Single-center versus multi-center data sets for molecular prognostic modeling: A simulation study.
Radiat. Oncol. 15:109 (2020)
Background Prognostic models based on high-dimensional omics data generated from clinical patient samples, such as tumor tissues or biopsies, are increasingly used for prognosis of radio-therapeutic success. The model development process requires two independent discovery and validation data sets. Each of them may contain samples collected in a single center or a collection of samples from multiple centers. Multi-center data tend to be more heterogeneous than single-center data but are less affected by potential site-specific biases. Optimal use of limited data resources for discovery and validation with respect to the expected success of a study requires dispassionate, objective decision-making. In this work, we addressed the impact of the choice of single-center and multi-center data as discovery and validation data sets, and assessed how this impact depends on the three data characteristics signal strength, number of informative features and sample size. Methods We set up a simulation study to quantify the predictive performance of a model trained and validated on different combinations of in silico single-center and multi-center data. The standard bioinformatical analysis workflow of batch correction, feature selection and parameter estimation was emulated. For the determination of model quality, four measures were used: false discovery rate, prediction error, chance of successful validation (significant correlation of predicted and true validation data outcome) and model calibration. Results In agreement with literature about generalizability of signatures, prognostic models fitted to multi-center data consistently outperformed their single-center counterparts when the prediction error was the quality criterion of interest. However, for low signal strengths and small sample sizes, single-center discovery sets showed superior performance with respect to false discovery rate and chance of successful validation. Conclusions With regard to decision making, this simulation study underlines the importance of study aims being defined precisely a priori. Minimization of the prediction error requires multi-center discovery data, whereas single-center data are preferable with respect to false discovery rate and chance of successful validation when the expected signal or sample size is low. In contrast, the choice of validation data solely affects the quality of the estimator of the prediction error, which was more precise on multi-center validation data.
Impact Factor
Scopus SNIP
Web of Science
Times Cited
Scopus
Cited By
Altmetric
Publikationstyp
Artikel: Journalartikel
Dokumenttyp
Wissenschaftlicher Artikel
Typ der Hochschulschrift
Herausgeber
Schlagwörter
Predictive Model ; Omics Data ; Feature Selection ; Predictive Performance ; Study Design ; Validation; Gene-expression; Treatment Decisions; Signature; Validation; Prediction; Head
Keywords plus
Sprache
englisch
Veröffentlichungsjahr
2020
Prepublished im Jahr
HGF-Berichtsjahr
2020
ISSN (print) / ISBN
1748-717X
e-ISSN
1748-717X
ISBN
Bandtitel
Konferenztitel
Konferzenzdatum
Konferenzort
Konferenzband
Quellenangaben
Band: 15,
Heft: 1,
Seiten: ,
Artikelnummer: 109
Supplement: ,
Reihe
Verlag
BioMed Central
Verlagsort
Campus, 4 Crinan St, London N1 9xw, England
Tag d. mündl. Prüfung
0000-00-00
Betreuer
Gutachter
Prüfer
Topic
Hochschule
Hochschulort
Fakultät
Veröffentlichungsdatum
0000-00-00
Anmeldedatum
0000-00-00
Anmelder/Inhaber
weitere Inhaber
Anmeldeland
Priorität
Begutachtungsstatus
Peer reviewed
POF Topic(s)
30203 - Molecular Targets and Therapies
30504 - Mechanisms of Genetic and Environmental Influences on Health and Disease
Forschungsfeld(er)
Radiation Sciences
PSP-Element(e)
G-501000-001
G-521800-001
Förderungen
Copyright
Erfassungsdatum
2020-05-18