PuSH - Publication Server of Helmholtz Zentrum München

Cirino, T.* ; Caron, G.* ; Ermondi, G.* ; Charochkina, L.L.* ; Tetko, I.V.

SangsterLogP - the largest publicly available dataset of logP values.

Sci. Data, DOI: 10.1038/s41597-026-07357-2 (2026)
Postprint DOI PMC
Open Access Gold as soon as Publ. Version/Full Text is submitted to ZB.
We present SangsterLogP, the largest publicly available curated dataset of experimental logP values, comprising more than 23k unique molecules, with experimental logP values ranging from -3.8 to 11.7 (about 15.9 log units). The dataset originated from Dr. James Sangster's comprehensive literature review of over 3k sources. We implemented a systematic curation workflow including a) logD-to-logP adjustment for ionised compounds and b) consensus-based residual analysis for outliers and duplicates removal. External validation using retrospective and prospective test sets demonstrated robust predictive performance (RMSE of 0.34 and 0.47 log units, respectively). SangsterLogP also substantially expands coverage of chemical space compared to the widely used legacy PHYSPROP database, including compounds in the beyond-Rule-of-5 domain. The fully annotated dataset, including experimental conditions and sources, is freely accessible via the Zenodo repository and on the Online Chemical database and Modelling Environment website.
Altmetric
Additional Metrics?
Edit extra informations Login
Publication type Article: Journal article
Document type Scientific Article
Keywords Workflow ; Outlier ; Data Curation ; Residual ; Chemical Space ; Data Source
ISSN (print) / ISBN 2052-4463
e-ISSN 2052-4463
Journal Scientific Data
Publisher Springer
Publishing Place London
Reviewing status Peer reviewed
Grants European Comission (Erasmus Mundus Joint Master)
HORIZON EUROPE Marie Sklodowska-Curie Actions