Open Access Gold as soon as Publ. Version/Full Text is submitted to ZB.
SangsterLogP - the largest publicly available dataset of logP values.
Sci. Data, DOI: 10.1038/s41597-026-07357-2 (2026)
We present SangsterLogP, the largest publicly available curated dataset of experimental logP values, comprising more than 23k unique molecules, with experimental logP values ranging from -3.8 to 11.7 (about 15.9 log units). The dataset originated from Dr. James Sangster's comprehensive literature review of over 3k sources. We implemented a systematic curation workflow including a) logD-to-logP adjustment for ionised compounds and b) consensus-based residual analysis for outliers and duplicates removal. External validation using retrospective and prospective test sets demonstrated robust predictive performance (RMSE of 0.34 and 0.47 log units, respectively). SangsterLogP also substantially expands coverage of chemical space compared to the widely used legacy PHYSPROP database, including compounds in the beyond-Rule-of-5 domain. The fully annotated dataset, including experimental conditions and sources, is freely accessible via the Zenodo repository and on the Online Chemical database and Modelling Environment website.
Altmetric
Additional Metrics?
Edit extra informations
Login
Publication type
Article: Journal article
Document type
Scientific Article
Keywords
Workflow ; Outlier ; Data Curation ; Residual ; Chemical Space ; Data Source
ISSN (print) / ISBN
2052-4463
e-ISSN
2052-4463
Journal
Scientific Data
Publisher
Springer
Publishing Place
London
Reviewing status
Peer reviewed
Institute(s)
Institute of Structural Biology (STB)
Grants
European Comission (Erasmus Mundus Joint Master)
HORIZON EUROPE Marie Sklodowska-Curie Actions
HORIZON EUROPE Marie Sklodowska-Curie Actions