Repository logo
 

Improved Machine Learning Predictions of EC50s Using Uncertainty Estimation from Dose–Response Data

Accepted version
Peer-reviewed

Loading...
Thumbnail Image

Change log

Abstract

In early-stage drug design, machine learning models often rely on compressed representations of data, where raw experimental results are distilled into a single metric per molecule through curve fitting. This process discards valuable information about the quality of the curve fit. In this study, we incorporated a fit-quality metric into machine learning models to capture the reliability of metrics for individual molecules. Using 40 datasets from PubChem (public) and BASF (private), we demonstrated that including this quality metric can significantly improve predictive performance without additional experiments. Four methods were tested: random forests with parametric bootstrap, weighted random forests, variable output smearing random forests, and weighted support vector regression. When using fit-quality metrics at least one of these methods led to a statistically significant improvement on 31 of the 40 datasets. In the best case, these methods led to a 22% reduction in the models root mean squared error. Overall, our results demonstrate that by adapting data processing to account for curve fit quality, we can improve predictive performance across a range of different datasets.

Description

Keywords

Journal Title

Journal of Chemical Information and Modeling

Conference Name

Journal ISSN

1549-9596
1549-960X

Volume Title

Publisher

American Chemical Society

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International
Sponsorship
BASF funded this project

Relationships

Is previous version of: