Improved Machine Learning Predictions of EC50s Using Uncertainty Estimation from Dose–Response Data
Accepted version
Peer-reviewed
Repository URI
Repository DOI
Change log
Authors
Abstract
In early-stage drug design, machine learning models often rely on compressed representations of data, where raw experimental results are distilled into a single metric per molecule through curve fitting. This process discards valuable information about the quality of the curve fit. In this study, we incorporated a fit-quality metric into machine learning models to capture the reliability of metrics for individual molecules. Using 40 datasets from PubChem (public) and BASF (private), we demonstrated that including this quality metric can significantly improve predictive performance without additional experiments. Four methods were tested: random forests with parametric bootstrap, weighted random forests, variable output smearing random forests, and weighted support vector regression. When using fit-quality metrics at least one of these methods led to a statistically significant improvement on 31 of the 40 datasets. In the best case, these methods led to a 22% reduction in the models root mean squared error. Overall, our results demonstrate that by adapting data processing to account for curve fit quality, we can improve predictive performance across a range of different datasets.
Description
Keywords
Journal Title
Conference Name
Journal ISSN
1549-960X

