Repository logo
 

Random forests and their application to heteroscedastic drug design data


Loading...
Thumbnail Image

Type

Change log

Abstract

Random forests are a popular machine learning method that make predictions by ensembling decision trees. They are widely used on tabular data, particularly drug design datasets. Whilst they often give good predictions they are difficult to interpret and the reasons for their successes and failures on different datasets are not always clear. This thesis explores this problem, providing an explanation for the performance of random forests and exploring the practical implications of this explanation. We explore an analogy between random forest and a Bayesian procedure of model selection and use this to explain the different behaviours of a random forest. In particular, we look at the role of randomness in obtaining good predictions and explain when and why pruning is required to prevent overfitting in random forests.

These novel explanations for random forest performance enable us to provide practical recommendations for selecting model parameters. These can give specific guidance about how and when to tune different parameters based on properties of the dataset. In addition we provide simple tools that can estimate prediction uncertainty and variable importance. The methods are developed based on the random forest interpretation we present and are competitive with standard approaches to these tasks.

Following this, we extend the random forest standard algorithm to incorporate the uncertainty information that arises in heteroscedastic data – datasets where the amount of noise in the target value varies between datapoints. We consider datasets where the relative amount of measurement noise in different datapoints is known. Using 10 drug design datasets we show utilising this uncertainty information can lead to significantly better predictive performance. We introduce three random forest variations to learn from heteroscedastic data: parametric bootstrapping, weighted random forests and variable output smearing. All three can improve model performance, demonstrating the adaptability of random forests to heteroscedastic data and thus expanding their applicability. These methods are consistent with our interpretation of a random forest and their relative performance gives further insight into the random forest mechanism.

Finally, we used these random forest methods for heteroscedastic data on more complex drug design datasets. These datasets are from early-stage drug design where machine learning models often rely on compressed representations of data. Raw experimental results are summarised using a single metric per molecule via a curve fitting process. This discards information about the quality of the curve fit. We introduce two fit-quality metrics and incorporate these into machine learning models to capture the reliability of metrics for individual molecules. Using 40 datasets from PubChem (public) and BASF (private), we demonstrated that including a quality metric can significantly improve predictive performance without additional experiments. When using fit-quality metrics at least one of the machine learning methods tested led to a statistically significant performance improvement on 31 of the 40 datasets. In the best case, these methods led to a 22% reduction in the models root mean squared error. These results demonstrate that by adapting data processing to account for curve fit quality, we can improve predictive performance across a range of drug design datasets. In summary, this thesis develops an interpretation of random forest and then uses this to develop novel methods that give improved performance on real drug design tasks.

Description

Date

2025-07-05

Advisors

King, Ross

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as All rights reserved