^{1}

^{2}

^{1}

^{3}

^{2}

^{3}

^{1}

^{1}Optibrium Limited, Cambridge, CB25 9PB, UK

^{2}Cavendish Laboratory, University of Cambridge, Cambridge, CB3 0HE, UK

^{3}Intellegens Limited, Cambridge, CB4 3AZ, UK

Imputation is a powerful statistical method that is distinct from the predictive modelling techniques more commonly used in drug discovery. Imputation uses sparse experimental data in an incomplete dataset to predict missing values by leveraging correlations between experimental assays. This contrasts with quantitative structure–activity relationship methods that use only descriptor – assay correlations. We summarize three recent imputation strategies – heterogeneous deep imputation, assay profile methods and matrix factorization – and compare these with quantitative structure–activity relationship methods, including deep learning, in drug discovery settings. We comment on the value added by imputation methods when used in an ongoing project and find that imputation produces stronger models, earlier in the project, over activity and absorption, distribution, metabolism and elimination end points.

Imputation is the process of filling in the gaps in a dataset, where values have not yet been measured, using the limited data that are already present. This may appear simple when only a few values are missing but is challenging when more than half, or as much as 99% of the data are absent. Data are a precious commodity for model building and using imputation enables greater value to be extracted from the limited data available. This is particularly important in drug discovery, where the data are especially sparse and noisy. The recent application of deep learning methods for data imputation has led to significant advances in predictive power and unlocked hidden potential in drug discovery projects. In this paper, we will discuss data imputation, compare it with other modeling methods and describe some example applications in drug discovery.

‘Imputation’ is a less-familiar term within statistics or machine learning in comparison with words such as ‘regression’, ‘prediction’ and ‘classification’. In drug discovery, the latter three terms are commonly associated with quantitative structure–activity relationship (QSAR) models, which attempt to describe a property of interest as a mathematical function of some number of chemical ‘descriptors’ as inputs, to estimate the value of the property for an as-yet unmeasured compound. This process is portrayed in the top half of

QSAR: Quantitative structure–activity relationship.

In contrast, imputation is the process of filling in the gaps in a dataset, where values have not yet been measured, using the limited data that are already present. This process can be simple if only a handful of data points are missing but is very challenging when more than half, or even 99% of the data are absent in the first instance. Imputation is depicted in the bottom half of

Data are precious commodities for model building, and the imputation process of expanding the data that are available dramatically improves the quality of models. This not only captures correlation between descriptors and assays, but also explicitly uses the experimental data as an input, which enables the model to learn directly from assay–assay correlations [

There are at least three noteworthy methods for imputation in drug discovery. These create interesting comparison and are summarized below:

Alchemite™ (Intellegens, Cambridge, UK) is a novel method which combines the power of deep learning with an imputation framework and can exploit nonlinear correlations between heterogeneous assay end points [

Profile QSAR (pQSAR) is a two-step imputation method formed from combination of commonly used QSAR approaches. pQSAR first builds a layer of random forest models [

The Macau method combines Bayesian probabilistic inference with an implementation of matrix factorization [

This illustrates improvement in prediction quality (reduction in root-mean-square error) as the methods which provide uncertainty estimates alongside predictions focus on the most confident fraction of the dataset to impute/predict.

CMF: Collective matrix factorization; DNN: Deep neural network; pQSAR: Profile quantitative structure–activity relationship; RMSE: Root-mean-square error.

The combination of imputation and effective confidence estimates mean that scientist can choose an appropriate level of uncertainty and partially fill in the dataset with sufficiently high-quality predictions.

An important application of QSAR models is to make predictions for ‘virtual compounds’ which have not yet been synthesized and thus have no experimental information, based solely on molecular descriptors. Deep imputation models can easily be extended to this case by only inputting the compound descriptors for new compounds, as illustrated in

With no existing experimental data, the algorithm has the same information as a QSAR method at test time; however, the model can still exploit assay–assay correlations it learned in training and use these to enhance predictions. We would expect the model performance to therefore lie between a good QSAR method and the case of an imputation model with experimental information.

There is an evidence that imputation adds value to drug discovery projects beyond the benchmarking data discussed above. An illustrative case study demonstrated successful application of deep imputation to heterogeneous drug discovery project data [^{2}.#x00A0;= 0.72, where the best QSAR equivalents achieved ^{2}.#x00A0;= 0.50. It was also found that QSAR models completely failed to predict a cell-based activity end point, achieving a negative ^{2}. whereas the imputation method could make accurate predictions, ^{2}.#x00A0;= 0.66, by exploiting assay–assay correlations [

Further to that study, the value added as a project progresses is illustrated in ^{2}.#x00A0;> 0.6 rises steadily, even for low numbers of training points, whereas random forest only shows a significant rise for more than 600 measured data points. On low quantities of data points, 20–30% of Alchemite end points have excellent models ^{2}.#x00A0;> 0.7, whereas random forest models cannot achieve this level of accuracy, even when using all of the data. Therefore, fewer data points are needed to build high quality imputation models, equating to fewer experiments, reduced cost and a shorter timeframe.

The percentage of end points with ^{2} values above 0.5 (light), 0.6 and 0.7 (dark) is plotted against number of data points used to construct the model for both Alchemite™ deep imputation and random forest models.

We have explained the concept of imputation and contrasted it with standard QSAR models. Imputation uses the known data explicitly to reconstruct the missing elements allowing assay–assay correlations and thereby gains more value from the investment in experimental measurements. One immediate and obvious advantage with imputation is the filling of the entire data matrix, which can provide as much as 100-times the original data.

Although a new method in the field of drug discovery, there is growing evidence that it can make more effective and efficient use of data in a variety of applications, to improve productivity and efficiency.

Imputation uses existing data resources more efficiently. An obvious application is for imputation to be used at the data source itself, whether that be a local dataset, corporate repository or large-scale data warehouse. Filling in the gaps in data sources with confident predictions will dramatically increase the wealth of information that can be mined to identify high-quality compounds. A combination of proprietary data combined with all known public domain data in a single imputed database will also give stronger predictive models for many types of end points in the chemical and biological domain.

Imputation can be used for combination of diverse types of sparse data, not only from experiments, but also computational simulations and physics-based calculations. Data can be aggregated across different resolutions and scales, with inexpensive and fast calculations assisting the prediction of moderately challenging end points which in turn assist the prediction of the hardest, slowest and most expensive data points to collect.

Any imputation method that can fulfill this would need to be proven to handle very sparse, noisy and heterogeneous data, scale to large sizes and remain robust to unrelated data sources being merged. The strongest method so far is the deep imputation method that provides uncertainty estimates in each imputed value. The flexibility of the deep learning framework will facilitate applications for novel kinds of data including time series, curve and distributional data, text data, aggregations of samples among populations and even graphical and network-based data sources.

For drug discovery, the combination of chemist and computational algorithm will enhance productivity, taking the burden of routine and intensive tasks away from the expert leaving them more freedom to be creative and focused on the human strategy elements of the design process. Imputation methods will take a leading role in achieving this goal as a foundation for further downstream AI and machine learning methods, which cannot handle incomplete data. Such algorithms could make additional statements about missing values beyond ‘what number can I assign to this missing value?’ and ‘how confident are we in this prediction?’; we could ask questions such as ‘how relevant is this missing value to my project?’, ‘which experiment should I perform next?’ and ‘can I trust this data point or should I remeasure it?’. Addressing these questions will ultimately assist a step change in efficiency and effective resource allocation.

BWJ Irwin, S Mahmoud and MD Segall are employees of Optibrium Ltd. TM Whitehead and GJ Conduit are employees of Intellegens Ltd. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

No writing assistance was utilized in the production of this manuscript.

This work is licensed under the Attribution-NonCommercial-NoDerivatives 4.0 Unported License. To view a copy of this license, visit

_{50}s for realistically novel compounds

_{50}s for 8,558 novartis assays