Deep imputation on large-scale drug discovery data

More accurate predictions of the biological properties of chemical compounds would guide the selection and design of new compounds in drug discovery and help to address the enormous cost and low success-rate of pharmaceutical R&D. However, this domain presents a significant challenge for AI methods due to the sparsity of compound data and the noise inherent in results from biological experiments. In this paper, we demonstrate how data imputation using deep learning provides substantial improvements over quantitative structure-activity relationship (QSAR) machine learning models that are widely applied in drug discovery. We present the largest-to-date successful application of deep-learning imputation to datasets which are comparable in size to the corporate data repos-itory of a pharmaceutical company (678 994 compounds by 1166 endpoints). We demonstrate this improvement for three areas of practical application linked to distinct use cases; (a) target activity data compiled from a range of drug discovery projects, (b) a high value and heterogeneous dataset covering complex absorption, distribution, metabolism, and elimination properties, and (c) high throughput screening data, testing the algorithm's limits on early stage noisy and very sparse data. Achieving median coefficients of determination, R 2 , of 0.69, 0.36, and 0.43, respectively, across these applications, the deep learning imputation method offers an unambiguous improvement over random forest QSAR methods, which achieve median R 2 values of 0.28, 0.19, and 0.23, respec-tively. We also demonstrate that robust estimates of the uncertainties in the predicted values correlate strongly with the accuracies in prediction, enabling greater confidence in decision-making based on the imputed values.


INTRODUCTION
The combination of deep learning and statistical imputation methods is seeing rapidly growing success in a wide range of scientific domains including high-value materials discovery, 1,2 the development of new chemicals for industrial applications, 3,4 battery development, 5 and most importantly for the context of this work small molecules drug discovery. [6][7][8][9][10][11] This success can be attributed to the predictive power of the deep learning methodology combined with the flexibility and practical advantages of the imputation framework, which can handle sparse datasets, and use existing, partial assay data to enhance the quality of predictions for missing values in the dataset. 8 Sparse datasets are common in experimental scientific domains, where it is extremely rare that all possible experiments are run on all possible subjects, often due to the cost and time associated with collecting experimental data. 6,8 In this paper, we will focus on applications of deep learning imputation to the discovery of new drugs. This is a particularly attractive field for the application of artificial intelligence methods of many kinds, 12 due to the high costs, long timescales, and valuable output of pharmaceutical research and development. The average cost of a novel drug that succeeds in clinical trials and reaches the market is $2.6B. 13 This cost is driven by the high failure rate in the R&D process; only 4% of drug discovery projects result in a marketed drug, and only 12% of candidate drugs that enter expensive and time-consuming clinical trials reach the market. 14 However, the value of an efficacious drug to a patient whose disease is cured or ameliorated may be incalculable, and the associated financial benefit to a pharmaceutical company can be commensurately large; a "blockbuster" drug will achieve sales measured in billions of dollars per year.
The low success rate of pharmaceutical R&D is, in large part, due to the complexity of the process and the ultimate goal. Drug discovery begins with a biological target implicated in a disease process. This is typically a protein involved in a biological pathway which, if inhibited or stimulated, will treat the disease; for example, inhibiting an essential protein in bacteria, thereby killing the organism, can result in an antibiotic, while stimulating the dopamine receptor in the brain can treat the symptoms of Parkinson's disease. Once a suitable target has been identified, the objective of a drug discovery project is to identify a therapeutic that will achieve the desired effect on the target when dosed to a patient, while avoiding serious side effects. This process often begins by finding initial "hits" that show activity against the target in an in vitro assay, which is followed by an iterative optimization process in which new chemical compounds are synthesized and tested to identify a candidate drug suitable for clinical trials. The design of a high-quality clinical candidate is a complex process, requiring multi-parameter optimization (MPO) of target activity and many other characteristics required in an efficacious and safe drug, often summarized as absorption, distribution, metabolism, elimination, and toxicity (ADMET) properties. As compounds progress through the drug discovery process, more complex and expensive experiments are used to assess their likelihoods of success as a drug before a candidate is chosen for clinical trials in humans, subject to approval by the regulatory authorities. The drug discovery process, from selection of a target to nomination of a clinical candidate, takes an average 5.5 years, and the complete R&D process through to launch of a new drug takes 13.5 years on average. 14 Clearly, the abilities to make accurate predictions of the best compounds to synthesize and which to progress to more expensive studies, based on initial experimental results, has the potential to dramatically improve the cost, time, and success rate of drug discovery. These objectives are challenging due to the nature of the data available with which to build predictive models. Datasets are typically much smaller and sparser than those seen in traditional machine learning applications, such as image recognition and language processing. It is rare for a compound to have been measured in all relevant experiments, and no experiments are run on all potential compounds of interest. In a typical pharmaceutical company's database, less than 1% of the possible experimental data points across all compounds of interest will have been measured. In addition to sparsity issues, drug discovery data contain a high degree of experimental noise due to the variability inherent in biological assays. 7 Even when sources of experimental variability are minimized, the results for a given compound in an experimental assay will often vary by 0.5 log units. 15 A wide range of machine learning methods has been applied to predict compound activities and ADMET properties. 16 These quantitative structure-activity relationship (QSAR) models relate features calculated from chemical structures (often referred to as "descriptors") to one or more target activities or ADMET properties. A comparison of deep learning imputation with the broader field of machine learning in the context of drug discovery is given by Irwin et al. 7 A successful implementation is the Alchemite method, 1-5 which has outperformed other data imputation techniques both in terms of accuracy and modeling performance 8 as well as flexibility of implementation and robustness to the challenges associated with practical drug discovery applications. 7 Alchemite demonstrated qualitative benefits over a variety of other machine learning methods, including random forests (RFs), 17 deep neural networks, 18 matrix factorization, 19,20 and purpose-made drug discovery imputation routines 21,22 on two benchmark drug discovery datasets. 6 These homogeneous datasets-purely comprised of target activities-were also designed to mimic the challenging extrapolation expected in drug discovery applications, where the training set is "known chemistry," and the test set requires extrapolation into "new chemistry" that has not yet been seen by the models. 22 This application also saw the ability for uncertainty estimates given by Alchemite to allow substantial enhancements to the predictive quality of models. By exploiting the bespoke uncertainty estimate for each prediction and focusing in on the most confident predictions, the effective accuracy of models exceeded a coefficient of determination (R 2 ) of 0.9, compared to the headline figure of 0.44 on the entire dataset. 6,8 This focusing effect is impossible for methods which do not provide a robust error bar for each individual prediction, 8 and would not yield a benefit where the error bars are of low quality.
The Alchemite method proved more robust than a suite of standard machine learning methods when applied to a real and active drug discovery project. 7 This application demonstrated that-in addition to coping with noise and sparsity-Alchemite could address temporal evolution within datasets and included a mixture of heterogeneous endpoints in a single model. These endpoints can either be unrelated, in which case they are treated separately, or related through complex functions of multiple experimental measurements. These benefits of deep imputation were retained on the small-scale datasets typical of a drug discovery project, in contrast with many deep learning methods that rely on large-scale datasets to gain value over simpler machine learning methods in this field.
The method also successfully assisted in finding a novel, active anti-malarial compound when combined with generative methods. [8][9][10]23 This application relied on a so-called "virtual" models, 8 which depend only on calculated molecular descriptors, allowing virtual screening of the generated compounds, which had not yet been synthesized or tested experimentally. The use of Alchemite's robust uncertainty estimates in combination with probabilistic MPO techniques 24 enabled the confident selection of a compound for synthesis and experimental validation.
The abovereported successes have all, to date, been achieved on small to moderately sized datasets. For typical drug discovery projects, this would mean hundreds to a few thousand compounds (rows), and tens of experimental endpoints (columns) in the data matrix. 7 While the Alchemite method has fulfilled the criteria for a robust and practically useful methodology to tackle challenging applications in the field of drug discovery, 7 a key requirement is to prove the scalability of the method to large datasets, 8 comparable in scale to data available to a moderate-to-large pharmaceutical company. Such a "large" dataset would contain of order one million compounds and thousands of experimental endpoints.
Scalability has been the focus of other imputation methods, such as the MACAU matrix factorization method, 19 which in order to achieve scalability to millions of compounds as desired, results in a linear model which will only capture a shallow degree of the correlation between endpoints. 8 In contrast to this, the Alchemite method presents a nonlinear deep learning methodology, which has provably exploited multiple experimental correlations to predict complex and multifactorial cell-based properties.
Application to large-scale databases of compound data will bring further benefits. Learning from interassay correlations across much larger numbers of compounds than would be explored in a single project will enable this information to be leveraged across multiple drug discovery projects and biological targets. This will improve the accuracy of predictions and may reveal unexpected historical correlations between experimental endpoints. Virtual models derived from such a "global" model will improve the virtual screening of new compounds. Imputation of new activities for existing compounds may reveal opportunities for repurposing compounds for different therapeutic objectives. These applications will unlock enormous value from the wealth of data stored in pharmaceutical companies' data repositories, but whose full potential is, as yet, unrealized. Furthermore, achieving these objectives with a single multitarget model across all compounds and endpoints will reduce computational complexity and cost, vs building and updating models for individual projects on an ad hoc basis.
In this work, we show the first successful application of the Alchemite deep imputation methodology to a pharmascale dataset within a reasonable computational cost. This important step demonstrates that the method meets the requirements for implementation as an overarching modeling method to realize the benefits outlined above. To evaluate the algorithm's ability to successfully handle different kinds of data on the large-scale dataset used in this work, we considered three relevant drug discovery applications: • Project activities: In a drug discovery project, compounds are assessed using assays to test for activity against one or more biological targets. The Project Activities endpoints were used to evaluate the ability to prospectively predict the activities (measured as the concentration at which half of the maximum inhibition is observed, IC 50 ) of compounds against targets. The data are aggregated across many drug discovery projects, and therefore there are 178 different target columns, each corresponding to a distinct IC 50 . • ADMET: This dataset was used to evaluate the ability to predict a broad range of ADMET endpoints, including compound solubility and cell permeability, the extent that compounds inhibit common drug-metabolizing enzymes, metabolic stability, and toxicity endpoints. For ADMET endpoints that describe activities against known common ADMET targets, the column data type is either an IC 50 , or one or more percentage inhibition (%inh) results at different concentrations, but never both together.
• High-throughput screening (HTS): At the inception of a drug discovery project, initial chemical starting points may be identified by a broad and coarse sweep of a wide diversity of compounds to identify those that show indications of activity against the target in question. These HTS campaigns may test hundreds of thousands, or even millions, of compounds. To achieve this at an acceptable cost and time, high-throughput assays are employed that are often noisier than the later project activity assays. The majority of data points will also show little or no activity, which creates a significant bias in the resulting dataset. The objective of this application was to assess the ability to predict HTS activities, despite the limitations in these data. In particular, we wished to test the ability of Alchemite models to predict the activities of a full HTS screen (Assay X) from the results of a much smaller pilot screen. These columns are usually either an IC 50 , (sometimes with a single high concentration %inh result), or multiple %inh results measured at different concentrations.
In order to allow reproducible comparison of our results, we also include the latest performance of the Alchemite algorithm on a publicly available dataset. The data cover 159 sparse kinase assays downloaded from ChEMBL, 25 and were assembled by Martin et al. 22 This dataset was used in a previous benchmarking study, 6 and the results reported here are the most recent, from an improved version of the Alchemite algorithm.

METHODS
The deep imputation method used in this work, Alchemite, is based on the iterative application of a deep learning algorithm to the sparse experimental data to identify and leverage nonlinear correlations between endpoints. It has been previously described in detail in Verpoort et al 26 and its application to drug discovery data was described in References 6 and 7. A full description of the method is also included in the "Methods" section of the Supplementary Information. Here, we will provide a high-level summary of the method. Two classes of model are used in this work: • Imputation: These models generate predictions for the test data points using sparse assay data as input, in addition to molecular descriptors, and test an Alchemite model's ability to "fill in the gaps" in the experimental data for compounds that have been synthesized and tested in some assays. • Virtual: These models are built to expect only molecular descriptors as input. They test an Alchemite model's ability to make predictions based only on compound structure, that is, for a compound that has not yet been synthesized or tested.
To train an imputation model, missing values in the sparse experimental data are first given provisional estimates of numbers drawn from a distribution approximating that of the existent experimental data for each endpoint. For each of the N endpoints, the other N À 1 endpoints and the structural descriptors are used to build models of the experimental data in the endpoint, and this model is used to impute updated values for the initially missing data for each endpoint in parallel to obtain improved estimates for each missing value. This procedure is then iterated, using the estimates from iteration I À 1 to generate the I th set of estimates. Once the estimates are sufficiently converged, or the desired number of iterations has been carried out (typically two or three iterations), the algorithm returns the latest set of estimates as the predictions for all missing values in the dataset.
The virtual model is trained similarly, except that the model is constrained not to use experimental endpoints as inputs in the first iteration, mimicking the later application to virtual compounds. This approach still leverages nonlinear correlations between endpoints through the later steps of the iterative procedure, enabling improvements in performance over methods that simplify focus on predicting one endpoint at a time. Predictions can then be made taking as input the chemical descriptors of a compound and iteratively generating estimates for every endpoint, and returning the latest set of consistent estimates.
For both imputation and virtual models, the underlying modeling of each endpoint is performed using a proprietary "gradient" kernel. In contrast to standard neural network sigmoid or rectifier activation functions, which can be envisaged as beginning with a large-length-scale approximation of a function and gradually adding more fine detail, the gradient kernel begins with detailed local models and gradually stitches them together into a cohesive whole. This enables more accurate capture of effects like activity cliffs-where a response rapidly varies as a function of the inputs-and is generally on the order of a thousand times quicker to train due to the inherent parallelizability.
One of the most important elements of the deep imputation model is the ability to quantify the uncertainty in predictions. This enables one to separate the most confident predictions from uncertain predictions, targeting future resources only on those compounds with the highest probability of success. An ensemble of submodels is used to quantify uncertainty for each endpoint at each iteration. Each submodel is trained on a bootstrap sample of the available data to provide accurate treatment of the variation within the data.
One additional complexity of the drug discovery data used here is that multiple endpoints are frequently measured in the same experimental assay. For example, permeability assays often report both apical (P app A to B) and basolateral (P app B to A) permeability endpoints, as well as the ratio (B to A)/(A to B). Because the ratio depends directly on the two values, the apical and basolateral values should not be used as inputs to predict the ratio. Therefore, in general one endpoint from a given assay should therefore not be used as input to predict another endpoint from the same assay, as at test time either both endpoints will have been measured for a given compound or neither will be available. To capture this, Alchemite includes generalized, asymmetric constraints on column dependencies. These can also be used to ensure assays that are typically run late in a program are not used as input to predict assays run earlier for a given compound (while still allowing the early stage assays to be used as input to predict the late-stage assays).

COMPARISON QSAR MODELS
A RF model was also constructed for each individual endpoint as representative examples of QSAR methods. These RF models were generated using the scikit-learn implementation of regression RF 27 and take compound descriptors as input only.
A wider comparison of QSAR methods was previously undertaken for project data by Irwin et al. 7 This included partial least squares, RF, Gaussian process, and radial basis function models and found RF models to be broadly representative of the accuracy of QSAR methods. Advanced (multitarget) methods such as deep neural networks and matrix factorization were compared to Alchemite by Whitehead et al. 6 In all cases, Alchemite's deep imputation method was found to outperform the other approaches in each case significantly.

METRICS
All models were evaluated on an independent test set using two statistics: The coefficient of determination (R 2 ), defined as which takes values in the range (À∞, 1], where 1 indicates a perfect model, 0 indicates a model no better than random, and negative values indicate predictions that are worse than random; and the root mean square error (RMSE) where N is a number of compounds in the set, y pred i is the predicted value, and y obs i is the experimentally observed value for data point i. The RMSE is expressed in the same units as the observed property values. R 2 values were calculated only for endpoints with greater than five data points to give sufficient statistical relevance.

DATASET
All modeling data in the main study are proprietary and were provided by Takeda Pharmaceutical Company Ltd. Prior to modeling, all qualified and out-of-range data were removed. The remaining data were transformed into units more amenable for machine learning (eg, log transformations were applied to columns which varied many orders of magnitude). To maintain full modeling rigor, the dataset was split into training, validation, and independent test sets. The full training dataset contained 678 994 compounds and 1166 experimental endpoints; the breakdown across the three applications described above is shown in Table 1.
The blind test set contained a total of 17 660 data points across endpoints for each application, as described in Table 2.The independent, blind test sets were prepared by Takeda and withheld during the model building and internal validation process. Predictions for the blind test sets were provided to Takeda before the experimentally observed values were revealed. The test sets were generated by Takeda in the following ways: • Project activities: The test data points were selected temporally, that is, the most recently measured data points were withheld, to test the models' abilities to predict the activities of the most recently synthesized compounds and assay results. • ADMET: The test data points were selected randomly to test the models' abilities to predict ADMET properties for a wide diversity of compounds. This selection method gives an even coverage of each type of endpoint and value according to their prevalence in the overall dataset. • HTS: A small proportion of the test data points were selected randomly. However, the majority were derived from a single assay (Assay X), for which the results of a pilot screen were provided in the training set, but the results from the remaining compounds in the full screening collection were withheld. The split was arranged in this way to test whether the majority of a more detailed high throughput screen could be predicted using the very broad pilot screen.

ChEMBL kinase dataset
We provide results for a publicly available dataset which allows direct comparison with other methods. The data cover 159 sparse kinase assays downloaded from ChEMBL, 25 and were assembled by Martin et al. 22 The dataset is provided with the Supplemental Information to this manuscript.

Compound descriptors
A total of 330 molecular descriptors were calculated with the StarDrop Auto-Modeller module for each compound. These descriptors can be computed from the atom and bond graph structure of any compound, including virtual compounds, and therefore all descriptors are present for each compound in the dataset. The set of 330 descriptors comprises T A B L E 1 The breakdown of the training data in terms of the three applications (project activities, high-throughput screening [HTS], absorption, distribution, metabolism, elimination, and toxicity [ADMET]) and the number of endpoints, compounds, and data points, along with a measure of sparsity

RESULTS
Here, we show the results of the Alchemite models compared with RF models for each endpoint in the independent test set. Because the data for each application represent a large number of endpoints, as shown in Table 1, we show a profile of endpoint results, ordering the R 2 values ordered from highest to lowest for each method (see Figure 1 for example).
Alchemite also provides uncertainty estimates for each prediction and we also present a comparison of the uncertainty estimates with the observed errors for the independent test. Figure 1 shows the profile of endpoint results for the independent test data for the ChEMBL kinase dataset. The median R 2 across the 159 columns in the test set is imputation: 0.48(2), virtual: 0.06(1), RF: À0.14(3). We can see that Alchemite imputation is the highest, followed by Alchemite virtual. The median R 2 for RF is negative, which means that more than half of the columns have RF models that are worse than random (the variance in the predictions is larger than the variance of the overall column). Figure 2 shows the profile over project activity endpoints for the independent test data. This plot includes curves showing the performance of Alchemite Imputation and Virtual models relative to the RF models. The median R 2 for the Alchemite Imputation model is 0.69, compared to 0.28 for RF models. The median R 2 for the Alchemite Virtual model is 0.55 which is also substantially higher than the RF models, showing that the multitarget deep learning with sparse experimental data trains a very high-quality virtual model. Each of the points in the profile shown in Figure 2 is an R 2 value for predictions of a different endpoint. In the case of the project activity endpoints, these are all measurements of activity against a target. Scatter plots and uncertainty analysis are given for a focused example in the Supplementary Information better to show how these predictions can be used in practice, while inspecting the quality of the models and the uncertainty predictions.

ADMET test results
The results for the independent test set for the ADMET endpoints are shown in Figure 3, and we can see a similar trend to the above example. The best model is the Alchemite Imputation model with a median R 2 of 0.36, close behind is the Alchemite Virtual model with a median R 2 of 0.32 and finally, RF models achieve a median R 2 of 0.19.
The ADMET endpoints represent a wide variety of different data types, and it is interesting to compare the profile of results for different classes of ADMET endpoints. In particular, Figure 4 shows the accuracy profiles for pIC 50 and pEC 50 (the negative base-10 logarithm of the concentration in Molar units which exhibits 50% of the maximum effect) endpoints respectively. From these plots, one can see that the Alchemite models have a much larger advantage over RF for the pEC 50 endpoints than for pIC 50 endpoints. The Alchemite imputation model outperforms the Alchemite virtual model for the pEC 50 endpoints, whereas the two models are roughly equivalent for pIC 50 endpoints. This result is consistent with those seen in smaller project datasets, where Alchemite tends to show the greatest benefit for complex, multi-mechanistic endpoints. 7 While a pIC 50 measurement relates to the inhibition of a single target protein, pEC 50 measurements result from more complex assays that may be influenced by multiple factors; for example, the activity of a compound in a cell will relate not only to its activity against a target protein, but also permeability through the cell membrane, solubility in the buffer solution and binding to other proteins in the cellular matrix. This demonstrates one of Alchemite's advantages, namely the ability to learn directly from relationships between experimental endpoints, which may capture these other factors, to make better predictions. F I G U R E 2 Profile of the coefficient of determinations (R 2 ) achieved on the independent test set for Project activity endpoints. The endpoints are ordered from highest R 2 (left) to lowest (right). Alchemite imputation model (orange), Alchemite virtual model (gray), and random forest models (blue), are plotted for comparison F I G U R E 3 Profile of the coefficient of determinations (R 2 ) achieved on the independent test set for absorption, distribution, metabolism, elimination, and toxicity (ADMET) endpoints. The endpoints are ordered from highest R 2 (left) to lowest (right). Alchemite imputation (orange), Alchemite virtual (gray), and random forest (blue) models are plotted for comparison Figure 5 shows the profile of R 2 results over the HTS endpoints in the independent test set. In this case, the Alchemite virtual model is essentially equivalent to the RF models, whereas the Alchemite imputation results are an improvement. The median R 2 for RF is 0.23, similar to the Alchemite virtual model with a median R 2 of 0.27, but lower than the Alchemite imputation model, which achieved an R 2 of 0.43. This shows that there is additional information in the correlations between endpoints and in the structure of the training data that can be exploited by using Imputation on HTS data.

HTS test results
One objective of the HTS test was to assess the ability of the Alchemite models to predict activities for the full screening deck for Assay X, based on the pilot screen data for this assay. However, none of the models considered in this study showed sufficient predictive power for this endpoint, and the R 2 values were close to zero (Alchemite imputation 0.07, Alchemite virtual 0.12, RF À0.17). There are several possible explanations for this: • The overlap in compounds with data measured in Assay X and other endpoints in the training set is lower than is typical for other endpoints in the training set. The maximum overlap corresponds to only 10% of the compounds for Assay X. • The distribution of percentage inhibition data is challenging to model. For Assay X, the distribution is shown in Figure 6. The large majority of the measured values are distributed around 0%, plus or minus 10% and only a very small proportion of compounds are measured to have significant activities, as we would expect from HTS. Furthermore, the noise in the measured values for inactive compounds may be affecting the ability of the accuracy metric to distinguish good from poor models and guide the model optimization. It may be possible to transform the percentage inhibition data to reduce the impact of this noise.
In order to test the impact of the bias in the distribution of observed values, a new version of the model was built in which the active data were oversampled by duplicating the active compounds 15 times relative to the inactive. This did not improve the accuracy of the resulting model. This indicates that the problem is less likely to be due to sampling bias and suggests that the algorithms may be attempting to model the noise in the data rather than the signal. If the compound structures had been known, compound set enrichment would be an option to look for statistically significant signals relative to the of the HTS background distribution. 29

Taking uncertainties in predictions into account
To test the accuracy of the uncertainty estimates produced by Alchemite, we can plot the RMSE in prediction vs the most confidently predicted fraction of the test set, that is, smaller fractions correspond to the predictions with the smallest error bars (according to the algorithm). If the uncertainties are conveying useful information, we would expect the most confidently predicted fractions of the test set to show better accuracy, that is, a lower RMSE.
The quality of uncertainty predictions, averaged across all 159 ChEMBL kinase dataset endpoints, as a function of the most confidently predicted fraction of the dataset, is shown in Figure 7. All of the ChEMBL kinase dataset endpoints are in the same units (pIC 50 , the negative base-10 logarithm of the concentration in molar units which exhibits 50% of the maximum target inhibition), and this average is well-defined. For comparison, the uncertainties in the RF predictions were calculated as the standard deviation of predictions from the ensemble of decision trees. Figure 7 shows that the error bars produced by all three methods, on average, provide some useful information in identifying more accurate predictions. In addition, the absolute RMSE is much higher, on average, for RF predictions. The Alchemite virtual model has a lower RMSE, and the imputation model has the best RMSE.
The quality of uncertainty predictions, averaged across all 178 project activity endpoints, as a function of the most confidently predicted fraction of the dataset, is shown in Figure 8. All of the project activity endpoints are also in the same units (pIC 50 ). Figure 8 again shows that the error bars produced by all three methods, on average, provide some useful information in identifying more accurate predictions. The interpretation of the plot is the same as that of Figure 7. However, the benefit from RF error bars is much smaller than that of the Alchemite uncertainty estimates. The decrease in RMSE correlates much more strongly with the error bars for both the virtual and imputation Alchemite models. In addition, the absolute RMSE is much higher, on average, for RF predictions. The Alchemite virtual model has a lower RMSE, and the imputation model has the best RMSE. An example of a similar analysis for an individual project activity endpoint is provided in the Supplementary Information in Figures S3 and S4.
We can also explore the ability of Alchemite to focus on the most confident predicted values in the more heterogeneous ADMET endpoints. Unlike the project activity endpoints, the mixed units across the ADMET endpoints mean F I G U R E 6 The distribution of training values for Assay X. Most of the values are centered around 0% inhibition, that is, inactive compounds, and the width of that peak is likely to be noise that the error analysis cannot be summarized in a single graph for the full dataset in analogy to Figures 7 and 8. We consider some illustrative examples of individual endpoints in Figures 9 and 10, with additional uncertainty quantification plots shown in Figures S5 and S6 of the Supplementary Information. Figure 9 reflects the performance of Alchemite permeability models-the logarithm of the basolateral to apical permeability in a cell line (P app B to A)-by comparing the most confident 50% of predictions with all predictions for both the imputation (left) and virtual (right) models. The most confident predictions are more closely clustered to the identity line, and the clear outliers have been dropped. Figure 10 plots the predictions for an unrelated pEC 50 endpoint and shows that, while the predictions for this endpoint follow the observed values quite well, there is more scatter in the predictions, and some points have large uncertainty estimates. That is to say, predictions for this endpoint are accurate, but not precise. If more precision is required, we can again focus in on the most confident predictions. In this instance, we show the most confident 25% of predictions according to the Alchemite error bars, and these predictions are clustered around the identity line in a tighter grouping than the baseline model. F I G U R E 7 Graph illustrating the relationship between confidence and accuracy of prediction for the ChEMBL kinase dataset test set. The x-axis shows the most confidently-predicted fraction of the test data, that is, in moving from right to left only the most confidently predicted values are included. The y-axis shows the root mean square error (RMSE) of the fraction of predictions, aggregated over 159 ChEMBL kinase dataset endpoints. A lower RMSE value indicates more accurate predictions. Results for random forest models are shown in blue, the Alchemite imputation model in orange, and the Alchemite virtual model in gray. For ease of visualization, Gaussian smoothing has been applied to the accuracy calculated at each sampled fraction F I G U R E 8 Graph illustrating the relationship between confidence and accuracy of prediction for the project activities test set. The x-axis shows the most confidently-predicted fraction of the test data, that is, in moving from right to left only the most confidently predicted values are included. The y-axis shows the root mean square error (RMSE) of the fraction of predictions, aggregated over 178 project activity endpoints. A lower RMSE value indicates more accurate predictions. Results for random forest models are shown in blue, the Alchemite imputation model in orange, and the Alchemite virtual model in gray. For ease of visualization, Gaussian smoothing has been applied to the accuracy calculated at each sampled fraction We can see that the correlation between the most confident and accurate results is also strong for these models, even when the baseline R 2 is not high. This importantly allows the models to be useful even in situations where they would be otherwise discarded due to a poor correlation for the full test set. The correlations of RMSE with the confidence in predictions data for these endpoints are shown Figures S5 and S6  as T hyp ≈ 1.3 Â N fold N samples T base , where T base is the base training time for a dataset, N samples is the number of hyperparameter optimization samples required for convergence (usually 20-50), and N fold is the number of cross-validation folds. However, following an initial hyperparameter optimization, the model can be updated with new data in the time taken for training, unless there is a significant change in the overall structure of the dataset. Furthermore, the hyperparameter optimization process can be further parallelized over the cross-validation folds to reduce the overall time by a factor of N fold .

CONCLUSIONS
Some general conclusions can be drawn across all applications. Alchemite imputation models consistently outperform RF models, and generally outperform Alchemite virtual models. This highlights a benefit of deep learning imputation, which can learn directly from the relationships between experimental endpoints and gain valuable information, even from very limited experimental data, to more accurately fill in missing experimental values. In all applications, the Alchemite virtual model performed better than or equivalently to RF. The Alchemite algorithm is competitively fast when compared to other deep learning methods and was applied to a pharma-scale dataset within a reasonable computational cost.
We have demonstrated that the Alchemite uncertainty estimates correlate strongly with the accuracy of the corresponding predictions, unlike those derived from RF ensemble-based uncertainties. This result is particularly exciting because generating robust and objectively useful uncertainty estimates from neural networks remains a major challenge. 30 Valid uncertainty estimates are essential to the effective use of models; understanding where a result is likely to be sufficiently accurate enables high-quality compounds to be identified with confidence while avoiding missed opportunities by incorrectly discarding a potentially good compound due to an uncertain prediction. 15 There were endpoints which could not be modeled by any method for all applications (ie, the rightmost points in Figures 1-5). Without heavy preprocessing, all large datasets will have such endpoints, especially on the repository-wide scale. We should not expect to be able to model all endpoints, particularly when the data are noisy or where few data points are available. However, it is notable that the inclusion of noisy and uncorrelated endpoints in the dataset did not have a detrimental effect on the performance of the Alchemite models for the majority of endpoints. This contrasts with other multitarget modeling approaches that benefit where there are strong correlations between endpoints, but suffer a detrimental effect from the introduction of uncorrelated endpoints into the dataset. 31 There are also some more specific conclusions we can draw for each of the three individual applications.

Project activities conclusions
For the project activity endpoints, all of the Alchemite models significantly outperform RF, showing the method is very effective on activity type endpoints. The results from the independent test were consistent with those from the internal validation; this is remarkable because the test set selected by Takeda Pharmaceuticals was temporally based, representing the most relevant and recent compounds in the corresponding project endpoints. Therefore, the results indicate the consistency and utility one could expect when deploying Alchemite models in real projects. As expected, the Alchemite imputation model slightly outperforms the virtual model because the former model has access to more information, in the form of sparse experimental data. This shows that the cross-correlations between experimental endpoints offer significant practical utility, and it is sensible to exploit this where possible.

HTS conclusions
The Alchemite imputation model outperforms the Alchemite virtual and RF models on HTS data, which represent some of the most challenging and noisy data. However, the prediction of the full screening collection for "Assay X" based on an initial pilot screen was not possible with any of the models. One approach to addressing this may be to apply a classification method, but this is beyond the scope of this study.