Deep imputation on large-scale drug discovery data

Irwin, BWJ; Whitehead, TM; Rowland, S; Mahmoud, SY; Conduit, GJ; Segall, MD

doi:10.1002/ail2.31

Deep imputation on large-scale drug discovery data

Published version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/328785

Repository DOI

https://doi.org/10.17863/CAM.76232

Files

Published version (1.82 MB)

Type

Article

Authors

Irwin, BWJ

https://orcid.org/0000-0001-5102-7439

Whitehead, TM

Rowland, S

Mahmoud, SY

Conduit, GJ

Show 1 more

Abstract

jats:titleAbstract</jats:title>jats:pMore accurate predictions of the biological properties of chemical compounds would guide the selection and design of new compounds in drug discovery and help to address the enormous cost and low success‐rate of pharmaceutical R&D. However, this domain presents a significant challenge for AI methods due to the sparsity of compound data and the noise inherent in results from biological experiments. In this paper, we demonstrate how data imputation using deep learning provides substantial improvements over quantitative structure‐activity relationship (QSAR) machine learning models that are widely applied in drug discovery. We present the largest‐to‐date successful application of deep‐learning imputation to datasets which are comparable in size to the corporate data repository of a pharmaceutical company (678 994 compounds by 1166 endpoints). We demonstrate this improvement for three areas of practical application linked to distinct use cases; (a) target activity data compiled from a range of drug discovery projects, (b) a high value and heterogeneous dataset covering complex absorption, distribution, metabolism, and elimination properties, and (c) high throughput screening data, testing the algorithm's limits on early stage noisy and very sparse data. Achieving median coefficients of determination, jats:italicR</jats:italic>jats:sup2</jats:sup>, of 0.69, 0.36, and 0.43, respectively, across these applications, the deep learning imputation method offers an unambiguous improvement over random forest QSAR methods, which achieve median jats:italicR</jats:italic>jats:sup2</jats:sup> values of 0.28, 0.19, and 0.23, respectively. We also demonstrate that robust estimates of the uncertainties in the predicted values correlate strongly with the accuracies in prediction, enabling greater confidence in decision‐making based on the imputed values.</jats:p>

Keywords

4613 Theory Of Computation, 3404 Medicinal and Biomolecular Chemistry, 34 Chemical Sciences, 33 Built Environment and Design, 46 Information and Computing Sciences, 3303 Design, Machine Learning and Artificial Intelligence, Networking and Information Technology R&D (NITRD), Bioengineering, Generic health relevance

Journal Title

Applied AI Letters

Journal ISSN

2689-5595
2689-5595

Volume Title

2

Publisher

Wiley

Publisher DOI

https://doi.org/10.1002/ail2.31

Rights

Attribution 4.0 International

Sponsorship

The Royal Society (uf130122)
Royal Society (URF\R\201002)

Optibrium Ltd, Intellegens Ltd, Takeda, Royal Society

Collections

University of Cambridge Research Outputs (Articles and Conferences)

Deep imputation on large-scale drug discovery data

Published version

Peer-reviewed

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Keywords

Journal Title

Conference Name

Journal ISSN

Volume Title

Publisher

Publisher DOI

Rights

Sponsorship

Collections