Discovering Highly Potent Molecules from an Initial Set of Inactives Using Iterative Screening.

Cortés-Ciriano, Isidro; Firth, Nicholas C; Bender, Andreas; Watson, Oliver

Discovering Highly Potent Molecules from an Initial Set of Inactives Using Iterative Screening.

Accepted version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/292498

Repository DOI

https://doi.org/10.17863/CAM.39658

Files

Accepted version (311.67 KB)

Type

Article

Authors

Cortés-Ciriano, Isidro

Firth, Nicholas C

Bender, Andreas

https://orcid.org/0000-0002-6683-7546

Watson, Oliver

Abstract

The versatility of similarity searching and quantitative structure-activity relationships to model the activity of compound sets within given bioactivity ranges (i.e., interpolation) is well established. However, their relative performance in the common scenario in early stage drug discovery where lots of inactive data but no active data points are available (i.e., extrapolation from the low-activity to the high-activity range) has not been thoroughly examined yet. To this aim, we have designed an iterative virtual screening strategy which was evaluated on 25 diverse bioactivity data sets from ChEMBL. We benchmark the efficiency of random forest (RF), multiple linear regression, ridge regression, similarity searching, and random selection of compounds to identify a highly active molecule in the test set among a large number of low-potency compounds. We use the number of iterations required to find this active molecule to evaluate the performance of each experimental setup. We show that linear and ridge regression often outperform RF and similarity searching, reducing the number of iterations to find an active compound by a factor of 2 or more. Even simple regression methods seem better able to extrapolate to high-bioactivity ranges than RF, which only provides output values in the range covered by the training set. In addition, examination of the scaffold diversity in the data sets used shows that in some cases similarity searching and RF require two times as many iterations as random selection depending on the chemical space covered in the initial training data. Lastly, we show using bioactivity data for COX-1 and COX-2 that our framework can be extended to multitarget drug discovery, where compounds are selected by concomitantly considering their activity against multiple targets. Overall, this study provides an approach for iterative screening where only inactive data are present in early stages of drug discovery in order to discover highly potent compounds and the best experimental set up in which to do so.

Keywords

Algorithms, Drug Discovery, Drug Evaluation, Preclinical, Machine Learning, Quantitative Structure-Activity Relationship

Journal Title

Journal of Chemical Information and Modeling

Journal ISSN

1549-960X
1549-960X

Volume Title

58

Publisher

American Chemical Society

Publisher DOI

https://doi.org/10.1021/acs.jcim.8b00376

Rights

Sponsorship

European Research Council (336159)
European Commission Horizon 2020 (H2020) Marie Sk?odowska-Curie actions (703543)

This project has received funding from the European Union’s Framework Programme For Research and Innovation Horizon 2020 (2014–2020) under the Marie Curie Sklodowska-Curie Grant Agreement No. 703543 (I.C.-C.). A.B. thanks the European Research Commission (Starting Grant ERC-2013-StG 336159 MIXTURE) for funding. N.C.F is funded by EPSRC (EP/M006093/1).

Collections

Cambridge University Research Outputs