Leveraging heterogeneous data from GHS toxicity annotations, molecular and protein target descriptors and Tox21 assay readouts to predict and rationalise acute toxicity

Allen, Chad HG; Mervin, Lewis H; Mahmoud, Samar Y; Bender, Andreas

Leveraging heterogeneous data from GHS toxicity annotations, molecular and protein target descriptors and Tox21 assay readouts to predict and rationalise acute toxicity

Published version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/293843

Repository DOI

https://doi.org/10.17863/CAM.40955

Files

Published version (3.35 MB)

Type

Article

Authors

Allen, Chad HG

Mervin, Lewis H

Mahmoud, Samar Y

Bender, Andreas

https://orcid.org/0000-0002-6683-7546

Abstract

Despite the increasing knowledge in both the chemical and biological domains the assimilation and exploration of heterogeneous datasets, encoding information about the chemical, bioactivity and phenotypic properties of compounds, remains a challenge due to requirement for overlap between chemicals assayed across the spaces. Here, we have constructed a novel dataset, larger than we have used in prior work, comprising 579 acute oral toxic compounds and 1,427 non-toxic compounds derived from regulatory GHS information, along with their corresponding molecular and protein target descriptors and qHTS in vitro assay readouts from the Tox21 project. We found no clear association between the results of a FAFDrugs4 toxicophore screen and the acute oral toxicity classifications for our compound set; and a screen using a subset of the ToxAlerts toxicophores was also of limited utility, with only slight enrichment toward the toxic set (Odds Ratio of 1.48). We then investigated to what degree toxic and non-toxic compounds could be separated in each of the spaces, to compare their potential contribution to further analyses. Using an LDA projection, we found the largest degree of separation using chemical descriptors (Cohen’s d of 1.95) and the lowest degree of separation between toxicity classes using qHTS descriptors (Cohen’s d of 0.67). To compare the predictivity of the feature spaces for the toxicity endpoint, we next trained Random Forest (RF) acute oral toxicity classifiers on either molecular, protein target and qHTS descriptors. RFs trained on molecular and protein target descriptors were most predictive, with ROC AUC values of 0.80-0.92 and 0.70-0.85, respectively, across three test sets. RFs trained on both chemical and protein target descriptors combined exhibited similar predictive performance to the single-domain models (ROC AUC of 0.80-0.91). Model interpretability was improved by the inclusion of protein target descriptors, which allow the identification of specific targets (e.g. Retinal dehydrogenase) with literature links to toxic modes of action (e.g. oxidative stress). The dataset compiled in this study has been made available for future application.

Keywords

Computational toxicology, Heterogeneous data, Quantitative high-throughput screening, Target prediction

Journal Title

Journal of Cheminformatics

Journal ISSN

1758-2946
1758-2946

Volume Title

11

Publisher

BioMed Central

Publisher DOI

https://doi.org/10.1186/s13321-019-0356-5

Rights

Attribution 4.0 International

Sponsorship

CEFIC Long-range Research Initiative (CEFIC LRI Award 2012 to Andreas Bender)

Collections

Cambridge University Research Outputs