Computational Approaches to Predicting Drug Induced Toxicity Richard Liam Marchese Robinson King's College This dissertation is submitted for the degree of Doctor of Philosophy I Abstract Novel approaches and models for predicting drug induced toxicity in silico are presented. Typically, these were based on Quantitative Structure-Activity Relationships (QSAR). The following endpoints were modelled: mutagenicity, carcinogenicity, inhibition of the hERG ion channel and the associated arrhythmia - Torsades de Pointes. A consensus model was developed based on Derek for Windows TM and Toxtree and used to filter compounds as part of a collaborative effort resulting in the identification of potential starting points for anti-tuberculosis drugs. Based on the careful selection of data from the literature, binary classifiers were generated for the identification of potent hERG inhibitors. These were found to perform competitively with, or better than, those computational approaches previously presented in the literature. Some of these models were generated using Winnow, in conjunction with a novel proposal for encoding molecular structures as required by this algorithm. The Winnow models were found to perform comparably to models generated using the Support Vector Machine and Random Forest algorithms. These studies also emphasised the variability in results which may be obtained when applying the same approaches to different train/test combinations. Novel approaches to combining chemical information with Ultrafast Shape Recognition (USR) descriptors are introduced: Atom Type USR (ATUSR) and a combination between a proposed Atom Type Fingerprint (ATFP) and USR (USR-ATFP). These were applied to the task of predicting protein-ligand interactions - including the prediction of hERG inhibition. Whilst, for some of the datasets considered, either ATUSR or USR-ATFP was found to perform marginally better than all other descriptor sets to which they were compared, most differences were statistically insignificant. Further work is warranted to determine the advantages which ATUSR and USR-ATFP might offer with respect to established descriptor sets. The first attempts to construct QSAR models for Torsades de Pointes using predicted cardiac ion channel inhibitory potencies as descriptors are presented, along with the first evaluation of experimentally determined inhibitory potencies as an alternative, or complement to, standard descriptors. No (clear) evidence was found that 'predicted' ('experimental') 'IC- II descriptors' improve performance. However, their value may lie in the greater interpretability they could confer upon the models. Building upon the work presented in the preceding chapters, this thesis ends with specific proposals for future research directions. III Acknowledgements First and foremost, many thanks go to my supervisors, Dr John Mitchell and Professor Robert Glen, for their valuable guidance, feedback and support throughout my doctoral studies. Thanks are also owed to many colleagues, both past and present, at the Unilever Centre for Molecular Science Informatics. In particular, I owe gratitude to the following individuals: Dr Florian Nigsch, for discussions regarding Winnow and initial guidance; Dr Pedro Ballester, for discussions regarding USR and 3D-QSAR; Dr Hamse Mussa, for his explanations of Machine Learning and statistics and Rob Lowe, for invaluable programming advice and many enjoyable discussions inside and outside of work. Dr Ed Cannon is thanked for providing Python code which was used as a starting point for implementing the descriptors presented in Chapter 5. Dr Chris Marchese, Dr Susana Tomasio, Dr Johannes Kirchmair, Shardul Paricharak, Florian Roessler, Sonia Liggi and Matt Grayson are thanked for proofreading. I thank the following researchers for helpfully discussing their work and/or providing data used in the work presented in this thesis: Drs Khac-Minh Thai, Elodie Dubus, Ismail Ijjali, Olivier Taboureau, Chris Swain, Chun Wei Yap, Andreas Bender and Munikumar Doddareddy. Valuable feedback was also received from scientists from the Chemical Computing Group, Ideaconsult, ChemAxon, Accelrys and Lhasa Limited. I am indebted to the administrative support provided by Susan Begg and Emma Graham, as well as the technical support provided by the computer officers at the Department of Chemistry. My friends and family are also owed immense thanks for their support and encouragement, particularly in recent years. Unilever and the Engineering and Physical Sciences Research Council are thanked for funding this work. IV Declaration This thesis is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where specifically indicated. This thesis does not exceed the specified word limit (60,000) as defined by the Degree Committee of the Faculty of Physics and Chemistry. V Publications Sections of this thesis are based upon (contributions to) the following journal articles and communications. Chapter 2 The sections on QSAR modelling methods, figures of merit, model validation and the applicability domain are based on this author's contributions to: Gleeson, M. P.; Modi, S.; Bender, A.; Marchese Robinson, R. L.; Kirchmair, J.; Promkakaew, M.; Hannongbua, S.; Glen, R. C. Curr. Pharm. Des. 2012, 18, 1266–1291. Chapter 3 An abridged description of this author's contribution to this collaboration is presented in: Ballester, P.J..; Mangold, M.; Howard, N.I.; Marchese Robinson, R.L.; Abell, C.; Blumberger, J.; Mitchell, J.B.O. J. R. Soc. Interface Article In Press. Chapter 4 Marchese Robinson, R. L.; Glen, R. C.; Mitchell, J. B. O. Mol. Inf. 2011, 30, 443–458. Marchese Robinson, R. L.; Glen, R. C.; Mitchell, J. B. O. J. Cheminf. 2012, 4(Suppl 1), O6. VI Contents Chapter 1 Introduction ............................................................................................................... 1 1.1 Determination of Drug Induced Toxicity ....................................................................... 2 1.2 The Value of Computational Approaches ...................................................................... 3 1.3 Drug Induced Toxicity Endpoints Modelled in this Thesis ............................................ 7 1.4 The Intention of this Thesis .......................................................................................... 21 Chapter 2 Computational Toxicology and Quantitative Structure-Activity Relationships ..... 22 2.1 Defining Computational Toxicology ............................................................................ 22 2.2 Historical Background .................................................................................................. 22 2.3 Approaches to Predicting Toxicology In Silico ............................................................ 23 2.4 Read Across .................................................................................................................. 24 2.5 Expert Systems ............................................................................................................. 25 2.6 Quantitative Structure-Activity Relationships.............................................................. 27 Chapter 3 Screening for Mutagenicity and Carcinogenicity in the Context of a Prospective Virtual Screen .......................................................................................................................... 50 3.1 Overview of Collaboration ........................................................................................... 50 3.2 Approach Developed for Toxicity Screening ............................................................... 55 3.3 Empirical Validation of Toxicity Models ..................................................................... 59 3.4 Results Obtained from Application of Toxicity Models .............................................. 62 3.5 Conclusions .................................................................................................................. 68 Chapter 4 Development and Assessment of Binary Classifiers for Identifying Potent hERG Inhibitors .................................................................................................................................. 69 4.1 Introduction .................................................................................................................. 69 4.2 Datasets ......................................................................................................................... 70 4.3 Model Development and Validation............................................................................. 74 4.4 Comparisons with Models Developed by Thai and Ecker and by Dubus et al. ........... 85 4.5 Avoiding Feature Selection Bias: Calculating Overlap between Different Datasets ... 88 4.6 Results and Discussion ................................................................................................. 88 4.7 Conclusions ................................................................................................................ 106 Chapter 5 Development of Novel 3D Descriptors ................................................................. 108 5.1 Introduction ................................................................................................................ 108 5.2 Proposed Methodology ............................................................................................... 112 5.3 Evaluation of Methodology ........................................................................................ 117 5.4 Results and Discussion ............................................................................................... 131 5.5 Conclusions ................................................................................................................ 150 Chapter 6 Predicting Drug Induced Torsades de Pointes Using Biological Descriptors ....... 152 VII 6.1 Introduction ................................................................................................................ 152 6.2 Modelling Approaches Employed Here ..................................................................... 154 6.3 Datasets ....................................................................................................................... 158 6.4 Summary of TdP Model Comparisons ...................................................................... 167 6.5 Results and Discussion ............................................................................................... 169 6.6 Conclusions ................................................................................................................ 182 Chapter 7 Conclusions and Future Work ............................................................................... 183 7.1 Conclusions ................................................................................................................ 183 7.2 Future Work ................................................................................................................ 185 Bibliography .......................................................................................................................... 189 Appendix A. Supplementary Files ......................................................................................... 217 Appendix B. Performance of Toxicity Models Previously Reported in the Literature ......... 218 Appendix C. Additional Computational Details ................................................................... 220 VIII List of Figures Figure 1.1 Overview of empirical (yellow) and possible in silico (red) approaches to toxicity assessment in a typical discovery testing scheme for an orally available pharmaceutical. Grey arrows indicate the requirement for empirical data to build most in silico models. .................. 5 Figure 1.2 Overview of empirical (yellow) and possible in silico (red) approaches to toxicity assessment in drug development and post-marketing surveillance for a typical orally available pharmaceutical. Grey arrows indicate the requirement for empirical data to build most in silico models. ............................................................................................................................. 6 Figure 1.3 The standard model for the potential consequences of hERG inhibition. Here, 'B' denotes a hERG blocker. This image is based on that presented by Crumb and Caverro. 59 The induction of, and degeneration into fibrillation of, Torsades de Pointes is more completely discussed in section 1.3.3. ........................................................................................................ 10 Figure 2.1 Two of the structural alerts for genotoxic carcinogenicity used in the “hybrid” expert system software program Toxtree. Here, R denotes any atom/group except OH or SH. 37 ......................................................................................................................................... 26 Figure 2.2 An overview of SVM classification. (A) A linearly separable dataset in the feature space, and a possible separating hyperplane. (B) The SVM solution for such a dataset: the maximum margin hyperplane. (C) A non-linearly separable dataset; highlighted are two misclassified instances and two further instances lying inside the margin, with their corresponding slack-variables. (D) A conceivable corresponding decision boundary (as shown in (C)) in the descriptor space (supposing the feature space is a higher dimensional projection of the descriptor space); only the two misclassified instances are highlighted. N.B.: These images are for illustrative purposes only. ...................................................................... 39 Figure 3.1 Overview of the workflow undertaken to identify novel, experimentally verified inhibitors of type II DHQase. This author's contributions to the study are circled. ................ 54 Figure 3.2 Results obtained from validating carcinogenicity predictions of all 112 models on the ISSCAN database. All equivocal carcinogens were considered non-carcinogens. ........... 63 Figure 3.3 Results obtained from validating carcinogenicity predictions of all 112 models on the ISSCAN database. All equivocal carcinogens were considered carcinogens. ................... 64 Figure 3.4 Results obtained from validating mutagenicity predictions of all 112 models on the ISSCAN database. All equivocal mutagens were considered non-mutagens. ......................... 64 IX Figure 3.5 Results obtained from validating mutagenicity predictions of all 112 models on the ISSCAN database. All equivocal mutagens were considered mutagens. ................................ 65 Figure 4.1 Two examples of topological substructures encoded as features, for an example molecule, by a generic circular fingerprint considering environments extending up to two bonds from the central atom. ................................................................................................... 78 Figure 4.2 The assignment of 'discretized descriptor features' corresponding to descriptor D. .................................................................................................................................................. 81 Figure 4.3 The procedure employed to generate orthogonal sparse bigrams (OSBs) as additional features for a generic molecule and generic original set of features (monograms) using a window size of three. ................................................................................................... 83 Figure 4.4 A summary of the generation of the 94 feature sets – comprising fingerprint features, discretized descriptor features or both – evaluated in the current work. ................... 84 Figure 4.5 Distribution of Int-Set (red) and ExtTest-Set (blue) within the plane defined by the first two principal components (PCA plot) for the P_VSA descriptor set (computed as per section 4.3.2.1). Principal components were calculated from the combined Int-Set and ExtTest-Set, using the prcomp( ) function in R 225 - with scale=TRUE. ................................. 94 Figure 4.6 PCA plots generated as per Figure 4.5 (training set: red, ‘external’ test set: blue), for all splits of the Literature-368 dataset; from top left to bottom right: original split, random:1, random:2 and random:3. ......................................................................................... 95 Figure 4.7 PCA plots (generated as per Figure 4.5) for the Diverse Subset partitions of the Thai-313 (LHS) and Dubus-203 (RHS) datasets for which results are presented in Table 4.3. .................................................................................................................................................. 96 Figure 4.8 Mean MCC values for externally validated (i.e. only ‘unbiased’ feature sets were considered, where relevant) selected Winnow models (generated using multiple training cycles, where relevant), compared to QuaSAR-Classify (QC) mean MCC values and Binary QSAR (BQ) MCC values when trained and tested on the same data. ..................................... 97 Figure 4.9 Mean MCC values for externally validated Winnow models (as per Figure 4.8), and the corresponding (mean) MCC values for the externally validated SVM and RF models. .................................................................................................................................................. 98 Figure 5.1 Computation of all three (A) USR and (B) supplementary ATUSR descriptors (for atoms typed as H bond donors) corresponding to the distances (dotted lines) between the molecular centroid (black sphere) and the relevant atoms. N.B.: The [NH+] donor (RHS of X the molecular centroid; N = blue; H = white) is also typed as a cationic and ring atom. This figure was generated using VMD (version 1.9).357,358 .......................................................... 116 Figure 5.2 Removal of inorganic structures from Doddareddy dataset in Pipeline Pilot, prior to standardization. The numbers reflect the total number of compounds, in all SMILES files, subsequent to assignment of unique compound IDs, presented by Dr Andreas Bender. The 2,644 compounds referred to were presented in a subset of these files. ................................ 121 Figure 5.3 Standard workflow used to process ‘raw’ structures in all datasets modelled in the work presented in this chapter. N.B.: The structures obtained subsequent to each of the different pre- CORINA processing steps were also parsed via Pybel. 177 .............................. 126 Figure 5.4 R 2 values obtained from 10-10CV (five RNG seeds for randomForest) on the hERG-196 dataset , with different descriptor sets calculated from structures obtained: (A) prior to Molecular Mechanics calculations, (B) from local minimisations, (C) from global minimisations, (D) from docking (ChemScore selections), (E) from docking (RF-Score selections). The black lines and circle centres denote the median and mean results respectively. ........................................................................................................................... 133 Figure 5.5 Corresponding MCC values (c.f. Figure 5.4) obtained on the hERG-196:Subset dataset. ................................................................................................................................... 135 Figure 5.6 Mean R 2 values obtained, for the hERG-196 dataset, using the following descriptors, computed from different 3D structures: (A) ATUSR, (B) USR-ATFP, (C) USR+MACCS, (D) USR. ...................................................................................................... 137 Figure 5.7 Mean MCC values obtained, for the hERG-196: Subset dataset, using the following descriptors, computed from different 3D structures: (A) ATUSR, (B) USR-ATFP, (C) USR+MACCS, (D) USR. ................................................................................................ 138 Figure 5.8 Images of Imai et al. 305 docked structures for cisapride (A) and E-4031 (G) and corresponding aligned structures obtained via: STERGEN (B,H), local minimisation (C,I), global minimisation (D,J), ChemScore selection (E,K) and RF-Score selection (F,L). All molecular images were generated using VIDA (version 4.1.1). 378 ........................................ 140 Figure 5.9 R 2 values obtained for: (A) hERG-196 dataset, (B) ThaiReg dataset. In both cases, the structures used for descriptor calculations were obtained from the standard workflow (see 5.3.4.1). The black lines and circles denote the median and mean results respectively. ....... 142 Figure 5.10 MCC values obtained for: (A) hERG-196:Subset dataset, (B) ThaiReg: Subset dataset, (C) Doddareddy dataset and (D) Schattel dataset. In all cases, the structures used for XI descriptor calculations were obtained from the standard workflow (see 5.3.4.1). The black lines and circles denote the median and mean respectively. .................................................. 144 Figure 5.11 P(a) versus number of structures in the active cluster subset when clustering the following datasets using predicted logD: (A) Doddareddy (Test), (B) Doddareddy (Train), (C) ThaiReg:Subset, (D) hERG-196:Subset, (E) Schattel. N.B.: The red (blue) points denote values obtained for the real (activity permuted) data; the green points denote a theoretical upper bound to p(a) - computed as the total number of actives in the dataset divided by the current number of structures in the active cluster subset. 331 .................................................. 147 Figure 5.12 Corresponding plots to those presented in Figure 5.11, based on clustering using USR. ....................................................................................................................................... 148 Figure 6.1 Fragments identified in cathinone by Clark and co-workers that were likewise found (green) and not found (red) by this author's implementation of their bit-string descriptor................................................................................................................................ 156 Figure 6.2 SMARTS patterns matching a tertiary amine and the corresponding amide. ...... 156 Figure 6.3 Overview of the procedures used to prepare the structures in the ion channel and TdP datasets from which InChIs and structural descriptors were computed. ........................ 166 Figure 6.4 MCC values (black line: median, circle: mean) obtained using predicted IC- descriptors, and the corresponding results obtained using structural-descriptors alone, across all cross-validation folds, repetitions and RNG seeds. Results obtained on the Yap-2004 (A,C,E) and Clark-2009 (B,D,F) datasets, using predicted IC-descriptors generated from all models (A,B), the putative most relevant set - i.e. KCNH2, KNCQ1, CACNA1C (C,D) and just the KCNH2 model (E,F). ................................................................................................ 175 Figure 6.5 MCC values (black line: median, circle: mean) obtained using experimental KCNH2 IC-descriptors, and the corresponding results obtained using structural-descriptors alone, across all cross-validation folds, repetitions and RNG seeds. Results obtained on subsets of the Yap-2004 (A,C) and Clark-2009 (B,D) datasets corresponding to all compounds for which KCNH2 experimental IC-descriptors were obtained (A,B) and subsequent to reducing the inconsistency in the experimental conditions used to obtain the underlying measurements (C,D). ........................................................................................... 178 XII List of Tables Table 2.1 The confusion matrix for a binary classifier. ........................................................... 43 Table 3.1 Combinations of options considered for predicting compounds as mutagens/carcinogens based upon the output generated by Toxtree. If any one of the ‘ticked’ outcomes occurred, a positive prediction of mutagenicity/carcinogenicity was made. If these conditions were not met, compounds were deemed to be predicted non-mutagenic/non- carcinogenic by Toxtree. .......................................................................................................... 61 Table 3.2 Combinations of options considered for predicting compounds as mutagens/carcinogens based upon the output generated by DfW. If any one of the ‘ticked’ outcomes occurred, a positive prediction of mutagenicity/carcinogenicity was made. If these conditions were not met, compounds were deemed to be predicted non-mutagenic/non- carcinogenic by DfW. .............................................................................................................. 61 Table 3.3 Performance of the selected model. ......................................................................... 66 Table 4.1 Numbers of hERG inhibitors and their distribution amongst the potency categories for the datasets modelled in this chapter. ................................................................................. 73 Table 4.2 Performance of this author’s selected models and literature models on ‘external’ test sets for all (‘Int/Ext’) partitions of the Literature-368 dataset into training and ‘external’ test sets. MCC values (mean MCC values for Winnow, Random Forest (RF) and QuaSAR- Classify) are presented. Values in parentheses are the maximum MCC values obtained across the 50 runs of the QuaSAR-Classify module. Missing values for Winnow (multiple cycles) signify that a single cycle gave the best 5CV result. ............................................................... 92 Table 4.3 MCC values obtained on ‘external’ test sets for all partitions of the Thai-313 and Dubus-203 datasets used to evaluate this author’s modelling procedures. See Table 4.2 for presentational details. ............................................................................................................... 93 Table 4.4 Some of the associations, learnt by the Winnow models, between 'important' ECFP_4 features and hERG blockade which were consistent with trends previously noted in the literature. N.B.: denotes the arithmetic mean weight, across all scorers, assigned to the feature for class A etc. The median difference over all 50 training set orders is reported. All Dubus-203 training set compounds assigned feature 1430169877 were observed to possess a corresponding amide fragment. Occurrence ratios denote the fraction of compounds belonging to the class in question and containing the feature. All SMILES XIII patterns were visualised using MarvinView (version 5.5.0.1); 298 the 'A' symbols denote wildcard, i.e. undefined, heavy atom connections. 185,308 ....................................................... 102 Table 4.5 Results obtained on the Thai-313 and Dubus-203 datasets in the current work compared to those reported by Thai and Ecker 69 and Dubus et al. 91 The split of the Thai-313 dataset for which results are highlighted corresponds to the split with the highest MCC values obtained for both Binary QSAR models (i.e. the split used to evaluate this author’s models). The partitions of the Thai-313 dataset used to generate results in the current work (training set :80 actives/160 inactives, test set: 20 actives/53 inactives) differed slightly from the single partition used by Thai and Ecker (training set: 81 actives/159 inactives, test set: 19 actives/54 inactives), as explained in section 4.4.5. All MCC values obtained by Dubus et al. and Thai and Ecker were estimated herein. Acc., Rec. and Prec. denote accuracy, recall and precision respectively. ........................................................................................................................... 105 Table 4.6 Chi-squared p-values corresponding to (mean, across 50 runs, for QuaSAR- Classify) test set MCC values obtained here for the partitions of the Literature-368, Dubus- 203 and Thai-313 dataset used to assess this author’s models. These p-values were computed using the CHIDIST( ) function in Excel 2007 (32-bit), supposing one degree of freedom (see Chapter 2, section 2.6.4.1), save where a negative MCC was obtained – for which it was supposed that the model must effectively be a random predictor and the p-value was set to one. ......................................................................................................................................... 106 Table 5.1 Descriptor sets compared in this work. .................................................................. 118 Table 5.2 RMSD values (computed from heavy atoms) and USR similarities (computed from all atoms) upon alignment of the structures obtained here to the docking poses presented by Imai et al. 305 ........................................................................................................................... 141 Table 5.3 Numbers of clusters obtained, upon clustering using predicted logD (or USR descriptors), computed prior to the application of CORINA (or from structures processed via the standard workflow) with a distance cut-off of 0.2 (see section 5.3.6). For both descriptor sets, the classification and regression datasets were ranked in order of decreasing number of clusters - supposed to correspond to decreasing diversity. ................................................... 145 Table 6.1 Total number of compounds used to generate ion channel models. ...................... 161 Table 6.2 Numbers of TdP dataset compounds with experimental IC-descriptors based on comparing names of compounds in TdP and ion channel datasets. ....................................... 165 Table 6.3 Cross-validated results obtained on the entirety of the ion channel datasets derived for the generation of predicted IC-descriptors. ...................................................................... 170 XIV Table 6.4 Performance of ion channel models on the TdP datasets; the performance of those models for which only a single TdP dataset compound with an experimental IC-descriptor was obtained is omitted. ......................................................................................................... 171 Table 6.5 Performance of KCNH2 model on TdP datasets, after removing problematic measurements and increasing the experimental consistency of the retained measurements. 171 XV Glossary AC = Discretization method using all midpoints between adjacent, ordered, training set instances, with different class labels, as split points (see Chapter 4) Acc. = Accuracy AD = Applicability domain ADME = Absorption, distribution, metabolism and excretion ADR = Adverse drug reaction AE = Adverse event AERS = Adverse Event Reporting System ANN(s) = Artificial Neural Network(s) APD(s) = Action potential duration(s) ArizonaCERT = Arizona Center for Education and Research on Therapeutics ASP = Astex Statistical Potential BQ = Binary QSAR CA = ChemAxon descriptor set (see Chapter 4) CART = Classification and Regression Trees algorithm CV = Cross-validation DfW = Derek for Windows TM Dubus-Rel = Descriptor set obtained via feature selection by Dubus and co-workers (see Chapter 4) ECVAM = European Centre for the Validation of Alternative Methods EPA = Environmental Protection Agency FDA = Food and Drug Administration FI = Fayyad and Irani's discretization method FOM(s) = Figure(s) of merit GLP = Good Laboratory Practice GSK = GlaxoSmithKline hERG = Human ether-à-go-go-related gene (or the corresponding ion channel) HTS = High throughput screening XVI IC-descriptors = Descriptors corresponding to (predicted) pIC50 values for cardiac ion channels as described in Chapter 6 ICH = International Conference on Harmonisation IKr = Rapid delayed rectifier current InChI = IUPAC International Chemical Identifier 'Int/Ext' = Dataset partition into a training and test set, where the test set was not directly used for model selection (see Chapter 4) kNN = k-Nearest Neighbours LDA = Linear Discriminant Analysis LOOCV = Leave-one out CV = Number of descriptors randomly sub-sampled at each node when training a Random Forest model MAE = Arithmetic mean absolute error MCC = Matthews Correlation Coefficient MCCV = Monte-Carlo CV MLR = Multiple Linear Regression n.o. = Number of = Number of trees in a Random Forest model NMEs = New molecular entities OOB = Out-of-bag OSB(s) = Orthogonal sparse bigram(s) PC(s) = Principal component(s) PCA = Principal components analysis PDB = Protein Data Bank PLS = Partial Least Squares Prec. = Precision QC = QuaSAR-Classify (Q)SAR = (Quantitative) Structure-Activity Relationship Rec. = Recall RF = Random Forest XVII RF-Score = Ballester and Mitchell's docking scoring function built using Random Forest RMSE = Root Mean Square Error RNG = Random number generator SA(s) = Structural alert(s) SDF = Structure-Data file SMILES = Simplified Molecular Input Line Entry System SRS = Spontaneous Reporting System STERGEN = Stereoisomer generator, integrated into the software program CORINA SVM = Support Vector Machine SVR = Support Vector Regression TdP = Torsades de Pointes TdP+/- = Label assigned to a compound with/without the potential to induce Torsades de Pointes Thai-Rel = Descriptor set obtained via feature selection by Thai and Ecker (see Chapter 4) Type II DHQase = Type II dehydroquinase UMC = Uppsala Monitoring Centre USR = Ultrafast Shape Recognition XO = Xenopus Ooctytes 1 Chapter 1 Introduction This thesis presents novel computational models and modelling approaches for predicting some of the most serious forms of pharmaceutical drug induced toxicity. Toxicology may be broadly defined as the "study of the adverse effects of drugs and chemicals on living systems". 1 Pharmaceutical drugs are chemicals introduced to the human body to induce a desired therapeutic effect. 2 Principally, albeit not exclusively, these are 'small' molecules, with a majority being administered orally. 3 A variety of types of adverse effect, or toxicity "endpoints", 4 may be induced by pharmaceuticals. 5–7 As of the year 2000, safety issues were one of the leading causes of drug attrition, accounting for more than 20% of failures. 8 Safety concerns also led to 26 drugs, deemed to be new molecular entities (NMEs) by the US regulatory body the Food and Drug Administration (FDA), 9 being withdrawn from the US market in the period 1980-2010. 10 As well as the resultant costs incurred by the pharmaceutical sector due to attrition, 8 and market withdrawal, failure to anticipate adverse drug reactions (ADRs) * prior to marketing approval may have lethal consequences for patients. 11–13 Hence, there is a clear need for, 14 and corresponding focus in industry on, 6,7,15 the early identification of the toxic liabilities of potential drug candidates. As is emphasised in this thesis, in silico methods can play an important role in meeting this need. This chapter discusses the experimental approaches used to anticipate drug induced toxicity, along with the approaches used to identify drug induced toxicity in the patient population. This provides the context for explaining the importance of computational approaches, such as those presented in this thesis, for predicting these toxic effects. This chapter proceeds with an explanation of the importance of the types of toxicity which were modelled in the studies described here and briefly discusses the key issues associated with modelling these endpoints: *The World Health Organisation (WHO) defines ADRs as "any noxious, unintended, and undesired effect of a drug, which occurs at doses used in humans for prophylaxis, diagnosis, or therapy". The term “adverse drug event” may be considered to more broadly encompass any injury, including those resulting from overdose, induced upon administration of a drug.11 However, the term “adverse drug reaction” may be used in a looser sense - also encompassing overdose related injuries.6 In this thesis, the term 'adverse drug reaction' is used in this looser sense – i.e. interchangeably with 'adverse (drug) event'. 2 mutagenicity, carcinogenicity, hERG inhibition and Torsades de Pointes. This chapter concludes with an overview of the work to be presented in the rest of this thesis. 1.1 Determination of Drug Induced Toxicity As schematically, and incompletely, illustrated in Figure 1.1 and Figure 1.2, a typical orally available drug is subject to various experimental toxicology, and - ultimately - human clinical safety, assessments prior * to product registration and launch. 7,16,17 To put these assessments into context, these figures also depict the typical testing stages associated with moving from a (virtual, i.e. existing only as computer records) compound library to hits, which show 'reasonable' activity against the therapeutic target, to leads, which have the potential to be optimised by medicinal chemists to yield acceptable drugs, and ultimately to a clinical candidate that is subject to human clinical trials. A fuller explanation of this process, and these terms, is provided in the referenced texts/articles. 7,17–19 Regulatory agencies require data from certain preclinical in vitro (cell based) and in vivo (in animals) experimental toxicity assays, which must conform to Good Laboratory Practice (GLP), to be submitted for a clinical candidate. 7,17 For example, the in vitro assays currently called for by regulatory agencies include bacterial mutagenicity assessments and an assessment of inhibition of the potassium ion channel encoded by the human ether-à-go-go- related gene (hERG) - endpoints discussed in more detail in section 1.3. 17 Prior to this, however, a variety of experimental toxicology assays are carried out during drug discovery. Early stage in vitro assays may be divided into "prospective" and "retrospective" assays. "Retrospective" assays, e.g. for phospholipidosis, are usually designed to identify target organ specific, "dose-limiting" toxicities (which would limit the therapeutic effect achievable by simply increasing the dose of a drug); these assays are usually carried out after the potential for such types of toxicity being relevant is indicated by short term in vivo assays. "Prospective" assays, e.g. for hERG inhibition, are designed to identify "development-limiting" toxicities which could force compound discontinuation if identified. 7 * As previously noted, even these extensive testing regimes may fail to anticipate ADRs prior to product launch, which may only be identified during “post-marketing surveillance”. This is, by definition, particularly likely for unpredictable “idiosyncratic” ADRs.7,10 3 1.2 The Value of Computational Approaches However, it would be far more cost effective if experimental toxicity assessment of potential drug candidates could be reduced. In silico approaches to toxicity assessment, which generate toxicity predictions for untested, and possibly unsynthesised, compounds, have the potential to reduce the time and money invested in toxicity assessment in drug discovery. This requires that their predictions are sufficiently reliable. 15 One should note that the error associated with the predictions of in silico models cannot be reduced beyond that inherent in the experimental measurements/observations made for the modelled endpoint. Indeed, these errors pose challenges for both the empirical generation * and assessment of in silico models. 20 The specific errors associated with observations made for each of the endpoints modelled in subsequent chapters of this thesis are discussed in section 1.3. Various uses for such in silico models can be envisaged: 1. The most high throughput, computationally inexpensive models, could be used to remove predicted potent toxicants from (virtual) screening libraries (prior to synthesis). 15,20 2. The models could be used to prioritise (synthesis of)21 a lead series.15,22 3. If the models are interpretable to medicinal chemists, they could be used to guide structural modifications, to remove a toxic liability. 15,20,21 4. Rather than acting as an alternative to experimental assays, for which it is commonly accepted the currently available models are typically insufficiently reliable, 20,22 they could be used to prioritise compounds for experimental testing. 15,20,22,23 5. They could be used to assist regulatory decision making.24 Gavaghan and co-workers at AstraZeneca suggested that their in-house hERG inhibition model was sufficiently predictive and interpretable to be used, to some extent, for the first four purposes outlined above. 15 Davenport and co-workers at Evotec recently reported the successful application, in the context of a specific discovery project, of in silico models for hERG inhibition for the second and third of these tasks – considerably reducing the typical levels of hERG inhibition measured for newly synthesised compounds. 21 Indeed Müller et al. suggest that, as of the last decade, "computational tools have supported the optimisation of compounds from a specific chemical class with a small amount of experimental hERG data". * See Chapter 2. 4 The in silico prediction of mutagenicity has also been deemed a success story, with Hoffman- La Roche Limited reporting a significant reduction in Ames positive (see section 1.3.1) test results subsequent to the introduction of an in silico pre-filter in the late 1990s. 6 Computational models are also increasingly being used to support regulatory decision making and guidelines specifying the requirements for models to be used in a regulatory environment have been published. 24,25 In contrast to predicting the outcomes of in vitro assays, generating models for the direct prediction of human ADRs is deemed particularly challenging - due to the possibility of many, complex mechanisms, the limitations of available data for human ADRs and variable individual susceptibilities to ADRs, leading to "idiosyncratic" toxicity. 7,14,26 However, over the last decade, work has been ongoing to build models based on data for human ADRs. 27–30 Indeed, Pharmatrope is currently making its models for predicting human ADRs directly from chemical structure, built from post-marketing surveillance data, available to its clients. 31 Additional work has been undertaken to relate ADRs to interactions with specific protein targets 30,32,33 as well as gene expression data 14 - such linkages perhaps enabling the future anticipation of human ADRs from assay profiles. The publicly available databases, derived from post-marketing surveillance, for human ADRs, and the potential limitations with such data, were discussed by Clark and Wiseman 28 and, more recently, by Nigsch et al. 14 These issues are returned to in more detail in section 1.3.3. 5 Figure 1.1 Overview of empirical (yellow) and possible in silico (red) approaches to toxicity assessment in a typical discovery testing scheme for an orally available pharmaceutical. Grey arrows indicate the requirement for empirical data to build most in silico models. 6 Figure 1.2 Overview of empirical (yellow) and possible in silico (red) approaches to toxicity assessment in drug development and post-marketing surveillance for a typical orally available pharmaceutical. Grey arrows indicate the requirement for empirical data to build most in silico models. 7 1.3 Drug Induced Toxicity Endpoints Modelled in this Thesis 1.3.1 Mutagenicity and Carcinogenicity The potentially lethal effects of cancer, characterised by groups of abnormally proliferating cells, or tumours, 34 are well known, with more than 1.7 million cancer related deaths in Europe estimated in 2004. 35 Current understanding divides chemical carcinogens * into two categories: genotoxic and non- genotoxic (epigenetic). Genotoxic carcinogens cause direct damage to DNA, whilst non- genotoxic carcinogens act via a variety of alternative mechanisms. 37,38 Mutagenicity, a form of genotoxicity leading to transmissable DNA damage, 38 is typically a cause for concern within the pharmaceutical industry because it is viewed as a surrogate for carcinogenicity. 5,39 Not all mutagens, however, are carcinogens 38,40 and the importance of mutagenicity as a toxicity endpoint in its own right, with a possible association with birth defects, 41 has also been stressed. 36,39 1.3.1.1 Significance of Mutagenicity and Carcinogenicity for the Pharmaceutical Industry Compounds exhibiting positive results (i.e. indicated to be mutagens) in mutagenicity assays are rarely progressed to clinical trials. 16,42 Non-genotoxic carcinogens, however, are usually negative in these assays. 38 Whilst regulatory agencies have recently accepted shorter duration assessments in some cases, 43,44 the standard two-year rodent bioassay for carcinogenicity traditionally called for may actually entail a study lasting four to five years. 16,43–45 Since a positive result from rodent bioassays can cause the costly abandonment of a drug candidate in late stage clinical trials, there is a major need to identify carcinogens during discovery or early development. 16 1.3.1.2 Experimental Characterisation of Mutagenicity and Carcinogenicity The in vitro Ames assay is used by almost every pharmaceutical company for short-term assessment of mutagenic (hence, carcinogenic) potential. 5 Originally developed by Ames, * A carcinogen is "an agent which increases the rate of formation of tumors in a population".36 8 this assay uses a panel of mutant, histidine-requiring, strains of bacteria, located on a histidine deficient medium - typically in the presence of homogenised liver, to allow for mammalian metabolic activation. 5,20,46 Mutagenic compounds have the potential to induce mutations that restore histidine independence, allowing the formation of bacterial colonies. Hence, a positive result is obtained if a compound induces significant colony growth in one of the strains. Different versions of the Ames test may be performed with different strains and/or without metabolic activation. 47 If compounds have not been assessed in the presence of all recommended strains 20 and/or in the absence of metabolic activation, the assay may fail to detect mutagenicity. A GLP Ames assay is required before a candidate can enter clinical trials. 7 The degree of interlaboratory concordance of the Ames assay has been reported as 80-85%, 38 or 70-87%. 47 Recent analysis by Sushko et al. suggested that the average per-compound accuracy (defined as the maximum number of positive/negative results, divided by the total number of decisive results for a compound) of this assay, primarily determined from repeated measurements in the same laboratory, could be 90-94%. 47 Recent studies have presented an alternative "Ames II" assay, with similar levels of predictivity for rodent carcinogenicity test outcomes, which requires lower quantities of a compound for testing. 48 Other genotoxicity tests, such as the in vitro micronucleus assay, may also be used to screen for carcinogenicity in the pharmaceutical industry. 5 Whilst an Ames positive result corresponds to rodent carcinogenicity 77-98% of the time, 40 46% of the 149 carcinogens (or equivocal carcinogens) studied by Zeiger, which were identified in long term rodent assays, were found to be non-mutagens in the Ames assay. 49 Indeed, a 2005 study concluded that information from short-term studies was still not sufficiently predictive of the long-term rodent carcinogenicity assays to replace them. 44 However, the ability of these traditional rodent bioassays to identify human carcinogens, which are traditionally identified via epidemiological studies, has been questioned. In addition to concerns regarding species extrapolation, Cohen also raised concerns regarding dose-extrapolation - i.e. traditional carcinogenicity studies are performed using doses which may considerably exceed the expected exposure levels for humans. 45,50 9 1.3.1.3 Availability of Data for Mutagenicity and Carcinogenicity A variety of datasets for both mutagenicity and carcinogenicity have been made freely available in an electronic format as required for modelling (see Chapter 2, section 2.6.2) by the US Environmental Protection Agency (EPA) and other organisations. 6,51 These include the ISSCAN database, with carcinogenicity and mutagenicity data from long-term rodent studies and the Ames assay respectively 51 and the IRISTR database, with carcinogenicity assessments derived in accordance with the EPA's cancer risk assessment guidelines - including, where available, data from human epidemiological studies. 52–54 The ISSCAN (version 3a) and IRISTR (version 1b) databases comprised 1,153 and 544 compounds respectively at the time of writing, albeit the same carcinogenicity/mutagenicity assessments had not been made for all compounds therein. Hansen et al. recently presented a freely available Ames mutagenicity dataset comprising 6,512 compounds. 55 To the best of this author's knowledge, this was the largest freely available mutagenicity dataset at the time of writing. Mutagenicity and carcinogenicity datasets of 8,412 and 1,634 compounds respectively were, at the time of writing, commercially available from LeadScope. 56 Additional, proprietary, data is held by the pharmaceutical industry. 57 1.3.2 hERG Inhibition Drug inhibition of the hERG potassium ion channel, which carries the rapid delayed rectifier current (IKr) in cardiac myoctyes, 58,59 i.e. heart muscle cells, 60 is of significance due to its potential relationship to Torsades de Pointes (TdP). TdP, a potentially fatal cardiac arrhythmia, 58* is ultimately the toxicity endpoint of concern, with hERG inhibition considered as a typical surrogate in the pharmaceutical industry. 6,15 Further consideration is given to TdP, as an endpoint in its own right, in section 1.3.3. In the standard model, drug blockade of the hERG ion channel leads to prolongation of the cardiac action potential, † which can lead to prolongation of the QT interval, ‡ followed by TdP (Figure 1.3).59,61 However, levels of hERG inhibition resulting in considerable QT * An abnormal heart rhythm.60 † The "pattern of electrical activity associated with excitable heart cells".59 ‡ A period on the electrocardiogram (ECG), which measures electrical activity across the heart as a whole.59 10 prolongation for some drugs, may not lead to observable prolongation in others. 62 Furthermore, some compounds exhibiting modest QT prolongation have been significantly associated with TdP, whilst amiodarone induces larger increases in QT interval, yet is associated with negligible incidence of TdP. 63 Indeed, it is not unknown for TdP to occur in the presence of QT shortening. 64 Figure 1.3 The standard model for the potential consequences of hERG inhibition. Here, 'B' denotes a hERG blocker. This image is based on that presented by Crumb and Caverro.59 The induction of, and degeneration into fibrillation of, Torsades de Pointes is more completely discussed in section 1.3.3. 1.3.2.1 Significance of hERG Inhibition for the Pharmaceutical Industry In spite of the simplistic nature of the 'standard model', virtually all drugs known to induce QT prolongation inhibit hERG 63 and hERG inhibition is the most common mechanism for drug prolongation of the QT interval. 65 Moreover, the link between drug induced QT prolongation and the risk of TdP was of sufficient concern to regulators by the year 2000, that 11 a working committee of the International Conference on Harmonisation (ICH) was charged with drafting international guidance for regulators 66 that ultimately called for in vitro assessment of IKr inhibition for clinical candidates. 65 Even prior to the finalisation of these guidelines in 2005, pharmaceutical companies were being requested to provide hERG inhibition data to regulators and experimental assessment of hERG inhibition had become a common part of the drug discovery process. 15,66 A correlation between hERG inhibition and QT prolongation/Torsadogenic potential (the potential of a compound to induce TdP) has been determined in various studies. Redfern et al. suggested that a ratio of more than 30:1 between hERG IC50 * and effective therapeutic plasma concentration could be taken as indicative of a lack of TdP causing potential. 68 Similarly, Yao et al. proposed a safety margin of 300:1 (hERG IC50 : maximum unbound plasma concentration) to separate drugs causing TdP and/or QT interval prolongation from those inducing neither cardiac disorder. 62 Heuristics have been developed in the pharmaceutical industry directly relating hERG IC50 values to their typical toxicological significance. For example, the experience of Yao and co- workers within GlaxoSmithKline (GSK) was that compounds with IC50 < 1 μM typically prolonged the QT interval in in vivo assays, whereas those with IC50 > 10 μM typically did not and compounds lying above their proposed safety margin usually had IC50 values above 10 μM. Hence, in early drug development, compounds with hERG IC50 < 1 μM were typically not progressed, whereas those with hERG IC50 > 10 μM were typically progressed without in vivo assessment of their potential to prolong the QT interval. 62 Similar heuristics are accepted elsewhere in the industry: compounds with IC50 > 10 μM are commonly considered safe, whilst the development of potent inhibitors with IC50 < 1 μM 15 is commonly discontinued. 69 1.3.2.2 Experimental Characterisation of hERG Inhibition A variety of experimental assays are used to assess hERG inhibition.67 These may be grouped into binding and functional assays. Binding assays estimate the affinity with which a putative hERG blocker binds to the channel, yet do not directly assess a compound's ability to inhibit * The half-maximal inhibitory concentration,67 as discussed in section 1.3.2.2. 12 hERG current.67 Functional assays estimate the extent to which compounds block the hERG channel. 70 Binding affinity may be quantified by Kd, defined in Equation 1.1. 71 [ ][ ] [ ] 1.1 In Equation 1.1, [L], [R] and [L-R] represent the equilibrium concentrations of unbound ligand, unbound receptor and bound receptor respectively. In the present context, the ligand and receptor correspond to a putative hERG inhibitor and hERG channel respectively. Whilst Kd values can be determined directly via saturation-binding experiments, 71 they are typically estimated from competitive binding assays. 67,70,72,73 In these assays, the ability of the putative hERG blocker to displace a previously assessed, labelled hERG binder - typically, a potent hERG blocker - is determined. The "IC50" obtained from these experiments corresponds to the unbound concentration of putative blocker at which 50% of the labelled binder has been displaced; this "IC50" is dependent upon the amount of labelled binder used in the assay and, hence, is not an absolute measure of binding affinity. An estimate (Ki) for the Kd of the putative blocker can be obtained using the Cheng-Prusoff equation (Equation 1.2). 71,72,74,75 [ ] 1.2 In Equation 1.2, [L] and Kd,L respectively denote the equilibrium unbound concentration, corresponding to 50% displacement of the labelled binder, and Kd for the labelled binder. In principle, binding assays suffer from fundamental limitations. They are unable to distinguish between binding which inhibits or activates the hERG channel and may fail to detect blockers which bind in an alternative binding site to the labelled binder.67 Other limitations may arise from failing to assay hERG channels in whole cells. 73 Nonetheless, good experimental correlation has been obtained, in a number of studies, between the Ki values obtained from common competitive binding assays and the IC50 values obtained from electrophysiological (see below) functional assays. However, inhibitory potency may be underestimated to varying degrees for different compounds. 70,72,73 High throughput functional assays, such as the rubidium efflux and membrane potential sensitive dye based assays, which indirectly estimate the effect of a compound on hERG current, may also be used for initial screening purposes. 67,70 13 However, none of these assays are deemed to be a substitute for electrophysiological assays - which directly measure current flowing through ion channels at specified transmembrane potentials. 67,70,76 hERG channels transition between different states, commonly denoted closed, inactivated and open, in response to changes in transmembrane potential - with current flow only possible in the open state. 58,67,76 In principle, electrophysiological assays allow for the assessment of state dependent block. 76 Whilst traditionally corresponding to low throughput techniques, work on, higher throughput, automated electrophysiology assays was initiated in the late 1990s. 77 In the last decade, these became widely available 77 and work was undertaken to minimise the discrepancies between these assays and conventional electrophysiology. 78,79 Whilst IC50 values may be estimated from the percentage inhibition at a single concentration, more reliable estimates are obtained from a concentration-response curve. 80,81 Typically, inhibition measurements are fitted to a version of the Hill equation - which commonly, 62,82,83 though not always, 80 can be expressed in the following form (Equation 1.3). ( [ ] ) 1.3 In Equation 1.3, Y corresponds to the fractional block of hERG current, [B] corresponds to the unbound concentration of blocker B for which the IC50 is determined, and corresponds to the Hill coefficient which is, usually, 84 also estimated from fitting. Y may be estimated as (IC IB)/IC, where IB denotes the estimated hERG current in the presence of a given concentration of blocker B and IC corresponds to the estimated hERG current in the absence of B. 83 This expression supposes that 100% current inhibition is achievable in the limit that [B] tends to infinity, such that the IC50 corresponds to the concentration required for 50% channel blockade. 83 Minor deviations from 100% maximum inhibition may be taken into account via minor modification of the Hill equation. 80 In some studies, however, maximum block of less than 50% has been suggested, requiring alternate IC50 estimations. 85,86 Commonly, hERG inhibitory potency is expressed in terms of the pIC50 ( (mol litre -1 )) calculated from the IC50 obtained in electrophysiological assays. 15,78 Reliable estimates of endogenous IKr inhibition are technically challenging. 87 In the pharmaceutical industry, 62,78 it is common practice to heterologously express hERG, though 14 not potential auxiliary subunits or other modulating proteins, 87 in non-cardiac mammalian expression systems. 67,87 However, inhibitory potencies obtained from cells natively expressing IKr may differ appreciably from those obtained in heterologous expression systems and potencies may vary between different heterologous expression systems. 67,82,87 Notably, inhibitory potencies obtained from heterologous expression in Xenopus Ooctytes (XO) are typically lower than those obtained in mammalian systems. 67,87,88 A variety of additional factors - including temperature, voltage protocol etc. - can considerably affect the IC50 value determined from electrophysiological measurements. However, variations in these factors do not necessarily lead to statistically significant differences in estimated IC50 values. 82 Nonetheless, the challenges posed to empirical model generation due to the variability in literature derived pIC50 estimates - typically obtained under different conditions 67 - have been widely remarked upon. 89–93 Above and beyond these systematic variations (for a given compound), replicate electrophysiological pIC50 measurements have been reported to have an error of 0.1-0.5 log units. 15 1.3.2.3 Availability of Data for hERG Inhibition The pharmaceutical industry has access to databases comprising thousands of compounds screened for hERG inhibition in the same assay and, in the last decade, the generation and validation of in silico models on these databases has been reported. * To date, most publicly available datasets are considerably smaller. At the time of writing, data for 1,960 compounds screened in the same membrane potential dye based assay were freely available from PubChem. 95 Larger datasets/databases comprising measurements obtained using a variety of assays/experimental protocols, are also freely available - including, at the time of writing, literature curated hERG activity data for 5,972 compounds from the ChEMBL database. 96 Doddareddy et al. recently presented a dataset of 2,644 compounds, which they claimed to be the largest publicly available dataset at that * Modelling studies based on nearly 9,000 and 60,000 compounds, assessed in a single in- house assay, were respectively presented in 2007 by Gavaghan et al. (Astrazenca)15 and in 2005 by O'Brien and de Groot (Pfizer).94 15 time. 93 A recent publication 97 noted a variety of free access online repositories containing hERG/IKr inhibition data, curated from the literature for 100-600 compounds. 98–101 Literature curated datasets might also be purchased from commercial providers, such as Aureus Sciences (formerly Aureus Pharma) 102 and Sunset Molecular. 103 It should be re-emphasised, however, that not all the measurements in these literature derived datasets/databases will have been derived from the same cell type or assays etc. Modelling studies presented in the recent literature, including work presented in Chapter 4 of this thesis, have made available datasets incorporating electrophysiological measurements for a few hundred compounds. 92,104 A recent publication presented HERGCentral - a freely available repository of screening results obtained in a high-throughput electrophysiological assay for more than 300,000 compounds. 97,105 1.3.3 Torsades de Pointes Torsades de Pointes (TdP) is a rare ventricular tachycardia, * which may degenerate into ventricular fibrilation † - which, left untreated, 60 leads to sudden death. 87 Understanding of the mechanisms underlying drug induced TdP is still incomplete. 106,107 Nonetheless, the induction of TdP is believed to be related to an increase in the heterogeneity of ventricular repolarisation. This process, and the underlying molecular interactions, are very briefly discussed below. The interested reader is referred to the reviews by Fenichel et al., 87 Gupta et al. 108 and Varró and Baczkó. 106 Repolarising currents, of which IKr carried by hERG is one, carry cations to the cellular exterior - restoring resting transmembrane potential. The converse is true for depolarising currents, carried by different cardiac ion channels. The cycle of depolarisation and repolarisation constitutes the action potential, 58 and a reduction in repolarising currents lengthens the action potential duration; 59 prolongation of ventricular action potential durations (APDs) is reflected by elongation of the QT interval (see above). 87 * An unusually rapid heart rhythm originating in one of the ventricles (the lower chambers of the heart).60 † Rapid, chaotic beating of the ventricles which prevents the heart from effectively pumping blood to the body.60 16 The differential distribution of cardiac ion channels amongst different cell types leads to slight differences in APDs across normal ventricular tissue. Drug induced reduction in ventricular repolarisation can exacerbate differences in APDs which may, albeit not necessarily, 109 promote heterogeneity of repolarisation - creating the conditions for TdP. 87 The potential for a drug to reduce ventricular repolarisation, and increase heterogeneity of repolarisation, depends, in principle, upon its ability to perturb the currents (both depolarising and repolarising) carried by multiple cardiac ion channels. 87,106 As well as direct inhibition of cardiac ion channels, indirect perturbation of ion channel currents may occur via other mechanisms, which may entail interactions with other, regulatory, proteins, such as the inhibition of hERG trafficking to the cell membrane. 61 There is clearly a need to also take pharmacokinetic and physicochemical factors, such as might affect a compound's cellular/tissue distribution and plasma concentrations, into account when considering a compound's potential to induce TdP 61,62 - i.e. its Torsadogenic potential. In addition to considering the intrinsic Torsadogenic potential of a drug, it is also important to consider the potential for the risk of TdP induction to be increased due to drug-drug interactions and other risk factors, e.g. female gender. 12,108 Some analyses suggest that incidents of drug induced TdP almost never occur with non-cardiac medications in the absence of such risk factors. 115 These additional complications both complicate the situation with regard to the prescription, and hence marketing, of potentially Torsadogenic drugs (section 1.3.3.1) 87 as well as the assessment of Torsadogenic potential (section 1.3.3.2). 1.3.3.1 Significance of Torsades de Pointes for the Pharmaceutical Industry Given that the significance of hERG inhibition ultimately stems from concerns regarding the Torsadogenic liability of potential drug candidates, the significance of TdP for the pharmaceutical industry is largely reflected in the significance of hERG inhibition (see section 1.3.2). The potentially lethal effects of TdP make it highly important to identify the Torsadogenic potential of drug candidates prior to market approval. Whilst only around 1 in 120,000 patients prescribed the antireflux 110 drug cisapride suffered subsequent incidents of TdP, or other arrhythmias, between 1993 and 1995, 12 the drug was (inconclusively) linked to around 80 deaths in Canada and the US. 110 Cisapride was eventually withdrawn from the market in 17 the year 2000 10,110 due to its association with TdP. 10,111 However, the consequences of drug induced TdP are less likely to result in market withdrawal for drugs used to treat life threatening conditions. 58 For example, the antiarrhythmic quinidine, with a higher incidence of drug induced TdP than cisapride, 58 remained on the market as of 2010. 112 Even if a drug is not withdrawn from the market, it may be assigned a warning label - with a "boxed warning", or "black box warning", designed to indicate risk of the most serious, e.g. life-threatening, ADRs. 113 These may negatively impact the revenue generated by a drug, albeit an analysis of the effect of black-box warnings issued for cisapride prior to its withdrawal suggested they had no significant effect on prescriptions. 114 The rarity of TdP induction by non-antiarrhythmic drugs, typically with < 1 case per 10,000- 100,000 exposures, makes it virtually impossible to detect in clinical trials. Hence, there is a particularly pressing need for methods - either experimental or in silico - which could identify TdP inducing drugs prior to market launch. 87 1.3.3.2 Approaches to Defining the Torsadogenic Potential of Drugs Torsadogenic potential is commonly assessed preclinically by considering surrogate indicators which can be measured in in vitro or in vivo assays. 62,115 Indeed, various pre- clinical assays of both kinds are proposed by regulatory guidelines - including single cell assessments of IKr inhibition (see section 1.3.2), changes in action potential duration in multicellular test systems and QT interval alterations in laboratory animals. 65 In light of the limitations with the 'standard model' outlined in section 1.3.2, assessment of IKr inhibition (as indicated above, the effects of this may be offset or enhanced by inhibition/activation of other cardiac ion channels etc.) and QT interval prolongation cannot be considered as direct indicators of Torsadogenic potential. Regarding in vivo assessments of QT interval prolongation, various other ECG parameters have been suggested as potentially more appropriate indicators. 87,108 Moreover, measurements of the QT interval are technically difficult. Indeed, what is actually measured are changes in the QT interval corrected for various co-factors (such as heart rate); appropriate correction is considered challenging - particularly in animal models. 87 Given the limitations of these surrogate measurements, work has been undertaken to directly assess the Torsadogenic potential of drugs in animals treated to increase their predisposition to TdP. The development of arrhythmia has also been studied in isolated whole animal hearts. 18 Measurements of the electrical activity across an isolated section of an animal heart (the "wedge preparation") may also be able to detect pseudo-ECG patterns characteristic of TdP and could be used to determine Torsadogenic potential. 87,116 Very recently, a human ventricular slice preparation was proposed for experimental assessment of arrhythmic risk. 117 Ideally, Torsadogenic potential would be assessed based upon human data. However, using drug induced TdP as an end point in prospective studies is not always feasible for ethical reasons. 118 In clinical trials, regulatory guidelines emphasise the importance of detecting the risk of TdP induction via assessing prolongation of the QT interval. 119 However, given how rare drug induced TdP is, direct assessment of the risk of TdP induction in humans is not expected to be feasible until after clinical trials. 87,115,119 Assessment of risk from post-marketing surveillance is hugely complicated. Firstly, a considerable number of possible confounding factors exist. These include drug-drug/drug- food interactions, variable individual/demographic susceptibilities (and biases towards different patient demographics), over-reporting (e.g. in response to elevated public concern regarding a drug's safety), under-reporting (e.g. for supposed obvious toxicities) as well as variable tendencies of physicians to report adverse events. 14,28,33 An additional confounding factor is inconsistent drug usage (e.g. overdoses). Secondly, a reported case of an adverse event being caused by a drug might be based on a physician's impression, 28 and hence be somewhat subjective - or based upon questionable evidence. 118 Other potential problems include the use of multiple terms to categorise the same ADR, 14 multiple names being used for the same drug, the submission of multiple reports for the same incident and the increased chance of observing adverse events with increasing usage of the drug (e.g. as time spent on the market increased). 28 Efforts have been made by the FDA to ensure that standardized terms are used for adverse event reports compiled in its Adverse Event Reporting System (AERS), 14,28 and that resources are available for normalisation of drug names. 28 Given the potential for chance identification of a drug as the cause of an incident of TdP, 28,118 various researchers have proposed only considering a drug to cause TdP based upon statistical tests for association. For example, it has been supposed that reports of TdP induction were due to chance co- occurrence, with this null-hypothesis not being rejected unless the number of reports was statistically significantly higher than the number of reports expected from chance. 28,33 For a 19 given adverse event (E), and a given drug (D), the number of reports expected by chance (R') could be calculated as follows (Equation 1.4). 1.4 In Equation 1.4, N(D) is the number of pills/injections of D administered, for all patients, and P(E) is the estimated probability of E co-occurring with any drug (Equation 1.5). ∑ ∑ 1.5 In Equation 1.5, M(D') is the number of times E co-occurred with the drug D' . N(D') could be estimated from prescription numbers or shipping volumes.33,120 However, when using databases of adverse event reports, a more readily available estimate 33 would be the total number of reports involving D' in the database - as recently used to estimate R' for cardiotoxic adverse events by Matthews and Frid 33 as well as Clark and Wiseman. 28 This is not without its limitations. For example, a disproportionately large number of reports for a particular drug or adverse event would artificially inflate R', requiring a larger value of M(D') for a statistically significant association between D' and E to be identified.33 In addition to statistical analysis of adverse event databases, different types of evidence might be obtained from a variety of sources and used by experts to assess a drug's ability to induce TdP in humans. This approach is employed by the Arizona Center for Education and Research on Therapeutics (ArizonaCERT). 121 Assessments based upon human data will usually consider a drug to pose no risk of inducing TdP unless evidence of (the potential for) drug induced TdP is available. 27,28 In some cases, an assessment of Torsadogenic potential might be based upon evidence for drugs within the same therapeutic class. 118 Given the low incidence of TdP with TdP causing drugs (e.g. cisapride, see above), this could result in a number of 'false negative' assignments. It should be noted that, whilst it is common 27–29,122–128 to conceive of Torsadogenic potential in terms of TdP causing and non-TdP causing agents, this is somewhat of a simplification: some TdP causing drugs might have a greater potential to cause TdP than others. Clark and Wiseman suggested they, unsuccessfully, tried to predict a continuous indicator of 20 Torsadogenic potential quantifying the statistical association between a compound and reported incidents of TdP induction. 28 1.3.3.3 Availability of Data on the Torsadogenic Potential of Drugs In addition to isolated case reports/clinical incidents reported in the literature, 12 publications by De Ponti et al., 118 Redfern et al. 68 as well as Fermini and Fossa, 129 presented lists of drugs categorised according to their Torsdaogenic potential. ArizonaCERT makes its lists of (potentially) TdP causing drugs, grouped into different categories of Torsadogenic risk, freely available online. 130 Commercially available resources which present an assessment of drug safety based upon expert review of the literature, and have been used to identify TdP causing agents, 27,33 include Micromedex 131 and Meyler's Side Effects of Drugs. 132 The US FDA's AERS presents case reports of adverse events received from 1996-present; these are either voluntary submissions by patients and healthcare professionals or mandatory submissions from drug manufacturers. 133 In principle, a case report details the names of all drugs administered to the patient, the drug(s) suspected of causing the adverse event(s), details of administration (including dosage), the observed adverse drug event(s) and the outcome for the patient. 28 AERS data is freely available from the FDA's website. 133 The AERS replaced the older Spontaneous Reporting System (SRS) which compiled adverse event case reports from 1969-1997, yet has been suggested to be less well curated. 28 The VigiBase database system compiles adverse event case reports, in standardized language, from across the globe. Access to VigiBase is controlled by the Uppsala Monitoring Centre (UMC). 134,135 Data might also be obtained from official warnings/labels, in spite of their limitations - e.g. they may not be based on the most up to date evidence, and might be based upon supposed risk - such as assignments based upon a drug's therapeutic class. 118 Nonetheless, this information is largely publicly available, 136 e.g. from the FDA's freely accessible MedWatch data files, 137 and may indicate the frequency with which an ADR has been observed in patients taking a particular drug. 136 Studies by Yap et al. 27 and Xue et al. 122 presented datasets comprising the names of a few hundred drugs categorised as TdP causing, using data from ArizonaCERT amongst other sources, and non-TdP causing. Clark and Wiseman recently presented a dataset comprising 21 more than 1,000 drugs, including structures, categorised as either TdP causing or non-TdP causing, based upon statistical analysis of the FDA's AERS. 28 1.4 The Intention of this Thesis To re-iterate, the aim of the work presented in this thesis was the development of new in silico models and/or modelling approaches which could be used to predict drug induction of some of the most serious toxicity endpoints: mutagenicity and carcinogenicity, hERG inhibition and Torsades de Pointes. Ideally, these approaches would be chemically interpretable, so as to guide lead optimisation/library design. Chapter 2 presents an overview of the computational approaches which may be used for toxicity prediction, with a particular focus on those approaches which were explored in the research presented in subsequent chapters. Chapter 3 presents work undertaken to screen for mutagenic and carcinogenic drug-like compounds in the context of a collaboration resulting in the in silico identification of experimentally verified inhibitors of type II dehydroquinase (DHQase). Chapter 4 describes the development of models to identify potent inhibitors of the hERG ion channel and compares their performance to literature approaches on different datasets. Chapter 5 reports novel approaches to encoding 3D molecular structure for in silico modelling; their merits, compared to existing approaches to encoding molecular structure, are assessed by generating models for hERG inhibition and another biological endpoint. Chapter 6 presents an assessment of novel approaches to identifying TdP causing drugs based upon mechanistically relevant biological information. Finally, Chapter 7 presents the conclusions drawn from the work presented in preceding chapters and offers suggestions for future research. Additional results referred to in this thesis (omitted for brevity), along with source code and data files generated in the course of the research presented, have been made available on the DVD attached to the inside cover (see Appendix A). Appendix B summarises the performance of toxicity models previously presented in the literature, with which the models presented in this thesis are compared. Additional computational details, including versions of software employed where not specified in the main text, are provided in Appendix C. 22 Chapter 2 Computational Toxicology and Quantitative Structure-Activity Relationships As discussed in Chapter 1, there is a clear need for in silico methods to predict the potential toxicological effects of pharmaceuticals and guide the efforts of medicinal chemists in reducing toxic liabilities. Computational toxicology is a field which seeks to develop such methods. This chapter defines 'computational toxicology', presents a brief historical background to the field, followed by an overview of contemporary approaches to predicting toxicology in silico. Arguably, approaches which seek to predict the toxicological effects of compounds solely from their molecular structure are most desirable, and all computational work presented in subsequent chapters of this thesis is almost exclusively concerned with such methods - both expert systems and, in particular, Quantitative Structure-Activity Relationships (QSARs). The remainder of this chapter describes these approaches, with a focus on QSARs, in some detail. 2.1 Defining Computational Toxicology Computational toxicology may broadly be defined as the application of "mathematical and computer models to predict adverse effects and to better understand the mechanism(s) through which a given chemical causes harm". 138 More narrowly, the field may be conceived as one which employs "computer technology and information processing (informatics) to analyse, model, and estimate chemical toxicity based upon structure activity relationships (SAR)", 139 albeit - as touched upon in section 2.3 - a wider range of in silico approaches are also currently employed for toxicity prediction. 2.2 Historical Background The notion that chemical composition is a key determinant of toxicological properties is not new and was articulated by Aleksandr Borodin as early as 1858. 140 By 1868, Brown and Fraser were seeking to empirically determine relationships between molecular structure and biological effects, 141 a paradigm underpinning many of the methods employed in contemporary computational toxicology. Various studies in the late 19 th century also sought to relate toxicological characteristics to experimentally determined properties of compounds. 140,142 In 1944, Lazarev published work which quantitatively related both the equilibrium constant for the partitioning of organic compounds between water and olive oil 23 and aqueous solubility to toxicological properties. Lazarev further presented limited attempts to relate changes in these physicochemical parameters to functional group contributions - as would allow for the prediction of toxicological effects from molecular structure alone. The rationale underpinning the empirical focus of Lazarev's work - that the complexity of the induction of toxicological effects obscured the mechanism of action 143 - remains a justification for empirical in silico models today. 19 The seminal work of Hansch et al., 144–146 in the early 1960s, presented, to the best of this author's knowledge, the first derivations of equations for bioactivities, including toxicological properties such as carcinogenic potency, based on contributions from multiple structural parameters. 146 Notably, they employed computers to derive these relationships, 144 and began to construct estimates of their, otherwise experimentally derived, substituent constants for compounds for which direct measurements were not obtained. 146,147 Around the same time, Free and Wilson employed the newly introduced electronic computers to derive regression models for the LD50 * values of chemical analogues, using the presence of specific side chains as predictors. 148 Whilst not coining the phrase, these studies arguably mark the advent of computational toxicology. Whilst initial studies sought to generate relationships between toxicity and chemical structure for small sets of congeneric molecules, belonging to a "well-defined family of molecular structures .... [with different] substituents attached to a common molecular skeleton", 149 the development of software for predicting the toxicological effects of large, heterogeneous sets of compounds commenced in the late 1970s. 140 Today, a wide variety of predictive toxicology software packages exist which may be used to generate predictions for structurally diverse sets of compounds. 150 2.3 Approaches to Predicting Toxicology In Silico A variety of computational approaches are currently used to predict toxicity. 14,20,150 Some of these (typically) make predictions based solely on the molecular structure of the toxicant - for example, read across, 20,151,152 (Q)SARs 20,25,153 and expert systems. 42,140,154 Some require knowledge of the receptor via which the toxicant acts - for example, proteochemometrics, 155 as well as the "docking" of putative toxicants into the (putative) binding sites of 3D receptor models. 156,157 Others typically require various properties of the toxicant to be experimentally * The median lethal dose.140 24 determined - for example, physiologically based pharmacokinetic and pharmacodynamic modelling. 104,107,158,159 Some predictive toxicology programs employ more than one approach to arrive at predictions; for example, so-called “hybrid” expert systems may also employ QSAR components. 154 Detailed discussion of all the approaches currently employed in, and all the issues associated with, computational toxicology is beyond the scope of this thesis. The interested reader is directed to a number of recent reviews/texts for further discussion of these approaches, and other issues that are not covered herein. 14,19,20,150 Since this author was most interested in approaches based on toxicant molecular structure alone, read across, expert systems and QSARs are discussed in greater detail below. 2.4 Read Across The “read across” approach supposes that chemically similar molecules should exhibit similar toxic effects for a given endpoint. 151 Indeed, this reflects the more generally supposed “molecular similarity principle”, which holds that similar molecules should exhibit similar properties; 2 this principle informs the widespread use of similarity searching, in which molecules are ranked according to their similarity to “query” molecules, known to be desirably active against a protein target of interest, and the top ranking compounds are deemed most likely to be active against the same target. 2,20 Likewise, read across assesses a compound of interest for possible toxicity by considering the most similar molecule(s) known to be toxic. 20,151 It must be noted, however, that the molecular similarity principle is not always observed to hold in practice. Indeed, whether or not this principle holds may be contingent on the manner in which molecules are determined to be similar. 2 This emphasises that the molecular properties used to assess similarity are critical: read across commonly employs “obvious” chemical similarities which should be mechanistically relevant for the endpoint of interest. 20,151,152 For example, similarity might be assessed based on the common presence of particular functional groups. 20 Based on the importance of electrophilic reactivity for skin sensitization, Enoch et al. proposed a read across approach for assessing skin sensitization potency based on computing similarity in terms of an “electrophilicity index”.151 25 2.5 Expert Systems Expert systems "seek to emulate a human expert making predictions". 154 They comprise a "knowledge base", which typically encodes the generalized knowledge of human experts as a collection of rules and an "inference engine" governing how these rules are used to arrive at predictions for specific cases. 140,154 The rules used in an expert system vary in their complexity. In the context of computational toxicology, a simple rule might be: if a molecule exhibits a given structural property, the compound will induce a specific endpoint. Rules of this kind may be used by the Toxtree program to make carcinogenicity predictions on the basis of 'structural alerts' for which there are no “modulating factors” (see below).37 More complex rules are usually employed by the program Derek for Windows TM , 160 for which the rules take the following general form: If [grounds] is [threshold] then [proposition] is [force] Here, "grounds" would refer to some variable providing evidence in favour of the "proposition" - with the value ("level of belief", e.g. "plausible", "doubted"') of the "grounds" denoted by "threshold", and the level of belief in the "proposition" denoted by "force". An example rule of this form is presented below. The rules used to make predictions are commonly based on the presence of "structural alerts", or simply "alerts" (sets of substructures associated with a particular toxicity endpoint), as illustrated in Figure 2.1. 37,154,160 The presence of these substructures may be ignored, i.e. the alert will not "fire", in the presence of "modulating factors" - structural features deemed to diminish or abolish the toxicity conferred by the alert. 37 It is important to note that these alerts are solely capable of generating positive predictions for toxicity, and the absence of an alert is not an indication of a lack of toxicity - rather, a relevant alert for the compound's structural class may simply not have been compiled yet. 20,37 26 Figure 2.1 Two of the structural alerts for genotoxic carcinogenicity used in the “hybrid” expert system software program Toxtree. Here, R denotes any atom/group except OH or SH.37 These rules may also incorporate predicted physicochemical properties (e.g. the coefficient of skin permeability, Kp), as well as the predictions for mechanistically related toxicity endpoints. For example, * a possible rule employed in the expert system software program Derek for Windows TM (DfW) 160 would be: If [logKP < -5] is [certain] then [skin sensitization] is [species dependent variable 6] An example of the type of processes applied by an inference engine would be the system of “argumentation” employed within DfW. Argumentation is used, for example, to propagate arguments along a chain, undercut arguments and resolve multiple arguments about the same proposition. In propagating arguments along a chain, uncertainties in the “grounds” (see above) are propagated as uncertainties in the proposition - e.g. a rule declaring that the “force” for a proposition was "certain" when a given “grounds” was "certain" could be used to infer that the force was "plausible" when the same “grounds” was "plausible". Undercutting arguments entails the introduction of context dependency into the strength of an argument in favour of a proposition - e.g. the “force” associated with a proposition based upon the threshold for a specific “grounds” might be dependent upon the species for which the toxic effects of the compound were of interest. In resolving multiple arguments for and against a proposition, DfW would combine the “forces” generated by multiple arguments to arrive at a single “force” - for example, arguments yielding conclusions of "probable" and "doubted" would yield a final conclusion of "plausible". Fuller explanation of this system, and relevant examples of its application, is provided in the referenced articles. 160,161 * Also, see Chapter 3, section 3.2.2.1. 27 A variety of computational toxicology software programs employing expert systems were commercially (including Derek for Windows TM ) 160 and freely (including Toxtree 162,163 and OncoLogic TM ) 164 available at the time of writing. Additional explanation of how the programs DfW and Toxtree make toxicity predictions, specifically for mutagenicity and carcinogenicity, is provided in Chapter 3. 2.6 Quantitative Structure-Activity Relationships In this thesis, the term Quantitative Structure-Activity Relationship (QSAR) is used to denote any empirically derived relationship between molecular structure, encoded as a set of continuous or discrete numbers (descriptors), and bioactivity - either a continuous (e.g. pIC50) or categorical value (e.g. toxic vs. non-toxic). Whilst this broad definition is consistent with the description of QSAR modelling recently presented by Tropsha, 153 and Worth, 25 this term may also be used more specifically, to denote approaches which seek to determine how bioactivity changes within an "island" of "chemical space" defined by some set of descriptors. 19 The relationship expressed by a QSAR is determined (validated) using a training (test) set, requiring a set of molecules, encoded via descriptors, with observed bioactivities. The rest of this chapter will discuss the key issues associated with training and validating QSARs, along with examples of the approaches employed for these tasks - with particular emphasis upon those which were used in the work presented in subsequent chapters of this thesis. 2.6.1 Dataset Preparation The first step in developing a QSAR model is the acquisition of a dataset comprising molecular structures (represented in an appropriate electronic format, as briefly discussed in section 2.6.2.1) and associated biological observations. Whilst researchers in the pharmaceutical industry may have access to experimentally consistent datasets comprising thousands of compounds, 15,94 QSAR researchers commonly work with published data 165 and there may be a necessary trade-off between the consistency of biological measurements and the requirement for a sufficiently large * dataset for robust model generation and validation. * If building a "global" model (see section 2.6.3.1), this dataset must also be structurally diverse. 28 Datasets may be compiled from primary or secondary literature sources, and various 'benchmark' QSAR datasets have been made freely available in recent publications 55,166 (with the latter comprising measurements derived using the same experimental protocol). 166 QSAR datasets may also be downloaded in electronic formats from various * online repositories. 167 Various public databases also exist which serve as repositories of structures and/or associated biological data. 168–171 Recently, however, concerns have been raised regarding the accuracy of the structural data contained within various public and private databases, with estimated error rates ranging between 0.1-10+%, 165,172,173 potentially having a significant effect on estimated QSAR performance. 165,172 Another concern is the possibility of duplicate entries when combining data from different sources, which may overlap considerably; 174 unless these are identified and removed, the associated redundancy may skew QSAR analysis. If stereochemically indifferent descriptors are used, stereoisomers are effectively structural duplicates and should arguably be treated as such. 165 The 'standardization' of chemical structures is an important step in the preparation of a QSAR dataset. 32,165 Different representations of the same (sub)structures are a problem for two reasons: (a) they may obstruct the identification of duplicate structures within the dataset and (b) the descriptors calculated for the same (or similar) structures will be inconsistent. 32,165 One possible complication is the treatment of salts and multi-fragment compounds; whilst it is standard practice to remove counterions and, more generally, retain the largest fragment in multi-component dataset entries, 32,165 this may not be appropriate if there are reasons to believe either of the fragments could be significant for the observed bioactivity or when the bioactivities of salts differ significantly from their corresponding neutral form. 165,172 The practice of removing additional compounds from the dataset, e.g. inorganics/organometallics, for which some software programs may have problems calculating descriptors, 165,175 and/or 175 compounds violating Lipinski's rules, 176 has also been * Those linked to from the webpage cited here do not represent an exhaustive compilation. 29 advocated in the literature. * However, some types of descriptors may be computed for some types of inorganics/organometallics, and analysis a few years ago by Overington et al. suggested that around 25% of small molecule drugs (not including "biological drugs") were not compliant with Lipinski's rules. 3 This further serves to emphasise that the most appropriate pre-processing of structures within and removal of entries from a QSAR dataset is non-trivial - different procedures may be appropriate under different circumstances. 165 Additional dataset preparation steps will also depend upon the descriptors to be calculated. For example, the calculation of 3D descriptors (as discussed in section 2.6.2.2) requires - presuming these are not already pre-computed or experimentally available - the generation of an appropriate 3D structure. 2.6.2 Computer Representation of Chemical Structures 2.6.2.1 Chemical Structure Formats Prior to the computation of descriptors, chemical structures must be represented in an electronic format. A plethora of such formats exist. 177 Those which were principally employed in the work presented in this thesis were: the IUPAC International Chemical Identifier (InChI), 178 the Simplified Molecular Input Line Entry System (SMILES), 179 the Structure-Data file 180 (SDF) and Tripos MOL2 file formats. 181 2.6.2.2 Descriptors A plethora of molecular descriptors, or numerical encodings of molecular structure "obtained by a well-specified algorithm applied to a defined molecular representation or a well- specified experimental procedure", exist; many of these were reviewed in the year 2000 by Todeschini and Consonni 182 and the number of descriptors continues to grow. 183 Indeed, much of the work presented in this thesis – from the development of a new approach to encoding descriptors for use with Nigsch's version of the Winnow algorithm 184 in Chapter 4, * Originally designed to alert chemists to compounds with potentially undesirable absorption or permeation, in a drug discovery or development context,176 compliance with Lipinski’s rules may also be used as a surrogate for “drug-likeness”175 (see Chapter 3, section 3.1). 30 the 3D descriptors proposed in Chapter 5 and the in silico biological descriptors presented in Chapter 6 – is concerned with novel descriptors. Computationally derived descriptors are commonly grouped into the following categories. Those which can be computed solely from the molecular formula, such as molecular weight and elemental atom counts, may be denoted "0D descriptors". The term "1D descriptors" may be used to denote descriptors encoding the list of substructures present in a molecule, with "2D descriptors" corresponding to those based upon the precise connectivity (i.e. topology) of the entire molecular structure. 182 Alternatively, as is accepted in this thesis, descriptors computed from the molecular formula and all descriptors calculated from molecular connectivity (i.e. including the presence/absence of substructures) may be termed 1D and 2D descriptors respectively. 20 Some 2D descriptors may be augmented with stereochemical information, even in the absence of 3D geometrical information; 185 the term 'topological descriptors' is used in this thesis to denote 2D descriptors which do not encode such information. Descriptors calculated from the fixed three-dimensional co-ordinates of a single molecular conformation are termed "3D descriptors". 182,186 The so-called "4D descriptors" are based upon an ensemble of conformations; 20,186 again, this label is defined differently elsewhere. 182 Recent studies have proposed using the output of short-term biological assays as descriptors for predicting long-term biological responses. 187–189 Further recent studies have proposed that in vitro assay information, 128,190 or protein target predictions, 191 be used to inform purely in silico QSAR models of in vivo toxicity. Indeed, the use of predicted in vitro bioactivities as descriptors for predicting Torsades de Pointes is explored in Chapter 6 of this thesis. 2.6.3 QSAR Modelling Methods 2.6.3.1 General Concepts Regression and Classification QSAR models are designed to predict continuous or categorical measures of bioactivity, denoted regression or classification models respectively; 192 a further distinction may be made between "hard point" classifiers, which solely aim to map an instance (i.e. a molecule encoded via descriptors) onto a class label, and probabilistic classifiers which estimate 31 probabilities of class membership. Here, probability may be viewed in a Bayesian sense - i.e. as a quantification of the degree of confidence in a particular belief. 193,194 Probabilistic classifiers are advantageous when the costs of incorrect class assignment are class dependent and subject to change; 193 for example, it is arguably the case that, in early drug development, an incorrect prediction of toxicity is worse than incorrectly predicting a candidate as non-toxic, whereas the converse is true in a regulatory environment. 195 It must be noted, however, that a number of QSAR modelling methods may effectively be used for both regression and classification - for example, Random Forest (RF), 192 Artificial Neural Networks (ANNs) 196 and Partial Least Squares (PLS). 192 Linear vs. Non-Linear Models In linear models, the output of the model (either the predicted bioactivity for regression models or the output of a classifier - thresholds for which are used for class assignment), takes the following general form 193,197 (Equation 2.1). ∑ 2.1 In Equation 2.1, is the model output, and the mth descriptor (out of a total of M) and corresponding coefficient (weight) respectively, and an offset term. Linear modelling methods have certain potential advantages over non-linear methods which do not assume this functional relationship. Firstly, the contributions of individual descriptors to the predictions are readily apparent, making them potentially more interpretable. Secondly, the computational overhead of linear methods may be lower than corresponding non-linear approaches - for example, when considering the choice of kernel function (see below) for a Support Vector Machine (SVM). 198 However, many structure-activity relationships, particularly those based on multiple mechanisms, 199 are likely to be non-linear - hence purely linear modelling methods could yield suboptimal predictivity. However, some non-linear relationships may be captured by linear modelling methods via generating additional descriptors based upon combinations of the original descriptors, 184 or by raising the original descriptors to the power n (n ≥ 2).146 Global vs. Local Models 32 "Local" models are based on structurally non-diverse sets of compounds, often on a single congeneric series, 200 and are expected to have limited applicability. 201 "Global" models, as developed in the work presented in this thesis, are based on a diverse range of chemical series and are required for screening diverse chemical libraries. 15 Overfitting This phenomenon may be defined as a scenario under which a model "fits the idiosyncrasies of a particular training set at the expense of the predictivity of a similar set of molecules" 202 - for example, by learning the noise (random experimental error) as well as the signal in the training set. Hyperparameters QSAR modelling methods may be associated with adjustable "hyperparameters" - parameters which control the distribution of other parameters; 193 in the current context, they control the relationship between bioactivity and molecular descriptors learnt from the training set. For some methods (e.g. the Support Vector Machine, described below), these must be carefully selected. 203 2.6.3.2 Examples A plethora of methods may be employed to generate a QSAR model. Some commonly employed examples are: Multiple Linear Regression (MLR), Partial Least Squares (PLS), 197 Recursive Partitioning (or Decision Tree), 91,192,204,205 Artificial Neural Networks (ANNs), 193,196,206 k-Nearest Neighbours (kNN), 207,208 Support Vector Machines (SVMs), 193,203,209,210 Random Forest 192,204,211 and Naive Bayes. 193,212–214 It should be noted that this list is far from exhaustive - additional methods are presented below and new approaches are regularly proposed in the literature. 194,215 The following presents a detailed discussion of those methods which were used in the work presented in subsequent chapters of this thesis. Winnow The Winnow algorithm, originally developed by Littlestone, 216 was recently adapted for QSAR generation by Nigsch. 184 The following description refers to the version of this algorithm implemented by Nigsch 184,213 which was, with the minor addition of multiple training cycles (see below), employed in the work presented in Chapter 4 of this thesis. 33 Instances are presented to Winnow as a set of ‘features’ – text strings which may either be present or absent in an instance. * For each class (C), Winnow holds S independently trained weight vectors (scorers) with positive elements ( ) for all M features seen during training, used to generate S scores ( ) (Equation 2.2), for a given instance (i), for each class. The predicted class is the class with the highest arithmetic mean score. ∑ 2.2 In Equation 2.2, is 1 (0) if the mth feature is present (absent) in instance i, and all M values of may be considered a set of binary descriptors. The features themselves could correspond to the 'on-bits' of a fingerprint descriptor (as proposed by Nigsch) 184,213 or, as proposed in Chapter 4 of this thesis, to discretized descriptors. The training instances are presented sequentially to Winnow, in a predefined yet randomised order, with S training instances held in memory, and randomly distributed amongst the scorers, for a given training iteration; for the next training iteration, the first training instance which entered the “first-in first-out cache” is removed from memory and the next training instance added. Since only a subset of the training data is held in memory at any point in time, this "on-line" algorithm is highly memory efficient during training; 184 however, as illustrated in Chapter 4 of this thesis, this does mean the exact model learnt by Winnow is dependent upon the exact order in which the training set instances are presented. Nigsch originally proposed presenting each training set instance once, for inclusion in the cache, such that training ceased after the last training set instance was added to the cache and all scorers were subsequently updated (i.e. after a 'single training cycle'). 184 In the work presented in Chapter 4 of this thesis, however, the repeated presentation of the training set (in the same order), i.e. multiple training cycles, was also considered. The presentation of the training set in different randomised orders, as commonly used to train ANNs, 196 would arguably yield better results, but this is harder to implement computationally. * More generally, the term “features” may be used interchangeably with “descriptors” in the QSAR literature, and acquires a distinct meaning again in the context of discussing SVM – see below.209 34 The classwise feature weights for each scorer, initially set to unity, are updated via the following "error-driven" procedure. 184 For each scorer, upon encountering instance i during training, is calculated, for all classes. The weights of all features in instance i are decreased when , the score for the wrong class (C*), exceeds ∑ (the number of features in instance i) and when does not exceed this threshold, yet | – ∑ | ≤ ∑ . The weights of all features in instance i are increased when , the score for the correct class (C**), does not exceed the threshold and when does exceed the threshold, yet | – ∑ | ≤ ∑ , where ≤ . A non-zero value for , forcing weight updates even when the correct class of the current training instance is relatively highly scored, may make the algorithm more robust. 184 The weights of all features in instance i are increased via multiplying by a “promotion” factor (p), where 1 < p < pmax, or decreased by multiplying using a “demotion” factor (d), where dmin < d < 1 Nigsch initially proposed variable promotion and demotion factors, with pmax and dmin being positive constants, 184 albeit constant positive values may also be considered as per the work presented in Chapter 4 of this thesis. Earlier work by Nigsch indicated that this algorithm, under some circumstances, could perform comparably to Random Forest 184 and a Laplacian modified Naive Bayes classifier. 213 The work presented in Chapter 4 of this thesis indicates that, under some circumstances, this algorithm may perform comparably to SVMs as well. Whilst a linear algorithm, Nigsch's implementation is able to capture non-linearity in the original feature space. This is achieved by constructing additional features, termed "orthogonal sparse bigrams" (OSBs), via non-exhaustive pairing of the original features present in a molecule as previously described by Nigsch. 184,213 Linear Discriminant Analysis Linear classifiers may also be derived from Linear Discriminant Analysis (LDA). A variety of approaches to determining a linear discriminant separating two classes (i.e. a model with the form of Equation 2.1 along with a threshold delimiting predictions for one class from predictions for the other) exist. 193 However, "LDA" is usually used 217 to denote the determination of a linear discriminant via adjusting the coefficients of the descriptors in order to maximise the Fisher criterion 193,217,218 – designed to maximise the ratio of between-class- variance to within-class-variance within the training set. 193,217,219 35 Binary classifiers derived from canonical discriminant analysis, for mutagenicity and carcinogenicity are incorporated into the Toxtree progam, 37 and hence were considered when generating predictions for mutagenicity and carcinogenicity in the work presented in Chapter 3. For binary classification, since the class means are co-linear, canonical discriminant analysis corresponds to LDA. 220 Naive Bayes This approach estimates the (relative) posterior probabilities of class membership (i.e. the probabilities given the descriptor values), assuming the class conditional descriptor distributions, denoted , are given by the product of the class conditional distribution of the individual descriptors, denoted . From Bayes' theorem, it follows that this assumption allows for the posterior probability of class , denoted , to be expressed as follows (Equation 2.3): 193 ∏ 2.3 In Equation 2.3, is the prior probability of membership of class and the constant of proportionality is the same across all classes. The Binary QSAR methodology, developed for binary classification,214 was proposed by Thai and Ecker69 for generating predictive models for hERG inhibition, and their approach was compared to novel approaches in the work presented in Chapter 4. Essentially, this methodology employs a Naive Bayes approach to estimate the posterior probability of a compound belonging to the "active" class.214 The methodology can work with continuous descriptors - specifically, the descriptors used are principal components derived from the original descriptors (see section 2.6.6) - via estimating for the mth retained 69 principal component. The priors are estimated from the training data using modified "active"/"inactive" occurrence counts, which do not tend to zero in the limit of there being few "actives"/"inactives" in the training set.214 Support Vector Machines A Support Vector Machine (SVM) determines a linear "decision boundary" (or "hyperplane") which is designed to separate molecules, represented as instances located in a “feature 36 space”, belonging to one class from molecules belonging to the other possible class. Hence, this decision boundary may be used for binary classification. The “feature space” referred to may correspond to the space defined by the original descriptors, or a non-linear projection of this space – generating a linear or non-linear classifier respectively. Mathematically, the decision boundary is defined as ( ) , with molecules assigned to, say, the 'toxic' ('non-toxic') class when ( ) ( ( ) ) with representing a descriptor vector, where ( ) is computed as per Equation 2.4. ( ) ∑ 2.4 In Equation 2.4, represents a vector in the feature space, b represents a positive or negative scalar (the “bias”), and is the normal to the hyperplane. If the training set is linearly separable - i.e. all training set data points in, say, the 'toxic' class can be perfectly separated from all training set data points in the 'non-toxic' class by some hyperplane(s) - in the feature space, an SVM finds the "maximum margin" hyperplane in order to minimise overfitting; * here, the margin is defined as the perpendicular distance between the hyperplane and the closest training set data point. 193,209,210 If the dimensions of the feature space are scaled such that, by definition, | | (i.e. ( ) ) for the training set instances lying closest to the perfectly separating hyperplane, the margin is given by | | and maximisation of the margin corresponds to minimisation of | | . 193 If the training set is not linearly separable in the feature space, SVMs allow for some degree of misclassification by introducing the "slack-variables". Training set instances associated with non-zero slack-variables may be misclassified and the margin is now defined in terms of the closest training set instances associated with zero-valued slack-variables. During training, the need to maximise the margin, in order to limit overfitting, is balanced against the need to * If the feature space corresponds to a highly non-linear projection of the descriptor space, considerable overfitting may nonetheless occur. The degree of non-linearity of this projection is typically controlled by a kernel parameter, emphasising the importance of SVM hyperparameter selection (see below).203,209 37 minimise the extent of training set misclassification, such that minimisation of the following expression (Equation 2.5) is attempted. 193,209,210 | | ∑ 2.5 In Equation 2.5, is the slack-variable for the ith training set compound (a positive scalar which increases as the compound moves further away from the 'wrong side' of the hyperplane), the vector of all slack-variables, and C is the "regularization constant",210 a hyperparameter 203 which determines the trade-off between the minimisation of misclassification (RHS of Equation 2.5) and the maximisation of the margin (LHS of Equation 2.5) during training. 210 Figure 2.2 presents an overview of this classification procedure. 38 39 Figure 2.2 An overview of SVM classification. (A) A linearly separable dataset in the feature space, and a possible separating hyperplane. (B) The SVM solution for such a dataset: the maximum margin hyperplane. (C) A non-linearly separable dataset; highlighted are two misclassified instances and two further instances lying inside the margin, with their corresponding slack-variables. (D) A conceivable corresponding decision boundary (as shown in (C)) in the descriptor space (supposing the feature space is a higher dimensional projection of the descriptor space); only the two misclassified instances are highlighted. N.B.: These images are for illustrative purposes only. 40 Obtaining the SVM solution of Equation 2.5, and generating predictions via Equation 2.4, does not require explicit generation of the feature space. Rather, all that is required is the determination of dot-products in this space, which may be computed (Equation 2.6) from corresponding vectors in the descriptor space using a kernel function. 210 ( ) 2.6 In Equation 2.6, and are vectors in the descriptor space and k( ) is the kernel function. The linear kernel, ( ) , corresponds to a dot-product in the descriptor space and yields a linear classifier. 193 A popular choice of non-linear kernel in the QSAR community 209 is the Gaussian Radial Basis Function (RBF) kernel (Equation 2.7). 221 This kernel was used to generate non-linear SVM models in the work presented in Chapter 4 of this thesis. ( ) ( | | ) 2.7 In Equation 2.7, is an additional hyperparameter controlling the degree of non-linearity (i.e. the degree of “flexibility”)203 of the SVM model. Different variants of the basic SVM classification procedure outlined above exist, 193,209,210 as well as an adaptation of the SVM approach for regression: Support Vector Regression (SVR). 193,209,210,222 Recursive Partitioning This approach is used to construct a single Decision Tree model from the training set. Starting from the entire training set (the "root node"), each descriptor is searched for "cutpoints" which partition the training set compounds at the current "parent node" into K "daughter nodes", such that the separation in the experimental bioactivities of the subgroups of the data passed to the daughter nodes is maximised according to some measure of separation. 204 For classification, this measure might be the mean decrease in Gini impurity, 223 or a t-test might be used for continuous bioactivities. 224 Commonly, K = 2 - i.e. only one split criterion is sought per descriptor (e.g. x1 > A, or x1 ≤ A).204 Cutpoints might also be selected for linear or non-linear combinations of descriptors.205 Partitioning continues until some stopping criterion (e.g. all compounds in the current node belong to the same class) is met. Predictions are generated by passing compounds through the 41 tree, and assigning the majority class or the average bioactivity value for training set compounds in the final ("leaf") node. 204 Recursive Partitioning is notably prone to overfitting; 205 even small changes to the training set could yield changes to one cutpoint and have the subsequent effect of appreciably changing the structure of the Decision Tree. 204 Overfitting may be limited to some extent via pruning 204 - removing branches from the fully grown tree, with the optimal depth of the tree determined using internal validation (see section 2.6.5) on the training set. 91 The QuaSAR-Classify approaches proposed by Dubus et al. 91 for the generation of hERG blocker classifiers, which were further evaluated in the work presented in Chapter 4, are based on a Recursive Partitioning algorithm. Random Forest As proposed by Breiman, 211 this method grows a forest of unpruned decision trees, each trained on independent, random bootstrap samples (i.e. select N from N with replacement) of the training set, with the cutpoints at each node selected from independently, and randomly, chosen subsets of the descriptors. For a new molecule, the forest makes predictions via aggregating the predictions for each tree (either via majority voting for classification or via averaging for regression). Since, approximately 1/3 of training set instances are expected not to be selected for the generation of any given tree, an estimation of the forest's predictivity may be made on the training set by only aggregating the predictions, for a given training set molecule, of those trees for which the molecule was "out-of-bag" (OOB), i.e. not selected. 192 Whilst other decision tree algorithms could be employed, the standard Random Forest algorithm (as implemented in the randomForest package in the R Statistical Programming language) 225 generates decision trees using the Classification and Regression Trees (CART) algorithm. 192 CART uses a weighted sum of the Gini impurity 223 (variance) 226 over both daughter nodes for split selection when employed for classification (regression). 192,223,227,228 Importantly, Random Forest commonly performs well (albeit, not necessarily optimally) 'off- the-shelf' - i.e. with a set of default hyperparameters. Here, the hyperparameters refer to the number of trees in the forest ( ) and the number of descriptors randomly sub-sampled at each node ( ). 192 42 The behaviour of the forest is expected to converge 192,204,211 as increases (indeed, Breiman proved convergence of the expected error rate for Random Forest classifiers as tends to infinity), 211 with Svetnik et al. suggesting that 500 trees are usually sufficient for many purposes. 192 Due to its ability to capture many non-linear relationships, and approximate linear decision boundaries, 204 its good performance 'off-the-shelf', its inbuilt OOB estimates of model performance and its readily computed 192,223 - albeit, imperfect 227,229 - variable importance measures, the randomForest implementation of Random Forest was used extensively in the work presented in subsequent chapters of this thesis to generate both classification (Chapter 4, Chapter 5 and Chapter 6) and regression (Chapter 5 and Chapter 6) QSARs. For the versions of randomForest employed in the work presented in this thesis, the default (M/3 and √M for regression and classification respectively, where M is the number of descriptors, rounded down to the nearest nonzero integer) and (500) values corresponded to those proposed by Svetnik et al. 192 2.6.4 Quantifying the Predictive Power of QSAR Models In order to summarise the predictive performance of a QSAR model, a variety of statistics, or figures of merit (FOMs), may be employed - each of them with their own strengths and weaknesses. N.B.: Often, in assessing the predictive performance of a modelling approach, multiple values might be computed for a figure of merit (FOM) and summarised using the arithmetic mean; all references to mean values for a figure of merit in subsequent chapters should be understood to refer to the arithmetic mean. 2.6.4.1 Figures of Merit for Classification The starting point for the assessment of QSAR classifiers is typically the "confusion matrix" (C),* comprising elements Ckl denoting the number of compounds in class k predicted to belong to class l.230 The following figures of merit are all based on the confusion matrix. A commonly reported measure of overall performance is "accuracy" 126,128,231 (or the "fraction of correct predictions"), 184 the number of correct predictions divided by the total number of * Alternatively, this matrix being an example of a "contingency table",94 the "contingency matrix".230 43 compounds for which predictions are made. 231 However, this is a poor measure of performance - particularly when the compounds to be predicted belong disproportionately to one class (i.e. the data is "unbalanced"); for example, if 99% of compounds belong to a single class, then a totally non-discriminative classifier which predicted everything to belong to that class would appear to be highly predictive - with an accuracy of 99%. 232 For binary classification, with a 'positive' and 'negative' class, the confusion matrix simplifies as follows (Table 2.1). Observed Positive Negative P re d ic te d Positive Number of true positives (TP) Number of false positives (FP) Negative Number of false negatives (FN) Number of true negatives (TN) Table 2.1 The confusion matrix for a binary classifier. The Matthews Correlation Coefficient (MCC) is an appropriate FOM for assessing the overall discriminative ability of a binary classifier (Equation 2.8), which assigns equal credit (equally penalises) correct (incorrect) predictions for either class. The MCC takes values between negative one (all predictions are incorrect) and one (all predictions are correct), whilst random assignment of the class labels would have an expectation value of zero - this value also being obtained in the limit that a classifier predicted all compounds to belong to a single class. Hence, the MCC is more appropriate than accuracy for ranking classifiers predicting unbalanced validation data and may be considered a measure of the extent to which the model would exceed the performance of a random predictor. 230 Indeed, as noted by Baldi, 230 the MCC is related to the chi-squared test-statistic ( ) for independence (the “null hypothesis", corresponding here to a random predictor) for a 2 ×2 contingency table (Equation 2.9). 233 Assuming a chi-squared distribution, and supposing the validation data is large enough for the degrees of freedom to be taken as one, a "p-value" (i.e. 44 the probability of obtaining a test-statistic at least as large given the null hypothesis) * may readily be calculated. 233 M √ 2.8 M √ 2.9 In Equation 2.9, N' is the total number of compounds used for model validation. Given the useful properties of the MCC outlined above, this figure of merit was extensively used to summarise the overall performance of the binary classifiers developed and/or assessed in the work presented in subsequent chapters of this thesis. However, any single FOM loses information with respect to the complete confusion matrix - i.e. there is not a one- to-one mapping between the confusion matrix and any FOM. Moreover, the MCC may be relatively high when few compounds are predicted to belong to one class. 230 Cohen's kappa 234 has similar properties to the MCC and may be used to assess classifier predictivity for an unlimited number of classes. Gorodkin also presented an extension of the MCC to multiple classes. 232 The performance of the model for individual classes may be assessed in terms of the recall and precision (which may further be combined into a single measure - the F Measure). These are defined below 175 (for the 'positive' class). † For the binary classification problem toxic vs. non-toxic, recall of toxic compounds and non-toxic compounds may be termed sensitivity and specificity respectively 195 * As noted in section 2.6.4.2, p-values can be computed for other figures of merit (given the null hypothesis of a random predictor). Comparison of these p-values (lower p-values offering stronger grounds for rejecting the null hypothesis) might be more appropriate than direct comparison of these figures of merit when comparing the performance of models assessed on validation data comprising different numbers of compounds. † Clearly, the expressions for any other class are analogous. 45 2.10 2.11 M 2.12 2.6.4.2 Figures of Merit for Regression For regression models, the Pearson's correlation coefficient (r), the coefficient of determination (R 2 ) and the Root Mean Square Error (RMSE) are commonly computed. 235 The common expressions, 236 used throughout in the work presented in this thesis, for these are presented in the following equations. Spearman's rank-correlation coefficient ( ) was also computed (Equation 2.15). 226,237 ∑ ( ) ( ) √∑ ( ) ∑ ( ) 2.13 ∑ ∑ ( ) 2.14 ∑ ( ) 2.15 M √ ∑ 2.16 46 In these preceding equations, N' denotes the number of compounds for which predictions are made, and denote the experimentally measured and predicted bioactivity for the ith compound with and denoting their respective means. Finally, denotes the difference in ranks assigned to compound i when the compounds used for validation are ranked according to their predicted and observed bioactivities. These statistics have their advantages and disadvantages. 186 For example, the RMSE may appear low (i.e. the model may misleadingly appear predictive) if the model simply predicts all compounds to have the mean bioactivity of the training set (supposing this is close to the mean bioactivity of the test set and the test set's bioactivity values are distributed over a narrow range). 238 Additionally, neither the Pearson's nor Spearman's correlation coefficients take account of non-zero model bias (i.e. when ∑ ( ) . 239 However, for the Pearson's and Spearman's correlation coefficients, statistical tests exist for determining the probability of observing a correlation coefficient at least as positive * as that obtained given the null hypothesis of random predictions 226,237 (i.e. a p-value). † 2.6.5 Model Validation It is important that unbiased estimations of model performance are made. The compounds used for model validation should not have been used to train the model. 240 When validation data is not used to directly train the model, yet has been used to select hyperparameters or descriptors (selection of the latter commonly termed "feature selection"), 69,91 model performance is expected to be overestimated. 200,207,241–243 The unbiased assessment of a model on data not used for training nor model selection (for example, hyperparameter or descriptor selection) may be termed "external" validation (as opposed to "internal" validation). 153 In this * N.B.: Here, the alternative hypothesis (accepted when the null hypothesis is rejected) is that the correlation coefficient expected for the "population" (in our case, all possible validation set molecules) is greater than zero (the value expected for the null hypothesis).226 This contrasts with the statistical test proposed for the MCC in section 2.6.4.1, for which - strictly speaking - this null hypothesis can only be rejected in favour of concluding that the classifier is not a random predictor. 47 thesis, 'internal' validation is used to denote model assessment on data not directly used for training, but still used to guide model development via hyperparameter or feature selection. As well as using a single distinct test set (a "holdout" sample) to validate a model, cross- validation, based on repeatedly training a model on a subset of the original training set and making predictions for the remaining compounds (see below for details), might yield more robust estimates of model performance if few compounds are available for the holdout sample. 243 Cross-validation is often viewed, by definition, as an "internal" validation method. 186,200 However, if all steps of model generation (including hyperparameter and feature selection) are repeated for each training set, cross-validation estimates of model performance need not be optimistically biased; 243 under such circumstances, one may speak of "external cross-validation". 153,189 Arguably, a further distinction may be made here between attempts to estimate, without bias, the performance of: 1. A single model. 2. A modelling approach (e.g. a specific QSAR modelling method and/or descriptor set). In the first scenario, "external cross-validation" may be carried out on the training set eventually used in its entirety to build the final model. Hence, one may consider assessment on a holdout sample to be external in a profounder sense. In the second, one is not interested in pooling all the data used for cross-validation to train a single model, hence this consideration does not apply. Cross-validation (CV) typically, albeit not always, 244 randomly partitions a dataset into several training and test sets and estimates the performance of a modelling approach across the partitions. In K-fold CV, a dataset is partitioned into K disjoint sets (folds) of comparable size, and each fold is used in turn to validate the model, with all remaining instances used for training; leave-one out CV (LOOCV) entails setting K to N - the size of the dataset - i.e. one compound is used for model validation, and the rest for training, at a time. The performance statistics commonly used are the mean FOMs across all K folds, 240,245 albeit the predictions might be pooled across all folds prior to computing a FOM. 246 In Monte-Carlo CV (MCCV), a dataset is repeatedly, independently randomly partitioned into a training and test set (using a fixed percentage of the data for testing) and the performance of the modelling approach is assessed by computing the mean FOMs, obtained on the test set, across all repetitions. 245 In stratified CV, an attempt is made to maintain the ratio between the classes in the entire 48 dataset across all partitions. 240,246 Alternative approaches for partitioning the dataset into training and validation data may also be considered. 240 As well as assessing model predictivity on the original, available data, one might also employ "y-randomisation", or "(y-)scrambling" for model assessment. 153,186,247 This entails the randomisation (typically, via permutation) of the bioactivities within a dataset, and an assessment of the expected performance of models (estimated from multiple repetitions of the procedure) obtained by applying the original modelling procedure (including hyperparameter and feature selection) to the permuted dataset. 247 Model performance on the training and/or test data may be compared 186,189 with some authors having applied a model trained on randomised data to non-randomised test data. 222 and a reduction in estimated model performance after scrambling may be interpreted as indicating that the QSAR model(s) are not based on capturing chance correlations within the dataset. 189 2.6.6 Applicability Domain Since the training set used to derive a model represents an incomplete coverage of "chemical structure space", the model may only be "successfully" applied to a finite variety of chemical structures.248 The "applicability domain" (AD) of a model is widely understood to define the range of chemical structures to which the model is "applicable". More precisely, a report for the European Centre for the Validation of Alternative Methods (ECVAM) defined the applicability domain as: "the response and chemical structure space in which the model makes predictions with a given reliability".249 Whilst this may be interpreted as a range of chemical structures for which the expected model performance is well characterised, 25 the applicability domain is commonly interpreted as a region of chemical structure space in which the model is known to exhibit desirable predictivity.47,202,248,250,251 A distinction may be made between those approaches which simply try to categorise compounds as "inside AD/outside AD" and those which seek to directly assess the expected performance of the model for a particular compound.47 Examples of the former include approaches based on molecular fragments, such as categorising compounds with fragments not seen in the training set as outside the AD,249–251 as well as those which are informed by an understanding of the molecular mechanisms responsible for the bioactivity of interest.249As an example of the latter, skin sensitizers may be categorised as belonging to different "reaction mechanistic domains" and models developed that are specific to one such domain.252 49 Other "inside AD/outside AD" approaches are based upon the location of compounds with respect to the training set within the space defined by a particular set of descriptors. The simplest of these are the range based methods, which categorise compounds as inside/outside the AD if (some of) their descriptor values lie inside/outside the ranges observed for the training set. Alternatively, range based methods may be applied after transforming to a new co-ordinate system based on the principal components (see below). 249 A principal components analysis (PCA) plot may be used for rapid, qualitative assessment of whether or not a (set of) compound(s) lie inside the chemical space of the training set 69 - i.e. whether or not they may be considered to lie inside the applicability domain of the model. These types of plots were used for qualitative assessment of the separation in chemical space between the training and test sets for the models presented in Chapter 4 of this thesis. PCA starts with the computation of the principal components - linear, orthogonal combinations of the original descriptors. The principal components (PCs) are the M eigenvectors of the covariance matrix (XTX), computed from the N×M matrix (X) with elements Xim denoting the value of the mth descriptor for the ith molecule. The corresponding eigenvalues measure the relative, independent, contribution to the variance associated with the original descriptors for a given PC. *197 The approximate distribution of the data within the descriptor space may then be visualised by plotting the data in the plane defined by the two PCs with the largest eigenvalues - i.e. a PCA plot. 189,193 In recent years, a number of studies have sought to benchmark measures designed to estimate the predictive performance of a QSAR model for a new compound.47,202,248,251 The studies by Shusko et al.47 and Dragos et al.,248 in keeping with earlier publications,251 advocated the use of measures based on the variation in predictions across an ensemble of predictors as a metric for discriminating between ‘well predicted’ and ‘poorly predicted’ compounds. *The scaling of descriptors, to remove 'artificially larger' contributions, e.g. due to different units, to the structural variation captured by the descriptors may be appropriate prior to PC computation.197 50 Chapter 3 Screening for Mutagenicity and Carcinogenicity in the Context of a Prospective Virtual Screen This chapter presents this author's contribution to a prospective virtual screening project which identified experimentally verified inhibitors of type II dehydroquinase (DHQase). 253,254 This project was a collaboration between members of the Mitchell, Blumberger and Abell research groups of the Department of Chemistry, University of Cambridge. This author's contribution was the development and application of a toxicity filter, used to remove putative inhibitors predicted to induce particularly important toxicity endpoints (see Chapter 1) during the initial stages of this project. This chapter starts (section 3.1) by situating this author’s contribution within the context of this collaboration – summarising the workflow employed and the key findings of the project. The nature of the toxicity filter, along with the motivation for applying a toxicity filter of this kind, is fully explained in section 3.2. The empirical validation of various modelling options, which informed the options selected for the toxicity filter, is presented in sections 3.3 and 3.4. This chapter ends by discussing the implications of the toxicities predicted for the experimentally determined inhibitors. 3.1 Overview of Collaboration The aim of the project was to identify structurally novel, 'drug-like' inhibitors of type II DHQase. This enzyme catalyses the reversible dehydration of 3-dehydroquinate to give 3- dehydroshikimate, a key step in the essential shikimate pathway in Streptomyces coelicolor, Helicobacter pylori and Mycobacterium tuberculosis bacteria 253–255 - the latter two being pathogenic. 253,255,256 However, this pathway is not present in mammals. 253–255 Consequently, inhibitors of the H. pylori or M. tuberculosis isoforms may serve as antibiotics - indeed, inhibitors of the latter may serve as anti-tuberculosis drugs 253,255 - making the identification of new inhibitors of type II DHQase an important step towards discovering structurally novel drugs which could circumvent resistance to existing antibiotics. 255,257 The workflow used in this project is schematically illustrated in Figure 3.1. 51 Firstly, a pre-defined “drug-like”* subset of the freely available ZINC (ZINC8) database was downloaded on April the 14 th 2009. This contained 8,784,580 commercially available compounds. 258–260 The similarity of the available conformers was compared to three ligands co-crystallised with type II DHQase (CA2, RP4 and GAJ † from Protein Data Bank (PDB) 261,262 entries 2BT4, 2CJF and 2C4W respectively) using Ballester and Richards' Ultrafast Shape Recognition (USR) methodology. 263,264 For each template (i.e. co-crystallised ligand), conformers with a similarity below a pre-defined threshold (0.90) were discarded. This resulted in three sets of compounds with 2,963 (CA2 template), 918 (RP4 template) and 498 (GAJ template) unique, non-overlapping ZINC codes (i.e. 4,379 unique ZINC codes in total, corresponding 260 to 4,379 compounds). These sets of compounds were presented to this author in SDF format. As described in section 3.2, these files were parsed to generate combined predictions for mutagenicity and carcinogenicity. Compounds predicted to be both mutagenic and carcinogenic were removed. SDF entries corresponding to ZINC codes which were duplicated in the original files generated by this toxicity filter were also removed. ‡ Ultimately, three SDF files, comprising entries with unique ZINC codes, were obtained. These contained 406 (GAJ template), 847 (RP4 template) and 2,655 (CA2 template) SDF entries respectively, i.e. 3,908 compounds, corresponding to 3,908 unique ZINC codes, in total. *A "drug-like" compound has "sufficiently acceptable ADME properties and sufficiently acceptable toxicity properties to survive through the completion of human Phase I clinical trials".258 As is common practice,258 the maintainers of ZINC define this in terms of filters based upon - calculated - molecular properties.259 †These codes denote the following ligands: (1S,3R,4R,5S)-1,3,4-trihydroxy-5-(3- phenoxypropyl)cyclohexanecarboxylic-acid (CA2), (1S,4S,5S)-1,4,5-trihydroxy-3-[3- (phenylthio)phenyl]cyclohex-2-ene-1-carboxylic acid (RP4), N-tetrazol-5-yl 9-oxo 9H- xanthene-2 sulphonamide (GAJ). ‡Given that the maintainers of ZINC note260 that they enumerate possible stereoisomers for compounds with incompletely defined stereochemistry, it is possible that these compounds may have been associated with stereochemical ambiguity - which could have resulted in misleading findings in the subsequent docking studies. 52 These 3,908 compounds were docked, using GOLD, 265 into five PDB crystal structures of type II DHQase (PDB codes: 2BT4, 2C4W, 2CJF, 1GU1 and 1H0R). These corresponded to the S. coelicolor (1GU1, 2BT4, 2CJF), H. pylori (2C4W) and M. tuberculosis (1H0R) isoforms respectively. Docking poses were generated using ChemScore, and re-scored using GoldScore, Astex Statistical Potential (ASP) and RF-score. 266,267 Based on the scores obtained, three protocols were employed to generate overall rankings for all 3,908 compounds. Protocol 1 (a consensus scoring strategy) sorted all poses, per protein structure, generated from docking into the 2BT4, 2CJF and 2C4W structures by their average rank (according to ChemScore, GoldScore and ASP). Protocol 2 ranked the poses according to RF-score. In both cases, the top ranking 100 compounds for each target were selected for further consideration. Protocol 3 considered all five sets of poses, sorted according to the average ChemScore, GoldScore and ASP ranks; the compounds ranked within the top 500 for all five protein targets were selected for further consideration. For protocol 4, all 4,379 compounds initially passing the USR screen were considered. All such compounds were ranked according to their USR similarity score with respect to the RP4 template; top ranking compounds were selected for further consideration. Given financial constraints and redundancy between the compounds selected by these protocols, a small number of compounds, prioritised via these protocols, was finally purchased for experimental testing. All purchased compounds were tested for their ability to inhibit conversion of the natural substrate (i.e. 3-dehydroquinate) by both S. coelicolor and M. tuberculosis isoforms of type II DHQase in a kinetic assay and Ki values estimated from the measured IC50 values, using the original form of the Cheng-Prusoff equation, which is valid for reversible, competitive inhibition. 75 Of the 148 compounds tested, 89 (91) were confirmed as inhibitors of the M. tuberculosis (S. coelicolor) isoform with Ki < 500 μM. Median Ki values were 115 and 108 μM for the M. tuberculosis and S. coelicolor isoforms respectively. The most potent of these inhibitors had Ki values of 23 and 4 μM for the M. tuberculosis and S. coelicolor isoforms respectively. Arguably, this compares favourably to the recent discovery of a novel inhibitor via high throughput screening (HTS), entailing experimental assessment, of 150,000 compounds against the H. pylori isoform, which had a Ki of 20 μM and 230 μM for the H. pylori and S. coelicolor isoforms respectively. 268 53 The two sets of inhibitors were separately combined with inhibitors of the corresponding isoform previously published in the literature, as identified from the ChEMBL database. 269 For both expanded sets, clustering analysis, 270 and manual inspection of the structures of compounds in each cluster, indicated that the new inhibitors discovered in this virtual screening workflow were structurally distinct from those previously published in the literature. At the time of writing, this work had been submitted for publication. This publication will present full details of the computational and experimental procedures employed, and results obtained in this study - beyond the description of the toxicity filter presented below. 54 Figure 3.1 Overview of the workflow undertaken to identify novel, experimentally verified inhibitors of type II DHQase. This author's contributions to the study are circled. 55 3.2 Approach Developed for Toxicity Screening 3.2.1 General Overview Combined predictions were generated for both the mutagenicity and carcinogenicity endpoints using Derek for Windows TM (version 11.0), 160,271 made available by Lhasa Limited, and the freely available Toxtree (version 1.51) 159–161 software. For brevity, Derek for Windows TM is referred to as DfW. As previously noted by Simmon-Hettich et al., predictive toxicology filters applied early on in the pharmaceutical life cycle should prioritise specificity over sensitivity (see Chapter 2, section 2.6.4) - i.e. they should focus on minimising the loss of potential candidates due to incorrect predictions of toxicity, even at the expense of retaining some toxic compounds, when attrition costs are low. 195 Hence, only compounds predicted to be both mutagenic and carcinogenic were selected for removal, to reduce losses due to incorrect predictions for either endpoint. The focus on these endpoints was due to the particular importance of identifying both mutagens and carcinogens early on in the pharmaceutical lifecycle, as explained in Chapter 1, section 1.3.1. Of course, these are not the only endpoints which require early identification to reduce attrition costs (e.g. see Chapter 1, sections 1.3.2 and 1.3.3); however, removal of compounds deemed to exhibit additional toxicities would either have resulted in an accumulated loss of incorrectly predicted toxic compounds - if compounds were removed if predicted to either induce mutagencity and carcinogenicity or another endpoint - or an increased retention of toxic compounds if compounds were only removed if they were predicted to exhibit many types of toxicity. The focus on prioritising specificity over sensitivity also informed the selection of options used to generate both the predictions made using both programs individually and the manner in which combined predictions were generated. Prior to selecting the options presented below, a range of options were considered and their predictive performance assessed on an experimentally derived mutagenicity and carcinogenicity database (see section 3.3). In order to explain the options selected, it is necessary to briefly review the manner in which DfW and Toxtree generate predictions for mutagenicity and carcinogenicity. Further details may be obtained from the documentation for these programs and the referenced articles and texts. 56 3.2.2 Generating Predictions using DfW 3.2.2.1 Background As discussed in Chapter 2, section 2.5, DfW makes predictions on the basis of rules that are based on various types of evidence - structural alerts, physicochemical properties, toxicological properties as well as conclusions arrived at for related toxicity endpoints. For example, the following rules might be used to build up an argument in favour of observing carcinogenicity in rats. 160 If [structural alert = beta-O/S-substituted carboxylic acid or precursor] is [certain] then [peroxisome proliferation] is [plausible (in rats)] If [peroxisome proliferation (in rats)] is [plausible] then [carcinogenicity] is [plausible (in rats)] These example rules also emphasises some important characteristics of DfW. In contrast to Toxtree, 37 DfW generates conclusions of varying strengths (as opposed to simple binary predictions) as well as species specific conclusions - e.g. carcinogenicity is “certain” in rodents, yet “plausible” in humans.160 3.2.2.2 Specific Options Selected However, for the purposes of the project presented in this chapter, binary predictions (carcinogen/mutagen vs. non-carcinogen/non-mutagen) were required. Hence, 'positive' predictions for carcinogenicity/mutagenicity were generated when DfW concluded that carcinogenicity in humans/in vitro mutagenicity in Salmonella typhimurium (the test organism used in the standard Ames test) 38,46 was “certain”, “probable” or “plausible”, with DfW predictions being otherwise interpreted as 'negative'. * *Whilst in keeping with literature precedence,272 and required for the purpose of generating an automated toxicity filter herein, digitising the output of DfW in this fashion clearly loses information that would be of considerable value in, say, prioritising compounds for experimental assessment. The specific choice of cutoff is, also, to some extent arbitary;154 however, internal analysis within Lhasa Limited has suggested that, for several endpoints, compounds classified as “equivocal” - which ranks immediately below “plausible”160 had a 50:50 chance of being toxic (private correspondence with Dr Chris Barber). 57 3.2.3 Generating Predictions using Toxtree 3.2.3.1 Background Toxtree was used to make predictions for mutagenicity/carcinogenicity on the basis of the Benigni/Bossa rulebase. 37 Using this rulebase, the program reports all genotoxic and nongenotoxic carcinogenicity structural alerts (SAs) - see Chapter 2, section 2.5 - that were deemed to have “fired” for the compound in question - i.e. were found in the molecular structure in the absence of any possible modulating factors supposed to abolish the toxic potential conferred by the SA. If an SA substructure corresponding to an aromatic amine or an α,β-unsaturated aldehyde is identified (even if it does not fire), binary classification, canonical discriminant analysis QSARs may be applied (see Chapter 2, section 2.6.3.2). These QSARs were trained to discriminate between: 1. aromatic amines inducing, or not inducing, mutagenicity in Salmonella typhimurium TA100 (including metabolic activation), 2. aromatic amines exhibiting, or not exhibiting, carcinogenicity in rodents, 3. α,β-unsaturated aliphatic aldehydes inducing, or not-inducing, mutagenicity in Salmonella typhimurium TA100 (without metabolic activation). These QSARs are denoted: 1. QSAR6, 2. QSAR8 and 3. QSAR13 respectively. 3.2.3.2 Specific Options Selected Toxtree was interpreted as predicting mutagenicity when one of its genotoxic SAs fired, with carcinogenicity also predicted in these cases and when a positive prediction for carcinogenicity was obtained from QSAR8, with negative predictions ignored. Ignoring negative predictions for the Toxtree QSAR models could help avoid a scenario in which toxicity conferred by another SA, other than that for which the QSAR was applied, was ignored due to a negative QSAR prediction. When mutagenicity/carcinogenicity was not predicted, Toxtree was interpreted as predicting the compound to be a non-mutagen/non- carcinogen. 58 3.2.4 Combined Predictions 3.2.4.1 Background The value of making toxicity predictions based upon the combined output of different in silico models was stressed in the recent literature. 273,274 The manner in which overall predictions are generated when the models do not agree can be adjusted to increase sensitivity (specificity), at the cost of reduced specificity (sensitivity). 273 Hence, given the need to maximise specificity for the toxicity filter employed in this project, the used of a combined prediction strategy was attractive. It should be noted that, for such combined models to add value, they should be complementary - i.e. they should not consistently make identical predictions - so that greater confidence may be assigned when a positive (i.e. toxic) prediction is generated by multiple models. 274 The DfW and Toxtree programs clearly employ different prediction paradigms (see above). Moreover, they may employ complementary SAs: the SA resulting in a prediction for carcinogenicity by DfW, as noted in section 3.2.2.1, is not documented as corresponding to a SA employed by Toxtree for carcinogenicity prediction. 37,160 However, it was not possible to fully assess the extent to which the SAs employed by Toxtree (version 1.51) and DfW differed, since access to the DfW knowledge base was not permitted by the terms under which Derek for Windows TM (version 11.0) was licensed to us. Ultimately, the complementary nature of the two programs is evidenced by the differences in specificity and sensitivity obtained when considering conflicting predictions to be positive or negative (see section 3.3). * 3.2.4.2 Specific Options Selected Based on the 'yes/no' (or 'positive/negative') predictions generated for carcinogenicity/mutagenicity via the interpretation of the output of DfW and Toxtree presented above, compounds were predicted to be mutagenic/carcinogenic if 'positive' predictions were generated via both programs; compounds were otherwise predicted to be non-mutagenic/non-carcinogenic. *Whether similar results would be obtained with the current versions of these programs is an open question. 59 3.2.5 Practical Implementation All SDF files provided to this author, were batch processed via the Toxtree (version 1.51) GUI. * Ultimate generation of toxicity predictions, and generation of filtered SDF files, entailed parsing of SDF files using Python 275 scripts, largely via the Python module Pybel. 177 Parsing of the SDF files included writing the titles (ZINC codes) of each entry to an SDF field (), as required for generating predictions and filtering compounds predicted to be mutagenic and carcinogenic via parsing the SDFs, updated with Toxtree output, using an additional set of Python scripts. These scripts also triggered the generation of Derek for Windows TM (version 11.0) predictions; DfW predictions were matched to the corresponding Toxtree predictions using the field. Clearly documented versions of these Python scripts, and the DfW configuration file, are made available with this thesis (Appendix A). The original versions of these scripts, used to generate (combined) toxicity predictions for the ZINC database subsets, were essentially identical to those used to generate Toxtree and DfW predictions for the toxicity database considered in the next section - save for using a database specific field, specified below, to match Toxtree and DfW predictions. 3.3 Empirical Validation of Toxicity Models 3.3.1 Overview The ISSCAN 51 (version 3a) database (see Chapter 1, section 1.3.1.3) was downloaded in SDF format and used to assess the predictive performance of various mutagenicity and carcinogenicity binary classifiers constructed using DfW and/or Toxtree. The performance of these models informed the selection of the specific options used for toxicity filtering (as presented in section 3.2). 3.3.2 Options Assessed In Table 3.1 and Table 3.2, the 12 and four sets of options for generating positive predictions for mutagenicity and carcinogenicity using Toxtree and DfW respectively are presented. As * For the CA2 set, Toxtree was applied prior to addition of the field; for the GAJ and RP4 sets, Toxtree was applied following this step. DfW was applied to the SDF files generated following addition to the SDF files originally provided to this author. 60 well as generating predictions using Toxtree and DfW individually, all possible combined predictions were generated. Two options were considered for making combined positive predictions (for both endpoints): 1. an overall positive prediction was generated if either submodel made a positive prediction, 2. an overall positive prediction was only generated if both sub-models made a positive prediction. These options for generating combined predictions were expected to maximise 1. sensitivity or 2. specificity respectively. Ultimately, this yielded 112 (4+12+[12×4×2]) models to be evaluated. In the absence of positive predictions, all models were interpreted as yielding negative predictions. * *No claim is being made here that when these models do not predict a compound to be toxic, this actually indicates the compound is non-toxic; indeed, save when a QSAR analysis can be applied, Toxtree is inherently incapable of making negative predictions,37 and not all assessments made by DfW suggest the presence or absence of toxicity (e.g. an “equivocal” or “open” prediction),160 with a "nothing to report" outcome generated in the absence of either a structural alert firing or the activation of any other relevant rule in the knowledge base.276 However, for the purposes of the current work, a toxicity filter was required that would discard compounds anticipated to be toxic whilst retaining all others (i.e. make a de facto non-toxic prediction for the latter compounds). 61 Toxtree Model Number Outcomes Treated as Triggering a Positive Mutagenicity Prediction Corresponding Outcomes Treated as Triggering a Positive Carcinogenicity Prediction Genotoxic SA fired QSAR6 positive prediction QSAR13 positive prediction Positive mutagenicity prediction Nongenotoxic SA fired QSAR8 positive prediction 1   2    3    4     5    6     7    8     9     10      11      12       Table 3.1 Combinations of options considered for predicting compounds as mutagens/carcinogens based upon the output generated by Toxtree. If any one of the ‘ticked’ outcomes occurred, a positive prediction of mutagenicity/carcinogenicity was made. If these conditions were not met, compounds were deemed to be predicted non-mutagenic/non-carcinogenic by Toxtree. DfW Model Number Outcomes Treated as Triggering a Positive Mutagenicity Prediction Corresponding Outcomes Treated as Triggering a Positive Carcinogenicity Prediction In vitro mutagenicity in humans is certain, probable or plausible Mutagenicity in Salmonella typhimurium is certain, probable or plausible Carcinogenicity in humans is certain, probable or plausible Carcinogenicity in rodents is certain, probable or plausible 1   2   3   4   Table 3.2 Combinations of options considered for predicting compounds as mutagens/carcinogens based upon the output generated by DfW. If any one of the ‘ticked’ outcomes occurred, a positive prediction of mutagenicity/carcinogenicity was made. If these conditions were not met, compounds were deemed to be predicted non-mutagenic/non-carcinogenic by DfW. 3.3.3 ISSCAN Validation Subsequent to generating Toxtree predictions using the GUI, and DfW predictions, by parsing the downloaded SDF, the final toxicity predictions, based upon all 112 options, were 62 obtained via parsing the output files via Python scripts as per section 3.2.5. The field was used to match Toxtree and DfW predictions for the same molecule. It was not possible to obtain any predictions for 12 out of the 1,153 entries in this database. Experimentally derived mutagenicity and carcinogenicity classes were determined by considering the summary assessments for carcinogenicity and mutagenicity based upon rodent studies and the Ames test, and recorded in the SDF fields and , respectively. By default, all performance estimates were generated by considering all equivocal results as corresponding to non-mutagens/non-carcinogens. On this basis, there were 387 and 440 experimentally assessed mutagens and non-mutagens respectively, out of the 1,141 compounds for which predictions could be generated. The corresponding numbers of carcinogens and non-carcinogens were 713 and 428 respectively. Upon counting compounds with equivocal results for these endpoints as mutagens/carcinogens, the number of mutagens and non-mutagens was changed to 399 and 428 (out of the same 1,141). The corresponding numbers of carcinogens and non-carcinogens were 781 and 360 respectively. For each endpoint, performance estimates were only generated based upon those compounds for which toxicity predictions and experimentally derived classes could be obtained. 3.4 Results Obtained from Application of Toxicity Models 3.4.1 ISSCAN Validation Detailed results, obtained from validation of all 112 models on the ISSCAN database, are made available with this thesis (Appendix A). These detailed results include the MCC and corresponding chi-squared p-value (as proposed in Chapter 2, section 2.6.4.1), computed using the CHIDIST( ) function in Excel 2007 (32-bit). The performance of all 112 models is summarised in the Receiver Operating Characteristic (ROC) graphs 277 presented in Figure 3.2, Figure 3.3, Figure 3.4 and Figure 3.5. The line denoting corresponds to the expected performance of a random predictor, better than random predictions lying above this line. 277 As is indicated by these graphs, the estimated performance of the models was not changed considerably by changing the class label assigned to compounds with equivocal mutagenicity/carcinogenicity assessments. The maximum (median) absolute difference in 63 carcinogen sensitivity was 0.04 (0.02), with all other sensitivity and specificity values changed by at most 0.01 (2dp) upon changing the class labels assigned for equivocal compounds. The combination of prediction options selected (see section 3.2.2.2, 3.2.3.2 and 3.2.4.2) corresponded to one of two combinations with the maximum sensitivity, out of those models with the maximum specificity, when applied to discriminating carcinogens from non- carcinogens. This was the case irrespective of whether compounds with equivocal experimental assessments were deemed carcinogens or non-carcinogens. Save for combinations of options with sensitivity values below 0.10, the selected model was one of those with the highest specificity value for separating mutagens and non-mutagens - again, irrespective of the class label assigned to compounds with equivocal experimental findings. Figure 3.2 Results obtained from validating carcinogenicity predictions of all 112 models on the ISSCAN database. All equivocal carcinogens were considered non-carcinogens. 64 Figure 3.3 Results obtained from validating carcinogenicity predictions of all 112 models on the ISSCAN database. All equivocal carcinogens were considered carcinogens. Figure 3.4 Results obtained from validating mutagenicity predictions of all 112 models on the ISSCAN database. All equivocal mutagens were considered non-mutagens. 65 Figure 3.5 Results obtained from validating mutagenicity predictions of all 112 models on the ISSCAN database. All equivocal mutagens were considered mutagens. 3.4.1.1 Discussion of the Performance of the Selected Model Table 3.3 summarises the performance of the selected model, used for filtering the subsets of the ZINC database provided to this author. In terms of the MCC values obtained for discriminating between mutagens and non-mutagens in the ISSCAN database, the selected model was one of the top performing models. Whilst this was not the case for discriminating carcinogens from non-carcinogens in this database, all models with higher MCC values corresponded to lower specificity values - and the minimisation of false positives was our priority in this work (see section 3.2.1). These results suggest the choice of selected model was reasonable. 66 Endpoint MCC P-Value Sensitivity Specificity Mutagenicity 0.66 1.1E-80 0.76 0.89 Mutagenicity (Equivocal mutagens considered mutagens) 0.67 7.6E-83 0.76 0.90 Carcinogenicity 0.33 5.5E-29 0.56 0.78 Carcinogenicity (Equivocal carcinogens considered carcinogens) 0.28 1.1E-21 0.52 0.78 Table 3.3 Performance of the selected model. 3.4.2 Application to Filtering of ZINC Datasets The original SDF files passed to this author contained a cumulative total of 4,454 entries, corresponding to 4,379 unique ZINC codes. Following the application of the toxicity filter described in section 3.2, a cumulative total of 4,035 SDF entries remained, corresponding to 3,970 unique ZINC codes - i.e. 409 (9%) of the originally provided compounds were predicted to be both mutagenic and carcinogenic, hence were discarded. Of the 3,970 ZINC codes corresponding to those SDF entries which were not immediately discarded following application of the toxicity filter, 30 (113) ZINC codes corresponded to predicted mutagens (carcinogens). Hence, 439 (10%) of the 4,379 compounds present in the SDF files originally provided to this author were predicted mutagens, and 522 (12%) were predicted carcinogens. It was not possible to experimentally validate these predictions. 3.4.3 Toxicity Predictions for Experimentally Tested Compounds Of the 148 compounds experimentally tested for type II DHQase inhibition, 10 were predicted carcinogenic by the model used for toxicity filtering (see section 3.2), with only one compound predicted to be mutagenic. This supposed mutagen was also predicted to be carcinogenic, since this was amongst those selected from the original USR screen, prior to application of the toxicity filter. Considering all toxicity predictions generated by Toxtree and DfW, 32 of the experimentally assessed compounds were not predicted to be associated with any potential toxic liability by 67 either program. For these 32 compounds, no Toxtree structural alerts for mutagenicity/carcinogenicity were fired, nor did any QSAR assessments indicate mutagenicity/carcinogenicity. Across all endpoints considered, any assessments made by DfW were that the toxic effect was "doubted", "improbable" or "impossible". However, for many of these compounds, DfW only generated an assessment for a few endpoints; this could mean that they were not covered by the rules in the DfW knowledge base for other endpoints, or that a structural alert did not fire due to modulating factors - as discussed in Chapter 2, section 2.5. Of these 32 compounds, the lowest Ki (for the M. tuberculosis isoform) was 40 μM; the median Ki (for the M. tuberculosis isoform), across all 20 of these compounds for which a Ki was determined, was 120.5 μM - slightly higher than the median value of 115 μM obtained across all tested compounds for which a Ki was determined for this isoform. Considering the most potent inhibitor of the M. tuberculosis isoform (Ki = 23 μM), DfW deemed phototoxicity and skin sensitization in rodent, hamster or human to be “plausible” (i.e., on balance, the evidence was deemed to support the proposition that this compound induced these toxicities). 160 Otherwise, no positive toxicity predictions were generated by DfW, with no Toxtree structural alerts fired or QSAR assessments indicating any toxic liability. The mechanisms which underpin skin sensitization 278 may also lead to allergic responses, or drug hypersensitivity, which may be severe, in response to oral drugs. 279 However, given the potentially lethal consequences of untreated tuberculosis, many adverse effects are commonly deemed acceptable in conjunction with anti-tuberculosis drugs. 280 It should also be noted that phototoxic side effects, such as those induced by the anti-tuberculosis drug sparfloxacin, may be avoided if patients adhere to instructions to avoid sunlight. 281 Hence, supposing that these predicted toxicities were to be experimentally verified, and lead optimisation was unable to remove them, the most potent M. tuberculosis isoform inhibitor found in our study could * still be a viable lead candidate. The toxicity predictions obtained for all experimentally tested compounds are summarised in an Excel file made available with this thesis (Appendix A). * A variety of additional requirements exist, however, for a viable lead candidate - including the synthetic accessibility of more potent derivatives with acceptable ADME properties.18 68 3.5 Conclusions Starting from a database of more than 8 × 10 6 drug-like compounds, a collaborative hierarchical virtual screening protocol was used to select 148 compounds for experimental assessment. Of these, 89 (91) were found to inhibit the M. tuberculosis (S. coelicolor) isoform of type II dehydroquinase with Ki < 500 μM. In a much more cost and resource effective manner, the strategy developed yielded comparable results to recent experimental high throughput screening programs designed to identify inhibitors of this enzyme. 268 A toxicity filter, developed by this author using commonly employed (hybrid) * expert systems (Derek for Windows TM and Toxtree), was used to identify compounds with potential mutagenic and carcinogenic liabilities, whilst minimising the loss of false 'positives' (due to incorrectly identified toxicants) during this early stage discovery effort. An assessment of different modelling options on the ISSCAN database indicated that the software programs were complementary, such that models could be developed with greater sensitivity or, in our case, greater specificity than was achievable with either program alone. ROC graph analysis and consideration of the MCC, and its associated chi-squared p-value, indicated the selected models were more discriminative between mutagens/carcinogens than would be expected for a random predictor. One potential limitation of these assessments, however, is the lack of mutagenicity/carcinogenicity validation data derived from human studies. Considering both the inhibitory potencies with respect to the M. tuberculosis isoform and the lack of/nature of the anticipated toxicities for some of the identified inhibitors, it is possible that anti-tuberculosis lead candidates may have been identified. Further experimental work, including experimental toxicity screening assays - particularly for those toxicities anticipated in silico, is warranted to explore this possibility. * See Chapter 2, section 2.5. 69 Chapter 4 Development and Assessment of Binary Classifiers for Identifying Potent hERG Inhibitors This chapter describes work undertaken to develop models for binary classification of inhibitors of the human ether-à-go-go-related gene (hERG) potassium ion channel, according to toxicologically relevant potency thresholds proposed by the pharmaceutical industry. The aim here was to make predictions based upon computationally inexpensive, 2D descriptors and the approaches were compared to those previously presented in the literature based upon these kinds of descriptors. The novel approaches were found to perform comparably to, or better than, the performance of the methods previously reported in the literature. It was discovered that the performance of some of the modelling approaches could vary dramatically. This variation occurred, for example, when training and validating the models on different datasets and, in the case of the pseudo-stochastic methods considered here, for different training and testing runs on the same data. 4.1 Introduction As discussed in detail in Chapter 1, section 1.3.2, there is a clear need for rapid and reliable computational approaches which can discriminate between potent (IC50 < 1 μM) hERG inhibitors, the development of which is often discontinued, and weaker inhibitors. Ideally, these models would also be comprehensible to medicinal chemists to facilitate designing out hERG inhibition from a compound series. Whilst some models69,91,282,283 have been developed to discriminate between potent inhibitors (IC50 < 1µM) and ‘non-inhibitors’ (IC50 > X µM, where X ≥ 10), these would be of limited value in drug discovery, where many compound series exhibit ‘moderate’ inhibition – with IC50 values in the range of 1-10 µM.15 Consequently, this author was interested in ligand-based, binary classifiers capable of separating 'blockers' (IC50 < 1 µM) from both moderate (1 μM ≤ IC50 < 10 μM) and weak (10 μM ≤ IC50) inhibitors. Various studies in recent years have sought to develop binary classifiers of this kind. Both Li et al.90 and Tobita and co-workers284 developed classifiers based on Support Vector Machines (SVMs). Whilst Li et al. calculated 3D GRIND descriptors, after docking the structures into a homology model of the ion channel pore domain, Tobita et al. calculated computationally inexpensive MACCS keys and other 2D descriptors using the Molecular Operating Environment (MOE) software program.285 70 Likewise, Dubus et al.91 and Thai and Ecker69,231 developed models based on 2D descriptors computed using the same software. In this chapter, novel approaches to generating binary classifiers of this kind are presented and their predictive performance directly compared, on various (partitions of) datasets, to approaches presented by Thai and Ecker69 and Dubus et al.91 4.2 Datasets The modelling approaches developed by this author, as well as implementations of those presented by Thai and Ecker69 and Dubus et al.,91 were assessed using the following three datasets. Summary statistics for the Literature-368, Dubus-203 and Thai-313 datasets are presented in Table 4.1. The latter two datasets were primarily used to validate this author’s implementations of the binary classifiers proposed by Dubus et al. and Thai and Ecker, via (inexact) replications (see 4.4.5) of the Diverse Subset training and test partitions presented in their respective publications.69,91 The approaches developed here were also assessed on some of these training and test partitions respectively to provide additional comparisons between the different methods and assess dataset dependencies of performance estimates. Copies of the Literature-368 and Thai-313 datasets are presented, in SDF format, in the files made available with this thesis (Appendix A); the Dubus-203 dataset is available on request from Dr Elodie Dubus of Aureus Sciences.102 Literature-368 Dataset Disjoint sets of 220 and 148 compounds were compiled from literature sources by this author. Since, initially, all models were trained using (subsets of, where specified) the former and evaluated on the latter, these are referred to as the Int-Set and ExtTest-Set, respectively. The ExtTest-Set was compiled to ensure that it contained no compounds in common with those used for model development by Thai and Ecker,69 nor by Dubus et al.91 This allowed the ExtTest-Set to be used as a truly external test of this author’s models and theirs. For binary classification, compounds with (arithmetic mean) pIC50 > 6 were assigned to class A (‘Active’) and those with 6 ≥ (arithmetic mean) pIC50 were assigned to class I (‘Inactive’). In total, the Int-Set contained 55 compounds in class A (strong inhibitors) and 165 compounds in class I, with the ExtTest-Set comprising 24 class A and 124 class I compounds. The Int-Set class I compounds included 70 moderate inhibitors with 6.00 ≥ 71 (arithmetic mean) pIC50 > 5.00 and 95 weak inhibitors with 5.00 ≥ (arithmetic mean) pIC50. The numbers of moderate and weak inhibitors amongst the class I compounds in the ExtTest- Set were 50 and 74 respectively. The starting point for the derivation of the Int-Set was the list of hERG inhibitors presented by Nisius and Göller in 2009. 286 Compounds were initially removed from this set where the available, relevant hERG inhibition measurements (see below) indicated assignment to more than one of the potency categories defined above – i.e. strong, moderate and weak inhibitors. In order to maintain desirably high numbers in each potency category, these compounds were supplemented with hERG inhibitors presented in the primary literature as obtained from a Scopus search. Another Scopus search was subsequently employed, in late 2009, to identify additional hERG inhibitors for inclusion in the ExtTest-Set. In order to avoid adding compounds which were included in the earlier studies by Thai and Ecker69 and Dubus et al.91 (see above), this Scopus search was (in contrast to before) restricted to the primary literature published in 2008-2009. Additional compounds were identified via consulting a recent review of the primary literature67 in addition to correspondence with Dr Chris Swain of Cambridge MedChem Consulting. Where possible, all measurements from secondary sources were checked against their primary literature references and only retained where these indicated that the experimental conditions criteria outlined below were not violated. Otherwise, where the secondary sources indicated that these criteria were not violated, measurements from these sources were usually retained. However, non-unique pIC50 values – for a given compound - from secondary sources were additionally discarded if it could not be determined that they corresponded to genuinely distinct measurements. In order to maximise the validity of the classes assigned, additional inhibition measurements – from the primary literature or patents – were sought for all compounds in this dataset via SciFinder Scholar 2007 287 and recorded if they met the experimental criteria presented below. Due to time constraints, only sources published prior to 2010 were considered. The selection of hERG inhibition measurements for each compound was designed to estimate the expected pIC50 range to which the compounds would be assigned if electrophysiologically assessed under the conditions typically employed by the pharmaceutical industry – since the 72 1 μM (pIC50=6) threshold delimiting class A and class I compounds was derived from such assessments.15,62 In order to minimise systematic errors, all inhibition measurements were, to the best of this author’s knowledge, obtained from electrophysiological assays in mammalian, heterologous expression systems. Values which were clearly not derived from inhibition of the tail current elicited in these assays,67 were excluded. These conditions are commonly used within the pharmaceutical industry.62,78 IC50 values which were explicitly noted to have been estimated from single concentration measurements were discarded, as these are typically less reliable (see Chapter 1, section 1.3.2.2). However, in the absence of IC50 values for a compound, single concentration inhibition measurements, which categorically established pIC50 values as > 6.00 or ≤ 5.00 - ‘40% inhibition at 10µM’, say - were also used for class assignment. Some measurements were obtained using the medium-throughput QPatch,288 PatchXPress®79 and IonWorks TM78 automated patch-clamp assays. IonWorks TM measurements which appeared to be potency underestimates289 were discarded in favour of higher potency measurements. This author inferred that all IonWorks TM measurements were obtained using a similar protocol to that employed by Bridgland-Taylor et al., which does not consistently underestimate or overestimate potency.78 In spite of taking care to minimise experimental inconsistencies, appreciable variability in the pIC50 values retained for some compounds is still evident. However, the median absolute difference in pIC50 values – not including measurements where the literature references simply indicated pIC50 > 6 or pIC50 < 5 – recorded for the same compound is only 0.35 log units (upper and lower quartiles: 0.69, 0.15). Moreover, for only five compounds were inhibition measurements, meeting the stringent criteria described above, obtained which indicated assignment to different classes or potency categories. In these cases, the mean pIC50 was used for class assignment (three compounds), and for additional categorisation of class I compounds as moderate or weak inhibitors (two compounds). The topology of all structures was checked using either SciFinder Scholar 2007 287 or literature sources. No pair of compounds in this dataset are stereoisomers. Whilst the measurements presented for some dataset entries (i.e. compounds) were derived for a specific enantiomer, some entries correspond to a racemate – with all measurements assumed to be derived for the racemate, where applicable, unless explicitly stated otherwise in the literature references. 73 All pIC50 estimates used for class assignments are presented, along with literature references, in a Word document (Literature_368_References.doc) made available with this thesis (Appendix A). This document also presents the references consulted for the structures presented in SDF format (Appendix A) for this dataset. Dubus-203 Dataset This dataset, previously used by Dubus et al., and comprising 96 (107) compounds with IC50 ≤ 1 µM (IC50 > 1 µM), also determined using electrophysiological measurements in mammalian cells,91 was kindly provided to this author in SDF format by Dr Elodie Dubus. Here, the classes are referred to as ‘A’ and ‘I’, and all class A compounds referred to as strong inhibitors, as per the Literature-368 dataset. The class I compounds comprised 48 “moderate” inhibitors (1 µM < IC50 < 10 µM) and 59 “weak” inhibitors (IC50 ≥ 10 µM),91 essentially defined as per the Literature-368 dataset. Thai-313 Dataset The structures in this dataset, previously used by Thai and Ecker,69 were in part (285 compounds) kindly provided to this author in SDF format by Dr Khac-Minh Thai, with the remaining structures obtained using SciFinder Scholar 2007 287 or the primary literature references provided by Thai and Ecker.69 This dataset contained 100 (213) compounds in class A (I) - as defined for the Literature-368 dataset - with class labels assigned using the IC50 values provided by Thai and Ecker.69 Based on these IC50 values, the class I compounds comprised 97 moderate and 116 weak inhibitors – as defined for the Literature-368 dataset. Dataset Total n.o. Compounds Class A Class I Class I: Moderate Inhibitors Class I: Weak Inhibitors Literature- 368 368 79 289 120 169 Dubus-203 203 96 107 48 59 Thai-313 313 100 213 97 116 Table 4.1 Numbers of hERG inhibitors and their distribution amongst the potency categories for the datasets modelled in this chapter. 74 4.3 Model Development and Validation As detailed in sections 4.3.1 and 4.3.2, a variety of different 2D descriptor sets were considered for building the models developed in this work, along with a number of different Machine Learning algorithms. Descriptors and hyperparameters were selected using stratified, five-fold cross-validation (5CV) on the training set (see Chapter 2, section 2.6.5). In order to avoid optimistic bias in their estimated performance, all these models were validated on external test sets - i.e. datasets containing none of the instances used to select the descriptor sets or hyperparameters used to build the models and obviously containing none of the data used to train the models – after selecting the hyperparameters and descriptor sets (see Chapter 2, section 2.6.5). The selected descriptor sets and hyperparameters were those which maximised the mean value (across all five validation sets) for the Matthews Correlation Coefficient (MCC) - defined in Chapter 2, section 2.6.4.1. * All learning algorithms, in conjunction with the selected descriptor sets and hyperparameters, were then trained on the entire training set, or subsets thereof, where specified, and evaluated on the corresponding external test set. 4.3.1 Machine Learning Algorithms Considered The Machine Learning algorithms utilised to generate the models developed in this work are presented below. Save where otherwise specified, all algorithmic implementations were used with their default settings. Detailed descriptions of these algorithms and the meaning of the selected hyperparameters is provided in Chapter 2, section 2.6.3.2. Winnow Initially, models were developed using Nigsch’s adaptation of Littlestone’s Winnow algorithm.184 Instances are presented to Winnow as a set of ‘features’ – text strings which may either be present or absent in an instance. † In the current context, the instances are compounds and the features correspond to descriptors, discussed in section 4.3.2. Here, four distinct hyperparameter combinations were initially considered, in conjunction with 94 different feature sets. These hyperparameter combinations entailed: * If multiple possibilities maximised this, a single, arbitrary, set was chosen. † When not discussing this author’s models, the term features is used more generally to refer to explanatory variables. 75 1. One scorer, combined with the original promotion (p) and demotion (d) factors (collectively 'update factors') proposed by Nigsch.184 2. Five scorers and the original update factors. 3. One scorer and constant (p=1.1 and d=0.9) update factors. 4. Five scorers and constant update factors. When the feature sets included “orthogonal sparse bigrams” (OSBs), generated from the initial feature sets (“monograms”), as described by Nigsch, 184 ε (see Chapter 2, section 2.6.3.2) was set to 0.15, as per Nigsch,213 with ε set to zero for all feature sets composed solely of monograms. All combinations were evaluated by presenting the 5CV training sets to Winnow in a common, single order. Winnow Based Feature Selection The top ranking * hyperparameter-feature set combination out of the 376 (4×94) possibilities, defined the top feature set, used to generate predictive models using all subsequent learning algorithms. For all algorithms other than Winnow (see below), feature sets were encoded as bit-strings, with 1 (0) assigned when a training set feature was present (absent) in an instance. Multiple Winnow Training Cycles An additional Winnow model was generated using multiple training cycles, in conjunction with the aforementioned top ranking hyperparameter-feature set combination. For each training cycle, the training set instances were presented to Winnow in the same order. The number of training cycles was increased from one to 100, in steps of one, and the top ranking number of cycles selected to build the final model. The implementation of Winnow, in C++, used here was largely identical to that previously used by Nigsch and is made available with this thesis (Appendix A). Estimated Winnow Performance All external test set results presented for Winnow are the mean over 50 different training set presentation orders. *Based on the maximum training set 5CV mean MCC – as per all model selection discussed at the start of section 4.3. 76 Support Vector Machine The Support Vector Machine (SVM) algorithm was used with the Radial Basis Function (RBF) kernel (see Chapter 2, Equation 2.7). In addition to the kernel hyperparameter , the regularisation constant (C) was also varied (see Chapter 2, Equation 2.5). As proposed by Morik et al.,290 the use of the following non-uniform cost-factor (Equation 4.1) was considered (a 'uniform cost-factor' means: = ), in order to reduce bias in favour of the majority class in the training set: 4.1 Here, ( ) is the relative weight given to the sum of the slack-variables (see Chapter 2, Equation 2.7) in the ‘positive’ (’negative’) class and ( ) the number of training set instances in this class. Here, class A was considered the 'positive', and class I the 'negative', class. Joachims’ SVMlight (version 6.02) implementation of SVM was used.291 Efforts were undertaken to optimise , C and the cost-factor [SVMlight options: -g, -c, -j], using an adaptation of the grid-search procedure recommended by Hsu et al. 221 Initially, all combinations of { } - where denotes the SVMlight default – and { }, in conjunction with a uniform and non-uniform cost-factor, were considered. Additional adjustment of C and was subsequently carried out, in conjunction with the cost-factor used in the initially top ranking parameter set (C= , = ). If ≠ , all combinations corresponding to varying and , in steps of 0.25, between ± 1.75 and ± 1.75, respectively, were considered. When = , only was varied in this fashion, and the top ranking parameter set for which ≠ was also considered for optimisation in the usual fashion. Random Forest The R 225 implementation of Random Forest (randomForest) was used here. Since Random Forest (RF) is noted to perform well ‘off the shelf',192 the default hyperparameters were used throughout (see Chapter 2, section 2.6.3.2). 77 Estimated Random Forest Performance All results presented with this algorithm are the mean of the results obtained from training and validating the model with five different random number generator (RNG) seeds. 4.3.2 Feature Sets Available to the Models Developed in this Work The feature sets considered by this author for generating models included, one or both of, two types of features: circular fingerprint features and 'discretized descriptor features' corresponding to numerical 2D descriptors (discussed below). The computational workflow (see section 4.3.2.1) used to generate these meant that only molecular connectivity, and not inconsistently available stereochemical information, was encoded by these features. Circular Fingerprint Features Three circular fingerprints, encoding topological molecular substructures, were calculated: CFP2, 213 ECFP_4 and FCFP_6, 185 the former a version of the MOLPRINT 2D fingerprint developed by Bender et al. 292 and the latter two developed by SciTegic. 94,213 The former two were previously used to build protein-target prediction models using Winnow by Nigsch et al., 213 whilst the latter was used to build predictive models for hERG inhibition by O'Brien and de Groot. 94 The constituent features of these fingerprints, calculated for a given molecule, correspond to a set of possible * non-predefined atom centred topological substructures present in the molecule; each feature is centred on an atom in the molecule and encodes information about atoms located within a (maximum) number of bonds from the central atom (Figure 4.1). The size of the encoded topological substructures, the type of information and manner in which this information is encoded, varies between the three fingerprints. * As illustrated in Table 4.4, for an example ECFP_4 feature, similar substructures may map onto the same circular fingerprint features. 78 Figure 4.1 Two examples of topological substructures encoded as features, for an example molecule, by a generic circular fingerprint considering environments extending up to two bonds from the central atom. Discretized Descriptor Features These were generated using in-house Python scripts, 275 a clearly documented version of which is made available with this thesis (Appendix A). For every numerical descriptor calculated, a compound is assigned a feature corresponding to the range, out of a predefined set, that its value lies within. These ranges were delimited by split points chosen using the distribution of the classes with respect to the training data ordered by their values for the descriptor under consideration. This procedure is illustrated in Figure 4.2. Two different discretization methods were considered for choosing the split points: i. For all changes in class label going along the ordered training set, the midpoint between the descriptor values of instance K with class CK and instance J with class CJ is chosen as a split point. ii. Fayyad and Irani’s method,293 as implemented via the Orange module (version 1.0). 294 The latter method also considers the values where there is a change in class for the ordered training set as possible split points for a given descriptor. Split points are, however, selected recursively: at most, a single split point is chosen for the original training set and the same procedure is then applied to the partitioned subsections of the training set until none of the possible split points is deemed ‘valid’. Possible split points, for a given (subset of the) training set, are only deemed ‘valid’ if the following condition holds (Equation 4.2). Out of all such ‘valid’ split points, the split point selected is the one which maximises the first term (LHS) in this expression. This first term, the “information gain”, is a measure of the extent to which the uncertainty in the class label for a randomly selected instance in the (subset of the) training set is reduced upon knowing which side of the split point it lies. The overall form of 79 this expression is designed to prevent excessive partitioning of the training set – whereby the complexity of specifying a split point is insufficiently justified by the extent to which the partition separates instances in different classes. 293 4.2 In Equation 4.2, the ‘ ’ and ‘ ’ are given by equations 4.3 and 4.4 respectively. In the following equations, , and denote the total number of classes represented in the original (subset of the) training set, comprising instances, and the two subsets created by the split point respectively – where the split point corresponds to a value ( ) of the descriptor of interest ( ). The fraction of instances in the original (subset of the) training set lying either side of this split point are denoted by ≤ and . The fraction of instances in the original (subset of the) training set belonging to class is given by . Out of those instances assigned to one of the newly created susbsets, the fraction belonging to class is given by ≤ or . ∑ ≤ ∑ ≤ ≤ ∑ 4.3 4.4 where [ ∑ 4.5 80 ∑ ≤ ≤ ∑ ] For brevity, feature sets generated using the former and latter discretization methods are denoted by AC and FI respectively. Four different numerical, 2D, descriptor sets were used to generate 'discretized feature sets’. These were: i. The P_VSA descriptors, proposed by Labute to be widely applicable for QSAR modelling, 295 and successfully applied to hERG inhibition modelling by Dubus et al. 91 and by Thai and Ecker. 69,231,296 As per these earlier studies, these were calculated via MOE. ii. The 23 "relevant" MOE descriptors used by Dubus et al. in their "model 1".91 iii. The 11 "relevant" MOE descriptors selected by Thai and Ecker.69 iv. A set of estimated physicochemical properties (logP, maximum basic pKa, minimum acidic pKa) and a topological index (Wiener Index), 182,297 calculated using ChemAxon’s cxcalc tool.298 Where no basic (acidic) pKa values were reported, the descriptor values were set to -10 (+20), the default minimum (maximum) reported value. The constitution of descriptor set iv was informed via existing understanding of molecular properties relevant for hERG inhibition. The significance of variations in logP, basic pKa and the incorporation of acidic moieties (i.e. variations in acidic pKa) for hERG inhibitory potential, along with mechanistic rationales, was previously highlighted by Jamieson et al. 299 The Wiener Index increases with the number of atoms, and - for a constant number of atoms - is maximised for a linear structure, 182 i.e. may partially capture molecular shape, including size, which has previously been indicated to be important for hERG inhibition. 300 For brevity, descriptor sets ii, iii and iv, and all feature sets generated from them, are denoted Dubus-Rel, Thai-Rel and CA respectively. 81 Figure 4.2 The assignment of 'discretized descriptor features' corresponding to descriptor D. 82 4.3.2.1 Descriptor Calculation Prior to calculating ChemAxon descriptors, all molecular structures were standardized using ChemAxon’s standardizer tool* and retained in SDF format. ChemAxon’s molconvert tool was then used to generate MOL2 files from the standardized SDFs, from which CFP2 fingerprints, as implemented by Nigsch, 213 were calculated. Prior to SciTegic fingerprint calculation, using the Molecular Fingerprint component in Pipeline Pilot Student Edition, 301 and MOE descriptor calculation, using MOE’s sddesc tool, explicit hydrogens were added to the standardized structures using ChemAxon’s standardizer tool. All structures in the Dubus-203 and Thai-313 datasets were updated with new, unique, IDs prior to standardizing, using the Pybel module, with all Thai-313 structures being similarly updated prior to generating Binary QSAR models (see section 4.4.2). 177 One structure in the Dubus-203 set, which was not correctly standardized by ChemAxon’s standardizer tool, was manually standardized † prior to calculating CFP2 and SciTegic fingerprints. In 14 cases, explicit hydrogens were not fully removed when standardizing structures in the Thai-313 dataset. 4.3.2.2 Feature Sets Searched In total, 94 feature sets were generated. Each fingerprint type corresponded to one feature set, as did all eight combinations of each of the four numerical descriptor sets with both discretization methods. All combinations of discretized feature sets with fingerprint feature sets (3×8) yielded 24 additional feature sets. All the constituent features of these feature sets can be considered “monograms”.184,213 For each monogram set, “orthogonal sparse bigrams” (OSBs) – non-exhaustive combinations of pairs of monograms – were generated, as previously described by Nisgch,184,213 using a window-size of three (Figure 4.3). Thus, an additional 35 feature sets were generated based on the combination of all 35 monogram feature sets with their corresponding OSBs. Finally, * The standardizer configuration files are made available with this thesis (Appendix A). † Full details are provided in FINAL_Dubus203_README-FT.txt, which is made available with this thesis (Appendix A). 83 all combinations of the three new (monograms and OSBs) fingerprint feature sets with all eight (monogram only) discretized feature sets yielded 24 further feature sets. The presence of OSBs in the feature sets should allow for synergies between the monograms to be taken into account by linear classifiers, such as Winnow. 213 The generation of all 94 feature sets evaluated using Winnow is summarised in Figure 4.4. Figure 4.3 The procedure employed to generate orthogonal sparse bigrams (OSBs) as additional features for a generic molecule and generic original set of features (monograms) using a window size of three. 84 Figure 4.4 A summary of the generation of the 94 feature sets – comprising fingerprint features, discretized descriptor features or both – evaluated in the current work. 85 4.4 Comparisons with Models Developed by Thai and Ecker and by Dubus et al. 4.4.1 Background Recent studies by Dubus et al. 91 and Thai and Ecker 69 indicated that their binary classification models, using the same threshold as here, were highly predictive. Thus, this author was keen to see how well the models developed here compared with theirs when trained and externally validated on the same datasets. Dubus and co-workers developed models using MOE’s QuaSAR-Classify tool, using the P_VSA and Dubus-Rel descriptor sets referred to above (section 4.3.2). 91 Similarly, Thai and Ecker developed Binary QSAR models based on the aforementioned (section 4.3.2) P_VSA and Thai-Rel descriptor sets. 69 This author implemented all four models using MOE. 4.4.2 Additional Protocols for Descriptor Calculation In order to more fairly make direct comparisons with the modelling approaches proposed by Dubus et al. and Thai and Ecker and better assess dataset dependencies of model performance, this author calculated the descriptors used to build these models using computational procedures more closely reflecting those employed by them (private correspondence with Dr Ismail Ijjaali and Dr Khac-Minh Thai respectively). These procedures started with the structures available prior to any previously described standardization and are referred to, here, as the ‘Dubus Standardization’ and ‘Thai Standardization’ procedures.*  D b z : ’ standardizer tool was used to remove extraneous molecular fragments and add explicit hydrogens. Subsequently, ’ cxcalc tool was used to calculate major protonation states (pH 7.4), with explicit proton assignment [cxcalc majormicrospecies -H 7.4 -f sdf:H]. The resulting structures, in SDF format, were imported into a MOE database and descriptors were calculated using the GUI, then stored in SDF format.  Thai Standardization: All structures, post-removal of extraneous fragments and z ’ standardizer tool, were imported into a MOE database and Energy Minimized – followed by descriptor calculation using the GUI. * The ChemAxon standardizer configuration files are made available with this thesis (Appendix A). 86 4.4.3 Comparisons with QuaSAR-Classify As described in the MOE documentation, QuaSAR-Classify employs a recursive partitioning algorithm (see Chapter 2, section 2.6.3.2) to grow a single decision tree from the training set. The Node Split Size and Max Tree Depth were set to five and 15 respectively, with random five-fold cross-validation on the training data being used for pruning, in keeping with Dubus et al. 91 All QuaSAR-Classify experiments were run using SVL scripts. All QuaSAR-Classify models were run 50 times within the same MOE session, and, unless stated otherwise, the results presented are the mean over these runs. Since Dubus et al. suggested that their models would perform sub-optimally using 'unbalanced' training sets, 91 when the training sets used to select the models developed by this author were unbalanced, subsets of these training sets were generated for training models subsequently evaluated on the corresponding external test sets. These 'balanced' subsets were generated using MOE’s Diverse Subset tool to remove the least diverse compounds from the majority class until the number of remaining compounds equalled the number of compounds in the minority class (i.e. class A). Diverse Subset rankings were generated independently for each class, using (non-bit packed) MACCS keys and the Tanimoto coefficient similarity metric, after importing the structures available after calculation of MOE descriptors, for the models developed by this author, into a MOE database. For computational expediency, only the previously selected Winnow models were re- evaluated after regenerating all discretized features using the balanced training sets. 4.4.4 Comparisons with Binary QSAR All models were trained with the Component Limit set to 8, as per Thai and Ecker. 69 4.4.5 Diverse Subset Partitions of the Thai-313 and Dubus-203 Datasets The Diverse Subset train/test splits of the Dubus-203 91 and Thai-313 69 datasets, respectively used by Dubus et al. and Thai and Ecker to evaluate their models, were sought - principally in order to validate the implementations of their models used in the current work, which employed a later version of MOE (Appendix C). 87 For the Thai-313 dataset, Diverse Subset calculation utilised all 184 2D MOE descriptors calculated, using the ‘Thai Standardization’ procedure and bioactivity values (- log10(IC50(µM)) calculated from the IC50 values provided by these authors. 69 Values of 4.0 and -4.0 were assigned when they reported the IC50 as << 1 µM, and >> 10 µM respectively. Diverse Subset calculation was run from the MOE GUI, using the Euclidean Distance, scaled to unit variance, to assign diversity rankings, with the 240 most diverse compounds selected for the training set. This procedure conformed to that used by Thai and Ecker (inferred from private correspondence with Dr Khac-Minh Thai). Since the outcome of the Diverse Subset protocol is dependent on the identity of the first compound in the database, this author generated all four possible Diverse Subset splits corresponding to a test set of 20 actives and 53 inactives (the closest match to the split presented by Thai and Ecker). 69 These were identified by computing all 313 possible splits using an in-house Python implementation of the Diverse Subset algorithm; 309 of these 313 conceivable splits were excluded, as they did not correspond to a 20:53 division. For the Dubus-203 dataset, Diversity rankings were computed using (non-bit packed) MACCS keys and the Tanimoto coefficient similarity metric. The Diverse Subset procedure was separately applied to the ‘Dubus Standardization’ processed structures in classes “A” and “I” (referred to as “High” and “Weak” by Dubus et al.),91 to select the 80 maximally diverse structures in each class for training, with the remaining compounds being used as a test set. The models developed by this author were also trained and selected on the Dubus-203 and one of the Thai-313 Diverse Subset training sets, * and then ‘externally’ validated on the corresponding test sets. As explained below, models generated using the Dubus-Rel and Thai-Rel descriptor sets may not have been truly externally validated using this test protocol. * The train/test partition of the Thai-313 dataset used being the partition for which the highest test set MCC values were obtained using both Binary QSAR models; this enabled a direct comparison between the modelling approaches proposed in this work and the best performance achieved, by this author, with the Binary QSAR models. 88 4.5 Avoiding Feature Selection Bias: Calculating Overlap between Different Datasets The Thai-Rel and Dubus-Rel descriptor sets were selected using the entirety of the Thai-313 and Dubus-203 datasets respectively (private correspondence with Dr Khac-Minh Thai and Dr Elodie Dubus). This author prefers the practice of only using the training set for feature selection, to avoid any risk of an optimistic bias on the test set arising from its having been included in the feature selection process. * No structures were identified in common between the ExtTest-Set and Int-Set datasets as well as between the ExtTest-Set and the Thai-313 and Dubus-203 datasets. This prevented the possibility of incurring feature selection bias when the ExtTest-Set was used for model validation. In practice, such bias could always be avoided for the modelling approaches proposed in the current work, when the ‘external’ test sets overlapped with the Dubus-203 (Thai-313) dataset, if the feature sets based on the Dubus-Rel (Thai-Rel) descriptors were not selected to build the final models. The structures in common between different datasets were identified by comparing them using stereochemically indifferent, canonical SMILES 302 generated using ChemAxon’s molconvert tool (molconvert smiles:0), from the structures used for SciTegic fingerprint calculation. † This procedure also identified 107 (87) compounds in the Int-Set with structures in the Thai- 313 (Dubus-203) dataset. This procedure was used to identify all dataset overlaps referred to below. 4.6 Results and Discussion 4.6.1 Overall Model Performance Table 4.2 and Table 4.3 summarise the MCC values obtained using the approaches developed in this work when validated on various ‘external’ test sets - i.e. sets of compounds neither used to directly select the hyperparameters or features employed by, nor train, the models. In * The good Diverse Subset test set results obtained by both sets of authors using their P_VSA models, and all results obtained by Thai and Ecker on an external test set of 58 compounds,69,231 are unaffected by such issues. † These comparisons may not fully capture the overlap between datasets due to the presence of different tautomeric forms etc. 89 addition to presenting results for the original Int-Set/ExtTest-Set split, results are presented for randomly generated ‘Int/Ext’ splits of the Literature-368 dataset to provide a more robust assessment of the expected performance of the modelling approaches. All random splits were generated such that the same numbers of strong, moderate and weak inhibitors were maintained in each Int/Ext split as per the Int-Set/ExtTest-Set. The MCC values obtained with this author’s implementation of the Binary QSAR and QuaSAR-Classify models on these test sets (when trained on the same data as the models developed in this work) are presented for comparison. The ‘external’ test sets for the random partitions of the Literature-368 dataset and the Diverse Subset test sets for both the Dubus-203 and Thai-313 datasets used to ‘externally’ validate this author’s models, overlapped with both the Thai-313 and Dubus-203 datasets. The Dubus- 203 (Thai-313) test set contained approximately 17 (15) compounds in the Thai-313 (Dubus- 203) dataset. For the random partitions of the Literature-368 dataset - random:1, random:2 and random:3 - there were approximately 40 (34), 41 (29) and 38 (29) ‘external’ validation set compounds, respectively, in the Thai-313 (Dubus-203) dataset. Thus, when ‘biased’ features based on the Dubus-Rel or Thai-Rel descriptor sets were selected, additional models were generated based on an ‘unbiased’ feature selection protocol – excluding any such feature sets from selection. The consistently better results obtained on these test sets with this author’s models, including when ‘unbiased’ feature sets were selected, suggest that the poorer results obtained for the original Int-Set/ExtTest-Set split were not due to the absence of feature selection bias. Rather, as illustrated in Figure 4.5, these poor results appear to reflect extrapolation beyond the applicability domain of the models: many compounds in the original ExtTest-Set appear to lie outside of the chemical space occupied by the Int-Set, yet this is not apparent for the random Int/Ext splits of the Literature-368 dataset (Figure 4.6). Figure 4.7 likewise indicates that the Diverse Subset test sets, for which results are presented in Table 4.3, lay within the applicability domain of the models trained on the corresponding training sets. The compositions of all train/test partitions are presented in CSV files made available with this thesis (Appendix A), along with additional details required to exactly reproduce results 90 obtained for the pseudo-stochastic methods used to develop models in this work: Winnow and Random Forest. * The range of MCC values obtained on the ‘external’ validation sets, with either feature selection protocol, is comparable with those previously reported in the literature (Appendix B, Table B.1). Save for two cases, all models developed by this author, trained and validated on the same data, achieved higher (mean) MCCs than all Binary QSAR and QuaSAR- Classify models, even when truly external validations of this author’s models are presented and not of these others (see section 4.5). Figure 4.8 graphically summarises the relative performance of the Winnow models (based on an 'optimised' number of training cycles), on truly external data, and all Binary QSAR and QuaSAR-Classify models trained and validated on the same data. As later discussed in detail, these comparisons also highlight the considerable variation in estimated performance of these approaches when trained and validated on different datasets. Figure 4.9 graphically summarises the relative performance of these linear Winnow models compared to the performance of the corresponding SVM and Random Forest (RF) models. The comparable, or better, performance of Winnow to these more sophisticated, non-linear Machine Learning algorithms is interesting: Nigsch et al. previously found that, when models were built using the same features, Random Forest consistently outperformed Winnow, in terms of the MCC. 184 Detailed (mean) figures of merit, including 5CV results, obtained for selected models, are presented, along with the corresponding selected hyperparameters, in a set of CSV files made available with this thesis (Appendix A). Therein, results are also presented (Thesis_Chapter4_Table_S4k.xlsx) which compare the (mean) MCC values presented in Table 4.2 and Table 4.3 to the (mean) MCC values quantifying the performance of these models when applied to the separation of strong (mean pIC50 > 6.00) from moderate (5.00 < mean pIC50 ≤ 6.00), or weak (mean pIC50 ≤ 5.00), inhibitors in the corresponding test sets. Thesis_Chapter4_Table_S4k.xlsx also presents corresponding Bonferroni corrected 303 p- values. † The corrected p-values for all truly externally validated models proposed by this * Such files are not provided for the Dubus-203 dataset, which is available from Aureus Sciences.102 †The uncorrected p-values were calculated (see Chapter 2, section 2.6.4.1) from the MCC value (or mean value for the pseudo-stochastic methods – Winnow, RF, QuaSAR-Classify) 91 author, when trained and tested on the random:1 and random:2 partitions of the Literature- 368 dataset, are less than 0.05 - which strongly suggests that these models were capable of discriminating moderate from strong inhibitors, as would be of value in the pharmaceutical industry. 15 Indeed, save for one SVM model, all models developed by this author performed better, in terms of the (mean) MCC, than expected for a random predictor (MCC equal to zero) when applied to the separation of strong and moderate inhibitors in truly external test sets. 4.6.2 The Effect of Unbalanced Training Sets The use of unbalanced training sets (i.e. with more compounds in class I than A) appears to bias the Winnow models in favour of predicting compounds as non-blockers (i.e. compounds in class I). For all corresponding ‘externally’ validated Winnow models, the mean numbers of true positives (here, blockers are 'positives' and non-blockers are 'negatives') and false positives (true negatives and false negatives) decreased (increased) upon moving from a balanced to an unbalanced training set. The introduction of bias towards the majority class in the training set has long been recognised as a problem for many learning algorithms. 304 Interestingly, however, all models trained here using unbalanced data had higher (mean) recall and, save for one Random Forest model validated on the Thai-313 dataset, higher (mean) precision for class I than class A when ‘externally’ validated. The combination of higher precision and higher recall, for the ‘external’ validation sets, suggests these models were better able to identify non-blockers than blockers in these test sets. obtained when training and testing the selected modelling approaches on a given train/test partition. They denote the conditional probability of obtaining a (mean) MCC value with at least the magnitude observed on a given test set, supposing a random predictor had been built on the corresponding training set. The Bonferroni correction provides an upper bound, equal to the value below which the corrected p-values are deemed statistically significant, to the conditional probability, supposing an unknown number of approaches performed like random predictors on average, of erroneously declaring that any model (or random selection from a set of corresponding models for the pseudo-stochastic algorithms) built, in this work, on one of the training sets would perform no differently, in terms of the mean MCC across various test sets, to a random predictor - based on the (mean) MCC value observed on the single corresponding test set. 92 Int/Ext Split Feature Set Selected (Winnow, SVM, RF) Model Winnow SVM RF Binary QSAR QuaSAR-Classify Single cycle Multi- ple cycles P_VSA Thai- Rel P_VSA Dubus- Rel Original CFP2 and Thai- Rel(AC) 0.26 -- 0.08 0.34 0.16 0.08 -- -- Random:1 ECFP_4 and CA(FI) 0.53 0.59 0.55 0.52 0.29 0.12 -- -- Random:2 ECFP_4 and CA(AC) 0.46 0.51 0.47 0.52 0.34 0.17 -- -- Random:2 ECFP_4 and Thai-Rel(AC) 0.44 0.42 0.40 0.54 0.34 0.17 -- -- Random:3 ECFP_4 and CA(AC)+ Joint OSBs 0.43 -- 0.40 0.44 0.03 -0.04 -- -- Original (Balanced training set) CFP2 and Thai- Rel(AC) 0.10 -- -- -- -- -- -0.03 (0.04) 0.15 (0.18) Random:1 (Balanced training set) ECFP_4 and CA(FI) 0.42 0.49 -- -- -- -- 0.29 (0.37) 0.35 (0.38) Random:2 (Balanced training set) ECFP_4 and CA(AC) 0.26 0.27 -- -- -- -- 0.03 (0.06) 0.22 (0.26) Random:2 (Balanced training set) ECFP_4 and Thai-Rel(AC) 0.28 0.29 -- -- -- -- 0.03 (0.06) 0.22 (0.26) Random:3 (Balanced training set) ECFP_4 and CA(AC)+Joint OSBs 0.37 -- -- -- -- -- 0.22 (0.24) 0.25 (0.26) Table 4.2 Performance of this author’s selected models and literature models on ‘external’ test sets for all (‘Int/Ext’) partitions of the Literature-368 dataset into training and ‘external’ test sets. MCC values (mean MCC values for Winnow, Random Forest (RF) and QuaSAR-Classify) are presented. Values in parentheses are the maximum MCC values obtained across the 50 runs of the QuaSAR- Classify module. Missing values for Winnow (multiple cycles) signify that a single cycle gave the best 5CV result. 93 Int/Ext Split Feature Set Selected (Winnow, SVM, RF) Model Winnow SVM RF Binary QSAR QuaSAR-Classify Single cycle Multi- ple cycles P_VSA Thai- Rel P_VSA Dubus- Rel Dubus- 203 Diverse Subset ECFP_4+OSBs and CA(FI) 0.84 -- 0.83 0.87 -- -- 0.39 (0.65) 0.45 (0.65) Dubus- 203 Diverse Subset CFP2+OSBs and Dubus- Rel(AC) 0.78 -- 0.83 0.79 -- -- 0.39 (0.65) 0.45 (0.65) Thai-313 Diverse Subset CFP2 and CA(FI)+Joint OSBs 0.80 -- 0.79 0.83 0.60 0.68 -- -- Thai-313 Diverse Subset CFP2 and Dubus- Rel(AC)+Joint OSBs 0.81 -- 0.79 0.82 0.60 0.68 -- -- Table 4.3 MCC values obtained on ‘external’ test sets for all partitions of the Thai-313 and Dubus- 203 datasets used to evaluate this author’s modelling procedures. See Table 4.2 for presentational details. 94 Figure 4.5 Distribution of Int-Set (red) and ExtTest-Set (blue) within the plane defined by the first two principal components (PCA plot) for the P_VSA descriptor set (computed as per section 4.3.2.1). Principal components were calculated from the combined Int-Set and ExtTest-Set, using the prcomp( ) function in R225 - with scale=TRUE. 95 Figure 4.6 PCA plots generated as per Figure 4.5 (training set: red, ‘external’ test set: blue), for all splits of the Literature-368 dataset; from top left to bottom right: original split, random:1, random:2 and random:3. 96 Figure 4.7 PCA plots (generated as per Figure 4.5) for the Diverse Subset partitions of the Thai-313 (LHS) and Dubus-203 (RHS) datasets for which results are presented in Table 4.3. 97 Figure 4.8 Mean MCC values for externally validated (i.e. only ‘unbiased’ feature sets were considered, where relevant) selected Winnow models (generated using multiple training cycles, where relevant), compared to QuaSAR-Classify (QC) mean MCC values and Binary QSAR (BQ) MCC values when trained and tested on the same data. (A) 98 Figure 4.9 Mean MCC values for externally validated Winnow models (as per Figure 4.8), and the corresponding (mean) MCC values for the externally validated SVM and RF models. 99 4.6.3 Features Emphasised by the Models Developed in this Work To better understand the models, estimates were computed of the importance of each feature in all ‘externally’ validated Random Forest and Winnow models – where computationally feasible. * For Winnow, the C++ code was modified to print the arithmetic mean, across all scorers, weights assigned to each feature for both classes and the feature importance was estimated as the absolute difference in these mean weights. For Random Forest, the Gini importance measure was used. 223 Overall feature importance was assigned on the basis of the median rank (more important features having lower values) across all 50 (five) runs of Winnow (Random Forest). For the purposes of the following discussion, the ‘important features’ are the top ranking 1% of features for the model in question. Evidently, the modelling methods differ in the manner in which they make use of the features: in contrast to the SVM and Random Forest models, the Winnow algorithm can only model linear relationships between the classes and the features and only make predictions on the basis of features present in a compound. This is reflected in the incomplete overlap between the important features for each model: for all, evaluated, Random Forest models trained on subsets of the Literature-368 dataset, less than 50% of the important features were also deemed important for the Winnow models trained on the same data. Except for the ‘random:1’ split of this dataset, the common important features included fingerprint and discretized descriptor features, suggesting the distinct types of descriptor were adding comparably important information to both types of models. This author was interested as to whether these models might highlight novel molecular substructures associated with high/low potency hERG inhibition. The median (across all training cycles) signed difference in the mean weights assigned to the important Winnow features was taken as indicative of the ‘global’† association of those features with hERG active/inactive compounds. This difference was defined as positive for features with a higher mean weight for class A than for class I. Those ‘important features’ corresponding to CFP2 or ECFP_4 fingerprint components were closely inspected; SMILES representations of the substructure(s) mapping onto the ECFP_4 features (across the whole of the Literature-368 or Dubus-203 dataset for the corresponding models) were generated using Pipeline Pilot. An * This was not the case for the Random Forest models based on Dubus-Rel features. † In addition to the caveats raised below, since the training sets cannot be claimed to span all of chemical space, any trends are, of course, only indicative. 100 important difference between the CFP2 and ECFP_4 features is that the latter encode a specific set of allowed, non-hydrogen, attachment points for the corresponding substructures. 185 It is important to note, however, that there is not a one to one mapping between molecular substructures and either type of fingerprint features. Multiple substructures may correspond to the same feature, and, more rarely, the same ECFP_4 feature may correspond to multiple substructures. 185 Furthermore, the aforementioned SMILES representations for subtly different substructures may be equivalent (personal communication from Accelrys Limited). The ECFP_4 based Winnow models appear to exhibit consistency with trends previously noted in the literature, as illustrated in Table 4.4. * In keeping with Song and Clark’s analysis of their linear Support Vector Regression model, 222 the highlighted Winnow models associated a feature encoding a fluoro-phenyl moiety with strong hERG inhibition and a feature encoding an amide associated with weaker inhibition. The association of a feature (2131972380) encoding a tertiary amine, located away from the molecular periphery, with potent hERG blockade, learnt by all ECFP_4 based Winnow models and highlighted for one such model in Table 4.4, is consistent with the well-known association between such tertiary amines and hERG inhibition. 222,299,305 Various mechanistic rationales have been proposed for this: protonation of the basic amine possibly leads to pi-cation interactions or facilitates non- classical hydrogen bonds with hERG Tyr652 residues. 299,305 The latter of these findings demonstrates that the ECFP_4 based Winnow models were capable of capturing chemically meaningful relationships (i.e. relationships which were unlikely to be artefacts of the dataset) between hERG inhibition and molecular structure. However, in general, care must be taken when seeking to interpret the contributions of different features to these types of models. Firstly, chance correlations between mechanistically important moieties and chemically unimportant moieties may reveal spurious relationships between hERG blockade liability and chemical substructures. This caveat is perhaps most notably illustrated by the fact that a feature (863188371) corresponding to an ethyl group was deemed the 5 th most important for * All features presented in Table 4.4 were amongst the important features for these models. All these features, where contributing to a Winnow model, were exclusively associated with one or more of the SMILES presented in Table 4.4 and a median difference in mean weights with the same sign as presented here. 101 the Winnow model built on the training set of the random:1 split of the Literature-368 dataset using multiple training cycles. It must also be noted that Winnow appears to have assigned questionably high significance to some features - given the relative number of active and inactive training set compounds associated with the feature in question. For example, the fluoro-phenyl moiety encoded by feature -296909061, highlighted as an important hERG blocker feature in Table 4.4, was associated with a higher median difference in mean weights than the 2131972380 feature (see Table 4.4) which was more clearly associated with active training set compounds. In addition, Winnow, as a linear algorithm, does not consider the contextual significance of these features and the effect of adding/removing a molecular fragment on hERG inhibitory potency may be dependent on the remaining chemical functionality as well as the position of the substitution. One final point to bear in mind when seeking to chemically interpret these models is the possibility that hERG inhibition may arise from binding to different (open, closed or inactivated) 58 states of the ion channel and, indeed, different binding sites. 306 The exact nature of the interactions, hence the chemical significance of molecular substructures, may be different for blockers binding to different states of/sites within the ion channel. 300,306,307 The median difference in mean weights is presented, along with corresponding SMILES where relevant, for the important Winnow features in all assessed models, in an Excel file made available with this thesis (Appendix A). 102 Feature Fragment SMILES Winnow Model Median 〈 〉 〈 〉 Training Set Occurrence Ratio Per Class -296909061 Training set: Random:1 Multiple training cycles 1.0196 Class A: 18/55 (0.33) Class I: 17/165 (0.10) 1430169877 Training set: Dubus-203 Feature set: ECFP_4+OSBs and CA(FI)) -2.4957 Class A: 1/80 (0.01) Class I: 7/80 (0.09) 2131972380 Training set: Random:1 Multiple training cycles 0.9675 Class A: 17/55 (0.31) Class I: 5/165 (0.03) Table 4.4 Some of the associations, learnt by the Winnow models, between 'important' ECFP_4 features and hERG blockade which were consistent with trends previously noted in the literature. N.B.: 〈 〉 denotes the arithmetic mean weight, across all scorers, assigned to the feature for class A etc. The median difference over all 50 training set orders is reported. All Dubus-203 training set compounds assigned feature 1430169877 were observed to possess a corresponding amide fragment. Occurrence ratios denote the fraction of compounds belonging to the class in question and containing the feature. All SMILES patterns were visualised using MarvinView (version 5.5.0.1);298 the 'A' symbols denote wildcard, i.e. undefined, heavy atom connections.185,308 4.6.4 Dataset Performance Dependencies This author was surprised to have obtained considerably worse results on the Literature-368 dataset with Thai and Ecker’s and with Dubus et al.’s binary classifiers compared with those that they previously reported. 69,91 However, as extensively detailed in Table 4.5, the results obtained with this author’s implementations of their methods on their datasets indicate only marginal differences between the different implementations of their binary classifiers: the differences between the maximum test set MCCs obtained here and the corresponding test set MCCs calculated from their previous results are either less than or marginally greater than the 103 range of MCC values obtained using this author’s implementations. Furthermore, it is clear that this author was unable to exactly reproduce Thai and Ecker’s test set exactly – presumably due to using a later version of MOE (Appendix C) – making it impossible to obtain a direct comparison between the different implementations of their binary classifiers. These findings suggest that the poorer results obtained on the Literature-368 dataset primarily reflect differences in the data used to train and validate these models. The considerable range in (mean) results obtained using the Binary QSAR and QuaSAR- Classify models, for different train/test partitions of the Literature-368 dataset (Table 4.2), indicates that the consistently lower results obtained on this dataset are not merely a reflection of the reduction in training set size (compared to the Thai-313 and Dubus-203 datasets). The consistently lower results obtained on the Literature-368 dataset, compared to the Leave One Out cross-validated (LOOCV) results on the Thai-313 dataset (Table 4.5), suggest the lower results obtained, with the Binary QSAR models, do not merely reflect the absence of a Diverse Subset train/test partition of the Literaure-368 dataset. * It could be argued that the variation in overall binary classification performance of these models is more appropriately assessed in terms of the chi-squared p-values corresponding to the (mean) MCC values, rather than the MCC values themselves, since these may allow fairer comparisons between the results obtained on differently sized test sets (see Chapter 2, section 2.6.4.1). Table 4.6 presents such p-values for all partitions of the Literature-368, Thai-313 and Dubus-203 datasets used to externally validate this author’s models. The p-values obtained for the Binary QSAR (QuaSAR-Classify) approaches on the Thai-313 (Dubus-203) test set are consistently lower (lower in 50% of cases) than those obtained on the Literature- 368 dataset. This suggests that the (mean) MCC values obtained on the Literature-368 test set (148 compounds) were not lower than those obtained on these other test sets (Dubus-203: 43 compounds, Thai-313: 73 compounds) as an artefact of the test set size having increased. † * The Diverse Subset algorithm, as described in the MOE documentation, is designed to ensure pairs of similar molecules are split between the training and test set when partitioning a dataset.285 † The probability of a random predictor obtaining an MCC value greater than or equal to a given size, i.e. the p-value, is smaller for a larger test set - i.e. the chance of a random predictor obtaining a large MCC value decreases with increasing test set size. Hence, if it 104 Since the Thai-313 LOOCV effective test set size (240 compounds, i.e. the number of compounds in the training set) is larger than the Literature-368 test set size, the decrease in MCC values obtained on the Literature-368 dataset using the Binary QSAR methods cannot be an artefact of test set size. were supposed that these two approaches effectively yielded random predictors, the MCC would be expected to decrease for larger test set. 105 Table 4.5 Results obtained on the Thai-313 and Dubus-203 datasets in the current work compared to those reported by Thai and Ecker69 and Dubus et al.91 The split of the Thai-313 dataset for which results are highlighted corresponds to the split with the highest MCC values obtained for both Binary QSAR models (i.e. the split used to evaluate this author’s models). The partitions of the Thai-313 dataset used to generate results in the current work (training set :80 actives/160 inactives, test set: 20 actives/53 inactives) differed slightly from the single partition used by Thai and Ecker (training set: 81 actives/159 inactives, test set: 19 actives/54 inactives), as explained in section 4.4.5. All MCC values obtained by Dubus et al. and Thai and Ecker were estimated herein. Acc., Rec. and Prec. denote accuracy, recall and precision respectively. (i) Binary QSAR Results (Diverse Subset Splits of Thai-313 Dataset) Source Validati- on Descrip- tors Acc. Rec. (A) Rec. (I) Prec. (A) Prec. (I) MCC MCC Range (Across All Splits) Current work (Example split) Training Set - LOOCV P_VSA 0.79 0.51 0.93 -- -- 0.50 0.51- 0.49 Thai-Rel 0.81 0.61 0.91 -- -- 0.55 0.59- 0.55 Test Set P_VSA 0.85 0.60 0.94 0.80 0.86 0.60 0.60- 0.48 Thai-Rel 0.88 0.70 0.94 0.82 0.89 0.68 0.68- 0.57 Thai and Ecker69 Training Set - LOOCV P_VSA 0.82 0.58 0.94 -- -- 0.57 -- Thai-Rel 0.80 0.61 0.91 -- -- 0.56 -- Test Set P_VSA 0.89 0.68 0.96 0.87 0.90 0.70 -- Thai-Rel 0.93 0.79 0.98 0.94 0.93 0.82 -- (ii) QuaSAR-Classify Test Set Results (Diverse Subset Split of Dubus-203 Dataset) Source Descrip- tors Acc. Rec. (A) Rec. (I) MCC MCC Range (Across All Runs) Current work (Highest MCC from 50 runs) P_VSA 0.79 1.00 0.67 0.65 0.65- 0.36 Dubus- Rel 0.79 1.00 0.67 0.65 0.65- 0.43 Dubus et al.91 P_VSA 0.81 0.94 0.74 0.66 -- Dubus- Rel 0.74 0.94 0.63 0.56 -- 106 Int/Ext Split Model Binary QSAR (P_VSA) Binary QSAR (Thai-Rel) QuaSAR- Classify (P_VSA) QuaSAR- Classify (Dubus-Rel) Original 5.5E-02 3.1E-01 -- -- Random:1 3.5E-04 1.5E-01 -- -- Random:2 3.3E-05 3.8E-02 -- -- Random:3 7.4E-01 1.0E+00 -- -- Dubus-203 Diverse Subset -- -- 1.1E-02 3.3E-03 Thai-313 Diverse Subset 3.0E-07 6.6E-09 -- -- Original (Balanced training set) -- -- 1.0E+00 7.8E-02 Random:1 (Balanced training set) -- -- 3.4E-04 2.1E-05 Random:2 (Balanced training set) -- -- 6.7E-01 6.7E-03 Random:3 (Balanced training set) -- -- 8.0E-03 2.6E-03 Table 4.6 Chi-squared p-values corresponding to (mean, across 50 runs, for QuaSAR-Classify) test set MCC values obtained here for the partitions of the Literature-368, Dubus-203 and Thai-313 dataset used to assess this author’s models. These p-values were computed using the CHIDIST( ) function in Excel 2007 (32-bit), supposing one degree of freedom (see Chapter 2, section 2.6.4.1), save where a negative MCC was obtained – for which it was supposed that the model must effectively be a random predictor and the p-value was set to one. 4.7 Conclusions The work presented here shows the considerable variability in results that may be obtained by training and validating the same hERG blocker classifiers on different datasets. This is highlighted by the considerable reduction in model performance observed on the Literature- 368 dataset with the models proposed by Dubus et al. 91 and by Thai and Ecker 69 compared to both previously published validations of these models, as well as validations of these approaches in the current work, on previously published datasets. 107 This author’s Winnow models perform comparably to models, based on the same descriptors, employing the more sophisticated, non-linear Machine Learning algorithms: Support Vector Machines and Random Forest. The use of multiple training cycles was found to sometimes improve the performance of the Winnow models. Features based upon numerical molecular properties, an advance on earlier work with Winnow in QSAR research, were found to make positive contributions to the models. Moreover, the contributions of some features to these Winnow models are interpretable in terms of known relationships between molecular structure and hERG blockade potency. However, considerable caution is advocated when seeking to infer novel structure-activity relationships from these kinds of models. Having taken care to minimise possible model and feature selection bias, all binary classifiers developed in the current work almost always yielded results on external validation sets as good as or better than the classifiers proposed by Dubus et al. and by Thai and Ecker. Once possible extrapolation beyond the applicability domain of the model was removed, this author’s approaches perform well when compared to previously published models. 108 Chapter 5 Development of Novel 3D Descriptors This chapter describes the conception, implementation and performance evaluation of novel 3D descriptors. Specifically, a chemical feature 'coloured' version of Ballester and Richards' Ultrafast Shape Recognition (USR) methodology, 'Atom Type USR' (ATUSR), was proposed. The performance of this descriptor set when used for the generation of Quantitative Structure-Activity Relationships (QSARs) for protein-ligand binding problems was assessed. As highlighted in earlier chapters, some protein-ligand binding events may mediate serious forms of toxicity, making method development for predicting the binding affinities and selectivities of ligands for toxicologically relevant proteins an important contribution to computational toxiciology. The performance of the ATUSR descriptor set was compared to (standard) descriptor sets proposed in the literature and an adaptation of ATUSR designed to investigate the value added by explicitly encoding the 3D distribution of chemical functionality. The effects of conformer generation and dataset variation upon the results were also investigated. 5.1 Introduction The importance of molecular shape as a determinant of protein receptor mediated biological activity, and receptor selectivity in particular, has been highlighted in the recent literature.263,264,309,310 The need for "shape complementarity" between a ligand (a 'small' molecule) and a protein binding site arises for three reasons:311 1. Ligands with certain shapes are simply unable to fit into the binding site. 2. Close contacts are required to form sufficiently enthalpically favourable intermolecular interactions. 3. Space filling of the receptor site may lead to the entropically favourable expulsion of bound water molecules. Different approaches to characterising molecular shape in silico have been proposed. These methods may be divided into "alignment-based" and "alignment-free"309 (or "superposition" and "non-superposition") 263 methods. Early superposition methods treated molecules as sets of overlapping hard spheres. However, a more realistic and computationally efficient approach characterises shape in terms of a set of atom centred Gaussian functions and computes shape similarity in terms of the maximal obtainable overlap (corresponding to an 'optimal' alignment) between the Gaussian based 109 representations of two molecular conformers.263,311–313 A variation on this approach is employed by OpenEye's Rapid Overlay of Chemical Structures (ROCS) software;314,315 recently, freely available, open source software programs employing a variation on this approach have been described.316,317 However, the need for molecular alignment inherent to superposition methods is an intrinsic limitation to their computational efficiency 263 and, possibly, if the optimal alignment is not obtained, to their effectiveness for shape comparison.263,316 Furthermore, they do not immediately offer descriptors that could be used for QSAR modelling. * Various non- superposition, or rotationally invariant, methods of characterising shape (similarity) have been proposed during the last decade, all of which encode molecular shape as a set of descriptors. These include shape signatures, Zernicke descriptors and alpha-shape descriptors, proposed by Zauhar et al.,319 Mak et al.320 and Wilson et al.321 respectively and the Ultrafast Shape Recognition (USR) descriptors proposed by Ballester and Richards. 263,264 This latter approach supposes that molecular shape is uniquely determined by the distances between all the atoms in a molecule. In principle, this is a simplification, since different atoms have different (effective) radii and the set of inter-atomic distances cannot encode chirality, i.e. USR descriptors distinguish between diastereomers but not enantiomers. However, the first of these simplifications, which is also made by ROCS, may be of limited consequence in practice.316 Furthermore, USR can be adapted (or supplemented with an additional descriptor) to capture chirality - as shown by Armstrong et al.322 (Zhou et al.).323 As its name suggests, USR has been indicated to be particularly computationally efficient for shape similarity calculations, yet there is a debate in the literature regarding its effectiveness at capturing molecular shape compared to some other methods.263,264,310,324,325 Nonetheless, prospective virtual screening, via similarity searching using USR, recently identified an * Subsequent to alignment, however, 3D descriptors might be computed based upon the values of molecular properties sampled at grid-points defined with respect to a reference molecule to which all other molecules are aligned. Indeed, whilst not aligning conformers according to maximal shape similarity, this approach is employed to derive descriptors by the widely used Comparative Molecular Field Analysis (CoMFA) 3D QSAR method. CoMFA descriptors correspond to the calculated potential energies, at all grid-points, for a set of different probe atoms/groups, designed to characterise the favourability of forming different kinds of intermolecular interactions at different locations around the molecule of interest.318 110 outstandingly high proportion of actives when compared to the results of experimentally screening against the same target326 (see also Chapter 3 of this thesis). Whilst molecular shape undoubtedly plays an important role in receptor-specific biological activity, some limitations on the use of molecular shape alone to discriminate between active and inactive compounds have been noted. Firstly, more flexible proteins and proteins with multiple binding sites may be able to bind compounds with different shapes.310 Nonetheless, given the inexhaustive number of shapes which a binding site might adopt and the limited number of possible binding sites, active compounds would still, arguably, belong to certain classes of shapes. Secondly, perfect shape complementarity for a more flexible ligand may be entropically disfavoured due to the loss of rotational degrees of freedom upon binding.311,327 Thirdly, chemistry contributes to binding free energy327 - i.e. bioactivity is not dependent upon molecular shape alone.310,328,329 The importance of encoding chemical information was emphasised in recent studies by Kirchmair et al.329 and Cannon et al. 175 These latter authors combined a variant of USR with an implementation of the MACCS key descriptor set, which encodes the presence or absence of pre-defined 2D substructural features using a bit-string, and generated classification models using Random Forest. 175 Their results suggest that shape (as encoded by USR) and chemical feature information may be combined synergistically - for some bioactivity prediction tasks. 175 Whilst 3D descriptors are expected to encode additional, spatial, information that 2D descriptors cannot, 3D descriptors may not necessarily yield better models than 2D descriptors.19,324,330,331 This could be because of an inability to approximate the bioactive conformer - leading to 3D descriptors failing to encode additional, relevant, information compared to 2D descriptors. It has also been speculated that much of the spatial information regarding molecular structures is implicitly encoded by their 2D structure.19,330 Prior to considering the general validity of the claim that the bioactive conformer is likely to be poorly approximated, it is worth briefly touching upon the computational approaches used to obtain such an approximation. An initial set of 3D co-ordinates must first be assigned to all atoms. A variety of programs are available for this purpose, 332 such as the widely employed CORINA. As summarised in the manual, CORINA generates a single, “low energy” conformation based upon a combination of ‘typical’ bond lengths, bond angles and torsion angles – informed by statistical analysis of the conformational preferences of acyclic 111 molecular fragments in small molecule crystal structures – and computationally inexpensive calculations. 333 The initial 3D structure, possibly following refinement using a Molecular Mechanics force- field, which uses an empirically parameterised analytical expression for the relative energies of different 3D structures for a given molecule, 197,332 may then be subjected to a “conformational search” in order to locate other thermodynamically plausible conformations. 332 A variety of approaches may be applied for conformational searching. Usually, truly systematic exploration of all possibilities is computationally inconceivable and many of the conformers found during a systematic search would be rejected on energetic grounds in any case. A more computationally efficient approach is to to consider random changes in torsion angles, possibly subject to constraints, combined with criteria for selecting energetically plausible intermediate solutions for further adjustments. Monte Carlo and genetic algorithms both employ, distinctive, variations on this kind of approach. Other algorithms, such as simulated annealing, might also be employed. Detailed discussion of conformational search algorithms lies beyond the scope of this thesis, but the interested reader is directed to Folloppe and Chen for an excellent introduction. 332 Conformational searching may be carried out in vacuo or in a (simplified) solvation model 332 or – as per the “docking” approach334 – by assessing the energetic favourability of different conformations inside an experimentally determined, 334 or computationally predicted, 305 protein binding site. Assessing how well such approaches can ‘reproduce’ the bioactive conformer depends upon a suitable definition of a ‘closely reproduced’ bioactive conformer. Foloppe and Chen suggest that a root mean squared deviation (RMSD) between corresponding atoms, upon maximal alignment, of 1 Å or below may be considered a “good fit”.332 They recently found that three popular software programs (MOE, Catalyst and ConfGen) were able to locate more than 60% of experimentally determined bioactive conformers in a test set comprising more than 200 “drug-like” compounds.335 They further claimed that most bioactive conformers can be reproduced within an RMSD of 2 Å across methods. 332 Nonetheless, even if a conformational search can locate a good approximation of the bioactive conformer, this does not necessarily mean that the program can recognise the bioactive conformer – i.e. determining the bioactive conformer may still remain a problem. 112 Indeed, Folloppe and Chen suggest that appropriate assessment of conformational energetic preferences remains a key challenge. 335 However, nonwithstanding the possibility that 3D descriptors might genuinely yield worse predictions due to an inability to locate the bioactive conformer, it might be the case that observed superior performance with 2D descriptors is an artefact of "analogue bias" - i.e. datasets may be disproportionately composed of topologically similar active compounds.330,336 Alternatively, 2D methods could correlate strongly with simple physicochemical properties such as logP,337 which may correlate with nonspecific contributions to binding affinity, rather than the specific contributions that are required for, desirable, compound selectivity338 and may be better captured by 3D approaches. Hence, 2D descriptors may outperform/perform as well as 3D descriptors for "biased" datasets in which, say, the set of actives was disproportionately made up of topologically similar molecules and/or actives and inactives were well separated in terms of simple (physicochemical) properties,336,339 even when the 3D descriptors would be valuable under circumstances where such conditions did not apply. In summary, the literature to date suggests that, in developing a novel 3D descriptor set, such as those presented in this chapter, it is important to: 1. Encode both shape and chemical features. 2. Investigate the effects of conformational sampling. 3. Benchmark the more sophisticated descriptors against simpler, 2D descriptors. 4. Assess their performance across different datasets. 5. Consider whether or not dataset biases/property distributions might underpin the observed relative performance of the assessed methods. These points were considered when assessing the novel 3D descriptor set proposed in section 5.2. 5.2 Proposed Methodology 5.2.1 Descriptors The novel descriptor set proposed in this work is an extension of USR to encode the 3D arrangement of chemical functionality relevant for binding within a protein. For brevity, this 113 is termed the Atom Type USR (ATUSR) descriptor set. An explanation of how ATUSR descriptors are calculated first requires a detailed explanation of USR. USR descriptors are an approximate encoding of the distances between all atoms in a molecule, and hence of molecular shape (see section 5.1). A representation of these distances is generated by approximately encoding the distributions of the pairwise distances between all atoms in the molecule and a set of four reference locations, chosen to reduce redundancy between the pairwise distances from any pair of reference locations. These reference locations are: the molecular centroid (ctd), the closest atom to ctd (cst), the farthest atom from ctd (fct) and the farthest atom from fct. The associated distributions of pairwise distances are approximately encoded by computing estimators (i.e. sample moments) of the first moment about the origin (the mean distance to the reference location, encoding molecular size), plus the second and third moments about the mean. 263,264 The lth moments about the origin (Equation 5.1) and the mean (Equation 5.2) are computed using the following expressions. In these equations, denotes the Euclidean distance between the reference location and the nth atom out of all N atoms in the molecule. For each moment, the corresponding descriptor is its lth root, such that all 12 descriptors (four reference points, three moments per point) have dimensions of distance. 263,264 ∑ 5.1 ∑ 5.2 The speed with which USR descriptors can be computed, 264,326 and the simplicity of their implementation, motivated the development of a novel 3D descriptor set based upon these descriptors. These ATUSR descriptors include the USR descriptors described above, plus 11 groups of additional descriptors corresponding to the USR descriptors computed using the same reference locations, yet only including distances to atoms of a particular type. Hence, ATUSR descriptors comprise 144 (12+11×12) values in total. The 11 atom types were designed to discriminate between chemical features which might make different contributions to the free energy of binding for the molecule in question to a protein binding site.327 These 114 atom types were: (weak) hydrogen bond acceptor (donor), ring member, aromatic ring member, hydrophobic, fluorine, halogen, anionic and cationic. All atom types were defined using SMARTS340,341 patterns written by this author, and atoms corresponding to these SMARTS identified using the Python Pybel module. 177 In principle, atoms could be assigned more than one type - e.g. all 'aromatic ring atoms' are also 'ring atoms'. However, a rule that hydrogen bond donors (acceptors) could not be additionally typed as anionic (cationic) was also hardcoded. Steps were taken towards making atom typing independent of the tautomeric form or formal charge state of the input structure; to achieve the latter aim, steps were taken towards categorising all atoms expected to be protonated (deprotonated), or expected to be associated with delocalised positive (negative) charge in protonated (deprotonated) functional groups, at physiological pH, as cationic (anionic).342–345 The definition of hydrogen bond donors and acceptors was largely based on the definitions used by Mills and Dean (who explicitly excluded donors and acceptors, e.g. C-H groups, deemed to be 'weak') - i.e. donors included any nitrogen/oxygen with an attached hydrogen, whilst acceptors included sp 2 sulfurs and all oxygen, and some nitrogen atoms, possessing a lone pair, save where this was expected to be 'extensively' delocalised into an aromatic/electron withdrawing pi-system.346,347 All nitrogen atoms presenting lone pairs upon expected deprotonation at physiological pH, as found in tetrazole344,348 and acyl sulfonamides,345,349 were typed as hydrogen bond acceptors. Likewise, all nitrogen atoms expected to be protonated at physiological pH were categorised as donors.342 Weak donors were defined as CH groups with polarizing nitrogen, oxygen or fluorine substituents (or aliphatic CH groups with multiple chlorine substituents) and sp CH groups and aliphatic carbon bound SH groups.327,350,351 Weak acceptors were defined as all atoms in an aromatic ring, a double or triple bond (pi-acceptors) or sp 3 sulfurs.350,351 In all cases, the heavy atom (X) of an X-H donor was assigned the donor atom type. The rationale for encoding the positions of atoms in ring systems in general, and aromatic rings in particular, was to encode both the degree of molecular flexibility (ring systems having lower conformational entropy than chains,342 making binding to a receptor more entropically favourable), and the potential for enthalpically favourable offset pi-pi stacking etc.327 respectively. 115 Except for the exclusion of carbons indicated as making negative contributions to Wildman and Crippen's logP model,352 and the inclusion of fluorine,351–353 hydrophobic atom types were assigned to all non-polar carbon atoms, bromine, iodine and some sulfur atoms, as accepted in the RDKit (BaseFeatures.fdef, Q22010_1).354 The assignment of a distinct atom type to carbon bound fluorines reflects the distinctive effects of fluorine substituents. As the most electronegative element, it may have a particularly profound effect on pKa values,353 hence on the potential for strong hydrogen bonding350 or electrostatic interactions more generally. Moreover, CF interactions with polar groups "frequently occur in the PDB [Protein Data Bank]"327 and can lead to five-fold increases in binding affinities.353 Fluorine substituents occupy smaller volumes than other polar groups, increase lipophilicity on average,353 and do not offer any potential for 'halogen bonding'.355,356 All iodine and bromine atoms and chlorine atoms attached to certain substituents which were indicated to increase the chance of their involvement in halogen bonding, were designated as 'halogens'.355,356 Figure 5.1 illustrates the computation of USR and ATUSR descriptors for an example molecule. 116 Figure 5.1 Computation of all three (A) USR and (B) supplementary ATUSR descriptors (for atoms typed as H bond donors) corresponding to the distances (dotted lines) between the molecular centroid (black sphere) and the relevant atoms. N.B.: The [NH+] donor (RHS of the molecular centroid; N = blue; H = white) is also typed as a cationic and ring atom. This figure was generated using VMD (version 1.9).357,358 5.2.2 Application of Descriptors to Bioactivity Prediction The performance of the ATUSR descriptor set, compared to all other descriptor sets noted in section 5.3.1, when used to generate various regression and classification QSAR models using Random Forest (as implemented in the R 225 package randomForest) was assessed. Since the focus was on comparing different sets of descriptors, and randomForest is noted to perform well with its default parameters, 192 no hyperparameter selection was carried out. For all runs, randomForest( ) was called using = 501 (i.e. its default 500, adjusted to avoid random tie breaks for classification) and the default setting for (see Chapter 2, section 2.6.3.2). 117 5.3 Evaluation of Methodology The code for all descriptors implemented in this work, along with scripts used to carry out the modelling runs and analysis described in this section, is made available with this thesis (Appendix A). 5.3.1 Descriptor Sets Compared The descriptor sets which were compared are presented in Table 5.1. Extensions of USR and USR-like descriptors based on descriptors computed using moments beyond the third moment about the mean were not considered. Whilst the inclusion of higher order moments might yield improved performance, albeit at greater computational cost,175,359 the focus here was on comparing different ways of incorporating chemical information into USR-like descriptor sets and on comparing their performance relative to standard descriptor sets. Comparisons were sought with MACCS175,331,337,360 and P_VSA69,91,214,296 2D descriptors since both have been shown to be generally applicable to discriminating between bioactivity classes/bioactivity regression modelling. Indeed, the former set of descriptors was shown to outperform sets of USR-like descriptors for some classification tasks. 175 As noted previously, there is a clear need to benchmark the performance of 3D descriptor sets against sets of 2D descriptors, given both the lower computational overhead expected of the latter -which have no requirement to generate 3D conformers, a procedure which may also fail in some cases - and earlier studies which suggested that 3D representations may yield inferior predictive performance (see section 5.1). This latter point was the rationale for also benchmarking the performance of ATUSR descriptors against USR-ATFP - a combination of USR and an 'Atom Type Fingerprint ' (ATFP) comprising the occurrence count for each of the 11 atom types used for the ATUSR descriptors (section 5.2.1). This corresponds to comparing two conceptually different approaches to adding chemical information to USR: 1. The distributions of potential macromolecular interaction sites, relative to the same reference points as USR, are explicitly encoded (as per ATUSR). 2. The presence/number of such sites is encoded in addition to encoding the 3D distribution of all atoms via USR (as per USR-ATFP). The second of these approaches is analogous to the MACCS+USR-variant "hybrid descriptor" proposed by Cannon et al. 175 Likewise, just as Cannon et al. compared USR- 118 variants and their "hybrid descriptor" to MACCS alone, the ATFP descriptor set was also evaluated in isolation - giving an indication * of whether the encoding of molecular shape is actually adding predictivity. Descriptor Set ATUSR USR-ATFP ATFP USR USR+MACCS MACCS P_VSA Table 5.1 Descriptor sets compared in this work. 5.3.2 Validation Protocol For most datasets used in this work (see section 5.3.3), Random Forest models built using the aforementioned sets of descriptors (see section 5.2.2) were evaluated using 10 repetitions of 10-fold cross-validation. The partitions were designed to correspond to stratified folds, for classification tasks, or to independent sampling of percentile-based subgroups for continuous bioactivities - likewise intended to ensure that each fold was representative of the dataset as a whole. However, for the larger Doddareddy dataset, a single train/test partition was considered for computational expediency. In all cases, model generation was repeated five times per train/test partition with a different seed for the randomForest (see section 5.2.2) random number generator (RNG). 5.3.3 Datasets For most of the datasets considered to date, blockade of the hERG potassium ion channel was the modelled bioactivity. This was for the following reasons: 1. The considerable toxicological significance of hERG inhibition (see Chapter 1, section 1.3.2). * To some extent, molecular size - which is encoded by USR - may correlate with these atom occurrence counts. 119 2. Earlier work having been undertaken by this author in both reviewing the literature for this toxicity endpoint, and in acquiring data from minimally experimentally inconsistent protocols (see Chapter 4). 3. Earlier work in the literature having highlighted the importance of molecular shape for potent hERG blockade. Regarding this latter point, work by Chekmarev et al. had suggested that, on the basis of descriptors encoding molecular shape alone, strong and weak hERG inhibitors could be discriminated.282 Moreover, docking of hERG inhibitors into a molecular dynamics informed model by Zachariae et al. suggested that, whilst hERG binding sites may be flexible, and compounds may bind in more than one site/to more than one channel state, specific conformational, size and peripheral structural requirements exist for potent hERG blockade.300 This latter work suggested that, at least for discriminating potent from weaker blockade, a causative relationship between molecular shape and hERG inhibitory potency exists. The Schattel benchmark dataset was of interest due to all biological measurements having being obtained using the same assay, within the same laboratory, and its having been previously modelled using 3D descriptors and Random Forest. 166 hERG-196 Dataset This comprised 196 compounds for which pIC50 estimates noted to have been derived using the IonWorks TM assay in CHO cells, using a similar protocol to Bridgland-Taylor et al., 78 or otherwise inferred to have been obtained under similar conditions, or from manual, whole cell patch clamp measurements in CHO cells at room temperature, were obtained from the literature. * Where multiple pIC50 estimates were obtained for the same compound, the arithmetic mean was used for modelling. The derivation of this dataset started by taking a subset of the Literature-368 dataset presented in Chapter 4, for which precise pIC50 values – i.e. excluding measurements merely indicating pIC50 < 5 or pIC50 > 6 – meeting the criteria noted above were available. Additional compounds, with hERG pIC50 values meeting these criteria, were obtained via a Scopus search of the recent literature in the summer of 2010. Where the compounds in the * Bridgland-Taylor et al. indicated high correlation, with minimal systematic bias, between pIC50 estimates obtained under both sets of conditions. 78 120 hERG-196 dataset corresponded to the original subset (see above) of the Literature-368 dataset, the same unique IDs were assigned. * All structures, including stereochemistry, were manually checked using the experimental literature (where the compound was recently reported in a medicinal chemistry SAR study) or SciFinder Scholar 2007. 287 Care was taken to only specify stereo-centres where these were specified in the consulted sources, and that these were drawn as per the recommendations in the CORINA manual (see section 5.3.4.1) and that compounds for which all stereocentres were specified, yet only relative stereochemistry indicated, in these sources were noted (see section 5.3.4.2). The ‘raw’ SDF and a CSV file containing all structures, prior to processing described below, and experimental pIC50 estimates respectively, along with references, are presented in the files made available with this thesis (Appendix A). ThaiReg Dataset This was a subset of the 248 compounds previously used to validate hERG regression models by Thai and Ecker.296 Starting from the SDF containing the structures used in this study (kindly provided by Dr Khac-Minh Thai), a Python script was used to prepare a ‘raw’ SDF comprising 248 structures for which precise IC50 values were presented by Thai and Ecker.296 Subsequent to removal of compounds which CORINA failed to parse, and additional compounds based on further analysis (see section 5.3.4.3), a dataset of 225 compounds with pIC50 values was obtained for regression modelling. Since the number of compounds in common between, or with common stereoisomers in, the hERG-196 and ThaiReg datasets was estimated to be 73 or below, † it was suspected that these datasets might occupy different regions of chemical space. This, coupled with the fact * These being the original IDs assigned to the Int-Set subset of the Literature-368 dataset, with the original ExtTest-Set subset IDs being advanced by 300 – as explained in Literature_368_References.doc (see Appendix A). † Stereochemistry indifferent InChIs were computed, as per section 5.3.4.3 – save for the use of the SNon flag, for all structures in the 248 compound superset of the ThaiReg dataset and the hERG-196 dataset; 73 unique InChIs were found to be common to both sets of compounds. There were 246 and 196 unique InChIs associated with the superset of the ThaiReg dataset and hERG-196 dataset respectively. 121 that regression models were successfully built on the original superset of this dataset,296 motivated the assessment of the descriptor sets of interest on this dataset as well. Doddareddy Dataset This was a subset of the hERG inhibition classification dataset (2,644 compounds) recently presented by Doddareddy et al. 93 Actives were compounds with IC50 < 10 μM and inactives were compounds with IC50 > 30 μM and FDA approved drugs (assumed inactives). Prior to the removal of structures subsequent to standardization (see section 5.3.4.3), inorganic compounds were removed (some of these having forced the generation of 3D structures using CORINA to abort) using Pipeline Pilot Student Edition (Figure 5.2),301 giving ‘raw’ structures prior to subsequent processing. Ultimately, a dataset of 2,218 compounds was derived - with the training (1,187 inactives : 826 actives) and test (125 inactives : 80 actives) sets corresponding (post-removal of compounds) to the randomly selected split used by Doddareddy et al. The rationale for modelling this dataset was that, due to its significant size compared to previously compiled hERG datasets, it was suspected to be significantly more diverse and offer a larger coverage of chemical space - characteristics which might affect the relative performance of the evaluated sets of descriptors. Figure 5.2 Removal of inorganic structures from Doddareddy dataset in Pipeline Pilot, prior to standardization. The numbers reflect the total number of compounds, in all SMILES files, subsequent to assignment of unique compound IDs, presented by Dr Andreas Bender. The 2,644 compounds referred to were presented in a subset of these files. hERG-196:Subset and ThaiReg:Subset Datasets In order to gain some insight into the effect that the large size of the Doddaredy dataset might have on the descriptor sets' relative performances, analogous active (IC50 < 10 μM) vs. 122 inactive (IC50 > 30 μM) classification datasets were derived from the hERG-196 and ThaiReg datasets, via excluding all compounds with (mean) pIC50 values not fulfilling the corresponding criteria. The hERG-196:Subset and ThaiReg:Subset datasets comprised 148 (122 active : 26 inactive) and 185 (155 active : 30 inactive) compounds respectively. Whilst the different active : inactive ratios, and assay types used to derive hERG activity, for the three hERG classification datasets would arguably impact upon the results, it was expected that the different sizes (and potentially different levels of diversity) would exert a greater effect on the relative performance obtained with different sets of descriptors. Schattel Dataset This was a dataset of 249 compounds provided by Schattel et al. in SDF format. This comprised compounds, with absolute stereochemistry characterised, determined as active (60 compounds) or inactive (189 compounds) with respect to inhibition of c-Jun N-terminal kinase-3 (JNK3). 166 5.3.4 Dataset Processing and Descriptor Calculations All scripts, and standardizer configuration files, specifying the exact options used with ChemAxon's tools, 298 CORINA (version 3.20) 333 and the Molecular Operating Environment (MOE), 285 screenshots completely detailing graphically set options for MacroModel,361 as well as all files required to reproduce the GOLD334,362 docking runs, and obtain the selected docking poses, are made available with this thesis (Appendix A). 5.3.4.1 Standard Processing of Structures All ‘raw’ structures, presented in SDF format, in the (pre-cursors of) all datasets (see section 5.3.3) were processed via the workflow schematically illustrated in Figure 5.3 These operations were designed to yield one representative 3D structure, under typical assay conditions, per compound. 5.3.4.2 Structural Refinement of the hERG-196 Dataset Initial results for the hERG-196 dataset, and results (see section 5.4) based on the standard workflow (Figure 5.3), indicated typically marginal differences between the performance of many of the 3D and 2D descriptor sets considered in this chapter (see Table 5.1). In order to investigate whether this reflected poor approximation of the bioactive conformer (see section 123 5.1), alternative workflows were followed (for the hERG-196 dataset) in order to obtain a conformer for each compound that might better approximate this. For all workflows, the same processing of the ‘raw’ structures via ChemAxon's tools, and the conversion of all finally obtained SDF files to MOL2 format prior to MOE descriptor calculations, was carried out as per the standard workflow (Figure 5.3). 5.3.4.2.1 Preparatory Steps For all workflows, initial 3D structures were generated, possibly more than one per compound, from the structures obtained subsequent to processing using ChemAxon's tools (Figure 5.3), using CORINA's stereoisomer generator ("STERGEN"); all stereoisomers generated were designed to correspond to different possibilities given the (possibly incomplete) stereochemistry specified for the original structures. Since the STERGEN protocol would not have generated multiple enantiomers for compounds with all stereocentres specified, yet for which absolute stereochemistry was not indicated (see section 5.3.3), enantiomers of the structures generated by CORINA for these compounds were obtained, by reflecting conformers in the x-y plane, using a Python 275 script employing the Pybel module. 177 Taking into account those compounds for which multiple possible stereoisomers were generated, a total of 349 3D structures was obtained, for the 196 compounds in the hERG- 196 dataset, subsequent to parsing the STERGEN generated structures via this final Python script. These 349 structures were processed using Molecular Mechanics (section 5.3.4.2.2) and docking (section 5.3.4.2.3) procedures. For the structures obtained immediately prior to and subsequent to the application of Molecular Mechanics, an arbitrary stereoisomer was consistently selected per compound, for obtaining descriptors, where multiple possible stereoisomers were available for a given compound. However, since the highest scoring docking solution was selected per compound, a non-arbitrary possible stereoisomer was used for descriptor calculations subsequent to docking - this being the rationale for considering all possible stereoisomers for each compound. 124 5.3.4.2.2 Molecular Mechanics Further processing of these 349 structures was carried out via MacroModel (version 9.7).361 Two sets of calculations were set up: 1. Local minimisations. 2. Conformational searches, designed to find the global minimum for each structure. In both cases, Maestro (version 9.1.207), 361 was used to set up the calculations, starting from the Pybel generated SDF, and extract MacroModel generated structures in SDF format. Both sets of calculations employed the MMFF94s force-field,363 and a “GB/SA” continuum model of aqueous solvation.197,364 The “optimal" method was selected for all minimisations,* along with 10,000 maximum iterations and the default convergence criterion of 0.05 kJ mol -1 Å -1 .361 Conformational searches employed the recommended361 Monte Carlo Multiple Minimum (MCMM) approach,366 with "intermediate" sampling of esters/amides, retention of mirror image conformations, 100 steps per rotatable bond, and a limit of 1,000 steps per molecule. Redundant conformers, identified based on a maximum atom deviation of 0.5 Å, were eliminated, along with any conformers higher than 21.0 kJ mol -1 above the current global minimum. 5.3.4.2.3 Docking The Molecular Mechanics calculations were premised on the assumption that the global minimum of the free ligand in aqueous solution was a good approximation of the bioactive conformer (and, for the local minimisations, that this was the closest local minimum to the CORINA conformation). Indeed, a recent study suggested that a similar conformational search procedure to that described above could yield a global minimum which was a good approximation of the bioactive conformer.367 However, the bioactive conformer could be substantially different to this global minimum.19,332,368 Hence, as has elsewhere been proposed in the literature,90,369 docking was employed to (hopefully) obtain a better approximation of the bioactive conformer. All 349 locally minimised structures obtained for the hERG-196 dataset were docked into the hERG homology model presented by Imai et al.,305 using GOLD (version 5.0.1),334,362 * In practice, this was a Truncated Newton Conjugate Gradient method.365 125 subsequent to preparing them in MOL2 format, with all atom and bond types assigned according to the GOLD conventions for functional groups, using CORINA. Docking runs started from the PDB format file presented by Imai et al.,305 containing an experimentally supported cisapride-hERG complex, obtained from molecular docking. This cisapride structure was extracted within GOLD and the binding site defined by all protein atoms (and associated residues) within 12 Å of this ligand. This was deemed appropriate since the binding site included the Ser624, Tyr652 and Phe656 residues (on all four subunits) highlighted by Imai et al.305 as playing key roles in binding potent hERG inhibitors. For each input structure, 40 docking solutions were obtained using ChemScore.370,371 This strategy was based upon * the protocol employed by-Imai et al. to obtain experimentally supported docking solutions for hERG inhibitors.305 The generation of diverse solutions, without early termination, was specified. Automatic genetic algorithm settings were employed using the default search efficiency. In keeping with the aforementioned conformational searches, rotations about amide bonds were not considered. In addition to selecting the highest ranking solution per compound using ChemScore, the highest ranking solution upon re-scoring using RF-Score was also selected; in both cases, structures were extracted from the docking solution files written by GOLD using Python scripts 275 employing Pybel. 177 RF-Score was trained and validated on the same partition of the PDBbind dataset as Ballester and Mitchell, 266† prior to applying to all docking solutions. Copies of these solutions were prepared as required for RF-Score using a Python script; the protein structure file required was obtained via reloading the protein data file specified in the GOLD configuration file and exporting in PDB format via Hermes version 1.4.1. 334 Since the docking solutions written by GOLD were stripped of formal charges, the co- ordinates of the top ranking solutions (selected via ChemScore or RF-Score) were * Unlike Imai et al., for computational expediency, additional docking runs using GoldScore were not carried out. † Save for some differences in software and hardware versions (see Appendix C), the instructions presented by these authors were followed. Minor changes were made to their code (see Appendix A); however, no differences were observed between the reported training and test set performances and those presented in Ballester and Mitchell's publication.266 126 transferred, via Python code, to the corresponding SDF structures obtained via local minimisation, prior to converting to MOL2 format as per the standard workflow (Figure 5.3). Figure 5.3 Standard workflow used to process ‘raw’ structures in all datasets modelled in the work presented in this chapter. N.B.: The structures obtained subsequent to each of the different pre- CORINA processing steps were also parsed via Pybel.177 127 5.3.4.3 Preparation of Files for Descriptor Calculations All MOL2 files ultimately obtained from the workflows described above were used to compute MACCS keys and P_VSA descriptors (stored in SDF format), via MOE. Pybel 177 filtered versions of these MOE exported files and their MOL2 precursors were used to generate MACCS and P_VSA descriptors files and compute all other descriptor sets presented in Table 5.1 respectively. An analysis of redundancy/potentially incorrect structures in these datasets was carried out via comparing stereochemistry encoding InChIs computed (from all SDF files obtained subsequent to processing via ChemAxon's tools, yet prior to CORINA processing) using the standalone Windows executable (InChI version 1.03) available from the International Union of Pure and Applied Chemistry (IUPAC). 178 The InChIs employed here for structure comparisons were generated using the default options. The structures were updated with these InChIs (as SDF fields) via a Python script employing Pybel. All structures with non-unique InChIs, * and all those for which erroneous stereochemistry assignment was indicated by CORINA (based on warnings/error messages in the CORINA logs related to stereochemistry) were removed from the relevant datasets. In the course of this work, 15 erroneous structures were also removed from the ThaiReg dataset based upon consulting SciFinder 372 or the literature sources provided by Thai and Ecker. 296 The resulting MOL2 and SDF files used to compute descriptors are made available with this thesis (Appendix A). 5.3.5 Statistical Comparisons All values for all figures of merit presented in section 5.4 were computed per randomForest RNG seed-test set combination. Hence, five results were obtained per descriptor set for the Doddareddy dataset and 500 for all other datasets evaluated via 10 times 10-fold cross- validation (10-10CV). The variation observed for each set of descriptors motivated statistical evaluation of the results. The null hypotheses proposed were that, across all possible test set results that could * Save for those active compounds in the Doddareddy dataset with InChIs only matching those computed from inactive compounds corresponding to the seven 'false non-blockers' noted by Doddareddy et al.93 - these 'inactives' being separately specified for removal. 128 have been obtained for a dataset corresponding, in terms of its characteristics, to any given dataset which was actually evaluated: * 1. The overall mean difference in performance - as measured using a suitable figure of merit - for any pair of descriptor sets would have been zero. 2. For a given descriptor set, the overall mean performance would have been no better than a random predictor. However, the variation in results obtained arises from two sources: 1. The selection of different training and test sets. 2. The internal randomness of Random Forest. This author was unaware of any appropriate statistical test which would enable direct computation of p-values for either null hypothesis, given these distinct sources of variation. Considering the first null hypothesis, for each set of results obtained for a given dataset using a single randomForest RNG seed, all 10-10CV results for a given pair of descriptors were compared using the "corrected repeated k-fold CV test" advocated by Bouckaert and Frank.373 This test is designed to correct for the underestimation of p-values generated from the standard paired t-test 226 when applied to comparison of repeated cross-validation results - due to the lack of independent sampling of train/test partitions arising from the overlap of cross-validation training sets and, when repeated, test sets.373 Supposing rejection of the null hypothesis for all RNG seeds, the overall mean difference was evaluated for statistical significance by applying a standard paired t-test to the cross-validation mean results, for each RNG seed, as it is reasonable to suppose these results were truly independent. For the Doddareddy dataset, only the latter test was applied; given the large size of the training and test set used for this dataset, it was assumed that the variation in results associated with different training and test set selections would be negligible. In all cases, p-values corresponding to a two-tail test 226 were computed. For all tests, the figures of merit considered for classification and regression problems were the MCC and R 2 respectively. Considering the second null hypothesis, one-tail 226 versions of the statistical tests previously described were applied to the results obtained from 10-10CV.373 Since, for a random *Systematic differences were expected between the datasets, that could lead to genuine differences in the (relative) performances of the descriptors - hence, it was deemed appropriate to separately test these hypotheses for each dataset. 129 predictor, the mean MCC and Pearson's correlation coefficient values are expected to be zero, 226,230 these values were assessed for statistical significance for regression and classification problems respectively. For the Doddareddy dataset, the chi-squared p-values (Chapter 2, section 2.6.4.1) corresponding to the MCC were computed for each RNG seed. Supposing the (mean) results obtained for each RNG seed were all statistically significant, the overall mean MCC was assessed to be statistically significantly greater than zero based on a, standard, one-tail t-test. 226 Ultimately, a considerable number of comparisons regarding the relative and absolute performance of the descriptor sets were made. Since a decision to reject a given null hypothesis was made upon the basis of observing a sufficiently low p-value (less than 0.05), the expected proportion of rejections which were erroneous (the "false discovery rate", or FDR) 374 could well exceed 0.05 if the original p-values were used. It is appropriate to control the FDR for a set (or "family") of p-values, where any statistically significant finding amongst the family points towards an overall conclusion; 374 here the overall conclusions, depending upon the null hypothesis rejected, would be that there were genuine differences in the performance of the descriptor sets - or that a given set of descriptors yielded better than random performance - under some circumstances. Hence, all p-values obtained, across all datasets, from the tests applied for pairwise comparisons, were treated as a family, whilst only those p-values obtained for a given descriptor set were treated as a family for the purpose of rejecting the second null hypothesis. In order to restrict the FDR to 0.05 or below, for a given family of comparisons, Benjamini and Yekutieli's method375 was applied, using the R function p.adjust( ), 225 to generate adjusted p-values, with differences only declared statistically significant when associated with adjusted p-values less than 0.05. 303 To summarise: multiple adjusted p-values were computed for any given comparison between pairs of descriptor sets or between the results obtained for a single descriptor set and the 'zero-baseline' expected for a random predictor. Differences in overall mean MCC, R 2 or Pearson's correlation coefficient values - for a given dataset - were only declared statistically significant if all relevant comparisons were deemed statistically significant (adjusted p-values < 0.05). 130 The results of all statistical comparisons, including a summary of all differences that were deemed statistically significant, are presented in CSV files made available with this thesis (Appendix A). 5.3.6 Dataset Evaluation In order to facilitate interpretation of the relative performance of the assessed descriptor sets on the datasets presented (section 5.3.3), clustering analysis was employed as an indication of the diversity of the datasets (and, for the classification datasets, their separability into actives and inactives) in terms of molecular shape and lipophilicity - the latter commonly associated with nonspecific receptor binding.338 Agglomerative hierarchical clustering 270 was employed: descriptor vector representations of all molecules in the dataset defined the first set of (“singleton”)331 clusters; the closest pair of these were replaced with a new cluster corresponding to the geometric mean of their descriptor vectors - this procedure being carried out iteratively. USR (ChemAxon logD * ) descriptors were employed to characterise the datasets in terms of molecular shape (lipophilicity). All descriptor calculations and clustering analyses were carried out using a combination of R 225 and Python 275 code made available with this thesis (Appendix A). The number of clusters (excluding “singletons”) formed prior to the smallest inter-cluster distance exceeding a specified cut-off was used as a measure of dataset diversity. The distance metric employed was one minus the similarity metric defined by Ballester and Richards for the comparison of USR descriptor vectors 263 - hence it took values between zero and one. Following Schreyer and Blundell, it was supposed that conformations with USR similarities below 0.8 could be considered 'significantly different';368 the distance cut-off was set to 0.2 for both sets of descriptors for consistency. The separability of classes was indicated via application of an analogous approach to that proposed by Brown and Martin:331 for both the original dataset, and activity permuted * In keeping with Shamovsky et al., logD was considered a better measure of lipophilicity for ionisable compounds than logP.338 The structures available immediately prior to processing via CORINA (see Figure 5.3) were used to compute logD predictions. It was not possible to compute logD estimates for nine of the 826 active entries in the Doddareddy training set; hence, these were excluded when clustering based on logD. 131 versions, the number of “active clusters” - when clustering was terminated upon formation of a specified minimum number of clusters - was defined as the number of ('non-singleton') clusters containing at least one active compound. The proportion (p(a)) of compounds assigned to these clusters (the “active cluster subset”) which were active was computed, and compared to the proportion (p(0)) of active compounds in the entire dataset. If the p(a) values obtained, upon specification of increasingly larger numbers of clusters, * corresponding to increasingly fewer compounds in the “active cluster subset”, for the real data exceed those obtained for the activity permuted data, this indicates that the descriptor set employed for clustering (partially) separates actives and inactives in the dataset. 5.4 Results and Discussion All of the following plots were generated using R. 225 All box-and-whisker plots summarise the results across all RNG seeds and across all train/test partitions and were generated such that the whiskers extend to the data extremes, with the solid lines denoting the median and the box denoting the upper and lower quartiles.376 The means are denoted by the centre of the circles superposed on the box and whisker plots, with error bars extending to plus and minus the standard error of the mean. 226 Unless otherwise noted, the coefficient of determination (R 2 ) and Matthews Correlation Coefficient (MCC) are used to summarise the overall performance of descriptor sets when used for regression and classification modelling respectively. More detailed results, referred to below, are made available with this thesis (Appendix A). (See Chapter 2, section 2.6.4 for definitions of these figures of merit.) 5.4.1 Effects of Structural Refinement In the following plots, the overall performances of all descriptor sets for regression (Figure 5.4) and classification (Figure 5.5) modelling of the hERG-196 and hERG-196: Subset datasets respectively are summarised for all sets of 3D structures generated via the different workflows described in section 5.3.4.2. * Here, the minimum number of clusters was set to one, then increased - insofar as was possible - from 30 in steps of 30, for all datasets. 132 133 Figure 5.4 R2 values obtained from 10-10CV (five RNG seeds for randomForest) on the hERG-196 dataset , with different descriptor sets calculated from structures obtained: (A) prior to Molecular Mechanics calculations, (B) from local minimisations, (C) from global minimisations, (D) from docking (ChemScore selections), (E) from docking (RF-Score selections). The black lines and circle centres denote the median and mean results respectively. 134 135 Figure 5.5 Corresponding MCC values (c.f. Figure 5.4) obtained on the hERG-196:Subset dataset. Considering the hERG-196 regression results, both the results presented in Figure 5.4, and all corresponding plots for Pearson's (Spearman's) correlation coefficient, the Root Mean Square Error (RMSE) and the mean absolute error (MAE), indicate that the USR descriptor set is the worst. The USR descriptor set consistently yielded the lowest overall mean R 2 , with all statistically significant pairwise differences in these values involving USR results. This is not so surprising. Firstly, the importance of chemical information was highlighted in recent classification and similarity searching studies.329,377 Secondly, it is reasonable to suppose that changes in molecular shape might be more useful in discriminating between clearly separated actives and inactives, rather than in controlling more subtle graduations in bioactivity that would need to be captured by a regression model. 136 In contrast, the classification results (Figure 5.5) obtained using USR are not consistently ranked as worst performing in terms of either their overall median (or mean) MCC - which may reflect the ability (as suggested above) of molecular shape to specifically discriminate actives from inactives. Neither the computation of structures obtained from Molecular Mechanics nor docking yielded systematic improvements in the performance of the 3D descriptor sets. * Indeed, in most cases, the 'refinement' of the original structures obtained from STERGEN led to lower overall mean R 2 (MCC) values being obtained using the 3D descriptor sets (see Figure 5.6 and Figure 5.7). Furthermore, whilst the performance of all descriptor sets for both classification and regression was otherwise deemed statistically better than random, the mean MCC value for USR (and USR+MACCS) was not deemed statistically significantly better than random when descriptors were computed from ChemScore or RF-Score selected (and globally minimised) structures. * Those descriptor sets (USR+MACCS and USR-ATFP) comprising both 3D and 2D components were considered 3D descriptor sets. 137 Figure 5.6 Mean R2 values obtained, for the hERG-196 dataset, using the following descriptors, computed from different 3D structures: (A) ATUSR, (B) USR-ATFP, (C) USR+MACCS, (D) USR. 138 Figure 5.7 Mean MCC values obtained, for the hERG-196: Subset dataset, using the following descriptors, computed from different 3D structures: (A) ATUSR, (B) USR-ATFP, (C) USR+MACCS, (D) USR. 139 This could be because none of the 'refinement' procedures typically yielded an improved approximation of the bioactive conformer. To gain some partial insight into this, the structures of cisapride and E-4031 from which descriptors were computed for the results presented above were compared to the docking poses presented by Imai et al.305 Since Imai et al. presented experimental results in support of these poses, it was supposed that their poses could be taken as good approximations of the actual bioactive conformers. The structures used for descriptor calculations were aligned to these docking poses, extracted from the PDB format files presented by Imai et al. using OpenBabel (version 2.3.0a), and the Root Mean Squared Deviation (RMSD) between the heavy atoms computed using a Python script employing OpenEye's OEChem Toolkit (version 1.7.7).378 USR similarities were also computed. Visual (quantitative) comparisons of the aligned structures are presented in Figure 5.8 (Table 5.2). It should be noted that the protonation states assigned by Imai et al. differed from those assigned for descriptor calculations in the present work and a different stereoisomer of cisapride was used for descriptor calculations (save for the RF-Score selected structure). As USR descriptors are computed from all atoms (i.e. both heavy and hydrogen atoms), the differences in protonation states may partially explain the observation that USR similarity did not always increase when the RMSD decreased. The increase (decrease) in RMSD (USR similarity), relative to the original STERGEN structure, observed upon selecting a structure from a global conformational search or a top scoring docking pose for E-4031 lends credence to the suggestion that the refinement procedures did not typically yield an improved approximation of the bioactive conformer. It is also worth noting that low Pearson's correlation coefficients (ChemScore: 0.364, RF- Score: 0.287), for the poses selected for descriptor calculation, were obtained between the average pIC50 values (used for QSARs) and the docking scores for the hERG-196 dataset. The much lower performance of RF-Score than that previously reported for the PDBbind dataset, comprising experimental protein-ligand complexes, 266,267 further suggests that the docking procedures were (typically) unable to generate improved approximations of the bioactive conformers. Indeed, there are a number of particular difficulties inherent in searching for the bioactive conformer via docking into a hERG model; some of these may be scoring function specific, as discussed by Imai et al. - who did not use docking scores to select their experimentally supported docking poses305 - and further problems may arise due 140 to the possibility that hERG blockers may bind to multiple states of and/or multiple sites within the ion channel (see Chapter 4, section 4.6.3). Figure 5.8 Images of Imai et al.305 docked structures for cisapride (A) and E-4031 (G) and corresponding aligned structures obtained via: STERGEN (B,H), local minimisation (C,I), global minimisation (D,J), ChemScore selection (E,K) and RF-Score selection (F,L). All molecular images were generated using VIDA (version 4.1.1).378 141 Compound Origin of Aligned Structure RMSD (Å) USR Similarity Cisapride STERGEN 2.44 0.65 Local minimisation 2.77 0.55 Global minimisation 2.47 0.69 ChemScore selection 2.01 0.55 RF-Score selection 2.24 0.55 E-4031 STERGEN 1.29 0.79 Local minimisation 1.27 0.77 Global minimisation 2.14 0.72 ChemScore selection 2.11 0.54 RF-Score selection 2.19 0.52 Table 5.2 RMSD values (computed from heavy atoms) and USR similarities (computed from all atoms) upon alignment of the structures obtained here to the docking poses presented by Imai et al.305 5.4.2 Relative Performance Across Different Datasets The overall performance of the different descriptor sets for regression (Figure 5.9) and classification (Figure 5.10) modelling of different datasets is summarised below. 142 Figure 5.9 R2 values obtained for: (A) hERG-196 dataset, (B) ThaiReg dataset. In both cases, the structures used for descriptor calculations were obtained from the standard workflow (see 5.3.4.1). The black lines and circles denote the median and mean results respectively. 143 144 Figure 5.10 MCC values obtained for: (A) hERG-196:Subset dataset, (B) ThaiReg: Subset dataset, (C) Doddareddy dataset and (D) Schattel dataset. In all cases, the structures used for descriptor calculations were obtained from the standard workflow (see 5.3.4.1). The black lines and circles denote the median and mean respectively. Once more, the regression results obtained for USR were evidently the worst. The overall mean R 2 values (Figure 5.9), Pearson and Spearman correlation coefficients were the lowest, and the overall mean RMSE and MAE values the highest (Appendix A), with USR for both datasets considered. Likewise, the only statistically significant pairwise differences in overall mean R 2 involved USR results (Appendix A). Whilst, in terms of both the median and mean MCC value (Figure 5.10), the performance of USR was not the worst for the hERG-196: Subset dataset, its performance in terms of the mean MCC was the worst for all other classification datasets. Nonetheless, save for the Doddareddy dataset, none of the other descriptor sets was determined (Appendix A) to give significantly different mean MCC values to those obtained with USR. * This (given the statistically significant pairwise differences observed for the corresponding regression datasets, noted above) still lends support to the proposal that USR might be intrinsically better at discriminating clearly distinct actives from inactives than in generating regression models. * When the structures used for descriptor calculations were prepared via the standard workflow (section 5.3.4.1). 145 Nonetheless, for both classification and regression modelling, the performance of USR descriptors was deemed statistically significantly better than random for all datasets when structures were obtained via the standard workflow (see section 5.3.4.1). Indeed, this was the case for all descriptor sets. There is no indication that the failure of USR(-like) descriptor sets to (significantly) outperform the other descriptor sets, and the typically poor performance of USR in particular, reflects limited diversity of molecular shapes in these datasets. * As indicated by the results in Table 5.3, the hERG-196: Subset dataset, the only dataset for which the median (and mean) MCC for USR was not the worst, appears to be the least diverse in terms of molecular shape of the classification datasets. Descriptor Set Dataset Type Dataset Number of Clusters USR Classification Doddareddy(Train) 209 Doddareddy(Test) 36 Schattel 35 ThaiReg:Subset 31 hERG-196:Subset 21 Regression ThaiReg 37 hERG-196 31 logD Classification Doddareddy(Train) 51 Doddareddy(Test) 30 Schattel 21 ThaiReg:Subset 21 hERG-196:Subset 16 Regression hERG-196 21 ThaiReg 19 Table 5.3 Numbers of clusters obtained, upon clustering using predicted logD (or USR descriptors), computed prior to the application of CORINA (or from structures processed via the standard workflow) with a distance cut-off of 0.2 (see section 5.3.6). For both descriptor sets, the classification and regression datasets were ranked in order of decreasing number of clusters - supposed to correspond to decreasing diversity. For all but the ThaiReg:Subset classification dataset, there would appear to be some indication that the actives and inactives are separable in terms of lipophilicity. This is * For a dataset composed of very similarly shaped molecules, differences in molecular shape could not account for significant differences in bioactivity. 146 suggested by the p(a) values obtained upon clustering the original datasets using predicted logD values being, commonly, either above or at the top of the range of those obtained when applying the same clustering procedure to the corresponding activity permuted datasets (Figure 5.11). It is possible that separability of the actives and inactives for the Doddareddy dataset due to differences in lipophilicity could partially account for the good discriminative ability of the P_VSA descriptor set for this dataset (Figure 5.10), since some of its constituent descriptors are based on atomic contributions to logP. 295 Likewise it is possible that the lack of observable separability, in terms of lipophilicity, for the ThaiReg:Subset dataset contributed to the relatively low performance of the P_VSA descriptor set (Figure 5.10). Nonetheless, for this latter dataset, a 2D descriptor set (ATFP) gave the highest mean MCC. This latter observation suggests that failure of the 3D descriptor sets to (significantly) outperform the 2D descriptor sets, for the classification datasets, cannot simply be explained by the actives and inactives being separated in terms of lipophilicity. Curiously, whilst the results prior to this point on the graph indicate separability of actives and inactives in terms of lipophilicity, there is a noticeable decrease in p(a) values (Figure 5.11) when clustering based on predicted logD is terminated for the Doddareddy training set upon formation of 1290 clusters or more (corresponding to 708 structures or less in the active cluster subset). * This result does not appear to be readily explainable. It is important to remember that the quality of the logD predictions made by the tool employed here (see section 5.3.6), 379 in keeping with other methods, is unlikely to be similar across the whole of chemical space. 338 The (presumably variable) prediction error associated with the logD model employed potentially hampers the discernment of separability, on the basis of lipophilicity, of the actives and inactives in the datasets considered in this work. * Detailed results tables corresponding to, and full size versions of the plots presented in, Figure 5.11 and Figure 5.12 are made available with this thesis (Appendix A). 147 Figure 5.11 P(a) versus number of structures in the active cluster subset when clustering the following datasets using predicted logD: (A) Doddareddy (Test), (B) Doddareddy (Train), (C) ThaiReg:Subset, (D) hERG-196:Subset, (E) Schattel. N.B.: The red (blue) points denote values obtained for the real (activity permuted) data; the green points denote a theoretical upper bound to p(a) - computed as the total number of actives in the dataset divided by the current number of structures in the active cluster subset.331 148 Figure 5.12 Corresponding plots to those presented in Figure 5.11, based on clustering using USR. 149 5.4.3 Effectiveness of Different Approaches to Combining Chemical Information with USR An important question here concerns whether or not the ATUSR approach to combining USR with chemical information is (potentially) superior to the "hybrid"175 approach of combining USR with a 2D chemical fingerprint - as was initially hypothesised. To remove the confounding factor of different atom types, it is therefore appropriate to specifically compare ATUSR to the USR-ATFP, ATFP and USR descriptor sets. In the following discussion, all references to relative performance should be understood to solely refer to the relative performance of these descriptor sets (albeit no statistical quantities were recalculated) - where performance was measured in terms of the overall mean R 2 or MCC, as applicable. If the original hypothesis was correct, the following ranking would be expected: ATUSR > USR-ATFP > USR/ATFP. Across all sets of results (obtained for all classification and regression datasets, and all procedures for obtaining 3D structures), the ATUSR descriptor set only yielded the best performance in two cases: for the Schattel and hERG-196:Subset classification datasets, when descriptors were computed from locally minimised structures for the latter. However, in neither case were the differences in descriptor set performance statistically significant. Indeed, the expected ranking (see above) was only observed for the Schattel dataset. The USR-ATFP "hybrid" descriptor set outperformed all other descriptor sets under consideration for the Doddareddy and ThaiReg datasets as well as for the hERG-196 and hERG-196: Subset datasets under some circumstances. * However, considering those sets of results where the best performance was obtained using the USR-ATFP descriptor set, statistically significant differences between the performance obtained using USR-ATFP and any other descriptor set were only observed for the regression datasets and the Doddareddy dataset. Moreover, all such statistically significant differences involved USR descriptors - i.e. the performance of the USR-ATFP or ATUSR descriptor set was not deemed statistically significantly different to the performance of ATFP descriptors. * These circumstances being when the 3D structures used to compute the relevant descriptors were obtained from the standard and STERGEN procedures, as well as the local minimisation procedure for the hERG-196 dataset. 150 Hence, these results lend inconclusive support to the proposition that the ATUSR descriptor set may yield the best performance under some circumstances. Indeed, these results only weakly support to the proposition that the combination of USR with chemical information leads to improved performance over simply considering chemical information in isolation. This point is reinforced by the observation that statistically significant differences in performance were never observed between the MACCS and USR+MACCS descriptor sets. Of course, one caveat to note is that the statistical testing protocol employed here (see section 5.3.5) was focussed on controlling Type I error - i.e. erroneously declaring the difference in performance between two descriptors to be genuinely different when the observed difference arose from chance. 226,373 This was deemed appropriate as, arguably, the onus should be on demonstrating that new methods genuinely perform better than those currently in widespread use. The use of adjusted p-values, however, is likely to lead to higher Type II error 303 - i.e. erroneously declaring differences in performance to be statistically insignificant when genuine differences in performance exist. 226,373 5.5 Conclusions This chapter described the development of a novel set of 3D descriptors (ATUSR) encoding molecular shape and the three-dimensional distribution of potential macromolecular interaction sites. This was achieved by adding chemical information to the Ultrafast Shape Recognition (USR) descriptor set developed by Ballester and Richards263,264 in a conceptually distinct fashion to the approach previously proposed by Cannon et al.175 Initial experiments were carried out to evaluate the effectiveness of ATUSR for both regression and classification protein-ligand binding QSAR. These evaluations included comparisons with standard 2D descriptors as well as comparisons with a combination of USR and a 2D fingerprint (ATFP) based on the same atom types as ATUSR: USR-ATFP. In keeping with the findings of Cannon et al., 377 the results obtained suggested that combining USR with 2D chemical information could lead to improved performance, with respect to both USR and the original 2D descriptor set, for certain classification and regression tasks under some circumstances. In one case, the results suggested ATUSR could lead to further improved performance. However, in no cases were these improvements deemed statistically significant. Indeed, many of the descriptor sets (both 2D and 3D) assessed appeared to perform comparably for both regression and classification QSAR. One notable exception, however, 151 was USR. Save for the results obtained on one classification dataset, the overall performance of USR was the worst of all descriptors evaluated, emphasising the importance of encoding chemical information. Nonetheless, all descriptor sets almost always yielded statistically significantly better performance than that expected for a random model. The use of various Molecular Mechanics and docking approaches to yield, ostensibly, improved representations of the bioactive conformer, for one of the datasets employed in this work, did not lead to the expected improvements in performance for the 3D descriptor sets. However, this appeared to reflect the difficulty of employing these methods to find the bioactive conformer for hERG inhibitors. For the datasets considered in the current work, it appeared that an increased diversity of molecular shapes did not have an effect on the relative performance of USR(-like) descriptor sets. A particularly interesting issue, however, is that biases in the datasets employed could have favoured 2D descriptor sets with respect to 3D descriptor sets. Qualitative indications were obtained, for some of the classification datasets, that the actives and inactives were partially separable in terms of lipophilicity. This author suggests that this might partially account for the good performance of the 2D descriptor sets relative to the 3D ones on such datasets. However, further analysis of analogue bias (an issue which there was not time to investigate in the work presented in this chapter)336,339 would also be warranted, as this may lead to artificially high observed performance for 2D descriptor sets.330 Some specific proposals are presented in Chapter 7. The code for the ATUSR descriptor set, and related descriptor sets, is made publicly available to the community (Appendix A), to facilitate further evaluations (or extensions) of these methods. 152 Chapter 6 Predicting Drug Induced Torsades de Pointes Using Biological Descriptors This chapter describes the development of QSAR approaches for discriminating between compounds with the potential for inducing Torsades de Pointes (TdP) and those without, denoted TdP+ and TdP- compounds respectively. The novelty introduced in this work is the incorporation of predicted pIC50 values for biologically relevant cardiac ion channels as additional/alternative descriptors ('IC-descriptors'). A variation on this idea, namely the incorporation of experimental pIC50 values as additional descriptors was also considered. The former types of descriptors are denoted 'predicted IC-descriptors' and the latter 'experimental IC-descriptors'. The former represents an assessment of a purely in silico approach, whilst the latter explores how useful in vitro data might be in enhancing QSAR approaches to predicting drug induced TdP. Previously presented approaches to predict Torsadogenic potential in silico, with a particular emphasis on QSAR studies, are first discussed. The QSAR methods used in the current work to identify TdP causing agents, along with the approaches used to generate IC-descriptors, are then discussed. The datasets used to build and validate TdP models and those used to generate IC-descriptors are subsequently presented - and the challenges associated with modelling this data highlighted. The insights obtained to date into the performance of IC-descriptors are discussed. This chapter concludes by emphasising the need for future work. 6.1 Introduction The first published study to employ QSAR approaches for directly discriminating between compounds assessed as being TdP causing (TdP+) and supposed non-TdP causing agents (TdP-), was presented by Yap et al. in 2004.27 They developed their models using two sets of 'standard' descriptors, in conjunction with various Machine Learning algorithms.27Around the same time, Xue et al. employed Support Vector Machines (SVMs), and a filtered collection of commonly used descriptors, to model a similarly derived dataset. 122 Since then, a number of publications have reported models generated on the datasets made available by these authors. A variety of Machine Learning methods, and feature selection strategies were considered in these studies - although, in most cases, the same descriptors were employed as per the original publications. 124–127,199 153 More recently, Frid and Matthews reported the construction of a dataset of 1,632 approved drugs, with data from a variety of sources used to categorise compounds as TdP+ or TdP- (this author’s terminology).33 Models for discriminating between TdP+ and TdP- compounds were developed using various software packages commonly employed in computational toxicology. 29 Similarly, Clark and Wiseman recently disclosed a dataset of more than 1,000 drugs identified as TdP+ or TdP-, principally using statistical criteria. They developed binary classification models using descriptors based on a dictionary of molecular fragments. 28 It should be noted that all previously referenced studies sought to relate general purpose descriptors, directly encoding molecular structure, to the Torsadeogenic potential of a compound. In contrast, Gepp and Hutter generated models incorporating (amongst a pool of general purpose descriptors) descriptors based on a pharmacophoric SMARTS pattern for human-ether-a-go-go related gene (hERG) ion channel blockers and similarity to the potent hERG inhibitor astemizole. 128 Whilst not seeking to directly predict Torsadogenic potential per se, studies by Yang et al. 107 and Obiol-Pardo et al. 104 simulated the effects of drug compounds on the QT interval (see Chapter 1, section 1.3.3). Interestingly, the latter incorporated predicted pIC50 values for two cardiac ion channels. Last year (2011), Mirams et al. used experimentally measured IC50 values for various cardiac ion channels, along with other experimental parameters and simulation derived parameters, as descriptors to build multiclass models for Torsadogenic potential. 159 The current work builds upon these earlier attempts to incorporate biologically relevant information into the prediction of Torsadogenic potential. This work extends these earlier investigations by seeking to use predicted pIC50 values as descriptors in the TdP models. The resulting cardiotoxicity predictions are evaluated on a larger number of compounds than Obiol-Pardo et al. 104 and Mirams et al. 159 Moreover, the use of conventional (i.e. purely structural) QSAR descriptors, as an alternative/in addition to using (predicted or experimental) pIC50 values is assessed, to the best of this author's knowledge, for the first time. 154 6.2 Modelling Approaches Employed Here Binary classification models for discriminating between TdP+ and TdP- compounds were built using Random Forest. Models were generated using standard QSAR descriptors and (predicted) pIC50 values for (different combinations of) ion channels whose inhibition was deemed to be relevant to Torsadogenic potential. The former set of descriptors are referred to as 'structural-descriptors', with the latter referred to as 'IC-descriptors'. Models were generated using both types of descriptors alone and in combination. When experimentally obtained pIC50 values were used as descriptors, these are denoted ‘experimental IC-descriptors’. When predicted pIC50 values were employed, these are denoted 'predicted IC-descriptors'. Predicted IC-descriptors were generated using Random Forest regression models, based on the same structural-descriptors used for directly modelling TdP. 6.2.1 Structural-Descriptors The following combination of structural-descriptors was used to build models for TdP and the ion channel models: Molecular Linear Free Energy Relationship Descriptors (MLFER), 380 McGowan's characteristic volume, 381 as well as a fragment-based bit-string descriptor similar to that used by Clark and Wiseman to develop their models for Torsades de Pointes. 28 The MLFER descriptors correspond to the ‘overall’ (or “summation”) hydrogen bond acidity (MLFER_A), two descriptors (MLFER_BH, MLFER_BO) representing the 'overall' (or “summation”) hydrogen bond basicity, the combined dipolarity/polarizability (MLFER_S), the excess molar refraction (MLFER_E) and the gas-hexadecane partition coefficient (MLFER_L). These descriptors, along with McGowan's characteristic volume, were calculated using PaDEL-Descriptor (version 2.5). 382,383 The MLFER descriptors represent an implementation of the fragment-based approach, developed by Platts et al., 380 for calculating the experimentally derived solute-descriptors proposed by Abraham 384 to capture different contributions to solute-solvent interactions (personal correspondence with Dr Chun-Wei Yap). As per the corresponding set of descriptors used by Abraham 384 and Yap, 27 the combined set of MLFER descriptors and McGowan's volume is referred to in this thesis as the Linear Solvation Energy Relationship (LSER) descriptors. 155 Work by Abraham indicated that these LSER descriptors capture properties relevant for protein ligand binding. 384 Since both plasma concentrations (which are related to solubility) and protein interactions are relevant for Torsadogenic potential (see Chapter 1, section 1.3.3), these were appropriate descriptors to use for directly modelling TdP and constructing ion channel models. Indeed, the TdP models developed by Yap et al. were based on a subset of these descriptors - albeit, their implementation was not exactly the same as used here (personal correspondence with Dr Chun Wei Yap). 27 In their 2009 study, Clark and Wiseman used 321 molecular fragments to define a bit-string descriptor encoding the presence or absence of these fragments in a molecule. 28 In earlier work, an occurrence count descriptor based on a subset of these fragments was used to generate regression models for various molecular properties, 385 including hERG inhibition. 222 In this work, the 319 pseudo-SMARTS patterns presented by Clark and Wiseman, 28 were adapted and a bit-string descriptor implemented based upon the resultant SMARTS patterns. The bit-vector was implemented in Python, using the Pybel module. 177 This Python code, along with a description of the modifications made to the pseudo-SMARTS patterns, is made available with this thesis (Appendix A). There are some discernible differences between this bit-string descriptor and that implemented by Clark and co-workers as illustrated by an example in Figure 6.1. 222 This bit-string was able to encode the presence of moieties which could engage in specific binding within an ion channel pore. For example, it encoded the possible presence of a tertiary amine, a typical feature of potent hERG inhibitors, potentially facilitating non- classical hydrogen bonds within the ion channel pore, 305 as discussed in Chapter 4. Since the specific interactions allowed for by such fragments would not be encoded by the LSER descriptors, the addition of these bit-string descriptors was deemed appropriate for generating ion channel models as well as directly predicting TdP. The structural-descriptors presented here do not encode stereochemistry - as evidenced by Platts et al., 380 Abraham and McGowan, 381 and the pseudo-SMARTS patterns obtained from Clark and Wiseman. 28 This informed the processing of the datasets described in section 6.3. 156 Figure 6.1 Fragments identified in cathinone by Clark and co-workers that were likewise found (green) and not found (red) by this author's implementation of their bit-string descriptor. Figure 6.2 SMARTS patterns matching a tertiary amine and the corresponding amide. 6.2.2 Descriptor Pre-processing Prior to generating the ion channel and TdP models, all descriptors with constant values in the training set were removed. All descriptor values ( ) were scaled and centred, using the training set maximum ( ) and minimum ( ) values to give new values ( ) designed to lie between 0 and 1, as per the bit-string fingerprint descriptors (Equation 6.1). This was carried out in anticipation of generating additional models using alternative Machine Learning algorithms, that could be explored in the future, since the Random Forest algorithm used here is unaffected by scaling or centring of descriptors. 157 ( ) ( ) 6.1 6.2.3 TdP Model Generation Binary classification (TdP+ vs. TdP-) Random Forest models were generated using the randomForest package in R. 225 Models were generated using = 501 (see Chapter 5, section 5.2.2) and the default value for (see Chapter 2, section 2.6.3.2). Hyperparameter optimisation was not considered for computational expediency, since Random Forest is noted to typically perform well with its default hyperparameters, 192 and this work was focussed on the effect of using different descriptor combinations to generate TdP models. All models were generated five times, using different random number generator (RNG) seeds. The scripts employed for TdP model generation are made available with this thesis (Appendix A). 6.2.4 Generation of Predicted IC-descriptors The Random Forest regression models used to generate predicted IC-descriptors were built on the datasets described in section 6.3.1. Care was taken to ensure that these datasets did not overlap with those used to generate the TdP models, in order to investigate the effect of incorporating genuinely predicted IC-descriptors into models for TdP. The arithmetic mean of all pIC50 estimates associated with a given dataset entry was used as the response variable. All structural-descriptors discussed in section 6.2.1 were used, subsequent to the pre-processing described in section 6.2.2. The Random Forest implementation matched that used to generate the TdP classification models (section 6.2.3). The default number of trees was used. However, in order to better evaluate the effect of using predicted IC-descriptors, it was considered more critical to maximise the predictive power of these models; hence, in contrast to before, efforts were made to optimise , by evaluating: ( ) for j = 1,2,3,4,5,6. Here M denotes the total number of descriptors, and floor( ) rounds down to the nearest integer. This sequence was chosen to include the randomForest default (see 158 Chapter 2, section 2.6.3.2), half and twice this value, as has been recommended, and the maximum possible value of which may be appropriate when few descriptors are relevant (a number of the bit-string-descriptors, e.g. the presence of an ethane fragment, were suspected to be largely irrelevant). 386 The model with the highest out-of-bag 192 Pearson's correlation coefficient was selected (see Chapter 2, sections 2.6.3.2 and 2.6.4.2 respectively). The scripts used to build these models are made available with this thesis (Appendix A). 6.2.5 Generation of Experimental IC-descriptors For a given cardiac ion channel, where experimental estimates for pIC50 values were available for a compound in the TdP datasets, their arithmetic mean was used as an experimental IC-descriptor. The origin of these measurements is explained in section 6.3.1. As explained there, pairs of bioactivity files, matching compound IDs to synonyms and (possibly multiple) pIC50 estimates, were derived for each cardiac ion channel for which data was available. The compounds present in the TdP datasets were matched to pIC50 measurements in these files by comparing the names of the TdP dataset compounds to the synonyms for the ion channel inhibitors presented in these files - ignoring trivial differences such as spaces, capitalisation etc. 6.3 Datasets 6.3.1 Ion Channel Datasets In order to generate IC-descriptors, inhibition/activation measurements were sought for cardiac ion channels, inhibition or activation of which might be mechanistically related to the induction of Torsades de Pointes. Inhibition measurements were obtained for the following ion channels, denoted by the name of the gene encoding the primary alpha-subunit: SCN5A (Nav1.5), CACNA1C (Cav1.2), KCNH2 (hERG), KCNQ1 (KVLQT1), which carry the INa, ICa,L, IKr and IKs currents respectively. 387 Inhibition/activation of these currents within ventricular tissue has the potential to increase or reduce the duration of ventricular repolarisation, the prolongation of which may lead to TdP (see Chapter 1, section 1.3.3). Changes in all of these currents have been suggested to play a role in either promoting/counteracting QT interval prolongation and the induction of Torsades de Pointes. 104,106,159,387 159 Inhibition measurements were also obtained for the following ion channels (denoted as before): CACNA1H (Cav3.2), 388 CACNA1G (Cav3.1), KCNA5 (Kv1.5). 387 Due to their (typically) selective expression in non-ventricular - for example, atrial - cells, 387–389 it could be argued that inhibition/activation of their currents is irrelevant to TdP. 389 However, atrial fibrillation has been suggested to be linked with a reduced risk of Torsades de Pointes. 390 Hence, it was deemed reasonable to additionally investigate whether or not generating IC- descriptors based upon inhibition measurements for these ion channels might also enhance models for TdP. The ChEMBL database (version 11, MySQL format) 269 was searched for inhibition/activation measurements for the cardiac ion channels noted above. 387 The script used to extract data from ChEMBL is made available with this thesis (Appendix A). Measurements, with the maximum ChEMBL confidence score of nine, were extracted for mammalian targets, corresponding to the gene names encoding the primary alpha subunits. 387,388 Measurements which the extracted assay descriptions indicated were derived from competitive binding assays were excluded, since these assays cannot distinguish between channel activation/inhibition, 67 which have distinct effects on the cardiac action potential. 106 The inhibition measurements derived were pIC50 values, IC50 values (directly converted into pIC50 values) and percent inhibition values. The latter were converted into pIC50 estimates, based on the test concentration provided, using the Logit function. Johnson et al. found pIC50 estimates obtained using the Logit function had a median error of only 0.1 log units for percent inhibition from 20-80 %; hence, only percent inhibition values in this range were considered. 81 Synonyms and molecular structures (canonical SMILES) were also extracted for the corresponding compounds. Additional data for KCNH2 was obtained from the Literature-368 dataset described in Chapter 4, as well as the hERG-196 dataset described in Chapter 5. Data for KCNH2 and KCNQ1 was also obtained from Obiol-Pardo et al. 104 - kindly provided in electronic format by Dr Manuel Pastor. The data selected from all three sources was combined for each ion channel. Only a single entry corresponding to a given stereochemically indifferent InChI (see section 6.3.5) was retained in the dataset used to generate the ion channel model; all remaining entries were 160 separately recorded - i.e. two sets of files, an SDF and a corresponding bioactivity file, were derived per ion channel. The removal of dataset entries in this fashion was deemed appropriate for the following reasons. Firstly, the stereochemistry of all structures in the Literature-368 dataset had not been checked. Hence, stereochemistry dependent comparisons could have erroneously retained duplicates. Secondly, for some sets of entries in the datasets compiled by Obiol- Pardo et al., the same IC50 measurement was assigned to all potential stereoisomers, where this was unknown to Obiol-Pardo and co-workers (private correspondence with Dr Manuel Pastor). Hence, stereochemistry dependent comparisons would have retained duplicate measurements. Thirdly, since stereochemically indifferent structural-descriptors were used to generate IC-descriptors (section 6.2.1), the retention of stereoisomers could have introduced effective redundancy when training and validating the corresponding models. Each dataset entry was assigned an ID - either the name associated with entries in the Obiol- Pardo et al. datasets, the ChEMBL molregno, or unique IDs assigned to the combined set of compounds taken from the Literature-368 and hERG-196 datasets. Where pairs of dataset entries, originating from one of these three original sources, were associated with the same stereochemistry dependent InChI, yet their IDs did not indicate they were the same compound/stereoisomers, both dataset entries were removed. Likewise, where the synonyms associated with pairs of entries in the bioactivity files corresponding to the compound sets prepared for the generation of ion channel models, indicated that the entries corresponded to the same compound, both dataset entries were removed. In both cases it was deemed likely that some of the associated structures were topologically incorrect. * This yielded the pairs of files per ion channel which were parsed to generate experimental IC- descriptors as described in section 6.2.5. Finally, the dataset entries deemed to correspond to compounds in the TdP datasets when generating experimental IC-descriptors (see section 6.2.5) were removed, followed by removal of dataset entries for which their stereochemically indifferent InChIs were matched to those of compounds in the TdP datasets. * Alternatively, some of these inhibitors could have been assigned the correct structure but incorrect synonyms - this being the rationale why the bioactivity files available subsequent to filtering these 'suspicious IDs' were parsed to obtain measurements for experimental IC- descriptors (see section 6.2.5). 161 Ultimately, this yielded datasets of the following size (Table 6.1), which were used to generate the ion channel models employed for calculating predicted IC-descriptors - since the 45 compounds retained in the SCN5A dataset fell below a heuristic minimum of 50 required for model generation, no ion channel model was built on this dataset. Ion Channel n.o. Compounds CACNA1H 73 KCNH2 1,285 CACNA1G 192 CACNA1C 106 KCNQ1 59 KCNA5 300 Table 6.1 Total number of compounds used to generate ion channel models. All dataset files used to generate these ion channel models are made available with this thesis (Appendix A). 6.3.2 TdP Datasets TdP models were evaluated on datasets derived from those originally presented by Yap et al. 27 (Yap-2004 dataset) and Clark and Wiseman 28 (Clark-2009 dataset). The initial descriptions below refer to these original datasets, whilst the numbers of compounds remaining in the versions of the Yap-2004 and Clark-2009 datasets ultimately used to assess the modelling approaches introduced in the current work, following the removal of compounds described in the intervening sections, are presented in section 6.3.3.4. Yap-2004 Dataset Yap and co-workers originally selected TdP+ compounds from agents reported, as of 2003, by the Arizona Center for Education and Research on Therapeutics (ArizonaCERT)391 to have a risk (or a possible risk) of inducing TdP in humans or otherwise to be avoided by patients with congenital long QT syndrome (LQTS). They aimed to exclude compounds otherwise reported as having a weak association with TdP. Additional TdP+ compounds were elsewhere reported to be associated with TdP/ventricular tachyarrhythmias on the basis of clinical studies/case reports, or were agents with official warnings regarding the occurrence of TdP. All such assignments were based on human studies. Their 243 TdP- compounds 162 corresponded to an additional set of agents for which they found no reported case of TdP in humans. 27 Here, structures corresponding to the names presented by Yap et al. 27 were obtained from Ligand.Info (ligand_info_ver_1_02.sdf.zip), 392,393 DrugBank, 386–389 PubChem, 169,394 SciFinder, 372 the literature (gentamicin),395,396 or (anakinra, 1IRP) the PDB (as prompted by DrugBank). 261,262,397 The structures were aggregated into a pair of SDF files, corresponding to the training and test set defined by Yap et al., using the Python Pybel module. 177 These SDF files are made available with this thesis (Appendix A). Clark-2009 Dataset Clark and Wiseman derived TdP+ assignments for all but five of the TdP+ compounds in this dataset from the FDA's Adverse Event Reporting System (AERS), with drugs extracted from this database assigned as TdP+ on the basis of reports from the AERS describing compounds as either primary of secondary suspect agents in incidents of TdP. Chi-squared values were computed based upon the number of such reports associating a given drug with TdP, compared to an estimation of the number of such reports expected if there was no causative relationship between taking the drug and TdP. Only if the chi-squared value was statistically significant at the 5% level, was the null hypothesis - that the reports associating the drug with TdP were due to chance co-occurrence - rejected; otherwise, drugs reported in the AERS were assigned to the TdP- class. An additional set of five compounds withdrawn from the market due to QT-associated events were also assigned to the TdP+ class, yielding 71 TdP+ compounds in total. 28 The 1,307 SMILES presented by Clark and Wiseman were downloaded; non-standard (H) symbols were removed. 179 TdP+ class labels were assigned based on the names of the 71 compounds listed in another of their Supporting Information files. 28 6.3.3 Resolving Potential Problems Identified with the TdP Datasets An overview of the identification and removal of problematic entries from the versions of the Yap-2004 and Clark-2009 datasets originally derived in the current work is provided below. The calculations presented here were only carried out on the filtered versions of these datasets. The identities of all compounds removed from both datasets are made available with this thesis (Appendix A). 163 6.3.3.1Problematic Class Assignments In their 2006 study, Gepp and Hutter 128 noted that three compounds assigned as TdP- by Yap et al. were reported as being weakly associated with TdP by ArizonaCERT. They identified three additional TdP- compounds which were reported to prolong the QT interval and/or might cause TdP - or, in one case, have a "definitive association" with TdP. 129 In order to improve the consistency of class label assignment within the Yap-2004 dataset, all six of these TdP- compounds were removed. 128 The possibility of unidentified ‘false negatives’ remains an issue for both the Yap-2004 and Clark-2009 datasets (see Chapter 1, section 1.3.3.2). 6.3.3.2 Problematic Structures Dataset entries found to correspond to polymers, * as well as those identified as corresponding to multiple fragments, were removed. The latter decision was prompted by the detection of compounds (e.g. trimetaphan camsilate) associated with a large organic counterion; the standard approach of discarding minor fragments may be inappropriate for such multifragment compounds, for which the bioactive component is ambiguous. 165 For simplicity, all identified multifragment compounds were removed. Additional structures in the Clark-2009 dataset were determined to be incorrect via consulting ScifFinder. 372 Some structures with incorrect topology in this dataset were identified via consideration of those cases where the names in the Clark-2009 and Yap-2004 datasets matched, yet not the corresponding stereochemically indifferent InChIs (section 6.3.5). These structures were deemed erroneous based upon their inconsistency with the Yap-2004 structure, which was checked using SciFinder 372 or, in the case of neomycin, a literature reference. 399 Vasopressin was removed from both datasets as no clear structural reference was obtained. Further structures in the Clark-2009 dataset were suspected to be incorrect - for example, dataset entries corresponding to atomic sulfur and lithium - or, in the case of xenon, i.e. a noble gas, to offer no generalisation ability. All these structures were removed from the Clark-2009 dataset. * Monomeric structures were originally obtained for the polymeric exchange resins,128 colestipol and colestyramine (cholestyramine); the need for a distinct approach to QSAR modelling of polymers was emphasised by England.398 164 6.3.3.3Structural Redundancy All compounds with non-unique stereochemically indifferent InChIs (see section 6.3.5) were removed. 6.3.3.4 Final Versions of TdP Datasets Employed in this Work Ultimately, the steps undertaken in the preceding sections yielded filtered versions of the Yap-2004 and Clark-2009 datasets comprising 329 (99 TdP+, 230 TdP-) and 1170 (67 TdP+, 1103 TdP-) compounds respectively. All references to these datasets presented from here on, unless noted otherwise, should be understood to refer to these filtered versions. 6.3.4 TdP Dataset Compounds for which Experimental IC-descriptors were Available The numbers of compounds in the Yap-2004 and Clark-2009 datasets for which experimental pIC50 estimates were obtained are presented in Table 6.2. The original files * presenting these measurements are made available with this thesis (Appendix A). * N.B.: As these were derived prior to filtering the TdP datasets, the names of some compounds which were ultimately discarded are presented, along with their associated measurements, in these files. These compounds are not included in the counts presented in Table 6.2. 165 TdP Dataset Ion Channel TdP Class n.o. Compounds with Experimental IC- descriptors Yap-2004 CACNA1H TdP+ 1 TdP- 0 SCN5A TdP+ 1 TdP- 1 KCNH2 TdP+ 52 TdP- 8 CACNA1C TdP+ 0 TdP- 0 CACNA1G TdP+ 1 TdP- 0 KCNA5 TdP+ 1 TdP- 0 KCNQ1 TdP+ 14 TdP- 1 Clark-2009 CACNA1H TdP+ 0 TdP- 0 SCN5A TdP+ 2 TdP- 1 KCNH2 TdP+ 37 TdP- 88 CACNA1C TdP+ 0 TdP- 1 CACNA1G TdP+ 0 TdP- 0 KCNA5 TdP+ 1 TdP- 2 KCNQ1 TdP+ 11 TdP- 10 Table 6.2 Numbers of TdP dataset compounds with experimental IC-descriptors based on comparing names of compounds in TdP and ion channel datasets. 6.3.5 Standardization and Comparison of Structures All structures in the TdP and ion channel datasets were standardized using the same procedure. All structures in a given dataset, including a couple for which some standardization steps failed, were transferred to a single SDF file using the Python module Pybel, 177 prior to InChI generation, and subsequent calculation of structural descriptors. This 166 workflow is illustrated in Figure 6.3. The exact options employed for standardization are specified in supplementary files made available with this thesis (Appendix A). InChIs were computed as per Chapter 5, section 5.3.4.3. All references to stereochemically dependent InChIs in this chapter refer to InChIs generated, via this workflow (Figure 6.3), using default options, whilst stereochemically indifferent InChIs were generated using the SNon flag. Figure 6.3 Overview of the procedures used to prepare the structures in the ion channel and TdP datasets from which InChIs and structural descriptors were computed. 167 6.4 Summary of TdP Model Comparisons All scripts used to run these experiments, along with all raw input files, are made available with this thesis (Appendix A). 6.4.1 Modelling Runs Based on the Complete TdP Datasets 6.4.1.1 Descriptor Combinations Compared Models were generated based on: 1. The structural-descriptors - i.e. the combination of the fragment-based bit-vector and LSER descriptors described in section 6.2.1. 2. Predicted IC-descriptors. 3. Combinations of both types of descriptor sets. For all descriptor sets including predicted IC-descriptors, three versions were considered based on the following combinations of ion channel models: 1. All models (see Table 6.1). 2. The hypothetically most biologically relevant (see section 6.3.1) subset (KCNH2, KCNQ1 and CACNA1C). 3. KCNH2 only. The KCNH2 only descriptor set was considered for the following reasons. Firstly, given the larger quantities of data available for building these models (Table 6.1), it was suspected the KCNH2 dataset might have greater coverage of the chemical space of the TdP datasets, hence yield higher quality predictions for these datasets and, consequently, more useful predicted IC-descriptors. Secondly, given the current focus on assessing hERG (KCNH2) inhibition as a surrogate for Torsadogenic potential in the pharmaceutical industry (see Chapter 1, section 1.3.2), 62,400 it was considered valuable to assess the extent to which the inclusion of hERG inhibitory information alone might improve predictive models for TdP. 6.4.1.2 Partitioning into Training and Test Sets For both TdP datasets, models were generated and validated using (multiple repetitions of) 10-fold stratified cross-validation (10CV). Whilst 10 repetitions of 10CV were carried out on 168 the Yap-2004 dataset, 10CV was only carried out once on the larger Clark-2009 dataset for computational expediency. For each train/test partition, Random Forest models were generated five times, using five different random number generator (RNG) seeds. 6.4.2 Modelling Runs Based on Compounds with Experimental IC- descriptors 6.4.2.1Subsets of TdP Datasets Employed Since a sizeable number of compounds with experimental IC-descriptors was only obtained for the KCNH2 channel (see Table 6.2), TdP models were built on the subsets of the TdP datasets for which KCNH2 experimental IC-descriptors were available. It was considered appropriate to generate these based on a subset of pIC50 estimates obtained under more experimentally consistent conditions than those originally obtained. Hence, KCNH2 experimental IC-descriptors were recalculated via only averaging estimates obtained from electrophysiological measurements in mammalian, heterologous expression systems. 67 Measurements for terfenadine which were deemed to be derived from a suboptimal IonWorks TM assay protocol, 78,401 or had been erroneously annotated in ChEMBL 402 were also excluded. The files containing these reduced sets of measurements are also made available with this thesis (Appendix A). However, recalculating KCNH2 experimental IC-descriptors based upon excluding measurements, as described above, led to a reduction in the size of the TdP datasets. The subsets of the Yap-2004 and Clark-2009 datasets based on these recalculated descriptors contained 51 (7) and 36 (85) TdP+ (TdP-) compounds respectively. Hence, TdP models were also generated and validated using the datasets corresponding to the original KCNH2 experimental IC-descriptors (see Table 6.2). 6.4.2.2 Descriptor Combinations Compared Models were generated based on: 1. The structural-descriptors (section 6.2.1). 2. KCNH2 experimental IC-descriptors. 3. Combinations of both types of descriptor sets. 169 6.4.2.3 Partitioning into Training and Test Sets Due to their reduced size, 10 repetitions of stratified cross-validation were carried out using 10, 8 and 7 folds for the subsets of the Clark-2009, Yap-2004 (original experimental IC- descriptors) and Yap-2004 (recalculated experimental IC-descriptors) datasets respectively. 6.4.3 Statistical Comparisons Overall TdP model performance was assessed in terms of the overall mean MCC. Differences in these mean MCC values obtained were determined to be statistically significant using the procedure described for the assessment of repeated cross-validation results in Chapter 5, section 5.3.5. Whilst the exact constitution of these descriptor sets varied, depending upon which ion channels were considered, all results obtained with either the ‘IC-descriptors’ or ‘combination’ descriptor sets, corresponding to either predicted or experimental IC- descriptors, were treated as a single family for the purposes of adjusting the p-values obtained for comparisons to the random predictor baseline. Pairwise comparisons were only carried out between the ‘structural-descriptors’, ‘IC-descriptors’ and ‘combination’ descriptor sets for a given dataset and a given combination of ion channels. * However, all corresponding sets of comparisons involving predicted IC-descriptors and experimental IC-descriptors were considered to be distinct families for the purpose of computing adjusted p-values. 6.5 Results and Discussion The scripts used to compute performance statistics for both the ion channel and TdP models are made available with this thesis (Appendix A). 6.5.1 Results Obtained for Ion Channel Models In order to estimate the expected performance of these models, for compounds inside their applicability domains, leave-one-out cross-validation (LOOCV) was employed. † For each * This was because the primary focus of these investigations was to draw general conclusions, if possible, regarding the predictivity conferred by the use of IC-descriptors. Increasing the number of pairwise comparisons could have led to less powerful statistical tests of these pairwise comparisons upon generating adjusted p-values.303 † Save where this would have been computationally prohibitive. In practice, this lead to the performance of the KCNH2 model, based on a dataset of more than 1,000 compounds (see Table 6.1), being assessed using two repetitions of 10-fold cross-validation (10CV). 170 fold, was re-optimised using the cross-validation training set, in order to avoid biasing the cross-validated performance estimates. All cross-validated performance estimations presented in these sections were obtained from pooling the results across all folds. Finally, the performance of these selected ion channel models on those TdP dataset compounds for which experimental pIC50 values were obtained is presented. Model performance is quantified in terms of the Root Mean Square Error (RMSE), the coefficient of determination (R 2), the Pearson (r) and Spearman's rank (ρ) correlation coefficients and their associated one-tail p-values (see Chapter 2, section 2.6.4.2.). 6.5.1.1 Cross-validated Performance The cross-validated results obtained for the ion channel models are presented in Table 6.3. A number of the models, notably for the key ion channels KCNQ1 and CACNA1C identified in section 6.3.1, are clearly not as predictive as might be hoped. Nonetheless, all correlation coefficient p-values obtained are statistically significant at the 5% level, suggesting that the positive correlation observed between the predicted and experimental pIC50 values was not due to chance. Ion Channel CV Repetition RMSE R 2 r p- value(r) ρ p- value(ρ) KCNH2 1 0.71 0.52 0.72 0.0E+00 0.69 1.3E-182 2 0.70 0.53 0.73 0.0E+00 0.70 3.6E-190 CACNA1H N/A 0.55 0.81 0.91 0.0E+00 0.78 1.7E-16 CACNA1C 0.73 0.25 0.50 2.0E-08 0.51 1.4E-08 KCNQ1 0.91 0.20 0.46 1.2E-04 0.49 3.7E-05 KCNA5 0.46 0.29 0.55 0.0E+00 0.55 1.5E-25 CACNA1G 0.42 0.69 0.83 0.0E+00 0.83 9.2E-51 Table 6.3 Cross-validated results obtained on the entirety of the ion channel datasets derived for the generation of predicted IC-descriptors. 6.5.1.2 Performance of Ion Channel Models for TdP Dataset Compounds The predicted IC-descriptors were compared to the experimental IC-descriptors - i.e. predicted and (mean) experimental pIC50 values were compared - for the compounds in the TdP datasets for which experimental IC-descriptors were available. For the KCNH2 model, 171 this external validation was repeated after filtering the pIC50 measurements obtained for the TdP dataset compounds as described in section 6.4.2.1. The results obtained for the former and latter validations are presented in Table 6.4 and Table 6.5 respectively. Supposing the statistically significant (at the 5% level) positive correlation between predicted and experimentally obtained pIC50 values obtained for the KCNH2 model held across the entirety of the TdP datasets, these results suggest these predictions are appropriate alternatives, for these datasets, to the inclusion of experimental pIC50 values as descriptors for predicting TdP. TdP Dataset Ion Channel RMSE R 2 r p-value (r) ρ p-value (ρ) n.o. Compounds Yap-2004 KCNH2 0.90 0.50 0.76 7.1E-13 0.80 0.0E+00 60 KCNQ1 1.47 - 1.35 - 0.16 7.1E-01 0.00 5.0E-01 15 Clark-2009 KCNH2 0.99 0.37 0.67 0.0E+00 0.69 4.6E-19 125 KCNA5 2.21 - 3.36 - 0.84 8.1E-01 - 1.00 1.0E+00 3 KCNQ1 1.51 - 0.80 - 0.09 6.6E-01 - 0.08 6.4E-01 21 Table 6.4 Performance of ion channel models on the TdP datasets; the performance of those models for which only a single TdP dataset compound with an experimental IC-descriptor was obtained is omitted. TdP Dataset Ion Channel RMSE R 2 r p-value (r) ρ p-value (ρ) n.o. Compounds Yap- 2004 KCNH2 0.87 0.52 0.78 1.7E-13 0.81 0.0E+00 58 Clark- 2009 KCNH2 0.99 0.37 0.68 0.0E+00 0.70 2.7E-19 121 Table 6.5 Performance of KCNH2 model on TdP datasets, after removing problematic measurements and increasing the experimental consistency of the retained measurements. 6.5.2 Results Obtained for TdP Models In addition to the performance summaries presented below, detailed results - including the outcomes of the statistical tests noted below - are made available with this thesis (Appendix A). Unless stated otherwise, all figures of merit presented are the mean across all five calls to randomForest( ), all cross-validation folds and, where applicable, all repetitions of cross- validation. The results obtained across all repetitions of cross-validation, all folds and all 172 RNG seeds (i.e. calls to randomForest( )) are graphically summarised as per Chapter 5, section 5.4. Unless noted otherwise, all following references to model performance refer to model performance in terms of the overall mean MCC. 6.5.2.1 Effect of IC-descriptors on Performance 6.5.2.1.1 Effect of Predicted IC-descriptors on Performance Figure 6.4 summarises the performance of all TdP models generated using predicted IC- descriptors, and the corresponding models generated using structural-descriptors alone. 173 174 175 Figure 6.4 MCC values (black line: median, circle: mean) obtained using predicted IC-descriptors, and the corresponding results obtained using structural-descriptors alone, across all cross-validation folds, repetitions and RNG seeds. Results obtained on the Yap-2004 (A,C,E) and Clark-2009 (B,D,F) datasets, using predicted IC-descriptors generated from all models (A,B), the putative most relevant set - i.e. KCNH2, KNCQ1, CACNA1C (C,D) and just the KCNH2 model (E,F). The results obtained on the Yap-2004 dataset indicate that the inclusion of predicted IC- descriptors has no appreciable effect on the performance of models generated using structural-descriptors alone, whilst the performance of models generated using predicted IC- descriptors alone is appreciably worse. Irrespective of the combination of ion channel models, the mean MCC values obtained using IC-descriptors alone were lower, and statistically significantly different (Appendix A), to those obtained with the other descriptor sets; no statistically significant differences between the mean MCC values obtained with the other two descriptor sets were observed. Whilst yielding worse performance than the other descriptor sets, the overall mean MCC obtained with, any combination of, predicted IC-descriptors alone was nonetheless, in keeping with both other descriptor sets, statistically significantly higher than the value expected for a random predictor (zero) for the Yap-2004 dataset. Interestingly, the overall mean MCC obtained using only these descriptors consistently fell, for this dataset, upon reducing the number of ion channel models used to generate them. Given the supposedly greater biological relevance of inhibiting CACNA1C, KCNQ1 and KCNH2 (see section 6.3.1) - as opposed to the superset of all ion channels - and the indicated 176 poorer predictivity of the CACNA1C and KCNQ1 models, with respect to the KCNH2 model, this is somewhat surprising. The low quality of all models generated on the Clark-2009 dataset, which may well reflect the considerable imbalance in classes (see section 6.3.3.4), * offers little basis for discerning the relative predictivity expected with the different descriptor sets. No models performed statistically significantly better than random. No pairwise differences in overall mean MCC were statistically significant. 6.5.2.1.2 Effect of Experimental IC-descriptors on Performance Figure 6.5 summarises the performance of all models generated using experimental KCNH2 IC-descriptors, and the corresponding models based on structural-descriptors alone. * This imbalance appearing to bias the models towards simply becoming majority class predictors, as suggested by the observation that, for all descriptor sets, the overall mean recall was less than 0.15 and greater than 0.94 for the TdP+ and TdP- (majority class) compounds respectively. 177 178 Figure 6.5 MCC values (black line: median, circle: mean) obtained using experimental KCNH2 IC- descriptors, and the corresponding results obtained using structural-descriptors alone, across all cross-validation folds, repetitions and RNG seeds. Results obtained on subsets of the Yap-2004 (A,C) and Clark-2009 (B,D) datasets corresponding to all compounds for which KCNH2 experimental IC- descriptors were obtained (A,B) and subsequent to reducing the inconsistency in the experimental conditions used to obtain the underlying measurements (C,D). No statistically significant differences (Appendix A) in overall mean MCC values were observed between the three sets of descriptors for any of these sets of compounds. However, this could well reflect, at least in part, the small size of these datasets (see section 6.4.2.1). The reduced size of the corresponding validation set sizes would be expected to lead to increased instability of the results obtained on a given validation set and the smaller size of the training sets would be expected to lead to an increase in the 'signal to noise ratio' - making it harder for any modelling approach to learn to discriminate between TdP+ and TdP- compounds. This latter contention is supported by the observation that, for the subsets of the Clark-2009 dataset, only the use of the ‘combination’ descriptor set led to statistically significant improvements in performance with respect to a random predictor and that this was true for none of the modelling approaches assessed on subsets of the Yap-2004 dataset. 179 The considerable class imbalance (see section 6.4.2.1) in the subsets of the Yap-2004 dataset employed here appears to have biased the corresponding models towards becoming majority class predictors. * 6.5.2.2 Contributions of Different Descriptors to the Models The contributions of different descriptors to the Random Forest TdP models were assessed using the Gini importance measure. 223 However, given the particular bias of this measure towards continuous descriptors, 227 the importance of the binary substructural descriptors was expected to be artificially lowered with respect to the LSER descriptors and IC-descriptors. Nonetheless, the relative importance of the IC-descriptors with respect to each other and the LSER descriptors could yield valuable insights into the contributions of these novel descriptors to the predictions made by the TdP models. The focus below is on the contributions made by IC-descriptors; hence, all the discussion below refers to those models generated using one of the considered combinations of IC-descriptors on its own, or in combination with structural-descriptors. The final measure of descriptor importance employed here was the median (across all RNG seeds, and cross-validation partitions for a given dataset) Gini importance based rank, † lower values indicating more important features. All calculated importance values are presented in CSV files made available with this thesis (Appendix A). 6.5.2.2.1 Contributions of Predicted IC-descriptors Across every single model (i.e. for every cross-validation split, every RNG seed , every predicted IC-descriptor combination and both datasets), the KCNH2 IC-descriptor was the most important descriptor. For all other predicted IC-descriptors, the Gini measure based ranks were less stable across different models. This could be taken as affirmation of the particular biological significance of KCNH2 inhibition for Torsadogenic potential. However, it is clearly necessary to also take into * As is presumably reflected in the median recall for the TdP+ (majority) class and TdP- class, in all cases, being one and zero respectively (see Appendix A). † The ranks were generated for all descriptors, including the binary descriptors corresponding to molecular fragments. 180 account the predictivity of the ion channel models for the compounds in the TdP datasets. The higher predicitivity of the KCNH2 model (see section 6.5.1) compared to the models for the other two ion channels widely understood to be important for the induction of TdP (i.e. CACNA1C and KCNQ1, see section 6.3.1) may also have skewed the results. Also, the non- linearity of Random Forest means that the importance of KCNH2 inhibition suggested here should not be interpreted as indicating that KCNH2 inhibition can be considered in isolation when evaluating Torsadogenic potential. Whilst the observation that the CACNA1C descriptor was assigned the second most important median rank for all relevant sets of models built on the Yap-2004 dataset might be supposed to affirm the significance of inhibiting this channel for the induction of TdP (see section 6.3.1), its median rank for the Clark-2009 dataset was not greater than the ranks for the KCNA5 (supposedly mechanistically irrelevant) and KCNQ1 IC-descriptors in both sets of models including all predicted IC-descriptors. These latter results, however, could simply be an artefact of the difficulty in generating discriminative models for the Clark-2009 dataset (see section 6.5.2). 6.5.2.2.2 Contributions of Experimental IC-descriptors For the models built on the subset of the Clark-2009 dataset, the KCNH2 IC-descriptor was once more consistently assigned the most important rank. That this was not the case for the models built on the subset of the Yap-2004 dataset seems likely to be an artefact of the difficulty in generating discriminative models for this dataset (see section 6.5.2). 6.5.3 The Value Added by IC-Descriptors The results obtained offer no clear evidence that IC-descriptors, as an alternative or complement to structural-descriptors, improve the performance of models for Torsadogenic potential. That no clear evidence of improved performance was observed with both predicted and experimental IC-descriptors suggests that this finding is not simply an artefact of the poor quality of some of the ion channel models used to generate predicted IC-descriptors, nor the possibility that fractions of the TdP datasets might lie outside the applicability domain of these models. To put this in context, it should be noted that attempts, reported in the recent literature, to build models, for in vivo endpoints, based on experimentally derived biological descriptors have usually, though not always, suggested that the combination of biological and traditional 181 (i.e. purely structural) descriptors leads to improved bioactivity predictions over those obtained using structural-descriptors alone. 187–189 However, it has been indicated that the manner in which biological measurements are turned into descriptors may determine whether or not an improvement in predictivity is observed upon their incorporation into the modelling procedure. 188 This point is explored further, with respect to designing new types of IC-descriptors, in Chapter 7. Specifically regarding the development of predicted biological descriptors, this author is unaware of any other attempts to generate models for an in vivo, or clinical, endpoint based on the paradigm proposed here (i.e. the generation of predicted bioactivities based on the same structural-descriptors directly used for modelling the ultimate endpoint of interest). However, work by Cunningham et al. indicated that carcinogenicity predictions based upon fragment descriptors and docking predicted protein targets were improved with respect to models built on either type of information in isolation. 191 Perhaps of greater relevance for the paradigm proposed in this work, Zhu et al. found that in vivo toxicity predictions based on purely structural-descriptors were improved when predictions were separately generated for subclasses of the data predicted, using the same descriptors, to show different associations between in vitro and in vivo biological information. 190 Even if they do not improve predictivity, IC-descriptors could be valuable by yielding more mechanistically interpretable models for TdP. For example, suppose a model were to predict a compound to be Torsadogenic. It would be valuable to be able to rationalise the prediction in mechanistic terms; for example, 'compound X is anticipated to induce TdP, because it strongly inhibits the KCNH2 channel and only weakly inhibits the CACNA1C channel'. This could suggest strategies to a medicinal chemist for reducing the anticipated Torsadogenic liability; for example, might the KCNH2/CACNA1C inhibition of 'compound X' be reduced/ increased, given the synthetic accessibility of derivatives etc.? Such an analysis was, unfortunately, not possible with the implementation of Random Forest employed in this work. However, Kuz'min et al. recently presented a means of obtaining the signed contributions of descriptors towards the predictions made by Random Forest models for individual compounds. For example, in the current context, 'this descriptor contributes positively towards the predicted Torsadogenic potential of compound X'. 229 182 6.6 Conclusions The key objective of the work presented in this chapter was to assess the viability of discriminating between TdP inducing and non-inducing compounds using predicted pIC50 values for (potentially) biologically relevant ion channels as well as the potential for enhancing conventional QSAR approaches (based on structural-descriptors) to this task by including predicted/experimental pIC50 values as additional descriptors (IC-descriptors). It was originally hypothesised that the addition of such novel biological descriptors would, on average, improve performance over models generated using structural-descriptors alone. However, no clear evidence to support this hypothesis was obtained. Nonetheless, the potential value of (predicted) IC-descriptors may lie in their ability to yield mechanistically interpretable predictions of Torsadogenic potential. Further work is required to realise the potential mechanistic interpretability of models generated using IC-descriptors; this could entail using the approach proposed by Kuz'min et al. to interpret the contributions of different descriptors to the predictions made by Random Forest models for specific compounds. 229 Possible improvements to the manner in which IC- descriptors are generated are discussed in Chapter 7. 183 Chapter 7 Conclusions and Future Work The aim of the work presented in this thesis was the development of novel models/modelling approaches which could be used to identify, in silico, potential drug compounds inducing important toxicity endpoints. Chapter 1 explained why in silico predictive toxicology is of immense value in pharmaceutical research. This chapter proceeded to explain the importance of anticipating, and overviewed current scientific understanding of, the key toxicity endpoints which were the focus of the work presented in this thesis: mutagenicity, carcinogenicity, hERG inhibition and Torsades de Pointes (TdP). Chapter 2 presented an overview of the approaches available to predict toxicity in silico, with particular reference to those methods (notably Quantitative Structure-Activity Relationships, or QSARs) upon which the novel work presented in this thesis was founded. 7.1 Conclusions In Chapter 3, a consensus model for mutagenicity/carcinogenicity was developed by combining the output generated by two commonly employed predictive toxicology programs: Toxtree 160 and Derek for Windows TM . 271 Various options for combining the output of these programs or using one of these programs in isolation, to generate predictions for both endpoints were evaluated - notably, on a large, QSAR ready database (ISSCAN). 51 In terms of the Matthews Correlation Coefficient (MCC), the top performing models - for both endpoints - were combined models. Moreover, combining the outputs from both programs allowed the model to be tuned to minimise false predictions of toxicity, as was desirable given the model's intended application as a screening tool in an early stage discovery project. 195 The selected model was used to remove compounds predicted to exhibit both endpoints during a virtual screening workflow which successfully identified possible starting points for structurally novel anti-tuberculosis drugs. Subsequent analysis of the wealth of toxicity predictions generated for a range of endpoints suggests the bioactive compounds identified would not necessarily be lost to attrition for toxicity reasons and pointed to some potential toxicities that should be experimentally tested. In Chapter 4, a novel modelling approach based on (an extension of) Nigsch's version of the Winnow algorithm 184,213 was presented - incorporating for the first time, to the best of this author's knowledge, information encoded by numeric descriptors into QSAR models generated using this technique and investigating the use of multiple training cycles. The additional features based upon numeric descriptors were found to make important 184 contributions to the models, whilst the use of multiple training cycles was found to potentially yield improved results. This approach was used to develop models for the rapid identification of potent hERG inhibitors, the development of which would usually be abandoned in the pharmaceutical industry. 62,69 These binary classification models were rigorously externally validated using carefully selected datasets and directly compared to approaches previously proposed for this task by Thai and Ecker 69 and Dubus et al. 91 . The Winnow based models were found to perform competitively, or better, to those previously presented in the literature, and the Winnow algorithm was found (to this author's knowledge, for the first time in QSAR research) to perform comparably or better to models generated from the same descriptors using Random Forest and SVM - the memory efficient manner in which Winnow learns from the training set (as explained in Chapter 2, section 2.6.3.2) making this a notable finding. This work also highlighted the considerable variability in the estimated performance of a given QSAR approach which may be observed when training and testing the resultant model on different data. Chapter 5 introduced a novel approach to combining Ballester and Richard's Ultrafast Shape Recognition (USR) descriptors 263 with chemical information: Atom Type USR (ATUSR). The performance of this 3D descriptor set for the generation of both classification and regression QSAR models for protein-ligand binding, primarily for hERG inhibition, was benchmarked on various datasets and compared to (extensions of) USR (analogous to those previously presented by Cannon et al.), 175 as well as commonly used 2D descriptor sets. USR was almost always found to yield the worst performance - yet statistically significantly better performance than expected for a random predictor - for both classification and regression modelling. More generally, the ordering of descriptor performance was different across different datasets. In only some cases were ATUSR descriptors (weakly) suggested to offer improved performance over a conceptually analogous approach (USR-ATFP) to that proposed by Cannon et al. 175 for extending USR. Moreover, the use of various Molecular Mechanics and docking approaches to yield (ostensibly) improved approximations (compared to those generated by CORINA) 333 of the bioactive conformer, employed for descriptor calculations, did not lead to the expected increases in predictive performance for ATUSR. However, this could partly reflect the 185 particular difficulties associated with employing docking to find improved representations of the bioactive conformer for the (hERG) datasets considered. Moreover, some limited evidence was obtained that the actives and inactives in some of the classification datasets were partially separable based on lipophilicity, which may have biased the evaluations in favour of 2D descriptors. In Chapter 6, a novel approach to directly discriminating between compounds with TdP causing potential, and non-TdP causing compounds, was introduced. The proposed approach incorporated predicted (or experimental) pIC50 values for cardiac ion channels (including hERG) previously indicated to play a role in the induction of TdP as additional descriptors (IC-descriptors). This approach was evaluated on datasets (subsets of those previously used for QSAR modelling of TdP) derived from assessments for TdP causing potential in humans. In contrast to many of the earlier studies which employed biological measurements as additional descriptors for toxicity endpoints, 187–189 the results obtained did not (clearly) indicate that the addition of predicted (experimental) IC-descriptors led to improved performance with respect to the use of traditional (structural) descriptors in isolation. However, the real value of such 'hybrid' models for TdP may lie in improved mechanistic interpretability of the predictions. 7.2 Future Work Novel Models for TdP Based on Biological Descriptors Given the novelty of the approach and the potential for improved interpretability of predictions for medicinal chemists, additional work is warranted to assess whether alternative approaches to incorporating IC-descriptors into models for TdP would be more successful (in terms of predictivity) than those described in Chapter 6. Improved versions of IC-descriptors might correspond to binary descriptors denoting levels of inhibition above or below appropriate thresholds. These thresholds might best be selected via internal validation of the resultant TdP models. The use of such descriptors, particularly if compounds close to the 'active vs. inactive' boundary were excluded, could reduce the significance of experimental noise, and more effectively enable the combination of data from different assays, leading to improved predictivity of the ion channel models as well as more robust experimental IC-descriptors. 90,93 Were this to facilitate the inclusion of data from a wider variety of assays, and more compounds, this could both extend the applicability 186 domain of the ion channel models, which may not currently cover the entirety of the TdP datasets, and yield more TdP dataset compounds with experimental IC-descriptors; this could improve the value of the predicted IC-descriptors when applied to the entirety of the TdP datasets and strengthen the basis for drawing conclusions regarding the value of the experimental IC-descriptors. Sedykh et al. recently found that experimental biological descriptors derived from multiple points on a dose response curve yielded better toxicity predictions than simple binary encoding of the corresponding assay data. 188 Hence, it would be appropriate to see whether IC-descriptors based upon multiple dose-response parameters and other parameters which can be derived from electrophysiological assays, would yield improved IC-descriptors. Such detailed information, from the same electrophysiological hERG assay, has recently been made available for more than 300,000 compounds. 97 Moreover, as noted in Chapter 6, the use of the approach proposed by Kuz'min et al. for the interpretation of Random Forest models, 229 in order to realise the potential mechanistic interpretability of models generated with these kinds of descriptors, would also be warranted. Unbiased Assessment of ATUSR Descriptors As briefly discussed above, as well as in Chapter 5, dataset biases may result in the observed performance of 2D descriptors exceeding that of 3D descriptors, even if the 3D descriptors capture useful information that the 2D descriptors do not. Examples of dataset biases which may have this effect are (some forms of) “analogue bias” and “property bias”.336 In general, analogue bias refers to the actives in a dataset being disproportionately comprised of ('trivially') structurally related analogues. Where these analogues are similar in terms of (properties which are directly related to) their 2D structural characteristics, artificially improved performance of 2D descriptors may be observed. 330 Property bias refers to separation of the actives and inactives in terms of simple molecular properties, 336 such as logP, 339,403 which may be highly related to 2D descriptors. 337 There is a particular need in the pharmaceutical industry for methods which can identify compounds with, say, lower hERG inhibition, yet similar lipophilicity. 338 Recently, Rohrer and Baumann developed a framework for quantifying both kinds of dataset bias, and for generating "unbiased" versions of an original dataset. 339,403 They applied this framework to generate the Maximum Unbiased Validation (MUV) binary classification 187 datasets based on bioactivity data for 30 actives and 15,000 inactives for 17 protein target interaction based bioactivities. It would be particularly interesting to evaluate the relative performance of ATUSR descriptors, using a similar set of experiments to those presented in Chapter 5, on these unbiased, benchmark datasets. Moreover, an initial inspection of the PDB 261 suggests that relevant (antagonist/agonist bound) crystal structures are available for the protein targets against which inhibition/activation is measured in the relevant PubChem assays. 169 For example, the 3NMQ and 1GWR complexes correspond to antagonist and agonist bound complexes of the HSP90 and ER-alpha targets respectively. This would allow further investigation of how structure-based approaches to approximating the bioactive conformer might improve the relative performance of ATUSR descriptors for 3D QSAR - Celik et al. having previously successfully docked small molecules into the 1GWR structure. 404 Indeed, given that ER-alpha agonists may act as endocrine disrupting chemicals (EDCs), inducing EDC toxicity, 405 investigating the performance of ATUSR descriptors for modelling activation of the ER-alpha receptor could serve as a test of their usefulness in computational toxicology. Moreover, the MUV protocol 339 could be applied to the (hERG) datasets previously modelled in this thesis, and the change (if any) in the relative performance of ATUSR descriptors investigated. As implementations of the descriptor sets presented in Chapter 5 are made available to the community with this thesis (Appendix A), this should facilitate further studies of their effectiveness. Additionally, careful evaluations of their potential computational efficiency would be merited if further investigations demonstrated their usefulness - which might entail re-implementing the ATUSR descriptor set. Improved Screening for Potent hERG Inhibitors The results presented in Chapter 4 suggest that Winnow models generated using features obtained from ECFP_4 fingerprints and discretized simple molecular properties, such as logP, pKa values and the Wiener Index, could perform well compared to models previously presented in the literature for discriminating potent (IC50 < 1 μM) hERG inhibitors from moderate (IC50: 1-10 μM) and weak (IC50 ≥ 10 μM) inhibitors. Additional work is warranted to: 188 1. Build/evaluate such models on a larger set of compounds. This data might be obtained from the ChEMBL database; 269 to obtain appreciably larger quantities of data though, it might be necessary to compromise (see Chapter 6) the rigorous specifications regarding experimental conditions employed when compiling the Literature-368 and hERG-196 datasets discussed in Chapter 4 and Chapter 5 respectively. Alternatively, experimentally consistent hERG inhibition measurements for more than 300,000 compounds are currently available via HERGCentral. 97,105 2. More rigorously define the applicability domain of these models. As extensively discussed in Chapter 2, section 2.6.6, clearly delimiting the applicability domain of QSAR models is of considerable importance, yet remains a highly active area of research. It is possible that an approach based upon missing features, or the variation in class scores across different training set orders might serve as useful "distance to model" metrics that could be used to define an applicability domain as per Sushko et al. 47 3. Implement the models as open source, freely available tools that would be readily available to non-experts. This could, in part, entail using the freely available implementation of extended connectivity fingerprints in the software program jCompoundMapper. 406 In summary, the work presented in this thesis has investigated a variety of novel approaches to computationally predicting drug induced toxicity and offers starting points for avenues of future research as well as readily available predictive tools. 189 Bibliography (1) Combs, A. B.; Acosta, D. Jr. In Computational Toxicology: Risk Assessment for Pharmaceutical and Environmental Chemicals; Ekins, S., Ed.; Wiley Series on Technologies for the Pharmaceutical Industry; John Wiley and Sons: Hoboken, New Jersey, 2007; pp. 3– 20. (2) Nigsch, F. Computational Prediction of Molecular Properties for Drug Discovery. PhD Thesis, University of Cambridge, United Kingdom, 2008. (3) Overington, J. P.; Al-Lazikani, B.; Hopkins, A. L. Nat. Rev. Drug Discovery 2006, 5, 993– 996. (4) Toxicity Testing Overview. http://alttox.org/ttrc/tox-test-overview/ (accessed January 11 2012). (5) Xu, J. J. In Computational Toxicology: Risk Assessment for Pharmaceutical and Environmental Chemicals; Ekins, S., Ed.; Wiley Series on Technologies for the Pharmaceutical Industry; John Wiley and Sons: Hoboken, New Jersey, 2007; pp. 21–32. (6) Müller, L.; Breidenbach, A.; Funk, C.; Muster, W.; Pähler, A. In Computational Toxicology: Risk Assessment for Pharmaceutical and Environmental Chemicals; Ekins, S., Ed.; Wiley Series on Technologies for the Pharmaceutical Industry; John Wiley and Sons: Hoboken, New Jersey, 2007; pp. 545–580. (7) Kramer, J. A.; Sagartz, J. E.; Morris, D. L. Nat. Rev. Drug Discovery 2007, 6, 636–649. (8) Kola, I.; Landis, J. Nat. Rev. Drug Discovery 2004, 3, 711–715. (9) U S Food and Drug Administration Home Page. http://www.fda.gov/ (accessed January 11 2012). (10) Qureshi, Z. P.; Seoane‐Vazquez, E.; Rodriguez‐Monguio, R.; Stevenson, K. B.; Szeinbach, S. L. Pharmacoepidemiol. Drug Saf. 2011, 20, 772–777. (11) Lazarou, J.; Pomeranz, B. H.; Corey, P. N. JAMA, J. Am. Med. Assoc.1998, 279, 1200– 1205. (12) Vitola, J.; Vukanovic, J.; Roden, D. M. J. Cardiovasc. Electrophysiol. 1998, 9, 1109–1113. 190 (13) Wysowski, D. K.; Bacsanyi, J. N. Engl. J. Med. 1996, 335, 290–291. (14) Nigsch, F.; Lounkine, E.; McCarren, P.; Cornett, B.; Glick, M.; Azzaoui, K.; Urban, L.; Marc, P.; Müller, A.; Hahne, F.; Heard, D. J.; Jenkins, J. L. Expert Opin. Drug Metab. Toxicol. 2011, 7, 1497–1511. (15) Gavaghan, C. L.; Arnby, C. H.; Blomberg, N.; Strandlund, G.; Boyer, S. J. Comput.-Aided Mol. Des. 2007, 21, 189–206. (16) Johnson, D.E.; Rodgers, A.D. Curr. Opin. Drug Discovery Dev. 2006, 9, 29–37. (17) Olson, H. M.; Davies, T. S. In Predictive Toxicology in Drug Safety; Xu, J. J.; Laszlo, U., Eds.; Cambridge University Press: New York, 2011; pp. 1–17. (18) Bleicher, K. H.; Böhm, H.-J.; Müller, K.; Alanine, A. I. Nat. Rev. Drug Discovery 2003, 2, 369–378. (19) Mitchell, J. B. O. Future Med. Chem. 2011, 3, 451–467. (20) Gleeson, M. P.; Modi, S.; Bender, A.; Marchese Robinson, R. L.; Kirchmair, J.; Promkatkaew, M.; Hannongbua, S.; Glen, R. C. Curr. Pharm. Des. 2012, 18, 1266–1291. (21) Davenport, A. J.; Möller, C.; Heifetz, A.; Mazanetz, M. P.; Law, R. J.; Ebneth, A.; Gemkow, M. J. Assay Drug Dev. Technol. 2010, 8, 781–789. (22) Valerio, L. G. Jr. Toxicol. Appl. Pharmacol. 2009, 241, 356–370. (23) Egan, W. J.; Zlokarnik, G.; Grootenhuis, P. D. J. Drug Discovery Today: Technol. 2004, 1, 381–387. (24) Valerio, L. G. Jr. Hum. Exp. Toxicol. 2008, 27, 757–760. (25) Worth, A.; Lapenna, S.; Piparo, E. L.; Mostrag-Szlichtyng, A.; Serafimova, R. A Framework for Assessing In Silico Toxicity Predictions: Case Studies with Selected Pesticides; JRC Scientific and Technical Reports; European Commission, Joint Research Centre, 2011. (26) Johnson, D. E.; Rodgers, A. D.; Sudarsanam, S. In Computational Toxicology: Risk Assessment for Pharmaceutical and Environmental Chemicals; Ekins, S., Ed.; Wiley Series on Technologies for the Pharmaceutical Industry; John Wiley and Sons: Hoboken, New Jersey, 2007; pp. 725–749. 191 (27) Yap, C. W.; Cai, C. Z.; Xue, Y.; Chen, Y. Z. Toxicol. Sci. 2004, 79, 170–177. (28) Clark, M.; Wiseman, J. S. J. Chem. Inf. Model. 2009, 49, 2617–2626. (29) Frid, A. A.; Matthews, E. J. Regul. Toxicol. Pharmacol. 2010, 56, 276–289. (30) Bender, A.; Scheiber, J.; Glick, M.; Davies, J. W.; Azzaoui, K.; Hamon, J.; Urban, L.; Whitebread, S.; Jenkins, J. L. ChemMedChem 2007, 2, 861–873. (31) Pharmatrope Press Release. http://pharmatrope.com/?q=node/9 (accessed January 12 2012). (32) Nigsch, F.; Mitchell, J. B. O. Toxicol. Appl. Pharmacol. 2008, 231, 225–234. (33) Matthews, E. J.; Frid, A. A. Regul. Toxicol. Pharmacol. 2010, 56, 247–275. (34) Foster, J.R. In Carcinogenesis; Comprehensive Toxicology, Vol. 14, 2nd ed.; Roberts, R., Ed.; Elsevier Limited.: Kidlington, United Kingdom, 2010; pp. 1–10. (35) Boyle, P.; Ferlay, J. Ann. Oncol. 2005, 16, 481–488. (36) Jansen, J. D. Mutat. Res. 1988, 205, 3–12. (37) Benigni, R.; Bossa, C.; Jeliazkova, N.; Netzeva, T.; Worth, A. The Benigni / Bossa Rulebase for Mutagenicity and Carcinogenicity - A Module of Toxtree; JRC Scientific and Technical Reports; European Commission, Joint Research Centre, 2008. (38) Benigni, R.; Bossa, C. Chem. Rev. 2011, 111, 2507–2536. (39) Guidance on Genotoxicity Testing and Data Interpretation for Pharmaceuticals Intended for Human Use, S2(R1), Step 4 Version; ICH Harmonised Tripartite Guideline; International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use, 2011. (40) Zeiger, E. Mutat. Res., Genet. Toxicol. Environ. Mutagen. 2001, 492, 29–38. (41) Shelby, M. D.; Bishop, J. B.; Mason, J. M.; Tindall, K. R. Environ. Health Perspect. 1993, 100, 283–291. (42) Custer, L. L.; Kreatsoulas, C.; Durham, S. K. In Computational Toxicology: Risk Assessment for Pharmaceutical and Environmental Chemicals; Ekins, S., Ed.; Wiley Series on 192 Technologies for the Pharmaceutical Industry; John Wiley and Sons: Hoboken, New Jersey, 2007; pp. 391–401. (43) S1B Testing for Carcinogenicity of Pharmaceuticals; Guidance for Industry; International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use, 1997. (44) Jacobs, A. Toxicol. Sci. 2005, 88, 18–23. (45) Cohen, S. M. Toxicol. Sci. 2004, 80, 225–229. (46) Ames, B. N. Cancer 1984, 53, 2034–2040. ( ) Sushko, I.; ovotarskyi, S.; K rner, R.; Pandey, . K.; Cherkasov, .; Li, .; ramatica, P.; ansen, K.; Schroeter, T.; M ller, K.-R.; i, L.; Liu, .; ao, .; berg, T.; ormozdiari, .; ao, P.; Sahinalp, C.; Todeschini, R.; Polishchuk, P.; rtemenko, .; Kuz’min, V.; Martin, T. M.; Young, D. M.; Fourches, D.; Muratov, E.; Tropsha, A.; Baskin, I.; Horvath, D.; Marcou, G.; Muller, C.; Varnek, A.; Prokopenko, V. V.; Tetko, I. V. J. Chem. Inf. Model. 2010, 50, 2094– 2111. (48) Kamber, M.; Flückiger-Isler, S.; Engelhardt, G.; Jaeckh, R.; Zeiger, E. Mutagenesis 2009, 24, 359–366. (49) Zeiger, E. Cancer Res. 1987, 47, 1287–1296. (50) Cohen S.M. Exp. Toxicol. Pathol. 2010, 62, 497–502. (51) ISSCAN: Istituto Superiore di Sanita, Chemical Carcinogens: Structures and Experimental Data. http://www.epa.gov/ncct/dsstox/sdf_isscan_external.html (accessed January 16 2012). (52) Backus, G. S.; Wolf, M. A.; Burch, J.; Richard, A. M. DSSTox EPA Integrated Risk Information System (IRIS) Toxicity Review Data: SDF File and Documentation. http://www.epa.gov/ncct/dsstox/sdf_iristr.html (accessed January 16 2012). (53) Guidelines for Carcinogen Risk Assessment; FRL-2984-1; Risk Assessment Forum, US Environmental Protection Agency: Washington, DC, 1986. 193 (54) Guidelines for Carcinogen Risk Assessment; EPA/630/P-03/001F; Risk Assessment Forum, US Environmental Protection Agency: Washington, DC, 2005. ( ) ansen, K.; Mika, S.; Schroeter, T.; Sutter, .; ter Laak, .; Steger- artmann, T.; einrich, .; M ller, K.-R. J. Chem. Inf. Model. 2009, 49, 2077–2081. (56) Leadscope Toxicity Databases. http://www.leadscope.com/toxicity_databases/ (accessed January 16 2012). (57) Hillebrecht, A.; Muster, W.; Brigo, A.; Kansy, M.; Weiser, T.; Singer, T. Chem. Res. Toxicol. 2011, 24, 843–854. (58) Sanguinetti, M. C.; Tristani-Firouzi, M. Nature 2006, 440, 463–469. (59) Crumb, W.; Cavero, I. Pharm. Sci. Technol. Today 1999, 2, 270–280. (60) MedTerms Medical Dictionary. http://www.medterms.com/script/main/hp.asp (accessed January 19 2012). (61) Hoffmann, P.; Warner, B. J. Pharmacol. Toxicol. Methods 2006, 53, 87–105. (62) Yao, X.; Anderson, D. L.; Ross, S. A.; Lang, D. G.; Desai, B. Z.; Cooper, D. C.; Wheelan, P.; McIntyre, M. S.; Bergquist, M. L.; MacKenzie, K. I.; Becherer, J. D.; Hashim, M. A. Br. J. Pharmacol. 2008, 154, 1446–1456. (63) Scuderi, P. E. Anesthesiology 2010, 113, 772–775. (64) Hondeghem, L. M. Acta Cardiol. 2011, 66, 685–689. (65) S7B The Non-Clinical Evaluation of the Potential for Delayed Ventricular Repolarization (QT Interval Prolongation) By Human Pharmaceuticals, Step 4 Version; ICH Harmonised Tripartite Guideline; International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use, 2005. (66) Friedrichs, G. S.; Patmore, L.; Bass, A. J. Pharmacol. Toxicol. Methods 2005, 52, 6–11. (67) Polak, S.; Wiśniowska, B.; Brandys, J. J. Appl. Toxicol. 2009, 29, 183–206. (68) Redfern, W. S.; Carlsson, L.; Davis, A. S.; Lynch, W. G.; MacKenzie, I.; Palethorpe, S.; Siegl, P. K. S.; Strang, I.; Sullivan, A. T.; Wallis, R.; Camm, A. J.; Hammond, T. G. Cardiovasc. Res. 2003, 58, 32–45. 194 (69) Thai, K.-M.; Ecker, G. F. Bioorg. Med. Chem. 2008, 16, 4107–4119. (70) Murphy, S. M.; Palmer, M.; Poole, M. F.; Padegimas, L.; Hunady, K.; Danzig, J.; Gill, S.; Gill, R.; Ting, A.; Sherf, B.; Brunden, K.; Stricker-Krongrad, A. J. Pharmacol. Toxicol. Methods 2006, 54, 42–55. (71) Williams, M.; Jarvis, M. F. In Drug Discovery Technologies; Clark, C. R.; Moos, W. H., Eds.; Ellis Horwood Series in Pharmaceutical Technology; Ellis Horwood Limited: Chichester, England, 1990; pp. 129–166. (72) Chiu, P. J. S.; Marcoe, K. F.; Bounds, S. E.; Lin, C.-H.; Feng, J.-J.; Lin, A.; Cheng, F.-C.; Crumb, W. J.; Mitchell, R. J. Pharmacol. Sci. 2004, 95, 311–319. (73) Diaz, G. J.; Daniell, K.; Leitza, S. T.; Martin, R. L.; Su, Z.; McDermott, J. S.; Cox, B. F.; Gintant, G. A. J. Pharmacol. Toxicol. Methods 2004, 50, 187–199. (74) Price, G. W.; Riley, G. J.; Middlemiss, D. N. In Medicinal Chemistry: Principles and Practice, 2nd ed.; King, F. D., Ed.; The Royal Society of Chemistry: Cambridge, United Kingdom, 2005; pp. 91–117. (75) Cheng, Y.-C..; Prusoff, W. H. Biochem. Pharmacol. 1973, 22, 3099–3108. (76) Wood, C.; Williams, C.; Waldron, G. J. Drug Discovery Today 2004, 9, 434–441. (77) Dunlop, J.; Bowlby, M.; Peri, R.; Vasilyev, D.; Arias, R. Nat. Rev. Drug Discovery 2008, 7, 358–368. (78) Bridgland-Taylor, M. H.; Hargreaves, A. C.; Easter, A.; Orme, A.; Henthorn, D. C.; Ding, M.; Davis, A. M.; Small, B. G.; Heapy, C. G.; Abi-Gerges, N.; Persson, F.; Jacobson, I.; Sullivan, M.; Albertson, N.; Hammond, T. G.; Sullivan, E.; Valentin, J.-P.; Pollard, C. E. J. Pharmacol. Toxicol. Methods 2006, 54, 189–199. (79) Guo, L.; Guthrie, H. J. Pharmacol. Toxicol. Methods 2005, 52, 123–135. (80) Tang, Q.; Jin, M.-W.; Xiang, J.-Z.; Dong, M.-Q.; Sun, H.-Y.; Lau, C.-P.; Li, G.-R. Biochem. Pharmacol. 2007, 74, 1596–1607. (81) Johnson, S. R.; Yue, H.; Conder, M. L.; Shi, H.; Doweyko, A. M.; Lloyd, J.; Levesque, P. Bioorg. Med. Chem. 2007, 15, 6182–6192. 195 (82) Kirsch, G. E.; Trepakova, E. S.; Brimecombe, J. C.; Sidach, S. S.; Erickson, H. D.; Kochan, M. C.; Shyjka, L. M.; Lacerda, A. E.; Brown, A. M. J. Pharmacol. Toxicol. Methods 2004, 50, 93–101. (83) Tie, H. H. Cellular Mechanism of QT Prolongation and Proarrhythmia Induced by Non- antiarrhythmic Drugs. MD Thesis, University of New South Wales, Australia, 2002. (84) Wible, B. A.; Hawryluk, P.; Ficker, E.; Kuryshev, Y. A.; Kirsch, G.; Brown, A. M. J. Pharmacol. Toxicol. Methods 2005, 52, 136–145. (85) Ko, C. M.; Ducic, I.; Fan, J.; Shuba, Y. M.; Morad, M. J. Pharmacol. Exp. Ther. 1997, 281, 233–244. (86) Zhou, Z.; Vorperian, V. R.; Gong, Q.; Zhang, S.; January, C. T. J. Cardiovasc. Electrophysiol. 1999, 10, 836–843. (87) Fenichel, R. R.; Malik, M.; Antzelevitch, C.; Sanguinetti, M.; Roden, D. M.; Priori, S. G.; Ruskin, J. N.; Lipicky, R. J.; Cantilena, L. R. J. Cardiovasc. Electrophysiol. 2004, 15, 475–495. (88) Witchel, H. J.; Milnes, J. T.; Mitcheson, J. S.; Hancox, J. C. J. Pharmacol. Toxicol. Methods 2002, 48, 65–80. (89) Su, B.-H.; Shen, M.; Esposito, E. X.; Hopfinger, A. J.; Tseng, Y. J. J. Chem. Inf. Model. 2010, 50, 1304–1318. (90) Li, Q.; Jørgensen, F. S.; Oprea, T.; Brunak, S.; Taboureau, O. Mol. Pharmaceutics 2008, 5, 117–127. (91) Dubus, E.; Ijjaali, I.; Petitet, F.; Michel, A. ChemMedChem 2006, 1, 622–630. (92) Marchese Robinson, R. L.; Glen, R. C.; Mitchell, J. B. O. Mol. Inf. 2011, 30, 443–458. (93) Doddareddy, M. R.; Klaasse, E. C.; Shagufta; IJzerman, A. P.; Bender, A. ChemMedChem 2010, 5, 716–729. (9 ) O’Brien, S. E.; de root, M. . J. Med. Chem. 2005, 48, 1287–1291. (95) AID 376 - PubChem BioAssay Summary. http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=376 (accessed January 19 2012). (96) CHEMBL240 Target Report Card. 196 https://www.ebi.ac.uk/chembldb/target/inspect/CHEMBL240 (accessed January 19 2012). (97) Du, F.; Yu, H.; Zou, B.; Babcock, J.; Long, S.; Li, M. Assay Drug Dev. Technol. 2011, 9, 580–588. (98) QSAR World Home Page. http://www.qsarworld.com/ (accessed January 4 2012). (99) Tox-Portal Home Page (English Version). http://www.tox-portal.net/index.html (accessed January 19 2012). (100) Hishigaki, H.; Kuhara, S. Database 2011, 2011, bar017. (101) Fenichel, R.R. Receptor Binding Database. http://www.fenichel.net/pages/Professional/subpages/QT/Tables/pbydrug.htm (accessed June 21 2012). (102) Aureus Sciences Home Page. http://www.aureus-sciences.com/aureus/web/guest (accessed Jan 19 2012). (103) Sunset Molecular Home Page. http://www.sunsetmolecular.com/ (accessed January 19 2012). (104) Obiol-Pardo, C.; Gomis-Tena, J.; Sanz, F.; Saiz, J.; Pastor, M. J. Chem. Inf. Model. 2011, 51, 483–492. (105) hERG Central Home Page. http://www.hergcentral.org/ (accessed January 19 2012). (106) Varró, A.; Baczkó, I. Br. J. Pharmacol. 2011, 164, 14–36. (107) Yang, P.-C.; Kurokawa, J.; Furukawa, T.; Clancy, C. E. PLoS Comput. Biol. 2010, 6, e1000658. (108) Gupta, A.; Lawrence, A. T.; Krishnan, K.; Kavinsky, C. J.; Trohman, R. G. Am. Heart J. 2007, 153, 891–899. (109) Rosen, M. R.; Janse, M. J. J. Cardiovasc. Pharmacol. 2010, 55, 428-437. (110) Sibbald, B. Can. Med. Assoc. J. 2003, 168, 1035. (111) Recanatini, M.; Cavalli, A.; Masetti, M. ChemMedChem 2008, 3, 523–535. (112) Viskin, S.; Rosso, R. J. Am. Coll. Cardiol. 2010, 56, 1585–1588. 197 (113) A Guide to Drug Safety Terms at FDA; FDA Consumer Health Information; US Food and Drug Administration, 2008. (114) Smalley, W.; Shatin, D.; Wysowski, D. K.; Gurwitz, J.; Andrade, S. E.; Goodman, M.; Chan, K. A.; Platt, R.; Schech, S. D.; Ray, W. A. JAMA, J. Am. Med. Assoc. 2000, 284, 3036– 3039. (115) Lee, N.; Authier, S.; Pugsley, M. K.; Curtis, M. J. Toxicol. Appl. Pharmacol. 2010, 243, 146–153. (116) Yan, G.-X.; Antzelevitch, C. Circulation 1998, 98, 1928–1936. (117) Curtis, M. J. Cardiovasc. Res. 2012, 93, 10–11. (118) De Ponti, F.; Poluzzi, E.; Montanaro, N. Eur. J. Clin. Pharmacol. 2001, 57, 185–209. (119) E14 The Clinical Evaluation of QT/QTc Interval Prolongation and Proarrhythmic Potential for Non-Antiarrhythmic Drugs, Step 4 Version; ICH Harmonised Tripartite Guideline; International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use, 2005. (120) Ursem, C. J.; Kruhlak, N. L.; Contrera, J. F.; MacLaughlin, P. M.; Benz, R. D.; Matthews, E. J. Regul. Toxicol. Pharmacol. 2009, 54, 1–22. (121) Arizona Center for Education and Research on Therapeutics. Process for Assigning Risk. http://www.azcert.org/medical-pros/drug-lists/why-lists.cfm (accessed January 20 2012). (122) Xue, Y.; Li, Z. R.; Yap, C. W.; Sun, L. Z.; Chen, X.; Chen, Y. Z. J. Chem. Inf. Comput. Sci. 2004, 44, 1630–1638. (123) Ivanciuc, O. Internet Electron. J. Mol. Des. 2006, 5, 488–502. (124) Yang, S.-Y.; Huang, Q.; Li, L.-L.; Ma, C.-Y.; Zhang, H.; Bai, R.; Teng, Q.-Z.; Xiang, M.-L.; Wei, Y.-Q. Artif. Intell. Med. 2009, 46, 155–163. (125) Fu, G.-H.; Cao, D.-S.; Xu, Q.-S.; Li, H.-D.; Liang, Y.-Z. J. Chemom. 2011, 25, 92–99. (126) Cao, D.-S.; Xu, Q.-S.; Liang, Y.-Z.; Chen, X.; Li, H.-D. Chemom. Intell. Lab. Syst. 2010, 103, 129–136. 198 (127) Bhavani, S.; Nagargadde, A.; Thawani, A.; Sridhar, V.; Chandra, N. J. Chem. Inf. Model. 2006, 46, 2478–2486. (128) Gepp, M. M.; Hutter, M. C. Bioorg. Med. Chem. 2006, 14, 5325–5332. (129) Fermini, B.; Fossa, A. A. Nat. Rev. Drug Discovery 2003, 2, 439–447. (130) Arizona Center for Education and Research on Therapeutics. QT Drug List by Risk Groups. http://www.azcert.org/medical-pros/drug-lists/drug-lists.cfm (accessed November 25 2011). (131) Micromedex Home Page. http://www.micromedex.com/ (accessed January 22 2012). (132) Meyler’s Side Effects of Drugs: The International Encyclopedia of Adverse Drug Reactions and Interactions, 15th ed.; Online Version. http://www.sciencedirect.com/science/referenceworks/9780444510051 (accessed January 22 2012). (133) Adverse Event Reporting System (AERS). http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/Adver seDrugEffects/default.htm (accessed January 4 2012). (134) Lindquist, M. Drug Inf. J. 2008, 42, 409–419. (135) Uppsala Monitoring Centre Home Page. http://who-umc2010.phosdev.se/ (accessed January 21 2012). (136) Kuhn, M.; Campillos, M.; Letunic, I.; Jensen, L. J.; Bork, P. Mol. Syst. Biol. 2010, 6, 343. (137) MedWatch Home Page. http://www.fda.gov/Safety/MedWatch/default.htm (accessed January 22 2012). (138) IUPAC Glossary of Terms Used in Toxicology, 2nd ed.; Online Version, Terms Starting with C. http://sis.nlm.nih.gov/enviro/iupacglossary/glossaryc.html (accessed December 19 2011). (139) Matthews, E. J.; Daniel Benz, R.; Contrera, J. F. J. Mol. Graphics Modell. 2000, 18, 605– 615. 199 (140) Gombar, V. K.; Mattioni, B. E.; Zwicki, C.; Deahl, J. T. In Computational Toxicology: Risk Assessment for Pharmaceutical and Environmental Chemicals; Ekins, S., Ed.; Wiley Series on Technologies for the Pharmaceutical Industry; John Wiley and Sons: Hoboken, New Jersey, 2007; pp. 183–195. (141) Brown, A. C.; Fraser, T. R. J. Anat. Physiol. 1868, 2, 224–242. (142) Kubinyi, H. Quant. Struct.-Act. Relat. 2002, 21, 348–356. (143) Lipnick, R. L.; Filov, V. A. Trends Pharmacol. Sci. 1992, 13, 56–60. (144) Hansch, C.; Maloney, P. P.; Fujita, T.; Muir, R. M. Nature 1962, 194, 178–180. (145) Hansch, C.; Muir, R. M.; Fujita, T.; Maloney, P. P.; Geiger, F.; Streich, M. J. Am. Chem. Soc. 1963, 85, 2817–2824. (146) Hansch, C.; Fujita, T. J. Am. Chem. Soc. 1964, 86, 1616–1626. (147) Fujita, T.; Iwasa, J.; Hansch, C. J. Am. Chem. Soc. 1964, 86, 5175–5180. (148) Free, S. M.; Wilson, J. W. J. Med. Chem. 1964, 7, 395–399. (149) Maran, U.; Sild, S. Artif. Intell. Rev. 2003, 20, 13–38. (150) Computational Toxicology: Risk Assessment for Pharmaceutical and Environmental Chemicals; Ekins, S., Ed.; Wiley Series on Technologies for the Pharmaceutical Industry; John Wiley and Sons: Hoboken, New Jersey, 2007. (151) Enoch, S. J.; Cronin, M. T. D.; Schultz, T. W.; Madden, J. C. Chem. Res. Toxicol. 2008, 21, 513–520. (152) Bassan, A.; Worth, A. P. In Computational Toxicology: Risk Assessment for Pharmaceutical and Environmental Chemicals; Ekins, S., Ed.; Wiley Series on Technologies for the Pharmaceutical Industry; John Wiley and Sons: Hoboken, New Jersey, 2007; pp. 751– 775. (153) Tropsha, A. Mol. Inf. 2010, 29, 476–488. (154) Judson, P. N. In Computational Toxicology: Risk Assessment for Pharmaceutical and Environmental Chemicals; Ekins, S., Ed.; Wiley Series on Technologies for the Pharmaceutical Industry; John Wiley and Sons: Hoboken, New Jersey, 2007; pp. 521–543. 200 (155) Kontijevskis, A.; Komorowski, J.; Wikberg, J. E. S. J. Chem. Inf. Model. 2008, 48, 1840– 1850. (156) Du, L.; Li, M.; You, Q.; Xia, L. Biochem. Biophys. Res. Commun. 2007, 355, 889–894. (157) Du-Cuny, L.; Chen, L.; Zhang, S. J. Chem. Inf. Model. 2011, 51, 2948–2960. (158) Reisfeld, B.; Mayeno, A. N.; Lyons, M. A.; Yang, R. S. H. In Computational Toxicology: Risk Assessment for Pharmaceutical and Environmental Chemicals; Ekins, S., Ed.; Wiley Series on Technologies for the Pharmaceutical Industry; John Wiley and Sons: Hoboken, New Jersey, 2007; pp. 33–70. (159) Mirams, G. R.; Cui, Y.; Sher, A.; Fink, M.; Cooper, J.; Heath, B. M.; McMahon, N. C.; Gavaghan, D. J.; Noble, D. Cardiovasc. Res. 2011, 91, 53–61. (160) Judson, P. N.; Marchant, C. A.; Vessey, J. D. J. Chem. Inf. Comput. Sci. 2003, 43, 1364– 1370. (161) Judson, P. N.; Vessey, J. D. J. Chem. Inf. Comput. Sci. 2003, 43, 1356–1363. (162) Jeliazkova, N. Toxtree; Ideaconsult Limited: Sofia, Bulgaria. http://toxtree.sourceforge.net/ (accessed June 22 2012). (163) Patlewicz, G.; Jeliazkova, N.; Safford, R. J.; Worth, A. P.; Aleksiev, B. SAR QSAR Environ. Res. 2008, 19, 495–524. (164) OncoLogic; US Environmental Protection Agency: Washington, DC. http://www.epa.gov/oppt/sf/pubs/oncologic.htm (accessed December 22 2011). (165) Fourches, D.; Muratov, E.; Tropsha, A. J. Chem. Inf. Model. 2010, 50, 1189–1204. (166) Schattel, V.; Hinselmann, G.; Jahn, A.; Zell, A.; Laufer, S. J. Chem. Inf. Model. 2011, 51, 670–679. (167) The Cheminformatics and QSAR Society: Data Sets. http://www.qsar.org/resource/datasets.htm (accessed May 16 2012). (168) ChEMBLdb Home Page. https://www.ebi.ac.uk/chembldb/ (accessed May 16 2012). (169) The PubChem Project Home Page. http://pubchem.ncbi.nlm.nih.gov/ (accessed November 24 2011). 201 (170) DrugBank Home Page. http://drugbank.ca/ (accessed March 16 2011). (171) ChemSpider Home Page. http://www.chemspider.com/ (accessed May 16 2012). (172) Young, D.; Martin, T.; Venkatapathy, R.; Harten, P. QSAR Comb. Sci. 2008, 27, 1337– 1345. (173) Williams, A. J.; Ekins, S. Drug Discovery Today 2011, 16, 747–750. (174) Muresan, S.; Petrov, P.; Southan, C.; Kjellberg, M. J.; Kogej, T.; Tyrchan, C.; Varkonyi, P.; Xie, P. H. Drug Discovery Today 2011, 16, 1019–1030. (175) Cannon, E. O.; Nigsch, F.; Mitchell, J. B.O. Chem. Cent. J. 2008 2, 3. (176) Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Adv. Drug Delivery Rev. 1997, 23, 3–25. (177) O’Boyle, . M.; Morley, C.; Hutchison, G.R. Chem. Cent. J. 2008, 2, 5. (178) The IUPAC International Chemical Identifier; International Union of Pure and Applied Chemistry: North Carolina, USA. http://www.iupac.org/home/publications/e- resources/inchi.html (accessed June 22 2012). (179) Weininger, D. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. (180) Dalby, A.; Nourse, J. G.; Hounshell, W. D.; Gushurst, A. K. I.; Grier, D. L.; Leland, B. A.; Laufer, J. J. Chem. Inf. Comput. Sci. 1992, 32, 244–255. (181) Tripos Mol2 File Format; Tripos: St Louis, USA. http://tripos.com/data/support/mol2.pdf (accessed June 22 2012). (182) Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors; Methods and Principles in Medicinal Chemistry, Vol. 11; Wiley-VCH: Weinheim, Germany, 2000. (183) Bender, A. Expert Opin. Drug Discovery 2010, 5, 1141–1151. (184) Nigsch, F.; Mitchell, J. B. O. J. Chem. Inf. Model. 2008, 48, 306–318. (185) Rogers, D.; Hahn, M. J. Chem. Inf. Model 2010, 50, 742–754. (186) Eposito, E. X.; Hopfinger, A. J.; Madura, J. D. In Chemoinformatics: Concepts, Methods, and Tools for Drug Discovery; Bajorath, J., Ed.; Methods in Molecular Biology, Vol. 275; Humana Press: New Jersey, USA, 2004; pp. 131–213. 202 (187) Zhu, H.; Rusyn, I.; Richard, A.; Tropsha, A. Environ. Health Perspect. 2008, 116, 506- 513. (188) Sedykh, A.; Zhu, H.; Tang, H.; Zhang, L.; Richard, A.; Rusyn, I.; Tropsha, A. Environ. Health Perspect. 2011, 119, 364–370. (189) Low, Y.; Uehara, T.; Minowa, Y.; Yamada, H.; Ohno, Y.; Urushidani, T.; Sedykh, A.; Muratov, E.; Kuz’min, V.; ourches, .; Zhu, .; Rusyn, I.; Tropsha, . Chem. Res. Toxicol. 2011, 24, 1251–1262. (190) Zhu, H.; Ye, L.; Richard, A.; Golbraikh, A.; Wright, F. A.; Rusyn, I.; Tropsha, A. Environ. Health Perspect. 2009, 117, 1257–1264. (191) Cunningham, A. R.; Qamar, S.; Carrasquer, C. A.; Holt, P. A.; Maguire, J. M.; Cunningham, S. L.; Trent, J. O. SAR QSAR Environ. Res. 2010, 21, 463–479. (192) Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P. J. Chem. Inf. Comput. Sci. 2003, 43, 1947–1958. (193) Bishop, C. M. Pattern Recognition and Machine Learning; Information Science and Statistics; Springer Science+Business Media,LLC: New York, USA, 2006. (194) Lowe, R.; Mussa, H. Y.; Mitchell, J. B. O.; Glen, R. C. J. Chem. Inf. Model. 2011, 51, 1539–1544. (195) Simon-Hettich, B.; Rothfuss, A.; Steger-Hartmann, T. Toxicology 2006, 224, 156–162. (196) Peterson, K. L. In Reviews in Computational Chemistry, Vol. 16; Lipkowitz, K. B.; Boyd, D. B., Eds.; John Wiley and Sons, Incorporated: Hoboken, New Jersey, USA, 2000; pp. 53– 140. (197) Jensen, F. Introduction to Computational Chemistry, 2nd ed.; John Wiley and Sons Limited: Chichester, England, 2007. (198) Hinselmann, G.; Rosenbaum, L.; Jahn, A.; Fechner, N.; Ostermann, C.; Zell, A. J. Chem. Inf. Model. 2011, 51, 203–213. (199) rodź, T.; uen, . .; udek, . Z. J. Chem. Inf. Model. 2006, 46, 416–423. (200) Benigni, R.; Bossa, C. J. Chem. Inf. Model. 2008, 48, 971–980. 203 (201) Gleeson, M.P.; Hersey, A.; Hannongbua, S. Curr. Top. Med. Chem. 2011, 11, 358–381. (202) Sheridan, R. P.; Feuston, B. P.; Maiorov, V. N.; Kearsley, S. K. J. Chem. Inf. Comput. Sci. 2004, 44, 1912–1928. (203) Keerthi, S. S.; Lin, C.-J. Neural Comput 2003, 15, 1667–1689. (204) Strobl, C.; Malley, J.; Tutz, G. An Introduction to Recursive Partitioning: Rationale, Application and Characteristics of Classification and Regression Trees, Bagging and Random Forests; Technical Report Number 55; Department of Statistics, University of Munich, 2009. (205) Burbidge, R.; Trotter, M.; Buxton, B.; Holden, S. Comput. Chem. 2001, 26, 5–14. (206) Cartwright, H. In Reviews in Computational Chemistry, Vol. 25; Lipkowitz, K. B.; Cundari, T.R., Eds.; John Wiley and Sons , Incorporated: Hoboken, New Jersey, USA, 2007; pp. 349–389. (207) Golbraikh, A.; Tropsha, A. J. Mol. Graphics Modell. 2002, 20, 269–276. (208) Zheng, W.; Tropsha, A. J. Chem. Inf. Comput. Sci. 2000, 40, 185–194. (209) Ivanciuc, O. In Reviews in Computational Chemistry, Vol. 23; Lipkowitz, K. B.; Cundari, T.R., Eds.; John Wiley and Sons , Incorporated: Hoboken, New Jersey, USA 2007; pp. 291– 400. (210) Müller, K.-R.; Mika, S.; Rätsch, G.; Tsuda, K.; Schölkopf, B. IEEE T. Neural Networ. 2001, 12, 181–201. (211) Breiman, L. Mach. Learn. 2001, 45, 5–32. (212) Xia, X.; Maliski, E. G.; Gallant, P.; Rogers, D. J. Med. Chem. 2004, 47, 4463–4470. (213) Nigsch, F.; Bender, A.; Jenkins, J. L.; Mitchell, J. B. O. J. Chem. Inf. Model. 2008, 48, 2313–2325. (214) Labute, P. In Proceedings of the Pacific Symposium on Biocomputing ’99; Altman, R. B.; Dunker, A. K.; Hunter, L.; Klein, T. E.; Lauderdale, K., Eds.; World Scientific: New Jersey, USA, 1999; pp. − . (215) Mussa, H. Y.; Hawizy, L..; Nigsch, F.; Glen, R. C. J. Chem. Inf. Model. 2011, 51, 4–14. (216) Littlestone, N. Mach. Learn. 1988, 2, 285–318. 204 (217) Åberg, K. M.; Jacobsson, S. P. J. Chemom. 2010, 24, 650–654. (218) Fisher, R. A. Annals of Eugenics 1936, 7, 179–188. (219) Mitchell, J. M. O. In Machine Learning, Neural and Statistical Classification; Michie, D.; Spiegelhalter, D. J.; Taylor, C. C., Eds.; Ellis Horwood Series in Artificial Intelligence; Ellis Horwood Limited: Hemel Hempstead, England, 1994; pp. 17–28. (220) Henery, R. J. In Machine Learning, Neural and Statistical Classification; Michie, D.; Spiegelhalter, D. J.; Taylor, C. C., Eds.; Ellis Horwood Series in Artificial Intelligence; Ellis Horwood Limited: Hemel Hempstead, England, 1994; pp. 107–124. (221) Hsu, C.-W.; Chang, C.-C.; Lin, C.-J. A Practical Guide to Support Vector Classification; Department of Computer Science, National Taiwan University, 2010. (222) Song, M.; Clark, M. J. Chem. Inf. Model . 2006, 46, 392–400. (223) Menze, B. H.; Kelm, B. M.; Masuch, R.; Himmelreich, U.; Bachert, P.; Petrich, W.; Hamprecht, F. A. BMC Bioinf. 2009, 10, 213. (224) Rusinko, A.; Farmen, M. W.; Lambert, C. G.; Brown, P. L.; Young, S. S. J. Chem. Inf. Comput. Sci. 1999, 39, 1017–1026. (225) R: A Language and Environment for Statistical Computing; R Development Core Team, R Foundation for Statistical Computing: Vienna, Austria. http://www.r-project.org/ (accessed June 23 2012). (226) Hugill, M. Advanced Statistics; Bell and Hyman Limited: London, England, 1985. (227) Strobl, C.; Boulesteix, A.-L.; Zeileis, A.; Hothorn, T. BMC Bioinf. 2007, 8, 25. (228) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, 2nd ed.; Springer Series in Statistics; Springer Science+Business Media,LLC: New York, USA, 2009. (229) Kuz’min, V. E.; Polishchuk, P. .; rtemenko, . .; ndronati, S. . Mol. Inf. 2011, 30, 593–603. (230) Baldi, P.; Brunak, S.; Chauvin, Y.; Andersen, C. A. F.; Nielsen, H. Bioinformatics. 2000, 16, 412–424. (231) Thai, K.-M.; Ecker, G. Mol. Diversity 2009, 13, 321–336. 205 (232) Gorodkin, J. Comput. Biol. Chem. 2004, 28, 367–374. (233) DeGroot, M.H. Probability and Statistics, 2nd ed.; Addison-Wesley Series in Statistics; Addison-Wesley Publishing Company, Incorporated: USA, 1989. (234) Cohen, J. Psychol. Bull. 1968, 70, 213–220. (235) Demel, M. A.; Janecek, A. G. K.; Thai, K.-M.; Ecker, G. F.; Gansterer, W. N. Curr. Comput.-Aided Drug Des. 2008, 4, 91–110. (236) Konovalov, D. A.; Llewellyn, L. E.; Vander Heyden, Y.; Coomans, D. J. Chem. Inf. Model. 2008, 48, 2081–2094. (237) Best, D. J.; Roberts, D. E. J. Roy. Stat. Soc. C-App. 1975, 24, 377–379. (238) Hughes, L. D.; Palmer, D. S.; Nigsch, F.; Mitchell, J. B. O. J. Chem. Inf. Model. 2008, 48, 220–232. (239) Palmer, . S.; O’Boyle, . M.; len, R. C.; Mitchell, . B. O. J. Chem. Inf. Model. 2007, 47, 150–158. (240) Martin, J. K.; Hirschberg, D. S. Small Sample Statistics for Classification Error Rates I: Error Rate Measurements; Technical Report No. 96-21; Department of Information and Computer Science, University of California, Irvine, USA, 1996. (241) Cawley, G.C.; Talbot N.L.C. J. Mach. Learn. Res. 2010, 11, 2079-2107. (242) Zhu, J. X.; McLachlan, G. J.; Ben-Tovim Jones, L.; Wood, I. A. J. Stat. Plan. Infer. 2008, 138, 374–386. (243) Hawkins, D. M. J. Chem. Inf. Comput. Sci.2004, 44, 1–12. (244) Hansen, K.; Rathke, F.; Schroeter, T.; Rast, G.; Fox, T.; Kriegl, J. M.; Mika, S. J. Chem. Inf. Model. 2009, 49, 1486–1496. (245) Molinaro, A. M.; Simon, R.; Pfeiffer, R. M. Bioinformatics 2005, 21, 3301–3307. (246) Parker, B. J.; Günter, S.; Bedo, J. BMC Bioinf. 2007, 8, 326. (247) Rücker, C.; Rücker, G.; Meringer, M. J. Chem. Inf. Model. 2007, 47, 2345–2357. (248) Dragos, H.; Gilles, M.; Alexandre, V. J. Chem. Inf. Model. 2009, 49, 1762–1776. 206 (249) Netzeva, T. I.; Worth, A. P.; Aldenberg, T.; Benigni, R.; Cronin, M. T. D.; Gramatica, P.; Jaworska, J. S.; Kahn, S.; Klopman, G.; Marchant, C. A.; Myatt, G.; Nikolova-Jeliazkova, N.; Patlewicz, G. Y.; Perkins, R.; Roberts, D. W.; Schultz, T. W.; Stanton, D. T.; Van De Sandt, J. J. M.; Tong, W.; Veith, G.; Yang, C. ATLA, Altern. Lab. Anim. 2005, 33, 155–173. (250) K hne, R.; Ebert, R.- .; Sch rmann, . J. Chem. Inf. Model. 2009, 49, 2660–2669. (251) Tetko, I. V.; Bruneau, P.; Mewes, H.-W.; Rohrer, D. C.; Poda, G. I. Drug Discovery Today 2006, 11, 700–707. (252) Roberts, D. W.; Patlewicz, G.; Kern, P. S.; Gerberick, F.; Kimber, I.; Dearman, R. J.; Ryan, C. A.; Basketter, D. A.; Aptula, A. O. Chem. Res. Toxicol. 2007, 20, 1019–1030. (253) Gourley, D. G.; Shrive, A. K.; Polikarpov, I.; Krell, T.; Coggins, J. R.; Hawkins, A. R.; Isaacs, N. W.; Sawyer, L. Nat. Struct. Biol. 1999, 6, 521–525. (254) Blomberg, L. M.; Mangold, M.; Mitchell, J. B. O.; Blumberger, J. J. Chem. Theory Comput. 2009, 5, 1284–1294. (255) Tiz n, L.; Otero, . M.; Prazeres, V. . V.; Llamas-Saiz, . L.; ox, . C.; van Raai , M. .; Lamb, .; awkins, . R.; insa, . .; Castedo, L.; onz lez-Bello, C. J. Med. Chem. 2011, 54, 6063–6084. (256) Hong, H.-J.; Hutchings, M. I.; Hill, L. M.; Buttner, M. J. J. Biol. Chem. 2005, 280, 13055– 13061. (257) Koul, A.; Arnoult, E.; Lounis, N.; Guillemont, J.; Andries, K. Nature 2011, 469, 483–490. (258) Lipinski, C. A. J. Pharmacol. Toxicol. Methods 2000, 44, 235–249. (259) ZINC Subsets. http://zincdocking.org/browse/subsets/ (accessed January 24 2012). (260) Irwin, J. J.; Shoichet, B. K. J. Chem. Inf. Model. 2005, 45, 177–182. (261) RCSB PDB Home Page. http://www.pdb.org/ (accessed November 24 2011). (262) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. Nucleic Acids Res. 2000, 28, 235–242. (263) Ballester, P. J.; Richards, W. G. J. Comput. Chem. 2007, 28, 1711–1723. (264) Ballester, P. J. Future Med. Chem. 2011, 3, 65–78. 207 (265) GOLD; The Cambridge Crystallographic Data Centre: Cambridge, United Kingdom. http://www.ccdc.cam.ac.uk/products/life_sciences/gold/ (accessed June 23 2012). (266) Ballester, P. J.; Mitchell, J. B. O. Bioinformatics 2010, 26, 1169–1175. (267) Ballester, P. J.; Mitchell, J. B. O. J. Chem. Inf. Model. 2011, 51, 1739–1741. (268) Robinson, D. A.; Stewart, K. A.; Price, N. C.; Chalk, P. A.; Coggins, J. R.; Lapthorn, A. J. J. Med. Chem. 2006, 49, 1282–1290. (269) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; Overington, J. P. Nucleic Acids Res. 2012, 40, D1100-D1107. (270) Everitt, B. S. In An R and S-Plus Companion to Multivariate Analysis; Springer-Verlag Limited: London, United Kingdom, 2005; pp. 115–136. (271) Derek for Windows; Lhasa Limited: 22-23 Blenheim Terrace, Woodhouse Lane, Leeds, LS2 9HD. https://www.lhasalimited.org/ (accessed June 22 2012). (272) Hayashi, M.; Kamata, E.; Hirose, A.; Takahashi, M.; Morita, T.; Ema, M. Mutat. Res., Genet. Toxicol. Environ. Mutagen. 2005, 588, 129–135. (273) Contrera, J. F.; Kruhlak, N. L.; Matthews, E. J.; Benz, R. D. Regul. Toxicol. Pharmacol. 2007, 49, 172–182. (274) Matthews, E. J.; Ursem, C. J.; Kruhlak, N. L.; Daniel Benz, R. ; Sabaté, D. A.; Yang, C.; Klopman, G.; Contrera, J. F. Regul. Toxicol. Pharmacol. 2009, 54, 23–42. (275) Python Programming Language. http://www.python.org/ (accessed June 22 2012). (276) Ellison, C. M.; Sherhod, R.; Cronin, M. T. D.; Enoch, S. J.; Madden, J. C.; Judson, P. N. J. Chem. Inf. Model. 2011, 51, 975–985. (277) Provost, F.; Fawcett, T. Mach. Learn. 2001, 42, 203–231. (278) Karlberg, A.-T.; Bergström, M. A.; Börje, A.; Luthman, K.; Nilsson, J. L. G. Chem. Res. Toxicol. 2008, 21, 53–69. (279) Pichler, W. J. Ann. Intern. Med. 2003, 139, 683-693. (280) Burman, W. J. Clin. Infect. Dis. 2010, 50, S165–S172. 208 (281) Lubasch, A.; Erbes, R.; Mauch, H.; Lode, H. Eur. Respir. J. 2001, 17, 641-646. (282) Chekmarev, D. S.; Kholodovych, V.; Balakin, K. V.; Ivanenkov, Y.; Ekins, S.; Welsh, W. J. Chem. Res. Toxicol. 2008, 21, 1304–1314. (283) Roche, O.; Trube, G.; Zuegge, J.; Pflimlin, P.; Alanine, A.; Schneider, G. ChemBioChem 2002, 3, 455-459. (284) Tobita, M.; Nishikawa, T.; Nagashima, R. Bioorg. Med. Chem. Lett. 2005, 15, 2886– 2890. (285) MOE (Molecular Operating Environment); Chemical Computing Group Incorporated: Montreal, Canada. http://www.chemcomp.com/ (accessed September 23 2012) (286) isius, B.; ller, . . J. Chem. Inf. Model. 2009, 49, 247–256. (287) SciFinder Scholar; Chemical Abstracts Service: Columbus, Ohio, USA. http://www.cas.org/ (accessed September 23 2012). (288) Fanoe, S.; Jensen, G. B.; Sjøgren, P.; Korsgaard, M. P. G.; Grunnet, M. Br. J. Clin. Pharmacol. 2009, 67, 172–179. (289) Aronov, A. M. J. Med. Chem. 2006, 49, 6917–6921. (290) Morik, K.; Brockhausen, P.; Joachims T. In Proceedings of the 16th International Conference on Machine Learning; Bratko, I.; Dzeroski, S., Eds.; Morgan Kaufmann Publishers Incorporated: San Fransisco, USA, 1999; pp. 268–277. (291) Joachims, T. In Advances in Kernel Methods: Support Vector Learning; Scholkopf, B.; Burges, C.; Smola, A., Eds.; MIT Press, 1999; pp. 169–184. (292) Bender, A.; Mussa, H. Y.; Glen, R. C.; Reiling, S. J. Chem. Inf. Comput. Sci. 2004, 44, 1708–1718. (293) Fayyad, U.M. Irani, K.B. In Proceedings of the 13th International Joint Conference on Artificial Intelligence; 1993; pp. 1022–1029. (294) Demsar, J. Zupan, B. Orange: From Experimental Machine Learning to Interactive Data Mining; White Paper; Faculty of Computer and Information Science, University of Ljubljana, 2004. http://orange.biolab.si/wp/orange.pdf (accessed June 23 2012). 209 (295) Labute, P. J. Mol. Graphics Modell. 2000, 18, 464–477. (296) Thai, K.-M.; Ecker, G. F. Chem. Biol. Drug Des. 2008, 72, 279–289. (297) Wiener, H. J. Am. Chem. Soc. 1947, 69, 17–20. (298) ChemAxon Home Page. http://www.chemaxon.com/ (accessed November 26 2011). (299) Jamieson, C.; Moir, E. M.; Rankovic, Z.; Wishart, G. J. Med. Chem. 2006, 49, 5029– 5046. (300) Zachariae, U.; Giordanetto, F.; Leach, A. G. J. Med. Chem. 2009, 52, 4266–4276. (301) Pipeline Pilot Student Edition; Accelrys Incorporated: San Diego, California, USA. http://accelrys.com (accessed June 23 2012). (302) Weininger, D.; Weininger, A.; Weininger, J. L. J. Chem. Inf. Comput. Sci. 1989, 29, 97– 101. (303) Dudoit, S.; Popper Shaffer, J.; Boldrick, J. C. Statist. Sci. 2003, 18, 71–103. (304) Kubat, M.; Matwin, S. In Proceedings of the 14th International Conference on Machine Learning; Fisher, D.H., Ed.; Morgan Kaufmann Publishers: San Francisco, USA, 1997. (305) Imai, Y. N.; Ryu, S.; Oiki, S. J. Med. Chem. 2009, 52, 1630–1638. (306) Mitcheson, J. S. Br J Pharmacol 2003, 139, 883–884. (307) Thai, K.-M.; Windisch, A.; Stork, D.; Weinzinger, A.; Schiesaro, A.; Guy, R.H.; Timin, E.N.; Hering, S.; Ecker G.F. ChemMedChem 2010, 5, 436–442. (308) Daylight Theory: SMILES. http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html (accessed March 25 2012). (309) Kortagere, S.; Krasowski, M. D.; Ekins, S. Trends in Pharmacological Sciences 2009, 30, 138–147. (310) Nicholls, A.; McGaughey, G. B.; Sheridan, R. P.; Good, A. C.; Warren, G.; Mathieu, M.; Muchmore, S. W.; Brown, S. P.; Grant, J. A.; Haigh, J. A.; Nevins, N.; Jain, A. N.; Kelley, B. J. Med. Chem. 2010, 53, 3862–3886. 210 (311) Grant, J. A.; Gallardo, M. A.; Pickup, B. T. J. Comput. Chem. 1996, 17, 1653–1666. (312) Good, A. C.; Richards, W. G. J. Chem. Inf. Comput. Sci. 1993, 33, 112–116. (313) Grant, J. A.; Pickup, B. T. J. Phys. Chem. 1995, 99, 3503–3510. (314) Nicholls, A.; MacCuish, N. E.; MacCuish, J. D. J. Comput.-Aided Mol. Des. 2004, 18, 451–474. (315) ROCS; OpenEye Scientific Software: Santa Fe, New Mexico, USA. http://www.eyesopen.com/rocs/ (accessed June 23 2012). (316) Haque, I. S.; Pande, V. S. J. Comput. Chem. 2010, 31, 117–132. (317) Kalliokoski, T.; Ronkko, T. P.; Poso, A. Mol. Inf. 2010, 29, 293–296. (318) Kubinyi, H. Drug Discovery Today 1997, 2, 457-467. (319) Zauhar, R. J.; Moyna, G.; Tian, L.; Li, Z.; Welsh, W. J. J. Med. Chem. 2003, 46, 5674– 5690. (320) Mak, L.; Grandison, S.; Morris, R. J. J. Mol. Graph.Model. 2008, 26, 1035–1045. (321) Wilson, J. A.; Bender, A.; Kaya, T.; Clemons, P. A. J. Chem. Inf. Model. 2009, 49, 2231– 2241. (322) Armstrong, M. S.; Morris, G. M.; Finn, P. W.; Sharma, R.; Richards, W. G. J. Mol. Graph.Model. 2009, 28, 368–370. (323) Zhou, T.; Lafleur, K.; Caflisch, A. J. Mol. Graphics Modell. 2010, 29, 443-449. (324) Venkatraman, V.; P rez-Nueno, V. I.; Mavridis, L.; Ritchie, D. W. J. Chem. Inf. Model. 2010, 50, 2079–2093. (325) Ballester, P. J.; Finn, P. W.; Richards, W. G. J. Mol. Graph.Model. 2009, 27, 836–845. (326) Ballester, P. J.; Westwood, I.; Laurieri, N.; Sim, E.; Richards, W. G. J. R. Soc. Interface 2010, 7, 335–342. (327) Bissantz, C.; Kuhn, B.; Stahl, M. J. Med. Chem. 2010, 53, 5061–5084. (328) Hawkins, P. C. D.; Skillman, A. G.; Nicholls, A. J. Med. Chem. 2007, 50, 74–82. 211 (329) Kirchmair, J.; Distinto, S.; Markt, P.; Schuster, D.; Spitzer, G. M.; Liedl, K. R.; Wolber, G. J. Chem. Inf. Model. 2009, 49, 678–692. (330) Sheridan, R. P.; Kearsley, S. K. Drug Discov. Today 2002, 7, 903–911. (331) Brown, R. D.; Martin, Y. C. J. Chem. Inf. Comput. Sci. 1996, 36, 572–584. (332) Foloppe, N.; Chen, I.-J. Curr. Med. Chem. 2009, 16, 3381–3413. (333) CORINA; Molecular Networks GmbH – Computerchemie: Erlangen Germany. http://www.molecular-networks.com/products/corina (accessed February 8 2012). (334) Jones, G.; Willett, P.; Glen, R. C. J. Mol. Biol. 1995, 245, 43–53. (335) Chen, I.-J.; Foloppe, N. Drug Dev. Res. 2011, 72, 85–94. (336) Rupp, M. Kernel Methods for Virtual Screening. PhD Thesis, Goethe Universität, Frankfurt am Main, 2009. (337) Brown, R. D.; Martin, Y. C. J. Chem. Inf. Comput. Sci. 1997, 37, 1–9. (338) Shamovsky, I.; Connolly, S.; David, L.; Ivanova, S.; Nordén, B.; Springthorpe, B.; Urbahns, K. J. Med. Chem. 2008, 51, 1162–1178. (339) Rohrer, S. G.; Baumann, K. J. Chem. Inf. Model. 2009, 49, 169–184. (340) Daylight Theory: SMARTS. http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html (accessed June 23 2012). (341) Open Babel SMARTS. http://openbabel.org/wiki/SMARTS (accessed March 1 2012). (342) Clayden, J.P.; Greeves, N.; Warren, S.; Wothers, P. D. In Organic Chemistry; Oxford University Press Incorporated: New York, 2001; pp. 1136–1137. (343) Wojciechowski, M.; Grycuk, T.; Antosiewicz, J. M.; Lesyng, B. Biophys. J. 2003, 84, 750– 756. (344) Herr, R. J. Bioorg. Med. Chem. 2002, 10, 3379–3393. (345) Babić, S.; orvat, . . M.; Mutavdžić Pavlović, .; Kaštelan-Macan, M. TrAC, Trends Anal. Chem. 2007, 26, 1043–1061. (346) Mills, J. E. J.; Dean, P. M. J. Comput.-Aided Mol. Des. 1996, 10, 607–622. 212 (347) Böhm, M.; Klebe, G. J. Med. Chem. 2002, 45, 1585–1597. (348) Goldgur, Y.; Craigie, R.; Cohen, G. H.; Fujiwara, T.; Yoshinaga, T.; Fujishita, T.; Sugimoto, H.; Endo, T.; Murai, H.; Davies, D. R. PNAS 1999, 96, 13040–13043. (349) Mason, J. S.; Morize, I.; Menard, P. R.; Cheney, D. L.; Hulme, C.; Labaudiniere, R. F. J. Med. Chem. 1999, 42, 3251–3264. (350) Steiner, T. Angew. Chem., Int. Ed. 2002, 41, 48–76. (351) Marcou, G.; Rognan, D. J. Chem. Inf. Model. 2007, 47, 195–207. (352) Wildman, S. A.; Crippen, G. M. J. Chem. Inf. Comput. Sci. 1999, 39, 868–873. (353) Böhm, H.; Banner, D.; Bendels, S.; Kansy, M.; Kuhn, B.; M ller, K.; Obst‐Sander, .; Stahl, M. ChemBioChem 2004, 5, 637–643. (354) Landrum, G. RDKit http://www.rdkit.org/ (accessed June 23 2012). (355) Politzer, P.; Lane, P.; Concha, M. C.; Ma, Y.; Murray, J. S. J. Mol. Model. 2007, 13, 305– 311. (356) Auffinger, P.; Hays, F. A.; Westhof, E.; Ho, P. S. PNAS 2004, 101, 16789–16794. (357) Humphrey, W.; Dalke, A.; Schulten, K. J. Mol. Graphics 1996, 14, 33–38. (358) VMD (Visual Molecular Dynamics); Theoretical and Computational Biophysics Group, Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana- Champaign: Illinois, USA. http://www.ks.uiuc.edu/Research/vmd/ (accessed March 5 2012). (359) Armstrong, M. S.; Morris, G. M.; Finn, P. W.; Sharma, R.; Moretti, L.; Cooper, R. I.; Richards, W. G. J. Comput.-Aided Mol. Des. 2010, 24, 789–801. (360) Gedeck, P.; Rohde, B.; Bartels, C. J. Chem. Inf. Model 2006, 46, 1924–1936. (361) Maestro and MacroModel; Schrödinger, LLC: New York, USA. http://www.schrodinger.com/productsguide/ (accessed June 23 2012). (362) GOLDSuite; Cambridge Crystallographic Data Centre: Cambridge, United Kingdom. http://www.ccdc.cam.ac.uk/products/gold_suite/ (accessed June 23 2012). (363) Halgren, T. A. J. Comput. Chem. 1999, 20, 720–729. 213 (364) Still, W. C.; Tempczyk, A.; Hawley, R. C.; Hendrickson, T. J. Am. Chem. Soc. 1990, 112, 6127–6129. (365) Ponder, J. W.; Richards, F. M. J. Comput. Chem. 1987, 8, 1016–1024. (366) Chang, G.; Guida, W. C.; Still, W. C. J. Am. Chem. Soc. 1989, 111, 4379–4386. (367) Zambre, V. P.; Murumkar, P. R.; Giridhar, R.; Yadav, M. R. J. Mol. Graph. Model. 2010, 29, 229–239. (368) Schreyer, A.; Blundell, T. Chem. Biol. Drug Des. 2009, 73, 157–167. (369) Clark, R. D.; Norinder, U. WIREs Comput. Mol. Sci. 2012, 2, 108–113. (370) Elridge, M. D.; Murray, C. W.; Auton, T. R.; Paolini, G. V.; Mee, R. P. J. Comput.-Aided Mol. Des. 1997, 11, 425. (371) Baxter, C. A.; Murray, C. W.; Clark, D. E.; Westhead, D. R.; Eldridge, M. D. Proteins 1998, 33, 367–382. (372) SciFinder; Chemical Abstracts Service: Columbus, Ohio, USA. https://scifinder.cas.org (accessed November 24 2011). (373) Bouckaert, R. R.; Frank, E. In Proceedings of the 8th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining ’04; Dai, H.; Srikant, R.; Zhang, C., Eds.; Springer Berlin Heidelberg: Berlin, Heidelberg, 2004; pp. 3–12. (374) Benjamini, Y.; Hochberg, Y. J.R. Statist. Soc. 1995, 57, 289–300. (375) Benjamini, Y.; Yekutieli, D. Ann. Statist. 2001, 29, 1165–1188. (376) Benjamini, Y. Am. Stat. 1988, 42, 257–262. (377) Cannon, E.O. Chemical Informatics of Prohibited Substances. PhD Thesis, University of Cambridge, United Kingdom, 2008. (378) OpenEye Home Page. http://www.eyesopen.com/ (accessed March 18 2012). (379) Martin, Y. C.; Porter, W.; Wu, J.; Metz, J.; Fu, W. Log10D octanol/bufferComputation: Which Combination of Computational Predictors Gives the Best Prediction of Experimental Results? Oral Presentation, ChemAxon US User Group Meeting, 2010. 214 http://www.chemaxon.com/library/evaluation-of-software-for-property-predictions/ (accessed April 26 2012). (380) Platts, J. A.; Butina, D.; Abraham, M. H.; Hersey, A. J. Chem. Inf. Comput. Sci. 1999, 39, 835–845. (381) Abraham, M. H.; McGowan, J. C. Chromatographia 1987, 23, 243–246. (382) Yap, C. W. J. Comput. Chem. 2011, 32, 1466–1474. (383) Yap, C. W. PaDEL-Descriptor; National University of Singapore: Singapore. http://padel.nus.edu.sg/software/padeldescriptor/ (accessed June 23 2012). (384) Abraham, M. H. Chem. Soc. Rev. 1993, 22, 73–83. (385) Clark, M. J. Chem. Inf. Model. 2005, 45, 30–38. (386) Liaw, A. R News 2002, 2, 3:18-22. (387) Grant, A. O. Circ.: Arrhythmia Electrophysiol. 2009, 2, 185–194. (388) Markandeya, Y. S.; Fahey, J. M.; Pluteanu, F.; Cribbs, L. L.; Balijepalli, R. C. J. Biol. Chem. 2011, 286, 2433–2444. (389) Jackson, C. M.; Blass, B.; Coburn, K.; Djandjighian, L.; Fadayel, G.; Fluxe, A. J.; Hodson, S. J.; Janusz, J. M.; Murawsky, M.; Ridgeway, J. M.; White, R. E.; Wu, S. Bioorg. Med. Chem. Lett. 2007, 17, 282–284. (390) Darbar, D.; Kimbrough, J.; Jawaid, A.; McCray, R.; Ritchie, M. D.; Roden, D. M. J. Am. Coll. Cardiol. 2008, 51, 836–842. (391) Arizona Center for Education and Research on Therapeutics. http://www.azcert.org/ (accessed November 19 2011). (392) Ligand.Info Home Page. http://ligand.info/ (accessed March 12 2011). (393) von Grotthuss, M.; Koczyk, G.; Pas, J.; Wyrwicz, L. S.; Rychlewski, L. Comb. Chem. High Throughput Screening 2004, 7, 757–761. (394) Bolton, E. E.; Wang, Y.; Thiessen, P. A.; Bryant, S. H. In Annual Reports in Computational Chemistry, Vol. 4; Cornell, W., Ed.; Elsevier BV., 2007; Chapter 12. 215 (395) Chu, J.; Zhang, S.; Zhuang, Y.; Chen, J.; Li, Y. Process Biochem. 2002, 38, 815–820. (396) Yoshizawa, S.; Fourmy, D.; Puglisi, J. D. EMBO J. 1998, 17, 6437–6448. (397) Stockman, B. J.; Scahill, T. A.; Strakalaitis, N. A.; Brunner, D. P.; Yem, A. W.; Deibel Jr., M. R. FEBS Lett. 1994, 349, 79–83. (398) England, N. W. Automatic Analysis and Validation of Open Polymer Data. PhD Thesis, University of Cambridge, United Kingdom, 2011. (399) Kaul, M.; Barbieri, C. M.; Srinivasan, A. R.; Pilch, D. S. J. Mol. Biol. 2007, 369, 142–156. (400) Bass, A. S.; Vargas, H. M.; Valentin, J.-P.; Kinter, L. B.; Hammond, T.; Wallis, R.; Siegl, P. K. S.; Yamamoto, K. J. Pharmacol. Toxicol. Methods 2011, 64, 7–15. (401) Aslanian, R.; Piwinski, J. J.; Zhu, X.; Priestley, T.; Sorota, S.; Du, X.-Y.; Zhang, X.-S.; McLeod, R. L.; West, R. E.; Williams, S. M.; Hey, J. A. Bioorg. Med. Chem. Lett. 2009, 19, 5043–5047. (402) Zhu, B.-Y.; Jia, Z. J.; Zhang, P.; Su, T.; Huang, W.; Goldman, E.; Tumas, D.; Kadambi, V.; Eddy, P.; Sinha, U.; Scarborough, R. M.; Song, Y. Bioorg. Med. Chem. Lett. 2006, 16, 5507– 5512. (403) Rohrer, S. G.; Baumann, K. J. Chem. Inf. Model. 2008, 48, 704–718. (404) Celik, L.; Lund, J. D. D.; Schiøtt, B. Chem. Res. Toxicol. 2008, 21, 2195–2206. (405) Shanle, E. K.; Xu, W. Chem. Res. Toxicol. 2010, 24, 6–19. (406) Hinselmann, G.; Rosenbaum, L.; Jahn, A.; Fechner, N.; Zell, A. J Cheminf. 2011, 3, 3. (407) Psyco Home Page. http://psyco.sourceforge.net/ (accessed March 3 2012). (408) Python-Statlib Home Page. http://code.google.com/p/python-statlib/ (accessed June 23 2012). (409) MySQL for Python Home Page. http://sourceforge.net/projects/mysql-python/ (accessed June 23 2012). (410) Numpy Home Page. http://numpy.scipy.org/ (accessed June 23 2012). (411) Cygwin Home Page. http://www.cygwin.com/ (accessed September 23 2012). 216 217 Appendix A. Supplementary Files All code, additional results and data files referred to in this thesis are to be found, along with explanatory README files, in chapter specific subdirectories of the ZIP file (Final_PhD_Thesis_SI_RLMarcheseRobinson.zip) saved onto the DVD attached to the inside cover. 218 Appendix B. Performance of Toxicity Models Previously Reported in the Literature Study Descriptors Algorithm Training (n.o. compounds, details) Validation (n.o. compounds, details) MCC P- Value (MCC) Recall (A) Recall (I) Li et al.90. GRIND SVM (WEKA), linear 476 (54 Active, 422 Inactive) LOOCV (internal validation)* 0.28 1.0E-09 0.33 0.92 WOMBAT- PK: 66 (19 Active, 47 Inactive), external test 0.40 1.2E-03 0.68 0.74 SVM (WEKA), non-linear LOOCV (internal validation) 0.18 8.6E-05 0.16 0.96 WOMBAT-PK 0.29 1.8E-02 0.83 0.47 Dubus et al.91 P_VSA QuaSAR- Classify 160 (80 Active, 80 Inactive) 43 (16 Active, 27 Inactive), external test of P_VSA models 0.66 1.5E-05 0.94 0.74 Dubus-Rel 0.56 2.4E-04 0.94 0.63 P_VSA Training set 0.95 2.9E-33 0.98 0.98 Dubus-Rel 0.91 1.2E-30 0.98 0.94 Tobita et al.284. MOE descriptors (2D) + MACCS (feature selection) SVM (WEKA), RBF 73 (28 Active,45 Inactive) 10-fold CV (internal?) 0.80 8.2E-12 0.86 0.93 * Private correspondence with Dr Olivier Taboureau. 219 Study Descriptors Algorithm Training (n.o. compounds, details) Validation (n.o. compounds, details) MCC P- Value (MCC) Recall (A) Recall (I) Thai and Ecker 2008 69 P_VSA Binary QSAR 240 (81 Active,159 Inactive) LOOCV 0.57 1.0E-18 0.58 0.94 Thai-Rel 0.56 4.1E-18 0.61 0.91 P_VSA 73 (19 Active, 54 Inactive) 0.70 2.2E-09 0.68 0.96 Thai-Rel 0.82 2.5E-12 0.79 0.98 P_VSA 223 (81 Active,142 Inactive), R- COOH excluded LOOCV 0.51 2.6E-14 0.56 0.91 Thai-Rel 0.58 4.78E- 18 0.62 0.92 P_VSA 64 (19 Active, 45 Inactive), R- COOH Excluded 0.78 4.4E-10 0.68 1.00 Thai-Rel 0.85 1.0E-11 0.84 0.98 Thai and Ecker 2009 231 SIBAR (based on Thai-Rel) Binary QSAR 194 58 (5 Active, 53 Inactive), external test 0.56 2.0E-05 0.60 0.96 Gava- ghan et al.15 DRONE (2D+3D)+ SELMA (2D) Hier -archical PLS 436 7,520 (605 Active, 6915 Inactive), external test 0.55 0.0E+0 0 0.63 0.96 Table B.1 Performance of hERG blocker binary classification models for which the two classes were defined as per Chapter 4 (IC 0 threshold: 1 μM). N.B.: Gavaghan et al.15 specifically applied their regression model to binary classification. All values highlighted in bold were calculated by this author. Where not explicitly provided, TP, TN, FP, FN values required to compute the MCC were estimated, to the nearest integer, using the reported recall for each class, in conjunction with the number of validation compounds in each class. All p-values corresponding to the MCC values were computed from the corresponding chi-squared statistic, as per Baldi et al.,230 supposing one degree of freedom (see Chapter 2, section 2.6.4.1), using the CHIDIST( ) function in Excel 2007 (32-bit). 220 Appendix C. Additional Computational Details Unless stated otherwise in the main text, * the following software and hardware was employed. Software Tools Python Save for control scripts run on 'non-local' machines (see below), this author primarily used Python 275 version 2.5.2 (32-bit), and, if applicable, the following non-standard modules.  Pybel.177 Version 1.4 (Chapter 3 and Chapter 4). Version 1.6 (Chapter 5 and Chapter 6). See below for corresponding Open Babel versions.  Psyco.407 Version 1.6.  Statlib.408 Version 1.1.0.  MySQLdb.409 Version 1.2.2. See below for corresponding MySQL versions.  NumPy.410 Version 1.2.1 (Chapter 3 and Chapter 4). Version 1.6.0 (Chapter 5 and Chapter 6). Python version 2.7.3 (32-bit) was used to run all scripts employed for calling, and parsing the output of, the R scripts used for generation and analysis of the FOM values obtained in Chapter 5 and Chapter 6. R The versions of R 225 and the corresponding non-standard packages used for the following calculations and analyses are presented in Table C.1. Calculations/Analyses Version of R Non-Standard R Packages  Chapter 4: Random Forest model generation and PCA plot generation  Chapter 5: Training RF-Score 2.7.0, 32-bit  randomForest (version 4.5-30)  Chapter 3: Plot generation  Chapter 5 and Chapter 6: Machine Learning. 2.12.1, 64-bit  randomForest (version 4.6-2)  caret (version 4.98) , with its dependencies: iterators, itertools, plyr, reshape (versions 1.0.4, 0.1-1, 1.5.2 and 0.8.4 * Details explicitly specified in the main text are not repeated here. 221 respectively).  gtools (version 2.6.2).  gplots (version 2.10.1), with its dependencies: caTools, gdata and bitops (versions 1.12, 2.8.2 and 1.0-4.1 respectively).  Chapter 5 and Chapter 6: Computing, plotting and statistical analysis of FOM values. 2.15.1, 64-bit  gplots (version 2.11.0), with its dependencies: caTools, gdata, bitops and gtools (versions 1.13, 2.11.0, 1.0-4.1 and 2.7.0 respectively). Table C.1 R setup for all relevant calculations/analyses presented in this thesis. Molecular Operating Environment (MOE) MOE 285 versions 2008.10 (Chapter 4) and 2011.10 (Chapter 5) were employed, unless noted otherwise, using their default settings. Pipeline Pilot Pipeline Pilot Student Edition 301 (version 6.1.5) was used throughout. Open Babel Open Babel versions 2.2.1 (Chapter 3 and Chapter 4) and 2.3.0a (Chapter 5 and Chapter 6) were employed. ChemAxon All references to ChemAxon 298 tools in Chapter 4 refer to version 5.1.3. Otherwise, all references to ChemAxon tools refer to those installed via the jchem-5.5.0.1- windows_with_jre installer. MySQL MySQL version 5.5.13 was used to load the ChEMBL database as required in Chapter 6. Cygwin The unix-like utilities (e.g. rm, cp) made available with Cygwin 411 were employed in various scripting workflows ran on machines running Windows operating systems (see below). 222 Computational Resources Description Calculations Specifications Standalone machine (local workstation)  Chapter 3 (save for plot generation in R).  Chapter 4 (save for Winnow, Winnow derivatives and MOE calculations).  RF-Score training and validation on the PDBbind database, plus Pipeline Pilot calculations, in Chapter 5.  Windows XP Professional, Service Pack 3, 32- bit Standalone machine (local workstation)  Chapter 5 (save where otherwise noted in this table).  Chapter 6 (save where otherwise noted in this table).  Windows 7 Professional, Service Pack 1, 64- bit  ® ™ Duo CPU E7500 @ 2.93 GHz Standalone machine (local workstation)  Chapter 5 and Chapter 6: Computing, plotting and statistical analysis of FOM values.  Windows 7 Enterprise, Service Pack 1, 64-bit  ® ™ Duo CPU P8600 @ 2.40 GHz Various standalone machines  Winnow feature selection, external validation, OSB generation and raw feature importance computation (Chapter 4)  CORINA (Chapter 5).  MOE (Chapter 4).  Linux University of Cambridge's distributed computing facility, CAMGRID  SVMlight (Chapter 4).  Winnow training cycles selection  Linux 223 (Chapter 4). Cluster  GOLD docking runs (Chapter 5).  Linux  Intel Xeon X5650 Table C.2 Relevant details for all computational resources used.