This readme file was generated on 2023-03-26 by Andrew Wedlake GENERAL INFORMATION Title of Dataset: Curated public bioactivity data for pharmacologically important off-targets. Author/Principal Investigator Information Name: Andrew Wedlake ORCID: 0000-0002-1013-6556 Institution: University of Cambridge Address: Department of Chemistry, Lensfield Rd, Cambridge CB2 1EW Email: ajw259@cam.ac.uk Author/Associate or Co-investigator Information Name: Jonathan Goodman ORCID: 0000-0002-8693-9136 Institution: University of Cambridge Address: Department of Chemistry, Lensfield Rd, Cambridge CB2 1EW Email: jmg11@cam.ac.uk Date of data collection: 2018-04-01 Geographic location of data collection: Cambridge, UK Information about funding sources that supported the collection of the data: N/A SHARING/ACCESS INFORMATION Licenses/restrictions placed on the data: N/A Links to publications that cite or use the data: https://pubs.acs.org/doi/abs/10.1021/acs.chemrestox.9b00325 https://pubs.acs.org/doi/abs/10.1021/acs.chemrestox.0c00332 Was data derived from another source? If yes, list source(s): ChEMBL, ToxCast Recommended citation for this dataset: Chem. Res. Toxicol. 2020, 33, 2, 388–401 DATA & FILE OVERVIEW File List: Folders: "Training Sets" and "Test Sets". Each folder contains the following files: Acetylcholinesterase.csv Adenosine A2a receptor.csv AGTR1.csv AKT1.csv Alpha-2a adrenergic receptor.csv Androgen receptor.csv BACE1.csv BCHE.csv Beta-1 adrenergic receptor.csv Beta-2 adrenergic receptor.csv CASP1.csv CASP2.csv CASP3.csv CASP5.csv CASP8.csv CASP10.csv CHRM5.csv CHUK.csv CSF1R.csv CSNK1D.csv Delta opioid receptor.csv Dopamine D1 receptor.csv Dopamine D2 receptor.csv Dopamine transporter.csv EDNRB.csv ELANE.csv Endothelin receptor ET-A.csv EPHA2.csv EPHB2.csv FGFR1.csv FKBP1A.csv FLT1.csv FLT4.csv FYN.csv Glucocorticoid receptor.csv GSK3B.csv HDAC3.csv HERG.csv Histamine H1 receptor.csv IGF1R.csv INSR.csv KDR.csv LTB4R.csv LYN.csv MAPK1.csv MAPK3.csv MAPK9.csv MAPKAPK2.csv MET.csv MMP2.csv MMP3.csv MMP9.csv MMP13.csv Mu opioid receptor.csv Muscarinic acetylcholine receptor M1.csv Muscarinic acetylcholine receptor M2.csv Muscarinic acetylcholine receptor M3.csv NEK2.csv Norepinephrine transporter.csv NR1I3.csv P2RY1.csv PAK4.csv PDE4A.csv PDE5A.csv PIK3CA.csv PPARG.csv PPP1CA.csv PPP2CA.csv PTEN.csv PTPN1.csv PTPN2.csv PTPN11.csv PTPN13.csv PTPN14.csv RAF1.csv RARA.csv RARB.csv ROCK1.csv RPS6KA5.csv Serotonin 2a (5-HT2a) receptor.csv Serotonin 3a (5-HT3a) receptor.csv Serotonin transporter.csv SIRT2.csv SIRT3.csv SRC.csv TACR2.csv TBXA2R.csv TEK.csv Tyrosine-protein kinase LCK.csv Vasopressin V1a receptor.csv Training sets were used for training models. Models were applied to test sets and predictions were evaluated. Data was split randomly between 75% training set and 25% test set. METHODOLOGICAL INFORMATION Description of methods used for collection/generation of data: Bioactivity data from in vitro assays were extracted from ChEMBL version 23 (data extracted April 2018).25 Activity reports with a confidence score of less than eight were removed, leaving only reports from assays that measure changes to the function of a single protein target directly or through a homologous single protein target. Following precedent,7, 26 we combined data for different aspects of biological activity through protein binding (EC50, IC50, Kd, Ki). Common salts and counterions were stripped with RDKit27 Salt Stripper. Any chemicals with more than 100 atoms were removed. These chemicals can cause problems for the common substructure algorithm and they tend to be less biologically relevant in terms of exposure. For each chemical, mean activity was taken – values of activity reported as “greater than” a certain value were removed for these calculations. The mean of the activities was taken for each chemical so as to use all the data available. Chemicals with a mean activity of 10 µM or lower were assigned as active; those with over 10 µM were assigned as inactive. People involved with sample collection, processing, analysis and/or submission: Andrew Wedlake DATA-SPECIFIC INFORMATION FOR: all files Number of variables: 4 Variable List: SMILES - Canonical SMILES, standardised and canonicalised with RDKit in KNIME. Binary Activity - Class as defined by a classification hreshold of mean activity > 10 uM STANDARD_VALUE - Mean activity in uM STANDARD_TYPE - The types of raw measurements in the aggregation, e.g. IC50, EC50 Number of cases/rows: AGTR1: training set 1480 rows - test set 508 rows AKT1: training set 2972 rows - test set 1015 rows Acetylcholinesterase: training set 3471 rows - test set 1106 rows Adenosine A2a receptor: training set 4539 rows - test set 1487 rows Alpha-2a adrenergic receptor: training set 1418 rows - test set 441 rows Androgen receptor: training set 7458 rows - test set 2463 rows BACE1: training set 6449 rows - test set 2173 rows BCHE: training set 2651 rows - test set 894 rows Beta-1 adrenergic receptor: training set 1766 rows - test set 577 rows Beta-2 adrenergic receptor: training set 2963 rows - test set 996 rows CASP1: training set 3438 rows - test set 1130 rows CASP10: training set 801 rows - test set 249 rows CASP2: training set 830 rows - test set 264 rows CASP3: training set 2235 rows - test set 770 rows CASP5: training set 1283 rows - test set 405 rows CASP8: training set 1095 rows - test set 367 rows CHRM5: training set 1322 rows - test set 442 rows CHUK: training set 1031 rows - test set 355 rows CSF1R: training set 1789 rows - test set 600 rows CSNK1D: training set 1300 rows - test set 437 rows Delta opioid receptor: training set 3144 rows - test set 1081 rows Dopamine D1 receptor: training set 2513 rows - test set 830 rows Dopamine D2 receptor: training set 5117 rows - test set 1716 rows Dopamine transporter: training set 3315 rows - test set 1113 rows EDNRB: training set 1535 rows - test set 510 rows ELANE: training set 2612 rows - test set 897 rows EPHA2: training set 1216 rows - test set 414 rows EPHB2: training set 810 rows - test set 291 rows Endothelin receptor ET-A: training set 1821 rows - test set 616 rows FGFR1: training set 2556 rows - test set 815 rows FKBP1A: training set 994 rows - test set 368 rows FLT1: training set 2385 rows - test set 783 rows FLT4: training set 1317 rows - test set 440 rows FYN: training set 1127 rows - test set 370 rows GSK3B: training set 2863 rows - test set 944 rows Glucocorticoid receptor: training set 7517 rows - test set 2474 rows HDAC3: training set 1684 rows - test set 508 rows HERG: training set 6038 rows - test set 2103 rows Histamine H1 receptor: training set 1785 rows - test set 597 rows IGF1R: training set 2698 rows - test set 919 rows INSR: training set 1487 rows - test set 494 rows KDR: training set 7032 rows - test set 2364 rows LTB4R: training set 1042 rows - test set 340 rows LYN: training set 1137 rows - test set 366 rows MAPK1: training set 13053 rows - test set 4233 rows MAPK3: training set 865 rows - test set 304 rows MAPK9: training set 1739 rows - test set 578 rows MAPKAPK2: training set 1488 rows - test set 500 rows MET: training set 3014 rows - test set 1002 rows MMP13: training set 2634 rows - test set 866 rows MMP2: training set 3446 rows - test set 1170 rows MMP3: training set 2085 rows - test set 712 rows MMP9: training set 3324 rows - test set 1107 rows Mu opioid receptor: training set 4384 rows - test set 1535 rows Muscarinic acetylcholine receptor M1: training set 2421 rows - test set 836 rows Muscarinic acetylcholine receptor M2: training set 2731 rows - test set 938 rows Muscarinic acetylcholine receptor M3: training set 2029 rows - test set 622 rows NEK2: training set 1009 rows - test set 349 rows NR1I3: training set 2600 rows - test set 822 rows Norepinephrine transporter: training set 3672 rows - test set 1181 rows P2RY1: training set 1273 rows - test set 390 rows PAK4: training set 1127 rows - test set 356 rows PDE4A: training set 1263 rows - test set 407 rows PDE5A: training set 2086 rows - test set 641 rows PIK3CA: training set 5089 rows - test set 1721 rows PPARG: training set 8785 rows - test set 2862 rows PPP1CA: training set 776 rows - test set 258 rows PPP2CA: training set 770 rows - test set 279 rows PTEN: training set 1509 rows - test set 491 rows PTPN1: training set 2775 rows - test set 878 rows PTPN11: training set 1181 rows - test set 385 rows PTPN13: training set 792 rows - test set 277 rows PTPN14: training set 774 rows - test set 270 rows PTPN2: training set 1166 rows - test set 382 rows RAF1: training set 1832 rows - test set 603 rows RARA: training set 2718 rows - test set 887 rows RARB: training set 2751 rows - test set 895 rows ROCK1: training set 1823 rows - test set 590 rows RPS6KA5: training set 944 rows - test set 320 rows SIRT2: training set 1249 rows - test set 400 rows SIRT3: training set 919 rows - test set 307 rows SRC: training set 3220 rows - test set 1016 rows Serotonin 2a (5-HT2a) receptor: training set 3568 rows - test set 1223 rows Serotonin 3a (5-HT3a) receptor: training set 1126 rows - test set 381 rows Serotonin transporter: training set 3909 rows - test set 1269 rows TACR2: training set 2011 rows - test set 782 rows TBXA2R: training set 2180 rows - test set 726 rows TEK: training set 1450 rows - test set 470 rows Tyrosine-protein kinase LCK: training set 1688 rows - test set 568 rows Vasopressin V1a receptor: training set 1243 rows - test set 435 rows