Automated Analysis and Validation of Chemical Literature Joseph Andrew Townsend Corpus Christi College A dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Unilever Centre for Molecular Science Informatics Department of Chemistry Lensfield Road, Cambridge, CB2 1EW, United Kingdom. August 5, 2007 Disclaimer This thesis is the result of my own work and includes nothing which is the outcome of work done in collaboration, except where specifically indicated. This thesis does not exceed the specified word limit (60000) as defined by the Chemistry Degree Committee. This thesis has been typeset in 12pt font using LATEX2ε according to the specifications defined by the Board of Graduate Studies and the Chemistry Degree Committee. Abstract Methods to automatically extract and validate data from the chemical liter- ature in legacy formats to machine-understandable forms are examined. The works focuses of three types of data: analytical data reported in articles, computational chemistry output files and crystallographic information files (CIFs). It is shown that machines are capable of reading and extracting analytical data from the current legacy formats with high recall and precision. Regular expressions cannot identify chemical names with high precision or recall but non-deterministic methods perform significantly better. The lack of machine- understandable connection tables in the literature has been identified as the major issue preventing molecule-based data-driven science being performed in the area. The extraction of data from computational chemistry output files using parser-like approaches is shown to be not generally possible although such methods work well for input files. A hierarchical regular expression based approach can parse > 99.9% of the output files correctly although significant human input is required to prepare the templates. CIFs may be parsed with extremely high recall and precision, contain connection tables and the data is of high quality. The comparison of bond lengths calculated by two computational chem- istry programs show good agreement in general but structures containing spe- cific moieties cause discrepancies. An initial protocol for the high-throughput geometry optimisation of molecules extracted from the CIFs is presented and the refinement of this protocol is discussed. Differences in bond length be- tween calculated and experimentally determined values from the CIFs of less than 0.03A˚ are shown to be expected by random error. The final protocol is used to find high-quality structures from crystallography which can be reused for further science. Acknowledgments I would like to thank Dr Jonathan Goodman, Dr Peter Murray-Rust, Sam Adams, Fraser Norton and Chris Waudby for their help and advice on the OSCAR project. Dr Murray-Rust, Dr Simon“Billy” Tyrrell, Dr Yong “YY” Zhang and Nick Day are also thanked for their support and cooperation during the course of my PhD. Dr Charlotte Bolton deserves special mention for providing the computational support without which this work would not have been possible. The Pickerel and Helen Clubb also deserve mention for allowing me to sanity check my ideas and for general support. The Unilever Centre for Molecular Science Informatics and the Royal Society of Chemistry are thanked for funding. iii Contents Disclaimer i Abstract ii Acknowledgements iii Table of contents vii List of tables viii List of figures xii Glossary xiii 1 Introduction 1 1.1 Data-Driven Science . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Metadata, Syntax and Semantics . . . . . . . . . . . . . . . . 4 1.3 Data Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 eScience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Machine-Understandable Data . . . . . . . . . . . . . . . . . . 11 1.6 Formats for Reporting Chemistry . . . . . . . . . . . . . . . . 13 1.7 Compute Processes . . . . . . . . . . . . . . . . . . . . . . . . 14 1.8 Analysis and Visualisation . . . . . . . . . . . . . . . . . . . . 16 1.9 Publication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.10 eXtensible Markup Language . . . . . . . . . . . . . . . . . . 18 1.10.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.10.2 Namespaces . . . . . . . . . . . . . . . . . . . . . . . . 19 1.10.3 Data Display . . . . . . . . . . . . . . . . . . . . . . . 21 1.11 Chemical Markup Language . . . . . . . . . . . . . . . . . . . 21 1.11.1 JUMBO . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.12 Scalable Vector Graphics . . . . . . . . . . . . . . . . . . . . . 25 1.12.1 SVG Graphing . . . . . . . . . . . . . . . . . . . . . . 27 1.13 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 iv 1.14 Jmol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2 The Quality of Data in the Chemical Literature 30 2.1 Information Extraction . . . . . . . . . . . . . . . . . . . . . . 32 2.2 The Experimental Data Checker . . . . . . . . . . . . . . . . . 34 2.2.1 The legacy format of organic chemistry articles . . . . 34 2.2.2 Common Chemical Concepts . . . . . . . . . . . . . . . 39 2.2.3 Regular Expressions . . . . . . . . . . . . . . . . . . . 42 2.2.4 Structure of the EDC . . . . . . . . . . . . . . . . . . . 45 2.3 OSCAR — Blurring the Line Between Authoring Tool and IE Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.3.1 Information Extraction Tests . . . . . . . . . . . . . . 53 2.4 OSCAR2 — the Importance of Chemical Names . . . . . . . . 58 2.4.1 The Importance of Connection Tables . . . . . . . . . . 59 2.5 OSCAR3 and OPSIN — Parsing Chemical Names to CTs . . 61 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3 Parsing Program Input — Compilers 67 3.1 Compilers and Classification Techniques . . . . . . . . . . . . 68 3.1.1 The Phases of a Compiler . . . . . . . . . . . . . . . . 69 3.1.2 Context-Free Grammars . . . . . . . . . . . . . . . . . 73 3.1.3 Parse Trees . . . . . . . . . . . . . . . . . . . . . . . . 74 3.1.4 Predictive Parsers . . . . . . . . . . . . . . . . . . . . . 75 3.1.5 Lexical analysis . . . . . . . . . . . . . . . . . . . . . . 76 3.1.6 Regular Expressions . . . . . . . . . . . . . . . . . . . 77 3.1.7 Non-deterministic parsing . . . . . . . . . . . . . . . . 79 3.1.8 Bayesian Classification . . . . . . . . . . . . . . . . . . 80 3.2 JFlex and CUP . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.3 Parsing Program Input . . . . . . . . . . . . . . . . . . . . . . 86 3.3.1 Parsing Program Output . . . . . . . . . . . . . . . . . 89 4 Parsing Program Output — JUMBOMarker 91 4.1 Design of JUMBOMarker . . . . . . . . . . . . . . . . . . . . 94 4.2 JUMBOMarker: Single-Pass, Single-Parse . . . . . . . . . . . 98 4.3 JUMBOMarker: Two-Pass, Two-Parse . . . . . . . . . . . . . 100 4.4 JUMBOMarker: Multi-Pass, Multi-Parse . . . . . . . . . . . . 102 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5 High-Throughput Computing 107 5.1 Condor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2 MOPAC and NCI HT Computing . . . . . . . . . . . . . . . . 109 5.2.1 InChI . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2.2 Proteus Molecules . . . . . . . . . . . . . . . . . . . . . 113 5.3 MOPAC and GAMESS . . . . . . . . . . . . . . . . . . . . . . 115 v 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.4.1 All Bonds . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.4.2 C–C bonds . . . . . . . . . . . . . . . . . . . . . . . . 129 5.4.3 C–N bonds . . . . . . . . . . . . . . . . . . . . . . . . 129 5.4.4 C–O and N–O bonds . . . . . . . . . . . . . . . . . . . 132 5.4.5 C–S bonds . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.5 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6 X-Ray Crystallography 147 6.1 Determining the Structure . . . . . . . . . . . . . . . . . . . . 147 6.2 Derived Results . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.2.1 β, B and U Parameters . . . . . . . . . . . . . . . . . 152 6.2.2 Libration . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.2.3 Minor Conformations and Incorrectly Assigned Atoms 155 6.2.4 Atomic Scattering Factors . . . . . . . . . . . . . . . . 156 6.3 The CIF Format . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.4 Quality Indicators . . . . . . . . . . . . . . . . . . . . . . . . . 160 7 Creating a Workflow 162 7.1 Existing Workflow Technology . . . . . . . . . . . . . . . . . . 163 7.2 CIF Repositories . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.3 Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.4 Create Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 7.5 Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 7.5.1 The Clusters . . . . . . . . . . . . . . . . . . . . . . . 173 7.5.2 Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.6 Retrieve Results . . . . . . . . . . . . . . . . . . . . . . . . . . 177 7.7 Results Repository . . . . . . . . . . . . . . . . . . . . . . . . 178 7.8 Designing a Robust Analysis Method . . . . . . . . . . . . . . 179 7.8.1 Altering the File Structure . . . . . . . . . . . . . . . . 179 7.8.2 Altering the File . . . . . . . . . . . . . . . . . . . . . 180 7.9 Creating a Protocol . . . . . . . . . . . . . . . . . . . . . . . . 180 7.10 Molecule Selection Parameters . . . . . . . . . . . . . . . . . . 181 7.11 Refining the Protocol . . . . . . . . . . . . . . . . . . . . . . . 183 7.12 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 8 Results 201 8.1 Failure Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 201 8.1.1 Insufficient Time . . . . . . . . . . . . . . . . . . . . . 202 8.1.2 SCF Did Not Converge . . . . . . . . . . . . . . . . . . 205 8.1.3 Bad Delocalised Coordinates Generated . . . . . . . . . 209 8.1.4 Incorrect Charge or Multiplicity . . . . . . . . . . . . . 210 8.2 Proteus Molecules . . . . . . . . . . . . . . . . . . . . . . . . . 212 vi 8.3 CIF Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 8.4 Bond Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 8.4.1 S–X Bonds . . . . . . . . . . . . . . . . . . . . . . . . 218 8.4.2 All Bonds . . . . . . . . . . . . . . . . . . . . . . . . . 222 8.4.3 C–C Bonds . . . . . . . . . . . . . . . . . . . . . . . . 224 8.4.4 C–N and C–O Bonds . . . . . . . . . . . . . . . . . . . 228 8.4.5 PLATON . . . . . . . . . . . . . . . . . . . . . . . . . 228 8.5 Uiso,bond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 8.6 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 8.7 Applying the Protocol . . . . . . . . . . . . . . . . . . . . . . 236 8.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 A Computational Chemistry 241 A.1 ab initio Calculations . . . . . . . . . . . . . . . . . . . . . . . 241 A.1.1 Closed Shell Self Consistent Field Theory . . . . . . . . 242 A.2 Density Functional Theory . . . . . . . . . . . . . . . . . . . . 245 A.3 Semi-Empirical Methods . . . . . . . . . . . . . . . . . . . . . 246 B Regular Expressions in Java 248 C Backus-Naur Form 254 D JFlex Lexical Rules 257 E GROMACS topology file 260 F Solvents and counter ions 264 G Molecules 268 H Molecule Optimisations 282 I Published Work 289 Bibliography 291 vii List of Tables 2.1 Comparison of spell checkers . . . . . . . . . . . . . . . . . . . 32 2.2 OSCAR results . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.3 Chemical name recall and precision of OSCAR2 . . . . . . . . 58 5.1 Comparison of run times . . . . . . . . . . . . . . . . . . . . . 121 5.2 Failure rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.3 Predicting run times . . . . . . . . . . . . . . . . . . . . . . . 146 6.1 Refinement method comparison . . . . . . . . . . . . . . . . . 157 8.1 Calculation statistics . . . . . . . . . . . . . . . . . . . . . . . 202 8.2 Insufficient time . . . . . . . . . . . . . . . . . . . . . . . . . . 203 8.3 SCF not converged . . . . . . . . . . . . . . . . . . . . . . . . 207 8.4 Bad delocalised coordinates generated . . . . . . . . . . . . . . 209 8.5 Crystal system analysis . . . . . . . . . . . . . . . . . . . . . . 215 8.6 Space group statistics . . . . . . . . . . . . . . . . . . . . . . . 216 8.7 R-factor statistics . . . . . . . . . . . . . . . . . . . . . . . . . 217 8.8 GAMESS GAUSSIAN03 comparison . . . . . . . . . . . . . . 220 8.9 S–X (X=C,N) bond lengths . . . . . . . . . . . . . . . . . . . 222 8.10 Run times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 8.11 Crystallographic filters . . . . . . . . . . . . . . . . . . . . . . 238 B.1 Regular expression constructs as specified by Java . . . . . . . 248 E.1 The topology file . . . . . . . . . . . . . . . . . . . . . . . . . 262 E.2 Intramolecular actions definitions . . . . . . . . . . . . . . . . 263 G.1 Final molecules . . . . . . . . . . . . . . . . . . . . . . . . . . 268 H.1 The geometry of bv6006molecule3 . . . . . . . . . . . . . . . . 282 H.2 The geometry of ci6067molecule2 . . . . . . . . . . . . . . . . 287 H.3 The geometry of rz6070molecule1 . . . . . . . . . . . . . . . . 288 viii List of Figures 1.1 Extracting data from different domains . . . . . . . . . . . . . 2 1.2 A typical organic synthesis . . . . . . . . . . . . . . . . . . . . 3 1.3 Article and thesis structure . . . . . . . . . . . . . . . . . . . 5 1.4 Multiple properties associated with a molecule . . . . . . . . . 6 1.5 Comparing different instances of a property . . . . . . . . . . 7 1.6 Hexacyclinol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.7 Validating properties prior to publication . . . . . . . . . . . . 8 1.8 An eScience experiment . . . . . . . . . . . . . . . . . . . . . 10 1.9 A chemical reaction . . . . . . . . . . . . . . . . . . . . . . . . 12 1.10 A chemical reaction . . . . . . . . . . . . . . . . . . . . . . . . 12 1.11 A chemical reaction . . . . . . . . . . . . . . . . . . . . . . . . 13 1.12 A section of a GROMACS topology file . . . . . . . . . . . . . 14 1.13 A section of a MOPAC output file . . . . . . . . . . . . . . . . 15 1.14 An example of XML . . . . . . . . . . . . . . . . . . . . . . . 19 1.15 XSLT flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1 Typical organic article structure . . . . . . . . . . . . . . . . . 35 2.2 Typical organic analytical data . . . . . . . . . . . . . . . . . 36 2.3 Organic synthetic methodology . . . . . . . . . . . . . . . . . 36 2.4 Chemical discourse . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5 Narrative discourse . . . . . . . . . . . . . . . . . . . . . . . . 38 2.6 Aspirin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.7 The structure of the original EDC . . . . . . . . . . . . . . . . 43 2.8 The GUI of the original EDC . . . . . . . . . . . . . . . . . . 44 2.9 The structure of the EDC version 2 . . . . . . . . . . . . . . . 46 2.10 Regular expressions . . . . . . . . . . . . . . . . . . . . . . . . 47 2.11 EDC highlighting data . . . . . . . . . . . . . . . . . . . . . . 48 2.12 EDC expert analysis . . . . . . . . . . . . . . . . . . . . . . . 49 2.13 Backus-Naur form of a paragraph from OBC . . . . . . . . . . 50 2.14 EDC tabulating data . . . . . . . . . . . . . . . . . . . . . . . 51 2.15 OSCAR parsing strategies . . . . . . . . . . . . . . . . . . . . 52 2.16 An example of an HNMR spectrum . . . . . . . . . . . . . . . 53 2.17 EDC spectrum plotting . . . . . . . . . . . . . . . . . . . . . . 54 2.18 Synthesis of similar molecules . . . . . . . . . . . . . . . . . . 57 ix 2.19 The OSCAR workflow . . . . . . . . . . . . . . . . . . . . . . 60 2.20 The parse tree of a chemical name . . . . . . . . . . . . . . . . 64 2.21 Ambiguous chemical names . . . . . . . . . . . . . . . . . . . 65 3.1 Overview of a compiler . . . . . . . . . . . . . . . . . . . . . . 68 3.2 The phases of a compiler . . . . . . . . . . . . . . . . . . . . . 70 3.3 Compiler assignments . . . . . . . . . . . . . . . . . . . . . . . 72 3.4 Compiler structure . . . . . . . . . . . . . . . . . . . . . . . . 73 3.5 A simple parse tree . . . . . . . . . . . . . . . . . . . . . . . . 74 3.6 Lexical analyser example . . . . . . . . . . . . . . . . . . . . . 77 3.7 Producer-consumer pair . . . . . . . . . . . . . . . . . . . . . 77 3.8 A parse tree generated by Chomskian analysis . . . . . . . . . 82 3.9 Creating a lexical analyser with JFlex . . . . . . . . . . . . . . 84 3.10 The specifications of a JFlex file . . . . . . . . . . . . . . . . . 84 3.11 The comment remover pre-processor in JFlex . . . . . . . . . . 85 3.12 An example production . . . . . . . . . . . . . . . . . . . . . . 86 3.13 Definitions of terminal and non terminal tokens . . . . . . . . 86 3.14 GROMACS topology file . . . . . . . . . . . . . . . . . . . . . 87 3.15 MOPAC output matched by JFlex . . . . . . . . . . . . . . . 90 4.1 Various program designs . . . . . . . . . . . . . . . . . . . . . 92 4.2 Linking programs with CML . . . . . . . . . . . . . . . . . . . 93 4.3 The DTD for JUMBOMarker. . . . . . . . . . . . . . . . . . . 94 4.4 A section of a GAMESS output file . . . . . . . . . . . . . . . 95 4.5 Point groups in MOPAC . . . . . . . . . . . . . . . . . . . . . 96 4.6 A JUMBOMarker template . . . . . . . . . . . . . . . . . . . 97 4.7 Section of MOPAC output . . . . . . . . . . . . . . . . . . . . 97 4.8 Section of MOPAC output . . . . . . . . . . . . . . . . . . . . 98 4.9 JUMBOMarker: single-pass, single-parse . . . . . . . . . . . . 99 4.10 JUMBOMarker: two-pass, two-parse . . . . . . . . . . . . . . 101 4.11 JUMBOMarker: multi-pass multi-parse . . . . . . . . . . . . . 104 5.1 NCI MOPAC workflow . . . . . . . . . . . . . . . . . . . . . . 110 5.2 MOPAC mismatches . . . . . . . . . . . . . . . . . . . . . . . 111 5.3 InChI example . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.4 Proteus molecule example . . . . . . . . . . . . . . . . . . . . 113 5.5 Proteus molecule example . . . . . . . . . . . . . . . . . . . . 114 5.6 Cross-checking MOPAC . . . . . . . . . . . . . . . . . . . . . 115 5.7 GAMESS input . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.8 Protocol development cycle . . . . . . . . . . . . . . . . . . . 119 5.9 Unusual behaviour shown by the Z-matrix . . . . . . . . . . . 122 5.10 QQ plot C–Cl . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.11 QQ plot N–N . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.12 QQ plot C–C . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 x 5.13 X-Y plot C–C bonds . . . . . . . . . . . . . . . . . . . . . . . 127 5.14 QQ plot all bonds . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.15 C–C major outlier . . . . . . . . . . . . . . . . . . . . . . . . 129 5.16 X-Y plot C–CFn containing molecules . . . . . . . . . . . . . . 130 5.17 QQ plot C–N . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.18 QQ plot C–O . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.19 QQ plot N–O . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.20 Density plots for C–O and N–O . . . . . . . . . . . . . . . . . 135 5.21 Molecular fragments . . . . . . . . . . . . . . . . . . . . . . . 137 5.22 QQ plot C–S . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.23 Mean calculation step time . . . . . . . . . . . . . . . . . . . . 140 5.24 Basis functions against number of atoms . . . . . . . . . . . . 142 5.25 Total time against non-hydrogen atoms . . . . . . . . . . . . . 143 5.26 s against the mean time per non-hydrogen atom . . . . . . . . 144 6.1 Structures in the CCDC by year . . . . . . . . . . . . . . . . . 148 6.2 Crystal structure determination workflow . . . . . . . . . . . . 149 6.3 Riding motion . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.1 A simple workflow . . . . . . . . . . . . . . . . . . . . . . . . 163 7.2 High-level calculation workflow . . . . . . . . . . . . . . . . . 164 7.3 Organisation of supplementary data . . . . . . . . . . . . . . . 166 7.4 The data structure for downloaded files . . . . . . . . . . . . . 168 7.5 The data structure for downloaded files . . . . . . . . . . . . . 169 7.6 The data structure for downloaded files . . . . . . . . . . . . . 171 7.7 Retrieve results workflow . . . . . . . . . . . . . . . . . . . . . 176 7.8 X-Y plot all bonds . . . . . . . . . . . . . . . . . . . . . . . . 184 7.9 QQ plot all bonds . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.10 QQ plot no solvents . . . . . . . . . . . . . . . . . . . . . . . . 187 7.11 QQ plot no solvent/ion/guest molecules and no possible hy- drogen bonds . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 7.12 QQ plot R-factor 6 0.05 . . . . . . . . . . . . . . . . . . . . . 190 7.13 QQ plot no constrained bonds . . . . . . . . . . . . . . . . . . 191 7.14 QQ plot no constrained molecules . . . . . . . . . . . . . . . . 192 7.15 QQ plot only specified atoms . . . . . . . . . . . . . . . . . . 193 7.16 QQ plot manual removal . . . . . . . . . . . . . . . . . . . . . 195 7.17 QQ plot cyclic . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 7.18 QQ plot cyclic . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 7.19 X-Y plot cyclic . . . . . . . . . . . . . . . . . . . . . . . . . . 198 8.1 Energy profile . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 8.2 Incorrect CT . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 8.3 InChI in a CML document . . . . . . . . . . . . . . . . . . . . 213 8.4 R-factor over time . . . . . . . . . . . . . . . . . . . . . . . . . 217 xi 8.5 Histogram of bond length esds . . . . . . . . . . . . . . . . . . 219 8.6 Long bonds to sulphur . . . . . . . . . . . . . . . . . . . . . . 221 8.7 QQ plot all bonds . . . . . . . . . . . . . . . . . . . . . . . . . 223 8.8 QQ plot C–C cyclic bonds . . . . . . . . . . . . . . . . . . . . 225 8.9 Loss of H-bonding . . . . . . . . . . . . . . . . . . . . . . . . . 226 8.10 QQ plot C–N cyclic bonds . . . . . . . . . . . . . . . . . . . . 227 8.11 QQ plot C–O cyclic bonds . . . . . . . . . . . . . . . . . . . . 229 8.12 Uiso,bond against Temperature . . . . . . . . . . . . . . . . . . . 230 8.13 Displacement Ellipsoids . . . . . . . . . . . . . . . . . . . . . . 232 8.14 Uiso,bond against Temperature . . . . . . . . . . . . . . . . . . . 233 8.15 CIF GAMESS times . . . . . . . . . . . . . . . . . . . . . . . 234 8.16 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 8.17 Maximum torsion angle . . . . . . . . . . . . . . . . . . . . . . 239 8.18 Torsion angles of toluene . . . . . . . . . . . . . . . . . . . . . 239 xii Glossary BNF Backus-Naur form CAS Chemical Abstracts Service CCDC Cambridge Crystallographic Data Centre CIF Crystallographic Information File CML Chemical Markup Language CNMR Carbon Nuclear Magnetic Resonance CSD Cambridge Structural Database CSS Cascading Style Sheet CT Connection Table DFT Density Functional Theory DOM Document Object Model DTD Document Type Definition EDC Experimental Data Checker GTO Gaussian Type Orbital HT High-Throughput HNMR Proton Nuclear Magnetic Resonance HRMS High Resolution Mass Spectroscopy HTML Hyper Text Markup Language IAM Independent Atom Model IE Information Extraction IUCr International Union of Crystallography IUPAC International Union of Pure and Applied Chemistry mu Unified Atomic Mass Unit 1.6605× 10−27 kg MP2 Møller-Plesset perturbation theory, second order xiii NCI National Cancer Institute NMR Nuclear Magnetic Resonance OBC Organic & Biomolecular Chemistry OOP Object-Orientated Programming OSCAR Open Source Chemistry Analysis Routines QQ Quantile-Quantile RSC Royal Society of Chemistry SCF Self Consistent Field SGML Standard Generalised Markup Language STO Slater Type Orbital SVG Scalable Vector Graphics UCC Unilever Centre for Molecular Science Informatics W3C World Wide Web Consortium WS WebService XML eXtensible Markup Language XSL eXtensible Style Language XSLT eXtensible Style Language Transformations x The arithmetic mean of the data type x, given by x = 1 n n∑ i=1 xi (1) s The standard deviation of a sample of n instances of data type x, given by s = √√√√ 1 n− 1 n∑ i=1 (xi − x)2 (2) xiv ρ Spearman’s rank correlation coefficient, given by ρ = 1− 6 ∑ d2i n(n2 − 1) (3) where di is the difference between each rank of corresponding values of x and y W The test statistic for the Shapiro-Wilk normality test given by W = (∑ n i=1aix(i) )2∑ n i=1 (xi − x)2 (4) where x(i) is the i th order statistic, i.e. the ith-smallest number in the sample and the constants ai are given by (ai, . . . , an) = mTV −1√ (mTV −1V −1m) (5) where m = (m1, . . . ,mn) T (6) and mi are the expected values of the order statistics of independent and identically-distributed random variables sampled from the stan- dard normal distribution and V is the covariance matrix of those order statistics. xv Chapter 1 Introduction The volume of scientific data∗ being produced is exploding and phrases such as data deluge, data avalanche and digital deluge are now commonly seen in the literature [1]. As an illustration of the scale of the current problem Lesk estimates that the Bible requires 5 × 106 bytes of storage [2] whereas the annual production of refereed journal literature is estimated by Hey et al. to require 1×1012 bytes [1]. The volume of data produced in the fields of astron- omy and bioscience are doubling in 12 and 9 months respectively [3]. This is faster than the measured rate of increase in performance of computer chips (Moore’s Law), which doubles every 18 months. However, modern analysis methods are also much faster which should result in an ideal situation. The information explosion is causing problems for many scientific commu- nities because the data production and data analysis processes are currently disconnected [4]. That is, the data is not available in suitable formats for re-use, often as a result of the publication process. Data is seen as an integral part of any scientific discipline and the ability to view and analyse another scientist’s data is essential. The current reliance on paper publishing pre- vents much of the data produced during an experiment ever being published. It is now common to see phrases such as . . . one hundred and five optimised geometries had been obtained ∗The word data is used frequently in this thesis as a synonym for a collection of infor- mation and as the plural of datum — the meaning should be apparent from context. 1 Figure 1.1: Data from different domains can be extracted (and combined) to a more usable form. in the present work. Reporting all of these optimised geometries is nonsense [5] in an article and it is estimated that ca. 80% of crystal structures determined are never published [6, 7]. The publication of thesis present similar problems — approximately 60 gigabytes of analysis, programs and most importantly data were created as a part of this thesis. It would be possible to include this as supplementary data by requesting that a CD is included as part of the thesis — although to include all the data would actually require nearly 100 CDs [8]. The publication process forces the decoupling of the interpretation and analysis (which is published) from data (which often is not published in full). There is a desire amongst the community to improve this situation, enabling more rapid publication and dissemination processes. In future, it is hoped that everything can be captured and published [10]; Hagdorn’s recent thesis is an example of how theses may appear in the future [9]. However, solving this problem will increase the information overload. 2 4-Acetoxy-6-hydroxymethyl-3-methyl-5- propyl-1H-pyridin-2-one (26). NaBH4 (0.127 g, 3.36 mmol) was added to a solution of aldehyde 24 (0.662 g, 2.79 mmol) in EtOH (18 mL) at 0 ◦C and the mixture was stirred for 2.5 h at rt. After addition of sat. NH4Cl solution (10 mL), the mixture was concentrated in vacuo and then extracted with EtOAc (5× 25 mL). The combined organic phases were dried over Na2SO4, then filtered and concentrated in vacuo. The residue was recrystalized from EtOAc to yield 0.58 g (87 %) of alcohol 26 as colorless crystals. — 1H NMR (250 MHz, CDCl3): δ 0.92 (t, 3J = 7.2 Hz, 3 H), 1.41 (mc, 2 H), 1.91 (s, 3 H), 2.22 (t, 3J = 7.4 Hz, 2 H), 2.34 (s, 3 H), 4.65 (s, 2 H). Figure 1.2: A typical paragraph from the supplemental data of an article in Journal of Organic Chemistry [14]. 1.1 Data-Driven Science To allow data-driven science (and further re-use), methods to extract data in legacy formats to more usable forms are required and the sheer volume of data requires that any process to do this must be automated. This work considers possible approaches for the creation of such transformation processes and their applicability with the premise that . . . the current scientific literature, were it to be presented in se- mantically accessible form, contains huge amounts of undiscov- ered science. [11] The data deluge is not confined to any one scientific domain therefore mul- tiple tools may be required to extract it to a reusable form (figure 1.1). A human is clearly not capable of performing an analysis of all the annual journal literature unaided, or even just that of the chemical literature — Chemical Abstracts Service reported over 1 million articles in 2006 and now 3 contains the abstracts of over 25 million articles [12, 13]. A typical article in the Journal of Organic Chemistry (ca. 8 pages) frequently has supporting supplemental data (which may not be refereed). The volume of the supple- mentary data is often five to ten times that of the original article, further exacerbating the problem. However, once the data is in the relevant form a computer can be used to help analyse it, allowing humans to interpret ab- normalities, trends and other features of interest. This is a prime motivation for representing the data in a machine-understandable† form. 1.2 Metadata, Syntax and Semantics Metadata‡ is essential for both the automation of the extraction process and the subsequent machine-understandability. Metadata such as those de- veloped by the Dublin Core Metadata Initiative allow the providence and reusability of data to be interpreted by a machine [16]. For extraction to be possible, the semantics and syntax of the legacy for- mat must be understood — semantics refer to the meaning and syntax to the structure of the data. The chemical field (especially organic chemistry) contains-well understood semantics and syntax which have been stable for a long time and can be represented in a machine-understandable format using Chemistry Markup Language. There is also a large legacy corpus. As a result organic chemistry represents an attractive area for data extraction. The most semantically and syntactically rich area of organic chemistry is the synthesis of a molecule and the associated analytical data; an example of such data is shown in figure 1.2. The legacy corpus consists of two major document types: journal articles and theses. The structure of the two types differs but the data is expected to be recoverable using natural language techniques (figure 1.3). Whilst significant progress was made in this area — †The term machine-understandable is explained in section 1.5 ‡Metadata is commonly defined as data that is used to describe other data, this might range from who authored the data to the fact that the document conforms to a particular DTD [15]. 4 Figure 1.3: The general structure of journal articles and theses differ but should both be amenable to natural language parsing techniques. including the development of ideas applicable to other scientific fields — the extraction did not prove to be as tractable as expected. 1.3 Data Validation Chemistry is based on the synthesis, structure and properties of molecules; it is therefore vital that a molecule can be uniquely and globally identified. IUPAC’s International Chemical Identifier (InChI) provides an ideal solu- tion [17]. There are often multiple instances of a particular molecule and properties§ associated with it reported in the literature. By using unique identifiers and appropriate metadata all the instances can be combined to give an overall picture (figure 1.4). In general, the ideal value of a property associated with a molecule can never be known. However, properties can be observed or calculated and the agreement between multiple instances of this property reinforces the belief §A property in this case refers to physical quantities relating to that molecule, such as the melting point or chemical formula. 5 Figure 1.4: A molecule may have multiple properties associated with it; repeated instances of a particular property may be reported independently. that the ideal value is being approached. Conversely disagreement between the values can lead to improvements of the measurement method or calcula- tion (figure 1.5). Stewart found significant errors in the literature pertaining to experimentally-determined enthalpies of formation which were detected by comparison with calculated values [18]. In his own words . . . the accuracy of experimental enthalpies of formation can be investigated by using semiempirical methods. Where agreement between calculated and reported enthalpies of formation exists, it is reasonable to assume that the experimental value is, indeed, accurate. Where there is a large difference, the error is likely to be either in the computational method or in the experiment; the probability of both being equally incorrect is small. [18] More recently, the structure of hexacyclinol (figure 1.6) has been the mat- ter of debate. Gra¨fe proposed a structure for the molecule in 2002 [19]. Following a total synthesis, La Clair confirmed this structure [20]. Rych- 6 Figure 1.5: The agreement between different instances of a property can reinforce the belief that the value is approaching the ideal value. Conversely, differences between the values can lead to refinement of the calculation or experimental processes so a more accurate value can be found. Figure 1.6: The structures of hexacyclinol proposed by Gra¨fe (left) and Rych- novsky (right). 7 Figure 1.7: The increased availability of computational resources will hope- fully lead to the validation of both calculated and observed properties prior to publication. novsky simulated the compound’s CNMR spectrum to see if it agreed with that reported and came to the conclusion that . . . the original structure assignment doesn’t fit the data. [21] Using the same computational method, a revised structure was proposed [21]. It is expected that the increased availability and speed of computing resources will in future allow such a validation to become a matter of course, leading to the situation shown in figure 1.7. For such a situation to arise requires the computational chemistry output to be machine-understandable. 1.4 eScience The UK eScience program, launched by John Taylor (Director General of Research Councils), is designed to . . . develop advances in scientific data curation and analysis and to be a primary source of top quality systems and repositories that enable management, sharing and best use of research data. [22] 8 eScience builds upon accessible structured knowledge resources and infor- mation, which is a severe limitation in eChemistry. This work examines meth- ods to allow large volumes of chemical information to be converted rapidly to structured data by automated methods, requiring ideally no human interven- tion, thereby making it machine-readable and machine-understandable. One of the goals of the community is to create a Semantic Web of data, described in 2001 Tim Berners-Lee as: . . . an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. [23] The critical part of the above quotation is that information is given well- defined meaning. In relation to chemistry, this means that a computer must understand the concept of a solvent, for example, which requires the devel- opment of an ontology (a full description of the properties applicable to, and relationships between, each term). This is the difference between machine- readable and machine-understandable data. With these technologies in place, proponents of the Semantic Web assert that this will allow computers to rea- son [24, 25, 26]. The outline of a typical eScience experiment is shown in figure 1.8 in the most general terms possible; this form of experiment is commonly referred to as a workflow. It is important to note that it is not always necessary for all of the processes to be present for an experiment to take place — for instance if the data is already in an appropriate form then no translation process would be required. eScience experiments often deal with large quantities of data (especially before the cleaning process) with the volume likely to increase in future, therefore, automation of the various processes (and methods of connecting them) is becoming increasingly necessary. This work examines all the compo- nents of such an experiment but in essence attempts to answer the question 9 Figure 1.8: An outline of a typical eScience experiment — it is expected that many refinement cycles occur before the final publication. can a machine read the scientific literature as it is currently pre- sented and re-use it for scientific research? It should be noted that the final publication process does not necessarily indicate that an article in a journal will be the result. To illustrate a possible experiment imagine a chemist trying to answer the question; what percentage of the molecules with a reported synthesis pub- lished in the last year in Organic & Biomolecular Chemistry con- tained a non-conjugated ketone? The collection of the data involves the automated reading of all the journal articles from the relevant year, the transformation and cleaning might require converting all the articles into a machine-understandable form and remov- ing all those compounds that do not have a reported synthesis, the compute process would determine (perhaps by examining the molecular structure or infra-red spectrum) which of these contained a non-conjugated ketone, the 10 analysis simply converts the counts of the molecules of each type to a per- centage which is finally relayed to the chemist in a suitable manner. Each process in the workflow can be thought of as a workflow in its own right, with a very similar structure to the overall experiment. In effect each process must consider • how the data comes in • how it can be translated to a usable form • a transform of some sort is performed • analysis of the transform occurs (for example, did the transform work) • how should the data be emitted 1.5 Machine-Understandable Data It is important to clarify what is meant by machine-understandable data and this can be done most easily by example. Figure 1.9 shows a picture of an organic reaction — this is human-readable and understandable (at least for someone acquainted with basic organic chemistry). Whilst it is machine- readable — a program has determined that the binary file holding the data should be rendered as a picture on screen — the data contained in the picture is not machine-understandable — the picture could be anything and would be treated the same way. To extract structured data from such a representation is difficult and methods for this are not considered in this work. Figure 1.10 shows the same reaction in a text-based representation which is still human-readable, human-understandable and machine-readable. The data is still not machine-understandable in that the text could still be any- thing, but it is now in a more accessible form because text can be more easily manipulated and searched than the pictorial representation. 11 Figure 1.9: A chemical reaction represented pictorially. Figure 1.10: A chemical reaction represented in a text based form. Figure 1.11 again shows the same reaction and again the representation is text-based but now contains XML elements that are used tomarkup the data. This representation is probably the least human-friendly. It contains a lot of information that the reader already knows which makes it difficult to read but it is still human-understandable. It is this (human-redundant) information which allows this representation to be machine-understandable, although it should be noted that simply marking up the data does not mean that the data is machine-understandable. However, if the syntax and semantics used to mark up the data are predefined then a chemically aware program can implicity understand that a reaction is occurring between the two reagents to form the two products. Chemical Markup Langauge (CML) provides the set of elements and se- mantics required to hold most chemical data. Holding the data in this form also has the advantage that it is relatively easy to convert the data in this form to other forms — details of how this might be achieved and further background on XML and CML are given in section 1.10. The translation process shown in figure 1.8 therefore represents the conversion of data from 12 Figure 1.11: A chemical reaction represented using structured text. another form into CML, a process commonly referred to as parsing. The other forms of data representation are considered below. 1.6 Formats for Reporting Chemistry Chemistry is reported in a variety of formats ranging from highly struc- tured documents with well-defined fields to narrative discourse. One of the goals of this study is to explore the feasibility of extraction of data into a semantically-rich and thereby machine-understandable form. In general, documents with more structure are simpler to parse because there are fewer ambiguities and many of the underlying semantics are given explicitly. The formats covered in this thesis fall into two main categories: human-authored data and data relating to computational chemistry programs which is usually machine-authored. An example of human-authored data has already been encountered in figure 1.2. The techniques developed to parse data in this form are examined in chapter 2. The data relating to computational chemistry programs can be split into a further two categories: program input and program output. Input data (fig- ure 1.12) is inherently machine-readable and must be machine-understandable, albeit only by the specific program. Techniques adapted from compiler theory allow such data to be transformed into more generally machine- 13 [ atoms ] ; nr type resnr resid atom cgnr charge 1 O 1 DRG O8 1 0.000 2 CB 1 DRG C3 1 0.000 . . . 8 H 1 DRG HAB 1 0.280 [ bonds ] ;ai aj fu c0 c1 1 2 1 0.123 502080.0 0.123 502080.0 ; O8 C3 . . . 7 8 1 0.100 374468.0 0.100 374468.0 ; N4 HAB Figure 1.12: A section of a GROMACS topology file, representing data in a machine-readable and machine-understandable form. understandable forms (see chapter 3). Program output (figure 1.13) is gener- ated by a machine and thus consists of a finite vocabulary. Typically this has well-defined structure but error messages may appear at any point making perfect parsing almost impossible. It is designed to be human-readable and human-understandable but not necessarily machine-understandable. Parsing data in this form required the development of new methods, the details of which are given in chapter 4. 1.7 Compute Processes There are two compute processes involved in this work reflecting the two major sources of data considered. The first process involves checking that the analytical data reported for a molecule is self-consistent and reasonable, the second uses computational chemistry programs to find an optimised structure of a molecule and requires a little more introduction. Quantum chemistry has emerged as an important tool for investigating a wide range of problems in chemistry. With the development of computational 14 ******************************************************************************* ** MOPAC2002 (c) Fujitsu ** ******************************************************************************* * MOPAC2002 Version 1.01 CALC.’D. Mon May 19 01:25:16 2001 * PM5 - THE PM5 HAMILTONIAN TO BE USED 46 ATOM CHEMICAL BOND LENGTH BOND ANGLE TWIST ANGLE NUMBER SYMBOL (ANGSTROMS) (DEGREES) (DEGREES) 1 Cl 0.000000 0.000000 0.000000 2 N 5.069306 * 0.000000 0.000000 1 46 48 H 1.090015 * 109.466551 * 54.661165 * 23 22 18 EMPIRICAL FORMULA: C18 H24 N3 Cl3 MOLECULAR POINT GROUP : C1 SIGMA BONDS 49 LONE PAIRS 12 FINAL HEAT OF FORMATION = -0.90975 KCAL = -3.80639 KJ ELECTRONIC ENERGY = -30264.46015 EV POINT GROUP: C1 CORE-CORE REPULSION = 26026.82711 EV MOLECULAR WEIGHT = 388.767 Figure 1.13: A section of a MOPAC output file. methods and the availability of more powerful computers, it has become pos- sible to solve chemical problems that until relatively recently were impossible. Quantum-mechanical methods are now routinely applied to problems related to molecular structure and reactivity [27]. There are three common approaches to computational chemistry: ab initio methods, semi-empirical methods and density functional theory (DFT). As described in section A.1, ab initio methods are 100% mathematical, meaning that all of the information generated about an atom, molecule or reaction comes from the fundamental quantum mechanical calculations (specifically, the Schro¨dinger equation). This requires significant computing resources, hence most calculations are limited to small molecules, typically those con- sisting of less than 100 atoms. Semi-empirical methods provide a way to study larger molecules. As the name suggests, semiempirical methods are a combination of ab initio methods coupled with the use of data from em- pirical studies — they are based on the Hartree-Fock formalism but make many approximations and obtain some parameters from empirical data rather than from theoretical principles. DFT is often considered to be an ab initio method for determining molecular electronic structure, despite the fact that 15 most of the common functionals use parameters derived from empirical data. It is usual for the limiting factor in calculations to be the amount of time available for the calculation to run. It is therefore important to be able to predict how long a particular calculation will take. The ability to do this becomes increasingly important if the process is to be automated. The back- ground and mathematical forms (where appropriate) of the three methods are given in appendix A. From equation A.25 it is seen that basic ab ini- tio methods require a four-way integration over all the basis functions to be performed; such methods are expected to scale as n4 where n is the number of basis functions. Improvements to the basic theory such as second or- der Møller-Plesset perturbation theory (MP2) or coupled cluster single and double excitation calculations (CCSD) scale to higher powers (n5 and n7 re- spectively). DFT implementing a standard coulomb integral scales as n4, but only as n3 when using an auxiliary basis. However, in the limit of a large molecule, the pairwise interaction dominates (which scales quadratically) and the overall behaviour typically scales as n2.2. Semi-empirical methods tend to scale as n2 [28]. The use of these scaling relationships in predicting calcu- lation times is shown in sections 5.5 and 8.6. 1.8 Analysis and Visualisation The process of analysing the data in an eScience experiment can vary greatly in complexity and in some cases can be almost trivial. For instance, the analysis process in chapter 2 only requires the list of problems found in the analytical data to be sorted in order of decreasing severity. It is often the case that the type of analysis required will depend largely on the visualisation and the manner of publication. There have been two approaches taken in this work and these are discussed separately below. The Experimental Data Checker (EDC) and OSCAR programs were orig- inally designed to aid chemists and editors finding mistakes in the analytical data; the intention was to create a system where possible errors in the data 16 were highlighted for subsequent human curation. The publication process therefore involves the reporting of the possible errors to the user. The vi- sualisation process began as a plain text list of the errors found, although subsequently, to aid comprehension, these errors were highlighted in differ- ent colours (reflecting the severity of the perceived error). As the program evolved, a tool to recreate the various reported spectra was incorporated into the visualisation process and the user was given the ability to view the analysed data in various forms. All the various parts of the programs were written in Java (see section 1.13) and further details are given in chapter 2. The analysis process in the remaining chapters focuses mainly on compar- ing the three dimensional structures of molecules before and after a geome- try optimisation calculation, thereby allowing the refinement of the cleaning process. The geometries are compared primarily by examining the change in bond length, angle and torsion between the two structures, usually by using graphs to detect anomalies and trends. Unfortunately no tools were found that were able to represent the data in the required manner, hence a graphing tool was written to allow the appropriate visualisation (see section 1.12.1). This program allowed each data point to be linked to an external web page which contained the input and output structures in Jmol applets (see section 1.14). Jmol is a molecular viewer which allows the user to manipulate the structure in three dimensions (by changing the orientation for example) and was used to visually compare the molecules. The manipulation of the data and creation of the web pages was again performed by programs written in Java and designed to be generally applicable. 1.9 Publication Clearly the ultimate publication of this work is as a thesis, although, owing to the constraints of the paper medium, the work will finally be published in a more interactive format. However, there are multiple publication processes utilised in chapter 3 onwards, for instance; the inability to parse program out- put using only compiler theory techniques, is a result. Similarly, the archival 17 of the machine-understandable form of a document (automatically allowing the re-use of the data) can also be interpreted as part of the publication pro- cess. These results are seen as byproducts, albeit useful byproducts, of the final publication which is a set of tools and protocols that allow high quality data to be extracted and reused for research. An introduction to the underlying tools and technologies that are used throughout this work for holding, manipulating, visualising and analysing the data is given below. 1.10 eXtensible Markup Language eXtensible Markup Language (XML) [29] is based on Standard Generalised Markup Language (SGML) [30], the international standard for defining the descriptions of the structure and content of different types of electronic con- tent (standardised in ISO-8879:1986). SGML was created by Goldfarb in the 1970s and is a very powerful metalanguage whose primary purpose is to create other markup languages, such as HyperText Markup Language (HTML) [31]. A markup language defined using SGML or XML has a specified vocabulary (the labels for elements and attributes) and a declared syntax (the grammar defining the hierarchy and other features). XML first appeared in 1996 with the first World Wide Web Consortium (W3C) recommendation published in 1998. The W3C is a vendor-neutral body which specifies protocols for the Web infrastructure and develops in- teroperable technologies (specifications, guidelines, software and tools) to lead the Web to its full potential. [32] The major advantage of XML over HTML is the ability to define what- ever elements are necessary to express and support the requirements of the application (see figure 1.14) as opposed to the elements¶ in HTML which are ¶The tag is the text between the angle brackets, i.e. , whereas the element is the start and end tag and all the content between them. The terms are often used interchangeably and the meaning should be apparent from the context. 18 Figure 1.14: An example of an XML document. predefined. 1.10.1 Validation All XML documents must be well-formed‖; they may also be valid. Validation is performed against a Document Type Definition (DTD) or, more recently, a XML Schema [34, 35]. A DTD defines which elements are permitted for documents of that type, what their names are, where they may occur, if they are optional or required, what types of values they can hold (although this mostly applies to attributes not elements) and how they are related to each other (for example; parent, child or sibling). XML Schemas also enable specifications to be placed on the type of data present in any given element or attribute. 1.10.2 Namespaces The ability for every author and user of XML to create their own element names means that tag names may be replicated but the meaning might be very different in each case. Such ambiguities may not be resolved even if ‖Some browsers allow HTML documents not to be well-formed although this can lead to ambiguous interpretations; eXtensible HyperText Markup Language (XHTML) must be well-formed [33]. 19 both tags are defined in a DTD. To prevent any collision of DTDs the W3C has defined a mechanism for namespacing each document definition [36]. This process allows each XML name to be globally unique; this is achieved by mapping a namespace attribute to a Uniform Resource Identifier (URI). These URIs exist solely to provide a globally unique string and need not rep- resent a physical Web address, and furthermore do not require an Internet connection to function. It is common for URIs to be based on the creator’s domain name (to provide the necessary uniqueness). This also allows sep- arate vocabularies and ontology to be defined for each area (for example, CML or Math Markup Language). Currently only elements may be namespaced but the use of dictionar- ies or ontologies — which represent formal descriptions of concepts — has required the ability to namespace individual attributes. The Scientific Tech- nical Medical Markup Language (STMML) provides the basic concepts for creating dictionaries. STMML was conceived by Murray-Rust and Rzepa and is a domain-independent language designed to manage the infrastructure of (mainly numeric) disciplines [37]. There are no limits to the number of dictionaries that can be associated with a given application and references to the dictionaries are identified by a prefix (e.g. iucr:). The use of dictionaries can greatly improve human comprehension of the data in a document (once converted to a human read- able form). This might be realised by showing the reader the corresponding dictionary entry for a particular data item on mouse over ∗∗. Data definitions are sometimes referred to as metadata, thus the fact that an XML document conforms to a specified DTD or Schema might be an example of metadata. The ability for XML to store metadata (such as how to handle the corresponding data) allows machines to understand the data and also makes it inherently easier to retrieve a particular data item. The Dublin ∗∗Mouse over means that an action is initiated when the mouse pointer is held over a defined area of the screen. 20 Core Metadata Initiative is an organization engaged in the development of interoperable online metadata standards [16]. The recommendations of this initiative have been formally endorsed [38, 39] and include the ability to specify metadata such as the creator of the data, contributors, associated dates, rights held in and over the data and the location of related data. There are also numerous entities for describing how the data was collected and archived. 1.10.3 Data Display Unlike HTML, in which almost all the elements affect only the presentation of the document and hence make it more human-readable, XML has been primarily designed to hold and pass information in a machine-understandable form. To allow data held in XML to be easily read by humans it is usual to apply a stylesheet to the document. A stylesheet essentially tells the computer how each tag, or group of tags, should be processed; multiple stylesheets can exist for any given document [40]. eXtensible Style Language (XSL) is used to define a set of primitives which describe a document transformation, conversions using this method usually refer to XSL transformations (XSLT). XSL provides a versatile and powerful language for transforming an XML document into something else, using com- plex transformations and dynamic operations. XSLT are based on pattern matching: each rule specifies a particular action that is performed when the associated pattern is encountered in the source document. XSLT can be used to merge documents, applying various filters to documents, inter-convert be- tween various XML dialects, and sort a document on its content. Figure 1.15 shows the workflow for a XSLT. 1.11 Chemical Markup Language Murray-Rust and Rzepa first presented the Chemical Markup Language (CML) to the world at the 1995 American Chemical Society August Meeting in Chicago [41, 42]. The following year the W3C began work on the XML 21 Source tree XSLT XML Result tree XML HTML xhtml text / rtf Figure 1.15: XSLT flow; the XML is read and a corresponding tree structure created, the XSL transforms are then applied to the tree structure to create the result tree which is subsequently translated into the required output format. 22 project and in 1997 CML became the first ever XML DTD. The first version of the CML (CML 1.0) specification was formally published in 1999 [43]. CML is an extensible base for chemically-aware markup languages. Historically, CML only focused on molecules (i.e. discrete entities rep- resentable by a formula and, usually, a connection table). It supported a hierarchy for molecules as well as reactions and macromolecular structures or sequences. It allowed for quantities and properties to be specifically at- tached to molecules, atoms or bonds, but originally it had no support for physicochemical concepts, reaction schemes, mechanisms, reactive centres, or spectator molecules. CML was developed to support both the presentational aspects and se- mantic content of chemistry. It is designed to be ontologically neutral. This neutrality is important as most molecular file formats, such as the MDL molfile [44], contain complex, often implicit, ontologies that are not neces- sarily convertible into other formats. By keeping CML ontologically neutral, this problem is minimised and as such, it is possible to convert CML into other formats. CML also uses abstract data types wherever possible. It can be seen that a melting point has the same abstract structure as the price of an item or a person’s age. This data type might be described as a float- ing point number, with units, allowed range and links to metadata. This approach greatly widened the applicability, support and tools available for CML. CML was designed to be fully compatible with XML and to re-use its ideas and technologies. As such it captures the content of chemistry rather than the presentation. CML was originally cast as a DTD, but with the advancing technologies of XML and XML Schema Language it was decided that CML needed to be recast in a more tightly modular system. This led to the creation of CML 2.0 [45]. CML 2.0 also aimed to address additional aspects that CML 1.0 ignored, e.g. chemical substances, and also to describe some of the elements in more detail, such as formula and electron, which 23 were not well-described in CML 1.0. A modular approach has been taken for the language specifications. This allows subsets of the language to be used independently — for instance CMLComp represents concepts of computational chemistry. The use of dic- tionaries allows CML to be extended still further. The following CML mod- ules are available: STMML domain-independent specification for general scientific data (in- cluding units, metadata, dictionaries, data types and data structures) CMLAll the complete CML language definition. CMLCore The core part of molecular structure representation. CMLComp CML for computational chemistry CMLReact Support for chemical reactions including enzymes CMLSpect CML for spectra including NMR, MS and infra-red; this inter- operates, rather than competes, with the more formal industry activi- ties such as AnIML [46], GAML [47] and SpectroML [48] CMLSnap CML to describe and handle dynamic (animated) reactions CMLQuery A query language for chemistry which is currently under de- velopment CMLCM condensed matter systems, also under development CML is generally considered to be the target format for data storage in this work. In other words we try to translate data into a form that is represented using the elements, attributes and other data items as specified in CML. In some cases this is done via the application of stylesheets to intermediate XML representations. 24 1.11.1 JUMBO CML does not provide chemical perception; it is only designed to hold data. JUMBO began as the Java Universal Molecular Browser for Objects but sub- sequently evolved into a generic name for the software that manages schemas for CML [49]. The JUMBO package can also be used to check the chemical validity of the data held in a CML document or data being added to an existing document. For instance, if a user were to attempt to specify a new bond in a molecule, JUMBO will check that all the atoms involved in the bond exist, are part of the molecule and (if desired) whether they are within reasonable bonding distance. JUMBO initially implemented the CMLDOM (DOM is a Document Ob- ject Model [50, 51]) to provide the required restrictions on data and data types that were unavailable using the CML DTD [52]. Since the introduc- tion of XML Schemas this is no longer as necessary — the base classes for JUMBO are now created directly from the CML Schema. JUMBO is struc- tured such that the data structure is separated from the tools as proposed by Knuth [53]. The program is Open Source [54] and available on the Source- Forge repository [55]; this work has used both JUMBO 4.6 and JUMBO 5.3. 1.12 Scalable Vector Graphics Scalable Vector Graphics (SVG) is a language for describing two-dimensional graphics and graphical applications in XML [56]. SVG 1.1 became a W3C Recommendation in January 2003 and forms the core of the current SVG developments. Sun Microsystems [57], Adobe [58], Apple [59], IBM [60], and Kodak [61] are some of the organizations that have been involved in defining SVG. Advantages of using SVG over other image formats (such as Joint Pho- tographic Experts Group (JPEG) and Graphic Interchange Format (GIF)) are: 25 • SVG files can be read and modified by a large range of tools, including any text editor • SVG files are smaller and more compressible than JPEG and GIF im- ages • SVG images are scalable • SVG images can be printed with high quality at any resolution • SVG images are zoomable — any part of the image can be magnified without degradation • text in SVG is selectable and searchable • SVG works with Java technology • SVG is an open standard • SVG files are pure XML • all the attributes of SVG elements can be animated The main competitor to SVG is Adobe Flash. The biggest advantage SVG has over this is the compliance with other standards (e.g. XSL and the DOM) whilst Adobe Flash relies on proprietary technology that is not Open Source. SVG is an application of XML and as such is compatible with the XML1.0 recommendation; it uses the XML Linking Language (XLink) [62] for URI referencing and requires support for base URI specifications as defined in XML Base [63]. The content of an SVG document can be styled using ei- ther CSS or XSL, where external stylesheets can be referenced [64]. SVG includes a complete DOM (level 1) and supports or incorporates many of the facilities described in the DOM level 2 specification including the CSS object model and event handling. The animation features incorporate and extend the general-purpose XML animation capabilities described in the SMIL An- imation specification [65, 66]. 26 1.12.1 SVG Graphing Scientific analysis often involves the creation of graphs to detect trends or outliers. Whilst graphs can be easily and quickly created using applications such as R [67] or Microsoft Excel [68], the resultant graphs are neither in- teractive nor can they be immediately mounted as web pages. To simplify the analysis process, a program was written to create graphs in SVG. This application can generate basic x-y plots, histograms, smoothed density plots and quantile-quantile (QQ) plots. The primary motivation for the creation of this software was to allow any data point (or anything on the graph) to be linked to an external document — in this work this has usually been a web page displaying the 3D represen- tations of the molecules — which displayed further information about that point. The user is also permitted to specify associated data for a point, which can be anything that is representable by a string. This data is displayed when the mouse pointer is over the area of the graph represented by that point. The use of SVG to create graphs allows the immediate integration into web pages — whether or not the interaction functionality is used — as well as ensuring that there is no loss of quality if a user desires to examine a particular area of the graph at a larger magnification. The text in any SVG document is searchable and therefore all the data and metadata contained in a graph mounted in or as a web page can be indexed by internet search engines. The SVG graphing application was designed to be used either from within a Java programming environment or as a standalone program with a graphical user interface (GUI). The options available for the presentation of the graph are based on those provided by R and most are accessible through the GUI. The application is currently in a functional state and was used to analyse all the data produced during this work but remains a work in progress, requir- ing greater CML support (especially CMLTable) and synchronising with the pelote specification [69]. 27 1.13 Java Java [70, 71] consists of three, equally important parts: the Java language, the Java Virtual Machine (JVM) and the Java platform. Java is a ‘write once, run anywhere’ technology, i.e. so long as the destination system has a JVM, the program will run on that system. This makes Java a very versatile and portable language. The Java programming language is object-oriented — object-oriented pro- gramming (OOP) may be seen as a collection of cooperating objects, as opposed to a traditional view in which a program may be seen as a list of instructions to the computer. In OOP, each object is capable of receiving messages, processing data, and sending messages to other objects. Each ob- ject can be viewed as an independent little machine with a distinct role or responsibility. OOP is intended to promote greater flexibility and maintainability in pro- gramming, and is widely popular in large-scale software engineering. By virtue of its strong emphasis on modularity, object-oriented code is intended to be simpler to develop and easier to understand subsequently, lending itself to more direct analysis, coding, and understanding of complex situations and procedures than less modular programming methods. The developers of Java tried to make the language powerful, but also to avoid overly complex features that can bog down an object-oriented language. By keeping the language simple, it is easier to write robust and (hopefully) bug-free code. The JVM is also known as the Java interpreter. Without the virtual machine the code will not run on a system, as it is this virtual machine that interprets and runs the code. The Java platform is important in that all Java code relies on the set of predefined classes (modules of Java code that define a data structure and a set of methods that operate on that data) that comprise the Java platform. 28 Java classes are organised into packages (related groups) and the Java plat- form defines packages for functionalities such as input/output, networking, graphics and regular expressions. The most common Java programs written are applets and applications. Applets are programs that adhere to certain conventions, allowing the program to run within a Java-enabled browser. An application is a standalone program that runs directly on the Java platform. Java is also designed to be a powerful software platform. Wherever possible, all the tools created as part of this work have been written in Java and in such a way that they should be reusable in other applications. There are already several instances of classes and methods being incorporated into, or used by, programs written for other purposes. The programs are entirely Open Source and are available to all [54]. 1.14 Jmol Jmol is a free, Open Source, molecule viewer written in Java and available as both an applet and an application [72]. Jmol is capable of extracting the 3D coordinates of molecules stored in various file formats including CML, CIF and GAMESS. It also allows the user to export the rendered molecule to graphical formats including PovRay [73]. All the 3D images of molecules in this work have been rendered using Jmol to create the PovRay file which was subsequently stored in the encapsulated postscript format. Jmol applets were also used throughout the analysis of the data to visually compare the geometries of molecules. 29 Chapter 2 The Quality of Data in the Chemical Literature Languages are used to communicate information; chemistry is a language without native speakers and has developed as a written rather than spoken one. This leads to a level of communication that is adequate but not optimal. The work presented in this chapter examines techniques to extract data from the literature in its current form into more comprehensible formats for both humans and machines and to validate and re-use. The most common method to disseminate results in the chemistry field is via publication in peer-reviewed journals. This process is designed to al- low only valid and accurate science and data into the public domain. Some errors (albeit mostly minor) can, and do, make it through the process and into the published work. These errors are often trivial typographic mistakes, but others are more serious and may lead to the work being withdrawn from publication. In future, the increase in volume of articles taken in conjunc- tion with human error must result in an increased number of mistakes (and possibly more serious ones than at present) permeating through the process. Typographic errors are generally trivial for a human reader to absorb and correct without actually being aware of them. For example the recent meme that was found on many blogs and in inboxes Aoccdrnig to rscheearch at Cmabrigde Uinervtisy, it deos not 30 mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a total mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. should actually read According to research at Cambridge University, it does not matter in what order the letters in a word are, the only important thing is that the first and last letter be at the right place. The rest can be a total mess and you can still read it without a problem. This is because the human mind does not read every letter by itself, but the word as a whole. Whilst a human can read and interpret the quotation, a computer will not in general be able to do so. The original quotation contains 34 unique spelling mistakes (wrod and taht are both repeated twice). The quotation was run through the spell checking tools from two different authoring pro- grams (WinEdt [74] and Microsoft Word [75]). A list of alternatives to the misspelled word are suggested by both programs. The words at the top of the list are considered by the program to be more likely alternatives, conse- quently, the results have been broken down into three categories; Top The correct word is at the top of the suggested alternatives; Present The correct word is present in the suggested alternatives, but is not the most likely; Absent The correct word does not appear in the suggested alternatives. The results are presented in table 2.1; whilst both programs correctly identi- fied the intended word in over 75% of the cases neither program would have been able to reconstruct the entire sentence correctly. 31 Microsoft Word WinEdt Top 27 26 Present 1 4 Absent 6 4 Table 2.1: Comparison of spell checking between Microsoft Word and WinEdt. The quotation, as originally cited, is not machine-understandable — that is, it would be impossible for even a natural language processing program to determine the meaning of the quotation. The increased desire for au- tomation of the categorisation, searching and interpretation of the published literature requires that the literature be machine-understandable. Although it is extremely unlikely that any published literature would contain the kind of gross errors present in the quotation above, a minor error (for example in an abstruse chemical name) is far more likely. 2.1 Information Extraction In computer science, information extraction (IE) or text mining is a type of information retrieval where the goal is to automatically extract structured information from unstructured machine-readable documents; an example of IE can be found in the analysis of gene expression data. An association rule represents a set of items that are likely to be seen together; for example, the rule {cancer} ⇒ {gene A↑, gene B↓, gene C↑} states that whenever cancer was found, gene A and gene C were highly expressed, but gene B was highly repressed and all three genes occurred together [76]. Typical subtasks of IE are: Named Entity Recognition recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions; 32 Coreference identification chains of noun phrases that refer to the same object. For example, anaphora is a type of coreference; Terminology extraction finding the relevant terms for a given corpus. The development of ontologies, and hence text mining, in other spheres has advanced rapidly, particularly in the biosciences. Bioinformatics has embraced the concept of literature data mining and much work has been done in this area ranging from the simple recognition of terms to the extraction of interaction relationships from complex sentences [77, 78]. This thesis details how such methods can be applied to chemical documents to create structured, semantically-rich, machine-understandable versions of the documents as well as describing the development of various text mining processes and chemical ontologies. The extraction of data from legacy formats is simplified by increasing struc- ture, thus if every chemist were to write articles in XML (with a predefined DTD or Schema) all the concepts would be immediately recoverable. For example an infra-red spectrum of butan-2-ol might be represented as; ... However, many synthetic organic chemists use Microsoft Word or a similar program as their authoring tool. These programs tend to focus on the pre- sentation of the data, rather than content of the data. Whilst it might be unreasonable to expect authors to create their documents in XML with fully integrated CML, the advent of online publication makes a more integrated version of an article (a datument [79]) more attractive [80]). Even if da- tuments become the accepted publication method there will still be a huge legacy corpus which will require extraction. 33 2.2 The Experimental Data Checker The Royal Society of Chemistry (RSC), recognising this potential prob- lem, sponsored a project∗ to develop a tool both to improve the quality of manuscripts submitted to them for publication and to allow reviewers to detect errors more easily. Initially, two students (F. R. Norton and this au- thor) were involved for a three month period to develop a proof-of-concept Experimental Data Checker (EDC). The brief for the EDC project was to analyse a corpus of typical organic chemistry articles — both those submitted to the RSC for publication and those already published — supplied by the RSC, and to determine to what extent the extraction of data was a tractable problem. A proof-of-concept cross-platform program that could extract (to a greater or lesser extent) the data present was to be developed which would enable human reviewers to detect errors more easily. It was envisaged that this would involve performing self-validation tests on the data in the analytical data section. The RSC also requested that a DTD be created to represent the extracted data. The specifications for the project required that the checking process should not affect the way in which the author could create the manuscript, and, that only a minimal amount of extra work would be involved in the checking process†. 2.2.1 The legacy format of organic chemistry articles An analysis of the corpus provided showed that most organic chemistry arti- cles (and articles for submission in the journal) are divided into well-defined sections. The published article structure is shown in figure 2.1. The same sections are found in the unpublished articles but often in a different order. Examples of each section are given below. ∗The project was collaborative; S. E. Adams, F. R. Norton, C. A. Waudby and this author have all been involved; overall this author contributed approximately 50% of the work. Where work was performed by a specific individual this is indicated. †Such a design brief — that the user should not be expected to do significant further work — is sometimes referred to as the requirement for a Big Red Button solution. 34 Figure 2.1: The typical structure of an organic chemistry article. 35 . . .Mp 62–65 ◦C. FTIR (NaCl, thin film) 3025 (m), 2952 (s), 2881 (m), 1734 (s), 1707 (s), 1249 (m), 1184 (m), 1132 (m), 1054 (m), 950 (w) cm−1. 1H NMR (100 MHz, CDCl3) δ 1.10 (m, 1H), 1.25 (m, 6H), 1.33 (s, 3H), 1.61 (m, 6H), 2.07 (dt, J = 13.5, 5.1 Hz, 1H), 2.41 (m, 1H), 2.59, (dt, J = 14.3, 6.2 Hz, 1H), 2.77 (dd, J = 12.0, 2.8 Hz, 1H), 3.92 (m, 4H), 4.19 (m, 2H). 13C NMR (400 MHz, CDCl3) δ 210.6, 173.2, 112.7, 65.7, 65.3, 61.5, 61.5, 45.3, 42.3, 35.0, 30.8, 29.4, 23.6, 23.0, 17.4, 16.3, 14.6. [α]20D +11.1 (c, 7.15, CHCl3). Anal. Calcd for C17H26O5: C, 65.78; H, 8.44. Found: C, 66.07; H, 8.35. Figure 2.2: An example of typical analytical data found in an organic chem- istry paper [82]. 5’,6’,8a’-Trimethyloctahydro-2’H -spiro[[1,3]dioxolane-2,1’- naphthalene]- 5’-carbaldehyde (20). A CH2Cl2 (10 mL) solution of alcohol 19 (270 mg, 1.01 mmol), 4-methylmorpholine N -oxide (130 mg, 1.10 mmol), and 4 A˚ MS (300 mg) was stirred for 10 min. At this time, tetra-n- propylammonium pyruthenate (17.5 mg, 0.05 mmol) was added in one portion and the reaction was stirred for 1 h. The reaction was found complete by TLC, passed through a pad of silica (1 × 20 cm2 with 1:1 hexanes:Et2O), and concentrated to provide aldehyde 20 (260 mg, 98% yield) as a clear oil. Figure 2.3: An example of a typical synthetic methodology found in an organic chemistry paper [82]. Analytical Data Analytical data (figure 2.2) are physical properties used to determine the identity of a compound, such as NMR spectra and elemental analysis. When reported in journals or theses this data is highly formalised. The format is presentational rather than semantic and may include human-created errors arising from transcription, omission, spelling mistakes [81], unforeseen mi- crostructure and vocabulary. Ambiguity is often present; hence recall may not be perfect. This is an attractive area for consideration because the data is semi-structured and when extracted would allow error checking (for self- consistency). 36 Starting from commercially available 3-furaldehyde, we accessed the known homoallylic alcohol 11 in 82% yield and 93% ee through Brown’s asymmetric allylboration (Scheme 2). Alternatively, alcohol 11 could be prepared in 82% ee and 64% yield in a proline oxide-catalyzed asym- metric addition of trichloroallyl silane into 3-furaldehyde. In either case, O-alkylation of 11 with 2,3-dibromopropene 12 afforded allylic ether 13 in 75% yield. Lithium-halogen exchange followed by isopropoxy borolane 14 trap provided the pinacol borolane 10, which was carried on to the borolane fragment 9 through a RCM by using Grubbs second generation catalyst 15 in 45% overall yield for the two steps. With access to pyran 9, attention was turned toward generating coupling partner 6. Figure 2.4: An example of a typical chemical discourse found in an organic chemistry paper [82]. Synthetic Methodology The synthetic methodology (figure 2.3) comprises a highly formal language and many stock phrases that may be valuable for tokenising. This format requires a lexicon to allow entity and microstructure recognition; shallow parsing and part-of-speech recognition techniques must also be employed. However, currently CML does not have the required elements or semantics to describe such data. This data represents an extremely large resource which, if it can be searched more effectively (for instance on reaction type), can be made vastly more use- ful for both eScientists and the more traditional chemist. If the semantics of this area can be well-defined it should be possible for robots to entirely reconstruct (and re-perform) the synthesis. Parsing the synthetic methodol- ogy did not form part of the original EDC project but subsequent work has shown this area to be somewhat tractable. Chemical Discourse Chemical discourse encompasses, for example, the results and discussion sec- tion of an article (figure 2.4). This requires deep parsing and a substantial 37 The marine sponge metabolite (+)-cacospongionolide B (+)-1 is a mem- ber of a class of compounds bearing a γ-hydroxybutenolide moiety. This functionality has been suggested to be important in the inhibition of sev- eral forms of secretary phospholipase A2 (sPLA2), enzymes involved in events leading to inflammation. Given the role of chronic inflammation in diseases such as asthma, psoriasis, cancer, atherosclerosis, and rheuma- toid arthritis, it is becoming increasingly important to discover and de- velop more effective agents that can mediate these pro-inflammatory sig- naling events. With a successful route to cacospongionolide B already in hand, efforts have turned toward understanding the structural features of the natural product responsible for inhibiting sPLA2 activity. Our previous findings indicated that furan 2 possessed comparable sPLA2 inhibitory activity to (+)-1, while the enantiomer of the natural prod- uct (−)-1 was less active. In addition, several unnatural diastereomers of the natural product were identified that displayed improved sPLA2 inhibitory activity over (+)-1. Figure 2.5: An example of a typical narrative discourse found in an organic chemistry paper [82]. English lexicon as well as a chemical one. There are many reserved words that may allow parsing but currently this is outside the scope of this work. Narrative Discourse Narrative discourse comprises, for example, the introduction section of an article (figure 2.5). This also requires deep parsing and a complete English lexicon as well as a chemical one. This is outside the scope of this work. Chemical Names Chemical names are often expressed in a formal language and as such should follow specifications. However these often contain ambiguity and in some cases allow multiple (and equally valid) names to be given to a species (see figure 2.6). Systematic chemical names are not designed to be machine- understandable but may be described by a grammar. Non-deterministic methods must be used to parse chemical names in general, allowing am- 38 IUPAC name 2-Acetoxy-benzoic acid CAS Index name Benzoic acid, 2-(acetyloxy)- Trivial name Salicylic acid Trivial name Aspirin Figure 2.6: Various names for the same connection table. biguities to remain unresolved for as long as possible. The ability to parse a systematic chemical name is essential as it is often the only way in which a connection table for the compound can be recovered from a paper. The identification and parsing of chemical names is explored in section 2.4. 2.2.2 Common Chemical Concepts The analysis section of each compound were examined and the most common well-defined concepts identified. These concepts were found to be: • chemical name • yield • boiling point / melting point • proton nuclear magnetic resonance (HNMR) • carbon nuclear magnetic resonance (CNMR) • infra-red spectrometry 39 • (high resolution) mass spectrometry (HRMS) • elemental analysis • optical rotation • refractive index • Rf value • ultraviolet spectrometry • nature (colour, state, modifiers, description, etc.) and represent what a typical organic chemist is expected to include in a report to show that they have made the intended substance. The RSC provides guidelines on how each of the various analytical data should be displayed [83]. However, they do not insist that these are rigidly adhered to, which results in multiple (sometimes very similar) representa- tions. For example, the suggested presentation of High Resolution Mass Spectroscopy data is: [Found: C, 63.1; H, 5.4%; M (mass spectrum), 352. C13H13NO4 requires C, 63.2; H, 5.3%; M, 352] Analysis of the corpus revealed the following representations: • Calculated for C13H8N5Cl3: m/z 338.98. Found 338.98 HRMS • (CI mode, CH4): C23H24Si (M); found: m/z 328.1647. Calc.: m/z 328.1647 • [Found: m/z (HRMS-FAB) 359.1993. C21H23N6 requires MH+ 359.1984] • exact mass 292.0828, C14H14NO6 (M/2 + H+) requires 292.08 • (Found M+, 257.9545. C9H7IO requires 257.9544) 40 • HR-LSIMS (m-nitrobenzyl alcohol) m/z 464.99649 [M(79Br) + H+], C15H18 79BrN2O8S requires m/z 464.99673 • [MALDI-TOF-MS Calc. for C127H107NNaO38 (M + Na)+: 2276.6. Found: m/z 2276.9 (M+Na)+] • HRMS (CI+) C13H16BrO2 requires 283.0334, found 283.0329 (M + H)+found = 423.1060, C21H25ClO5P requires 423.1128 for the 35Cl iso- tope • [M−] m/z 801.0935. Calc. for [C37H33Cl3N2O10P]−: 801.0938 • ESI-MSm/z : 805.44 (M + H)+;anal. calcd for C52H60N4S2: 804.43m/z (EI) 414.1660 (M+, 100%. C23H26O7 requires 414.1678), 385 (16.8), 278 (7.1), 217 (6.2), 195 (34.8), 167 (86.5), 135 (66.0), 131 (10.5) as well as the recommended version. The identification of each type of analyt- ical data therefore requires various representations to be taken into account. Regular expressions provide exactly this functionality and are supported by Java which is a cross-platform language. Java does not support the Microsoft Word document format but the text can be entered into a Java text area by using the cut and paste translation facility. This process results in the complete loss of formatting (other than new lines) and in some cases special character. For instance, the ‘δ’ character sometimes is translated as ‘d’ and in other cases is omitted entirely. An infra- red spectrum such as; νmax (CHCl3)/cm −12954, 2900, 1655 and 1603 (C=C), 876; is typically translated to; max (CHCl3)/cm-12954, 2900, 1655 and 1603 (C=C), 876; Thus the translation into plain text introduces further variation of represen- tation, further complicating the extraction process. 41 2.2.3 Regular Expressions A regular expression is a pattern or template for matching a set of text strings [84]. The origins of regular expressions lie in mathematics and arose from finite state automata theory in the 1950s. The mathematical back- ground of regular expressions is discussed in 3.1.6, however, they rapidly migrated into early text editors such as qed [85] where they are no longer regular expressions in the mathematical sense. Many different variations have occurred in both syntax and semantics; this work has concentrated on the regular expression package implemented by Java [86]. Regular expressions match a pattern against a subject (a string of char- acters); most characters in the regular expression stand for themselves. For example a straight-chain hydrocarbon, with up to ten carbon atoms and one optional multiple bond, would match against the following regular expression: chain (locantgroup)? saturation Where; • chain is defined as (meth|eth|prop|but|pent|hex|hept|oct|non|dec) • locantgroup is defined as -number- • number is defined as (1|2|3|4|5|6|7|8|9) • saturation is defined as (ane|ene|yne) The vertical bar means or ; the parentheses are used to group sub-expressions; the question mark means zero or one instances of and the juxtaposition of the parenthesised expressions means concatenation. Some common predefined character classes and boundary matches used in the Java regular expression package is given in appendix B. It is clear that the regular expression given above would match against both possible and impossible chemical names, for example meth-9-yne. This limitation that requires that regular expressions only be used to tokenise data, not to deal with the semantics (see section 3.2). 42 Figure 2.7: The structure of the original EDC. 43 Figure 2.8: The GUI of the original EDC. An experimental paragraph has been pasted into the upper window and the big red button pressed to produce the report in the bottom window. 44 2.2.4 Structure of the EDC The structure of the original EDC and the user interface are shown in figures 2.7 and 2.8 respectively. The two-panel display means that redundant input information (the unparsed text) is always displayed. Whilst this version of the EDC fulfilled the project requirements, there were significant weaknesses: • the design of the code was poor: each of the types of data identified (melting point, name, HNMR etc.) were held in purpose built classes and did not utilise interfaces • there was no separation of function and display • each paragraph of the experimental section had to be individually cut and pasted • the program was only available as an applet • editing or extending the regular expressions required editing the Java source code 2.3 OSCAR — Blurring the Line Between Authoring Tool and IE Tool The original EDC sufficed to show proof of concept. However, it had weak- nesses and areas where the implementation and functionality could be im- proved. The RSC sponsored three further students (S. E. Adams, C. A. Waudby and this author) to continue the development of the EDC. The fol- lowing areas were those were identified as those most in need of refactoring: • separation of function and display • analysis of entire article at once • overview of entire article • creation of an application 45 Figure 2.9: The structure of the EDC version 2. 46 (\(|\[)?(F|f)ound\\s\(%\):.*(C|c)alcd?\.?\s for.*\d{1,2}\.\d{1,2}(\]\.|.;) [\[\(]?(((?:anal\.?\s+)?found|(?:(?:(?:(?=\w{4,})\b(?x-i:(?:Zr|Zn|Yb|Y|Xe|W|V|Uuu| Uut|Uus|Uuq|Uup|Uuo|Uuh|Uub|U|Tm|Tl|Ti|Th|Te|Tc|Tb|Ta|Sr|Sn|Sm|Si|Sg|Se|Sc|Sb|S|Ru|Rn|Rh|Rf|Re|Rb|Ra| Pu|Pt|Pr|Po|Pm|Pd|Pb|Pa|P|Os|O|Np|No|Ni|Ne|Nd|Nb|Na|N|Mt|Mo|Mn|Mg|Md|Lu|Lr|Li|La|Kr|K|Ir|In|I|Hs|Ho| Hg|Hf|He|H|Ge|Gd|Ga|Fr|Fm|Fe|F|Eu|Es|Er|Dy|Ds|Db|Cu|Cs|Cr|Co|Cm|Cl|Cf|Ce|Cd|Ca|C|Br|Bk|Bi|Bh|Be|Ba| B|Au|At|As|Ar|Am|Al|Ag|Ac|Me|Et|Pr|Ph|Bn|Bz|Bu|t-?Bu|n-?Bu|Ts|Ms|Tr|D)\d*[--\-\=]?)+ (?:\s*[·.]\s*(?:\d+(?:\/\d+)?\s*)?(?x-i:[A-Z][a-z]?\d*)+)?\b\s*)? require[sd]?\s*|(?:anal[\.:\s]+)?calcd?\.?(?:\s+for)?\W*(?:(?=\w{4,})\b (?x-i:(?:Zr|Zn|Yb|Y|Xe|W|V|Uuu|Uut|Uus|Uuq|Uup|Uuo|Uuh|Uub|U|Tm|Tl|Ti|Th|Te|Tc|Tb|Ta|Sr|Sn|Sm|Si|Sg| Se|Sc|Sb|S|Ru|Rn|Rh|Rf|Re|Rb|Ra|Pu|Pt|Pr|Po|Pm|Pd|Pb|Pa|P|Os|O|Np|No|Ni|Ne|Nd|Nb|Na|N|Mt|Mo|Mn| Mg|Md|Lu|Lr|Li|La|Kr|K|Ir|In|I|Hs|Ho|Hg|Hf|He|H|Ge|Gd|Ga|Fr|Fm|Fe|F|Eu|Es|Er|Dy|Ds|Db|Cu|Cs|Cr| Co|Cm|Cl|Cf|Ce|Cd|Ca|C|Br|Bk|Bi|Bh|Be|Ba|B|Au|At|As|Ar|Am|Al|Ag|Ac|Me|Et|Pr|Ph|Bn|Bz|Bu|t-?Bu|n-?Bu| Ts|Ms|Tr|D)\d*[--\-\=]?)+(?:\s*[·.]\s*(?:\d+(?:\/\d+)?\s*)?(?x-i:[A-Z][a-z]? \d*)+)?\b\s*)?(?:(?:\((?:[^\(\)]|\((?:[^\(\)]|\([^\(\)]+\))+\))+\))\s*)?| for\W*)))(?:\W*(?:M\w*)?\W+)?((?:\W*?(?x-i:[A-Z][a-z]?)[\s,:]+\d+(?:\.\d+)?\s* \%??){2,})%?[;\.\s]*?){2}[\)\]]? Figure 2.10: Regular expressions to extract the elemental analysis from the analytical section. The top expression is that used in the EDC, bottom is that used in OSCAR. • improve spectrum representation • improve the regular expressions A complete rewrite of the code produced the revised program design shown in figure 2.9. Each of the types of data now implement a DataInterface and those data representing spectra a SpectrumInterface. The separation of function and display resulted in OSCAR (Open Source Chemistry Anal- ysis Routines) and the EDC which is now simply a particular front end for a user to access OSCAR [87]. Figures 2.11 and 2.12 show the EDC applica- tion highlighting data and the results of the ‘expert analysis’ routines. An example of the regular expressions developed for the original EDC and the improved version for OSCAR are shown in figure 2.10. In order that an entire article could be parsed at once, the document struc- ture and paragraph recognition are important. OSCAR firstly identifies the experimental section of the article and then splits this ReportParagraphs, where a ReportParagraph is defined in Backus-Naur form (BNF) in figure 47 Figure 2.11: The EDC application v2.4 showing the data identified. 48 Figure 2.12: The EDC application v2.4 showing the results of the expert analysis routines. Warnings are in the left hand panel — clicking on a warning brings the relevant explanation up in the right hand column. Items in red represent serious errors (for example more Hydrogen atoms reported in the HNMR than there are in the elemental analysis. Blue items are warnings (for example solids being reported without melting points) and green possible warnings (for example an infra-red spectrum being reported without the plate type). 49 ReportParagraph :: = ReportParagraphHeader AnalyticalDataBlock FULLSTOP ReportParagraphEnd ; ReportParagraphHeader :: = ChemicalName | ChemicalNameAndIdentifier | Identifier ; ReportParagraphEnd :: = WhiteSpaces | MaybeWhiteSpaces LINEEND ; WhiteSpaces :: = WHITESPACE | WhiteSpaces WHITESPACE ; MaybeWhiteSpaces :: = BLANK | WhiteSpaces ; Figure 2.13: The BNF for paragraph recognition in OBC papers. Terminal tokens are shown in capitals. 2.13 (for a full explanation of BNF see appendix C). Each of these Report- Paragraphs is then examined for analytical data. The ability to parse an entire article at once — rather than individual paragraphs — allows the cre- ation of a table listing all the compounds identified and the analytical data for each. This facility is considered particularly useful for chemists authoring a thesis in the organic chemistry, as a way to ensure that they have included all the required data [88]. Figure 2.14 shows an example of the tabulation facility. The original EDC used regular expressions to identify fine-grained data immediately (for example matching the HNMR peak types from an entire paragraph). Such immediate fine-grained parsing is difficult because it is not easy to differentiate between HNMR and CNMR peak types, for example; thus the regular expressions are extremely complex. The approach in OSCAR was to use hierarchical parsing (not coincidentally reflecting the structure of the target XML output). With reference to HNMR this process would involve 50 Figure 2.14: The EDC application v2.4 showing the tabulation of identified data. Clicking on a row brings up the relevant paragraph with more detailed information. 51 Figure 2.15: All-at-once (left) and hierarchical (right) parsing strategies. 52 Figure 2.16: An example of an HNMR spectrum. the following steps; identifying the block of HNMR data, then the block of peaks within that, then individual peaks and finally the peak type within the peak fragment. The two processes are illustrated in figure 2.15. Organic chemists are familiar with graphical representations of spectra such as the HNMR spectrum shown in figure 2.16. However one of the drawback of the current publication system is the destruction of information; spectra are not given in a graphical form, but in an abbreviated textual representation as shown in figure 2.2. A system was therefore created that attempted to recreate a spectrum from the textual form. A mass spectrum is inherently easier to recreate than an infra-red spectrum as it consists of single lines of differing height whilst the infra-red spectrum has different width and shapes of peak. Figure 2.17 shows an example of a recreated spectrum. 2.3.1 Information Extraction Tests In 2003 the RSC created a new journal, Organic & Biomolecular Chemistry (OBC), which was formed by the merger of the Journal of the Chemical 53 Figure 2.17: The EDC applet v2.3 displaying a recreated infra-red spectrum. Society, Perkin Trans 1, Perkin Trans 2 and Natural Product Reports. For- tunately the regular expressions developed to identify and extract data from the original corpus proved sufficiently broad (and the structure of the analyt- ical sections sufficiently rigid) that the desired data could still be extracted from the new journal. It is interesting to note that OSCAR was run over organic chemistry articles in German and still correctly identified typically more than 80% of the data correctly. To determine the recall and accuracy rates for OSCAR it is necessary to consider the two types of document for which it is applicable: published OBC articles and papers received by the RSC for publication in OBC. There are only minor differences between the two document types, the two most relevant being structure (order) and format. The first of these should not present any parsing problems because although the experimental section may occur in a different position in the two documents, the content is the same in both. However, the format of the data often changes slightly in the publica- 54 tion process. In general slight errors (such as missing commas or mismatched brackets) found in the analytical sections of the submitted documents will be corrected before publication. The EDC was originally intended to function as an authoring tool; thus it was decided at the inception of the project that the program should not make inferences from the data or correct it but that a human must perform curation functions. Therefore an error such as the mislabelling of a HNMR spectrum as a CNMR spectrum that would be clear to a chemist reading the paper because of the assignment of the integrals as protons and the values of the chemical shifts would be interpreted by OSCAR as a CNMR spectrum. Errors of this type are usually corrected during the review and editing process. Hence it was expected that the recall rate would be lower than the precision rate, (a single missing comma in a spectrum would result in a false negative) and that tight specifications‡ for a data type would result in higher recall. The test set comprised seven articles randomly selected from OBC 2003 [89, 90, 91, 92, 93, 94, 95] and three documents [96, 97, 98] that the RSC had received for submission to OBC. Because the focus for the current project is information extraction from published literature the test set was biased in favour of articles. The recall and accuracy statistics for the classification (identification) obtained from this sample are given in table 2.2, where: TP is the number of X that the system correctly identified and were present in the corpus FN is the number of X that the system failed to identify FP is the number of X that were recognised by the system which were not in the corpus X is a particular data type ‡A tight specification is one where there is little or no ambiguity possible. 55 Data type TP FN FP Recall % Precision % overall 1554 240 96 86.62 94.18 CNMR 187 24 5 93.03 97.40 elemental analysis 103 15 0 87.29 100.00 HNMR 212 23 4 90.21 98.15 HRMS 126 1 0 99.21 100.00 infra-red spectroscopy 186 19 8 90.73 95.88 mass spectroscopy 145 20 0 87.88 100.00 melting point 151 11 2 93.21 98.69 chemical name 171 72 47 70.37 78.44 nature 100 22 21 81.97 82.64 yield 173 43 9 80.09 95.05 Table 2.2: Recall and precision rates for OSCAR. and recall and precision are defined by Recall = TP TP + FN (2.1) Precision = TP TP + FP (2.2) A partially identified fragment has been classified as a false positive, hence missing commas or mismatched brackets in a spectrum would lead to a false positive. It is observed that the recall of the first seven data types is about 90%, but for the final three the recall is significantly lower. It is also noted that the precision for two of these data types (chemical name and nature) is also significantly lower than for the other types. In general it was observed that the reason why the yield was not correctly identified when multiple yields were quoted in one synthetic paragraph was because more than one synthetic route to the same compound was attempted, or the reaction gave a mixture of products, or the same reaction scheme had been employed on multiple starting compounds (see figure 2.18). 56 Typical procedure for the synthesis of 2-alkoxy-9-benzyl-8- hydroxyadenine derivatives 7a-e A solution of 12 (0.24 mmol) in c. HCl was stirred at room temperature for 4 h. After evaporation, the residue was chromatographed on silica gel to give 7, which was identified by comparison with a standard sample synthesized from 6. Yield: 7a (74%); 7b (81%); 7c (82%); 7d (87%); 7e (84%). [90] Figure 2.18: A paragraph showing how the synthesis of set of similar com- pounds is described. The relatively low recall and precision rates for identification of the chem- ical names and nature is not surprising. For chemical names an extremely simple regular expression was used: a match for various starts of chemical names, various ways of separating names (such as hyphens) and a match for various ends of chemical names. The nature of a compound is the reported data type that has the least-controlled vocabulary and hence presents more problems to parse. Although it would be possible to build up larger lists to match the state, colour and colour-modifiers of a compound that would increase the recall, an entirely different approach such as entity recognition would be preferable. The semantic and syntactic features that caused the majority of the false negatives were: • multiple instances of the same data type reported in the same para- graph • inability to recognise the end of the reported data for a compound The first of these would be simple to correct: OSCAR was designed to search for only one instance of a particular data type in a particular paragraph, however the inclusion of a recursive descent call would allow it to identify multiple instances of the same data type. The second is far more difficult because the chunking of the article into report-paragraphs (see above) is not sufficiently accurate. 57 General Experimental γ Precision % Recall % Precision % Recall % -2 68.5 94.3 85.9 95.3 -5 72.1 92.5 93.8 95.3 -8 75.0 90.6 96.8 95.4 -11 81.0 88.7 98.4 95.3 -14 80.8 79.2 98.2 87.5 Table 2.3: The effect of the γ parameter on the recall and precision rates of OSCAR2 for chemical names in the general and experimental sections. The values in bold font are those representing the optimal value chosen [99]. There are two inherent weaknesses to the method of paragraph recogni- tion used by OSCAR. The first occurs if the author omitted a full stop at the end of one analytical report section, which means that all the subsequent ReportParagraphs will be passed over until a full stop followed by optional whitespace then a newline is encountered. This produces significant num- bers of false negatives and also leads to any previously un-encountered data types present in the subsequent paragraphs being reported for the incorrect compound. The second weakness is that this construction may also match sections earlier in the experimental section, giving rise to false positives. 2.4 OSCAR2 — the Importance of Chemical Names Waudby continued the development of the OSCAR toolkit, focusing on im- provements to chemical name recognition and maintaining more of the article structure§. The version Waudby developed produced documents with inline XML and separated the chemical name recognition from analytical data iden- tification. The advantage of using inline XML markup is that it allows the preservation of the original document structure. A na¨ıve Baysian based on n-grams and a simple grammar (see section 3.1) were implemented to determine whether a word, or phrase was likely to be §OSCAR2 is entirely the work of Waudby. 58 a chemical name. Briefly, this involved breaking words up into three or four letter tokens and determining which tokens occur more in chemicals. For example, ybd, eth, alk, yne might all be expected to occur more frequently in chemical names than in general English. The grammar allows the context of the word to be taken into account. For example, a word is likely to be a chemical if it is followed by a quantity: methanol (10cm3) Single compounds with space-separated chemical names can also be correctly identified by context. For example, if a word ends in ‘ic’ is it followed by ‘acid’: periodic acid, periodic table The first instance of ‘periodic’ is followed by acid and is therefore assumed to be a chemical name whereas the second is followed by ‘table’ so is not identified as a chemical name. The system included a variable parameter γ that could be tuned to adjust the balance between recall and precision. Table 2.3 shows the recall and precision rates for five values of this parameter. A more complete explanation of the approach can be found in the article by Townsend et al. [100] and the various approaches this has built on in the articles by Vasserman [101]. Figure 2.19 shows the workflow for the OSCAR2 toolkit. 2.4.1 The Importance of Connection Tables Chemical structures form the basis of most organic chemistry; the two di- mensional representations of molecules (sometimes with an indication of the three dimensional structure included) are the form which chemists use when describing a molecule, or a reaction mechanism. The structure of an organic molecule, whether two or three dimensional, can usually be described by a connection table (CT). Systematic chemical names are often avoided (until the molecule is to be included in a formal report). In common use they are often in abbreviated forms or non-systematic names because systematic names are often lengthy, difficult to interpret and less memorable (see figure 2.6). 59 Figure 2.19: The OSCAR workflow. The identification of chemical names is vital because it is often the only way that the CT of the molecule can be recovered. The CT forms the basis of most organic chemistry, whether it be in predicting a reaction pathway or determining the likely appearance of infra-red, HNMR or CNMR spec- tra. Data of this type is frequently reported to characterise the molecule. It would be desirable to use all available data reported in a paper to perform crosschecks and self-validation. Whilst electronic publications provide the means to encapsulate a compound’s CT in a document, this is most com- monly achieved by including an MDL molfile [44] or a ChemDraw file [102] into a Microsoft Word document (which are rendered to the desired form when the file is viewed). During the publication process the CT is often con- verted into a picture, with the associated loss of data, leaving only the name available as a basis from which to recreate the CT. This is also the case with older articles where there is no electronic form in existence. There are good commercial chemical name-to-structure converters avail- able (for example ChemDraw). However, their analysis techniques are not published but it is believed that the algorithms are rule-based with no machine- learning capability. It is instructive to consider why a new approach to chem- 60 ical name-to-structure conversion is useful and why a technique incorporating some machine-learning aspects should be employed. 2.5 OSCAR3 and OPSIN — Parsing Chemi- cal Names to CTs It is now common to use programs to generate the systematic name for a compound. This has resulted in improved compliance to the IUPAC specifi- cations and greatly increased quality in the reported names. As computers are used to generate the names it seems sensible also also use them to perform the reverse translation. There have been a significant number of attempts to produce automated methods to convert a chemical name to a CT. The first use of a computerised grammar analysis process to convert chemical nomenclature to CTs was by Elliot in 1969 [103]. Before this Garfield produced a system to calculate a compound’s molecular formula from its name [104]. This algorithm did not include a complete grammatical description of chemical nomenclature, but all the basic facets of such a grammar were examined. Kirby et al. have been active in this area since 1985, when they presented a program that could convert an IUPAC systematic name to a chemical structure [105, 106, 107, 108, 109, 110]. The approach taken by this group was to create a formal grammar from the informal IUPAC rules then subsequently to modify a Simple Left Right parser generator (SLR) to apply to the context-free grammar. (A full discussion of how such a parser works and can be created can be found in section 3.1). The program has since been extended to find, and automatically correct, errors found in chemical names and to parse some semi-systematic names. However, in their own words, the work only focused on certain classes of compounds of industrial importance, including some cases of semi-systematic and trivial nomenclature. 61 The areas considered were, perhaps understandably, those that are the most tractable. The work covered much of the hydrocarbon nomenclature and has since been extended to recognise many acids, alcohols, aldehydes, ketones and ethers. Ultimately, a program is envisaged that can identify (in a paper or thesis) a chemical name, a synthetic route and the analytical data for this compound. A CT would be created for the title compound, and all chemicals identified in the synthesis. Automated validation could be achieved by generating various chemical properties, using either ab initio or semi-empirical methods. These would be checked against the reported data and any anomalies reported, this operation could also remove any ambiguities present in the structure generated. Calls to an outside program for reaction-prediction could also be made to verify that the synthetic route reported did, in fact, lead to the compound reported. It is hoped that such a program will prevent problems such as that described in section 1.3. The current methodology approaches the problem from both ends. Lex- emes are currently being developing to tokenise a chemical name. During the period of this author’s involvement with the project, only sections A to C of the IUPAC blue book [111] were considered but eventually the entire book will be encoded. Methods are also being created to store, join and oth- erwise manipulate molecular fragments. Methods to deal with brackets were developed, allowing the program to know which bracket level it is process- ing. Although a formal grammar to describe the syntax of chemical names must be fully developed, collaboration with Natural Language Processing groups have led to the consideration of including machine learning routines in conjunction with this. With such an implementation it would no longer be necessary to manually update and improve the lexemes, the grammar or the library of known fragments. A combination of lexical and syntactic analysis should be able to process the characters in the chemical name 62 1,2–dichlorohexane into the following tokens: 1. The locant 1 2. Comma 3. The locant 2 4. Hyphen 5. The multiplier di 6. The halogen chlor 7. omark 8. The chain hex 9. The saturation ane The blanks separating the characters of these tokens would normally be elim- inated during preprocessing. Recursive hierarchical syntactic analysis allows the construction of the parse tree (see section 3.1) shown in figure 2.20. From this, the CT of the molecule is immediately recoverable. Chemists often prefer to have the ability to perform graphical as well as text searching for complex chemical structures and sub-structures mentioned in scientific literature. To date this has only been possible if the structure has been rendered searchable by the inclusion of the appropriate CT or SMILES [112, 113, 114] string to represent the molecule. If a compound, chemical name or component has only been mentioned in the document as a text string it has not been possible to search for these using a graphical search engine. 63 Figure 2.20: The parse tree of the chemical name 1,2–dichlorohexane. 64 Figure 2.21: Two structures that might be referred to as 2-Chloroethyl ben- zene The SciBorg project [115] has focused on improving and implementing much of the functionality mentioned above with Corbett, Copestake et al. demonstrating proof of concept, or better, implementations of much of the technology [116, 117, 118]. This has included taking a non-deterministic approach toward chemical name parsing. 2-Chloroethyl benzene is an am- biguous name because insufficient locants have been specified. The name might be used to describe both the compounds in figure 2.21 (which also shows a grammatically-correct name for each [119]); both structures can be reconstructed using non-deterministic approaches. 2.6 Conclusions The work above shows that it is possible for a machine to read and extract data from the highly-structured analytical data section in the legacy formats currently used to publish organic chemistry. The extraction can be per- formed with very high recall and precision rates using (hierarchical) regular expressions although the process was more difficult than expected. However, whilst the recall and precision rates are high (and are being improved by im- plementing new techniques) they are not currently sufficiently high to allow 65 the full automation of the process. Parsing the synthetic methodology also proved to be far less tractable than expected. The work has also shown that chemical names cannot be identified re- liably solely by using regular expressions. However, other methods do al- low high rates of both recall and precision. The lack of a CT in machine- understandable form and the inability to reliably produce the correct CT from the information available has been highlighted as a major problem which must be addressed — for molecule-based data-driven science to be possible, it is vital that a CT is available. Supplementary data provided with articles may include the input files and results of computational chemistry calculations or the crystal structures de- termined. These data formats are more structured and, importantly, should necessarily contain the CT for the molecule (or molecules) in an almost triv- ially recoverable form. Data in such forms is considered in the subsequent chapters. 66 Chapter 3 Parsing Program Input — Compilers The previous chapter dealt with the extraction of data from the semi-structured experimental section of chemical papers which did not conform to a formal grammar. It should however be possible to create a formal grammar for data produced by, or for, a computer program. Data in such a form should there- fore lend itself to automated extraction with extremely high rates of both recall and precision; thus eliminating, or at the least drastically reducing, the requirement for human intervention. A compiler is a program that reads a program in one language — the source language — and translates it into another language — the target language (figure 3.1). Both the source and target languages should have fully specified grammars; part of the translation process involves the compiler reporting any errors in the source program (deviations from the specified grammar). To avoid having to use proprietary software, only ASCII files are considered as suitable for use as the source language. In all cases the eventual target language is CML, although in some cases XML is used as the primary target language which is then transformed to CML using stylesheets. There follows a general introduction to compiler theory and how it has been applied to extract data from input files for computational chemistry programs [120]. 67 Figure 3.1: Overview of a compiler. 3.1 Compilers and Classification Techniques There are two parts to the compilation process: analysis and synthesis. The analysis breaks the source program up into constituent pieces and may create an intermediate representation of the data. The synthesis part constructs the desired target program from the constituent pieces of the intermediate representation. This is often the most complex part for code compilation. In contrast, when dealing with chemistry held in an XML form, the synthesis is almost trivial and involves the application of stylesheets. During analysis the operations implied by the source program are determined and recorded in a hierarchical structure called a tree. A specialised kind of tree called a syntax tree is often used, in which each node represents an operation and the children of the node represent the arguments of the operation. Syntax trees are not required for processing computational chemistry but are vital for less structured chemical data. A compiler itself is often not sufficient to create an entire target program because the source program may be stored in separate files, or have include statements for brevity. A pre-processor usually deals with the task of pulling together all the relevant pieces of the source program. It was necessary to implement a pre-processor in some cases, although it was decided that include statements would not be expanded. 68 3.1.1 The Phases of a Compiler The traditional view of the phases of a compiler and the way in which they interact is shown in figure 3.2. This project only required five of these phases: lexical analysis, syntactic analysis, semantic analysis, symbol-table manager and the error handler. A symbol-table is a data structure containing a record for each identifier with fields for the attributes of the identifier, allowing the rapid recall and modification of data relevant to that record. These tasks are managed by using a XML infrastructure. Every phase may encounter or generate errors. All errors must be dealt with, so that the compilation (transformation) process can proceed, allowing further errors to be identified although some particular errors would imme- diately stop the process. The phases which most frequently give rise to the largest fraction of the errors are syntax and semantic analysis. The lexical phase can detect errors where the characters remaining in the input do not form any token of the language. The analysis of the source program consists of three phases: 1. Linear analysis, in which the stream of characters making up the source program is read from left-to-right and grouped into tokens that are sequences of characters having a collective meaning. 2. Hierarchical analysis, in which characters or tokens are grouped hier- archically into nested collections with collective meaning. 3. Semantic analysis, in which certain checks are performed to ensure that the components of the program fit together meaningfully. In a compiler, linear analysis is called lexical analysis or scanning. For the java assignment int position = initial + rate * 60; the lexical analyser would produce the stream of tokens, where each token represents a logically cohesive set of characters, such as identifier, keyword 69 Figure 3.2: The phases of a compiler. 70 (for example if, while and int), or a punctuation character. The character sequence forming a token is called the lexeme for the token. In this example id1, id2 and id3 will be used to represent position, initial and rate respec- tively, emphasising that the internal representation of an identifier is different from the lexeme for that identifier. Figure 3.3 shows how this assignment would be processed by the three phases. The semantic analyser allows the compiler to know that an integer must be formed by the addition or multi- plication of two integers. This allows the assignment of id2, id3 and 60 to type int. The division between lexical and syntactic parsing is chosen to simplify the overall analysis task, although the division becomes necessary if the source language is inherently recursive. Lexical constructs do not require recursion, while syntactic constructs often do. Context-free grammars are a formal- isation of the recursive rules that can be used to guide syntactic analysis (discussed later in this chapter). For example, recursion is not necessary to recognise a number, but is required to match parentheses in expressions. For code generation, the semantic analysis phase checks the source program for syntactic errors and gathers type information for the subsequent code generation phase. The Number of Passes In the compilation of computer code, several phases are usually implemented in a single pass: where a pass is defined as reading an input file and writing an output file. It is desirable to keep the number of passes to a minimum, but whilst this used to be necessity it is now more of a guideline. The lack of definite reserved-words in chemical literature means that a one-pass compiler is impossible. A Simple Compiler The syntax of a language can be represented using a notation called context- free grammars or Backus-Naur Form (BNF) [121, 122]; the specification is shown in appendix C. BNF came about as part of the creation process for 71 Figure 3.3: Assignments produced by the first three phases of a compiler. 72 Figure 3.4: The structure of a compiler incorporating a syntax-directed trans- lator. ALGOL. At the first World Computer Congress, which took place in Paris in 1959, Backus presented a formal description of the international algebraic language which was later called ALGOL 58 [123]. The formal language he presented, which would later evolve in to the BNF, was based on Post’s production system [124]. When such a context-free grammar exists it may be used to guide the translation of a program; this is called syntax-directed translation. The structure for such a compiler is shown in figure 3.4. 3.1.2 Context-Free Grammars A context-free grammar (grammar for short) is a way of specifying the syntax of a language (or any data). This is most easily illustrated with reference to a programming language. An if-else statement in Java has the form: if ( expression ) statement else statement In other words, the statement is a concatenation of the keyword if, an opening parenthesis, an expression, a closing parenthesis, a statement, the keyword else, and another statement. Using the variables expr and stmt to represent an expression and a statement respectively this may now be written stmt ⇒ if ( expr ) stmt else stmt 73 AX Y Z Figure 3.5: A simple parse tree. where the arrow should be read ‘may have the form’. Rules of this form are termed productions. In productions, lexical elements such as the keywords and the parentheses are called tokens, whilst variables such as expr and stmt comprise collections of tokens and are called nonterminals. A context-free grammar has four components: 1. A set of tokens, known as terminal symbols. 2. A set of nonterminals. 3. A set of productions where each production consists of a nonterminal, called the left side of the production, an arrow, and a sequence of tokens and/or nonterminals, called the right side of the production. 4. A designation of one of the nonterminals as the start symbol. 3.1.3 Parse Trees A parse tree pictorially shows how the start symbol of a grammar derives a string in the language. If nonterminal A has a production A⇒ XY Z then a parse tree may have an interior node labelled A with three children X, Y and Z from left to right (figure 3.5). Formally, given a context-free grammar, a parse tree is a tree with the following properties: 74 • the root is labelled by the start symbol • each leaf is labelled by a token or by ² • each interior node is labelled by a nonterminal • if A is the label of an interior node and X1, X2, . . . , Xn are the labels of the children of that node from left to right, then A ⇒ X1X2. . . Xn is a production. Here X1, X2, . . . , Xn stand for a symbol that is either a terminal or a nonterminal. If A⇒ ² then each node labelled A may only have a single child labelled ² A simple example of a parse tree was seen in figure 3.5. The leaves of a parse tree read from left to right form the yield of the tree. Most parsing methods fall in to one of two classes, called top-down and bottom-up methods. These terms refer to the order in which the nodes in the parse tree are constructed. In the former, construction starts at the root and proceeds toward the leaves and vice versa for the latter. Top-down parsers are usually easier to construct by hand and this is the general approach taken for this project. Ambiguity A grammar is said to be ambiguous if it is possible to represent a particular token string by more than one parse tree. Since this would generally mean that the token string’s meaning can be interpreted in more than one way it is usual to attempt to remove all ambiguity in a grammar. It is possible to use ambiguous grammars, but these require additional rules to resolve the ambiguities. 3.1.4 Predictive Parsers Recursive-descent parsing is a top-down method of syntax analysis in which a set of recursive procedures are executed to process the input. Predictive parsing is a special case of this where the lookahead symbol unambiguously 75 determines the procedure selected for each nonterminal. The lookahead sym- bol is the next token in the token stream. A predictive parser consists of a procedure for every nonterminal: 1. Each procedure decides which production to use by looking at the looka- head symbol; the production, with right side α, is used if lookahead symbol is FIRST(α). If there is a conflict between two right hand sides for any lookahead symbol then the method cannot be used. A produc- tion with ² on the right hand side is used if the lookahead symbol is not the FIRST set for any other right hand side. 2. The procedure uses a production by mimicking the right hand side. A nonterminal calls the procedure for that nonterminal, and a token matching the lookahead symbol results in the next input token being read. If at some point the token in the production does not match the lookahead symbol, an error is declared. FIRST is defined such that, if a nonterminal α that has the production α⇒ βχδ . . . ζ then FIRST(α) is β. It is impossible to construct a predictive parser for chemical documents owing to the number of nonterminals that share the same FIRST symbol (see section 3.3). 3.1.5 Lexical analysis A lexical analyser reads and converts the input into a stream of tokens to be analysed by the parser. From the definition of a grammar: a sentence of a language consists of strings of tokens. A sequence of input characters that comprises a single token is called a lexeme. Examples of possible lexemes for chemical names are seen later in this chapter. A lexical analyser insulates a parser from the lexeme representation of tokens. For instance, instead of passing the actual value of a number in the source to the parser, the lexical analyser would identify the number and pass a token indicating that a number was present to the parser, with a pointer to the actual value (see figure 3.6). It may also remove or normalise whitespace and comments. 76 Figure 3.6: An example of how a lexical analyser passes tokens to the parser. Figure 3.7: The lexical analyser and parser acting as a producer-consumer pair. The interaction of a lexical analyser, the input and the parser are shown in figure 3.7. The analyser reads characters from the input and determines what token will represent them. In some cases it is necessary to lookahead several characters to determine what the current token should be. Once these characters have been read they must be pushed back on to the input in case they form the start of a new token. The lexical analyser and parser form a producer-consumer pair. The analyser produces tokens and the parser consumes them. 3.1.6 Regular Expressions Lexemes are usually determined by matching against a regular expression. Although regular expressions have previously been discussed informally, a 77 more rigorous definition follows. A regular expression is built up from simpler regular expressions using a set of defining rules. Each regular expression r denotes a language ÃL(r). The defining rules specify how ÃL(r) is formed by combining, in various ways, the languages denoted by the sub-expressions of r. The following rules show the definition of the languages denoted by the regular expression being defined: 1. ² is a regular expression that denotes ², that is, the set containing the empty string. 2. If a symbol a is a symbol in Σ, then a is a regular expression that denotes a, i.e., the set containing the string a. Although the same notation is used for all three, technically the regular expression a is different from the string a and the symbol a. It should be clear from the context whether a is being treated as a regular expression, string or symbol. 3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then: (a) r|s is a regular expression denoting L(r) ⋃ L(s). (b) rs is a regular expression denoting L(r)L(s). (c) r* is a regular expression denoting (L(r))*. (d) (r) is a regular expression denoting L(r). This rule states that extra parentheses may be placed around regular expressions if desired. A language denoted by a regular expression is said to be a regular set. The specification of regular expression is an example of a recursive definition. Rules (1) and (2) form the basis of the definition and rule (3) provides the inductive step. To avoid unnecessary parentheses in regular expressions, the following conventions have been adopted. 1. The unary operator * has the highest precedence 78 2. Concatenation has the second highest precedence 3. The or operator | has the lowest priority All the operators are left associative. Under these conventions, (a)|((b)*(c)) is equivalent to a|b*c. Both expressions denote the set of strings that are either a single a or zero or more b’s followed by one c. Some languages cannot be described by any regular expression. One exam- ple is the set of all strings of balanced parentheses (in fact regular expressions cannot be used to describe any arbitrary set of balanced or nested constructs but a context-free grammar can). Regular expressions can only be used to denote a fixed or arbitrary number of repetitions of a given construct. Thus Hollerith strings of the form n Ha1a2 ... an from early versions of For- tran cannot be described, because the number of characters following H must match the decimal number n before H. Matching strings of this type is possi- ble using a library such as Perl Compatible Regular Expressions which allow both look ahead and recursion functionality. However such regular expres- sions no longer correspond to the original mathematical definition of regular expressions [125]. 3.1.7 Non-deterministic parsing Context-free grammars and regular expressions are deterministic methods of identifying data, i.e. the behaviour of the system is described completely without probabilities (other than zero or one). Non-deterministic methods (such as Na¨ıve Bayesian), which are largely used for classification, are dis- cussed below. Hidden Markov Model Hidden Markov Models (HMMs) are often used in biochemistry to predict which amino acid residue is most likely to occur next in a chain [126, 127]. The technique uses a training set that has been manually marked up to deter- mine initial transition probabilities, which become modified as the program sees more examples. 79 Formally, the HMM is a finite set of states, each of which is associated with a (generally multidimensional) probability distribution. Transitions between states are governed by a set of probabilities called transition probabilities. In a particular state an outcome or observation can be generated, according to the associated probability distribution. Only the outcome, not the state, is visible to an external observer; hence the name Hidden Markov Model. The theory is based on three assumptions: 1. That the next state is dependent only upon the current state, and the resulting model becomes actually a first order HMM. However, generally the next state may depend on past k states and it is possible to obtain such a model, called an kth order HMM. A higher order HMM will have a higher complexity. Even though the first order HMMs are the most common, some attempts have been made to use the higher order HMMs (the Markov assumption). 2. That state transition probabilities are independent of the actual time at which the transitions takes place (the stationary assumption). 3. That the current output (observation) is statistically independent of the previous outputs (observations). Unlike the other two, this assumption has a very limited validity. In some cases this assumption may not be fair enough and therefore becomes a severe weakness of the HMMs. 3.1.8 Bayesian Classification In the Bayesian approach to statistical inference, probability is a model of scientific knowledge [128]. One of the strengths of the Bayesian method is that it allows expert knowledge, in the form of a prior probability distribu- tion, to be formally incorporated into the statistical analysis. The Bayesian paradigm views both the data and the underlying parameters that generated the data as random variables — random because they are unknown [129]. 80 A Bayesian classifier is commonly employed in email filters to separate spam from non-spam. The expert knowledge is incorporated by producing a list of the features that are, in the experts’ opinion, most likely to indicate that an email is spam. The list is likely to contain words such as Via- gra and porn. These would be described as information-rich words because their presence is highly indicative that the email can be classified as spam. Information-poor words might be dear or cost which are likely to appear with approximately equal frequency in emails of both type. Once the initial list of features has been created it may be augmented by the classification program if it identifies other information-rich words. Natural Language Processing The goal of Natural Language Processing (NLP) is to allow machines to analyse, understand and generate languages that humans use naturally [130]. This is a hard task. Human language contains much ambiguity and sentence construction often makes comprehension difficult. The sentences below are both human-understandable but present problems to machines: I can can a can. A well-dressed man was speaking; he had a foreign accent. The first sentence contains the word can present in three different forms: a modal modifier of a verb, as a verb and as a noun. Syntax analysis would group the first two occurrences as a modal verb. The second sentence contains anaphora; this is coreference of one expression with its antecedent. Anaphora is common because it avoids repetition of words, making sentences more elegant. Part-of-speech analysis and tagging allows many of the common linguistic problems to be resolved. Shallow parsing is the process of identifying syntactic phrases (such as noun phrases) in natural language sentences, for instance, Chomsky’s system of transformational grammar. As outlined in Syntactic Structures [131], it comprised three sections, or components; the phrase-structure component, 81 Figure 3.8: The parse tree generated by the Chomskian analysis of the phrase ‘The man will hit the ball’. 82 the transformational component, and the morphophonemic component. In the following system of rules, S stands for Sentence, NP for Noun Phrase, VP for Verb Phrase, Det for Determiner, Aux for Auxiliary (verb), N for Noun, and V for Verb stem. 1. S → NP + VP 2. VP → Verb + NP 3. NP → Det + N 4. Verb → Aux + V 5. Det → the, a, . . . 6. N → man, ball, . . . 7. Aux → will, can, . . . 8. V → hit, see, . . . This is a simple phrase-structure grammar. It generates and thereby defines as grammatical many sentences and it assigns to each sentence that it gen- erates a structural description. Figure 3.8 shows the parse tree generated by Chomskian analysis of the phrase ‘The man will hit the ball’. 3.2 JFlex and CUP Several tools exist to create lexical analysers from special purpose notation based on regular expressions. This project used JFlex [132], which is a Java implementation of the C program flex [133, 134]. Figure 3.9 shows a typical example of this process. A specification of a lexical analyser is created in the JFlex language, this is run through the JFlex compiler to create the Java code. This code consists of a tabular representation of a transition diagram constructed from the regular expressions specified, together with a standard routine to recognise lexemes. The Java code is then compiled using the javac command. The resultant class is the lexical analyser that transforms an input stream into a sequence of tokens. 83 Figure 3.9: Creating a lexical analyser with JFlex. User code - Comments and import statements - Anything in here is placed verbatim in Lex.java %% Options and declarations - Directives and macros %% Lexical rules - Regular expressions and Java actions - Rules section Figure 3.10: The specifications of a JFlex file. 84 %% %class CommentRemover %standalone %unicode LineTerminator = \r|\n|\r\n Comment = ";".*{LineTerminator} AnythingElse = [^;]+ %% {Comment} { ; /*don’t print these out */ } {LineTerminator} { System.out.println(); } {AnythingElse} { System.out.println(yytext()); } <> { return 0; } Figure 3.11: The comment remover pre-processor in JFlex. The token <> is pre-defined and represents the end of the file. yytext() com- mands the program to replace this phrase with whatever was matched by the token. Figure 3.10 shows the JFlex file specification (the BNF of the lexical rules is shown in appendix D). If the %standalone directive is present the Java code in the lexical rules usually directly creates the target language otherwise it returns a token for the parser generator to consume. An example of a standalone scanner is shown in figure 3.11, where the target language is simply the original file with the comments removed. When consuming its input, the scanner determines the regular expression that matches the longest portion of the input (longest match rule). If there is more than one regular expression that matches the longest portion of input (i.e. they all match the same input), the generated scanner chooses the expression that appears first in the specification. After determining the active regular expression, the associated action is executed. If there is no matching regular expression, the scanner terminates the program with an error message. The scanner is used to determine that only permissible lexemes are present in the input. However the meaning of each lexeme (and how it should be processed) is not always apparent without context; for instance, a token rep- resenting a number might represent the atomic mass, a coordinate, the year or the amount of memory used. Therefore a syntactic analyser must be in- troduced. A Look-Ahead Leftmost Reduction (LALR) parser is a specialised 85 PairBlock ::= PAIR:t1 BLANKLINES PairLineBlock ; PairLineBlock ::= PairLine | PairLineBlock PairLine ; PairLine ::= PairLine1 | LongPairLine1 | PairLine2 ; PairLine1 ::= SPACE INT:i1 SPACE INT:i2 SPACE INT:i3 SPACE FLOAT:f1 SPACE FLOAT:f2 EOL {: System.out.println(""); :} ; LongPairLine1 ::= SPACE INT:i1 SPACE INT:i2 SPACE INT:i3 SPACE FLOAT:f1 SPACE FLOAT:f2 SPACE FLOAT:f3 SPACE FLOAT:f4 EOL {: System.out.println(" Figure 4.3: The DTD for JUMBOMarker. HTML tables for human-friendly display. The process involves the use of a set of templates (consisting of groups of regular expressions) to identify various sections of the output, and processing instructions which specify how to represent the extracted data in XML. The application of stylesheets allows the user to transform the data to other formats for display. 4.1 Design of JUMBOMarker JUMBOMarker is an XML-based language supporting structured regular expressions for parsing semi-structured documents. The regular expression to match a single line is enclosed in a primitive element. The DTD for JUMBOMarker is shown in figure 4.3. Figures 1.13 and 4.4 show extracts from MOPAC and GAMESS output files; there is a lot of implicit structure in the documents and the components are well labelled (an expert human 94 Figure 4.4: A section of a GAMESS output file. 95 MOLECULAR POINT GROUP : C2 MOLECULAR POINT GROUP : D3h Figure 4.5: The point group of two different molecules as reported by MOPAC. practitioner could understand what all the numbers mean). Comparing mul- tiple output files from the same program allows the identification of parts of the output as boilerplate∗ which does not vary between documents. Figure 4.5 shows the point group reported by MOPAC for two different molecules; from this it is determined that the string: MOLECULAR POINT GROUP : is boilerplate. The template to extract the point group is shown in figure 4.6. The extraction is a two stage process. Firstly the document is split into chunks; in this case, the point group (which is captured) and everything else (which is discarded). The more boilerplate present, especially where these provide well defined start and end points for each section, the easier the chunking process is. Once the chunk has been identified, whatever is matched within the capture group is available for JUMBOMarker to process. The data in the first capturing group is then recalled using {$1}, the second {$2} and so on. When run on the text in figure 4.5, this would produce: C2 D3h ∗Boilerplate text consists of standard phrases that may be combined or recalled to create new documents; it is commonly used in contracts or other agreements detailing terms and conditions. 96 Figure 4.6: The template to extract and markup the point group from MOPAC output. CARTESIAN COORDINATES NO. ATOM X Y Z 1 O 0.00210000 -0.00410000 0.00200000 2 O -0.06910000 5.24140000 0.03230000 . . . 14 H -3.16080000 1.41050000 0.94380000 15 H -3.18530000 1.42060000 -0.83600000 Figure 4.7: Section of MOPAC output: cartesian coordinates. Unforeseen character strings present in the output, such as error messages, usually prevent regular expressions from matching. The grouping of regular expressions into templates dictates that if one does not match then the entire group fails to match. Further processing is dependent on whether or not the template is mandatory. If the template is mandatory (minOccurs ≥ 1) the parser produces an error message, discards all the data for the job and moves on to subsequent jobs. In other words any job containing unexpected data is lost. However, if the template is optional (minOccurs = 0) the parser attempts to match the next template against the output. This only results in loss of the data associated with the section in which the mismatch occurred. The removal of the block structure might be considered as a way to avoid discarding mismatched jobs. In such a system, if an unexpected message were encountered, the parser would jump the line containing the message and then 97 NET ATOMIC CHARGES AND DIPOLE CONTRIBUTIONS ATOM NO. TYPE CHARGE No. of ELECS. s-Pop p-Pop 1 O -0.278945 6.2789 1.86320 4.41574 2 O -0.282193 6.2822 1.86399 4.41820 . . . 14 H 0.055649 0.9444 0.94435 15 H 0.055699 0.9443 0.94430 Figure 4.8: Section of MOPAC output: net atomic charges and dipole con- tributions. continue trying to match the current regular expression. However, although MOPAC output contains many stock-phrases, an unambiguous individual line recognition technique is not possible. For instance, consider the two types of data shown in figures 4.7 and 4.8; a regular expression to match a numeric line of the CARTESIAN COORDINATES would have the following form: \s+DIGIT+\s+CHEM ELEMENT\s+FLOAT\s+FLOAT\s+FLOAT where: CHEM ELEMENT = (H|He|Li|Be|...|Uuo) DIGIT = [0-9] FLOAT = (\+|-)?DIGIT+\.DIGIT+ This regular expression would also match numeric lines in the NET ATOMIC CHARGES AND DIPOLE CONTRIBUTIONS block, if the element in question has only s electrons. The only way of removing the ambiguity between the two lines is by context. Context is primarily provided by textual headings and po- sition in the document. The only way to determine context in such documents is by chunking, hence the block structure of JUMBOMarker is necessary. 4.2 JUMBOMarker: Single-Pass, Single-Parse The original design of JUMBOMarker consisted of a single-pass, single-parse parser; this means that the input document is only read once (single-pass) and that the data is extracted in a single process (single-parse). This design 98 Figure 4.9: For single-pass, single-parse parsing JUMBOMarker is an over- arching program that determines what output should be created, all the temporary files being held in memory. (shown in figure 4.9) resulted in an overly large, complex and extremely fragile system. Much of the complexity was a result of the need to control which routine should be run as a result of the presence or absence of particular matched groups in the output document (held in memory) before the final files are created. Because the templates to parse each computational code differ, specific methods are required to control the output for each. Therefore, to adapt JUMBOMarker to parse the output from a different computational chemistry program required editing the source code and effectively creating a new program. The single-pass, single-parse approach yielded ca. 95% correct parsing. All the incorrectly-parsed documents resulted from the presence of unex- pected strings in the output. To improve and simplify the parsing process, a two-pass, two-parse parser was introduced. This had the added benefit of separating the various components of the program. The separation on the components resulted in a more robust program and greatly simplified the process of extending JUMBOMarker to parse other program output. 99 4.3 JUMBOMarker: Two-Pass, Two-Parse In the two-pass, two-parse approach, JUMBOMarker is only used as a parser. In effect this implementation uses two single-pass, single-parse parsers which are controlled by an external program (figure 4.10). The first of these parsers is used to determine whether or not the job has failed. If the job has failed, the second parser is not used. However, if the job has run successfully, the control program runs the second parser to extract the data. Although this description refers to two separate parsers, they are in actuality the same parser (JUMBOMarker) using different templates. The first parser runs JUMBOMarker over the output log file with the failure template. The failure template serves two purposes, to determine that the output log file is complete and to check that there are no error or fail messages in the log file. Each of the primitiveGroups have the minOccurs attribute set to 0 and maxOccurs set to 1. Thus an XML file is always produced — even if it only contains an empty root element. A stylesheet is then run over the XML files produced to determine if the job is complete and contained no error messages. If this is the case, the output log file is suitable for the further parsing. If it is not, a further file (fail.xml) containing the failure reason is created. There are two sources of failure; incomplete output files and programmatic errors. If the failure reason is that the output file is incomplete then the job can simply be resubmitted for calculation. Jobs which cause programmatic errors are analysed and the appropriate refinements are made to the protocol (see chapter 7). The list below shows the sections present in a GAMESS failure template: • gamess-begun • too-many-steps • wrong-charge-and-multiplicity 100 Figure 4.10: JUMBOMarker is only used to parse the document, each step is controlled by external programs or scripts. The second parse is dependent on the results of the first. 101 • scf-not-converged • too-little-time-to-do-another-point • atoms-too-close • general-error • general-failure • gamess-terminated • error-termination The presence of both the gamess-begun and gamess-terminated sections indicate that the entire document is likely to be complete. If any of the other sections are also present this indicates that the job has failed for an identifiable programmatic reason. If there is no failure, the control program then applies the second parser to extract the data. The introduction of the two-pass, two-parse JUMBOMarker greatly sim- plified the detection of errors in the log files and hence the parsing process. Whilst this system was sufficient for the extraction of the required data from MOPAC output, it was incapable of fully parsing a GAMESS output file because of the repeating groups present in these files. To address this issue a multi-pass, multi-parse approach was adopted. 4.4 JUMBOMarker: Multi-Pass, Multi-Parse The limitations of the two-pass, two-parse approach are those of ordering and choice. JUMBOMarker can parse the following document structure: A* B? C+ where A, B and C are all sections of the document. However, it does not support: ( A | B ) C ( D & E ) F 102 where D, E and F are all sections of the document and the ampersand means at least one of each but in any order. It is in general unnecessary for the ampersand connector to be supported to parse output files and the ‘or’ con- nector (whilst useful) is not vital provided that the type of output is known. For example, different (though largely very similar) templates can be created to parse the output from different run types. By using optional repeating groups JUMBOMarker can parse documents with the structures: A B A B A B A A A A B B B A B B A A A B B B A where A and B represent primitives within a primitiveGroup (with maxOccurs > 1). However, there cannot be any unparsed data between A and B. Thus documents of the form: A C B A C B where C represents a section of the document for which no template exists cannot be parsed. All quantifiers in JUMBOMarker are greedy, that is, they match as much as they possibly can; thus there is no way of ‘skipping’ over a section because a general template to match an unspecified number of lines () would continue to match to the end of the document and B would never match. The introduction of multi-parse parsing eliminates this problem. Figure 4.11 shows the multi-parse process. The first parse through the document A C B A C B would markup the A sections producing ~A ~A 103 Figure 4.11: The out.txt will only be parsed if it has already passed the failure test. In multi-pass multi-parse parsing JUMBOMarker traverses the input file multiple times, each time producing a different file — this is necessary for overlapping repeating groups. 104 (where ~A represents the XML markup of section A). The second parse would mark up the B sections producing ~B ~B These sections would then be merged using stylesheets to produce ~A ~B ~A ~B This version of JUMBOMarker implements the same initial single-pass, single- parse parser as that in the two-pass, two-parse version. However, if the job has not failed the extraction of the data is no longer performed in a single- process but by multiple parsers, the results of which are then combined to give all the extracted data. The multi-pass, multi-parse approach took an average of five seconds to check that a file had not failed, extract the data and coordinates and extract the times reported into CML — this figure includes the initial loading time for the program and all the reading and writing to disk that is necessary. The parsing was performed on a Sony VAIO laptop computer; Intel Pentium 4 2.8 GHz cpu with 512 Mb of RAM. 4.5 Conclusions The current multi-pass, multi-parse version of JUMBOMarker achieves greater than 99.9% correct parsing of all files with no human curation. The files which are not correctly parsed are simply not parsable without a complete rewrite of the underlying code and might still require human input. The main problem remaining is that of encoding particular characters in the XML representa- tion. This is a general issue for any XML based system [143]. Fortunately, this problem occurs only rarely — only one instance was found during the course of this work. The extremely low error rate of JUMBOMarker allows it to be incorporated into automated processes thereby reducing the amount of work to be done by humans. The use of JUMBOMarker as part of a workflow to parse program output for analysis forms an integral part of the following chapters. 105 Although very little human intervention is required once JUMBOMarker has been set up, the creation of the templates still requires a large amount of user input and it is desirable to reduce this. Methods to automatically generate templates, or parts of templates which could be fleshed out by the user, were examined (using HMMs and extended natural language parsing techniques). However, whilst such techniques could provide a rough outline of the required template a large amount of debugging was required before they were usable. In fact, because the user was not involved in the initial creation of the templates, it was often found that more time was required to work out what was required than if these methods were not employed. In general, it proved simpler and faster to create the templates by hand. Collaborations with Martin Braendle [144], Rene´ Kanters [145] and Miguel Howard [146] have led to the creation of templates for GAUSSIAN03 [147] output and the implementation of the JUMBOMarker approach to parse user-defined extra data for programs such as Jmol. 106 Chapter 5 High-Throughput Computing High-Throughput (HT) computing involves allowing users to run large num- bers of independent jobs simultaneously over long periods of time. Con- versely, High-Performance computing is aimed at provided large amounts of computing power for relatively short periods of time. HT computing is par- ticularly applicable to the process of optimising molecular geometries — the same algorithm is applied repeatedly to different molecules independently. A typical computer is used very inefficiently; it is often idle for a very large percentage (>80%) of its available cycles. Various groups have attempted to to make use of these idle cycles — such as the Berkeley students behind the SETI@home [148] project and CANCER@HOME [149]. However, these schemes have all involved users downloading a purpose-built program which runs whilst the computer would otherwise be idle. These systems only allow a specific program to be run. Therefore a group at the University of Wisconsin- Madison developed Condor; once installed on a machine this allows many (any) programs to be run whilst the machine would otherwise be idle [150]. 5.1 Condor The idea behind Condor is simple; it matches any computational jobs that users have with spare power in other owners’ computers. Computer owners do not have to modify their programs to use Condor, they just have to agree to become part of a Condor network. This network is a group of computers 107 connected in such a way that messages and data can pass between them. Condor remains in the background of the network, on a computer called the central manager — which can be any computer in the network — where it searches for inactive computers. When such a machine is found, Condor claims it and adds it to the list of available nodes. By default, a physical user has higher priority than Condor so when the owner resumes using the computer, Condor stops any currently running programs and removes the computer from the available nodes list. In practice Condor is more complicated. Every computer in the pool must continually run parts of the Condor software that track activity on both the central manager and the local system. To do this, the computer must be capable of multi-tasking, or running more than one piece of software simul- taneously. The overall effect is that by using Condor previously unusable computing resources and time become available for free. Condor implements an internal scheduling system so that large jobs do not drain the pool of cycles, this is based on the priority Up-Down algorithm developed by Livny et al. [151] This system makes it possible for scientists who are running time- consuming research to coexist with people running shorter jobs. The priority that the system assigns to a scientist’s project de- creases as the number of cycles the scientist uses increases. [152] The Unilever Centre for Molecular Science Informatics (UCC) allowed Zhang of the Murray-Rust group to establish a Condor pool on 24 teaching machines in 2002. This provided approximately three months of uninter- rupted run time over the summer vacation (6 cpu years), effectively using the Condor system as a scheduler. During term (when interruptions may oc- cur as physical users claim the machines) this reverted back to an idle-cycle scavenging system as per the original design. 108 5.2 MOPAC and NCI HT Computing MOPAC is a general-purpose semi-empirical molecular orbital package for the study of solid state and molecular structures and reactions. Semi-empirical Hamiltonians are used in the electronic part of the calculation to obtain molecular orbitals, the heat of formation and its derivative with respect to molecular geometry. Using these results MOPAC calculates the vibrational spectra, thermodynamic quantities, isotopic substitution effects and force constants for molecules, radicals, ions, and polymers. The National Cancer Institute (NCI) make available the connection tables of 250,251 structures for download and re-use [153]. As described by Zhang in 2004 [154], these structures were downloaded in SDF format and converted to CML using OpenBabel [155], the resultant CML was combined with a set of job controls to create a MOPAC input file. The input file specified the type of calculation to run; a geometry optimisation using PM3 and the charge on the molecule. Where there were multiple molecules in the MOL file the largest fragment was selected∗. The Condor submission node does not support 250,251 individual jobs and therefore the input files were collected into 500 groups of 500 and a single group of 251 inputs. These were then submitted to the Condor infrastructure for calculation†. Figure 5.1 shows the overview of this process. The structure and properties determined by MOPAC were extracted and parsed to CML using JUMBOMarker, combined with the input structure previously generated (thus already in CML) and stored in an XML database (Xindice [156]). This process was non-trivial. Whilst MOPAC is in general a very stable program, the large number of jobs revealed some problems that were previously not encountered. An example of such a problem was a memory overflow issue that affected every 64,000th job, causing it to crash. ∗The largest fragment was taken to be that which contained the most atoms. †The calculation and analysis of the NCI molecules using MOPAC was a collaborative project between Zhang and this author. The submission of the jobs was entirely the work of Zhang, the subsequent analysis the work of this author. 109 Figure 5.1: Overview of the MOPAC calculation of the NCI molecules. 110 Figure 5.2: Mismatches in the MOPAC input and output The JUMBOMarker used for this process was the single-pass, single-parse design as defined in section 4.2. The templates developed for this process were designed to be fault tolerant — that is, if a job failed JUMBOMarker should skip over this job and continue parsing. Hence, the simple matching system for input and output files often resulted in mismatches as shown in figure 5.2. The detection of this fault was relatively trivial because the matching program reported the mismatch, for example; no output structure was found for input molecule 500. The fault was not easy to rectify because the files did not contain unique identifiers. Hence determining which was the appropriate output file to match with a particular input file relied on matching the structures contained in each — InChIs are ideal for this purpose. 5.2.1 InChI InChI is the IUPAC International Chemical Identifier and is a string of char- acters that is designed to canonically represent a chemical substance [17, 157]. It is derived from a structural representation of the substance in a way de- signed to be independent of the structural representation. Therefore a single 111 Figure 5.3: Various structures that all are represented by the same basic InChI. compound will always produce the same basic identifier. Figure 5.3 shows various structural representations of the same molecule which are normalised to a single basic InChI. The InChI technical manual states: It was agreed at IUPAC meetings prior to the start of this project that the first version of the InChI should cover well-defined, covalently- bonded organic molecules. It was also agreed to include sub- stances with mobile hydrogen atoms (tautomers, for instance). In the course of this project, it was found that straightforward exten- sion organometallic compounds could be represented. Methods were found to also include variable protonation. Also, the present version only considers traditional organic stereochemistry (double bond — sp2 and tetrahedral — sp3) and the most common forms of H-migration (tautomerism). However, the layered structure of the InChI allows future refinements with little or no change to the layers currently used. [17] It is possible to generate InChIs for molecules that the program was not designed for and those that are not chemically valid, for example C(CH3)6 generates InChI=1/C7H18/c1-7(2,3,4,5)6/h1-6H3 112 input calculated Figure 5.4: Garbage-In, Good-Out: a molecule from the NCI that was mended by MOPAC. but with the warning message: type=warning value=Accepted unusual valence(s): C(6) The use of InChIs to index and identify documents (especially web-based documents) was described by Coles et al. in 2005 [158]. It was hoped that creating InChIs for the structures contained in both the input and output documents should allow the correct molecules to be matched up, unfortu- nately this did not always prove to be the case. In several cases the con- nection table (from which the InChI is derived) changed as a result of the MOPAC calculation. The term proteus was coined to describe this type of molecule‡. It is interesting to note that occurrence of proteus molecules in the dataset would likely have remained undiscovered if the mismatching problem had not occurred. 5.2.2 Proteus Molecules Analysis of the proteus molecules showed that in some cases the MOPAC cal- culation had mended a bad input structure (Garbage-In, Good-Out) whilst in other cases the MOPAC calculation broke a good input structure (Good-In, ‡Proteus was a shape-changing ocean deity in Greek mythology. 113 input calculated Figure 5.5: Good-In, Garbage-Out: a molecule from the NCI that was broken by MOPAC — the five coordinate atom is antimony. Garbage-Out). Examples of these are shown in figures 5.4 and 5.5 respec- tively. In general the Good-In, Garbage-Out molecules were as a result of attempting to calculate properties for molecules that MOPAC is not opti- mised to deal with (under the protocol that was in use), i.e. those containing heavy metals. The identification of good and bad structures might be possi- ble using sufficiently chemically-aware programs but currently is performed by human experts. As previously discussed, it is important to verify that data is of sufficient quality before it is publicly disseminated. Determining which of the struc- tures calculated were Good-Out (regardless of whether or not input structure was good) is difficult because of the inconsistent quality and volume of the NCI dataset. To provide a measure of the quality of the MOPAC structures, comparison with structures obtained from a different source is necessary (fig- ure 5.6). Ideally the comparisons between structures would be between the molecules in the MOPAC calculated set and the same molecules determined experimentally. Experimentally-determined 3D structures are available from X-ray crystal- lography reported in Crystallographic Information Files (CIFs). The format 114 Figure 5.6: To cross-check the MOPAC results the structures can be com- pared to experimentally-determined structures or those determined by a dif- ferent computational program of a CIF is discussed in chapter 6 and the processing necessary to extract the CTs, in section 7.3. For the moment it is sufficient to state that the extraction of the CTs is non-trivial and in general, not possible without downloading the entire CIF. The lack of the CT, or an identifer directly derived from it (such as InChI), in an immediately machine-understandable form, means that it is impossible for a machine to automatically search the literature to find appropriate structures for comparison. It was decided that it would be more tractable to compare the geometries determined by MOPAC with those calculated by a different program at a higher level of theory. 5.3 MOPAC and GAMESS GAMESS (General Atomic and Molecular Electronic Structure System) al- lows the optimisation of molecular geometries using the energy gradient (cal- culated analytically for SCF or DFT wavefunctions) [159]. The original code split in 1981 into two variants, GAMESS (US) and GAMESS (UK) which now differ substantially. This work has been performed using GAMESS (US) which is available at no cost to both academic and industrial users. This work investigates the behaviour of GAMESS in a HT environment and uses the results to validate the results of the MOPAC optimisations. To do this, a general purpose protocol was required — the protocol should be able to cope with a wide range of different molecules and still converge the SCF and 115 optimise the geometry. GAMESS input is directed by what is called ‘$’ or ‘$ groups’. The in- put components directing what type of calculation should be performed are included in the $CONTRL group. The $DATA section provides the specific molecular set up, including both the symmetry type and the 3D coordinate information necessary to construct the molecule. This can be provided as a set of Cartesian components or, as a Z-matrix which describes the molecules using internal coordinates. Figure 5.7 shows a typical input file. Whilst the protocol was designed to reduce the number of failures, some failed jobs were anticipated. The failures were expected to fall into four categories: System crashes The calculation can either be regarded as lost or resub- mitted for calculation. Calculations overrun The protocol can be adapted to give more time for each job (the time is effectively free) but this might be a sign of a pathological molecule which will always cause problems. The wrong answer is produced There are four possibilities; • The protocol is not capable of dealing with molecules of this type (for example open shell systems). Either filters should be imple- mented to remove molecules of this type or the protocol should be changed to accommodate them. • This is a general problem (everyone gets the same wrong answer), as such this should not be considered a problem because the pro- tocol should be developed to get consistent answers rather than right ones. • The result produced by GAMESS is different from that from other programs — such results are a GAMESS community concern and would be reported to the developers. 116 Figure 5.7: An example GAMESS input file. The molecule speci- fied in cartesian coordinates (COORD=CART) should be geometry-optimised (RUNTYP=OPTIMIZE) using automatically-generated internal coordinates ($ZMAT group). The maximum time (TIMLIM) allowed for the calculation is 10080 minutes (one week) and a maximum of 262144000 8-byte words of memory are to be used (MEMORY). The energies are calculated using the Restricted Hartree Fock wavefunction (SCFTYP=RHF) theory with the B3LYP exchange function (DFTTYP=B3LYP) optimisation. Pople’s N-31G split valence basis set is to be used (GBASIS=N31) available for H-Zn when NGAUSS=6 and one polarisation function to be included on atoms of atomic number 3 and over (NDFUNC=1). The basis set and exchange function were chosen be- cause they represent the best trade-off between calculation time and accuracy in common use [160]. 117 • The result produced by this system differs from that calculated by another GAMESS system using the same level of theory. This is probably the most important wrong answer to identify, but does not necessarily mean that the answer is wrong; for example different starting geometries may lead to different local minima. Repeating calculations of the same molecule on different systems (in this case Linux- and Windows-based architectures) allows the determination of acceptable variation. Unforeseen failures Although every attempt is made to predict how calcu- lations might fail, previous experience suggests that unforeseen failures do occur. These are often useful because they highlight new ways of looking at the data and adapting the protocol. Chapter 7 details how a protocol is created and refined which involves choosing an appropriate level of theory and how to determine which struc- tures are suitable for calculation at that level. However, the method pre- sented relies on an assumed knowledge of the behaviour of the calculation program. The initial creation of a suitable protocol for HT computing with an unfamiliar program is extremely difficult and the first attempt should be expected to fail. The development cycle relies on the detection of problems to determine problems with the protocol which can then be refined or filters put in place. Figure 5.8 shows a typical protocol development cycle. The intention is for humans to validate protocols rather than individual data items (when large datasets are being used, dealing with individual data items is often impossible). Thus when developing a protocol, a molecule that is correctly computed is of no immediate interest because it does not highlight any problems present. Once the protocol has been established all the molecules that were correctly computed using that protocol can be analysed to produce derived data and in some cases further refinements for that protocol. 118 Figure 5.8: A typical development cycle for a protocol 119 Typical examples of the types of error encountered during the protocol development and the resultant action taken are described below: System crash A thunderstorm caused a power cut that prevented the air conditioning working, thus the computers overheated and stopped; all affected jobs were restarted. Science error A large number of the results failed because the incorrect charge was specified; a bug was found in the code that created the in- put file which resulted in all molecules being submitted with an overall charge of zero. The code was corrected and all the jobs were resubmit- ted. Unsuitable data Open shell systems were being submitted for calculation to a protocol for closed shell systems only; a filter was created to prevent open shell systems being submitted. Program crash Partial log files were produced for some results, the cause of this remains unknown; the affected jobs were resubmitted. Further refinements to the protocol resulted from analysis of the parsed data. To increase the number of molecules that can be calculated, it is desirable to reduce the calculation time for each one. The most obvious way to do this for molecules that contain symmetry is to only calculate the symmetry unique atoms. Unfortunately this can lead to problems: When the point group contains a 3-fold or higher rotation axis, the degenerate moments of inertia often cause problems choos- ing correct symmetry unique axes, in which case you must use COORD=UNIQUE rather than Z-matrices. Warning: The reorienta- tion into principal axes is done only for atomic coordinates, and is not applied to the axis dependent data in the following groups: $VEC, $HESS, $GRAD, $DIPDR, $VIB, nor Cartesian coords of ef- fective fragments in $EFRAG. COORD=UNIQUE avoids reorientation, and thus is the safest way to read these. [161] 120 no Z-matrix Z-matrix Total time (s) 12090251 8124016 Mean time per molecule (s) 17650 11860 Mean time per basis function (s) 154 104 Mean time per non-H atom (s) 2752 1849 Table 5.1: Comparing the effects of not implementing and implementing internal coordinates (Z-matrix) on the calculation time. The statistics are derived from the successfully completed geometry optimisations of 685 molecules containing between three and eight non-hydrogen atoms. It was felt that this would not be a worthwhile time-saving approach, largely because it would require many man-hours to create the new protocol and because the resultant protocol would be less general. Other methods are available to reduce the calculation time (lowering the level of theory for instance) but in general: . . . geometry optimizations depend on the coordinates being used (not the starting values, but rather the type). In general the most satisfactory behavior for the least human effort comes from putting cartesian coordinates in $DATA, selecting NZVAR=3N-6, and then using $ZMAT DLC=.TRUE. AUTO=.TRUE. $END. [162] The creation of the internal coordinates is automated within GAMESS [163] but may fail: . . . in cases where the automatic coordinate generation fails, you will have to play with the NONVDW, IXZMAT, and IRZMAT keywords in the same group. In addition, you always have to set the key- word NZVAR in the $CONTRL group to 3N − 6 (or 3N − 5 in case of linear molecules) when you use the DLC option, with N being the number of atoms. [164] 121 53 minutes to compute no Z-matrix 204 minutes to compute 16 minutes to compute with Z-matrix 10078 minutes to compute Figure 5.9: Although in general using a Z-matrix reduces the calculation time, this was not always found to be the case. Both molecules were under- going geometry optimisation using 6-31G*/B3LYP. The programs used to create the input file contained no methods for as- certaining the linearity of a molecule therefore the NZVAR was always set to 3N − 6. In fact, no linear molecules that were suitable for calculation have been found in the dataset to date. In general the calculation time was found to decrease when a Z-matrix was used (see table 5.1), this was not the case for all the molecules. Figure 5.9 shows the effects of using internal coordinates on two structural isomers. Whilst such cases may be interesting on an individual basis, from the point of view of the protocol developer they are exceptions to the general case and have little overall effect. To quote one of the developers of GAMESS: . . . it is possible for some molecules, some times, to work better in other coordinates. This is numerical work, and exceptions always turn up. [162] The possibility of a small number of automatic coordinate generation failures was deemed an acceptable trade-off for the general decrease in calculation time, thus a Z-matrix was implemented in subsequent protocols. 122 −4 −2 0 2 4 − 0. 15 − 0. 10 − 0. 05 0. 00 0. 05 0. 10 0. 15 Theoretical Quantiles ∆R (G AM ES S− MO PA C) / A ng str om Figure 5.10: QQ plot for the 106 C–Cl bonds in the dataset — the data are normally distributed over the entire length with ∆R=0.041A˚ and s=0.013A˚. 123 5.4 Results The difference in bond length (∆R(GAMESS−MOPAC)) is expected to be normally distributed; this can be assessed graphically by using Quantile- Quantile plots (QQ plots). The QQ plot is a graphical technique for diag- nosing differences between distributions; the normal theoretical quantiles are plotted on the x-axis and quantiles found in the data on the y-axis. If the sample data is normally distributed it should form a straight line, the gradi- ent of which is equal to the standard deviation of the sample and the value at x = 0 the mean. Figures 5.10 and 5.11 show cases where all the data appears to be normally distributed. Data of this sort provides little scope for further refinement of the protocol but does reinforce the belief that the ideal value is being approached. QQ plots also provide an excellent indication of the threshold values, above or below which, bond length differences appear to be no longer behaving normally (see figure 5.12). QQ plots for all bond types with more than 100 instances in the dataset are given below. The order of the bond between the atoms is irrelevant in this analysis because the difference in bond length is being considered — this is beneficial as it allows for alternative bonding patterns being applied. Outliers can also be identified by plotting a basic x-y graph of the bond length calculated by MOPAC against the bond length calculated by GAMESS. However this becomes more difficult with larger datasets. Figure 5.13 shows the C–C bond lengths in this form, one outlier is plain but the unusual be- haviour evident in figure 5.12 is no longer clear. A combination of the two methods is therefore used; once threshold values have been determined the SVG graphing program described in section 1.12.1 was used to examine those molecules giving rise to the outliers. 124 −4 −2 0 2 4 − 0. 15 − 0. 10 − 0. 05 0. 00 0. 05 0. 10 0. 15 Theoretical Quantiles ∆R (G AM ES S− MO PA C) / A ng str om Figure 5.11: QQ plot for the 324 N–N bonds in the dataset — the data are normal over the entire length with ∆R=−0.007A˚ and s =0.022A˚. 125 −4 −2 0 2 4 − 0. 15 − 0. 10 − 0. 05 0. 00 0. 05 0. 10 0. 15 Theoretical Quantiles ∆R (G AM ES S− MO PA C) / A ng str om Figure 5.12: QQ plot for the 9115 C–C bonds in the dataset — the data are normally distributed within ∆R = ±0.05A˚ with ∆R=0.005A˚ and s=0.013A˚. Outliers are apparent at ∆R ≈ 0.1A˚ and ∆R < −0.05A˚. 126 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1. 1 1. 2 1. 3 1. 4 1. 5 1. 6 1. 7 MOPAC bond length / Angstrom G AM ES S bo nd le ng th / An gs tro m Figure 5.13: The x-y plot of all the C–C bonds shows a clear outlier but little further immediate information. 127 −4 −2 0 2 4 − 0. 2 − 0. 1 0. 0 0. 1 0. 2 Theoretical Quantiles ∆R (G AM ES S− MO PA C) / A ng str om Figure 5.14: QQ plot for all the bonds in the dataset (16978) — the data are very approximately normally distributed within ∆R = ±0.1A˚ with ∆R=0.002A˚ and s=0.020A˚. 128 original amended Figure 5.15: The molecule giving rise to the major outlier in figure 5.12 (left) and the emended entry now in the NCI database. 5.4.1 All Bonds Figure 5.14 shows that overall there is good agreement between the bond lengths calculated by MOPAC and by GAMESS. Outliers are present (|∆R| > 0.1A˚) but these are usually caused by the presence of heavier elements (alu- minium, silicon, phosphorus and sulfur). While this graph gives an overall picture, analysis based on specific bond types is preferable. 5.4.2 C–C bonds Figure 5.12 shows the QQ plot for all C–C bonds. The major outlier ∆R ≈ 0.1A˚ was caused by the molecule in figure 5.15. The charge and connec- tion table for this molecule were incorrectly specified in the NCI dataset, although it has now been amended. The bonds giving rise to the tail of the distribution (∆R < −0.05A˚) were found to be atoms bonded to fluo- rine. The 29 molecules with a C–CFn fragment (n = 1, 2, 3) were examined; the ∆R(GAMESS−MOPAC) of the C–CFn bond ranged from −0.003A˚ to −0.081A˚. There appears to be a correlation between the number of fluorine atoms and the change in bond length ρ = −0.75 (see figure 5.16). 5.4.3 C–N bonds The C–N bonds show generally good normality (figure 5.17); of those be- low the threshold value of ∆R < −0.075A˚ one was from the molecule in figure 5.15, four from N-hydroxy-amide derivatives and two from molecules 129 1.0 1.5 2.0 2.5 3.0 − 0. 08 − 0. 06 − 0. 04 − 0. 02 0. 00 y = −0.0285x + 0.0106 number of fluorine atoms ∆R (G AM ES S− MO PA C) / A ng str om Figure 5.16: The variation in ∆R(GAMESS−MOPAC) for the 29 molecules containing a C–CFn n = 1, 2, 3 fragment. 130 −4 −2 0 2 4 − 0. 15 − 0. 10 − 0. 05 0. 00 0. 05 0. 10 0. 15 Theoretical Quantiles ∆R (G AM ES S− MO PA C) / A ng str om Figure 5.17: QQ plot for the 3615 C–N bonds in the dataset — the data are normally distributed for ∆R > −0.075A˚ with ∆R=−0.019A˚ and s=0.014A˚. 131 containing N–N bonds. 18 instances of N-hydroxy-amides were found in the corpus with the ∆R(GAMESS−MOPAC) of the nitrogen–carbonyl carbon bond varying between −0.03A˚ and −0.12A˚. The consistent negative sign of ∆R(−s)uggests that the discrepancies in the bond lengths are caused by systematic problems with one or both of the methods used. 5.4.4 C–O and N–O bonds The QQ plots of C–O and N–O bond length changes (figures 5.18 and 5.19) both show kinks indicating that there might be overlapping distributions present. This is evident in the density plots in figure 5.20. An analysis to determine the molecular features that discriminate between molecules giv- ing inconsistent bond lengths between MOPAC and GAMESS was therefore undertaken. A dataset of 282 molecules which contained a bond with a large bond length deviations§ (the bad set) and a dataset of 2302 molecules which did not (the good set) were prepared and 146 MOE descriptors [165] were created for the molecules in the two sets. A correlation analysis indicating which variables are likely to contribute to an agreement between the two programs and those likely to contribute to disagreement was performed. This showed that higher carbon valence connectivity, larger hydrophobic Van der Waals surface area, larger numbers of carbon atoms, higher total negative partial charge and a higher number and surface area of hydrogen atoms correlate with the good set. Conversely, molecules in the bad set typically contained a larger number of nitrogen atoms, larger fractional polar surface area, a large total polar surface area and high molecular mass density. Overall, many of the variables are related implicitly to the nitrogen content of the molecules (a property which was explicitly observed). In order to establish which structural features are responsible for disagree- ment between the two datasets, circular fingerprints (the representation of §A large bond length deviation was taken to be a bond with |∆R| > threshold value for that bond type. 132 −4 −2 0 2 4 − 0. 15 − 0. 10 − 0. 05 0. 00 0. 05 0. 10 0. 15 Theoretical Quantiles ∆R (G AM ES S− MO PA C) / A ng str om Figure 5.18: QQ plot for the 3019 C–O bonds in the dataset — the data are approximately normally distributed over the entire length with ∆R=0.011A˚ and s=0.014A˚, although there is a kink which indicates the possibility of overlapping distributions (see figure 5.20). 133 −4 −2 0 2 4 − 0. 15 − 0. 10 − 0. 05 0. 00 0. 05 0. 10 0. 15 Theoretical Quantiles ∆R (G AM ES S− MO PA C) / A ng str om Figure 5.19: QQ plot for the 193 N–O bonds in the dataset — the data does not appear to be normally distributed, possibly there are two or more overlapping distributions present (see figure 5.20), overall ∆R=0.033A˚ and s=0.023A˚. 134 −0.10 −0.05 0.00 0.05 0.10 0 5 10 15 20 25 30 C−O bonds ∆R (GAMESS−MOPAC) / Angstrom D en si ty −0.10 −0.05 0.00 0.05 0.10 0 5 10 15 20 N−O bonds ∆R (GAMESS−MOPAC) / Angstrom D en si ty Figure 5.20: The C–O bonds appear to come from two distinct distributions and the N–O bonds from three. 135 molecular structures by atom neighborhoods) were employed in combination with information-gain feature selection [166]. The ten features most able to discriminate between the datasets, their relative frequencies in the datasets and the corresponding information gain are given in figure 5.21. The frag- ments selected are in good agreement with those determined via decision trees based on the MOE descriptors. Aromatic moieties are much more frequent in the bad dataset, in particular those containing nitrogen. The identification of discrepancies in the bond lengths and the subsequent elucidation of the underlying cause being (nitrogen containing) aromatic moi- eties was performed by methods which could be entirely automated. Opti- mising the geometry of nitrogen, especially when the lone pair is delocalised, has been found to be problematic in the past [167]. The results presented above therefore reinforce the idea that machines can be used to validate re- sults and furthermore that there may be undiscovered science in the legacy literature. 5.4.5 C–S bonds The most apparent outlier (∆R ≈ −0.05A˚) was caused by trifluoromethanthiol. It was thought that this might be related to the effect seen for fluorinated carbon bond lengths but the corpus only contained one molecule with a S– C–F fragment, therefore no further analysis was possible. The remaining outliers arose from bonds in aromatic heterocycles containing N–N bonds. 5.5 Time The prediction of calculation times and likely failure rates are vital for the integration of a program into a workflow and for the development of a suit- able protocol. The run times of the GAMESS jobs has been examined and models of the data proposed. One of the initial constraints imposed in future protocols (that molecules containing more that 15 non-hydrogen atoms are not suitable for calculation) is a direct result of this work (see chapter 7) and the performance of the models is reported in section 8.6. 136 Fb = 0.067, Fg = 0.000, Ig = 0.024 Fb = 0.089, Fg = 0.004, Ig = 0.020 Fb = 0.057, Fg = 0.000, Ig = 0.024 Fb = 0.057, Fg = 0.004, Ig = 0.020 Fb = 0.074, Fg = 0.004, Ig = 0.017 Fb = 0.043, Fg = 0.000, Ig = 0.013 Fb = 0.103, Fg = 0.017, Ig = 0.013 Fb = 0.035, Fg = 0.000, Ig = 0.012 Fb = 0.053, Fg = 0.003, Ig = 0.012 Fb = 0.043, Fg = 0.001, Ig = 0.012 Figure 5.21: The ten molecular fragments most able to discriminate between the good and bad datasets. Fb is frequency of the fragment in the bad dataset, Fg the frequency in the good dataset and Ig is the information gain. 137 −4 −2 0 2 4 − 0. 15 − 0. 10 − 0. 05 0. 00 0. 05 0. 10 0. 15 Theoretical Quantiles ∆R (G AM ES S− MO PA C) / A ng str om Figure 5.22: QQ plot for the 358 C–S bonds in the dataset — the data appears to be approximately normally distributed for ∆R > 0A˚ with ∆R=0.045A˚ and s=0.025A˚. There might be two overlapping distributions present (the behaviour changing at ∆R ≈ 0.075A˚). However, the decreasing density of points for ∆R > 0.1A˚ suggests that these are outliers rather than a different distribution. 138 number of total jobs number of failure % non-H atoms failures 3 2 0 0 4 23 1 4.3 5 125 3 2.4 6 93 3 3.2 7 448 5 1.1 8 6 0 0 overall 697 12 1.7 Table 5.2: The number of calculations that failed to complete because of insufficient time, broken down by the number of non-hydrogen atoms. The statistics are based on the 685 completed jobs reported in table 5.1 and the 12 failed jobs created producing that data. The percentage of jobs that fail to complete because they ran out of time was found to be approximately 2% (see table 5.2) but this was expected to in- crease with increasing molecule size. A greater proportion of the calculations on such molecules would be expected to run for the total specified run time (a week) with the effect of increasing the reported total average time. The analysis of the run times below is based only on those jobs that successfully completed the geometry optimisation within the time limit specified. The time required for a single step of a calculation is predicted by the DFT equations to rise with the number of basis functions to the fourth power . However these algorithms have been rewritten with efficiency in mind and it is found that it is often the matrix manipulations which dominate (which scale as the cube on the number of basis functions). Figure 5.23 shows the behaviour of the mean step time for the calculations in this study which scale more favourably than the expected behaviour. It is likely to have been artifi- cially lowered because of the removal of the all non-terminating calculations — although pseudo-diagonalisation of the matrices and symmetry effects are also likely to have contributed. 139 0 20 40 60 80 100 120 140 160 180 200 220 240 0 500 1000 1500 2000 2500 m e a n s t e p t i m e / s number of basis functions y=0.0011x 2.6753 Figure 5.23: The mean step time for a calculation in this study scales more favourably than the expected n3. 140 It is more convenient to work with the number of atoms in a molecule than the number of basis functions because it is readily apparent and more gen- eralisable being independent of the basis set used. Fortunately the number of basis functions is correlated with the number of atoms (ρ = 0.81) but it is more strongly correlated (ρ = 0.95) with the number of non-hydrogen atoms (see figure 5.24). The total time to complete a geometry optimisation is not accurately pre- dictable by theory because of the chaotic behaviour displayed near the min- imum on the potential energy surface¶. However the total time does show a correlation with the number of non-hydrogen atoms (see figure 5.25); the calculated exponent was felt to be sufficiently close to three that a cubic de- pendance could be used to model the data. The use of a cubic exponent (or possibly higher) rather than the value of 2.9469 shown in figure 5.23 was also justified by recalling that the longest running jobs have been removed from the data and is the expected exponent for DFT calculations implementing an auxiliary basis rather than a standard coulomb integral. The equation of the line of best fit was found to be; 3 √ t = 3.095n+ 1.5268 (5.1) where t is the expected calculation time in seconds and n is the number of non-hydrogen atoms. There is a variation of several orders of magnitude in the calculation time per non-hydrogen atom which results in large standard deviations. It should be noted that the standard deviation is not constant but in fact appears to rise linearly with the mean time per non-hydrogen atom as shown in figure 5.26. This is advantageous because in combination with equation 5.1 it allows the prediction of the standard deviation for larger molecules as shown in table 5.3. The ability to predict run times (with some measure of the spread of ¶Close to minima on potential energy surfaces the gradient with respect to all coordi- nates tends to zero. This often causes gradient-following algorithms to oscillate around the true minimum for an unpredictable number of steps. 141 0 5 10 15 20 25 30 35 0 20 40 60 80 100 120 140 160 180 200 220 240 n u m b e r o f b a s i s f u n c t i o n s number of atoms y = 5.0459x + 55.25 ρ = 0.81 4 6 8 10 12 0 20 40 60 80 100 120 140 160 180 200 220 n u m b e r o f b a s i s f u n c t i o n s number of non-hydrogen atoms y=16.315x+10.198 ρ = 0.95 Figure 5.24: The number of basis functions can be more accurately modelled as a function of non-hydrogen atoms than of the total number of atoms. 142 2 4 6 8 10 12 10 0 10 1 10 2 10 3 10 4 10 5 number of non-hydrogen atoms t o t a l t i m e / s y=36.107x 2.9469 Figure 5.25: The total calculation time appears to be dependent on the number of non-hydrogen atoms present (ρ = 0.66) although the run times for a particular number of non-hydrogen atoms varies by several orders of magnitude. 143 0 10000 20000 30000 40000 50000 0 5000 10000 15000 20000 25000 30000 35000 s t a n d a r d d e v i a t i o n / s mean total time per non-hydrogen atom / s y = 0.6336x - 720.13 Figure 5.26: The standard deviation of run times of jobs containing a speci- fied number of non-hydrogen atoms can be modelled by the mean run time for those jobs (ρ = 0.98). these times) allows suitable protocols which may form part of a workflow to be created. 5.6 Conclusions The MOPAC–GAMESS comparisons show good agreement in general and have shown that machines are capable of reproducing the results found by humans. However, only comparing two computational methods on data of inconsistent and indeterminable quality, means that it is impossible to deter- mine which is right when a consistent discrepancy is found between the two sets of results (for example the fluorinated compounds). It then becomes nec- essary to validate the calculated results with those that are experimentally 144 determined and of measurable quality (see chapter 7). The analysis of the run times have shown that they appear to be accu- rately modelled using simple cubic relationships. It does not appear to be possible to predict the run time for an individual molecule — the run time for a particular number of non-hydrogen atoms varies by at least an order of magnitude. However, if a sufficiently large number of calculations are run, their behaviour may be modelled by using the mean run time (for which the standard deviation also appears to be predictable). 145 predicted number of mean total standard upper bound non-H atoms time / s deviation /s 95% confidence 3 400 100 1400 4 2100 1200 4700 5 4500 2700 9700 6 6400 5000 17000 7 15000 6800 27000 8 21000 10000 40000 9 30000 18000 56000 10 42000 23000 76000 11 45000 33000 100000 X 12 58000 36000 130000 X 13 73000 45000 160000 X 14 90000 56000 200000 X 15 110000 69000 250000 X 16 130000 84000 300000 X 17 160000 100000 360000 X 18 190000 120000 420000 X 19 220000 140000 500000 X 20 260000 160000 580000 X 21 290000 190000 670000 all values given to 2 significant figures Table 5.3: The mean run time for a calculation involving a specified number of non-hydrogen atoms may be predicted using equation 5.1 and the stan- dard deviation calculated from this value. The 95% confidence interval is calculated by t+2s where t is the mean runtime per non-hydrogen atom and s is the standard deviation of the sample. There are 604800 seconds in a week, thus ca. 95% of calculations involving 20 non-hydrogen atoms would be expected to complete in this time. 146 Chapter 6 X-Ray Crystallography X-ray crystallography uses the diffraction pattern produced by bombarding a single crystal with X-rays to solve the crystal structure [168]. The diffraction pattern is recorded and then analysed or solved to reveal the nature of the crystal. This technique is widely used in chemistry and biochemistry to determine the structures of an immense variety of molecules. When single crystals are not available, related techniques such as powder diffraction or thin film X-ray diffraction coupled with lattice refinement algorithms such as Rietveld refinement [169] may be used to extract similar, though less complete, information about the nature of the crystal. The crystallographic field also suffers from data deluge. Figure 6.1 shows the trend in the number of crystal structures published per year [170]. Cur- rently only ca. 20% of the crystal structures determined are actually pub- lished; this figure is likely to rise in the future witch will further exacerbate the problems associated with the data deluge. These problems may be al- leviated if the data could be more easily handled by automated processes, which would also allow it to be searched more precisely and easily. 6.1 Determining the Structure The spacing in the crystal lattice can be determined using Bragg’s law [171]: nλ = 2d sin θ (6.1) 147 1965 1970 1975 1980 1985 1990 1995 2000 2005 0 5000 10000 15000 20000 25000 30000 35000 n u m b e r o f s t r u c t u r e s publication year Figure 6.1: Number of crystal structures published in each of the years 1965 – 2006; the data for 2006 are as yet incomplete. 148 Figure 6.2: The steps involved in a crystal structure determination. where λ is the wavelength of the incident radiation, θ is the angle of in- ference, d is the distance between atomic layers and n is an integer. The electrons that surround the atoms, rather than the atomic nuclei themselves, are the entities that physically interact with the incoming X-ray photons. The contribution to the diffracted X-ray intensity is larger for atoms with a greater electron density than for the lighter elements (especially hydrogen) which can make it difficult to determine the positions of the lighter elements accurately when heavy elements are present. Figure 6.2 shows the flowchart for crystal structure determination [172]. In order to solve the three-dimensional structure of a molecule, it must first be crystallized. This is because a single molecule in solution has in- sufficient scattering power alone, and the scattering of multiple molecules in a concentrated solution is too convoluted to yield high resolution informa- tion (although methods such as small angle X-ray scattering can be used for the determination of particle systems in terms of averaged particles sizes or shapes in such cases). A crystal can be considered to be an (effectively) 149 infinite repeating array of the molecule of interest. The Laue conditions and Bragg’s law show that constructive interference between diffracted X-rays that are in-phase reinforce each other, so that the diffraction pattern be- comes detectable [173]. The geometric conditions where diffraction occurs can be visualised using Ewald’s sphere [174]. Once prepared, the crystals are harvested and thenmounted. Several meth- ods of mounting exist: it is possible to hold the crystal in a thin glass tube using grease or by using superglue or epoxy resin to hold the crystal to a glass fibre. A more recent alternative is to use a drop of oil and liquid nitro- gen to fix the crystal to the fibre. By cooling crystals the radiation damage incurred during data collection is reduced and thermal motion within the crystal decreased, giving rise to better diffraction limits and higher quality data. Crystals are then mounted on a diffractometer coupled with a machine that emits a beam of X-rays. The X-rays are diffracted by their interaction with the electrons in the crystal, and the pattern of diffraction is recorded on film or, more recently, charge-coupled device detectors and scanned into a computer. Successive images are recorded as a crystal is rotated within the X-ray beam. Before the advent of cryocooling, data was usually collected at room tem- perature. Increased radiation damage to the crystal meant that sometimes several crystals had to be used to obtain a single dataset. Cryocooling has reduced this problem. Moreover, instead of collecting the data spots one at a time, many modern machines use an array of X-ray detectors to collect data over a large range of angles at once. The data collected from a diffraction experiment is a reciprocal space rep- resentation of the crystal lattice. The position of each diffraction spot is governed by the size and shape of the unit cell, and the inherent symmetry within the crystal. The intensity of each diffraction spot is recorded, and is 150 proportional to the square of the structure factor amplitude. The structure factor is a complex number containing information relating to both the am- plitude and phase of a wave. In order to obtain an interpretable electron density map, phase estimates must be obtained (an electron density map allows a crystallographer to build a starting model of the molecule). This is known as the phase problem, and can be solved in a variety of ways: • molecular replacement — if a structure exists of a related protein, it can be used as a search model in molecular replacement to determine the orientation and position of the molecules within the unit cell. The phases obtained this way can be used to generate electron density maps. • heavy atom methods — if high molecular weight atoms (not usually found in proteins) can be soaked into the crystal, direct methods or Patterson-space methods can be used to determine their location and to obtain initial phase estimates. • ab initio phasing — if high resolution data exists (with accuracy better than 1.6A˚) direct methods can be used to obtain phase information. Having obtained initial phases, an initial model (the hypothesis) can be built. The Cartesian coordinates of atoms and their respective Debye-Waller factors (accounting for the thermal motion of the atom) can then be refined to best fit the observed diffraction data. This generates a new (and hopefully more accurate) set of phases and a new electron density map is generated. The model is then revised and updated by the crystallographer and a fur- ther round of refinement is carried out. This continues until the correlation between the diffraction data and the model is maximised. 6.2 Derived Results The primary numerical results of a structure determination are the parame- ters obtained in the least-squares refinement. Usually these consist of: • three positional coordinates for each atom 151 • and a number (often one, for isotropic, or six, for anisotropic models) of temperature factors, thermal parameters or displacement parameters for each atom • other parameters for effects such as extinction and overall scaling of the observed and calculated datasets These parameters may be refined as supposedly independent values, or there may be various constraints and/or restraints applied to their refinement. The refinement supplies not only values for the parameters but also an estimated standard deviation (esd) for each one. The structure determination also gives the atom types. The primary results are usually not the main object of the structure de- termination experiment, rather, it is the molecular geometry and possibly intermolecular interactions which are of interest. Therefore secondary results must be derived (such as bond lengths, bond angles and torsion angles). Each of these derived values should also have an associated esd. It has been suggested that the thermal parameters describe both the time- averaged temperature-dependent movement of the atoms about their mean equilibrium positions and their random distribution over different sets of equilibrium positions from one unit cell to another. These parameters may therefore be called atomic displacement parameters. The interpretation and analysis of these atomic displacement parameters is often not undertaken. 6.2.1 β, B and U Parameters Atomic displacements are described by a variety of different parameters, all of which are mathematically related. Thus, for an isotropic model, a single parameter is used, but this may be called B or U . These are related by f ′(θ) = f(θ) exp ( −B sin2 θ λ2 ) = f(θ) exp ( −8pi2U sin2 θ λ2 ) (6.2) where f(θ) is the scattering factor for a stationary atom and f ′(θ) the scat- tering factor for the vibrating atom. B and U both have units of A˚2 and U 152 represents a mean-square amplitude of vibration. For an anisotropic model, six parameters are used and the exponent (−B sin2 θ λ2 ) becomes −(β11h2 + β22k2 + β33l2 + 2β23kl + 2β13hl + 2β12hk) (6.3) although other forms are also used. These parameters are often represented graphically as thermal ellipsoids. This is possible only if certain inequality relationships among the six parameters are satisfied; otherwise they are said to be non-positive definite and the corresponding ellipsoid does not have three real principal axes. Such a situation may indicate a real problem in the structural model (e.g. a disordered atom), or it may be due to imprecise U ij parameters (high esds), in which case the anisotropic model for this atom is perhaps not justified. The anisotropic displacement parameters are often not published in most chemical journals and their significance is difficult to assess immediately. A simpler parameter to assess atomic motions is the equivalent isotropic parameter Ueq. There are different definitions of Ueq and some appear to be inappropriate [175]. One version of Ueq is that corresponding to a sphere of volume equal to the ellipsoid representing, on the same probability scale, the anisotropic parameters. A commonly used definition is Ueq = 1 3 (trace of the orthogonalised U ij matrix) although its meaning is not entirely clear. The effect of temperature on Ueq is examined in section 8.5. 6.2.2 Libration Thermal motion is known to produce an apparent shrinkage in molecular dimensions. Figure 6.3 shows how this shrinkage occurs for riding motion but is extensible to all libration (rotary oscillation). If a molecule has only small internal molecular vibrations compared with its movement as a whole about its mean position in the crystal structure, then it can be treated ap- proximately as a rigid body. In this case the atomic displacements are not independent and the U ij parameters of the atoms must be consistent with the overall molecular motion. 153 Figure 6.3: A light atom bonded to a heavy one will oscillate about its equi- librium position causing the election density to be spread out (black lines). The electron density is modelled as an ellipsoid (in the three dimensional case) with the atom at the centre of this ellipsoid (red). This causes ob- served bond length, r1, to be shorter than the true bond length r2. 154 The molecular motion can be described by a combination of three tensors (3× 3 matrices): T , L and S. T describes the translation and is symmetric, L describes libration (also symmetric) and S describes a screw motion which is not symmetric. T and L both have 6 independent parameters and S has 8 because there is a constraint on the three diagonal terms. From the rigid body parameters, corrections can be calculated for bond lengths within the molecule; these depend only on the libration tensor. Although many molecules are not rigid, certain groups of atoms within them may be and it is possible to treat these as rigid bodies and to calculate the subsequent bond length corrections. The Hirshfeld test [176] is used to determine whether a molecule or part of a molecule can be regarded as a rigid body. The effect of rigid body corrections and the agreement with the calculated bond length is shown in section 8.4.5. 6.2.3 Minor Conformations and Incorrectly Assigned Atoms It is sometimes hard to distinguish whether an atom has a high thermal motion or is statically disordered (minor conformations are present). The presence of disorder in a structure, unless it is very simple and can be well modelled, reduces to some extent the overall precision of the structure, not just the particular atoms affected. The lowest reported site occupancy in the dataset of crystallography used in this thesis was 0.0107. It is therefore possible that there were some minor conformations (< 1% frequency) which were not reported. It is possible, though unlikely, that a completely incorrect structure is iden- tified (essentially the wrong chemical compound is found) when the crystal structure is solved. An example of wrongly assigned atom types being re- ported was identified by von Schnering and Vu in 1986 [177]. They ques- tioned the reported structure of ‘[ClF6][CuF4]’ which in reality was probably [Cu(H2O)4][SiF6]. The mistake was likely caused by the similar scattering powers of Si and Cl, and of O and F. Subsequent improvements in crystal- lography mean that such incorrect assignments are now extremely unlikely, 155 especially when only light atoms are present in the crystal. 6.2.4 Atomic Scattering Factors The scattering factors commonly used are reasonably accurate representa- tions of the scattering power of individual, isolated atoms at rest. This commonly referred to as the independent atom model (IAM). The scatter- ing factors have spherical symmetry and probably their greatest limitation is the lack of allowance for distortion of this spherical electron density when atoms are placed together and bonded to each other. This is one reason why bonds to hydrogen atoms are found to be systematically shortened in X-ray diffraction studies. The invariom concept [178] provides a definition of a pseudoatom electron density, that is transferable between molecules, and employs the multipole formalism introduced by Hansen and Coppens [179]. In invariom refinement, the multipole parameters are predicted by a procedure involving theoretical calculations and can be described as providing aspherical scattering factors. Hence, the number of parameters to be refined in the least-squares procedure does not increase when compared with a standard IAM refinement. In 2005 Dittrich et al. demonstrated that the determination of molecular geometry by conventional X-ray single-crystal diffraction experiments of or- ganic molecules can be improved by invariom modelling [180]. The lengths of bonds involving hydrogen atoms were particularly affected. The structure of l-valinol determined by the IAM, invariom refinement and quantum chemical geometry optimisation (using GAUSSIAN98, D95++(3df,3pd)) was reported in 2006 [181]. Table 6.1 shows the comparison of the bond length differences between each of the methods. The effect of using spherical rather than as- pherical scattering factors for atoms (other than hydrogen) is thought to be significantly less than those caused by libration, minor disorder and incor- rectly assigned atoms. 156 X-C bonds, X=C,O,N esd / A˚ invariom 0.0007 IAM 0.0012 ∆R/ A˚ ∆R(invariom−IAM) 0.0010 ∆R(invariom−GAUSSIAN98 ) −0.0034 X-H bonds, X=C,O,N esd / A˚ invariom 0.0089 IAM 0.0118 ∆R/ A˚ ∆R(invariom−IAM) 0.1284 ∆R(invariom−GAUSSIAN98 ) −0.0280 Table 6.1: The effect of using invariom refinement compared to the IAM and the GAUSSIAN98 geometry optimised structure. The values are based on data reported by Dittrich et al. in 2006 for l-valinol [181]. 157 6.3 The CIF Format As mentioned previously, X-ray crystallography, in the form of Crystallo- graphic Information Files (CIFs), is a source of high quality, experimentally determined, 3D coordinates and connection tables for structures. The CIF format was introduced in 1991 to be used for the electronic transmission of crystallographic data between laboratories, journals and databases and was adopted by the International Union of Crystallography (IUCr) as the recom- mended medium for this purpose [182]. The development of a dictionary of crystallographic data items was a major feature of this work; each data item has been assigned a self-explanatory (within a 32 character limit) name for use in a CIF and the precise definition of the item given in the appendix to the paper (now available online) [183]. The CIF is based on the Self-Defining Text Archive and Retrieval (STAR) file syntax procedure defined by Hall in 1991 [184] and subsequently defined in Backus-Naur form in 1993 [185]. The CIF format places nine further restrictions on the STAR file syntax as detailed below: 1. Lines may not exceed 80 characters 2. Data names and block codes may not exceed 32 characters. All data names are case insensitive 3. The CIF dictionary defines particular data items as number or charac- ter types 4. A data item is assumed to be a number if it starts with a digit, plus, minus, a period and is not bounded by matching single quotes, double quotes or semicolons as the first character on a line 5. A number may be specified as an integer, floating-point number, or in scientific notation. When concatenated with a number in parentheses, that integer is assumed to be the esd in the final digit(s) or the number. For example: 34.5, 3.45E1, 34.5(12), 3.45E1(12) are all versions of 34.5 with and without an esd of 1.2 158 6. A data item is assumed to be of data type text if it extends over more than one line, i.e. it starts and ends with a semicolon as the first character of a line 7. A data item is assumed to be of data type character is if is not a number or text 8. Only one level of loop is permitted. Additional levels of repeated data must be stored as lists within a text field 9. The CIF dictionary specifies default units for CIF data items. If the data item is not stored in the default units, the units code is appended to the data name. Only those units defined in the CIF dictionary are acceptable. The data names defined for use in a CIF are separated into components representing the internal hierarchy of data categories. The data names are of the form For instance data relating to atoms begins with atom and is further cat- egorised into those data relating to the position of the atom in the crystal ( atom site ) and the properties of the atom type that occupies that position ( atom type ). The chemical conn data items specify the 2D chemical structure of the molecular species and allow a 2D chemical diagram to be constructed — a complete chemical entity must be described so symmetry-generated atoms must be included. The connection tables of all the molecules in the CIF are thus trivially recoverable. However, whilst this facility is provided, it is not required and authors usually do not supply this data (no instances were found in the 6738 CIFs examined in this study). 159 6.4 Quality Indicators There are many data items that can be used to assess and validate a crystal- lographic structure. UNIMOL [186] implements a variety of checks including searching for voids and consistency of bond lengths. The IUCr host a free service, checkCIF [187], that provides a report to authors of possible mis- takes in the CIF submitted. There are three versions of checkCIF: basic structural check, full structural check and full publication check. The full structural check performs 321 tests to determine the self-consistency of the data and that it falls within expected parameters [188]. Data that do not pass a test are reported as alerts of differing levels (A to D, most to least severe). Whilst CIFs with level A alerts are published, the author has to justify that the reason for the alert is valid; in general authors are advised to resolve as many of the problems giving rise to alerts before submission. The automatic checkCIF validation helps to ensure that there are no er- rors present in the published data (and that it is at least self-consistent) but this still allows large variation in the quality of the data. Fortunately the CIF contains data items that can be used as a measure of the data quality, the most obvious being the R-factor. Various methods are used to calculate the R-factor but are reported under the refine ls data name ( refine data names describe the structure refinement parameters and ls data names refer to the least squared method). The conventional R-factor is refine ls R factor gt (which succeeded refine ls R factor obs). This is the residual factor for reflection data classified as observed (the intensity of the reflection is sufficiently intense) and is calculated by R = ∑∣∣∣|Fm| − |Fc|∣∣∣∑ |Fm| (6.4) where Fm and Fc are the measured and calculated structure factors. The calculated structure for atoms with diffracting power f situated at different points in the unit cell as specified by the fractional coordinates x, y, and z 160 for the reflecting plane hkl is given by F 2c = A 2 +B2 (6.5) where A = ∑ f cos 2pi(hx+ ky + lz) (6.6) B = ∑ f sin 2pi(hx+ ky + lz) (6.7) Atoms in crystals vibrate at ordinary temperatures with frequencies very much lower that those of X-rays; at any one instant, some atoms are displaced from their mean positions in one direction while those in another part of the crystal are displaced in another direction. Consequently, diffracted X- rays which would be exactly in phase if the atoms were at rest are actually not quite in phase, and the intensity of the diffracted beam is thus lower than it would be if all atoms were at rest. The thermal displacements of atoms in crystals with large plane spacings (those giving reflections at small angles) are small fractions of the plane spacing, hence the intensities are largely unaffected. However crystals with closely spaced planes are more affected. The effect also increases with rising temperature; lower temperature experiments are therefore likely to be higher quality. Neutrons do not interact with matter in the same manner as X-rays. X-rays interact primarily with the electron cloud surrounding each atom. Neutrons interact directly with the nucleus of the atom, and the contribution to the diffracted intensity is different for each isotope; for example, hydrogen and deuterium are distinguishable. It is also often the case that light atoms contribute strongly to the diffracted intensity even in the presence of heavy atoms. Non-magnetic neutron diffraction is directly sensitive to the positions of the nuclei of the atoms. Neutron diffraction is therefore likely to provide extremely high quality structures with the positions of all the atoms well determined. Unfortunately, such experiments are not performed with the same regularity as the more traditional X-ray based studies. 161 Chapter 7 Creating a Workflow A workflow may be described as a collection of steps and data that define the paths taken to complete a task. Such a workflow might contain processes such as displaying content to users, collecting information from users or computer systems, performing calculations and sending messages to external computer systems. The tasks under consideration in this chapter are those of calculat- ing the minimised geometries for the connection tables (CTs) contained in CIFs and organising these results in a suitable way for analysis to take place. Previously in this thesis phrases such as the molecules were submitted for calculation and a protocol was created together with workflow diagrams such as figure 5.1 have been used to indicate what processes were performed. The amount of work necessary to create these processes and to link them together is often not appreciated. This chapter examines in more detail these processes and the work necessary for their creation. A simple high-level overview of a workflow is very simple to construct (figure 7.1 shows a typical example). Such a high-level (or coarse-grained) view is unlikely to reflect what is actually required to complete the specified task. Whilst much effort is typically put into ensuring that the calculations performed are valid and useful (these parameters are incorporated in the protocol, see section 7.9) it is vital that the infrastructure is in place to allow such a protocol to be adopted. 162 Figure 7.1: A simple high-level overview of a workflow. 7.1 Existing Workflow Technology Murray-Rust et al. have examined various methods for the creation and de- velopment of workflow technology for chemistry [189, 190]. Their approach has built on the myGrid Taverna project [191] and focuses on creating Web- Services (WS) for each of the processes. A WS provides a standardised programmatic interface for a particular piece of functionality. This has the advantage that large libraries or platform-dependent code need not be dis- tributed. Instead the process that they perform are encapsulated within a single WS. A finer-grained workflow for the required tasks is shown in figure 7.2. Whilst this is more realistic than that shown in figure 7.1, none of the processes or repositories shown were immediately available in usable states. Previous experience showed that it was unrealistic to attempt to control the entire process with a single workflow; a modular approach was therefore adopted. This required the creation of lightweight and independent modules for each of the processes. The modules created would then be encapsulated in a WS and distributed for re-use. The glue between each of the modules is XML/CML and the JUMBO toolkit which allows the output of one activity to be the input for the next as outlined in by Zhang et al. in 2004 [192]. To develop the required modules (which are often entire workflows themselves) 163 CIF CIF REPOSITORY DOWNLOAD CREATE INPUT RUN RESULTS REPOSITORY 88 8 8 MOPAC, GAMESS... RETRIEVE RESULTS ANALYSIS QUERY WEB BASED WEB SERVICE CLUSTER LOCAL MACHINE Figure 7.2: A workflow describing the ideal situation for calculating proper- ties of a collection of CIFs and analysing the results 164 each of the processes and repositories must be considered at a lower level. 7.2 CIF Repositories The Cambridge Crystallographic Data Centre (CCDC) was conceived in 1965 as a non-profit, charitable institution whose objectives are the general ad- vancement and promotion of the science of chemistry and crystallography for the public benefit. As part of this, the CCDC compile the Cambridge Structural Database (CSD) which is the world repository of small-molecule crystal structures. The CSD, whilst not web-based, should therefore provide an ideal reposi- tory of CIFs. Unfortunately the conditions of use of the CIFs provided from the CCDC CIF archive state that Individual CIF datasets are provided freely by the CCDC on the understanding that they are used for bona fide research purposes only. They may contain copyright material of the CCDC or of third parties, and may not be copied or further disseminated in any form, whether machine-readable or not, except for the pur- pose of generating routine backup copies on your local computer system. [193] The intention is for all the results of this project to be openly available (and repeatable), thus it was determined that the CCDC could not be used as a CIF resource. The electronic supplementary data provided for an article in a chemical journal may contain a CIF. This data can (often) be accessed without sub- scribing to the journal, is electronically searchable and can be reused. It is therefore ideal for this project except that there is no longer a single access point or access protocol. 165 list of journals and years list of journals and years list of journals and years list of journals and years list of journals and years list of journals and years list of journals and years list of journals and years list of journals and years list of journals and years list of journals and years list of journals and years list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers list of papers Experimental Experimental Experimental CIF CIF CIF CIF List of journals and journals by year List of papers from this journal by year Papers Supplementary data for this paper Figure 7.3: Overview of how journals organise supplementary data. 166 To create a repository of CIFs for calculation it was necessary to deter- mine how to retrieve the CIFs from a journal that would allow automated electronic download. Figure 7.3 shows how the supplementary data is typ- ically organised by a journal. Day described a workflow for this process in 2005 [189]. This workflow was used to provide a repository of CIFs which were provided as supplementary data for articles published in Acta Crystallo- graphica, Section E between January 2001 and September 2005. These were attractive because only a single experiment is reported in each CIF (which simplifies the parsing process) and because all the CIFs submitted to this journal must contain certain data items as specified by the IUCr [194]. 7.3 Download The download process as indicated in figure 7.2 would tend to indicate that a molecule’s CT is immediately retrievable from the CIF, or at least it is possible to determine whether or not a particular CIF is useful before down- loading it∗. The nature of the download process thus depends largely on the availability of an online CIF repository which could be queried for structures fitting the chosen protocol. Ideally only those CIFs containing structures that will be calculated would be downloaded. Following the previous discus- sion this was clearly impossible. This section thus focuses on the creation of a suitable data structure to store the set of downloaded CIFs and how the CTs contained in the CIFs were made searchable. The downloaded CIFs were provided in the structure shown in figure 7.4. Each of the CIFs is uniquely identified within the journal by a two letter, four number combination. Although unique, the identifier does not contain any information about the CIF but it is believed that each reviewer has their own two letter code and the number is simply an increment. Thus it should be possible to determine which structures were cleared for publication by which reviewer. ∗In this case useful means that the CIF contains the CT for a structure that would be appropriate for calculation. 167 Figure 7.4: The data structure for the downloaded CIFs, the general case (left) and an example of the values for this work (right). 168 Figure 7.5: The hierarchical data structure adopted for the downloaded CIFs; the .cif.cml.xml contains the complete CIF parsed to CML and the .cif.cml1.xml contains only the molecules from the CIF for display purposes. Although CIF is based on the STAR syntax that was designed to be machine-understandable, the CIFs must be converted to CML to be inte- grated into the rest of the workflow and processed with the available tools. A hierarchical structure was chosen to store the data pertaining to each CIF; thus each CIF was moved to a subdirectory of the same name. This al- lows all the data relating to a particular CIF to be stored under a separate subdirectory. 169 The CT and the 3D coordinates of the molecules in the CIF are obtained by the following processing sequence based on that published by Murray-Rust et al. in 2004 [195]: 1. Read the CIF into an XMLDOM [52] (achieved using a CIF2CML library [196]) 2. Discard minor disordered components 3. Convert fractional coordinates to cartesian 4. Join bonds using ‘reasonable’ covalent radii 5. Apply symmetry operations to generate the minimum number of molec- ular fragments 6. Generate CT(s) for the molecule(s) 7. Check against chemical formula (often not given) 8. Serialise the result as CML (.cif.cml.xml) The process is not foolproof as CIFs do not always include the molecular charges correctly specified and any disorder may be difficult to interpret. The serialised CML produced should contain all the information specified in the CIF, but for display purposes a second file was created (.cif.cml1.xml) which contained only the molecules. The results of this process with the new file structure are shown in figure 7.5. CIFs may contain multiple molecules in the unit cell (whether they are re- peated instances of the same CT or different CTs). The workflow is designed to deal with each molecule individually, thus subdirectories were created for each of the molecules in the CIF and the relevant file created (.inp.cml.xml). These files contained the CT and 3D coordinates of the relevant molecule to- gether with other data extracted from the CIF used to determine if they were suitable for calculation. The files can be easily recovered by searching for all files of filetype .inp.cml.xml under the actae directory. The final data structure adopted is shown in figure 7.6. 170 Figure 7.6: The hierarchical data structure adopted for the downloaded CIFs; the moleculeN.inp.cml.xml contains all the data relating to the Nth molecule found in the CIFs in CML. In this case molecule1 was suitable for calculation (so the .inp file was created) whilst molecule2 was not. 171 A traditional database has only one instance of a particular piece of in- formation with many links to it — this is a normalised approach, whereas using the file system as described above promotes a denormalised approach. Each time a piece of information is required a copy of it is present (for example the atom site occupancies are in the CIF, the .cif.cml.xml and the .inp.cml.xml files) thus no links are required which removes a level of complication. The only relational ideas in the file system representation employed are that everything in a particular subdirectory is a finer-grained representation of what is the parent directory. 7.4 Create Input Once a structure has been determined to be suitable for calculation, an input file must be generated for submission to the compute node or program. This file should contain the unique ID of the molecule. An ant [197] script was used to combine the coordinates from the .inp.cml.xml file with the job controls (determined by the protocol) creating the input file (.inp). The input files are then copied to submission nodes for computation. To speed up the input creation process, the files for submission were split into two sets and the input creation script run on both concurrently. These files were submitted for calculation and many failed very rapidly. A brief examination of the log files showed that the correct input had not been cre- ated, for example certain keywords were missing. Further analysis revealed that some of the input files contained the coordinates for a different molecule. The entire set of jobs were cancelled. The input file creation process was repeated and the files examined before submission to the compute nodes. The same type of mistakes were found in these files; although not always in the same files that contained errors the first time. The cause of the errors was found to be that the script used a temporary file to hold the data and the process was not designed to be multi- threaded. When run as a serial process the creation of input files succeeded. 172 It is desirable that a standard format to hold input data is implemented (e.g. CMLComp) with all the concepts linked to dictionaries [139]; these could either be read directly by the program or automatically transformed into program-specific input files as described in chapter 4. 7.5 Run Once the input has been created, running the computation requires moving the requisite jobs to a suitable computation node and the job starting. This should be a trivial process but actually proved to be much more complex than expected. The calculations were to be performed on three separate clusters; Kellogg, Corona and Vendian. 7.5.1 The Clusters Cluster is a widely-used term meaning independent computers combined into a unified system through software and networking. At the most fundamental level, when two or more computers are used together to solve a problem, it is considered a cluster. Kellogg is the home-built Beowulf cluster in the UCC, so-called because it runs only serial jobs. The cluster consists of seven single- CPU nodes, each with a 2.53GHz Pentium 4 processor and 1Gb memory. It is intended primarily for running long serial (single CPU) jobs; there is no parallel environment set up. Corona is the name for the IBM x-series cluster at the UCC, intended primarily for running parallel jobs. The submission node of Corona is an IBM x-series 345, a 2U, dual-Pentium-Xeon machine with 4Gb of memory. The sixteen compute nodes are IBM x-series 335s, 2.4GHz dual-Pentium-Xeon machines with 4Gb memory. Although designed for parallel jobs, serial jobs may be run although they have very low priority. The Condor pool, for which Vendian is the central manager, consists of this central manager, and 16 iPaqs. All the machines are setup as dedicated Condor hosts and are not used for interactive work. The central manager is 173 a Dell OptiPlex GX150 (1GHz Pentium III, 512Mb memory) and the iPaqs have 500MHz Pentium III processors and 512Mb memory. The jobs were apportioned between the clusters in the ratio 1 5 :2 5 :2 5 respec- tively. The numbers of nodes stated above represent the maximum available on each cluster — the scheduler limits a user’s allocation to allow other users computer time if required. Based on the data presented in table 5.3 the geometry optimisation of 1000 molecules containing 15 non-hydrogen atoms in would require a total run time of 1.1 × 108 seconds (3.5 years) to com- plete on average — which would take approximately 5 weeks of real time on the 39 available nodes. The actual real time required was over a year. Unfortunately, the Corona cluster suffered an unrecoverable crash not long after this work started resulting in the loss of 16 of the available nodes. The data, which was backed up, was recovered following which the uncalculated jobs were again reapportioned between the remaining available nodes and resubmitted. The Vendian Condor pool was a recent addition to the UCC computing facilities following the success of the pool set up on the teaching machines (which had to be uninstalled). It is effectively a dedicated computing cluster which uses Condor as the scheduler and networking mechanism. The infras- tructure was still being put in place to provide a stable environment whilst this work was being carried out which resulted in a very high percentage (>99%) of the calculations submitted failing. This was found to be caused by corruption of the GAMESS executable binary file whilst it was being sent to the compute node although this remained undiagnosed for some months. Alternatives to the Vendian cluster were available in the form of the University-wide Condor pool set up as part of the CamGRID project [198]. Unfortunately this system, which uses Condor in the originally intended cy- cle scavenging way, does not provide guaranteed uninterrupted runtime and is only available between the hours of 10pm to 6am. This is because the Uni- versity Computing Service will only allow the Linux based version of Condor 174 to be used (because of security issues in the Windows based version) and the computers which make up CamGRID all run Windows but have a dual boot system. The computers are automatically remotely rebooted into the Linux operating system at 10pm (if they are not currently claimed by a physical user), when the Condor service becomes available and then rebooted into the Windows operating system at 6am. Of course, at any time between 10pm and 6am a user can reboot a machine into Windows if desired. The lack on guaranteed uninterrupted runtime was not an insuperable problem because GAMESS has a checkpointing facility which allows the cur- rent state to be saved and the calculation restarted from that point. However, the remote rebooting of the machines does not allow the system to finish what it is currently doing which therefore does not allow the Condor system to retrieve the checkpoint data. Overall it was felt that camGRID did not provide a suitable platform for these calculations, therefore any uncalculated jobs (effectively all of them) were submitted to the only remaining resource, the Kellogg cluster. 7.5.2 Schedulers When a job is submitted for calculation on a cluster, it is not being run directly, but entering a scheduler which actually performs the process of moving the job (together with any other specified files) to a particular node and starting the calculation. The scheduler may also implement a queuing system so that all the users of a particular cluster are treated equally badly. [199] The scheduler provides both advantages and disadvantages; all users are treated fairly and the submission of a large number of calculations en masse is possible but it is a further level of complexity which can crash. In cases where the scheduler crashed, all the completed jobs were removed from the cluster for parsing and analysis and those which were either unsubmitted or incomplete were reentered into the queue. 175 IS JOB SUBMITTED? IS JOB FINISHED? TERMINATED CORRECTLY? RESUBMIT GET NEXT JOB MOVE JOB TO LOCAL MACHINE YES NO YES YES NO NO Figure 7.7: The decision diagram for retrieving jobs. Ovals represent deci- sions, each of which requires a separate method, rectangles represent simple instructions or other workflows. It should be noted that this diagram is simplified. 176 7.6 Retrieve Results This process is actually a separate workflow in its own right. Figure 7.7 shows the steps required for this sub-workflow. The rules governing the three decisions are shown below. Is Job Submitted? Although the submission script adds all the jobs to the queue, the actual calculation may not have started. The submission for a job consists of two files: Input File (.inp) This consists of the starting coordinates of the atoms and control information for GAMESS. Submission File (.sub) This consists of a set of instructions for the node running the job such as where the input file is located, where the GAMESS program resides and what to do with the various outputs. If the job is not in the queue but these two files (and only these two files) are present, then the job has not been submitted and should be added to the queue. If the job is in the queue then it simply has not begun to execute and no further action should be taken. Is Job Finished? Once a job has begun executing, four further files are created (the .inp and .sub files must already be present for execution to occur): Log file (.log) The information that would usually go to ‘System.out’, designed to be slightly human-understandable. Data file (.dat) Essentially a more compressed version of the data that is sent to the log file. Less human-understandable and also more difficult to machine-parse completely unambiguously. The start and end points are less well defined. This file only contains the data, rather than fail messages. Queue manager log file (.pbs.log) An empty file as all the infor- mation is sent to the error file. 177 Queue manager error file (.pbs.err) A complete list of where GAMESS stores particular files etc. Any job that has only these six files present is treated as finished and is suitable for further processing. However, GAMESS also creates a series of F files. These all have file types of the form: .F\d\d? The presence of the these F files indicates that either the job is still running, or that the node has crashed during the execution of the job. Unfortunately such information is not present in the .pbs.err file. It is thus necessary to interrogate the list of currently executing jobs to determine if this job is present. If it is then no further action should be taken, if it is not then the job has crashed during execution and this should either be immediately resubmitted automatically or moved to another location ready for manual resubmission. Although there is a desire to automate the entire process, it may not be sensible to have automatic resubmission of jobs of this sort because an analysis of the failed job might lead to useful knowledge. If finished then move else wait If the job has finished then remove the files to a permanent store otherwise look at the next job, if all the jobs have been looked at then wait for a specified time before restarting the process. Determine if job has terminated correctly JUMBOMarker was used to determine that the log file produced was complete as described in sec- tion 4.3. If the log file is complete then the file can be moved to the results repository, otherwise it should be resubmitted. 7.7 Results Repository Web-based repositories do exist, such as DSpace, which captures, stores, in- dexes, preserves, and distributes digital research material [200]. However, 178 whilst analysing the data produced, a local repository was preferred, be- cause until the analysis was completed the files containing the data were being frequently accessed and modified. A local repository reduces access times, bandwidth requirements and removes a level of complication from the analysis. 7.8 Designing a Robust Analysis Method Once the calculations have been performed and the data is in the local repos- itory, the majority of the analysis can occur. Chapter 5 details various exam- ples in which the calculation was changed owing to peculiarities discovered in the data created, and such changes were also necessary in this study. This involved finding molecules which were inappropriate for calculation and tightening the protocol so that such instances would be filtered off before submission (see section 7.11). Increasing the severity of the filters imposed on the data for calculation is useful for increasing the efficiency and validity of future calculations but not for those molecules already calculated; these must be retrofitted with the improved protocol. Two possible ways of retrofitting the data are • altering the file structure • altering the file 7.8.1 Altering the File Structure Altering the file structure could range from changing the name of a file (so that it would no longer be included in an iterative call for all files of a particular type), to the complete deletion of a file and all data derived from it. The second of these examples would decrease the amount of storage required for the data and the subsequent corpus would more closely resemble the data structure generated in future using the protocol. 179 It was felt that the needless and wanton destruction of data in this manner should be avoided and therefore all results (however ridiculous) should be kept in the dataset and eventually archived along with the rest. However, included in the aims of this study are a desire to check the validity of the data available in the public domain, to allow re-use of the data and to allow others to recreate experiments, thus it is necessary to provide an indication of the quality of the data during this process together with the necessary tools to analyse it. 7.8.2 Altering the File Altering the file is a logical progression from the denormalisation of data; all the data used to determine whether or not a molecule in a particular file is suitable for calculation should be present in that file. The files already contain elements or attributes from the original CIF indicating, for example, the presence of disorder; further elements were included during analysis to indicate whether or not a molecule passed a particular filter, or the data required to ascertain if it would pass. The elements added to a file need not be those pertaining to the require- ment of the protocol. For instance if a molecule was found to be protean this can also be indicated. Elements added purely to determine if the molecule has passed the requirements for the protocol, or is suitable for further analysis are in the form; ACYCLIC or and are referred to generally as flags. 7.9 Creating a Protocol When performing HT calculations it is important to reduce, as far as possible, the error rate of those calculations and also to maximise the amount of 180 useful work done in the time available. Previous experience of designing protocols, and analysis of the calculations performed suggested that it is a better use of the available resources to construct a fairly simple protocol and refine it after calculations have been performed rather than attempting to design the perfect protocol before starting any calculations. It is more efficient to perform more radical but crude filtration than to create very specific filters; although this might prevent a few interesting structures from being calculated, overall, more structures will be determined. A protocol may be divided into two separate sets of parameters: Job controls such as the time allowed, the memory available, the size of the basis set and the level of theory to use. Molecule selection is it reasonable to attempt to calculate the geometry of this molecule given the specified job controls? The job controls used were those which had been determined as a result of the analysis of the MOPAC molecules, i.e. the 6-31G* basis set, B3LYP exchange function, internal coordinates and a maximum runtime of one week (an example of a full input file showing all the job controls is shown in figure 5.7). 7.10 Molecule Selection Parameters Molecules that are attractive for submission for both high- and low-throughput computation are ones that will provide interesting results and that run to completion rapidly. Finding such molecules is both difficult and time con- suming. Whilst low-throughput computation might focus more on a few interesting molecules and allow such calculations to run for extremely long times, HT computation focuses on getting as many results as possible in the given time. The initial parameters selected to determine if a molecule was suitable for calculation were 181 • no disorder • element type • number of non-hydrogen atoms Disorder in CIFs can present a problem, especially when the disordered groups are not correctly identified, or the occupancy of the disordered sites are equal (because then it is not trivial to determine which of the sites go together to form each group). To prevent such problems, any structure that contained disorder was determined to be ineligible for calculation — it should be noted that this check was performed on a per-molecule basis rather than a per-CIF basis, hence some molecules in a CIF might be eligible for submission whilst others are not. This study focussed primarily on organic structures with well-described bonding schemes containing only those elements which can be represented using the basis set and level of theory stated above. Therefore only molecule containing the following element types were considered for submission: H, B, C, N, O, F, Si, P, S, Cl and Br. Table 5.3 shows the predicted run time required for a calculation involving a certain number of non-hydrogen atoms. This suggests that ca. 95% of the calculation on molecules containing 20 non- hydrogen atoms are likely to complete within a week. Molecules containing fewer than 20 non-hydrogen atoms are likely to run to completion in less time. To prevent calculating the structures of too many solvent-type molecules (for example water and dichloromethane) which, because of their abundance in the CIFs, would dominate the available computing time and provide little of interest, a minimum number of non-hydrogen atoms was also imposed. The minimum value chosen was four, this removes many small solvents and counter ions (such as ammonium) without preventing some of the more in- teresting small molecules from being calculated (especially α substituted car- bonyl groups with one to three fluorine and chlorine atoms present). 182 The imposition of the initial parameters on the dataset yielded ca. 2400 molecules. It was decided that this should be split into two sets of input data: those molecules with four to 15 non-hydrogen atoms, and those with 16 to 20. The sets contained approximately 1200 and 1000 molecules respectively. The submission of the latter set (with more non-hydrogen atoms and hence longer run times) was dependent on the results derived from analysing the first set. Unfortunately, although the results of the initial dataset were pos- itive, the loss of available compute resources and time constraints prevented the second set from being calculated. Therefore the following calculations consider molecules containing four to 15 non-hydrogen atoms, with no disor- dered atoms present and only containing the elements specified above. It was fully expected that further refinement of the selection parameters would be required as the analysis progressed. Methods for implementing these are described in section 7.8. A filter to remove molecules with unpaired electrons was not implemented because it was thought that the limit imposed on the number and type of atoms would make it impossible for such molecules to be present. Only very stable radicals are sufficiently stable to form crystals which are typically found in large structures or structures containing heavy metals [201]. The total number of CIFs in the corpus was 6738, from which 6676 were successfully parsed to CML. The 63 which were not parsed lacked partic- ular required data items or contained the data items in a form which was not understandable by the parser (they did not conform to the CIF dictio- nary within acceptable tolerance). The parsed CIFs yielded a total of 6455 molecules. Using the three initial parameters 1181 molecules were found that were suitable for calculation. 7.11 Refining the Protocol The comparison of MOPAC and GAMESS bond lengths suggested that per- forming geometry optimisations at 6-31G* with GAMESS provides consis- 183 1.2 1.4 1.6 1.8 2.0 2.2 1. 2 1. 4 1. 6 1. 8 2. 0 2. 2 CIF bond length / Angstrom G AM ES S bo nd le ng th / An gs tro m Figure 7.8: One clear outlier is visible, as is the appearance of bands, for example at a calculated bond length of 1.5A˚. The x = y line is shown. 184 tent, and high-quality data. Crystallographic data is considered to be of a consistent quality (as opposed to the NCI database), therefore any outliers found between the bond lengths reported in the CIF and those calculated by GAMESS are indicative of possible problems in either the experimental or calculated values — or that like is not being compared with like. All bonds to hydrogen atoms are excluded from the analysis because their posi- tions are poorly determined by crystallographic methods in general. Proteus molecules were also removed from the dataset before the geometries were compared. The analysis of these molecules is presented in section 8.2. As before, both x-y and QQ plots were used to determine outliers. Figure 7.8 shows all the 9512 bond lengths reported in the CIF against those calculated by GAMESS (excluding bonds to hydrogen); the bonds are from 973 molecules. There are two features of note; a clear outlier and an apparent banding. The molecule giving rise to the outlier was examined and found to be N,N -dimethylformamide. The bonds giving rise to the banding at a GAMESS bond length of 1.5A˚ were all found to be perchlorate ions. The determination of the positions of atoms in solvent/ion/guest molecules (such as the perchlorate anion) is not given the same priority as those from the major structure and therefore these molecules should not form part of the analysis. A program was written to check whether or not a molecule was a solvent/ion/guest and to add a flag indicating the results of this test. Appendix F shows the list of molecules, or molecular fragments considered to fall into this category. Figure 7.9 shows the QQ plot of all the 9512 bond length changes — the outliers are not as clearly apparent but there appear to be overlapping sets of data, with the behaviour changing at ∆R ≈ ± 0.05A˚. The cause of this is observed later. After the solvent/ion/guest molecules were removed from the dataset, 8728 bonds (from 785 molecules) remain; the QQ plot of these is shown in figure 7.10. The major outliers are caused by the loss of hydrogen bonding (on moving from the crystalline state to a single molecule in vacuum). Effectively 185 −4 −2 0 2 4 − 0. 2 − 0. 1 0. 0 0. 1 0. 2 Theoretical Quantiles ∆R (G AM ES S− CI F) / A ng str om Figure 7.9: All bonds in the corpus excluding those to hydrogen — the data appear to be normally distributed within ∆R ≈ ± 0.05A˚. It is possible that there are several overlapping distributions. Overall ∆R=0.016A˚ and s=0.022A˚. 186 −4 −2 0 2 4 − 0. 2 − 0. 1 0. 0 0. 1 0. 2 Theoretical Quantiles ∆R (G AM ES S− CI F) / A ng str om Figure 7.10: The QQ plot of all bonds excluding those from solvent/ion/guest molecules — the data is mostly normally distributed with ∆R=0.014A˚ and s=0.017A˚. 187 this means that the comparison of bond lengths is not comparing like with like, resulting in large outliers. To remove such outliers from the dataset, all bonds involving oxygen atoms that were not bonded to two non-hydrogen atoms, are marked with a flag to indicate that they might be affected by hydrogen bonding and removed from further analysis. Similarly all nitrogen atoms that are bonded to one or more hydrogen atoms and are sp3-hybridised may be hydrogen bond donors and are therefore also flagged. It should be noted that this is done on a per-bond basis, thus other bonds in the molecule are still included and the molecule would still be considered suitable for calculation. After the removal of possible hydrogen bonding effects, 7177 bonds (from 782 molecules) remain; the QQ plot of these (figure 7.11) shows that there are still outliers. On examination many of the outliers were caused by molecule with large R-factors; following advice from Alison Edwards, all CIFs with R-factor > 0.05 were removed from the dataset, and the protocol altered to filter off such structures in future because these are less likely to be high quality structures [202]. Figure 7.12 shows the QQ plot of the 5034 bonds from 571 molecules following the imposition of this filter. It is possible to place constraints on particular atom sites during the de- termination of the crystal structure which are labelled in the CIF using the atom site refinement data item. It was felt that structures with con- strained atoms (except hydrogen) were not sufficiently experimentally deter- mined. The initial filtering of constrained atoms was performed on a per-bond basis — only those bonds containing constrained atoms were removed from the analysis, because the remaining bonds in the molecule would still re- flect experimentally determined values — but this proved to be ineffectual. Therefore, all molecules containing constrained atoms were removed from the dataset. Figure 7.13 shows the QQ plot resulting from removing only constrained bonds rather than the entire molecule (4264 bonds from 547 188 −4 −2 0 2 4 − 0. 2 − 0. 1 0. 0 0. 1 0. 2 Theoretical Quantiles ∆R (G AM ES S− CI F) / A ng str om Figure 7.11: The QQ plot of 8728 bonds excluding those from sol- vent/ion/guest molecules and those which are possibly effected by hydrogen bonding — the data is mostly normally distributed with ∆R=0.014A˚ and s=0.015A˚. 189 −4 −2 0 2 4 − 0. 2 − 0. 1 0. 0 0. 1 0. 2 Theoretical Quantiles ∆R (G AM ES S− CI F) / A ng str om Figure 7.12: R-factor 6 0.05 — the data is mostly normally distributed with ∆R=0.013A˚ and s=0.015A˚. 190 −4 −2 0 2 4 − 0. 2 − 0. 1 0. 0 0. 1 0. 2 Theoretical Quantiles ∆R (G AM ES S− CI F) / A ng str om Figure 7.13: No constrained bonds — the data is mostly normally distributed with ∆R=0.013A˚ and s=0.015A˚. 191 −4 −2 0 2 4 − 0. 2 − 0. 1 0. 0 0. 1 0. 2 Theoretical Quantiles ∆R (G AM ES S− CI F) / A ng str om Figure 7.14: No molecules containing constrained non-hydrogen atoms — the data is mostly normally distributed with ∆R=0.013A˚ and s=0.014A˚. molecules). Figure 7.14 shows the QQ plot of the 3658 bonds (547 molecules) which remain following the imposition of the filters described above. The outliers ∆R < −0.05A˚ have two causes; loss of aromaticity and incorrectly specified charges in the CIF. 1,3,5,7-cyclooctatetrene in vacuo has been calculated to have alternating bond lengths of 1.47A˚ and 1.34A˚ reflecting the non-aromatic nature of the molecule. The crystal structure contains the molecule com- plexed with thulium allowing metal-ligand electron donation. The reported C–C bond lengths in the CIF are between 1.40A˚ and 1.42A˚ reflecting the 192 −4 −2 0 2 4 − 0. 2 − 0. 1 0. 0 0. 1 0. 2 Theoretical Quantiles ∆R (G AM ES S− CI F) / A ng str om Figure 7.15: Only molecules from CIFs containing the specified elements — the data is mostly normally distributed with ∆R=0.012A˚ and s=0.013A˚. aromatic nature of the bonding in this state. There is no reason to believe that the bond lengths obtained from the CIF and the GAMESS calculation are incorrect but they describe the molecules in different states. The outlier should therefore not be interpreted as being caused by bad data. The process of determining which atoms are part of a molecule is not currently sufficiently chemically aware. This frequently results in coordinated molecules and organometallic molecules becoming separated (for example ferrocene will become two cyclopentadienyl molecules and an iron atom). The presence of heavy atoms (in the crystallographic sense) in the crystal 193 also reduces the accuracy of the determination of the positions of the lighter nuclei. Therefore CIFs which contained other than the permitted nuclei (H, B, C, N, O, F, Si, P, S, Cl and Br) were removed from further analysis. Figure 7.15 shows the QQ plot of the 5253 bonds (348 molecules) that fulfil this requirement. The CIF containing the molecule with incorrectly specified charge con- tained two charged molecules and the authors had mistakenly swapped the charges over. There is no simple algorithmic method for determining that this has occurred, so a manual removal flag was used. 5251 bonds (347 molecules) remain in the dataset; the QQ plot of these is shown in figure 7.16. The temperature at which the crystal structure is determined affects the quality of the resultant structure, as mentioned in chapter 6. Tradition- ally structures were resolved at room temperature but the introduction of cryocooling has allowed low temperature studies to be performed. The tem- peratures reported in the entire corpus for this work ranged from 5K to 573K (the upper bound is highly dubious — see section 8.5). Following advice from Bernd Schweizer all crystal structures with a reported temperature greater than 200K were removed and a filter created to prevent the calculation of such structures in future [203]. Figure 7.17 shows the 1397 bonds (150 molecules) still in the dataset. To more completely remove the crystal packing effects from the analysis, only bonds which are part of a ring are considered. This involved creating a flag for each bond to designate whether or not it was a cyclic bond — this is has the sole effect of minimising the crystal packing effects on the analysis of the data; molecules containing no cyclic bonds are still suitable for calculation. 990 bonds (128 molecules) remain after the implementation of all the filters (including that for cyclic bonds). Figure 7.18 shows the QQ plot of the bond 194 −4 −2 0 2 4 − 0. 2 − 0. 1 0. 0 0. 1 0. 2 Theoretical Quantiles ∆R (G AM ES S− CI F) / A ng str om Figure 7.16: Manual removal — the data is mostly normally distributed with ∆R=0.012A˚ and s=0.013A˚. 195 −4 −2 0 2 4 − 0. 2 − 0. 1 0. 0 0. 1 0. 2 Theoretical Quantiles ∆R (G AM ES S− CI F) / A ng str om Figure 7.17: T 6 200 K — the data is mostly normally distributed with ∆R=0.009A˚ and s=0.012A˚. 196 −4 −2 0 2 4 − 0. 2 − 0. 1 0. 0 0. 1 0. 2 Theoretical Quantiles ∆R (G AM ES S− CI F) / A ng str om Figure 7.18: Cyclic bonds only — the data is mostly normally distributed with ∆R=0.009A˚ and s=0.011A˚. 197 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1. 2 1. 3 1. 4 1. 5 1. 6 1. 7 1. 8 1. 9 CIF bond length / Angstrom G AM ES S bo nd le ng th / An gs tro m Figure 7.19: The calculated bond lengths appear to be consistently longer than those determined experimentally, with the effect becoming more pro- nounced for longer bonds. The x = y line is shown. length comparisons. There is a slight positive skew present — GAMESS bond lengths are found to be consistently longer than the corresponding CIF bond length — with this effect becoming more pronounced for longer bonds. The lengthening is more apparent in the x-y plot (figure 7.19) and is examined in chapter 8. Table G.1 shows the 112 unique connection tables of the 128 molecules which pass the final protocol. 198 7.12 Conclusions As expected, the analysis of the data led to further filtering and refinement of the original protocol. The final protocol is given below: • no disordered atoms • molecules from CIFs consisting only of the following element types are permitted: H, B, C, N, O, F, Si, P, S, Cl, Br • molecules containing more that 4 non-hydrogen atoms are suitable — but the more of these atoms are present the longer the run time is likely to be (see tables 5.3 and 8.10). • solvent/ion/guest molecules are not suitable • only molecules with a R-factor ( refine ls r factor gt) 6 0.05 are suitable • molecules containing constrained atoms ( atom site constraints) other than hydrogen atoms are not suitable • structures determined at a temperature greater than 200K are not suit- able • manual removal of author error A further two factors should be considered when comparing the reported bond lengths in the CIF and those calculated in vacuo. Firstly, hydrogen bonding in the crystalline form can affect the bond lengths of the atoms involved and adjacent bonds (see section 8.4.3); the most common hydrogen bond acceptors and donors found in the dataset were carbonyl and imine groups, alcohols and amine groups respectively. Algorithms to detect bonds to these groups have been created and implemented. Secondly, the effects of crystal packing forces can largely be negated by considering only bonds which are part of a cyclic system. The identification of cyclic bonds is possible using the current tools and has been implemented. 199 Developing a suitable architecture for HT computing is non-trivial. Work- flows must be developed with flexibility and tolerance to failures in mind. The protocol should be expected to undergo considerable refinement before it can be considered suitable. This refinement is likely to be directed by the analysis of the results; to analyse the data produced without knowing the limitations of the protocol is meaningless. Traditional workflows programs (such as Taverna) are not well suited to the kind of HT computing detailed above. It might be possible to incorporate an entire workflow in such a program but to do so increases the complexity of the code required. Much of this complexity results from the limited input and output format for each of the processes in such programs. The program writers are therefore forced to implement the necessary algorithms to per- form the intended process as well as the transformation of the data to and from the specified workflow format. Traditional workflow programs are also not designed to implement processes on multiple platforms; this is often an absolute necessity in HT computing. In general, it is the belief of this author that workflows should be con- structed from simple, lightweight components each of which should perform one process. Each process should be written in such a way that it is as general as possible and clearly reports any errors encountered. Passing data between the processes is most easily accomplished by correct use of the file system; a denormalised approach (possibly with each process creating a new instance of a file) is desirable. The processes may be linked together using a script if desired. Even if each of the processes are encapsulated in a WS and made publicly available they are unlikely to be globally usable although local re-use is possible. Examples of processes that may be re-usable locally are input creation, submission and retrieval of results. 200 Chapter 8 Results Some analysis of the results has already been presented in the previous chap- ter. This analysis was necessary to determine how the original protocol should be modified and only considered non-proteus molecules that had com- pleted the geometry optimisation. To give a complete picture of the applica- bility of the workflow, the failed calculations and the proteus molecules must also be examined. 8.1 Failure Analysis The 1181 jobs submitted for calculation under the original protocol yielded 180 instances where the calculation failed (see table 8.1). There were four sources of failure: • insufficient time to finish the minimisation (21) • bad delocalised coordinates generated (16) • SCF did not converge (78) • incorrect charge and/or multiplicity specified (65) The molecules giving rise to each of these failures were examined to ascertain whether it would have been possible to create a filter to remove jobs of this type before submission. 201 number of . . . CIFs 6738 CIFs parsed to CML 6738 molecules extracted 6455 molecules suitable for calculation (original protocol) 1181 disordered molecules otherwise suitable for calculation 65 calculations (total) 1181 calculations failed (original protocol) 180 calculations failed (final protocol) 14 Table 8.1: The breakdown of the calculation statistics. 8.1.1 Insufficient Time Table 8.2 shows the unique CTs of the molecules that did not have sufficient time to complete the minimisation process. The dataset of completed calcu- lations was searched for the CTs of each of the molecules which failed. This search revealed cases where the calculation of the same CT (with very sim- ilar initial geometries) resulted in both failures and successfully completed computations. Figure 8.1 show an example of the energy profiles of the suc- cessful and failed calculations. The profiles suggest that there is a very small radius of convergence for the energy minimisation algorithm implemented in GAMESS. The 21 failures of this type represent a failure rate of 1.8%. This figure matches that found for the calculation of the MOPAC optimised structures which, taken in conjunction with the large spread of run times for a given number of non-hydrogen atoms, suggests that a failure rate of ca. 2% should be expected. There does not appear to be any way to differenti- ate the CTs giving rise to failures of this type from those which successfully complete in the time limit. Therefore no further refinement were made to the protocol. 202 Table 8.2: The molecules which did not have sufficient time to complete. The numbers below each molecule are: the number of instances of this molecule which did not have sufficient time to complete and, in brackets, the number of instances of the molecule which successfully completed in the time limit. 1 (0) 1 (0) 1 (0) 2 (32) 1 (0) 2 (2) 1 (1) 1 (0) Continued on Next Page. . . 203 Table 8.2 – Continued 1 (0) 1 (9) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0) 1 (1) Continued on Next Page. . . 204 Table 8.2 – Continued 1 (0) 2 (2) 8.1.2 SCF Did Not Converge The SCF failed to converge for 78 molecules under the original protocol. These 78 failures were caused by the 16 unique CTs shown in table 8.3. Again, the dataset was searched for instances where the same CT resulted in completed calculations; these are indicated in the table. It was observed that the molecules for which the SCF did not converge all contain localised charges. However, the dataset included many molecules with localised charges for which the SCF did converge. It is therefore not reasonable to include a filter in the protocol to remove such molecules. The maximum number of iterations permitted for the SCF to converge is 30 by default in GAMESS. Increasing this value might allow more calculations to complete. However, these calculations are likely to require more time than is allowed by the protocol and therefore alteration was thought be be unnecessary. 205 0 1 2 3 4 -481.75 -481.70 -481.65 -481.60 -481.55 -481.50 -481.45 t o t a l e n e r g y / H a r t r e e step worked 1 worked 2 failed 1 failed 2 10 20 30 40 50 60 70 80 -481.7295 -481.7290 -481.7285 -481.7280 -481.7275 -481.7270 t o t a l e n e r g y / H a r t r e e step worked 1 worked 2 failed 1 failed 2 100 150 200 250 300 350 400 450 -481.72920 -481.72915 -481.72910 -481.72905 -481.72900 -481.72895 -481.72890 -481.72885 -481.72880 -481.72875 t o t a l e n e r g y / H a r t r e e step failed 1 failed 2 Figure 8.1: The energy profiles for four calculations on the same connection table (N -benzyl-N -methylprop-2-yn-1-aminium) with similar initial geome- tries. The y-axis scale is different for each graph. 206 Table 8.3: The molecules for which the SCF did not con- verge. The number of times each molecule was observed is indicated. 10 (2) 1 (0) 7 (0) 40 (0) 1 (0) 1 (0) 6 (0) 1 (0) Continued on Next Page. . . 207 Table 8.3 – Continued 1 (0) 2 (0) 2 (0) 1 (0) 1 (0) 1 (0) Continued on Next Page. . . 208 Table 8.3 – Continued 2 (0) 1 (0) 8.1.3 Bad Delocalised Coordinates Generated It was noted in section 5.3 that the automated creation of internal coordinates by GAMESS may fail. There were 16 instances of this failure from 9 unique CTs. These unique CTs are shown in table 8.4 — as before, the number of instances where the same connection table resulted in completed calculations is indicated in brackets. This is a previously reported GAMESS error and there is no easy way to determine which molecules it will effect. Therefore, no changes were made to the protocol. Table 8.4: The molecules for which bad delocalised co- ordinates were generated. The number of times each molecule was observed is indicated. 1 (0) 3 (0) Continued on Next Page. . . 209 Table 8.4 – Continued 1 (4) 1 (3) 5 (0) 1 (4) 1 (0) 1 (0) 1 (0) 8.1.4 Incorrect Charge or Multiplicity There were found to be three reasons for this failure, all of which occurred in molecules suitable for calculation under the final protocol. The most common cause of this failure was the incorrect charge being specified in the CIF, or not being specified in a manner that was correctly parsable by the CIF2CML process. There is no way to account for author error. The checkCIF process does not report these as severe alerts although these mistakes were detected 210 by the calculation process. The CIF2CML process might be modified so that it places looser restric- tions on the format of formulae that it can parse. This would allow the parsing of the specified formula even if it did not correspond to the format specified by the CIF dictionary [205]. However, the results of the OSCAR project suggest that it is impossible to correctly interpret loosely-defined data types with high precision. Failures of this type would therefore still be expected. Two instances of this problem were caused by a mistake in the input file; one of the lines defining an atom in the file was longer than 80 characters. This causes GAMESS to skip the next input line which results in the next atom not being included in the calculation. The cause of the long lines is the use of the Java Double primitive to hold the coordinates. Whilst the CIF2CML program holds the fractional coordinates to the precision defined in the CIF, the cartesian coordinates created by the process are held to the maximum precision allowed by the Double primitive. Detection of this problem is possible before submission and was included in later protocols. However, it is still desirable that the underlying cause should be addressed; namely that the CIF2CML parser should ensure that the derived values should be given to the appropriate precision. The final reason for this problem was incorrectly reported structures. For example, Newton et al. report the structure and formula of 2-C-hydroxymethyl- 2,3-O-isopropylidene-d-ribono-1,5-lactam as C9H14NO5 which matches the structure reported in the CIF [208]. However, this structure does not match the CT given in the article which is C9H15NO5 (see figure 8.2). Failures owing to the incorrect charge and/or multiplicity being specified accounted for 5.5% of the submitted calculations and were all caused by in- correct or inconsistent CIFs. Failures of this type take extremely little time to 211 Figure 8.2: The CTs of 2-C-hydroxymethyl-2,3-O-isopropylidene-d-ribono- 1,5-lactam as reported by Newton et al. The intended structure is shown of the left and the structure reported in the CIF on the right [208]. calculate (typically a few tenths of seconds) and therefore such computations could be used to automatically validate the reported crystal structures. 8.2 Proteus Molecules Previous work highlighted the existence of molecules which when under going geometry optimisation showed a change in the CT (proteus molecules). The analytical tools used to determine differences in the geometry between the input and output structures require that there is no change of CT between the two. Therefore, before further processing, an InChI was generated for each molecule in each of the the parsed data-and-coords.xml files. The InChI was added as a child of the molecule element as a separately names- paced identifier element (see figure 8.3). A program was written to iterate through each of the InChIs and compare them to the InChI of the original (input) molecule. If the basic InChIs did not exactly match, a flag was added to the file to indicate that the molecule was protean. Initial analysis of the proteus molecules presented some unexpected results; namely that although some of the molecules marked as protean did change CT at some point during the calculation, the input and output molecules had identical InChIs. To rectify this another flag was introduced 212 ... ... InChI=1/C6H11NO2/c8-6(9)5-3-1-2-4-7-5/h5,7H,1-4H2,(H,8,9)/t5-/m0/s1 AuxInfo=1/1/N:11,14,8,17,6,20,3,1,2/E:(8,9)/it:im/rA:20OONHHCHCHHCHHCHHCHHC /rB:... Figure 8.3: An example of how a molecule’s InChI was incorporated into the CML document. These molecules showed a change of CT during the minimisation process but the optimised structure had an identical InChI to the input, whereas a true proteus molecule has a different InChI for the input and output structures. 28 proteus molecules were found, most of which were a result of the move- ment of a hydrogen atom to neutralise the charge on the molecule. A conse- quence of this was that most of the amino acids and derivative structures were removed from the dataset. These molecules tend to exist in the zwitterion form in the solid state and solution where they are able to make favourable ionic interactions. However, individual molecules (particularly in vacuo) are not usually stable in this form with respect to the neutral species. Of the 28 proteus molecules 26 were as a result of charge reduction, in all but one instance this involved the movement of hydrogen atom(s). The other cause was the distance between a nitrogen and a boron atom increasing from 1.66A˚ to 1.84A˚ which was no longer considered to be within bonding distance by the analysis tools used. The calculated bond order reported by GAMESS for the N–B bond was 0.325, with overall charges of +0.64 and −0.77 on the nitrogen and boron atoms respectively. The two proteus molecules which were not caused by charge reduction were both found to be molecules that should have been charged but were not reported to be so in the CIF. Table H.1 shows an example of the geometries adopted during the calculation of a proteus molecule. 213 15 semi-proteus molecules were found. 13 of these were caused by 6- or 7-member ring formation between charged and uncharged carboxylic groups in 1,3 or 1,4 relative positions. A typical example of this can be seen in table H.2. The remaining two instances involved the distance between bonded atoms lengthening beyond the value considered to constitute a bond by the analysis software (see table H.3). Although the semi-proteus molecules are interesting, overall they do not effect the protocol because only the initial and final geometries are being compared. 8.3 CIF Analysis To verify that the repository of CIFs used in this work was a representative subset of small molecule crystallography (as recorded in the CSD) a series of comparisons were drawn which are presented below. The statistics are derived from the 6675 CIFs which were parsed into CML to form the corpus for this work (set A) and the 775 CIFs which contained at least one molecule deemed suitable for calculation under the original protocol (set B). The CCDC publishes yearly statistics, based on the structures in the the CSD as of the 1st of January each year. The statistics used below are taken from the 2007 publication. The derivation of the statistics from the datasets used in this thesis has been performed by automated processes wherever possible. This means that minor errors are interpreted as errors even if the error was trivially human-parsable. Table 8.5 shows the analysis of crystal systems in the CSD and corpus of CIFs used in this thesis. The 6675 parsed CIFs contained 6673 instances where the crystal system was correctly defined. The two which did not, contained incorrect formatting and/or labels that were not defined in the CIF dictionary. One further result was excluded because it contains a spelling mistake in the definition (‘rombohedral’). 214 System % of Entries % of Entries % of CSD set A set A set B set B Triclinic 23.5 1543 23.1 160 20.6 Monoclinic 52.7 3690 55.3 439 56.6 Orthorhombic 18.9 1203 18.0 165 21.3 Tetragonal 2.2 108 1.6 7 0.9 Trigonal 1.7 59 0.9 4 0.5 Hexagonal 0.5 38 0.6 0 0 Cubic 0.4 29 0.4 0 0 Rhombohedral n/a 2 0.0 0 0 Total 99.9 6672 100 775 100 Table 8.5: The cell symmetries for the CSD and the corpus for this work. The CSD results are obtained from the 397,293 CSD structures for which the space group is fully defined. Table 8.6 shows the distribution of Hermann-Mauguin space groups for the datasets. The 6675 parsed CIFS yielded 6671 instances with the Hermann- Mauguin space group correctly defined. The four files for which the space group could not be identified used labels which were not defined in the CIF dictionary. Table 8.7 shows the R-factor statistics for the datasets. The CIF dictionary allows several to define the R-factor, all of which are valid. This work has only considered the conventional R-factor defined by the CIF dictionary which is labelled as refine ls R factor gt The required R-factor was contained in 6650 CIFs, the remaining 25 files had the R-factor given in an alternative format. The CSD contains 3.8% structures with unreported R-factors, although it should be noted that these are from short communications and most commonly from the earlier liter- ature [204]. The corpus for this work contains 0.4% structures with the conventional R-factor missing, although a valid alternative form was always present. 215 Space group % of Entries % of Entries % of CSD set A set A set B set B P2(1)/c 35.1 1463 21.9 181 23.4 P2(1)/n n/a 997 14.9 134 17.3 P2(1)/a n/a 68 1.0 12 1.6 P-1 22.6 1498 22.5 151 19.5 P2(1)2(1)2(1) 8.1 458 6.9 83 10.7 C2/c 7.9 552 8.3 36 4.7 P2(1) 5.5 298 4.5 54 7.0 Pbca 3.5 257 3.9 37 4.8 Pna2(1) 1.4 95 1.4 21 2.7 Pnma 1.3 92 1.4 0 0 Cc 1.1 46 0.7 3 0.4 P1 1.0 43 0.6 8 1.0 Pbcn 0.9 41 0.6 4 0.5 C2 0.9 54 0.8 3 0.4 Pca2(1) 0.7 61 0.9 10 1.3 R-3 0.6 24 0.4 1 0.1 P2/c 0.6 14 0.2 0 0 P2(1)/m 0.6 34 0.5 0 0 C2/m 0.5 54 0.8 0 0 P2(1)2(1)2 0.4 18 0.3 3 0.4 Pc 0.4 16 0.2 5 0.6 Pccn 0.4 26 0.4 3 0.4 Fdd2 0.3 21 0.3 2 0.3 I4(1)/a 0.3 15 0.2 3 0.4 Total 94.1 6671 93.6 773 97.5 Table 8.6: Hermann-Mauguin space group statistics for the CSD and the test corpus. The CSD results are obtained from the 397,293 CSD structures for which the space group is fully defined, and represents > 0.3% of the structures. 216 R-factor % of Entries % of Entries % of CSD set A set A set B set B R 6 0.030 9.2 940 14.1 86 11.1 0.030 < R 6 0.040 19.8 1766 26.5 219 28.3 0.040 < R 6 0.050 22.4 1890 28.3 246 31.7 0.050 < R 6 0.070 27.9 1709 25.6 190 24.5 0.070 < R 6 0.090 10.5 279 4.2 27 3.48 0.090 < R 6 0.100 2.4 46 0.7 6 0.8 0.100 < R 6 0.150 3.4 18 0.3 1 0.1 0.150 < R 0.7 2 0.0 0 0 not reported 3.8 25 0.4 0 0 Table 8.7: R-factor statistics for the CSD and the corpus for this thesis. Aug-01 Jun-02 Apr-03 Feb-04 Dec-04 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075 R - f a c t o r Figure 8.4: The mean R-factor of all the CIFs accepted by Acta Crystallo- graphica Section E by month. The error bars show the standard deviation of the values. 217 Tables 8.5 and 8.6 suggest that the dataset used for this thesis does form a reasonable subset of small molecule crystallography. However, it was ob- served that the corpus of CIFs used in this study contained a greater per- centage of lower R-factors than those in the CSD. This was expected because this study only considers CIFs published between the years 2000 and 2005 inclusive whereas the CSD contains much older structures since when there have been significant improvements in crystallographic experiments. Fig- ure 8.4 shows the mean R-factor per month of all the CIFs accepted by Acta Crystallographica Section E for publication between November 2000 and Au- gust 2005. The data is derived from the 6123 CIFs used in this study that contained both the required R-factor and the journal acceptance date. The graph shows that both the mean and the standard deviation of the R-factor has remained approximately constant since November 2000. The graph also shows that the implementation of the R-factor 6 0.05 filter in the protocol removes at most 50% of the structures published in a particular month. 8.4 Bond Lengths Figure 8.5 shows a histogram of the reported estimated standard deviations (esds) for the 990 bonds which pass the final protocol. It is important to have an idea of the average esd to determine whether or not differences in bond length between those reported in the CIF and those calculated by GAMESS are significant. The average esd is 0.003A˚ and s=0.002A˚, therefore it is likely that bond lengths determined by the two methods that differ by approximately 0.003A˚ may be explained by the expected spread of the data. The graph shows that most of the esds are less than 0.005A˚, which might be a suitable filter to introduce in future protocols. 8.4.1 S–X Bonds The comparison of the bond lengths in the previous chapter indicated that the GAMESS bond length was consistently longer than that reported in the CIF. This lengthening effect was most noticeable for longer bonds which were identified as S–X bonds (X=C,N). The seven molecules containing cyclic S–X 218 esd / Angstrom Fr eq ue nc y 0.000 0.002 0.004 0.006 0.008 0.010 0 10 0 20 0 30 0 40 0 50 0 Figure 8.5: The reported esds of the 990 bonds which pass the final protocol, esd=0.003A˚, s=0.002A˚. 219 molecule GAMESS GAUSSIAN03 name bond length / A˚ a b a b lh6438molecule1 1.718 1.791 1.719 1.790 lh6379molecule1 1.742 1.732 1.730 1.741 ac6153molecule1 1.759 1.800 1.762 1.804 ob6428molecule1 1.771 1.769 1.766 1.767 bt6436molecule1 1.839 1.893 1.839 1.888 bt6436molecule2 1.838 1.893 1.839 1.890 cv6296molecule1 1.783 1.759 1.783 1.759 Table 8.8: The bond lengths of the S–X (X=C,N) bonds for the seven molecules shown in figure 8.6. Both calculations were performed using 6- 31G*/B3LYP. bonds in the dataset are shown in figure 8.6. There is no clear reason for the discrepancy in the bond lengths so it was thought that a higher level calculation may be required. The molecules were submitted for calculation by GAMESS using a large basis set (6-311G**) and the B3LYP exchange function. Unfortunately these calculations did not complete in the time limit and therefore GAUSSIAN03 was used for the calculations instead. The GAUSSIAN03 program was a very recent addition to the computating facilities at the UCC and runs on the recently acquired Dove cluster. This is an Opteron-based compute cluster intended for parallel work. It has a head node with dual Opteron 246 CPUs and 4GB RAM. There are 8 compute nodes with dual, dual-core Opteron 265 CPUs and 4GB of RAM. This hardware is significantly faster than that of the Kellogg cluster which reduces the run times approximately eight-fold. This assumes that the implementation of the DFT code does not scale significantly differently in the two programs. The standard 6-31G*/B3LYP calculation was repeated on GAUSSIAN03 to verify that the two methods produced similar bond lengths for the molecules (see table 8.8). Three further calculations were performed on each molecule: 220 lh6438molecule1 lh6379molecule1 ac6153molecule1 ob6428molecule1 bt6436molecule1 bt6436molecule2 cv6296molecule1 Figure 8.6: Long bonds to sulphur. The indicated bonds show a large differ- ence between the calculated and CIF bond lengths. 221 CIF 6-311G** 6-31G* 6-311G** B3LYP MP2 MP2 a b a b a b a b lh6438molecule1 1.664 1.756 1.716 1.792 1.707 1.772 1.701 1.772 lh6379molecule1 1.720 1.709 1.782 1.739 1.713 1.726 1.709 1.721 ac6153molecule1 1.739 1.771 1.762 1.802 1.744 1.781 — — ob6428molecule1 1.744 1.730 1.763 1.766 1.752 1.754 1.749 1.750 bt6436molecule1 1.788 1.825 1.840 1.893 1.810 1.843 — — bt6436molecule2 1.789 1.826 1.840 1.894 1.810 1.843 — — cv6296molecule1 1.752 1.749 1.782 1.759 1.750 1.739 1.743 1.734 all values are given in A˚ to four significant figures Table 8.9: The S–X (X=C,N) bond lengths as reported in the CIF and cal- culated at various levels and methods of theory. The labels a and b indicate the bond referred to (see figure 8.6). Where no bond length is shown, the calculation failed to complete in the required time limit. 6-311G**/B3LYP, 6-31G*/MP2 and 6-311G**/MP2. The results of these calculations are shown in table 8.9. It was observed that simply increasing the size of the basis set did not improve the agreement with the experimen- tal values. However, the calculations at a higher level of theory did show a general improvement. This suggests that 6-31G*/B3LYP is not sufficient for accurate calculations involving second row elements. Unfortunately the time required for calculations at higher levels of theory is often prohibitive. The protocol must therefore be modified to only include bonds involving first row elements. 8.4.2 All Bonds To this point, all the QQ graphs have been plotted using the same axes limits to make it easier to detect the effect of implementing the various filters. The limits were chosen so that all outliers would be visible and were −0.25 6 ∆R / A˚ 6 0.25 on the y-axis and ± four standard deviations of the standard normal distribution on the x-axis. However, now that the major outliers have been identified and removed for justifiable reasons, it is instructive to examine the distributions in more detail. 222 −3 −2 −1 0 1 2 3 − 0. 06 − 0. 04 − 0. 02 0. 00 0. 02 0. 04 0. 06 Theoretical Quantiles ∆R (G AM ES S− CI F) / A ng str om Figure 8.7: QQ plot of the 976 bonds in the dataset after the imposition of all the filters. The data is mostly normally distributed with ∆R=0.009A˚ and s=0.010A˚ although unusual behaviour is still observed at both long and short bond lengths. 223 The Shapiro-Wilk W test [207] tests the null hypothesis that a sample x1, . . . , xn are from a normally distributed population. A W statistic of 1 is found for data that is perfectly normally distributed. The test also provides a p-value which is used to assess whether or not the observed deviation from normality is significant. For example, a W statistic of 0.97 with p-value of 0.05 suggests that there is no evidence to reject the null hypothesis at a 95% confidence level. A lower p-value would indicate that the W value is too extreme to be explained by chance variation (i.e. evidence against normal distribution) and the null hypothesis should be rejected. Figure 8.7 shows the QQ plot of all the bonds in the dataset after the imposition of all the filters. It is observed that although the data is mostly normally distributed there are significant outliers at both ends of the distri- bution. The Shapiro-Wilk W test gives W = 0.960, p-value = 1.04× 10−15, the null hypothesis is therefore rejected, although the W value near unity suggests that the data is almost normally distributed. The deviation from normality is likely to be caused by the tails of the distribution which are examined in the next sections. 8.4.3 C–C Bonds Figure 8.8 shows the 752 C–C cyclic bonds in the dataset after the imposition of all the filters. The molecules giving rise to those bonds with ∆R > 0.03A˚ were examined and all found to be involved in hydrogen bonds in the crystal that had subsequently been lost in the calculation. An example of this effect is shown in figure 8.9. As expected the C–O bond length decreases when hydrogen bonding is lost whilst the adjacent C–C bond lengths both increase. The W value was 0.928 and p-value < 2.2 × 10−16, the null hypothesis is therefore rejected although the W value indicates that the data is nearly normally distributed. The tails of the distribution are the likely cause of this deviation from normality and may be explained by the loss of crystal packing effects when the calculation is performed. 224 −3 −2 −1 0 1 2 3 − 0. 06 − 0. 04 − 0. 02 0. 00 0. 02 0. 04 0. 06 Theoretical Quantiles ∆R (G AM ES S− CI F) / A ng str om Figure 8.8: QQ plot of the 752 C–C cyclic bonds in the dataset after the imposition of the final protocol. The data is mostly normally distributed with ∆R=0.010A˚ and s=0.008A˚. There appears to be a discontinuity at ∆R > 0.03A˚. 225 Figure 8.9: The loss of hydrogen bonding can effect the bond length of cyclic bonds. The bond lengths indicated are in A˚. 226 −2 −1 0 1 2 − 0. 06 − 0. 04 − 0. 02 0. 00 0. 02 0. 04 0. 06 Theoretical Quantiles ∆R (G AM ES S− CI F) / A ng str om Figure 8.10: QQ plot of the 160 C–N cyclic bonds in the dataset after the imposition of the final protocol. The data appears normally distributed with ∆R=0.007A˚ and s=0.013A˚. 227 8.4.4 C–N and C–O Bonds There are 160 C–N bonds and 54 C–O bonds in the dataset after all the filters are imposed. The QQ plots of these are shown in figures 8.10 and 8.11 respectively. Both bond types show approximately normally distributed data. The Shapiro-Wilk W tests for the C–N and C–O ∆R were W = 0.962, p-value = 2.00 × 10−4 and W = 0.953, p-value = 3.23 × 10−2 respectively. Again the outliers are caused by the loss of crystal packing effects on calcu- lation. These tests indicate that the data is not normally distributed, but is almost so. There are no other bond types with more than 13 instances in the dataset. 8.4.5 PLATON Rigid-body model libration corrections can be calculated using PLATON [209]. Unfortunately, these calculations cannot easily be automated because the program cannot be run entirely from the command line. However, to inves- tigate the effect of the possible corrections, the program was run manually on the 129 molecules which contained cyclic bonds that passed the entire protocol. This dataset comprised 990 bonds but PLATON was unable to calculate libration corrections for 30 of these bonds. In all cases the PLA- TON corrected bond length was longer than that reported in the CIF (the minimum correction was 0.001A˚ and the maximum 0.014A˚), ∆R(PLATON- CIF)=0.003A˚ with s=0.002A˚. The X–S bonds (X=C,N) were lengthened by an average of 0.003A˚ which improved the agreement with the calculated val- ues for these bonds. ∆R(GAMESS-PLATON)=0.006A˚ with s =0.010A˚, the maximum and minimum ∆R(GAMESS−PLATON) found were 0.064A˚ and −0.031A˚ respectively. These results suggest that the libration corrected bond lengths agree well with the calculated bond lengths although the spread of the data is significantly larger than the esd of the bond lengths. Much of this variation is expected to be a result of the loss of crystal packing effects when the calculation is performed on isolated molecules in vacuo. 228 −2 −1 0 1 2 − 0. 06 − 0. 04 − 0. 02 0. 00 0. 02 0. 04 0. 06 Theoretical Quantiles ∆R (G AM ES S− CI F) / A ng str om Figure 8.11: QQ plot of the 54 C–O cyclic bonds in the dataset after the imposition of the final protocol. The data appears normally distributed with ∆R=−0.002A˚ and s=0.012A˚. 229 0 100 200 300 400 500 600 0.0 0.1 0.2 0.3 0.4 0.5 U i s o , b o n d / Å ² temperature / K Figure 8.12: The Uiso,bond for the 9512 bonds (excluding those to hydrogen) calculated by GAMESS under the original protocol plotted against the tem- perature at which the experiment was conducted. 230 8.5 Uiso,bond The atom site U iso or equiv data item is used to report the equivalent isotropic parameter Ueq for each atom in a CIF. The parameter is expected to depend on temperature. The Uiso,bond between two bonded atoms i and j was defined as: Uiso,bond = √ U2eq,i + U 2 eq,j (8.1) where Ueq,i is the reported Ueq for atom i. Figure 8.12 shows Uiso,bond plotted against temperature for all 9512 bonds (excluding those to hydrogen) that were calculated by GAMESS under the initial protocol. There are several features of interest; firstly there are structures which have reportedly been determined at 573, 566, 546 and 393 K. These are not reported as high temperature studies in the literature and look far more reasonable when 273 K is subtracted from the values. It is extremely likely that the authors had incorrectly reported the temperature in Celsius rather than of Kelvin. The Uiso,bond appears to be only weakly correlated with temperature (ρ = 0.45). The bonds giving rise to some of the highest values of Uiso,bond were ex- amined and were found to be solvent molecules. An example of a structure giving rise to a high Uiso,bond is shown in figure 8.13. The extreme eccentricity of the thermal displacement ellipsoids in the solvent molecule suggests that there is likely to be unreported disorder (minor conformations) present. Figure 8.14 shows Uiso,bond against temperature for the bonds which pass the final protocol (although both cyclic and acyclic bonds are permitted). It is observed that there is still only weak correlation (ρ = 0.48) between temperature and Uiso,bond. The largest values of Uiso,bond were from the C–F bonds in 2,2,2-trifluroethanol molecules. These molecules should have been removed as solvents but were missed. The list of solvent/ion/guest molecules was compiled by hand and is therefore likely to have some omissions. The 231 Figure 8.13: Displacement ellipsoids at the 50% probability level showing extremely large thermal motion for the N,N -dimethylformamide molecule. The size and extreme eccentricity of the O5 ellipsoid in particular suggests that minor conformations may be present. The figure has been taken from the article by Li and Xiao [206]. 232 0 50 100 150 200 250 0.00 0.02 0.04 0.06 0.08 0.10 0.12 U i s o , b o n d / Å ² temperature / K Figure 8.14: The Uiso,bond for the 990 bonds (excluding those to hydrogen) that pass the final protocol plotted against the temperature at which the experiment was conducted. 233 4 6 8 10 12 14 16 10 0 10 1 10 2 10 3 10 4 10 5 10 6 t o t a l t i m e / s number of non-hydrogen atoms y=12.03x 3.334 y=(3.095x+1.5268) 3 Figure 8.15: The total calculation time for the geometry optimisation of structures taken from crystallography scales less favourably than the pre- dicted cubic dependence (red line). However, the predicted run time consis- tently over estimates the actual value for n < 15. list of molecules currently identified as solvent/ion/guests is given in full in appendix F. Figure 8.14 suggests that a further filter might be added to the protocol, namely that bonds with Uiso,bond > 0.1 A˚2 should be omitted from analysis. Unfortunately Uiso,bond is a derived quantity and is not reported in the CIF which makes the filter more difficult to implement. 8.6 Time Figure 8.15 shows that the total run times for the geometry optimisation of the structures obtained from crystallography scale less favourably (n3.334) 234 number of mean total standard predicted predicted non-H atoms time / s deviation / s mean total standard time / s deviation / s 4 2100 1600 2700 980 5 3500 2900 4900 2400 6 9300 12000 8100 4400 7 10000 8100 12000 7300 8 16000 11000 18000 11000 9 32000 35000 25000 15000 10 25000 15000 34000 21000 11 40000 23000 45000 28000 12 59000 35000 58000 36000 13 80000 44000 73000 45000 14 76000 48000 90000 56000 15 110000 61000 110000 84000 all values given to 2 significant figures Table 8.10: The observed run times are reasonably well predicted using equa- tion 5.1. The largest observed deviation occurs for molecules containing 14 non-hydrogen atoms with the predicted average run time overestimating the true average run time by almost four hours. The predicted standard devia- tion consistently underestimated the observed values. than those that were already optimised using MOPAC which scale as n2.9469. It is observed that there is still a large variation of run times for a particular number of non-hydrogen atoms. Table 8.10 shows that run times predicted using equation 5.1 are reasonable, the largest difference between the predicted and actual average run times being approximately 4 hours. These comparisons support the theory that it is possible to predict the average run times for calculations using simple models, although the stan- dard deviation was consistently underestimated. It is important to note that attempting to predict the time required for a single calculation is nonsensical given the extremely large standard deviations observed and that very similar starting structures may require significantly different calculation times (see figure 8.1). However, the upper bound on the time to calculate a sufficiently 235 large dataset may be predicted with reasonable confidence. It is suggested that such a dataset should contain at least 100 molecules. 8.7 Applying the Protocol Figure 8.16 shows the final protocol developed with each of the filters colour coded to indicate the reason for the implementation. There are five reasons; Crystallographic effects These filters remove many of the poorly deter- mined experimental values and are mostly based on data items re- ported in the CIF — the exception is the removal of solvent/ion/guest molecules. GAMESS / time limitation These filters are required for the calculation to be completed within the time limit or for it to give meaningful results. For example, there is no point in attempting to compute the properties of atoms which are not well-described by the basis set being used. Methodology The calculation is performed on single molecules in vacuo which removes many of the short contacts that are made in the crys- talline form. These filters remove bonds which are likely to be affected by these disparities. Author error Such errors are unfortunately unavoidable, although the use of validation tools before publication should reduce these in future, they will be present in the legacy literature. X–Y (Y=Si,P,S,Cl,Br) The basis set and level of theory chosen (6-31G*/B3LYP) does not appear to be able to accurately calculate the bond lengths of bonds involving second row (or heavier) elements. Increasing the level of theory, to MP2 for example, would allow the calculation of heavier elements, but such calculations require much longer run times. The identification of particular filters that relate to crystallographic effects allows the literature to be searched for high-quality structures (i.e. those 236 Figure 8.16: The filters imposed on the data colour coded by reason. 237 Filter % pass R-factor 6 0.05 66.5 Not solvent/ion/guest 89.4 No non-H refinement 83.0 Temperature 6 200K 32.9 No disorder 92.8 Pass all filters 18.0 Table 8.11: The percentage of the molecules which pass each individual crys- tallographic filter and the percentage which pass all the filters. that pass all the crystallographic effect filters). The percentage of molecules extracted from the CIFs which pass each crystallographic filter and all the filters is shown in table 8.11. Using the esd of bond lengths it is possible to estimate the maximum torsion angle in a toluene molecule that may be accounted for by random error in the coordinates as 0.85◦ (see figure 8.17). This agrees well with the esd reported for the torsion angles in the CIF which is of the order of 0.5◦. The 6455 molecules obtained from the CIFs were searched for mono- substituted phenyl rings (the substituent being a carbon atom) contained in molecules that pass all the crystallographic filters of the final protocol. The largest deviation from planarity was found to be 4.1◦ for the cyclic atoms and 7.5◦ between the external carbon and the ring. GAUSSIAN03 allows the calculation of optimised geometries with speci- fied constraints. These constraints allow the torsion angle between particular atoms to be fixed at a particular value. Figure 8.18 shows the chosen con- straints places on the torsion angles of toluene molecules. The energies of the geometry optimised structures relative to the completely flat molecule were found to be 0.35kJ mol−1 and 2.25kJ mol−1 for torsion angles t1=t2=175◦ and t1=t2=170◦ respectively. These calculations suggest that the energy required to produce the torsion angle of 7.5◦ is approximately 2kJ mol−1. However, there is no simple interaction apparent in the crystal structure to 238 Figure 8.17: The maximum torsion angle of a phenyl ring that can be ac- counted for by the average esd in the atomic coordinates and a bond length of 1.4A˚ is 0.85◦. Figure 8.18: The torsion angles which were constrained during the geometry optimisation of toluene. The values shown are in degrees. 239 explain this distortion from planarity. 8.8 Conclusions The information in CIFs can be parsed to more generally machine-understandable formats with extremely high recall and precision. This data includes the con- nection tables of the molecules and some derived data. This allows molecule- based data-drive science to be performed. The use of validation tools such as checkCIF before publication means that there are few errors in the data. The comparison of the reported bond lengths to those calculated by GAMESS suggests that variations in bond length less than 0.03A˚ are the result of ran- dom error. The agreement between the values is improved by using rigid- body libration corrected bond lengths. Differences of bond lengths greater than 0.03A˚ may be explained by identifiable effects in general, but may merit further examination. The results have shown that the poor agreement be- tween the bond lengths calculated by MOPAC and GAMESS for aromatic nitrogen-bearing moieties are likely to be caused by MOPAC. The protocol developed allowed the identification of high-quality crystal- lographic structures (reducing recall at the expense of precision). Day is conducting further work which implements this protocol and involves inor- ganic structures [210]. The bond lengths in these high-quality structures, which account for ca. 20% of the recently reported structures, have an esd of 0.003A˚. These structures form a dataset that can be used for the identifica- tion of interesting structures which can be reused to form the basis of future experiments. 240 Appendix A Computational Chemistry A.1 ab initio Calculations Using atomic units, the time-independent molecular Schro¨dinger Hamilto- nian is (ignoring all relativistic terms) H = −1 2 ∑ i ∇2i − ∑ iA 1 |ri −RA| + ∑ i>j 1 |ri − rj| + ∑ A>B ZAZB |RA −RB| (A.1) where i, j denote electrons at ri, rj and A,B denote nuclei with charges ZA, ZB atRA,RB. Solutions (energies and wavefunctions) of the Schro¨dinger equation are obtained from HΨ = EΨ (A.2) for fixed positions of the nuclei. E ≡ E(R) is therefore the potential energy surface. The fundamental expansion functions used to find approximate so- lutions of Schro¨dinger’s equation are Slater determinants, which have the form Ψ = 1√ n! φ1(1) φ1(2) · · · φ1(n) φ2(1) φ2(2) · · · φ2(1) ... ... φn(1) φn(2) · · · φn(n) (A.3) = A(φ1φ2φ3 · · ·φn) (A.4) A = 1√ n! n!∑ u σuPu (A.5) where Pu is a permutation of the coordinates in φ1φ2φ3 · · ·φn. A permutation is even (σu = +1) or odd (σu = −1) if it is made up of an even number or odd number of single interchanges. These determinants obey the Pauli principle. 241 Ideally hydrogenic type functions Rnl(r)exp(−ZnR)Ylm(σφ) (A.6) would be used as expansion function for molecular orbitals. These can also be written as rnxpyqzsexp(−Znr) (A.7) from which it is seen that such a set includes s, p, d, f, etc. atomic orbitals. However, it is impossible to (non-numerically) evaluate the one and two electron integrals which arise in the evaluation of matrix elements if such functions are used. These functions are usually referred to as Slater functions, or Slater Type Orbitals (STOs). Boys [211] suggested using Gaussian basis functions, or Gaussian Type Orbitals (GTOs) xpyqzsexp(−ar2) (A.8) with p, q, s integers, and r2 = x2 + y2 + z2. The angular parts of these functions are the same as the STOs but the radial part is different. The derivative of an s Gaussian is zero at the origin, unlike the STO. The Gaus- sian dies off with exponential quadratic dependence compared to the STO’s linear dependence for large r. Thus GTOs have a totally different behavior to STOs at both small r and large r. However the key advantage of GTOs is that all the required integrals are easy to calculate. This follows from the fact that the product of two Gaussians is another Gaussian. To overcome the less desirable short and long range behavior of Gaus- sians, it is common to used fixed combinations of one to six Gaussians as basis functions, chosen to make the combination look more like STOs. STO- 3G for example means the use of a contracted combination of three Gaussians to represent a Slater function. Computational chemistry programs have de- veloped a specific notation for basis sets of this sort often called Pople’s basis set. The notation of the basis set is in the form N-ijG or N-ijkG where N is the number of Gaussian primitives (GTOs) for the inner shells, ij or ijk are the numbers of Gaussian primitives for contractions in the valence shell. N-ijG* denotes a polarized basis set augmented with d type functions on heavy atoms only, whilst N-ijG** or N-ijG(d,p) specifies a basis set with p-functions on hydrogen atoms as well. A.1.1 Closed Shell Self Consistent Field Theory The energy expression is E = 〈Ψ|H|Ψ〉 = 2 ∑ i hii + ∑ ij [2(ii|jj)− (ij|ij)] (A.9) 242 with Ψ = A(ψ21ψ22 · · ·ψ2n), where the superscript 2 indicates dual occupancy. Each orbital ψi is expressed in terms of the basis functions ψi = m∑ α=1 cαiηα (A.10) The orbitals which make the energy stationary with respect to variations of the molecular orbital coefficients cαi, maintaining orbital orthonormality are then found. If n of these orbitals φi have been found, there will be (m − n) other orbitals φa (called unoccupied or virtual orbitals) which obey 〈φa|φi〉 = 0 because there are m total basis functions. The condition that the energy is stationary with respect to the variation φk → φk + ²φa (k = 1, 2, . . . n; a = n+ 1, n+ 2, . . .m) (A.11) is therefore found. Substituting A.11 in A.9 and setting everything with a coefficient of ² to zero gives the stationary condition. For the one electron part 〈φk + ²φa|h|φk + ²φa〉 = hkk + ²(hak + hka) + ²2haa (A.12) the coefficient is therefore 2hak (using hermiticity). Similarly for the two electron part (k + ²a k + ²a|jj) = (kk|jj) + ²[(ka|jj) + (ak|jj)] + · · · (A.13) (ii|k + ²a k + ²a) = (ii|kk) + ²[(ii|ka) + (ii|ak)] + · · · (A.14) (k + ²a j|k + ²a j) = (kj|kj) + ²[(aj|kj) + (kj|aj)] + · · · (A.15) (i k + ²a|i k + ²a) = (ik|ik) + ²[(ik|ia) + (ia|ik)] + · · · (A.16) using the properties of two electron integrals and replacing ∑ i by ∑ j where appropriate, the stationary condition is obtained 4hak + ∑ j [8(ak|jj)− 4(aj|kj)] = 0 (A.17) or hak + ∑ j [2(ak|jj)− (aj|kj)] = 0 (A.18) The Fock hamiltonian F is defined such that 〈φa|F |φk〉 = hak + ∑ j [2(ak|jj)− (aj|kj)] (A.19) 243 which is more recognisable as a hamiltonian when written as F (1) = h(1) + ∑ j 2 ∫ φ2j(2) r12 dr2 − ∑ j ∫ φj(2)φj(1) r12 dr2P12 (A.20) where P12φk(1) = φk(2). The Fock hamiltonian is therefore an effective one- electron hamiltonian, including a kinetic term, nuclear attraction and an average potential term made up of a n electron coulomb part and an electron exchange part. Thus from A.18 and A.19, the condition that the energy is stationary with respect to variations of the molecular orbital coefficients is Fak ≡ (φa|F |φk) = 0 (k = 1, 2, . . . n; a = n+ 1, n+ 2, . . .m) (A.21) The orbitals that satisfy this condition may be obtained by solving the canon- ical secular equations ∑ β (ηα|F − ²i|ηβ)cβi = 0 (A.22) where ²i is the i th energy, for which it is known that the resulting orbitals obey Fpq = ²pδpq (A.23) Thus the solutions of the secular equations obey the conditions of A.21. In practise, because F is a hamiltonian and ²i are the corresponding energies of the orbitals φi, the n lowest eigensolutions of A.22 are identified as the occupied orbitals and the (m−n) remaining solutions as unoccupied orbitals. F is defined in terms of its solution so an iterative procedure is required to solve the Self Consistent Field equations A.22: i Select the geometry of the molecule and the basis set. ii Evaluate the basis function integrals hαβ,(αβ|γδ) and Sαβ iii Guess some coefficients cαβ for the occupied orbitals iv Form the density matrix Dαβ = ∑n i cαicβi v Construct the Fock matrix (ηα|F |ηβ) = hαβ + ∑ γδ (2(αβ|γδ)− (αγ|βδ)) (A.24) vi Solve the secular equations A.22. Go to step (iv). In step (iv), other than the first iteration, check thatD has changed (to a suf- ficiently small tolerance) from the previous iteration; if it has, the equations have converged and the energy is then calculated using E = 2 ∑ αβ Dαβhαβ + ∑ αβγδ DαβDγδ(2(αβ|γδ)− (αγ|βδ)) (A.25) 244 A.2 Density Functional Theory Since 1993 DFT has become the most often used approach of computational quantum chemistry for the study of ground state molecular properties. In DFT, the total energy is expressed in terms of the total electron density rather than the wave function. In this type of calculation, there is an ap- proximate Hamiltonian and an approximate expression for the total electron density. DFT methods can be very accurate for comparatively little com- putational cost. The drawback is, that unlike ab initio methods, there is no systematic way to improve the methods by improving the form of the functional. Physicists have been promoting the use of DFT since Slater’s contribu- tion in 1951 [212] which suggests the replacement of the exchange term in the Hartree-Fock method by the Dirac potential [213] which he argued con- tained both the exchange and correlation effects. This original form made the molecules significantly over bound. There are now no problems with ma- trix element evaluation and DFT codes which use local functionals are now less expensive to use than Hartree-Fock codes. Traditional methods in electronic structure theory, in particular Hartree- Fock theory and its descendants, are based on the complicated many-electron wavefunction. The main objective of DFT is to replace the many-body electronic wavefunction with the electronic density as the basic quantity. Whereas the many-body wavefunction is dependent on 3N variables, three spatial variables for each of the N electrons, the density is only a function of three variables and is a simpler quantity to deal with both conceptually and practically. If N is the number of elections then the density ρ(r) is defined by ρ(x1) = N ∫ . . . ∫ |Ψ|2ds1dx2 . . . dxN (A.26) where Ψ(x1x2 . . .xN) is the electronic wavefunction for the molecule. It is observed that ∫ ρ(r)dr = N (A.27) The most common implementation of density functional theory is through the Kohn-Shammethod [214]. The Kohn-Sham equations for the Kohn-Sham orbitals φi are( −1 2 ∇2 + v(r) + ∫ ρ(r′) |r− r′|dr ′ + vxc(r) ) φi(r) = ²iφi(r) (A.28) 245 where vxc is the exchange-correlation potential. If this can be exactly de- termined the exact density is accessible. However this remains unlikely and currently semi-empirical functionals are used instead. One of the most fre- quently used functionals is B3LYP [215, 216, 217]. B3LYP which is a hybrid functional in which the exchange energy, in this case from Becke’s exchange functional, is combined with the exact energy from Hartree-Fock theory. Three parameters define the hybrid functional, specifying how much of the exact exchange is mixed in. The adjustable parameters in hybrid functionals are generally fitted to a training set of molecules. Unfortunately, although the results obtained with these func- tionals are usually sufficiently accurate for most applications, there is no systematic way of improving them (in contrast to some of the traditional wavefunction-based methods like configuration interaction or coupled cluster theory). Hence in the current DFT approach it is not possible to estimate the error of the calculations without comparing them to other methods or experiment. Within the framework of Kohn-Sham DFT, the intractable many-body problem of interacting electrons in a static external potential is reduced to a tractable problem of non-interacting electrons moving in an effective poten- tial. The effective potential includes the external potential and the effects of the Coulomb interactions between the electrons. A.3 Semi-Empirical Methods Semi-empirical quantum chemistry methods are based on the Hartree-Fock formalism but make many approximations and obtain some parameters from empirical data. They are very important in computational chemistry for treating large molecules where the full Hartree-Fock method without ap- proximations is too expensive. The use of empirical parameters appears to allow some inclusion of electron correlation effects into the methods. Within the framework of Hartree-Fock calculations, some pieces of infor- mation (such as two-electron integrals) are sometimes approximated or com- pletely omitted. In order to correct for this loss, semi-empirical methods are parameterised. That is, their results are fitted by a set of parameters, nor- mally in such a way as to produce results that best agree with experimental data, but sometimes to agree with ab initio results. Semi-empirical calculations are much faster than ab initio methods but the results can be very wrong if the molecule being computed is not similar 246 enough to the molecules in the database used to parameterise the method. Semi-empirical calculations have been most successful in the description of or- ganic chemistry, where only a few elements are used extensively and molecules are of moderate size. The AM1 (Austin Model 1), is a semi-empirical method for the quantum calculation of molecular electronic structure in computational chemistry. It is based on the Neglect of Differential Diatomic Overlap (NDDO) integral approximation [218]. Specifically, it is a generalization of the modified neglect of differential diatomic overlap (MNDO) approximation. AM1 was developed by Dewar and co-workers and published in 1985 [219]. AM1 is an attempt to improve the MNDO model by reducing the repul- sion of atoms at small separation. The atomic core terms in the MNDO equations were modified through the addition of off-center attractive and re- pulsive Gaussian functions. The complexity of the parameterisation problem increased in AM1 as the number of parameters per atom increased from 7 in MNDO to 13-16 per atom in AM1. The PM3 method (Parameterised Model 3) is based on the NDDO integral approximation. The PM3 method uses the same formalism and equations as the AM1 method. The only differences are that PM3 uses two Gaussian functions for the core repulsion function, instead of the variable number used by AM1 (which uses between one and four Gaussians per element) and that the numerical values of the parameters are different. Other differences lie in the methodology used during the parameterisation; whereas AM1 takes some of the parameter values from spectroscopical measurements, PM3 treats them as values which may be optimised. The PM3 method was developed by Stewart and first published in 1989 [220, 221]. It is implemented in the MOPAC program, along with the related AM1, MNDO and MINDO methods. The original PM3 publication included pa- rameters for the following elements: H, C, N, O, F, Al, Si, P, S, Cl, Br, and I. Many other elements, mostly metals, have subsequently been parameterised. 247 Appendix B Regular Expressions in Java The following is taken from the Java documentation on regular expressions as defined in the package java.util.regex [222] Table B.1: Regular expression constructs as specified by Java Summary of regular-expression constructs Construct Matches Characters χ The character χ \ The backslash character \0n The character with octal value 0n (0 <= n <= 7) \0nn The character with octal value 0nn (0 <= n <= 7) \0mnn The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7) \xhh The character with hexadecimal value 0xhh \uhhhh The character with hexadecimal value 0xhhhh \t The tab character (‘\u0009’) \n The newline (line feed) character (‘\u000A’) \r The carriage-return character (‘\u000D’) \f The form-feed character (‘\u000C’) \a The alert (bell) character (’\u0007’) \e The escape character (‘\u001B’) \cχ The control character corresponding to χ Character classes [abc] a, b, or c (simple class) [ˆabc] Any character except a, b, or c (negation) [a-zA-Z] a through z or A through Z, inclusive (range) [a-d[m-p]] a through d, or m through p: [a-dm-p] (union) [a-z&&[def]] d, e, or f (intersection) [a-z&&[ˆbc]] a through z, except for b and c: [ad-z] (subtraction) [a-z&&[ˆm-p]] a through z, and not m through p: [a-lq-z](subtraction) Continued on Next Page. . . 248 Table B.1 – Continued Construct Matches Predefined character classes . Any character (may or may not match line terminators) \d A digit: [0-9] \D A non-digit: [ˆ0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A non-whitespace character: [ˆ\s] \w A word character: [a-zA-Z 0-9] \W A non-word character: [ˆ\w] POSIX character classes (US-ASCII only) \p{Lower} A lower-case alphabetic character: [a-z] \p{Upper} An upper-case alphabetic character:[A-Z] \p{ASCII} All ASCII:[\x00-\x7F] \p{Alpha} An alphabetic character: [\p{Lower}\p{Upper}] \p{Digit} A decimal digit: [0-9] \p{Alnum} An alphanumeric character: [\p{Alpha}\p{Digit}] \p{Punct} Punctuation: One of !”#$%&’()*+,-./:;<=>?[\]ˆ ‘{|}˜ \p{Graph} A visible character: [\p{Alnum}\p{Punct}] \p{Print} A printable character: [\p{Graph}\x20] \p{Blank} A space or a tab: [\t] \p{Cntrl} A control character: [\x00-\x1F\x7F] \p{XDigit} A hexadecimal digit: [0-9a-fA-F] \p{Space} A whitespace character: [\t\n\x0B\f\r] java.lang.Character classes (simple java character type) \p{javaLowerCase} Equivalent to java.lang.Character.isLowerCase() \p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase() \p{javaWhitespace} Equivalent to java.lang.Character.isWhitespace() \p{javaMirrored} Equivalent to java.lang.Character.isMirrored() Classes for Unicode blocks and categories \p{InGreek} A character in the Greek block (simple block) \p{Lu} An uppercase letter (simple category) \p{Sc} A currency symbol \P{InGreek} Any character except one in the Greek block (negation) [\p{L}&&[ˆ\p{Lu}]] Any letter except an uppercase letter (subtraction) Boundary matchers ˆ The beginning of a line $ The end of a line \b A word boundary \B A non-word boundary \A The beginning of the input \G The end of the previous match \Z The end of the input but for the final terminator, if any \z The end of the input Continued on Next Page. . . 249 Table B.1 – Continued Construct Matches Greedy quantifiers X? X, once or not at all X∗ X, zero or more times X+ X, one or more times X{n} X, exactly n times X{n, } X, at least n times X{n,m} X, at least n but not more than m times Reluctant quantifiers X?? X, once or not at all X∗? X, zero or more times X+? X, one or more times X{n}? X, exactly n times X{n, }? X, at least n times X{n,m}? X, at least n but not more than m times Possessive quantifiers X?+ X, once or not at all X ∗+ X, zero or more times X ++ X, one or more times X{n}+ X, exactly n times X{n, }+ X, at least n times X{n,m}+ X, at least n but not more than m times Logical operators XY X followed by Y X|Y Either X or Y (X) X, as a capturing group Back references \n Whatever the nth capturing group matched Quotation \ Nothing, but quotes the following character \Q Nothing, but quotes all characters until \E \E Nothing, but ends quoting started by \Q Special constructs (non-capturing) (?:X) X, as a non-capturing group (?idmsux-idmsux) Nothing, but turns match flags on - off (?idmsux-idmsux:X) X, as a non-capturing group with the given flags on - off 250 Backslashes, escapes, and quoting The backslash character (‘\’) serves to introduce escaped constructs, as de- fined in the table above, as well as to quote characters that otherwise would be interpreted as unescaped constructs. Thus the expression \\ matches a single backslash and \{ matches a left brace. It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language. A backslash may be used prior to a non-alphabetic character regardless of whether that character is part of an unescaped construct. Backslashes within string literals in Java source code are interpreted as re- quired by the Java Language Specification as either Unicode escapes or other character escapes. It is therefore necessary to double backslashes in string literals that represent regular expressions to protect them from interpretation by the Java bytecode compiler. The string literal “\b”, for example, matches a single backspace character when interpreted as a regular expression, while “\\b” matches a word boundary. The string literal “\(hello\)” is illegal and leads to a compile-time error; in order to match the string (hello) the string literal “\\(hello\\)” must be used. Character Classes Character classes may appear within other character classes, and may be composed by the union operator (implicit) and the intersection operator (&&). The union operator denotes a class that contains every character that is in at least one of its operand classes. The intersection operator de- notes a class that contains every character that is in both of its operand classes. The precedence of character-class operators is as follows, from highest to lowest: 1. Literal escape \x 2. Grouping [...] 3. Range a-z 4. Union [a-e][i-u] 5. Intersection [a-z&&[aeiou]] 251 Note that a different set of metacharacters are in effect inside a character class than outside a character class. For instance, the regular expression ‘.’ loses its special meaning inside a character class, while the expression ‘-’ becomes a range forming metacharacter. Line terminators A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators: • A newline (line feed) character (‘\n’), • A carriage-return character followed immediately by a newline charac- ter (‘\r\n’), • A standalone carriage-return character (‘\r’), • A next-line character (‘\u0085’), • A line-separator character (‘\u2028’), or • A paragraph-separator character (‘\u2029’). If UNIX LINES mode is activated, then the only line terminators recognized are newline characters. The regular expression ‘.’ matches any character except a line terminator unless the DOTALL flag is specified. By default, the regular expressions ˆ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ˆ matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence. Groups and capturing Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups: 1. ((A)(B(C))) 2. (A) 252 3. (B(C)) 4. (C) Group zero always stands for the entire expression. Capturing groups are so named because, during a match, each subsequence of the input sequence that matches such a group is saved. The captured subsequence may be used later in the expression, via a back reference, and may also be retrieved from the matcher once the match operation is complete. The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string “aba” against the expression (a(b)?)+, for example, leaves group two set to “b”. All captured input is discarded at the beginning of each match. Groups beginning with (? are pure, non-capturing groups that do not capture text and do not count towards the group total. 253 Appendix C Backus-Naur Form The following definition for BNF is taken from The World of Programming Languages by Marcotty and Ledgard [122]. The meta-symbols of BNF are: ::= meaning is defined as | meaning or < > angle brackets used to surround category names The angle brackets distinguish syntax rules names (also called non-terminal symbols) from terminal symbols which are written exactly as they are to be represented. A BNF rule defining a nonterminal has the form: nonterminal ::= sequence of alternatives consisting of strings of terminals or nonterminals separated by the meta-symbol | For example, the BNF production for a mini-language is: ::= program begin end ; This shows that a mini-language program consists of the keyword program followed by the declaration sequence, then the keyword begin and the state- ments sequence, finally the keyword end and a semicolon. 254 In fact, many authors have introduced some slight extensions of BNF for the ease of use: • optional items are enclosed in meta symbols [ and ], for example ::= if then [ else ] end if ; • repetitive items (zero or more times) are enclosed in meta symbols { and }, for example ::= { | } this rule is equivalent to the recursive rule: ::= | [ | ] • terminals of only one character are surrounded by quotes (‘’) to distin- guish them from meta-symbols, for example: ::= { ‘;’ } • terminal and non-terminal symbols are distinguished by using bold faces for terminals and suppressing < and > around non-terminals. This improves greatly the readability. The example then becomes: if statement ::= if boolean expression then statement sequence [ else statement sequence ] end if ‘;’ BNF’s syntax may be represented with a BNF like the following: 255 syntax ::= { rule } rule ::= identifier ‘::=’ expression expression ::= term { ‘|’ term } term ::= factor { factor } factor ::= identifier | quoted symbol | ‘(’ expression ‘)’ | ‘[’ expression ‘]’ | ‘{’ expression ‘}’ identifier ::= letter { letter | digit } quoted symbol ::= ‘ ‘ ’ { any character } ‘ ’ ’ 256 Appendix D JFlex Lexical Rules The syntax of the lexical rules section of a JFlex program is described by the following BNF grammar (terminal symbols are enclosed in ‘quotes’). This has been taken from the JFlex manual [132]. 257 LexicalRules ::= Rule+ Rule ::= [StateList] [‘^’] RegExp [LookAhead] Action | [StateList] ‘<>’ Action | StateGroup StateGroup ::= StateList ‘{’ Rule+ ‘}’ StateList ::= ‘<’ Identifier (‘,’ Identifier)* ‘>’ LookAhead ::= ‘$’ | ‘/’ RegExp Action ::= ‘{’ JavaCode ‘}’ | ‘|’ RegExp ::= RegExp ‘|’ RegExp | RegExp RegExp | ‘(’ RegExp ‘)’ | (‘!’|‘~’) RegExp | RegExp (‘*’|‘+’|‘?’) | RegExp ‘‘{’’ Number [‘‘,’’ Number] ‘‘}’’ | ‘[’ [‘^’] (Character|Character‘-’Character)* ‘]’ | PredefinedClass | ‘{’ Identifier ‘}’ | ‘ ’’ ’ StringCharacter+ ‘ ’’ ’ | Character PredefinedClass ::= ‘[:jletter:]’ | ‘[:jletterdigit:]’ | ‘[:letter:]’ | ‘[:digit:]’ | ‘[:uppercase:]’ | ‘[:lowercase:]’ | ‘.’ The grammar uses the following terminal symbols: JavaCode a sequence of BlockStatements as described in the Java Language Specification. Number a non negative decimal integer. Identifier a letter [a-zA-Z] followed by a sequence of zero or more letters, digits or underscores [a-zA-Z0-9 ] Character an escape sequence or any unicode character that is not one of these meta characters: | ( ) { } [ ] < > \ . * + ? $ˆ / . ‘ ’ ˜ ! StringCharacter an escape sequence or any unicode character that is not one of these meta characters: \ ” An escape sequence which consists of: • \n \r \t \f \b 258 • a \x followed by two hexadecimal digits [a-fA-F0-9] (denoting a standard ASCII escape sequence) • a \u followed by four hexadecimal digits [a-fA-F0-9] (denoting an unicode escape sequence) • a backslash followed by a three digit octal number from 000 to 377 (denoting a standard ASCII escape sequence) • a backslash followed by any other unicode character that stands for this character 259 Appendix E GROMACS topology file The GROMACS input parser was created by encoding all the allowable com- binations shown in this appendix. The following is extracted from the GRO- MACS 3.1 manual [223] and is included for completeness. The topology file is built following the GROMACS specification for a molecular topology. All possible entries in the topology file are listed in Table E.1 and Table E.2. Also listed are all the units of the parameters, which interactions can be per- turbed for free energy calculations, which bonded interactions are used by the GROMACS preprocessor (grompp) for generating exclusions and which bonded interactions can be converted to constraints by grompp. Description of the file layout: • semicolon (;) and newline surround comments • on a line ending with \ the newline character is ignored • directives are surrounded by [ and ] • the topology consists of three levels: – the parameter level (see Table E.1) – the molecule level, which should contain one or more molecule definitions (see TableE.2) – the system level: [ system ], [ molecules ] • items should be separated by spaces or tabs, not commas • atoms in molecules should be numbered consecutively starting at 1 • the file is parsed once only which implies that no forward references can be treated: items must be defined before they can be used • exclusions can be generated from the bonds or overridden manually 260 • the bonded force types can be generated from the atom types or over- ridden per bond • it is possible to apply multiple bonded interactions of the same type on the same atoms • descriptive comment lines and empty lines are highly recommended • starting with GROMACS version 3.1.3 all directives at the parameter level can be used multiple times and there are no restrictions on the order, except that an atom type needs to be defined before it can be used in other parameter definitions • If parameters for a certain interaction are defined multiple times for the same combination of atom types the last definition is used; starting with GROMACS version 3.1.3 grompp generates a warning for parameter redefinitions with different values • using one of the [ atoms ], [ bonds ], [ pairs ], [ angles ], etc. with- out having used [ moleculetype ] before is meaningless and generates a warning • using [ molecules ] without having used [ system ] before is mean- ingless and generates a warning • after [ system ] the only allowed directive is [ molecules ] • using an unknown string in [ ] causes all the data until the next direc- tive to be ignored, and generates a warning 261 Parameters interaction directive # f. parameters F.E. type at. tp mandatory defaults non-bonded function type; combination rule(α); generate pairs (no/yes); fudge LJ (); fudge QQ (); mandatory atomtypes atomtype;m(u);q(e);particle type; V(α);W(α) bondtypes see table E.2, directive bonds constrainttypes see table E.2, directive constraints pairtypes see table E.2, directive pairs angletypes see table E.2, directive angles proper dih. dihedraltypes 2/4(b) 1 φs(deg);kφ(kJ mol −1);multiplicity φ,k improper dih. dihedraltypes 2/4(c) 2 ζ0(deg);kζ(kJ mol −1 rad−2) all RB dihedral dihedraltypes 2/4(b) 3 C0, C1, C2, C3, C4, C5 (kJ mol −1) all LJ nonbond params 2 1 V(α);W(α) Buckingham nonbond params 2 2 a (kJ mol−1); b (nm−1); c6 (kJ mol −1 nm6) Molecule definition(s) interaction directive # f. parameters F.E. type at. tp mandatory moleculetype moleculename exclude neighbours # bonds away for non-bonded interactions mandatory atoms 1 atomtype; residue number; type residue name; atom name; charge group number; q(e); m(u) q,m intramolecular interaction definitions as described in table E.2 System mandatory system system name mandatory molecule molecule name; number of molecules ‘# at.’ is the number of atom types ‘f. tp’ is function type ‘F.E’ indicates which parameters can be interpolated during free energy calculations (a) the combination rule determines the type of LJ parameters (b) the inner two or all four atoms in the dihedral (c) the outer two or all four atoms in the dihedral For free energy calculations, the parameters for topology ‘B’ (λ = 1) should be added on the same line, after the normal parameters, in the same order as the normal parameters. Table E.1: The topology file 262 Intramolecular interaction definitions interaction directive # f. parameters F.E. type at. tp bond bonds(b,c) 2 1 b0 (nm); kb (kJ mol −1 nm−2) all G96 bond bonds(b,c) 2 1 b0 (nm); kb (kJ mol −1 nm−4) all morse bonds(b,c) 2 3 b0 (nm); D (kJ mol −1);β (nm−1) cubic bond bonds(b,c) 2 4 b0 (nm); Ci=2,3 (kJ mol −1 nm−i) connection bonds(b) 2 5 harmonic pot. bonds 2 6 b0 (nm); kb (kJ mol −1 nm−2) all FENE bond bonds 2 7 bm (nm); kb (kJ mol −1 nm−2) LJ/Coul. 1-4 pairs 2 1 V (a); W (a) all LJ/C. 1-4 A pairs 2 2 V (a); W (a) LJ/C. pair A pairs 2 3 angle angles(c) 3 1 θ0 (deg); kθ (kJ mol −1 rad−2) all G96 angle angles(c) 3 2 θ0 (deg); kθ (kJ mol −1) all quartic angle angles(c) 3 6 θ0 (deg); Ci=0,1,2,3,4 (kJ mol −1 rad−i) proper dih. dihedrals 4 1 φs (deg); kφ (kJ mol −1); multiplicity all improper dih. dihedrals 4 2 ζ0 (deg); kζ (kJ mol −1 rad−2) all RB dihedral dihedrals 4 3 C0,C1,C2,C3,C4,C5 (kJ mol −1) all constraint constraints(b) 2 1 b0 (nm) all constr. n.c. constraints 2 2 b0 (nm) all settle settles 3 1 dOHdHH , (nm) vsite2 virtual sites2 3 1 a () vsite3 virtual sites3 4 1 a, b () vsite3fd virtual sites3 4 2 a (); d (nm) vsite3fad virtual sites3 4 3 θ (deg); d (nm) vsite3out virtual sites3 4 4 a, b (); c (nm−1) vsite4fd virtual sites4 5 1 a, b (); d (nm) position res. position restraints 1 1 kx, ky , kz , (kJ mol −1 nm−2) all distance res. distance restraints 2 1 type; label; low, up1, up2 (nm); weight () orient. res. orientation restraints 2 1 exp.; label; α; c (U nmα); obs. (U); weight (U−1) angle res. angle restraints(c) 4 1 θ0 (deg); kc (kJ mol −1); multiplicity θ, k angle res. z angle restraints z(c) 2 1 θ0 (deg); kc (kJ mol −1); multiplicity θ, k exclusions exclusions(c) 1 one or more atom indicies ‘# at.’ is the number of atom types ‘f. tp’ is function type ‘F.E’ indicates which parameters can be interpolated during free energy calculations (a) the combination rule determines the type of LJ parameters (b) used by grompp for generating exclusions (c) can be converted to constraints by grompp For free energy calculations, the parameters for topology ‘B’ (λ = 1) should be added on the same line, after the normal parameters, in the same order as the normal parameters. Table E.2: Intramolecular actions definitions 263 Appendix F Solvents and counter ions There follows a list of all the structures deemed to be solvent/ion/guest molecules. Inorganic molecules and counter ions • SiF6 • CO2−3 • CO3H− • CO3H2 • NO3 • HNO+3 • SO42− • HSO4− • H2SO4 • PF−6 • PO3−4 • HPO2−4 • H2PO−4 • PO3−3 • HPO2−3 264 • H2PO−3 • H3PO3 • ClO−3 • ClO−4 • HClO3 • HClO4 • BrO−3 • BrO−4 • HBrO3 • HBrO4 • IO−3 • IO−4 • HIO3 • HIO4 • BF−4 Small solvent molecules All acids and alcohols are included in both their protonated and deprotonated forms. • trichloromethane • dicyanoamine • oxalic acid (ethandioic acid) • acetic acid • fluorinated acetic acid • chlorinated acetic acid • brominated acetic acid • sulfonic acid 265 • fluorinated sulfonic acid • chlorinated sulfonic acid • brominated sulfonic acid • trifluoroethanol • trichloroethanol • tribromoethanol • propanol • acetone • dimethylsulfoxide • ether • furan • tetrahydrofuran • N,N -dimethylformamide • trimethylammonia • trimethylammonium • dimethyl sulphate • amino ethanioc acid • benzene • toluene Included for completeness These molecules are too small (fewer than four heavy atoms) to be suitable for calculation, but are included for completeness. • F− • Cl− • Br− • I− 266 • NH4+ • dichloromethane • hydrogencyanide • CN− • H2O 267 Appendix G Molecules Table G.1: The 112 unique connection tables of the molecules that fulfil the final protocol. The number of occurrences of each of the connection tables in the final dataset are indicated. Molecules that pass the final protocol 1 2 2 Continued on Next Page. . . 268 Table G.1 – Continued 1 1 1 1 1 1 1 1 1 Continued on Next Page. . . 269 Table G.1 – Continued 1 1 1 1 1 2 2 1 1 Continued on Next Page. . . 270 Table G.1 – Continued 1 1 1 1 2 2 1 1 1 Continued on Next Page. . . 271 Table G.1 – Continued 1 1 1 1 1 1 1 1 1 Continued on Next Page. . . 272 Table G.1 – Continued 2 1 2 1 1 1 1 1 1 Continued on Next Page. . . 273 Table G.1 – Continued 1 1 1 1 1 3 2 2 1 Continued on Next Page. . . 274 Table G.1 – Continued 1 1 2 1 1 1 1 1 1 Continued on Next Page. . . 275 Table G.1 – Continued 1 1 1 1 1 1 1 1 1 Continued on Next Page. . . 276 Table G.1 – Continued 1 1 1 1 1 1 1 1 1 Continued on Next Page. . . 277 Table G.1 – Continued 1 1 2 1 1 1 2 1 1 Continued on Next Page. . . 278 Table G.1 – Continued 1 1 1 1 1 1 1 1 1 Continued on Next Page. . . 279 Table G.1 – Continued 1 1 1 3 1 1 1 1 1 Continued on Next Page. . . 280 Table G.1 – Continued 1 281 Appendix H Molecule Optimisations Table H.1: The reported geometries adopted by bv6006molecule3 during the geometry optimisation cal- culation. All steps of the optimisation are shown. Geometry optimisation of bv6006molecule3 step 0 step 1 step 2 step 3 step 4 step 5 step 6 step 7 step 8 step 9 step 10 step 11 Continued on Next Page. . . 282 Table H.1 – Continued step 12 step 13 step 14 step 15 step 16 step 17 step 18 step 19 step 20 step 21 step 22 step 23 step 24 step 25 step 26 step 27 step 28 step 29 step 30 step 31 Continued on Next Page. . . 283 Table H.1 – Continued step 32 step 33 step 34 step 35 step 36 step 37 step 38 step 39 step 40 step 41 step 42 step 43 step 44 step 45 step 46 step 47 step 48 step 49 step 50 step 51 Continued on Next Page. . . 284 Table H.1 – Continued step 52 step 53 step 54 step 55 step 56 step 57 step 58 step 59 step 60 step 61 step 62 step 63 step 64 step 65 step 66 step 67 step 68 step 69 step 70 step 71 Continued on Next Page. . . 285 Table H.1 – Continued step 72 step 73 step 74 step 75 step 76 step 77 step 78 step 79 step 80 step 81 step 82 step 83 step 84 step 85 step 86 step 87 step 88 286 Table H.2: The reported geometries adopted by ci6067molecule2 during the geometry optimisation cal- culation. The first 20 (of 27) geometries are shown, no further protean behaviour was observed after this point. Geometry optimisation of ci6067molecule2 step 0 step 1 step 2 step 3 step 4 step 5 step 6 step 7 step 8 step 9 step 10 step 11 step 12 step 13 step 14 step 15 Continued on Next Page. . . 287 Table H.2 – Continued step 16 step 17 step 18 step 19 Table H.3: The reported geometries of rz6070molecule1 during the geometry optimisation calculation. The first 12 (of 52) geometries are shown, no further protean be- haviour was observed after this point. Geometry optimisation of rz6070molecule1 step 0 step 1 step 2 step 3 step 4 step 5 step 6 step 7 step 8 step 9 step 10 step 11 288 Appendix I Published Work The following papers and communications have been published as a result of work contained in this thesis. P. Murray-Rust, R. C. Glen, H. S. Rzepa, J. J. P. Stewart, J. A. Townsend, E. L. Willighagen, Y. Zhang, A semantic GRID for molecular science, Pro- ceedings of UK e-Science All Hands Conference 2003 Y. Zhang, P. Murrary-Rust, M. T. Dove, R. C. Glen, H. S. Rzepa, J. A. Townsend, S. Tyrrell, J. Wakelin, E. L. Willighagen, JUMBO – An XML In- frastructure for eScience, Proceedings of UK e-Science All Hands Conference 2004 S. E. Adams, J. M. Goodman, R. J. Kidd, A. D. McNaught, P. Murray-Rust, F. R. Norton, J. A. Townsend, C. A. Waudby, Experimental data checker: better information for organic chemists, Org. Biomol. Chem., 2004, 2, 3067– 3070 J. A. Townsend, S. E. Adams, C. A. Waudby, V. K. de Souza, J. M. Good- man, P. Murray-Rust, Chemical documents: machine understanding and automated information extraction, Org. Biomol. Chem., 2004, 2, 3294–3300 J. A. Townsend, P. Murray-Rust, Capturing chemistry in XML, Abstr. Am. 289 Chem. Soc. 2004 Y. Zhang, R. C. Glen, P. Murray-Rust, H. S. Rzepa, J. A. Townsend, Seman- tic grid computing — The WorldWideMolecularMatrix, Abstr. Am. Chem. Soc. 2004 J. A. Townsend, P. Murray-Rust, S. M. Tyrrell, Y. Zhang, Computational chemistry robots, Abstr. Am. Chem. Soc. 2005 P. Murray-Rust, H. S. Rzepa, J. A. Townsend, D. Wilson, Computational chemistry in XML, Abstr. Am. Chem. Soc. 2006 P. T. Corbett, P. Murray-Rust, N. E. Day, J .A. Townsend, H. S. Rzepa, Chemistry publications in CML, Abstr. Am. Chem. Soc. 2006 290 Bibliography [1] T. Hey, A. Trefethen, The Data Deluge: An e-Science Perspective, Wiley, 2003, 809–824 [2] M. Lesk, Practical digital libraries: Books, bytes, and bucks, Morgan Kaufmann 1997 [3] M. Atkinson, Grid Infrastructure meets Biological Research Chal- lenges, 2002, http://www.nesc.ac.uk/presentations/ [4] F. Berman, Viewpoint: From TeraGrid to knowledge grid, Comm. ACM, 2001, 44, 27–28 [5] M.R. Helal, Y.A. Yousef A.T. Afaneh, Ab Initio Calculations of the Stabilization Energies of the Conformational and the Structural Iso- mers of C3H7X where X = F, Cl, and Br, J. Comp. Chem., 2003, 23, 966–976 [6] J.E. Davies, oral presentation, Unpublished Structures, CCG Autumn Meeting, 2003 [7] F. Allen, oral presentation, The Future of Crystallographic ‘Publica- tion’, CCG Autumn Meeting, 2003 [8] http://www.admin.cam.ac.uk/offices/gradstud/current/submitting/phd/cdrom.html [9] http://hdl.handle.net/1842/433 [10] http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=362 291 [11] P. Murray-Rust, Data-driven Science — A Scientit’s View, NSF/JISC Repositories Workshop, 2007 [12] http://www.cas.org/newsevents/releases/milliondocs1206.html [13] http://www.cas.org/ASSETS/836E3804111B49BFA28B95BD1B40CD0F/casstats.pdf [14] H. Shojaei, Z. Li-Bohmer, P. vonZezschwitz, Iromycins: A New Family of Pyridone Metabolites from Streptomyces sp. II. Convergent Total Synthesis, J. Org. Chem., 2007, 72, 5091–5097 [15] http://en.wikipedia.org/wiki/Metadata [16] http://dublincore.org [17] http://www.iupac.org/inchi [18] J. J. P. Stewart, On the use of Semiempirical Methods for Detecting Anomalies in Reported Heats of Formation of Organic Compounds, J. Phys. Chem. Ref. Data, 2004, 33, 713–724 [19] B. Schlegel, A. Ha¨rtl, H. M. Dahse, F. A. Gollmick, U. Gra¨fe, H. Do¨rfelt H, B. Kappes, Hexacyclinol, a new antiproliferative metabo- lite of Panus rudis HKI 0254, J. Antibiot., 2002, 55, 814–817 [20] J. J. La Clair, Total Syntheses of Hexacyclinol, 5-epi-Hexacyclinol, and Desoxohexacyclinol Unveil an Antimalarial Prodrug Motif, Angew. Chem. Int. Ed., 2006, 45, 2769–2773 [21] S. D. Rychnovsky, Predicting NMR Spectra by Computational Meth- ods: Structural Revision of Hexacyclinol, Org. Lett., 2006, 8, 2895– 2898 [22] http://www.nesc.ac.uk/nesc/mission.html [23] T. Berners-Lee, J. Hendler, O. Lassila, The Semantic Web, Sci. Am., 2001, 284, 34–43 292 [24] G. V. Gkoutous, P. Murray-Rust, H. S. Rzepa, M Wright, Chemical Markup, XML, and the World-Wide Web. 3. Toward a Signed Semantic Chemical Web of Trust, J. Chem. Inf. Comput. Sci., 2001, 41, 1124– 1130 [25] P. Murray-Rust, H. S. Rzepa, M. J. Williamson, E. L. Willighagen, Chemical Markup, XML, and the World-Wide Web. 5. Applications of Chemical Metadata in RSS Aggregators, J. Chem. Inf. Comput. Sci., 2004, 44, 462–469 [26] R. D. King, M. Young, A. J. Clare, K. E. Whelan, J. Rowland, The Robot Scientist Project, Springer Berlin 2005 [27] T. Helgaker, P. Jørgensen, J. Olsen, Molecular Electronic-Structure Theory, Wiley 2004 [28] R. R. Gotwals, S. C. Sendlinger, A Chemistry Educator’s Guide to Molecular Modeling, 2007, http://chemistry.ncssm.edu/book/ [29] http://www.w3.org/TR/REC-xml [30] J. H. Coombs, A. H. Renear, S. J. DeRose, Markup Systems and the Future of Scholarly Text Processing, Comm. ACM, 1987, 30, 933–947 [31] http://www.w3.org/TR/html401 [32] http://www.w3.org [33] http://www.w3.org/TR/xhtml1 [34] http://www.w3.org/TR/REC-xml/#dt-doctype [35] http://www.w3.org/XML/Schema [36] http://www.w3.org/TR/REC-xml-names [37] P Murray-Rust, H. S. Rzepa, STMML. A markup language for scien- tific, technical and medical publishing, Data Sci., 2002, 1, 128–192 293 [38] NISO Standard Z39.85-2001, http://www.niso.org [39] ISO Standard 15836-2003, http://www.iso.org [40] http://www.w3.org/TR/xslt [41] P. Murray-Rust, H. S. Rzepa, Handbook of Chemoinformatics, Wiley- VCH, 2003 [42] http://www.xml-cml.org/information/position.html [43] P. Murray-Rust, H. S. Rzepa, Chemical markup, XML, and the World- wide Web. 1. Basic principles, J. Chem. Inf. Comput. Sci., 1999, 39, 928–942 [44] A. Dalby, J. G. Nourse, W. Douglas Hounshell, A. K. I. Gushurst, D. L. Grier, A. Leland, J. Laufer, Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited, J. Chem. Inf. Comput. Sci., 1992, 32, 244–255 [45] P. Murray-Rust, H. S. Rzepa, Chemical markup, XML, and the World- wide Web. 4. CML Schema, J. Chem. Inf. Comput. Sci., 2003, 43, 757–772 [46] G. W. Kramer, ANIML: Analytical information markup language for spectroscopy and chromatography data, Abstr. Am. Chem. Soc. 2003 [47] http://www.gaml.org [48] A. D. T. Nguyen, A. Arslan, J. Travis, M. Smith, R. Scha¨fer, G. W.Kramer, Molecular spectrometry data interchange applications for NIST’s SpectroML, J. Assoc. Lab. Auto., 2004, 9, 346–354 [49] J. Wakelin, P. Murray-Rust, S. M. Tyrrell, Y. Zhang, H. S. Rzepa, A. Garcia, Mol. Simulations, 2005, 31, 315–322 [50] http://www.w3.org/TR/REC-DOM-Level-1 [51] http://www.w3.org/TR/DOM-Level-2-Core 294 [52] P Murray-Rust, H. S. Rzepa, Chemical markup, XML, and the World- wide Web. 2. Information Objects and the CMLDOM, J. Chem. Inf. Comput. Sci., 2001, 41, 1113–1123 [53] D. E. Knuth, The Art of Computer Programming, Addison-Wesley, 1997 [54] http://www.opensource.org/docs/osd [55] http://sourceforge.net/projects/cml [56] http://www.w3.org/Graphics/SVG [57] http://www.sun.com [58] http://www.adobe.com [59] http://www.apple.com [60] http://www.ibm.com [61] http://www.kodak.com [62] http://www.w3.org/TR/xlink [63] http://www.w3.org/TR/xmlbase [64] http://www.w3.org/TR/xml-stylesheet [65] http://www.w3.org/TR/REC-smil [66] http://www.w3.org/TR/2001/REC-smil-animation-20010904 [67] http://www.r-project.org [68] http://office.microsoft.com/excel [69] http://www.uszla.me.uk/software/pelote.html [70] J. Bishop, Java Gently: Programming Principles Explained, Addison- Wesley 1998 295 [71] http://java.sun.com [72] http://jmol.sourceforge.net [73] http://www.povray.org [74] http://www.winedt.com [75] http://office.microsoft.com/word [76] C. Creighton, S. Hanash, Mining gene expression databases for associ- ation rules, Bioinformatics, 2003, 19, 79–86 [77] M. Andrade, A. Valencia, Automatic extraction of keywords from sci- entific text: Application to the knowledge domain of protein families, Bioinformatics, 1998, 14, 600–607 [78] J. M. Temkin, M. R. Gilder, Extraction of protein interaction informa- tion from unstructured text using a context-free grammar, Bioinfor- matics, 2003, 19, 2046–2053 [79] P. Murray-Rust, H. S. Rzepa, The Next Big Thing: From Hypermedia to Datuments, J. Digital Information, 2004, 5, 248 [80] L. R. Garson, Communicating original research in chemistry and re- lated sciences, Acc. Chem. Res., 2004, 37, 141–148 [81] F. Damerau, A technique for Computer Detection and Correction of Spelling Errors, Comm. ACM, 1964, 7, 171–176 [82] R. P. Murelli, A. K. Cheung, M. L. Snapper, Conformationally Re- stricted (+)-Cacospongionolide B Analogues. Influence on Secretory Phospholipase A2 Inhibition, J. Org. Chem., 2007, 72, 1545–1552 [83] http://www.rsc.org/Publishing/ReSourCe/AuthorGuidelines/ArticleLayout/sect3.asp [84] Jeffrey E. F. Friedl, Mastering Regular Expressions, O’Reilly and As- sociates 2002 296 [85] L. P. Deutsch, B. W. Lampson, An online editor, Comm. ACM, 1967, 10, 793–799 [86] http://java.sun.com/javase/6/docs/api/java/util/regex/package- summary.html [87] S. E. Adams, J. M. Goodman, R. J. Kidd, A. D. McNaught, P. Murray- Rust, F. R. Norton, J. A. Townsend, C. A. Waudby, Experimental data checker: better information for organic chemists, Org. Biomol. Chem., 2004, 2, 3067–3070 [88] Private communication with J. Brazier and Dr J. Burton, Unilever Centre for Molecular Science Informatics 2005 [89] P. Wiklund, J. Bergman, Ring forming reaction of imines of 2- aminobenzaldehyde and related compounds, Org. Biomol. Chem., 2003, 1, 367–372 [90] K. Hirota, K. Kazaoka, I. Niimoto, H. Sajiki, Efficient synthesis of 2,9- disubstitued 8-hydroxyadenine derivates, Org. Biomol. Chem., 2003, 1, 1354–1365 [91] E. T. Gallagher, D. H. Grayson, Reactions of litiated (E)-3-halo-1- phenlssulfonylprop-1-enes and (Z)-1-halo-3-phenylsulfonylprop-1-enes with aldehydes, Org. Biomol. Chem., 2003, 1, 1374–1381 [92] K. Smith, G. A. El-Hiti, A. J. Jayne, M. Butters, Acylation of aro- matic ethers over solid acid catalysts: scope of the reaction with more complex acylating agents, Org. Biomol. Chem., 2003, 1, 2321–2325 [93] X. Peng, N. Fukui, M. Mizuta, H. Suzuki, Nitration of moderately deactivated arenes with nitrogen dioxide and molecular oxygen under neutral conditions. Zeolite-induced enhancement of regioselectivity and reversal of isomer ratios, Org. Biomol. Chem., 2003, 1, 2326–2335 [94] P. Wiklund, I. Romero, J. Bergman, Products from dehydration of dicarboxylic acids derived from anthranilic acid, Org. Biomol. Chem., 2003, 1, 3396–3403 297 [95] F. Lecornue´, J. Ollivier, Construction of medium-ring oxacy- cloalkenones. Extension towards benzo-fused cyclic ethers, Org. Biomol. Chem., 2003, 1, 3600–3604 [96] F. Jeannot, G. Gosselin, C. Mathe´, Synthesis and antiviral evaluation of 2’-deoxy-2’-C-trifluoromethyl b-D-ribonucleoside analogues bearing the five naturally occurring nucleic acid bases, Submitted for publication in Org. Biomol. Chem. [97] R. Hunter, P. Richards, Stereoselective tetrapyrido[2,1-a]isoindolone synthesis via carbanionic and radical intermediates: a model study for the Tacaman alkaloid D/E ring fusion, Submitted for publication in Org. Biomol. Chem. [98] M. D. Toscano, M. Frederickson, D.P. Evans, J.R. Coggins, C. Abell, C. Gonza´lez-Bello, Design, synthesis and evaluation of bifunctional in- hibitors of type II dehydroquinase, Submitted for publication in Org. Biomol. Chem. [99] C. A. Waudby, Unpublished work 2006 [100] J. A. Townsend, S. E. Adams, C. A. Waudby, V. K. de Souza, J. M. Goodman, P. Murray-Rust, Chemical documents: machine under- standing and automated information extraction, Org. Biomol. Chem., 2004, 2, 3294–3300 [101] A. Vasserman, Identifying Chemical Names in Biomedical Text: An Investigation of the substring co-occurence based approaches, Proceed- ings of the Student Research Workshop at HLT-NAACL 2004 [102] http://www.cambridgesoft.com/software/ChemDraw [103] P. M. Elliott, Translation of Chemical Nomenclature by Syntax Con- trolled Techniques. MSc. Thesis, The Ohio State University 1969 [104] E. Garfield, An Algorithm for Translating Chemical Names to Molec- ular Formulas, PhD Thesis 1962 298 [105] D.I. Cooke-Fox, G.H. Kirby, J.D. Rayner, Computer Translation of IU- PAC Systematic Organic Chemical Nomenclature. 1. Introduction and Background to a Grammar-Based Approach, J. Chem. Inf. Comput. Sci., 1989, 29, 101–105 [106] D.I. Cooke-Fox, G.H. Kirby, J.D. Rayner, Computer Translation of IUPAC Systematic Organic Chemical Nomenclature. 2. Development of a Formal Grammar, J. Chem. Inf. Comput. Sci., 106, 29, 106–112 [107] D.I. Cooke-Fox, G.H. Kirby, J.D. Rayner, Computer Translation of IUPAC Systematic Organic Chemical Nomenclature. 3. Syntax Anal- ysis and Semantic Processing, J. Chem. Inf. Comput. Sci., 1989, 29, 112–118 [108] D.I. Cooke-Fox, G.H. Kirby, J.D. Rayner, Computer Translation of IU- PAC Systematic Organic Chemical Nomenclature. 4. Concise connec- tion tables to structure diagrams, J. Chem. Inf. Comput. Sci., 1990, 30, 122–127 [109] D.I. Cooke-Fox, G.H. Kirby, J.D. Rayner, Computer Translation of IU- PAC Systematic Organic Chemical Nomenclature. 5. Steroid nomen- clature Steroid nomenclature, J. Chem. Inf. Comput. Sci., 1990, 30, 128–132 [110] G.H. Kirby, M.R.Lord, J.D. Rayner, Computer Translation of IU- PAC Systematic Organic Chemical Nomenclature., 6. (Semi)automatic Name correction, J. Chem. Inf. Comput. Sci., 1991, 31, 153–160 [111] A Guide to IUPAC Nomenclature of Organic Chemistry, Recommenda- tions 1993, (including Revisions, Published and hitherto Unpublished, to the 1979 Edition of Nomenclature of Organic Chemistry), IUPAC 1993 [112] D. Weininger, SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules, J. Chem. Inf. Comput. Sci. 1988, 28, 31–36 299 [113] D. Weininger, A. Weininger, J. L. Weininger, SMILES. 2. Algorithm for Generation of Unique SMILES Notation, J. Chem. Inf. Comput. Sci. 1989, 29 97–101 [114] D. Weininger, SMILES. 3. DEPICT. Graphical Depiction of Chemical Structures, J. Chem. Inf. Comput. Sci. 1990, 30, 237–243 [115] A. Copestake, M. A. Parker, S. Teufel, P. Murray-Rust, Extracting the Science from Scientific Publications, EPSRC, EP/C010035/1 [116] P. Corbett, P. Murray-Rust, High-throughput identification of chem- istry in life science texts, Computational Life Sciences II, Proceedings Lecture Notes in Computer Science, 2006, 4216, 107–118 [117] P. Corbett, Unpublished work 2006 [118] A. Copestake, P. Corbett, P. Murray-Rust, C.J. Rupp, A. Siddharthan, S. Teufel, B. Waldron, An Architecture for Language Processing for Scientific Texts, Proceedings of the UK e-Science All Hands Conference 2006 [119] J. Brecher, Name=Struct: A Practical Approach to the Sorry State of Real-Life Chemical Nomenclature, J. Chem. Inf. Comput. Sci., 1999, 39 943–950 [120] A. V. Aho, R. Sethi, J. D. Ullman, Compilers Principles, Techniques, and Tools, Prentice-Hall 2003 [121] P. Naur, Revised Report on the Algorithmic Language ALGOL 60, Comm. ACM, 1960, 3, 299–314 [122] M. Marcotty, H. Ledgard, The World of Programming Languages, Springer-Verlag 1986 [123] J. W. Backus, The Syntax and Semantics of the Proposed Interna- tional Algebraic Language of the Zu¨rich ACM-GAMM Conference, ICIP Paris, 1959 300 [124] E. L. Post, Formal Reductions of the General Combinatorial Decision Problem, Am. J. Mathematics, 1943, 65, 197–215 [125] http://www.pcre.org [126] A. K. McCallum, Reinforcement Learning with Selective Perception and Hidden State, PhD Thesis 1995 [127] http://jedlik.phy.bme.hu/˜gerjanos/HMM/node4.html [128] P. C. Austin, L. J. Brunner, J. E. Hux, Bayeswatch: an overview of Bayesian statistics, J. Eval. Clin. Pract., 2002, 8, 277–286 [129] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Rubin, Bayesian Data Analysis, Chapman & Hall 1995 [130] http://research.microsoft.com/nlp [131] N. Chomsky, Syntactic Structures, Walter de Gruyter 1957 [132] http://jflex.de [133] B. W. Kernighan, D. M. Ritchie, The C programming Language, Prentice-Hall 1988 [134] J. R. Levine, T. Mason, D. Brown, Lex & Yacc, O’Reilly & Associates 1992 [135] S. E. Hudson, CUP LALR Parser Generator for Java, http://www.cs.princeton.edu/˜appel/modern/java/CUP/ [136] S. C. Johnson, YACC—Yet Another Compiler Compiler, CS Technical Report Bell Telephone Laboratories, 1975, 32 [137] E. Lindahl, B. Hess, D. van der Spoel, GROMACS 3.0: a package for molecular simulation and trajectory analysis, J. Mol. Mod., 2001, 7, 306–317 [138] J. J. P. Stewart, MOPAC: A semiempirical molecular orbital program J. Comput. Aided Mol. Des., 1990, 4, 1–45 301 [139] Private communication with Andrew Walkingshaw, Unilever Centre for Molecular Science Informatics 2007 [140] J. M. Soler, E. Artacho, J. D. Gale, A. Garca, J. Junquera, P. Ordejo´n and D. Sa´nchez-Portal, The SIESTA method for ab initio order-N ma- terials simulation, J. Phys. Condens. Matter, 2002 , 14, 2745–2779 [141] J. D. Gale, GULP: A computer program for the symmetry adapted simulation of solids, J. Chem. Soc. Faraday Trans., 1997, 93, 629–637 [142] I. T. Todorov, W. Smith, DL POLY 3: the CCP5 National UK Code for molecular-dynamics simulations, Phil. Trans., 2004, 362, 1835– 1852 [143] A. Garc´ıa, P. Murray-Rust, J. Wakelin, The use of XML and CML in Computational Chemistry and Physics Programs, Proceedings of UK e-Science All Hands Conference, 2004 [144] Private communication with Dr M. Braendle, Infozentrum Chemie Bi- ologie Pharmazie 2005 [145] Private communication with Dr R. Kanters, Department of Chemistry, University of Richmond 2005 [146] Private communication with M. Howard, Jmol project 2005 [147] M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, J. A. Montgomery, Jr., T. Vreven, K. N. Kudin, J. C. Burant, J. M. Millam, S. S. Iyengar, J. Tomasi, V. Barone, B. Mennucci, M. Cossi, G. Scalmani, N. Rega, G. A. Petersson, H. Nakatsuji, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao, H. Nakai, M. Klene, X. Li, J. E. Knox, H. P. Hratchian, J. B. Cross, V. Bakken, C. Adamo, J. Jaramillo, R. Gomperts, R. E. Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W. Ochterski, P. Y. Ayala, K. Morokuma, G. A. Voth, P. Salvador, J. J. Dannenberg, V. G. Zakrzewski, S. Dapprich, A. D. Daniels, M. C. Strain, O. Farkas, D. K. Malick, A. D. Rabuck, 302 K. Raghavachari, J. B. Foresman, J. V. Ortiz, Q. Cui, A. G. Baboul, S. Clifford, J. Cioslowski, B. B. Stefanov, G. Liu, A. Liashenko, P. Piskorz, I. Komaromi, R. L. Martin, D. J. Fox, T. Keith, M. A. Al- Laham, C. Y. Peng, A. Nanayakkara, M. Challacombe, P. M. W. Gill, B. Johnson, W. Chen, M. W. Wong, C. Gonzalez, J. A. Pople, Gaussian 03, Gaussian, Inc., Wallingford, CT, 2004 [148] http://www.seti.org [149] W. G. Richards, Virtual screening using grid computing: the screen- saver project, Nat. Rev. Drug Discovery, 2002, 1, 551–555 [150] J. Basney, M. Livny, T. Tannenbaum, High Throughput Computing with Condor, HPCU news, 1, (1997) [151] M. W. Mutka, M. Livny, Scheduling remote processing capacity in a workstation-processor bank network, Proc. 7th Int. Conf. Distributed Comput. Syst., 1987, 2–9 [152] M. J. Litzkow, M. Livny, M. W. Mutka, Condor — a hunter of idle workstations, Proc. 8th Int. Conf. Distributed Computing Systems, 1988, 104–111 [153] http://cactus.nci.nih.gov/ncidb2/download.html [154] Y. Zhang, R. C. Glen, P. Murray-Rust, H. S. Rzepa, J. A. Townsend, Semantic grid computing — The WorldWideMolecularMatrix, Abstr. Am. Chem. Soc. 2004 [155] http://openbabel.sourceforge.net [156] http://xml.apache.org/xindice [157] http://wwmm.ch.cam.ac.uk/inchifaq [158] S. J. Coles, N. E. Day, P. Murray-Rust, H. S. Rzepa, Y. Zhang, En- hancement of the chemical semantic web through the use of InChI identifiers, Org. Biomol. Chem., 2005, 3, 1832–1834 303 [159] M. W. Schmidt, K. K. Baldridge, J. A. Boatz, S. T. Elbert, M. S. Gordon, J. H. Jensen, S. Koseki, N. Matsunaga, K. A. Nguyen, S. J. Su, T. L. Windus, M. Dupuis, J. A. Montgomery, General Atomic and Molecular Electronic Structure System, J. Comput. Chem., 1993, 14, 1347–1363 [160] K. K. Irikura, D. J. Frurip, Computational Thermochemistry: Predic- tion and Estimation of Molecular Thermodynamics, ACS Symp. Ser., 1998 [161] http://www.msg.ameslab.gov/GAMESS/GAMESS Manual [162] Private communication with Prof. K. K. Baldridge, University of Zu¨rich, 2005 [163] J. Baker, A. Kessi, B. Delley, The generation and use of delocalized internal coordinates in geometry optimization, J. Chem. Phys., 1996, 105, 192–212 [164] Private communtication with DrW. Sudholt, University of Zu¨rich, 2004 [165] Molecular Operating Environment, Version 2000.03, Chemical Com- puting Group Inc. [166] R. C. Glen, A. Bender, C. H. Arnby, L. Carlsson, S. Boyer, J. Smith, Circular fingerprints: Flexible molecular descriptors with applications from physical chemistry to ADME, IDrugs, 2006, 9, 199–204. [167] O. Ludwig, H. Schinke, W. Brandt, Reparametrisation of Force Con- stants in MOPAC 6.0/7.0 for Better Description of the Activation Bar- rier of Peptide Bond Rotations, J. Mol. Mod., 1996, 2, 341–350 [168] P. P. Ewald, Fifty Years of X-Ray Diffraction, Springer 1962 [169] H. M. Rietveld, A profile refinement method for nuclear and magnetic structures, J. Appl. Cryst., 1969, 2, 65–71. 304 [170] F. H. Allen, The Cambridge Structural Database: a quarter of a million crystal structures and rising, Acta Cryst., 2002, B58, 380–388 [171] W. L. Bragg, The Diffraction of Short Electromagnetic Waves by a Crystal, Proc. Camb. Philos. Soc., 1912, 17, 43-57 [172] W. Clegg, A. J. Blake, R. O. Gould, P. Main, Crystal Structure Anal- ysis: Principles and Practice, Oxford University Press 2002 [173] W. Friedrich, P. Knipping, M. Laue, Interferenzerscheinungen bei Ro¨ntgenstrahlen, Annalen der Physik, 1913, 346, 971–988 [174] P. P. Ewald, Zur Theorie der Interferenzen der Ro¨ntgentstrahlen in Kristallen, Physik. Z., 1913, 14, 465–472 [175] D. Watkin, Uequiv: its past, present and future, Acta. Cryst., 2000, B56, 747–749 [176] F. L. Hirshfeld, Can X-ray data distinguish bonding effects from vibra- tional smearing?, Acta Cryst., 1976, A32, 239–244 [177] H. G. von Schnering, D. Vu, Are the Previously Described [ClF6][CuF4] and [Cu(H2O)][SiF6] Identical?, Angew. Chem. Int. Ed., 1983, 22, 408 [178] B. Dittrich, T. Koritsa´nszky, P. Luger, A Simple Approach to Non- spherical Electron Densities by Using Invarioms, Angew. Chem. Int. Ed., 2004, 43, 2718-2721 [179] N. K. Hansen, P. Coppens, Testing Aspherical Atom Refinements on Small-Molecule Data Sets, Acta Cryst., 1978, A34, 909–921 [180] B. Dittrich, C. B. Hu¨bschle, M. Messerschmidt, R. Kalinowski, D. Girnta, P. Lugera, The invariom model and its application: refinement of D,L-serine at different temperatures and resolution, Acta Cryst., 2005, A61, 314–320 [181] B. Dittrich, P. Munshi, M. A. Spackman, Invariom-model refinement of l-valinol, Acta Cryst., 2006, C62, 633–635 305 [182] S. R. Hall, F. H. Allen, I. D. Brown, The Crystallographic Information File (CIF): a New Standard Archive File for Crystallography, Acta Cryst., 1991, A47, 655–685 [183] http://www.iucr.org/iucr-top/cif/cif core/index.html [184] S. R. Hall, The STAR file: a new format for electronic data transfer and archiving, J. Chem. Inf. Comput. Sci., 1991, 31, 326–333 [185] S. R. Hall, N. Spadaccinit, The STAR File: Detailed Specifications, J. Chem. Inf. Comput. Sci., 1994, 34, 505–508 [186] F. H. Allen, O. Kennard, W. D. S. Motherwell, W. G. Town, D. G. Waston, T. J. Scott, A. C. Larson, The Cambridge Crystallographic Data Centre. Part 3. The Unique Molecule Program, J. Appl. Cryst., 1974,7, 73–78 [187] http://checkcif.iucr.org [188] http://journals.iucr.org/services/cif/datavalidation.html [189] N. E. Day, P. Murray-Rust, H. S. Rzepa, S. M. Tyrrell, Y. Zhang, Automatic aggregation of open chemical data, Abstr. Am. Chem. Soc. 2005 [190] R. Guha, M. T. Howard, G. R. Hutchison, P. Murray-Rust, H. Rzepa, C. Steinbeck, J. Wegner, E. L. Willighagen, The Blue Obelisk — Inter- operability in Chemical Informatics, J. Chem. Inf. Model., 2006, 46, 991–998 [191] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, P. Li, Taverna: a tool for the composition and enactment of bioinformatics workflows, Bioinformatics, 2004, 20, 3045–3054 [192] Y. Zhang, P. Murrary-Rust, M. T. Dove, R. C. Glen, H. S. Rzepa, J. A. Townsend, S. Tyrrell, J. Wakelin, E. L. Willighagen, JUMBO – 306 An XML Infrastructure for eScience, Proceedings of UK e-Science All Hands Conference, 2004 [193] http://www.ccdc.cam.ac.uk/products/csd/request/request.php4 [194] http://journals.iucr.org/services/cif/reqditems.html [195] P. Murray-Rust, H. S. Rzepa, S. M. Tyrrell, Y. Zhang, Representation and use of chemistry in the global electronic age, Org. Biomol. Chem., 2004, 2, 3192–3203 [196] P. Murray-Rust, S. Tyrrell, CIF2CML — Automatic Processing in XML/CML, Acta Cryst., 2005, A61, C109 [197] http://ant.apache.org [198] M. Calleja, B. Beckles, M. Keegan, M. A. Hayes, A. Parker, M. T. Dove, CamGrid: Experiences in constructing a university-wide, Condor-based, grid at the University of Cambridge, Proceedings of UK e-Science All Hands Conference, 2004 [199] Private communication with Dr C. Bolton, Unilever Centre for Molec- ular Science Informatics [200] http://www.dspace.org [201] L. Zorina, S. Khasanov, B. Narymbetov, R. Shibaeva, A. Ko- tov, E´. Yagubskii, Crystal structure of radical cation salt, (BEDT- TTF )4(GaCl4)2 C6H5CH3, 2001, 46, 219-224 [202] Private communication with Dr Alison Edwards, Bragg Institute 2006 [203] Private communication with Dr W. Bernd Schweizer, Eidengeno¨ssische Technische Houchschule Zu¨rich 2006 [204] http://www.ccdc.cam.ac.uk/products/csd/statistics [205] http://www.iucr.org/iucr-top/cif/cifdic html/1/cif core.dic/Cchemical formula.html 307 [206] X.-H., Li, H.-P. Xiao, catena-Poly[[[(2,2’-bipyridine)copper(II)]-µ- terephthalato] N,N -dimethylformamide solvate], Acta Cryst., 2004, E60, 898–900 [207] S. S. Shapiro, M. B. Wilk, An analysis of variance test for normality (complete samples), Biometrika, 1956, 52, 591–611 [208] C. R. Newton, I. S. Michela, G. W. J. Fleet, Y. Ble´riot, D. J. Watkin, 2-C-Hydroxymethyl-2,3-O-isopropylidene-D-ribono-1,5- lactam, Acta Cryst., 2004, E60, 909–910 [209] A.L. Spek, Single-crystal structure validation with the program PLA- TON, J. Appl. Cryst., 2003, 36, 7–13 [210] N. E. Day, Unpublished work 2006 [211] S.F. Boys, Electronic wavefunctions. I. A general method of calculation for stationary states of any molecular system. Proc. Roy. Soc., 1950, 200, 542–554 [212] J. C. Slater, A Simplification of the Hartree-Fock Method, Phys. Rev., 1951, 81, 385–390 [213] P. A. M. Dirac, Note on exchange phenomena in the Thomas-Fermi atom, Proc. Camb. Philos. Soc., 1930, 26, 376–385 [214] W. Kohn, L. J. Sham, Self-Consistent Equations Including Exchange and Correlation Effects, Phys. Rev., 1965, A140, 1133–1138 [215] A. D. Becke, Density-functional thermochemistry. III. The role of exact exchange, J. Chem. Phys., 1993, 98, 5648–5652 [216] C. Lee, W. Yang, R. G. Parr, Development of the Colle-Salvetti correlation-energy formula into a functional of the electron density, Phys. Rev., 1988, B37, 785–789 [217] P. J. Stephens, F. J. Devlin, C. F. Chabalowski, M. J. Frisch, Ab ini- tio calculation of vibrational absorption and circular dichroism spectra 308 using density functional force fields, J. Chem. Phys. 1994, 98 11623– 11627 [218] J. Pople, D. L. Beveridge, P. A. Dobosh, Approximate Self-Consistent Molecular-Orbital Theory. V. Intermediate Neglect of Differential Overlap, J. Chem. Phys., 1967, 47, 2026–2033 [219] M.J.S. Dewar, E. G. Zoebisch, E.F. Healy, J.J.P. Stewart, AM1: A New General Purpose Quantum Mechanical Molecular Model, J. Am. Chem. Soc., 1985, 107, 3902–3909 [220] J. J. P. Stewart, Optimization of parameters for semiempirical methods .1. Method, J. Comp. Chem., 1989, 10 209–220 [221] J. J. P. Stewart, Optimization of parameters for semiempirical methods .2. Applications, J. Comp. Chem., 1989, 10 221–264 [222] http://java.sun.com/j2se/1.5.0/docs/api [223] http://alpha2.bmc.uu.se References are given in the style adopted by the Journal of Chemical Infor- matics and Computer Science. 309