Information Extraction from Chemical Patents David Matthew Jessop Fitzwilliam College This dissertation is submitted for the degree of Doctor of Philosophy i Preface This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text This dissertation does not exceed the word limit (60000) set by the Degree Committee ii Abstract Information Extraction from Chemical Patents David Matthew Jessop The automated extraction of semantic chemical data from the existing literature is demonstrated. For reasons of copyright, the work is focused on the patent literature, though the methods are expected to apply equally to other areas of the chemical literature. Hearst Patterns are applied to the patent literature in order to discover hyponymic relations describing chemical species. The acquired relations are manually validated to determine the precision of the determined hypernyms (85.0%) and of the asserted hyponymic relations (94.3%). It is demonstrated that the system acquires relations that are not present in the ChEBI ontology, suggesting that it could function as a valuable aid to the ChEBI curators. The relations discovered by this process are formalised using the Web Ontology Language (OWL) to enable re-use. PatentEye – an automated system for the extraction of reactions from chemical patents and their conversion to Chemical Markup Language (CML) – is presented. Chemical patents published by the European Patent Office over a ten-week period are used to demonstrate the capability of PatentEye – 4444 reactions are extracted with a precision of 78% and recall of 64% with regards to determining the identity and amount of reactants employed and an accuracy of 92% with regards to product identification. NMR spectra are extracted from the text using OSCAR3, which is developed to greatly increase recall. The resulting system is presented as a significant advancement towards the large- scale and automated extraction of high-quality reaction information. Extended Polymer Markup Language (EPML), a CML dialect for the description of Markush structures as they are presented in the literature, is developed. Software to exemplify and to enable substructure searching of EPML documents is presented. Further work is recommended to refine the language and code to publication-quality before they are presented to the community. iii Acknowledgments I would like to thank Prof. Robert Glen and Prof. Peter Murray-Rust for supervision. I would also like to thank all those who have contributed to the creation of the software that has made this project possible – most notably Dr Peter Corbett for his work on OSCAR3, Dr Lezan Hawizy for her work on ChemicalTagger and Daniel Lowe for his work on OPSIN. Further thanks go to all those too numerous to name at the Unilever Centre, past and present, who have contributed to discussions and supported me in my work. Special thanks go to Dr Sam Adams, who volunteered to proof read this thesis, and to Jo for her love and support. I am grateful to Unilever for funding. iv Contents Preface ..................................................................................................................................................... i Abstract ................................................................................................................................................... ii Acknowledgments .................................................................................................................................. iii Contents ................................................................................................................................................. iv List of Figures ........................................................................................................................................ vii Glossary .................................................................................................................................................. ix 1. Introduction .................................................................................................................................... 1 1.1 Open and Closed Data ............................................................................................................ 1 1.2 The Semantic Web .................................................................................................................. 2 1.3 Semanticizing Chemistry ......................................................................................................... 5 1.4 Information Extraction from Chemical Documents ................................................................ 5 1.5 Development Environment ..................................................................................................... 7 2. Sources of Chemical Documents & Technologies for their Semantic Enrichment ......................... 8 2.1 Availability of Documents ....................................................................................................... 8 2.1.1 Journal Articles ................................................................................................................ 9 2.1.2 Theses ............................................................................................................................. 9 2.1.3 Patents .......................................................................................................................... 10 2.2 Key Technologies .................................................................................................................. 14 2.2.1 XML & XPath ................................................................................................................. 15 2.2.2 Regular Expressions ...................................................................................................... 16 2.2.3 Machine-Understandable Chemical Formats ............................................................... 17 2.2.4 Chemical Markup Language .......................................................................................... 21 2.2.5 CMLXOM & JUMBO ....................................................................................................... 25 2.2.6 OSCAR3 ......................................................................................................................... 26 2.2.7 ChemicalTagger ............................................................................................................. 39 2.2.8 OSRA .............................................................................................................................. 45 2.3 Conclusions ........................................................................................................................... 47 3. Representation and Manipulation of Markush Structures ........................................................... 48 3.1 Markush Structures ............................................................................................................... 49 3.2 Polymer Markup Language ................................................................................................... 50 3.2.1 Representation of Polyethylene Oxide ......................................................................... 50 v 3.2.2 Representation of Polystyrene ..................................................................................... 53 3.2.3 Representing Variability in PML .................................................................................... 55 3.2.4 The Cambridge Polymer Builder ................................................................................... 58 3.3 Extension of PML for Markush Structures ............................................................................ 61 3.3.1 Frequency Variation ...................................................................................................... 63 3.3.2 Homology Variation ...................................................................................................... 63 3.3.3 Position Variation .......................................................................................................... 66 3.3.4 Position and Count Variation ........................................................................................ 67 3.3.5 Inline Connection Tables ............................................................................................... 68 3.4 Building Representative Examples of a Markush Structure.................................................. 70 3.5 Substructure Searching of Markush Structures .................................................................... 75 3.5.1 Implementing Extended Connection Tables ................................................................. 76 3.5.2 Building Extended Connection Tables ........................................................................... 78 3.5.3 The Relaxation Algorithm.............................................................................................. 82 3.5.4 Examples ....................................................................................................................... 87 3.6 Conclusions ........................................................................................................................... 91 4. Automatic Acquisition of Hyponymic Relations from the Chemical Literature ............................ 93 4.1 Hyponymic Relations ............................................................................................................ 93 4.2 Hearst Patterns ..................................................................................................................... 94 4.2.1 OSCAR3 Implementation .............................................................................................. 95 4.3 Acquiring Hyponymic Relations ............................................................................................ 97 4.3.1 HearstFinder .................................................................................................................. 98 4.3.2 Recording Hyponymic Relations ................................................................................. 100 4.3.3 Content of the Derived Relations & Sources of Error ................................................. 102 4.3.4 Trimming the Relations ............................................................................................... 105 4.3.5 HearstFinder Validation .............................................................................................. 107 4.4 Uses of Derived Data ........................................................................................................... 116 4.4.1 Automatic Classification of Structural & Non-Structural Classes ............................... 117 4.4.2 Detection of Useful Relationships ............................................................................... 120 4.4.3 Application to Data Searching ..................................................................................... 124 4.5 Conclusions ......................................................................................................................... 124 5. High-Throughput Abstraction of Chemical Reactions – PatentEye ............................................ 126 5.1 Downloading Patents .......................................................................................................... 127 5.1.1 EPO Web Interface ...................................................................................................... 127 vi 5.1.2 Automated Downloading of EPO patents ................................................................... 128 5.1.3 Formation of the Patent Corpus ................................................................................. 130 5.2 Document Enhancement .................................................................................................... 131 5.2.1 Paragraph Deflattening ............................................................................................... 132 5.2.2 Document Segmentation ............................................................................................ 133 5.2.3 Data Annotation .......................................................................................................... 139 5.2.4 Experimental Paragraph Classification ....................................................................... 154 5.2.5 Image Analysis ............................................................................................................. 157 5.2.6 Back Reference Annotation ........................................................................................ 170 5.3 Extraction of Reactions ....................................................................................................... 175 5.3.1 Conventional Format of Experimental Sections ......................................................... 176 5.3.2 Implementation of Automatic Reaction Extraction .................................................... 177 5.3.3 Reaction Extraction Performance ............................................................................... 195 5.4 Conclusions ......................................................................................................................... 198 6. Results ......................................................................................................................................... 199 6.1 Quality of Extracted Reactions ............................................................................................ 199 6.1.1 Corpus Formation ....................................................................................................... 199 6.1.2 Product Validation ...................................................................................................... 200 6.1.3 Reagent Validation ...................................................................................................... 201 6.1.4 Spectra Validation ....................................................................................................... 203 6.1.5 Automated Verification Validation ............................................................................. 204 6.2 Enabling Reuse of the Extracted Data ................................................................................. 209 7. Conclusions ................................................................................................................................. 213 8. Bibliography ................................................................................................................................ 216 Appendix A .......................................................................................................................................... 224 Appendix B .......................................................................................................................................... 227 Appendix C .......................................................................................................................................... 229 Appendix D .......................................................................................................................................... 230 Appendix E .......................................................................................................................................... 232 vii List of Figures Figure 2-1: InChI representations of limonene ..................................................................................... 19 Figure 2-2: CML representation of acetaldehyde ................................................................................. 22 Figure 2-3: Hydration of acetaldehyde ................................................................................................. 23 Figure 2-4: CML representation of a chemical reaction ....................................................................... 24 Figure 2-5: OSCAR3 Architecture .......................................................................................................... 26 Figure 2-6: Example inline document as produced by OSCAR3 ............................................................ 32 Figure 2-7: OSCAR3 Data Annotations .................................................................................................. 37 Figure 2-8: 1H NMR regular expression ................................................................................................. 38 Figure 2-9: ChemicalTagger Architecture ............................................................................................. 40 Figure 2-10: Sample ChemicalTagger output ........................................................................................ 44 Figure 2-11: Example reaction scheme ................................................................................................. 45 Figure 3-1: Generic structure representing the monochlorinated toluenes ........................................ 48 Figure 3-2: PML representation of poly(ethylene oxide) ...................................................................... 51 Figure 3-3: Atomistic representation for the g:o fragment .................................................................. 52 Figure 3-4: The creation of bonds between fragments in PML ............................................................ 53 Figure 3-5: PML representation of polystyrene .................................................................................... 54 Figure 3-6: PML representation of a statistical copolymer................................................................... 56 Figure 3-7: The front page of the Cambridge Polymer Builder ............................................................. 58 Figure 3-8: Designing a polymer ........................................................................................................... 59 Figure 3-9: Results of polymer building ................................................................................................ 60 Figure 3-10: PML representation of the monochlorinated toluenes .................................................... 62 Figure 3-11: Homology variation in EPML ............................................................................................ 64 Figure 3-12: Formal description of the alkoxy template ...................................................................... 65 Figure 3-13: Position variation in EPML ................................................................................................ 66 Figure 3-14: Markush structure employing simultaneous position and count variation ..................... 67 Figure 3-15: Simultaneous position and count variation in EPML ........................................................ 68 Figure 3-16: Markush structure featuring variable cyclic unit .............................................................. 69 Figure 3-17: Inline connection tables in EPML ...................................................................................... 69 Figure 3-18: Example Markush structure ............................................................................................. 73 Figure 3-19: EPML representation of the example Markush structure ................................................ 74 Figure 3-20: 3D (left) and 2D (right) views of a randomly-generated example compound ................. 75 Figure 3-21: Superimposed structure representing the monochlorinated toluenes ........................... 75 Figure 3-22: Relaxation match of 3-aminopropanoyl chloride ............................................................. 84 Figure 3-23: Inconclusive results of relaxation matches ...................................................................... 85 Figure 3-24: Example Markush structure (left) and corresponding ECT (right) .................................... 88 Figure 4-1: Acquisition and Storage of Hearst Patterns........................................................................ 97 Figure 4-2: Grammatical structure of a Hearst Pattern ...................................................................... 100 Figure 4-3: Distribution of Hearst Patterns across the patent corpus ................................................ 102 Figure 4-4: Individual Hearst Pattern frequency across the patent corpus ........................................ 103 Figure 4-5: The customised OSCAR3 ScrapBook ................................................................................. 110 viii Figure 4-6: Annotated Hearst Pattern as produced by the OSCAR3 ScrapBook................................. 111 Figure 4-7: Indirect ChEBI classification of acetone as a solvent ........................................................ 121 Figure 5-1: Search results for EP 1777210 .......................................................................................... 127 Figure 5-2: Variation of downloaded and unique patents in the corpus ............................................ 130 Figure 5-3: Identification of and Document Restructuring Using Consecutive Headings .................. 138 Figure 5-4: Software architecture for the application of OSCAR3 data annotations to patent XML documents .......................................................................................................................................... 140 Figure 5-5: Embedded images in the patent XML. The text has been shortened for the sake of brevity ............................................................................................................................................................ 158 Figure 5-6: EP1620437B1 Image 413 .................................................................................................. 158 Figure 5-7: Types of images present in experimental sections........................................................... 161 Figure 5-8: Input image (left) and correctly interpreted structure (right) .......................................... 164 Figure 5-9: Input image (top) and correctly interpreted structure (bottom) ..................................... 165 Figure 5-10: Input image (left) and correctly interpreted structure ................................................... 165 Figure 5-11: Input image (top) and incorrectly interpreted structure (bottom) ................................ 166 Figure 5-12: Input image (top) and unbuildable result (bottom) ....................................................... 167 Figure 5-13: Input image (top) and unbuildable result (bottom) ....................................................... 168 Figure 5-14: Runtime required for image analysis .............................................................................. 169 Figure 5-15: Analogous reactions from EP1326865............................................................................ 170 Figure 5-16: Indexed and Tokenised Headings ................................................................................... 172 Figure 5-17: Local annotation of sub-headings .................................................................................. 174 Figure 5-18: EP1326865 - Example 79, Step 1 .................................................................................... 177 Figure 5-19: Abstracting reactions from patent text .......................................................................... 178 Figure 5-20: Example NMR spectrum in CML ..................................................................................... 181 Figure 5-21: Sample ChemicalTagger markup of a reactant ............................................................... 182 Figure 5-22: ChemicalTagger output for mixed content .................................................................... 182 Figure 5-23: Automatically generated reactantList and spectatorList. For the sake of brevity, atom and bond elements have been removed ............................................................................................ 184 Figure 5-24: Identification of key reactants ........................................................................................ 188 Figure 5-25: Significant substructures in the analogous reaction ...................................................... 189 Figure 5-26: Proton environments in a non-trivial system ................................................................. 192 Figure 6-1: Sample RDF from the PatentEye Repository .................................................................... 211 Figure 6-2: Diagrammatic illustration of PatentEye Repository RDF .................................................. 212 ix Glossary API Application Programming Interface CAS Chemical Abstracts Service ChEBI Chemical Entities of Biological Interest CML Chemical Markup Language DTD Document Type Definition EPML Extended Polymer Markup Language EPO European Patent Office ECT Extended Connection Table HTML HyperText Markup Language InChI IUPAC International Chemical Identifier JUMBO Java Universal Molecular Browser for Objects JVM Java Virtual Machine MEMM Maximum Entropy Markov Model NLP Natural Language Processing OCR Optical Character Recognition OPSIN Open Parser for Systematic IUPAC Nomenclature OSCAR Open Source Chemistry Analysis Routines OSRA Optical Structure Recognition Application OWL Web Ontology Language MPT Mean Pairwise Tanimoto Coefficient NMR Nuclear Magnetic Resonance PDF Portable Document Format PML Polymer Markup Language RDF Resource Description Framework SMILES Simplified Molecular Input Line Entry Specification x URI Uniform Resource Indicator USPTO United States Patent and Trademark Office WIPO World Intellectual Property Organization XML Extensible Markup Language 1 1. Introduction Isaac Newton once wrote to Robert Hooke; “If I have seen a little further it is by standing on the shoulders of giants” This oft-quoted adage contains a great truth; in modern science, all new works are based upon something that has come before and so if we are to carry out new work, we must first know what has come before. In the modern age this can be difficult – research today is carried out on a huge scale, and a scientist searching for the answer to a simple question may find it impossible to find the appropriate needle in a vast, electronic haystack. Modern information systems give him at least a fighting chance, but it is by no means guaranteed that a computer system will contain the data he seeks, or have indexed the information it holds in sufficient depth to allow him to find his answer. This thesis seeks to address the problem of information flow in chemistry; the question of how to make information available to those who need it, when they need it. 1.1 Open and Closed Data The scale of information output in modern chemistry is huge (1). The CAplus database (2) holds more than 32 million references to patents and journal articles and indexes more than 1500 current journals on a weekly basis, while the CAS REGISTRY (3) holds more than 54 million chemical compounds and the CASREACT (4) database more than 39 million single and multi-step reactions. But even these numbers do not do justice to the scale on which the research is conducted – inevitably, these databases will not be complete indexes of the published data, while published authors will limit the data they include in their documents to that which is directly relevant to their work, omitting much that they have generated during the course of it. 2 Much chemical information is not freely available – it may be locked to a paper format in a researcher’s lab book, effectively lost to the community, while the traditional business model of a journal requires the erection of paywalls. Such closed data obstructs the work of scientists, though they may not know it. Data may be closed with good reason – in commercial research, for example, revealing the detail of one’s work too early may risk the patentability of an invention and thereby its commercial value – but often data is closed that need not be. The availability of data is vital for data-driven science such as spectra prediction and Quantitative Structure-Activity Relationships (QSAR), which has become increasingly important to the pharmaceutical industry as it seeks to control the spiralling costs of drug development. Open data – data that is freely available to the community – supports and enables such work. The more that the culture of open data spreads, the more such work becomes viable. 1.2 The Semantic Web Tim Berners-Lee first described the concept of the Semantic Web (5). The idea is simple – the World Wide Web comprises a vast collection of information, but information that is largely meaningless to a computer. If it were to be made machine-understandable, then software agents could be developed that would be able use this information as a basis for reasoning and to make decisions. This concept, tied to that of open data, would allow for computerised scientists conducting their own data-driven research and reporting their conclusions back to humans. The concept of a machine performing research is not one for the world of science fiction – indeed, the robot scientist Adam has conducted its own hypothesis-driven research, reaching conclusions that were later validated by human researchers (6). In order to make our information machine-understandable, it is necessary to formalise the semantics of the medium in which it is stored. For the semantic web, such formalisation is typically performed 3 by encoding the data using eXstensible Markup Language (XML). A bookshop, for example, might encode its catalogue as follows; Of Mice and Men John Steinbeck 0141023570 £4.00 War and Peace Leo Tolstoy 1853260622 £1.99 In this example, the four pieces of information that are stored for each book – the title, author, ISBN number and price – are enclosed within appropriate XML tags. XML tags may be either opening, e.g. , closing, e.g. or empty, e.g. . A computer may read this document and see that the catalogue contains a book, the title of which is “Of Mice and Men” and the author of which is “John Steinbeck”. By specifying the semantics in this way, we have made the data machine-understandable. Of course, concepts of the same name can mean different things to different people – or even to the same person. The concept “book” may be a collection of bound pages to a publisher or a collection of bets placed by gamblers to a bookmaker. A “title” may be the name of a book or a prefix to a person’s name in polite conversation. In order to allow a machine to differentiate between different concepts of the same name, XML elements are assigned namespaces, for example; <book xmlns=”http://www.amazon.co.uk” /> The attribute xmlns on this book element defines the term “book” as having the meaning defined by Amazon. The namespaces used in XML are Internationalised Resource Indicators (IRIs), as 4 opposed to the more common Uniform Resource Locator (URL) – the difference being that an IRI may, but need not, point to the actual location of a resource and may contain characters chosen from a larger set. By using a unique namespace, an author may create his own XML vocabulary suitable for whatever task he has in mind. Concepts from differing namespaces may then be combined within the same document to perform the role required by the document’s author. This combination of flexibility and precision of concepts has led in recent years to XML becoming a de facto standard with, for example, XML-based formats being adopted by Microsoft Office and OpenOffice. Though we may be some years away from computers routinely performing our science for us, some benefits of making our data machine-understandable are immediate – it is, for example, immediately made far more discoverable. Perhaps our frustrated researcher needs to know the glass transition temperature of polyvinyl alcohol. A text-based search on the web for the terms “glass transition temperature” and “polyvinyl alcohol” may be successful, or it may not. The glass transition temperature is often abbreviated as “Tg”, while polyvinyl alcohol is also known as poly(vinyl alcohol), poly(ethenol) and PVA – which is also the abbreviation for polyvinyl acetate. Moreover, the value of Tg for a polymer depends on the precise composition of the polymer and the method of measurement employed. The difficulties of locating key information in the literature have been previously noted (7). Our researcher would be greatly advantaged if he could define the property and substance of interest in terms that the machine can understand and allow the correct value to be discovered automatically from a set of precisely defined data. 5 1.3 Semanticizing Chemistry The description of chemistry in terms understandable to a machine is supported by the XML dialect Chemical Markup Language (CML). CML allows for the description of atoms, molecules, spectra, reactions and more, and is discussed in section 2.2.4. By embedding CML inside another XML document, it is possible to produce a document which is readable to humans and in which the chemistry may be understood by machines – a datument (8). This process may be carried out either by the original author, on the author’s behalf by an editor during the process of document publication, or post-publication. Creation of XML is best-supported by the creation of authoring tools, such as the Microsoft Word plug-in Chem4Word which allows CML to be embedded directly into Word documents (9). Publishers are beginning to see the value of semantically enriching their output, such as in the Royal Society of Chemistry’s Project Prospect (10; 11; 12), but there remains a vast amount of published chemical literature, both past and present, which is entirely unintelligible to a machine. This presents the central problem that this thesis seeks to address; to what extent can we identify and semantically enhance chemical information in published documents and extract it to form novel collections? 1.4 Information Extraction from Chemical Documents Chemical documents take a variety of types – the most common being journal articles, patents and theses. Such documents are widely available, though terms of usage agreements may restrict the uses to which they may be put. They typically contain large amounts of chemical information, such as syntheses, characterisation data and properties of a wide range of chemical substances. Such information is typically structured in purely natural (i.e. human) language, which is manually scraped from the literature in order to populate chemical databases such as the aforementioned CAS systems at the cost of much time and effort. 6 The potential automation of this process offers two great advantages. Firstly, the much greater efficiency of an automated process will allow the creation of free data aggregation services such as CrystalEye (13) – enabling data-driven science, saving money for those who depend on the information and widening access to such information. Secondly, by reducing the marginal cost of data acquisition to a matter of machine time the potential scale on which the process operates will be widened. The goal is unquestionably worthwhile, but the technology is as yet too immature to fully supplant the human aspect. Various systems have been developed to address specific aspects of the goal of information extraction from the chemical literature. In the 1980’s, CAS developed an experimental system to aid in their work abstracting chemical reactions from the literature while the 1990’s saw the development of optical structure recognition – software to identify and interpret chemical structure diagrams – which remains an active area of research today. Recent times have seen the development of the Open Source Chemistry Analysis Routines (OSCAR) and ChemicalTagger (14) toolkits in the Unilever Centre. OSCAR (15) allows for the semantic annotation of chemical documents, identifying chemical terminology and data within text, and is described in section 2.2.6. ChemicalTagger combines the chemical name recognition aspects of OSCAR with standard natural language parsing techniques to analyse the grammar of chemical texts as a precursor to machine understanding of the texts, and is described in section 2.2.7. These two toolkits between them perform key roles in the software developed as part of the current work and provide a platform for the development of large-scale systems for the liberation of chemical information from its containing documents. 7 1.5 Development Environment The work for this thesis required the integration and development of a number of pre-existing open- source technologies such as OSCAR; XOM (16), a library for the manipulation of XML; CMLXOM (17), a library for the manipulation of CML and JUMBO (17), a library of CML-compatible cheminformatics tools. All of these libraries are written in the Java programming language – and so, for compatibility reasons, is the code that underlies the present work. It is hoped that, in time, this code will also be released under an open-source licence to allow its further use and development by third parties. 8 2. Sources of Chemical Documents & Technologies for their Semantic Enrichment The current work is built upon two sine qua non. The first is that a suitable, and preferably large, set of chemical documents be found that may be used as source documents. The second is that, wherever possible, the relevant pre-existing tools and technologies must be employed. Accordingly, this chapter discusses the sources of chemical documents (section 2.1) and the existing technologies that are used in the current work (section 2.2) such as machine-understandable chemical formats, Chemical Markup Language and the software libraries that operate on it and the tools OSCAR3, ChemicalTagger and OSRA. 2.1 Availability of Documents Crucial to the success of this work is the availability and usability of suitable source documents. Chemical documents are typically supplied in one of two types of formats; in a text format, or an image format. In an image format, the supplied data encodes an image of the document, rendering the text of the document not readily accessible. A computer must first perform Optical Character Recognition (OCR) before operations involving the text may occur. OCR technology is highly error- prone, and so documents that are supplied in an image format are to be avoided. Conversely, in a text format the text of the document is encoded using a character set – making it directly available to a computer program. Such documents are obviously preferable for the current work. XML offers an ideal format for text documents, allowing for the explicit definition of text formatting, document sections, etc. in an unambiguous manner. The popular Portable Document Format (PDF) is notoriously difficult to work with. The format is designed to produce electronic copies of a document that describe its appearance rather than its 9 content. While allowing for the creation of documents that are quite suitable for displaying information to human users, PDF does not lend itself to the easy interpretation of the enclosed text. Indeed, the process of extracting text from a PDF has been compared to “converting hamburgers into cows” (18). Consequently, the PDF format is also to be avoided where possible. 2.1.1 Journal Articles Perhaps the most familiar chemical document is the journal article. Short and focused upon a specific subject, journal articles are published frequently throughout an academic’s career. With the coming of the digital age, journal articles are widely available in digital formats from the journal’s website. Journal articles are most frequently supplied as PDF downloads and in an HTML format for direct viewing on the website. Since a journal’s publisher will frequently operate a business model that charges for access to its articles, the terms of use of the subscription will typically prohibit the automated downloading of documents that the current work requires. Though open-access journals, such as Acta Crystallographica Section E, exist they are very much atypical at present. The question of whether such arrangements should or will continue in the digital age is open and important, but beyond the scope of this thesis. So while journal articles constitute an important route for the communication of chemical information, the automated abstraction of information from journal articles was not attempted during the current work. 2.1.2 Theses The standard route by which a PhD is gained requires the preparation of a thesis, and so a great number of chemical theses are produced around the world every year. After the conclusion of the examination process, a physical copy of the thesis is deposited with the university’s library and becomes a public document. It is curious that in the modern day there is not requirement for the 10 deposition of a digital copy of a thesis though the very great majority of PhD candidates will have prepared their thesis as a digital document. At the time of writing, the University of Cambridge’s digital repository of scholarly works, DSpace@Cambridge (19), contains just three theses from the Department of Chemistry, suggesting a low take-up rate among candidates. The British Library operates the Electronic Theses Online Service (EThOS) (20) which offers access to those theses that the British Library has digitised or has been supplied with a digital copy of, though coverage is incomplete and not all of the UK’s universities participate in the programme. Due to the difficulty of accessing a sufficient quantity of usable documents, theses were not considered a suitable source of documents for the current work. 2.1.3 Patents In order to gain a patent, an inventor must disclose and describe the subject of his invention, and detail examples of the invention. In the field of chemistry, this frequently requires the claimant to describe the synthesis and characterisation of a number of example compounds, as well as to describe the background and subject of the invention. As a result, chemical patents contain a great deal of potentially useful information such as synthetic routes and compound properties. In order to allow the public to know what has and what has not been patented, the documents are made public and, in the digital age, are widely accessible. Patent authorities and numerous other websites host copies of the documents. Though these documents are not supplied entirely without copyright protection, the restrictions are certainly less prohibitive than those that apply to journal articles. The United States Patent and Trademark Office, for example, permit a patent author to claim copyright over his work, provided that he allows for the facsimile reproduction of the original (21). The European Patent Office asserts copyright over the content of its website, but allows for its 11 adaptation and reuse without the need to acquire a licence subject to certain conditions (22). Patents therefore provide an ideal source of data for the current work. 2.1.3.1 EPO Patents The European Patent Office (EPO) has begun publishing its patents in an XML format. This format uses standardised XML tags such as heading and p (paragraph) to define the formatting of the document, to indicate the location of images within the text as well as to link to a specified image file and for some elementary definition of sections of the documents. Crucially, these patents are published in a text format. The XML files may be downloaded from the EPO website (23), and are packaged into a ZIP file along with a PDF-formatted copy of the patent and a set of files corresponding to the images that are present in the patent. These images are supplied in the TIFF format and are given sequential file names that correspond to the image IDs that are used in the XML-formatted copy of the patent document, e.g. imgb0006.tif. The content of the patent XML files is governed by a Document Type Definition (DTD) file that can be downloaded from the EPO website (24). The composition of the XML-formatted patents, whether used by convention or enforced by the DTD, is subsequently discussed. Root Element The root element of the XML documents is ep-patent-document. The common children of this element include SDOBI, abstract, description, claims, ep-reference-list. The only required child of ep-patent-document is abstract, although the other children mentioned will generally be present as well – description, for example, will in practice only be absent in those documents that do not contain a description of the invention e.g. patent search reports. 12 Alternatively, the children of ep-patent-document are permitted to be a series of doc-page elements, but this format is only employed when the pages of the application are included in an image format, and has not been encountered during the preparation of this thesis. Abstract The abstract element can be composed either of an abst-problem and an abst-solution element, or of one or more p elements. The abst-problem and abst-solution elements themselves consist of one or more p elements. The p (paragraph) elements contain text as well as formatting tags such as br and sup that perform the same roles as their namesakes in HTML, and further elements such as tables, maths and chemistry that enclose further content of a specific type. SDOBI The Sub-DOcument for Bibliographic (SDOBI) data uses proprietary tags to encode a wealth of metadata related to the patent, e.g. the tag B110 contains the patent number, B140 contains the date of publication of the patent and B542 contains the title of the patent. The specification of these tags is contained in the patent DTD, but since this metadata is not used at any point in this thesis it shall not be further discussed. Claims By convention, each document contains three claims sections – one each in English, French and German. Each claims element contains one or more claim elements, and each claim element 13 contains one or more claim-text elements. A claim-text element is composed of text, HTML- style formatting tags and further tags in a similar manner as for the p element. ep-reference-list The ep-reference-list element contains one or more sets of a heading element followed by one or more p elements. The heading elements contain text and HTML-style formatting tags, and the p elements have been discussed previously. Description The description element contains the majority of the text of the patent, and the DTD allows it to be composed of one or more sets of a heading followed by a number of p elements. The DTD also allows for a number of elements to be used that correspond to well-defined sections of the patent document, e.g. technical-field, industrial-applicability and description-of- embodiments – unfortunately the comments in the DTD state that “these elements must NOT be used by contractors” and they do not occur in the patents that comprise the corpus used to prepare this thesis. As a result, the identification of the different sections of the patent documents is not the trivial task that the DTD allows for, and this task is discussed later. 2.1.3.2 Other Patents Patents are, of course, published by organisations other than the EPO. The World Intellectual Property Organisation (WIPO) publishes patent documents through its website (25), primarily as images that are not suitable for the current work, though also as HTML-formatted text with minimal 14 markup of document sections. The United States Patent and Trademark Office (USPTO) produces text versions of its patents which are converted to XML and made available for bulk download via Google patents (26). At the time of writing the documents available for download date from 1976 to the present day, and are claimed to number approximately 7 million, across all subjects. The format of the patent text within the USPTO XML files is similar to that used by the EPO, with heading and p elements containing the text of headings and paragraphs respectively, and these elements forming a single, unstructured list. References to external files, including images, ChemDraw and Mol files are present in the XML, and the supporting files are available as part of the bulk downloads. Though the USPTO patents were not used in the current work it is believed that the techniques and technology developed for the EPO patents should be directly applicable to them, with the need for only some minor customisation. The set of USPTO patents would therefore constitute a second bulk set of source documents for a scaled-up version of the current work. The WIPO patents could also be used as source documents with the caveat that it would be necessary first to reformat the HTML versions of the patents and identify headings within the text before they could be subjected to the same process as those patents produced by the EPO and the USPTO. 2.2 Key Technologies Of course, the work presented in this thesis has not been conducted in an intellectual vacuum. Much of it is built upon technologies that have been developed over the preceding years or decades, and the most important of these technologies are subsequently discussed. 15 2.2.1 XML & XPath The basic terminology and format of XML, as well as the capacity it provides to render information machine-understandable, has already been discussed in section 1.3. Much of the functionality for writing and handling XML documents in the current work is provided by the XOM library (16). XOM provides such basic and essential tools as the ability to read, operate upon, and serialise XML documents while constantly ensuring that the document is well-formed – the requirement that, among other points, the document must contain a single root element from which all other elements in the document descend, and that the elements be correctly nested, i.e. they must not overlap. Key operations supported by XOM include the addition to and removal from the document of XML elements and attributes and of text content. XOM further supports the use of the XML query language XPath (27). XPath is a means to select XML nodes (elements, attributes, etc.) from a document. The language permits the user to formulate simple, context-independent queries such as; //molecule which selects all elements named “molecule” that are descended from the starting node, the prefix “//” indicating that the position of the selected node in the document is unimportant, provided that it descends from the starting point. Queries may be more complex and involve context-dependent terms, for example; /molecule//atom which selects all elements named “atom” that are descended from an element named “molecule” that is itself a child of the starting node. The query may be further specified by the requirement that elements carry named attributes or that named attributes have specific values, for example; /molecule[@name='benzene']//atom 16 which operates as for the previous example, with the added requirement that the molecule element must carry a name attribute with the value “benzene”. The example XPath expressions given here are simple examples, but are sufficient to give an impression of the uses to which XPath is put in the current work. 2.2.2 Regular Expressions Regular expressions (“regexes”) are a powerful method for string matching, for which support has been added to a number of programming languages including Java. When using regular expressions, literal characters in the search regex match to characters in the target text, e.g. the regex “mol” matches the substring “mol” in each of the strings “mol”, “molecule”, “salbutamol” and “Smolensk”. Certain characters are used as metacharacters and have roles other than to literally denote a character. For example, the metacharacter “$” marks the end of a line, the metacharacters “(“ and “)” denote the beginning and the end of a group respectively, and the metacharacter “?” notes that the preceding character or group is optional. Character classes may be used in regexes to match a single character from a well-defined set. Certain character classes are built in to the language, such as “.”, which matches any character and “\s”, which matches any whitespace character. Other character classes may be defined by the user by enclosing the permitted characters in square brackets, e.g. “*abc+” matches any one of the characters “a”, “b” or “c”. Further metacharacters permit iteration of the preceding character or group, such as “*”, which defines that any number (including zero), and “+”, which defines that one or more of the previous character or group must occur in a match. Regular expressions may also define lookaheads and lookbehinds, that require that a matching substring must be followed or preceded, respectively, by a specified regex, e.g. a lookahead is used in the regex “mol(?=\s)” to match “mol” in “salbutamol is used by asthmatics” but not in “Smolensk is in Russia”. 17 Regular expressions may be used to match complex patterns of text and to extract this text from a document. In the current work, regular expressions are most notably used by OSCAR3 in order to match the highly stylised reports of spectral data that occur in chemical texts. This operation is discussed later, in section 2.2.6.5. 2.2.3 Machine-Understandable Chemical Formats As the usage of computers to manage chemical information expanded, it became necessary to be able to represent chemical structures in a format that was interpretable by the machine. While the simple text strings “ethyl acetate” or “Oseltamivir” are easily understood by humans with sufficient domain knowledge, the chemical structures they represent cannot be trivially identified by a computer. Simple tasks such as substructure searching or calculation of molecular weights therefore cannot be automated if the input to the program is not machine-understandable. Machine-understandable chemical formats have proliferated over the years – the open-source format-conversion tool Open Babel supports more than 90 such formats (28). Two simple formats – SMILES and InChI – are employed in the current work and are subsequently discussed. 2.2.3.1 SMILES Simplified Molecular Input Line Entry Specification (SMILES) (29), is a popular form of line notation – a method for representing chemical structures in which a single string encodes the structure. In SMILES, atoms are represented using the abbreviated forms of their chemical elements. Atoms may be indicated to be bonded to one another when their element symbols are adjacent to one another, and bonds are assumed to be single bonds unless the two symbols are connected by the symbol = or #, marking double and triple bonds respectively, and unless the atoms are from the limited subset 18 permitted to be marked as aromatic by using their lowercased element symbol, e.g. “n” and “c”. To keep SMILES strings readable, it is assumed that hydrogen atoms are present in sufficient number and appropriate positions to occupy free valencies. Thus, propanal may be represented by the SMILES string “CCC=O”. Of course, not all chemical structures take the topological form of lines. Branches in a molecule may be indicated by enclosing the side chain within brackets, while ring closures may be indicated by following the two atoms between which a bond is present with the same number. Thus, 1,1- dimethylcyclohexane may be represented by the SMILES string “CC1(C)CCCCC1”. SMILES strings provide a concise means of representing chemical structures in a machine- understandable format, and have the advantage that it is possible to produce representations for simple structures such as those above that are comprehendible to humans. SMILES strings have the disadvantage, however, that it is possible to represent a single chemical structure with a number of different SMILES strings. Propanal, for example, may be represented as “CCC=O” as above, as “O=CCC” or as “C(C=O)C” among many other permutations. It is therefore not possible to determine if the structure represented by two SMILES strings is the same by simple text comparison of the two strings. Though various proprietary algorithms exist for the canonicalization of SMILES strings, such as CANGEN (30), the differing algorithms produce differing canonical SMILES strings for the same connection table. Consequently, recent attention has focused on an open standard – the International Chemical Identifier (InChI). 2.2.3.2 InChI The InChI technical manual (31) states that; “The objective of the Identifier is to provide a string of characters capable of uniquely representing a chemical compound… Since InChI is intended to serve as a precise digital signature of a compound, it must have two properties: 1) different compounds 19 (as defined by their „connection tables‟) must have different identifiers and 2) a single compound must have a single identifier, regardless how its structure is drawn.” The InChI (32; 33) represents a chemical structure in a series of layers and sub-layers. Sub-layers are indicated by preceding their content with the string “/?” where “?” is a single character code that identifies the sub-layer. The connection layer, indicating connectivity of the atoms in the molecular graph, for example, is indicated by the string “/c”. Sub-layers are present only when required to describe the structure the InChI represents. For example, consider the InChIs for limonene, (R)- limonene and (S)-limonene as shown in Figure 2-1. Limonene InChI=1/C10H16/c1-8(2)10-6-4-9(3)5-7-10/h4,10H,1,5-7H2,2-3H3 (R)-limonene InChI=1/C10H16/c1-8(2)10-6-4-9(3)5-7-10/h4,10H,1,5-7H2,2- 3H3/t10-/m0/s1 (S)-limonene InChI=1/C10H16/c1-8(2)10-6-4-9(3)5-7-10/h4,10H,1,5-7H2,2- 3H3/t10-/m1/s1 Figure 2-1: InChI representations of limonene 20 The InChI for limonene is composed as follows;  InChI=1 – a declaration stating that the InChI is version 1  /C10H16 – the molecular formula sub-layer (prefixed “/”), stating that the molecular formula is C10H16  /c1-8(2)10-6-4-9(3)5-7-10 – the connectivity layer (prefixed “/c”), defining connections between atoms in the molecular graph; in this case atom 1 is connected to atom 8, which is connected to atoms 2 and 10 etc.  /h4,10H,1,5-7H2,2-3H3 – the hydrogen layer (prefixed (“/h), defining the positions of hydrogen atoms in the molecule; in this case, atoms 4 and 10 have one hydrogen atom each, atoms 1 and 5-7 have two hydrogen atoms each etc. When the stereochemistry of the chiral centre is specified as (R), the InChI is extended as follows;  /t10- – the sp3 stereo sub-layer; in this case indicating that atom 10 has stereochemistry of parity “-” according to the InChI algorithm  /m0 – indicating whether all defined sp3 stereochemistry should be inverted, allowing enantiomers of molecules with multiple stereocentres to share an identical sp3 stereo sub- layer  /s1 – defining the type of stereochemistry as absolute or relative; in this case, absolute When the InChI for (S)-limonene is calculated, it can be seen to be identical to that of (R)-limonene with the exception that the “/m0” becomes “/m1”, indicating the inversion of the stereocentre. 21 In order to produce canonical representations, the ordering of layers in an InChI is mandated. As a result, InChIs for related structures will start identically – for example, InChIs for all the isomers of limonene will start “InChI=1/C10H16”, while all InChIs for all the stereoisomers of limonene will start “InChI=1/C10H16/c1-8(2)10-6-4-9(3)5-7-10/h4,10H,1,5-7H2,2-3H3”, as seen above. The content of the various sub-layers is assured to be canonical by various procedures as detailed in the InChI technical manual; for example the molecular formulae are formatted according to the Hill system (34). The generation of InChIs is supported by a number of popular software packages such as ChemDraw (35) and OpenBabel (28). Functionality for the automatic generation of InChIs from within Java programs is provided by the JNI-InChI library (36) and it is this library to which JUMBO delegates much of the process. Consequently, InChIs are a convenient canonical identifier for small molecules and are used where appropriate throughout the current work. 2.2.4 Chemical Markup Language Chemical Markup Language (CML) is an XML-based language for the description of chemistry (37; 38; 39; 40; 41). As such, CML allows for the description of machine-understandable connection tables and much more besides. For example, the connection table of acetaldehyde may be represented as in Figure 2-2. 22 <molecule xmlns="http://www.xml-cml.org/schema" id="m1"> <name dictRef="nameDict:unknown">acetaldehyde</name> <atomArray> <atom id="a1" elementType="C" /> <atom id="a2" elementType="C" /> <atom id="a4" elementType="O" /> <atom id="a5" elementType="H" /> <atom id="a6" elementType="H" /> <atom id="a7" elementType="H" /> <atom id="a8" elementType="H" /> </atomArray> <bondArray> <bond id="a1_a2" atomRefs2="a1 a2" order="S" /> <bond id="a1_a4" atomRefs2="a1 a4" order="D" /> <bond id="a1_a5" atomRefs2="a1 a5" order="S" /> <bond id="a2_a6" atomRefs2="a2 a6" order="S" /> <bond id="a2_a7" atomRefs2="a2 a7" order="S" /> <bond id="a2_a8" atomRefs2="a2 a8" order="S" /> </bondArray> </molecule> Figure 2-2: CML representation of acetaldehyde The individual atoms that make up the molecule are represented by the atom elements, which are contained within the atomArray element. Bonds are represented by the bond elements, contained within the bondArray element. The chemical element of each of the atoms is defined by the elementType attribute on the atom element, while the bond order is specified by the order attribute on the bond element and the two (assuming a conventional two-centre bond) atoms between which the bond exists are identified by the atomRefs2 attribute on the bond element – the value being a concatenation of the unique ids of the atoms between which the bond exists. The example above does not include any information that could not be encoded in most if not all machine-understandable formats, but the great advantage of CML is that it provides a platform for the description of chemical information that is more complex than simply molecular connectivity. For example, CML vocabularies exist for the description of chemical reactions (42) and spectral data (43). Furthermore, these vocabularies may co-exist within the same document rather than require, for example, one document describing a molecular structure to link to a separate document that describes its NMR spectrum. For example, the hydration of acetaldehyde, as shown in Figure 2-3, 23 may be represented in CML as in Figure 2-4, in which the connection tables of the reaction components have been omitted for the sake of brevity. Figure 2-3: Hydration of acetaldehyde 24 <reaction xmlns="http://www.xml-cml.org/schema"> <reactantList> <reactant> <molecule id="m1"> <name dictRef="nameDict:unknown">acetaldehyde</name> </molecule> </reactant> <reactant> <molecule id="m2"> <name dictRef="nameDict:unknown">water</name> </molecule> </reactant> </reactantList> <spectatorList> <spectator> <molecule id="m3"> <name dictRef="nameDict:unknown"> hydrogen chloride </name> </molecule> </spectator> </spectatorList> <productList> <product> <molecule id="m4"> <name dictRef="nameDict:unknown"> 1,1-ethanediol </name> <spectrum type="hnmr"> <peakList> <peak xValue="1.40" integral="3.0" yUnits="unit:hydrogen" peakMultiplicity="cmlx:doublet"> <peakStructure type="coupling" value="6.8" units="unit:hz" /> </peak> <peak xValue="3.65" integral="2.0" yUnits="unit:hydrogen" peakMultiplicity="cmlx:singlet" /> <peak xValue="5.13" integral="1.0" yUnits="unit:hydrogen" peakMultiplicity="cmlx:quartet"> <peakStructure type="coupling" value="6.8" units="unit:hz" /> </peak> </peakList> </spectrum> </molecule> </product> </productList> </reaction> Figure 2-4: CML representation of a chemical reaction 25 It can be seen that the reaction is described by its component reactant, spectator (e.g. solvents and catalysts) and product molecules. The 1H NMR spectrum of the product molecule is also contained within the document, being present as a spectrum child of the product molecule. The examples given above demonstrate a small but, in the context of this thesis, significant subset of the capacity of CML to define chemical information. By rendering chemical information machine- understandable, CML allows for the creation of systems that integrate data of a variety of types and from a variety of sources to perform novel research – a semantic web for chemistry. 2.2.5 CMLXOM & JUMBO Since the purpose of Chemical Markup Language is to allow computers to understand chemistry, it follows that there is a need for tools for reading, writing and manipulating CML to allow programmers to implement CML-compliant software. These roles are filled by the open-source libraries CMLXOM and JUMBO (44). CMLXOM is intended as a CML-specific extension to the popular open-source XML manipulation library XOM (16), while JUMBO serves as an open-source collection of CML-compliant cheminformatics utilities. Further JUMBO-related projects exist to provide specific pieces of functionality, such as the JUMBOConverters (45) project that provide conversion between CML and a number of other formats. While the great majority of code development that was undertaken as part of the current work occurred in projects specific to the task at hand, functionality already provided as part of CMLXOM and JUMBO was reused where possible and functionality likely to be of use to others was added to the shared libraries. The current work used versions 5.5.1 of JUMBO, 2.5.1 of CMLXOM, 0.3 of JUMBOConverters and 1.2.3 of XOM. 26 2.2.6 OSCAR3 The OSCAR project has, over the last few years, resulted in the creation of a number of tools for the analysis of chemical texts. OSCAR3 (46) is the latest incarnation of this project, and is a tool for the automated annotation and semantic enrichment of chemical texts. The primary function of OSCAR3 is to recognise chemical named entities and, where possible, to resolve chemical names to their corresponding connection tables. Where OSCAR3 recognises a named entity, annotations may be created in two forms – standoff, where the annotations are held in a separate document, and inline, where the annotations are directly embedded in the document under analysis. The system architecture of OSCAR3 is summarised in Figure 2-5, and the functioning of the program is subsequently discussed. Where not otherwise noted, the version of OSCAR3 used in the current work is revision 877, dated 2010-07-21. Figure 2-5: OSCAR3 Architecture 27 2.2.6.1 Input OSCAR3 accepts input in a variety of formats. The system architecture is strongly dependent upon the SciXML format – much of the code relies upon the subject document being held in this format. As an initial step, other input formats are converted to SciXML and tools exist within OSCAR3 for this conversion to be effected from plain text and from other XML formats, and for the conversion of HTML to plain text from which the SciXML document may be created, as shown in Figure 2-5. 2.2.6.2 Key OSCAR3 Resources The various stages of processing implemented in the OSCAR3 workflow make use of a number of key resources that are shared by the different modules. These resources provide examples of chemical and non-chemical terminology and associated structures. The resources are subsequently outlined. ChemNameDict ChemNameDict is a dictionary of chemical names that is largely derived from the ChEBI Ontology (47; 48) as at August 2009. ChemNameDict contains records for 13068 unique compounds, and each record contains precisely one InChI, at most one SMILES string and at least one chemical name. The large majority of the records that are derived from the ChEBI Ontology also contain at least one ChEBI ID, while the 52 that have been manually added by the OSCAR3 developers and were not present in the ChEBI Ontology do not. 28 Ontology.txt Ontology.txt provides a list of terms from the ChEBI, FIX (49) and REX (50) ontologies and their corresponding ontology IDs. Lexicon The Lexicon provides a list of chemical elements, both names and atomic symbols, from Hydrogen to Darmstadtium, element 110. ChemAses & NonChemAses ChemAses comprises a list of the names of 924 enzymes that are intended to be recognised by OSCAR3 – those of named entity type ASE, as described in section 2.2.6.3 – that was manually collected from the Gene Ontology (GO) (51), while NonChemAses is a list of 126 enzyme names from GO that are not of type ASE, e.g. “epimerase” and “transferase”. UsrDictWords UsrDictWords is a list of approximately 96000 English words, included to provide as wide a variety of non-chemical words as possible. This dictionary contains chemical terms commonly used in English speech, such as “aspirin” and “benzene”. Terms from UsrDictWords are therefore only taken as examples of non-chemical terms if they do not also appear in one of the lists of chemical terms. 29 Further Training Data Further training data is provided from a number of chemical documents in which named entities have been hand annotated by the OSCAR3 developers. This data includes the lists of tokens that have occurred in the texts and their state as chemical or non-chemical terms. 2.2.6.3 Named Entity Recognition The key function of OSCAR3 is the recognition of named entities. A number of classes of named entities are recognised by OSCAR3. These classes are fully described in the OSCAR3 annotation guidelines, and can be summarised as follows;  Chemical (CM) – used for terms that denote a specific chemical structure, substructure or generic structure e.g. “water”, “methyl” or “ketone”. The CM class is not applied to terms that describe function or properties, such as “ligand” or “base”.  Reaction (RN) – used for names of reactions that are derived from terms of type CM or named after chemists, e.g. “ozonolysis” – derived from “ozone” – or “Hofmann degradation” – named after the 19th century chemist August Wilhem von Hofmann. RNs may occur in any part of speech e.g. “hydrolyse” or “methylated”.  Chemical Adjective (CJ) – used for adjectives that are derived from words of type CM and do not occur as part of a CM, e.g. “aqueous” or “citric” in “citric acidity”.  Enzyme (ASE) – used for enzyme names that are derived from words of type CM or RN, e.g. “beta-lactamase” or “hydrolase”.  Chemical Prefix (CPR) – used for prefixes that would normally occur as part of a CM but are not being used as part of one, e.g. “cis-“ in “cis- isomer” or “1,3-” in “1,3-dipolar”. 30  Ontology term (ONT) – used for a set of dictionary terms that do not belong to any of the above types but are of sufficient importance to have been included in the ontology.txt resource, e.g. “acid” or “base”. Previous work has involved the manual annotation of a corpus according to the OSCAR3 annotation guidelines by separate human annotators and has demonstrated inter-annotator agreement of around 90% (52). The named entities classes are further divided into subclasses. The assignment of named entities to subclasses is not typically of importance to the user and does not form a part of the standard process of named entity recognition. Indeed, the performance of automatic subclass assignment is poor compared to that of the named entity recognition (53). Each of the classes CM, RN, CJ and ASE are divided into subclasses while the classes CPR and ONT are not. The class CM has three major and three minor subclasses. The major subclasses are EXACT, for specific compounds such as “acetic acid”, CLASS, for classes of compounds such as “carboxylic acid” and PART, for parts of specific compounds and classes thereof such as “carbonyl” and “alkyl”. The three minor subclasses are intended for terms that do not easily fit into one of the three major subclasses and are SPECIES, for elements used in the sense of “atmospheric carbon”, SURFACE and POLYMER, used for surfaces and polymers respectively. The subclasses of RN are REACT, for specific named reactions such as “methylation”, DESC, for descriptions of compounds such as in “the chlorinated solvent” and MOVE, for words that denote the movement of chemical compounds such as in “the chlorinated swimming pool”. The subclasses of CJ are EXACT, CLASS and PART, as for CM, plus ACID, for adjectives referring to an acid such as “citric”, SOLUTION, for adjectives referring to solutions such as “aqueous” and RECEPTOR for adjectives that refer to receptors such as “nicotinic”. The subclasses of ASE are not relevant to the current work and are detailed in the literature for the interested reader. Once named entity recognition has been performed, the results are used to create a set of standoff annotations. If desired by the user, these standoff annotations are then passed to the 31 NameResolver class (see section 2.2.6.4) in order to generate and incorporate connection tables for named entities of type CM where possible. Finally, the set of standoff annotations may be used to create an inline document, embedding named entity annotations into the subject document. For example, the input text “Methyl bromide is my favourite alkyl halide.” is rendered as shown in Figure 2-6. 32 <PAPER> <BODY> <DIV> <HEADER /> <P> <ne id="o1" surface="Methyl bromide" type="CM" confidence="0.8851147923587711" SMILES="[H]C([H])([H])Br" InChI="InChI=1/CH3Br/c1-2/h1H3" cmlRef="cml1" ontIDs="CHEBI:39275"> Methyl bromide </ne> is my favourite <ne id="o2" surface="alkyl halide" type="CM" confidence="0.8596820024542137" rightPunct="." ontIDs="CHEBI:24469"> alkyl halide </ne> . </P> </DIV> </BODY> <cmlPile> <cml convention="cmlDict:cmllite" id="cml1" xmlns="http://www.xml-cml.org/schema" xmlns:cmlDict="http://www.xml-cml.org/dictionary/cml/" xmlns:nameDict="http://www.xml-cml.org/dictionary/cml/name/"> <molecule id="m1"> <name>Methyl bromide</name> <identifier convention="iupac:inchi"> InChI=1/CH3Br/c1-2/h1H3 </identifier> <name dictRef="nameDict:unknown">Methyl bromide</name> <atomArray> <atom id="a1" elementType="C" /> <atom id="a3" elementType="H" /> <atom id="a4" elementType="H" /> <atom id="a5" elementType="H" /> <atom id="a2" elementType="Br" hydrogenCount="0" /> </atomArray> <bondArray> <bond id="a1_a3" atomRefs2="a1 a3" order="S" /> <bond id="a1_a4" atomRefs2="a1 a4" order="S" /> <bond id="a1_a5" atomRefs2="a1 a5" order="S" /> <bond id="a2_a1" atomRefs2="a2 a1" order="S" /> </bondArray> </molecule> </cml> </cmlPile> </PAPER> Figure 2-6: Example inline document as produced by OSCAR3 OSCAR3 takes a modularised approach to named entity recognition and implements an extensible architecture. Two approaches are currently implemented – the PatternRecogniser and the MEMMRecogniser. Both approaches operate by breaking an input string of text into tokens – 33 continuous sets of characters that may or may not correspond to words. The PatternRecogniser and MEMMRecogniser attempt to identify within the resulting token sets any examples of the named entity classes described above. PatternRecogniser The strategy employed by the PatternRecogniser attempts to classify tokens as chemical or non- chemical before joining sequential tokens into multi-token named entities where appropriate. The classification is carried out by comparing the token to the lists of chemical and non-chemical tokens derived from the OSCAR3 resources. If the token is contained in a list of chemical tokens derived from the resources, it is known to be chemical, while if it is not contained in a chemical list but is contained in the UsrDictWords resource then it is known to be non-chemical. If the token is previously unseen then a naïve Bayesian model is used to classify the token. This model is trained using 4-grams – sequential substrings of the token of length 4 – of the chemical and non-chemical terms in the OSCAR3 resources. For example, for the token “pyridine” the 4-grams would be “^pyr pyri yrid ridi idin dine ine$”. The first and last 4-grams here indicate specifically that the characters “pyr” and “ine” were found at the beginning and the end of the string respectively. Once trained, the model uses the 4-grams of an unseen token to predict the probability that it belongs to the chemical class, and if this probability exceeds a user-modifiable threshold then the token is classified as chemical. Once the tokens have been determined to be chemical or non-chemical, a pre-classifier then putatively assigns each of the chemical tokens to one of the named entity classes, based on the ending of the token. For example, tokens ending in “ic” are putatively assigned as CJ, with the exception of “arsenic” which is putatively assigned as CM. Adjacent chemical tokens are then joined together into single named entities where appropriate. For example, “ethyl acetate” should be 34 recognised as a single named entity, whereas “carbonyl carbon” should not. This process is controlled by regular expression style patterns based upon common naming conventions such as “*yl *ate”, “*ium *ide” or “*ic acid”. Each of these patterns defines the named entity class to which the term should be assigned, for example terms of the form “CPR and RN” such as “α- and β- methylation” are assigned as RN. Those tokens that have not been combined to form a multi-token name are then confirmed as members of their putative class. When identifying named entities using the PatternRecogniser, the assignment of named entity subclasses is not supported. MEMMRecogniser A second NameRecogniser implementation, the MEMMRecogniser (54), replaces the regular expressions employed by the PatternRecogniser with a Maximum Entropy Markov Model (MEMM), using the implementation supplied in the OpenNLP package Maxent (55). The pre- classification of tokens is carried out as described above, then for each token a number of features are generated. Such features may describe the token itself, such as the named entity class assigned by the pre-classifier and the 4-grams used in the pre-classification procedure, or contextual information about the usage of the token such as the tokens that surround the token in question. These features are then used by a rescorer to identify named entities. The rescorer consists of four individual maximum entropy classifiers, one for each of the classes CM, RN, CJ and ASE. Each of these four classifiers, trained on data extracted from manually annotated documents, predicts the probability that a given term is a named entity of the class for which the classifier has been trained and the most likely named entity subclass. Named entities are then identified within the text by discounting “blocked” entities – those that overlap with another entity with a higher confidence score. In the term “ethyl acetate”, for example, the entities “ethyl” and “acetate” would be considered individual named entities with low confidence scores, while the named entity “ethyl 35 acetate” would have a far higher confidence score and therefore supersede the single-token named entities. The MEMMRecogniser is considered to provide higher performance than the PatternRecogniser and is the NameRecogniser that OSCAR3 uses by default. 2.2.6.4 NameResolver The function of the NameResolver is, where possible, to generate a connection table for a named entity in InChI, SMILES and CML formats. In a conventional OSCAR3 workflow, the NameResolver modifies the set of stand-off annotations that has been produced by the NameRecogniser to include the connection tables, though as part of the work done in preparation of this thesis its capability was widened to allow its use as part of a library and the resolution of individual chemical names. Consequently, it is now possible to use the NameResolver to resolve individual chemical names to structures without the need to create and parse a SciXML document. When attempting to resolve a name, a number of strategies are employed. First, if the named entity is found in the NameResolver cache – the set of terms previously resolved for the document – then the previously generated results are used. Secondly, the term is looked up in the Lexicon and ChemNameDict resources and, if the name being queried is found, the connection tables stored therein are returned. Finally, the name is passed to the Open Parser for Systematic IUPAC Nomenclature (OPSIN). The results generated at any stage are cached to facilitate more rapid functioning of the code. The differing methods for resolving names produce connection tables in various formats. OPSIN, for example, generates a CML representation of the molecule, while a name resolved from ChemNameDict will have been resolved to InChI and, most to likely, to SMILES. The formats to which the name has not been resolved are generated from the known formats and the three formats are 36 added to the document being processed, in a conventional OSCAR3 workflow, or are returned to the calling function if OSCAR3 is being used as an external library. 2.2.6.5 DataParser OSCAR3 also has a secondary but highly useful function implemented in the DataParser class – the capacity to recognise and annotate sections of analytical data such as NMR and mass spectra by using regular expressions to match the highly stylised formats authors use when reporting them. This functionality was the origin of the OSCAR project, which began life as the Experimental Data Checker – a tool to allow authors of papers to be published in RSC journals to check that their experimental sections both conform to the journal’s style guidelines and contain no identifiable errors (56; 57). The functionality has been retained within the OSCAR3 project, though is not integrated with the name recognition and resolution functionality. As shown in Figure 2-5, the DataParser produces an inline annotation document of its own, in which recognised chunks of data are annotated but chemical names are not, for example the annotations generated for the input text “Recorded spectrum of ethyl acetate: 1H NMR(400MHz) 1.20 (3H, t), 1.97 (3H, s), 4.10 (2H, q)” are shown in Figure 2-7. <PAPER> <BODY> <DIV> <HEADER /> <P> Recorded spectrum of ethyl acetate: <spectrum type="hnmr"> 1H NMR( <quantity type="frequency"> <value> <point>400</point> </value> <units>MHz</units> </quantity> ) <peaks type=".."> <peak> <quantity type="shift"> <value> 37 <point>1.20</point> </value> </quantity> ( <quantity type="integral"> <value> <point>3</point> </value> <units>H</units> </quantity> , <quantity type="peaktype">t</quantity> ) </peak> , <peak> <quantity type="shift"> <value> <point>1.97</point> </value> </quantity> ( <quantity type="integral"> <value> <point>3</point> </value> <units>H</units> </quantity> , <quantity type="peaktype">s</quantity> ) </peak> , <peak> <quantity type="shift"> <value> <point>4.10</point> </value> </quantity> ( <quantity type="integral"> <value> <point>2</point> </value> <units>H</units> </quantity> <quantity type="peaktype">q</quantity> ) </peak> </peaks> </spectrum> </P> </DIV> </BODY> </PAPER> Figure 2-7: OSCAR3 Data Annotations 38 It can be seen that the annotations that have been applied to the text by OSCAR3 have created a machine-understandable spectrum, in which the peak shifts, integrals and multiplicities have been individually marked up, allowing for a machine that recognises the format of the OSCAR3 data annotations to read the spectrum. A converter exists within the JUMBOConverters (45) project to transform the output thus produced by OSCAR3 for NMR spectra into Chemical Markup Language, facilitating the easy reuse of the data. The regular expressions that form the core of this module are, for ease of comprehension and reuse, formatted in XML and implemented as a cascading set in which regular expressions are referred to by ids and can embed further regular expressions. For example, the top-level regular expression for matching a 1H NMR spectrum is as follows; <node type="spectrum" id="hnmr" value="hnmr"> <regexp parsegroup="0"> <insert idref="nmrDelta" />? \b <insert idref="hNmr.Prolog"/> (?: \W* for\s+\w+ (?: (![\(\);]).)*?)? <insert idref="nmrMethod"/>? (\W+(<insert idref="nmrDelta"/>|H)+\b)? [\s:=]+? (?: \W*ppm\W*?)? (?: peaks\s+at\s+)? (?:\s*<insert idref="nmrDelta"/>\s+)? <insert idref="nmrPeakBlock"/> </regexp> <child type="quantity" id="hnmrSolvent"/> <child type="quantity" id="hnmrStandard"/> <child type="quantity" id="hnmrFrequency"/> <child type="quantity" id="hnmrTemperature"/> <child type="peaks" id="hnmr"/> </node> Figure 2-8: 1 H NMR regular expression The node element denotes a concept to be annotated, the value of the type attribute gives the name of the tag to be assigned to the annotation and the id attribute gives the value of the type attribute to set on the annotation, as can be seen above in the annotated 1H NMR – in this case, <spectrum type="hnmr">. Other regular expressions are inserted into the top-level regex using 39 the insert elements, and important subsections of the matching text that are also to be annotated are defined using the child tag – where the regular expressions defined for these children are matched then the annotations produced by OSCAR3 will have children accordingly. For illustration, note that the frequency and peaks in the annotated spectrum above are children of the spectrum element, as defined by the third and fifth children of the top-level 1H NMR regular expression given above. 2.2.7 ChemicalTagger ChemicalTagger (14) is a tool, developed within the Unilever Centre, for the syntactic analysis of chemical documents. By combining the capacity of OSCAR3 for the identification of chemical named entities with common linguistic tools it is possible to identify the grammatical structure of chemical text and then use this information to infer the meaning. This is achieved using the Natural Language Processing techniques of tagging (the process of assigning tags to denote the meaning of individual tokens) and chunking (the process of building the grammatical structure according to a set of production rules). The functioning of ChemicalTagger may be summarised as follows; 40 Figure 2-9: ChemicalTagger Architecture ChemicalTagger accepts as its input format plain text. This text is then passed separately through each of three taggers to create three separate tag sets, which are then combined using heuristics to ascertain priority of the tags from each set. ANTLR (Another Tool for Language Recognition) (58) is then used to parse this single tag set according to a predefined formal grammar to determine its structure in a process known as chunking, and the result is produced as an Abstract Syntax Tree (AST), which is subsequently converted to XML and returned to the user. 2.2.7.1 Tagging Tagging is the process of selecting an appropriate label to denote the meaning of each token in a sequence. In the ChemicalTagger workflow, the input text is tokenised on whitespace before three separate taggers are applied to create three separate tag sets. The three taggers are employed as they categorise the token based on different criteria, as described below. 41 OpenNLP Tagger The OpenNLPTagger class passes the input text to the part-of-speech (POS) tagger provided by OpenNLP (59). This implementation is based on a MEMM and is trained with data from the Wall Street Journal and the Brown corpus. The POS tags that are applied to the text describe the function of the word within the sentence, e.g. noun or verb. These broad categories are subdivided into specific classes e.g. singular proper noun or past participle. The POS tags provide much of the key information that is required in order to identify the grammatical structure of the text. For example, the input string “The cat sat on the mat” is tagged as follows; The cat sat on the mat DT NN VBD IN DT NN where the POS tags are those from the Penn Treebank tagset (60), and have the following meaning; DT – determiner NN – noun, singular or mass VBD – verb, past tense IN – preposition/subordinating conjunction OSCAR3 Tagger The OscarTagger class passes the text to OSCAR3 to be annotated, using the default parameters. Tokens that that are annotated as named entities by OSCAR3 are tagged as being of the type assigned to them by OSCAR3. Tokens that occur as part of a multiple token named entity e.g. “ethyl acetate” are further tagged with their position in that named entity, i.e. start, middle or end to enable simpler reconstruction later in the ChemicalTagger workflow. 42 Regex Tagger The RegexTagger class applies a series of regular expressions to each token – those that match a regular expression are given the tag associated with that regular expression. For example, the token “mol” and variations thereon are given the tag “NN-AMOUNT” and are matched by the regular expression; (\d\d+)?(m|k)?mol(s)?$ By tagging the token “mol” in this way, it is possible to recognise a molar amount in a chemical text as a number followed by a token of type NN-AMOUNT. Such multi-token entities are defined in a set of formal production rules and are recognised during the chunking process. 2.2.7.2 Combination of Tag Sets The combination of the three separate tag lists into one combined set takes part in two stages. Initially, a simple rule of priority is used – if the token has received a tag from OSCAR3, other than ONT, then that tag takes priority. Otherwise, tags assigned by the regex tagger take priority over those from the OpenNLP tagger. A second step of processing then corrects a number of common errors. For example, the token “M” will be annotated as CM by OSCAR3, being a common abbreviation for “metal”. However, if the preceding token is a number, as in “2 M”, then it is likely that it is being used as abbreviation for “molar”. This error is corrected by reassigning the tag as “NN-MOLAR”. Such error correction serves to allow correct chunking in the next stage of the ChemicalTagger workflow. 43 2.2.7.3 Chunking – Determination of Grammatical Structure Using a pre-defined regular grammar, ChemicalTagger is able to determine the grammatical structure of a given input text. In this grammar, the format of the expected language is defined using a series of production rules that reference one another and the tags assigned to the input tokens. For example, the quantity of a chemical compound as commonly reported in the literature, such as “0.14 mol, 2.89 g”, is specified as follows; nnamount:'NN-AMOUNT' TOKEN; meaning that an “nnamount” is a token that has been tagged as “NN-AMOUNT” in the combined tag set. amount : cd nnamount -> ^(NODE["AMOUNT"] cd nnamount ); meaning that an “amount” is composed of a cardinal number (“cd”) followed by an “nnamount”. The part of the rule following the symbol “->” is an instruction to group the matched content into a single “AMOUNT” entity. measurementtypes : molar|amount|mass|percent|volume ; defining the five types of common measurements measurements :(cd nn)? measurementtypes dt?; quantity2 : measurements (comma measurements)* ; quantity : (quantity1|quantity2) -> ^(NODE["QUANTITY"] quantity1? quantity2?); meaning that a “quantity” may be formed from one or more “measurementtypes” in a comma separated list. Similar sets of rules define grammatical structures such as noun phrases and verb 44 phrases. As a result, the input text “Into a pressure reactor was placed 4-(dimethylamino)- benzenethiol (.147 g, .96 mmol).” produces the output shown in Figure 2-10, once converted into XML. <Document> <PrepPhrase> <IN-INTO>Into</IN-INTO> <NounPhrase> <DT>a</DT> <APPARATUS> <NN-PRESSURE>pressure</NN-PRESSURE> <NN-APPARATUS>reactor</NN-APPARATUS> </APPARATUS> </NounPhrase> </PrepPhrase> <VerbPhrase> <VBD>was</VBD> <VB-SUSPEND>placed</VB-SUSPEND> </VerbPhrase> <NounPhrase> <MOLECULE> <OSCAR-CM>4-(dimethylamino)-benzenethiol</OSCAR-CM> <QUANTITY> <_-LRB->(</_-LRB-> <MASS> <CD>.147</CD> <NN-MASS>g</NN-MASS> </MASS> <COMMA>,</COMMA> <AMOUNT> <CD>.96</CD> <NN-AMOUNT>mmol</NN-AMOUNT> </AMOUNT> <_-RRB->)</_-RRB-> </QUANTITY> </MOLECULE> </NounPhrase> <STOP>.</STOP> </Document> Figure 2-10: Sample ChemicalTagger output Once the meaning of chemical text has been determined and formally encoded in this manner, it is possible for a machine to automatically interpret the meaning of the input text. In the example above, for example, by using XPath it is possible to identify the presence and amount used of 4- (dimethylamino)-benzenethiol in the reaction from which the input text was taken. In this way, ChemicalTagger may be used as the foundation of a system for the semanticizing of texts that report 45 chemical syntheses. In the current work, ChemicalTagger revision 321256e8b429 (dated 2010-06-10) is used throughout. 2.2.8 OSRA In spite of the proliferation of machine-understandable chemical formats, much structural information is still communicated in image formats. To the document’s author and primary readership, this method is not without its advantages – in many situations, a chemical structure may be communicated to a trained chemist better and faster by using a structure diagram than by a systematic name. Indeed, when describing a chemical reaction the use of a reaction diagram greatly clarifies matters as compared to describing it in text. To a human, the reaction scheme shown in Figure 2-11 is preferable to the phrase “Isopropyl 3-(hydroxymethyl)pyridine-2-carboxylate plus methyl N-[(4-methylphenyl)sulfonyl]glycinate plus diethyl azodicarboxylate plus triphenylphosphine gives 3-{[Methoxycarbonylmethyl-(toluene-4-sulfonyl)-amino]-methyl}-pyridine-2-carboxylic acid isopropyl ester”. The use of image formats to communicate structural information is thus understandable, though in the context of this work unhelpful. Figure 2-11: Example reaction scheme 46 The resolution of structure diagrams to connection table is a simple task for a human; however to a machine the images do not directly contain any structural information. The creation of software for this task has been attempted by a number of groups, resulting in a number of applications including Kekulé (61), CLiDE (62; 63; 64), OSRA (65; 66; 67), chemOCR and ChemReader (68). This research dates back to the 1990’s, and has in the last few years been undergoing a renaissance – in particular the development of the Optical Structure Recognition Application (OSRA), the first open-source application for the interpretation of chemical structure diagrams. The published systems follow similar methodology in the analysis of input images, employing modules to carry out the following tasks;  Optical Character Recognition (OCR) – the identification of text within the image, which is typically used to denote atoms other than carbon.  Vectorisation – the process of reducing the width of lines and transforming the raster (bitmap) image to a vector format.  Bond detection – the identification of which graphical elements constitute bonds, and which types of bonds are represented e.g. wedge bonds, double bonds, and aromatic bonds.  Connection table compilation – the process of turning the information gathered in the previous steps into a connection table. Comparison of the currently available applications was performed by Park et. al. (68), who concluded that on a corpus of 362 images ChemReader performed best, followed by OSRA, Kekulé and CLiDE. This result should be treated with caution, however, since new versions of both OSRA and CLiDE have been released since the work was performed – and it should be noted that Park et. al. are the authors of ChemReader. Indeed, it is only that due to this fact that they could compare the performance of ChemReader since their software is not available externally pending the removal of open-source components that prevent the release of a commercial version. Since OSRA is, at 47 present, the only open-source solution for chemical image interpretation it was selected for the current work as the use of one of the alternative pieces of software would preclude the release of the software created during the preparation of this thesis as open-source. 2.3 Conclusions This chapter discussed the current situations of chemical document availability and the leading technologies that have been brought to bear on the problems of information extraction from the literature and the reuse of such information – two areas of critical importance for the current work. The applications of these technologies and the information that can be automatically recovered from the literature are subsequently described in chapters 4 and 5. 48 3. Representation and Manipulation of Markush Structures To be effective, a chemical patent must not only claim protection over those compounds that the authors have prepared and demonstrated to be of value, but also prevent competitors from introducing trivial modifications and thereby circumventing the protection that the patent grants. To achieve this goal, chemical patents describe the subjects of the invention in terms of generic structures – known as Markush structures after Eugene Markush, the first inventor to employ them – in which a number of related chemical species are simultaneously described by a shared scaffold with a number of variable features. For example, in a trivial case the set of three monochlorinated toluenes may be described by the generic structure shown in Figure 3-1. Figure 3-1: Generic structure representing the monochlorinated toluenes The Markush structures employed in patents are far more complex than that depicted in Figure 3-1, featuring multiple independently variable features to simultaneously describe hundreds of thousands of specific chemical compounds or more that could not feasibly be enumerated within the document. Since the Markush structures describe the area of chemical space for which a patent claims protection, the information they contain is of crucial importance. For that reason, information systems that deal with chemical patents require the capability to represent and handle the Markush structures that the patents contain. The 1980’s saw the development of a number of systems for the representation of Markush structures including the two main commercial systems (69), Markush DARC and MARPAT (70; 71). In 49 addition, a large project at the University of Sheffield led to the development of the GENSAL language (72; 73) which has since fallen into disuse. The current state of the art is sufficient to allow the creation of Markush database searching tools such as STN, but incompatible with the semantic web of chemistry. Since CML does not permit the description of Markush structures, it was a goal of the current work to explore how the existing technology of Polymer Markup Language (PML) could be employed to create this capability. The work presented in this chapter is therefore intended to serve as a proof of principle rather than as a fully-functional system that is ready for a large-scale implementation. 3.1 Markush Structures The manner in which Markush structures are used to describe areas of chemical space is consistent across the patent literature. Structural diagrams are used to illustrate the common scaffolds and how the variable features interact with them. Pseudoatoms are used to indicate the presence of complex features, while accompanying text describes the chemical identity of these groups. The classes of variability used in Markush structures are described by Barnard and Downs (74) as follows;  Substituent variation, where a group represents one unit chosen from an explicitly enumerated list of possibilities, e.g. “R1 is fluorine or chlorine”.  Homology variation, where a group represents one unit chosen from an implied list of possibilities by use of a class name, e.g. “R1 is alkyl” or “R1 is a halogen”.  Frequency variation, where the number of repetitions of a motif is allowed to vary, e.g. “R1 is -(CH2) nOH, where n is 0 to 12”.  Position variation, where the position to which a substituent is bonded is not fixed, e.g. the case of the chlorinated toluenes seen in Figure 3-1. 50 It should be appreciated, however, that these four classes of variation are conceptually rather than fundamentally different from one another since the sets of substituents which match a given example of homology, frequency or position variation are in principle enumerable and could therefore be expressed in terms of substituent variation. Of course, such an enumeration is not always practical – the problem of combinatorial explosion renders the enumeration of long-chain branched alkyl groups unworkable, for example. The usage of the other forms of variation where appropriate is also of value to the creation of an understandable Markush structure if these forms are a true representation of the variation which the author intends to describe. 3.2 Polymer Markup Language Earlier work in the Unilever Centre for Molecular Science Informatics has resulted in the development of Polymer Markup Language (PML), a CML vocabulary for the fragment-based description of polymers (75). In PML, polymer structures are described as assemblages of fragment elements. A fragment acts as a container for further fragment elements and for molecule elements. These molecule elements define atomistic information and represent molecular substructures in that one or more of the atoms in the molecule is a pseudoatom, intended to denote the position of a free valency. These free valencies are referenced by join elements, which describe the molecular connectivity between the substructures. The usage of PML is further discussed with reference to specific examples in sections 3.2.1, 3.2.2 and 3.2.3. 3.2.1 Representation of Polyethylene Oxide As a simple example of the usage of PML, the representation of a pentamer of polyethylene oxide is shown in Figure 3-2. 51 1 <fragment xmlns:g="http://www.xml-cml.org/mols/geom1"> 2 xmlns="http://www.xml-cml.org/schema"> 3 <fragment countExpression="*(5)"> 4 <join order="1" moleculeRefs2="PREVIOUS NEXT" 5 atomRefs2="r2 r1"> 6 <torsion>180</torsion> 7 </join> 8 <fragment> 9 <fragment> 10 <molecule ref="g:o"/> 11 </fragment> 12 <join id="j1" order="1" moleculeRefs2="PREVIOUS NEXT" 13 atomRefs2="r2 r1"> 14 <torsion>180</torsion> 15 </join> 16 <fragment> 17 <molecule ref="g:ch2"/> 18 </fragment> 19 <join id="j2" order="1" moleculeRefs2="PREVIOUS NEXT" 20 atomRefs2="r2 r1"> 21 <torsion>180</torsion> 22 </join> 23 <fragment> 24 <molecule ref="g:ch2"/> 25 </fragment> 26 </fragment> 27 </fragment> 38 </fragment> Figure 3-2: PML representation of poly(ethylene oxide) The repeat unit of poly(ethylene oxide) (PEO) is specified by the fragment on line 8 of Figure 3-2. This fragment contains three further fragment elements, on lines 9, 16 and 23. Each of these three fragment elements contains a molecule element. The atomistic representations (connection tables) for the subunits are contained within separate resources, and are referred to using namespaced identifiers. An example of one of these substructures is given in Figure 3-3. 52 1 <molecule xmlns="http://www.xml-cml.org/schema" id="o"> 2 <atomArray> 3 <atom id="a1" elementType="O" x3="-0.000000" 4 y3="-50.065000" z3="0.000000"/> 5 <atom id="r1" elementType="R" x3="-0.776000" 6 y3="0.512000" z3="-0.000000"/> 7 <atom id="r2" elementType="R" x3="0.776000" y3="0.512000" 8 z3="-0.000000"/> 9 </atomArray> 10 <bondArray> 11 <bond atomRefs2="a1 r1" order="1"/> 12 <bond atomRefs2="a1 r2" order="1"/> 13 </bondArray> 14 </molecule> Figure 3-3: Atomistic representation for the g:o fragment This molecule contains three atoms – an oxygen atom which is bonded to two pseudoatoms of elementType R. Similarly the other fragment used in the description of polyethylene oxide, the methylene fragment, may be found to consist of a -CH2- unit, the carbon atom of which is connected to two of these pseudoatoms. The role of these pseudoatoms, referred to as R-groups, is to indicate the free valencies present in a fragment through which joins may be used to connect it to further fragments and thereby to construct a macromolecule. The three molecule elements used in the PML document shown in Figure 3-2 are connected by the two join elements on lines 12 and 19. Each of these joins specifies a bond order, as well as a moleculeRefs2 attribute and an atomRefs2 attribute. The value of moleculeRefs2 used here, PREVIOUS NEXT, indicates that the fragments should be connected sequentially as opposed to as a side-chain. The atomRefs2 attribute indicates which of the pseudoatoms in the individual fragments should be used in the construction of the bond. The value used in this example, r2 r1, indicates that the dummy atoms to use are the one with and id of r2 in the PREVIOUS fragment and the one with an id of r1 in the NEXT fragment. These dummy atoms are to be deleted and new bonds between the fragments created as shown in Figure 3-4. The join elements may contain a child torsion element, which defines an appropriate torsion angle for the new bond. 53 Figure 3-4: The creation of bonds between fragments in PML The fragment on line 8 is itself contained by the fragment on line 3. The fragment in this position – a child of the root element – defines the constitutional repeating unit of the polymer and is hereafter referred to as the primary fragment. The optional countExpression attribute on the primary fragment specifies the number of repeat units present, in this case five, while the join on line 4 specifies how the repeat units are connected to one another using the syntax covered previously. 3.2.2 Representation of Polystyrene The example of polyethylene oxide discussed above is simple in that the macromolecule may be described as a continuous chain with no side chains. This topology is by no means ubiquitous, and consequently PML supports the description of branched polymers such as comb polymers and dendrimers. The methods which permit the description of branched polymers may also be employed 54 to describe polymers with a single backbone but which also carry side chains. This phenomenon is illustrated in Figure 3-5. 1 <fragment xmlns:g="http://www.xml-cml.org/mols/geom1" 2 xmlns="http://www.xml-cml.org/schema"> 3 <fragment countExpression="*(5)"> 4 <join order="1" moleculeRefs2="PREVIOUS NEXT" 5 atomRefs2="r2 r1"> 6 </join> 7 <fragment> 8 <fragment> 9 <molecule ref="g:ch"> 10 <join order="1" moleculeRefs2="PARENT CHILD" 11 atomRefs2="r3 r1"> 12 <fragment> 13 <molecule ref="g:benzene"/> 14 </fragment> 15 </join> 16 </molecule> 17 </fragment> 18 <join order="1" moleculeRefs2="PREVIOUS NEXT" 19 atomRefs2="r2 r1"> 20 </join> 21 <fragment> 22 <molecule ref="g:ch2"/> 23 </fragment> 24 </fragment> 25 </fragment> 26 </fragment> Figure 3-5: PML representation of polystyrene It can be seen that the usage of the join element on line 10 of the PML document in Figure 3-5 differs from that seen previously. Here, the moleculeRefs2 attribute takes the value PARENT CHILD, which indicates the presence of a side chain. The fragment on line 12 that contains the molecular subunit corresponding to the side chain is present as a child of the join element that connects it to the polymer backbone, while the join element is itself a child of the molecule on line 9 where previously the three elements would have been siblings. This format provides a convenient means to represent branching in a macromolecule, as it permits an arbitrary number of branches from the same position, and the branching in the XML document directly matches that in the macromolecule described, creating a comprehensible document. 55 3.2.3 Representing Variability in PML The examples of polyethylene oxide and polystyrene illustrate how PML may be used to represent a macromolecule with a precisely defined structure; however, it is rare for the composition of a polymer to be precisely known. Polymerisation processes can be difficult to control and resultant samples of polymers are polydisperse, i.e. composed of macromolecules of varying chain length, while specific mechanisms of polymerisation may lead to variable degrees of branching or for asymmetric monomers such as propylene to be incorporated in a head-to-tail fashion. This variability of polymer structure is further demonstrated by the cases of random and statistical copolymers, in which the positioning of the monomeric units is determined by a stochastic process. In order to fully support these more complex cases, PML must permit these forms of variability to be fully described. This support is provided by describing some of the fragment elements in the document as lists of fragments from which one is to be selected. The usage is demonstrated in Figure 3-6, which describes a PML representation of a statistical copolymer of ethylene oxide and propylene oxide. 1 <fragment xmlns='http://www.xml-cml.org/schema' 2 xmlns:g='http://www.xml-cml.org/mols/geom1'> 3 <fragmentList> 4 <fragment id='eo'> 5 <molecule ref='g:ethyleneOxide' /> 6 </fragment> 7 <fragment id='po'> 8 <molecule ref='g:propyleneOxide' /> 9 </fragment> 10 <fragment id='eE'> 11 <fragment ref='eo' /> 12 <join atomRefs2='r2 r1' moleculeRefs2='PREVIOUS NEXT' /> 13 <fragment ref='EE' /> 14 </fragment> 15 <fragment id='eP'> 16 <fragment ref='eo' /> 17 <join atomRefs2='r2 r1' moleculeRefs2='PREVIOUS NEXT' /> 18 <fragment ref='PP' /> 19 </fragment> 20 <fragment id='EE'> 21 <fragmentList role='markushMixture'> 22 <fragment ref='eo'> 56 23 <scalar dictRef='cml:ratio' dataType='xsd:double'> 24 0.01 25 </scalar> 26 </fragment> 27 <fragment ref='eE'> 28 <scalar dictRef='cml:ratio' dataType='xsd:double'> 29 0.84 30 </scalar> 31 </fragment> 32 <fragment ref='eP'> 33 <scalar dictRef='cml:ratio' dataType='xsd:double'> 34 0.15 35 </scalar> 36 </fragment> 37 </fragmentList> 38 </fragment> 39 <fragment id='pE'> 40 <fragment ref='po' /> 41 <join atomRefs2='r2 r1' moleculeRefs2='PREVIOUS NEXT' /> 42 <fragment ref='EE' /> 43 </fragment> 44 <fragment id='pP'> 45 <fragment ref='po' /> 46 <join atomRefs2='r2 r1' moleculeRefs2='PREVIOUS NEXT' /> 47 <fragment ref='PP' /> 48 </fragment> 49 <fragment id='PP'> 50 <fragmentList role='markushMixture'> 51 <fragment ref='po'> 52 <scalar dictRef='cml:ratio' dataType='xsd:double'> 53 0.01 54 </scalar> 55 </fragment> 56 <fragment ref='pP'> 56 <scalar dictRef='cml:ratio' dataType='xsd:double'> 57 0.84 58 </scalar> 59 </fragment> 60 <fragment ref='pE'> 61 <scalar dictRef='cml:ratio' dataType='xsd:double'> 62 0.15 63 </scalar> 64 </fragment> 65 </fragmentList> 66 </fragment> 67 </fragmentList> 68 69 <fragment id='f0'> 70 <fragment ref='EE' /> 71 </fragment> 72 </fragment> Figure 3-6: PML representation of a statistical copolymer In the PML document shown in Figure 3-6, a number of the fragments are defined by means of reference, such as that on line 70. The content of these fragments, such as the monomeric units 57 from which the polymer is constructed on lines 4 and 7, is then defined in the primary fragment list which is a child of the root element and is defined in the PML document above on line 3. The fragment elements that define the monomeric units are used in the construction of the eE, eP, pE and pP fragments on lines 10, 15, 39 and 44 respectively, each of which consists of an ethylene oxide (e) or propylene oxide (p) unit followed by a list from which to draw the next unit (EE or PP, defined on lines 20 and 49 respectively). These lists, called “Markush mixtures”, specify that the next unit to be used is either one of the eE, eP, pE and pP fragments, representing a continuation of the polymer chain, or a single polyethylene or polypropylene oxide unit, representing a termination condition. The probabilities with which each of the possible fragments is selected are specified in the scalar elements that are children of the fragment concerned, such as that on line 23. The primary fragment is defined on line 69 and points to a single EE element, indicating that the chains start with an ethylene oxide unit though this could just as easily have been replaced by a further Markush mixture allowing the initial unit to be either ethylene or propylene oxide. The capacity afforded by the markushMixture to represent a list of potential fragments allows for a variety of variability exhibited by polymers to be described in PML. The head-to-head or tail-to-tail addition of propylene monomers, for example, could have been described in the example above by the definition of the po fragment as a markushMixture list consisting of a head-first and a tail-first propylene unit while branched polymers may be described by the inclusion of a branching point in a markushMixture list that specifies the repeating units of the backbone of a polymer. This flexibility is key to the representation of complex sets of macromolecules and provides a means to describe the variability exhibited by Markush structures, as shall be discussed in section 3.3. 58 3.2.4 The Cambridge Polymer Builder In order to demonstrate the usage of PML, a demo application was developed (76). The Cambridge Polymer Builder is a frontend application that permits a user to create atomistic CML macromolecules with 3D-coordinates from a restricted set of pre-generated fragments. The application runs as a web service and is accessed using a web browser. The front page of the Cambridge Polymer Builder is shown in Figure 3-7. Figure 3-7: The front page of the Cambridge Polymer Builder The front page of the application prompts the user to select a polymer topology from the list of linear homopolymer, block copolymer, alternating copolymer, random copolymer, comb polymer and dendrimer. When the user selects one of these topologies, he is presented with a form (such as 59 that illustrated in Figure 3-8) into which he enters the required parameters, i.e. the identities of the repeat units and end groups to be used, the degrees of polymerisation and the torsion angles to be used around the bonds formed between repeat units. Figure 3-8: Designing a polymer The parameters selected by the user are used to populate a template PML document for the selected topology. The FragmentTool class from the JUMBO library is then used to construct either the atomistic CML macromolecule that is described by the PML document or an atomistic CML macromolecule that corresponds to the description given in the PML document if the input document defines a variable polymer. This process requires dereferencing the molecule elements that define the structural subunits of the polymer and combining them together in the manner 60 described in the input document. As the polymer chain is built, each structural fragment is automatically rotated to align it with the chain and to set the torsion angle about the new bond equal to that specified by the user, and positioned such that the atoms from either fragment are a distance apart equal to the sum of their covalent radii. Where appropriate, such as in the building of a random copolymer, individual structural subunits are selected at random from a markushMixture to create a single macromolecule that exemplifies the polymer described in the PML document. In addition, where possible, molecular properties are calculated using the van Krevelen group contribution method (77). The user is then presented with a page depicting these results, such as that shown in Figure 3-9, and from which the CML description of the macromolecule can be downloaded. Figure 3-9: Results of polymer building 61 3.3 Extension of PML for Markush Structures When commencing this work, it was thought that Polymer Markup Language would provide a suitable basis from which to develop a CML-based method for Markush structure representation. The capacity, using PML, to represent molecular substructures by the fragment element and to describe, using a markushMixture, a list of the allowed substituents at a given position provides at a basic level the functionality required to describe Markush structures. As discussed previously, Barnard’s four classes of variability within Markush structures may be reduced to substituent variation, which PML is equipped to handle, if the user is willing to fully enumerate the set of substituents that a patent author claims for a given position. This strategy, however, can lead to extremely verbose descriptions that are unlikely to be acceptable to a user. The number of mono- substituted alkane derivatives, and therefore also alkyl radicals, containing n carbon atoms exceeds 500 for n = 10 and 5,000,000 for n = 20 (78), while full enumeration of position variation in PML obfuscates the author’s intention. Consider, for example, the case of the monochlorinated toluenes shown in Figure 3-1; this simple Markush structure could be represented in PML as shown in Figure 3-10, in which the substituted benzene ring is defined three times – once each for the ortho-, meta- and para-substituted motif. <fragment xmlns='http://www.xml-cml.org/schema' xmlns:g='http://www.xml-cml.org/mols/geom1'> <fragment id='f0'> <fragment ref='methyl' /> <join atomRefs2='r1 r1' moleculeRefs2='PREVIOUS NEXT' /> <fragment ref='chlorinatedBenzene' /> </fragment> <fragmentList> <fragment id='methyl'> <molecule ref='g:ch3' /> </fragment> <fragment id='chlorinatedBenzene'> <fragmentList role='markushMixture'> <fragment ref='orthoChlorinatedBenzene'> <scalar dictRef='cml:ratio' dataType='xsd:double'> 0.4 </scalar> 62 </fragment> <fragment ref='metaChlorinatedBenzene'> <scalar dictRef='cml:ratio' dataType='xsd:double'> 0.4 </scalar> </fragment> <fragment ref='paraChlorinatedBenzene'> <scalar dictRef='cml:ratio' dataType='xsd:double'> 0.2 </scalar> </fragment> </fragmentList> </fragment> <fragment id='orthoChlorinatedBenzene'> <fragment> <molecule ref='g:benzene' /> </fragment> <join order='1' moleculeRefs2='PREVIOUS NEXT' atomRefs2='r2 r1'> </join> <fragment> <molecule ref='g:cl' /> </fragment> </fragment> <fragment id='metaChlorinatedBenzene'> <fragment> <molecule ref='g:benzene' /> </fragment> <join order='1' moleculeRefs2='PREVIOUS NEXT' atomRefs2='r3 r1'> </join> <fragment> <molecule ref='g:cl' /> </fragment> </fragment> <fragment id='paraChlorinatedBenzene'> <fragment> <molecule ref='g:benzene' /> </fragment> <join order='1' moleculeRefs2='PREVIOUS NEXT' atomRefs2='r4 r1'> </join> <fragment> <molecule ref='g:cl' /> </fragment> </fragment> </fragmentList> </fragment> Figure 3-10: PML representation of the monochlorinated toluenes Instead of mandating full enumeration of acceptable substituents, it is preferable to introduce novel features into the language that permit a more natural and concise expression of the variability in a Markush structure. The remainder of this chapter discusses the solutions that were devised to assist 63 in the description of Markush structures and that make up Extended Polymer Markup Language – EPML. 3.3.1 Frequency Variation It has already been discussed how PML may be used to describe a specific number of repeat units using the countExpression attribute, e.g. <fragment countExpression="*(5)"> <join order="1" moleculeRefs2="PREVIOUS NEXT" atomRefs2="r2 r1"> </join> <fragment> … </fragment> </fragment> In EPML, the countExpression is allowed to specify a range of permitted values using the format; <fragment countExpression="range(2,5)"> This usage specifies that the fragment in question is permitted to be repeated any number of times within the specified (inclusive) range, thereby describing frequency variation. 3.3.2 Homology Variation The phenomenon of homology variation, defined earlier as where “a group represents one unit chosen from an implied list of possibilities by use of a class name, e.g. ‘R1 is alkyl’ or ‘R1 is a halogen’” may be viewed as a combination of two separate forms of variation – one in which the substituent is enumerable as a precise and well-understood list such as “R1 is a halogen”, and those in which the substituent is defined in terms of a precise or generic structural feature such as “R1 is 64 an alkoxy group” or “R1 is a substituted or unsubstituted heteroaryl ring”. The description of such features is illustrated in Figure 3-11. 1 <fragment xmlns='http://www.xml-cml.org/schema' 2 xmlns:g='http://www.xml-cml.org/mols/geom1'> 3 <fragment id='f0'> 4 <molecule ref='g:benzene'> 5 <join atomRefs2='r1 r1' moleculeRefs2='PARENT CHILD'> 6 <fragment homology='halogen' /> 7 </join> 8 <join atomRefs2='r2 r1' moleculeRefs2='PARENT CHILD'> 9 <fragment template='alkoxy' branched='true' 10 minC='1' maxC='4' /> 11 </join> 12 </molecule> 13 </fragment> 14 </fragment> Figure 3-11: Homology variation in EPML This example shows a benzene ring (specified on line 4) substituted by a halogen (line 6) and in the ortho- position by an alkoxy group (lines 9-10) – the R-groups of the g:benzene fragment being numbered consecutively and contiguously around the ring from 1 to 6. The halogen and alkoxy substituents are specified using the homology and template attributes respectively, the values of which define the ids of the list of permitted substituents and CML molecule against which they should be resolved respectively. The alkoxy group is further specified using the branched, minC and maxC attributes which respectively define whether or not the carbon chain of the alkoxy group is permitted to be branched and the minimum and maximum number of carbon atoms in the alkoxy group, providing support for commonly-used restrictions. The CML molecule to which the alkoxy template resolves is shown in Figure 3-12. 65 1 <molecule id='alkoxy' xmlns='http://www.xml-cml.org/schema'> 2 <atomArray> 3 <atom id='r1' elementType='R' /> 4 <atom id='a1' elementType='O' /> 5 <atom id='r2' elementType='Q' /> 6 </atomArray> 7 <bondArray> 8 <bond id="r1_a1" atomRefs2="r1 a1" order="1" /> 9 <bond id="a1_r2" atomRefs2="a1 r2" order="1" /> 10 </bondArray> 11 </molecule> Figure 3-12: Formal description of the alkoxy template The CML molecule that defines the alkoxy template contains three atoms – an oxygen atom on line 4, a pseudoatom of type ‘R’ on line 3 that acts as a free valency by which the substituent may be connected to a parent structure and a pseudoatom of type ‘Q’ on line 5. This new pseudoatom indicates the presence of a carbon chain of variable length. The usage of such pseudoatoms allows the description of such generic substituents as alkyl, alkenyl and cycloalkyl groups, while coverage for more complex structure-based homology variation such as “R1 is a substituted or unsubstituted heteroaryl ring” is not provided. Support for such terminology in state of the art commercial systems is patchy, with Markush DARC allowing the usage of 22 specific “superatoms”, each representing a specific class of substituent such as “aromatic carbocyclic system” and “fused heterocycle”, while MARPAT adopts a similar system of “generic groups” (79). The correct handling of such terminology is a highly complex problem, for which Welford et al. (80) propose the usage of a generative grammar to produce a system capable of recognising, for example, a 2-chloro-pyridin-3-yl substituent as being a substituted heteroaryl ring, though this work appears not to have been fully developed. Consequently, there is no available means to fully describe homology variation as it is used by patent authors, and the facilities provided by EPML represent an acceptable approach for such an experimental language. 66 3.3.3 Position Variation Position variation, defined earlier as where “the position to which a substituent is bonded is not fixed” is represented formally in EPML in much the same way as it is represented graphically. The mobile substituent is represented in the same as it would be were it to have a fixed position in the structure, while the join that attaches it to the main structure has the connectivity information defined in a novel manner as illustrated in Figure 3-13, a representation of the set of monochlorinated toluenes. 1 <fragment xmlns='http://www.xml-cml.org/schema' 2 xmlns:g='http://www.xml-cml.org/mols/geom1'> 3 <fragment id='f0'> 4 <molecule ref='g:benzene'> 5 <join atomRefs2='r1 r1' moleculeRefs2='PARENT CHILD'> 6 <molecule ref='g:me' /> 7 </join> 8 <join atomRef1Array='r2 r3 r4' atomRef2Array='r1' 9 moleculeRefs2='PARENT CHILD'> 10 <molecule ref='g:cl' /> 11 </join> 12 </molecule> 13 </fragment> 14 </fragment> Figure 3-13: Position variation in EPML As can be seen above, the join that connects the mobile substituent, specified on lines 8-9, does not carry the atomRefs2 attribute that would normally describe the connectivity between the two fragments. Instead, this information is specified in the atomRef1Array and the atomRef2Array attributes, which list the ids of the free valencies the join is permitted to occupy on the PREVIOUS/PARENT and NEXT/CHILD fragment respectively. If either of these attributes is missing, then it is assumed that the substituent is permitted to be attached to any of the free valencies on the appropriate fragment. For example, if the atomRef1Array were missing from the join on line 5 of the EPML document in Figure 3-13, it would be assumed that the chlorine group could be 67 attached to any of the free valencies other than r1, which is occupied by the methyl substituent. This approach to position variation produces EPML documents that are far more concise than their explicitly-enumerated PML counterparts, which may be seen by comparing Figure 3-13 to the PML representation of the same Markush structure given in Figure 3-10. 3.3.4 Position and Count Variation In addition to position variation, Markush structures sometimes employ position variation that is repeated a variable number of times, as illustrated in Figure 3-14. This Markush structure is subsequently described in EPML in Figure 3-15. Figure 3-14: Markush structure employing simultaneous position and count variation 68 1 <fragment xmlns='http://www.xml-cml.org/schema' 2 xmlns:g='http://www.xml-cml.org/mols/geom1'> 3 <fragment id='f0'> 4 <molecule ref='g:me' /> 5 <join atomRefs2='r1 r1' moleculeRefs2='PREVIOUS NEXT' /> 6 <fragment ref='benzene'> 7 <join countExpression='range(2,5)' atomRef2Array='r1' 8 moleculeRefs2='PARENT CHILD'> 9 <fragment> 10 <molecule ref='g:cl' /> 11 </fragment> 12 </join> 13 </fragment> 14 </fragment> 15 <fragmentList> 16 <fragment id='benzene'> 17 <molecule ref='g:benzene' /> 18 </fragment> 19 </fragmentList> 20 </fragment> Figure 3-15: Simultaneous position and count variation in EPML The EPML description of the polychlorinated toluene above differs from that of the monochlorinated toluene in Figure 3-13 in that the chlorine substituent is contained within a fragment element, on line 9, which is in turn contained within a join, on line 7, that carries a countExpression attribute. The value of this attribute in the example above specifies a range of between 2 and 5 – specifying the value of the variable n from Figure 3-14, i.e. the number of times the substitution unit is repeated. 3.3.5 Inline Connection Tables The additional features of EPML as compared to PML assist a user in producing concise definitions of Markush structures, but the usage of structural units that are defined in separate documents is a slow process if the units have not previously been defined since a document author must then create atomistic CML representations of the missing units. In order to provide a user with a means to work around this problem, EPML permits the usage of inline connection tables to define the 69 molecular substructures represented by fragment elements. The format in which these connection tables are specified is derived from SMILES, and differs from pure SMILES in two regards. Firstly, the free valencies of the fragment are specified as though they were atoms using the codes “R1”, “R2”, etc. Secondly, since PML lacks a means by which to cyclise a structure the inline connection tables are used as a means to incorporate variability in a cyclised unit. Two forms of variation may be described by an inline connection table – substituent variation and frequency variation. The usage of these features is illustrated in Figure 3-17, which describes the Markush structure shown in Figure 3-16. Figure 3-16: Markush structure featuring variable cyclic unit 1 <fragment xmlns='http://www.xml-cml.org/schema'> 2 <fragment id='f0'> 3 <fragment smiles='R1C' /> 4 <join atomRefs2='r1 r2' moleculeRefs2='PREVIOUS NEXT' /> 5 <fragment smiles='R1c1c{O|N}cc1R2' /> 6 <join atomRefs2='r1 r2' moleculeRefs2='PREVIOUS NEXT' /> 7 <fragment smiles='R1[C[1-4]]' /> 8 </fragment> 9 </fragment> Figure 3-17: Inline connection tables in EPML The fragment on line 3 of Figure 3-17, specified by the string “R1C”, represents a methyl group since “C” represents a carbon atom, “R1” represents a free valency and hydrogen atoms are assumed to fill unspecified positions in the SMILES language. In the fragment on line 5, substituent variation is indicated by separating the permitted groups with the pipe character (“|”) and wrapping the list in braces (“,“ and “-”). Thus, the substring “,O|N-” indicates the presence of either an oxygen or a nitrogen atom. The full string, “R1c1c,O|N-cc1R2”, therefore represents a 3,4-disubstituted 70 pyrrole or furan, i.e. the ring from Figure 3-16. Finally, the fragment on line 7 uses frequency variation to define a carbon chain of between one and four units with the inline connection table “*C*1-4++”, in which the inner square brackets define the permitted range of integers for the frequency variation, while the text before the inner square brackets define the connection table for the repeated unit – in this case methylene, defined by the string “C”. While the usage of frequency variation in an inline connection table in this example could equally have been replaced by a standard markushMixture, the functionality is useful when describing rings of variable size as an alternative to enumeration. 3.4 Building Representative Examples of a Markush Structure As discussed earlier, the FragmentTool provides the functionality to construct atomistic CML representations of macromolecules described by a PML document. It was desired to create a similar demonstration application for EPML, and it was decided that rather than attempt to re-implement much of the functionality of the FragmentTool, this application should instead operate by reducing an EPML document to a PML document which describes a single chemical structure, and from which may be produced an atomistic representation using the FragmentTool. This process requires the removal of all of the additional features introduced into EPML and described in section 3.3, and is carried out in a number of steps as follows; 1. Those fragment elements that carry a homology attribute, i.e. that describe homology variation, are selected using XPath. For each such instance, the corresponding moleculeList is looked up and one of the substituents from this list is selected at random. The connection table for this substituent is copied from the moleculeList and the working document is modified by removing the homology attribute and inserting a reference to the selected substituent. 71 2. Instances of fragments that use inline connection tables are identified by XPath. Those that employ either substituent or frequency variation are selected from this list and the variation defined within each of the inline connection tables is fully enumerated to give the full set of inline connection tables represented by the original. The enumerated list is then used to create a markushMixture of the fragments represented by the set of enumerated inline connection tables, and this markushMixture replaces the original fragment in the working document. 3. Those join elements which carry countExpression attributes, i.e. those used for multiple position variation as discussed in section 3.3.4, are selected by XPath. The countExpression attribute is removed from the join, which is subsequently copied in position a number of times corresponding to that specified by the countExpression attribute. If the value of this attribute specified a range, the number of times to duplicate the join is selected at random from those integers specified by the range. 4. Those fragment elements which carry template attributes are identified using XPath. The connection tables for the templates to which they refer are dereferenced and used to create an inline connection table specifying a single substructure representative of the restrictions carried by the fragment in question, i.e. that matches the values of the maxC, minC and branched attributes. The template attribute is then replaced with this inline connection table in the working document. 5. Those fragment elements employing inline connection tables, including those that have been generated during earlier stages of processing, are selected by XPath. For each of these fragments, the inline connection table is converted to a corresponding atomistic CML molecule. Since the FragmentTool does not generate 3D co-ordinates, instead requiring them to be provided as an input, they are generated and added to the CML. The inline connection tables are then replaced with references to the newly-created molecules. 72 The current implementation generates 3D co-ordinates using the CORINA software (81). Since CORINA is commercial software, interfacing with the software is achieved by connecting to a computer in the Unilever Centre on which it is installed. This method does not scale with the number of machines running the MarkushBuilder but is sufficient for demonstration purposes. A preferred method would involve using a distributable open- source solution such as the CDK (82; 83; 84) or OpenBabel but has not been implemented as part of the current work. 6. The markushMixture elements in the working document are selected using XPath. From each of the markushMixtures, a single fragment is selected at random. The markushMixture is then replaced in the working document with this fragment. Though this step is not necessary to produce a PML document that can be processed by the FragmentTool, it is necessary in order to produce a PML document that defines one and only one structure. 7. The working document is searched by iterative descent for join elements. If a join does not carry an atomRefs2 attribute, i.e. if the join represents position variation, then it is processed to assign a specific atomRefs2. Where the join carries an atomRef1Array or an atomRef2Array attribute, i.e. where the allowed attachment points on the PREVIOUS/PARENT and NEXT/CHILD fragment respectively have been explicitly stated, then one is selected at random from the list. Where one or both of these lists is missing, a suitable attachment point is determined by examining the fragment concerned, determining a full list of the free valencies on that fragment and removing from this list those valencies that have previously been used by another join. Once this process has been carried out, one of the attachment points is selected at random and the join undergoing processing is assigned a specific atomRefs2 attribute. 73 Once these stages of processing have been carried out, the input EPML document has been reduced to a PML document that describes a single, specific chemical structure. This PML document is then passed to the FragmentTool in order to generate an atomistic CML representation of the specified structure. The results of this process are illustrated below. A simple Markush structure is described in Figure 3-18 and specified in EPML in Figure 3-19. By applying the MarkushBuilder to this document, an atomistic CML molecule with 3D co-ordinates was produced, which is visualised in Jmol (85) and illustrated as a 2D structure in Figure 3-20. Figure 3-18: Example Markush structure 74 1 <fragment xmlns="http://www.xml-cml.org/schema" 2 xmlns:g="http://www.xml-cml.org/mols/geom1"> 3 <fragment id='f0'> 4 <fragment ref='benzene'> 5 <join atomRef2Array='r1' moleculeRefs2='PARENT CHILD'> 6 <fragment homology='halogen' /> 7 </join> 8 </fragment> 9 <join atomRefs2='r1 r2' moleculeRefs2='PREVIOUS NEXT' /> 10 <fragment ref='oxygenOrNitrogen' /> 11 <join atomRefs2='r1 r2' moleculeRefs2='PREVIOUS NEXT' /> 12 <fragment ref='carbonyl' /> 13 <join atomRefs2='r1 r2' moleculeRefs2='PREVIOUS NEXT' /> 14 <fragment smiles='R1C1C(R2)C(R3)[C[1-4]]C1'> 15 <join atomRefs2='r1 r1' moleculeRefs2='PARENT CHILD'> 16 <fragment ref='alkoxy' /> 17 </join> 18 <join atomRefs2='r3 r1' moleculeRefs2='PARENT CHILD'> 19 <fragment ref='alkoxy' /> 20 </join> 21 </fragment> 22 </fragment> 23 <fragmentList> 24 <fragment id='benzene'> 25 <molecule ref='g:benzene' /> 26 </fragment> 27 <fragment id='o'> 28 <molecule ref='g:o' /> 29 </fragment> 30 <fragment id='n'> 31 <molecule ref='g:nsp2' /> 32 </fragment> 33 <fragment id='oxygenOrNitrogen'> 34 <fragmentList role='markushMixture'> 35 <fragment ref='o' /> 36 <fragment ref='n' /> 37 </fragmentList> 38 </fragment> 39 <fragment id='carbonyl'> 40 <molecule ref='g:carbonyl' /> 41 </fragment> 42 <fragment id='alkoxy'> 43 <fragment template='alkoxy' minC='1' maxC='4' 44 branched='true' /> 45 </fragment> 46 </fragmentList> 47 </fragment> Figure 3-19: EPML representation of the example Markush structure 75 Figure 3-20: 3D (left) and 2D (right) views of a randomly-generated example compound 3.5 Substructure Searching of Markush Structures Markush structures typically represent simultaneously a number of specific chemical structures with a shared substructure, and may therefore be considered as a superposition of the specific structures they represent. Such superpositions may be visualised as in Figure 3-21, in which the elements that are conserved and those that vary across the set of specific compounds are shown in black and red respectively. Figure 3-21: Superimposed structure representing the monochlorinated toluenes 76 Such Extended Connection Tables (ECTs) may be used for substructure searching, in which case the query may be asking one of two questions. If the query structure for the substructure search represents a complete chemical structure, then a hit signifies that the query structure is one of the specific compounds represented by the Markush structure defined by the ECT, while if the query structure is an incomplete chemical structure then a hit signifies that one or more of the underlying specific structures contains that substructure. Of course, such searching of ECTs must be subject to certain constraints to ensure that the correct solutions are reached – for example, no specific structure represented by the ECT above contains more than one chlorine atom and no carbon atom in the specific structures is pentavalent, in spite of such substructures being superficially present in the ECT. The use of ECTs to represent Markush structures has been previously discussed (86) and strategies for their substructure searching developed. Previous work has described the atom-by-atom searching (87) of ECTs, as well as the use of bitscreens (88) and reduced chemical graphs (89) as methods of filtering by which to reduce the computer time required to calculate search results. In order to demonstrate the possibility of performing substructure searching of Markush structures encoded in EPML, a system was implemented that first builds ECTs from EPML documents then employs the relaxation algorithm for atom-by-atom searching as described by von Scholley (87). The details of this system are subsequently discussed. 3.5.1 Implementing Extended Connection Tables The implementation of Extended Connection Tables used in the current work was created from scratch for the purpose. ECTs are represented by the ECT class and modelled as graphs, comprising of edges (bonds) and nodes (atoms and pseudoatoms). Nodes are represented by the abstract class Node, of which there are three implementations – AtomNode, TemplateNode and AlkyleneNode. 77 The AtomNode class represents a specific atom, while the AlkyleneNode class represents an alkylene (e.g. -CH2-, -CH2CH2-, etc.) chain and the TemplateNode class represents the connection table of a template used to describe homology variation. Edges are represented by the Edge class, which record the connectivity of the graph, and the order of the bonds which the edges represent. 3.5.1.1 Node The abstract class Node exists to provide functionality that is required by all types of Node. Specifically, it keeps track of which edges contain the current node, which other nodes are ligands of the current node, which ECT the node belongs to, the id of the Node, the markushId of the Node (if any) and the maximum number of variable edges that may be simultaneously connected to the Node. This final property is necessary in the case of multiple position variation to enforce a limit on the number of substituents present at a single position. 3.5.1.2 AtomNode The AtomNode class is intended to represent an atom of a specific type and is additionally used to represent the R-groups that indicate free valencies. In addition to the information stored by all nodes, an AtomNode records the element type of the node. 3.5.1.3 TemplateNode The TemplateNode class represents a template as used to represent a class of substituents in homology variation and keeps track of the parameters used to specify the template i.e. the minimum and maximum number of carbon atoms permitted. The connection table of the template is stored as an internal ECT of the TemplateNode, and when a TemplateNode is added to an ECT 78 the internal nodes are added to the parent ECT to simplify the representation of the connectivity of the ECT. 3.5.1.4 AlkyleneNode The AklyleneNode class represents an alkylene chain as used in homology templates. It holds atom nodes internally to represent the carbon and hydrogen atoms that comprise the chain and records the connectivity between the atoms. It stores the ligands of the first carbon in the chain separately from those of the final carbon in order to facilitate substructure searching. 3.5.1.5 Edge The Edge class represents a bond between two nodes in the ECT. It keeps track of the order of the bond, of the nodes that the edge connects and of whether or not the edge is variable, i.e. if it is not present in all example compounds of the Markush structure represented by the ECT. 3.5.2 Building Extended Connection Tables The capacity to construct an ECT representing a Markush structure described by an EPML document is provided by the EpmlParser class. This process is broken down into a number of steps which are subsequently described; 1. Those fragment elements in the source document that refer to substructures defined elsewhere in the document are modified to directly include this content. To achieve this, a copy is made of the referenced content which then is used to replace the reference. 79 2. Those fragment elements in the working document that carry countExpression attributes are selected by XPath. Those that specify a range, i.e. those that describe frequency variation, are expanded into a markushMixture in which the substructures corresponding to each of the permitted values in the range is represented separately. 3. Those fragment elements in the working document that employ inline connection tables are identified by XPath. Those that employ variation within the inline connection tables are fully enumerated and a markushMixture is constructed that represents each of the permitted substructures as a separate inline connection table. This markushMixture then replaces the original fragment element in the working document. 4. Those fragment elements in the working document that employ inline connection tables are identified by XPath, including those generated in the previous step. The inline connection tables are built into CML molecules using the SMILESTool class from JUMBO, which are then appended as children to the original fragment elements. 5. Those fragment elements in the working document that carry homology attributes are selected by XPath. For each such fragment, the corresponding CML molecules are loaded from the homology dictionary and used to create a markushMixture that defines the permitted substructures, which is then appended to the original fragment element. 6. Those fragment elements in the working document that carry template attributes are selected by XPath. For each such fragment, the corresponding definitions from the template dictionary are loaded and these CML molecules are appended as children to the original fragment elements. 7. Those join elements in the working document that carry countExpression attributes, i.e. those that describe multiple substitution of a PARENT substructure, are selected by XPath. For each such join, the countExpression attribute is removed and a 80 markushMixture is created in which each of the permitted counts are represented by a copy of the parent fragment of the join carrying the specified number of copies of the join. This markushMixture then replaces the parent fragment of the join to produce an enumerated list of the allowed substituted fragments. For example, the fragment; <fragment ref="benzene"> <join countExpression="range(1,2)"> <fragment ref="methyl" /> </join> </fragment> would be converted to; <fragmentList role='markushMixture'> <fragment ref='benzene'> <join> <fragment ref='methyl'> </join> </fragment> <fragment ref='benzene'> <join> <fragment ref='methyl'> </join> <join> <fragment ref='methyl'> </join> </fragment> </fragmentList> 8. All fragment elements in the working document are selected using XPath. These elements are assigned unique id attributes, numbered from 0 to n. Molecule elements that are contained within markushMixtures are assigned a unique markushId attribute of the format “x_y” where x is the id attribute of the parent fragment of the markushMixture and y is a unique id number from 0 to n as before. These markushIds are vital to the correct searching of the ECTs as they are used to keep track of which nodes are not permitted to appear simultaneously in an example structure. 9. Those fragment elements with a countExpression attribute, i.e. those that indicate a repeated substructure such as - (CH2)n- are selected by XPath and expanded by copying the 81 content of the fragment the number of times specified by the countExpression and replacing the original fragment with this enumeration. 10. The molecule elements in the working document are assigned unique id attributes in the same manner as was previously done for the fragment elements. A fragmentRefs2 attribute is then added to each join element that contains the unique ids of the two fragment or molecule elements that the join connects to facilitate adding the edges defined by the join elements when building the ECT. Once these transformations have been carried out on the working document, it has been transformed into a format from which an ECT may be constructed. 1. A list is compiled of all molecule elements in the working document that descend from the primary fragment. Those that contain atomistic descriptions, e.g. those derived from homology variation, are directly added to this list, while those that contain references to other CML documents are dereferenced, and the atomistic descriptions are added to the list. 2. The connection tables contained in the list generated in step 1 are added in turn to the ECT. Each atom from these connection tables is represented by a Node as described previously, while each bond is represented by an Edge. 3. The edges represented by the join elements in the working document are added to the ECT. Those join elements that represent edges between two specific atom nodes result in one Edge in the ECT, while those that connect to a markushMixture or that represent position variation result in sufficient edges to connect to all members of the markushMixture or all permitted connection points respectively. Such edges are marked as variable. 82 4. Nodes to which a variable edge is connected have an additional atom node representing a hydrogen atom added as a ligand to represent the hydrogen atom that occupies the position if the variable substituent is not attached in that position. The edge to this hydrogen is similarly marked as variable. Upon the completion of this process, the EpmlParser has constructed an ECT that represents the superposition of the explicit structures represented by the Markush structure as described by the EPML input document. Such an ECT may then be searched to determine if it contains a specific example structure or substructure using the relaxation algorithm, as described in the next section. The API of the ECT-related classes also permits a user to construct his own ECTs programmatically, and convenience methods are also provided in the ECT class to allow a user to construct ECT representations directly from SMILES strings and from CML fragments and molecules. 3.5.3 The Relaxation Algorithm The relaxation algorithm is a simple, iterative means by which a target structure may be determined to contain or not to contain a query structure. The basic method may be summarised as follows; 1. Assign to each atom in the query structure a unique label. 2. Assign to each atom in the target structure a set of the labels containing each of the labels from the query structure corresponding to each of the query atoms that the target atom could correspond to in a match between the two structures. 3. Iterate through the atoms of the target structure. For each label on each atom, check that the ligands of the target atom carry all of the labels carried by the ligands of the query atom that carries the label in question. If this condition does not hold, remove the label. 83 4. Repeat step 3 until either no labels remain on the target structure or a complete iteration through the target atoms results in no labels being removed. If step 4 results in an unlabelled target structure, then the target does not contain the query structure. If it results in a target structure in which each label from the query structure is carried by one and only one target atom then the target structure contains the query structure. If the relaxation procedure results in any other stable state then the target structure may or may not contain the target structure. This process is illustrated in Figure 3-22, in which at each step the labels on the atom undergoing inspection are shown in blue. Initial labelling is carried out by assigning to each target atom all labels carried by a query atom of the same element type. In the first step of relaxation, the label “4” is removed from the leftmost carbon as in the query structure the atom labelled “4” neighbours atoms labelled “5” and “6”, and these labels are not carried by a ligand of the leftmost carbon atom, while the conditions to retain the labels “2” and “3” are met. The process continues until a situation is reached in which each label from the query structure occurs once and only once in the target structure, and a complete iteration through the target atoms results in no labels being removed as shown, demonstrating that a match has been found. 84 Figure 3-22: Relaxation match of 3-aminopropanoyl chloride As mentioned previously, the algorithm outlined above is not guaranteed to produce a conclusive answer. For certain combinations of query and target structures, a stable state can be reached in which unambiguous mapping between query and target atoms has not been reached, but nor has the target structure been demonstrated not to be a match to the query. Two such examples are illustrated in Figure 3-23. In the first case, the searching of the query structure cyclopropane against the target structure cyclobutane, the initial labelling is as shown in the target structure. The 85 relaxation process removes no labels from the target structure, since the requirement for each label in the query structure is that it be adjacent to each of the other two and this condition is satisfied for all labels on the target structure. In the second case, the searching of dimethyl formamide against itself, the symmetry of the two N-methyl groups results in the relaxation algorithm terminating with both carrying the labels “4” and “5”, and so without producing an unambiguous assignment. The implementation of the algorithm used in the current work resolves such situations by selecting a node that carries more than one label, removing all but one label from the node and continuing the relaxation process. If no match is found, the process is repeated, retaining a different label until either a solution is found or it is shown that no match exists. Figure 3-23: Inconclusive results of relaxation matches The exact process used to perform relaxation matching of structural queries against ECTs in the current work is as follows; 1. Used R-groups, i.e. those defined in the source fragments of the EPML document that have been used to connect a further fragment to the structure are removed from the ECT. Unused R-groups are retained and later permitted to be matched to hydrogen atoms. 86 2. Unique labels are arbitrarily assigned to the atom nodes that make up the query ECT. No other types of nodes are permitted to form a part of the query ECT. 3. Each node in the target ECT is labelled against the query ECT. Atom nodes are given the label of a query node if they are of the same element type or the query node is hydrogen and the target node is an R-group, and if the target node has as at least as many ligands of each element type as the query node in question. Alkylene nodes that are ligands of the target atom node are considered as carbon atoms for this purpose. The length of the carbon chain embedded within an AlkyleneNode is set to be equal to the number of carbon atoms in the query structure and each internal carbon atom is labelled with all of the corresponding labels from the query structure. Each embedded carbon atom is given two ligand hydrogen atoms which are given all the labels carried by the ligand hydrogens of the query structure’s carbons. 4. An initial label reduction identifies those nodes in the query structure that carry the only instance of a certain label. Where found, those nodes are stripped of all other labels, since a match may only be found in circumstances where that node is carries that label. The process repeats until no such nodes are found. 5. The nodes of the target ECT are relaxed. In each iteration, a given label is removed from an atom node in the target structure that carries it unless the node’s ligands carry all of the labels adjacent to the given label in the query structure and the orders of the edges that connect the candidate equivalent nodes in the target structure match those in the query structure. The procedure for the removal of labels from the embedded nodes in an alkylene node follows a similar method. The internal carbon atom nodes are held in a list in which the first is considered to be the “leftmost” and the last the “rightmost”, and those nodes adjacent in the list are considered to be connected. During relaxation, for each internal carbon atom node sets of adjacent labels are computed in which one label is selected from 87 each of the nodes connected to the current node, the leftmost ligands if the current node is the first in the internal list and the rightmost ligands for all internal nodes. If none of the label sets so generated contains the set of adjacent labels from the query structure then the label in question is removed from the embedded target atom node. Whenever the rightmost node becomes unlabelled it is removed from the internal list, allowing the alkylene node’s internal carbon chain to shrink until it is of the correct size to match the query structure or until it reaches zero length and carries no labels, indicating that the alkylene node is not involved in any potential match to the query structure. Step 5 repeats until a stable state is reached. 6. The target ECT is checked to determine whether a premature stable state has been reached. If so, a multiply labelled node is selected and n copies of the ECT, where n is the number of labels on the selected node, are created in which the selected node carries only a single label. These ECTs are returned to step 5 for further processing. If the ECT carries each query label once and only once, the candidate solution is checked to ensure its validity. It is checked that the solution does not use nodes derived from more than one member of each markushMixture using the nodes’ markushIds, each TemplateNode contained in the ECT checks that its carbon count is permitted and each node is checked to ensure that its variable edge limit is not exceeded. If these checks are passed, a solution to the search has been found and is returned; if not, the ECT is discarded and the next ECT is considered. 3.5.4 Examples To demonstrate the building and searching of ECTs, a number of examples are subsequently discussed. A simple Markush structure and a corresponding ECT are shown in Figure 3-24. In the ECT, the halogen atoms (shown in blue) bear markushIds that indicate that not more than one of them may be used simultaneously when searching, and the variable edges in the ECT – those that connect 88 to the variably positioned methyl group and to the hydrogen atoms added to attachment points of said methyl group – are shown in red. The alkylene node representing the alkyl chain R is denoted in the ECT by the symbol “Q” and other nodes are atom nodes of the element type shown. Figure 3-24: Example Markush structure (left) and corresponding ECT (right) Examples of searching the ECT shown from Figure 3-24 are subsequently discussed. In each case, the query structure should not be construed to indicate the presence of any implicit hydrogen atoms beyond those that make up the methyl and ethyl substituents. When matches are found for the given query, the matching substructure of the ECT is shown in red. When discussing these examples, the atoms of the benzene ring are numbered clockwise from the top, starting at 1. 89 Example no. Query Match 1 2 3 In the first example the scaffold of the Markush structure, the benzene ring, is easily identified as matching the query structure. In the second example, the methyl group in the 4-position of the query structure is matched to the methyl group with the unspecified locant, while the two hydrogen atoms in the query structure are matched to those hydrogen atoms automatically added to the ECT at positions permitted to be connection points for such mobile substituents. Similarly, this mobile methyl group may be found in the 2-position, as in the third example. 90 Example no. Query Match 4 No match 5 No match The mobile methyl group may not, however, be found in two positions simultaneously and the search algorithm does not allow it to be simultaneously labelled for the two search methyl groups in a result; hence, the query in the fourth example does not produce a hit. Likewise, in the fifth example, while the carbon from the mobile methyl group may appear in both the 3- and 4- positions, it may not appear in both simultaneously. In this case a result that matches the query structure violates the variable edge limit of the mobile carbon, and so no hits are found. 91 Example no. Query Match 6 7 No match In the sixth example, the ethyl group is correctly matched to the alkylene node while in the seventh example only one of the halogen atoms is permitted to be used at a time in a search result, as enforced by the markushIds carried by the halogen atom nodes and so the query produces no hits. 3.6 Conclusions The work presented in this chapter has demonstrated an outline implementation of a CML-based system for the representation and manipulation of Markush structures which supports the major features of Markush structures as they are used in the patent literature. Polymer Markup Language, an existing CML vocabulary, has been extended to permit the more convenient description of Markush structures and systems have been developed to enable the creation of example molecules and substructure searching of the Markush structures. While the current work has shown how it is possible to produce machine-understandable Markush structures that are compatible with the semantic web of chemistry, it is in need of further 92 development before it can be said to be a competitor to the currently available commercial systems. In particular, the implementation of substructure searching provided by the current system does not have an understanding of aromaticity or of stereochemistry – features which would likely be required in any production system. Since the commencement of the current work, ChemAxon have begun to add functionality to their software, Marvin, to permit the drawing, representation and manipulation of Markush structures (90) and have further demonstrated the potential for the automatic conversion from the Markush DARC format to their own. While the Marvin format is loosely based upon CML, it must be considered essentially to be another proprietary format that is not suitable for the semantic web of chemistry. It is encouraging, however, to see that the automatic conversion of Markush formats has been shown to be possible as it allows for a transition from the current situation without the need to abandon the data that has collected up to the current point. 93 4. Automatic Acquisition of Hyponymic Relations from the Chemical Literature When considering a problem, the human thought process makes much use of background knowledge, whether in the chemical sciences or in everyday life. Chemical Markup Language provides a means to describe defined chemical concepts such as molecules, reactions and spectra but does not give the freedom to define the relationships between novel concepts. In the Semantic Web, this capacity is provided by ontologies – and much effort has been devoted to their construction. The most basic elements of these formal representations of knowledge are a set of hierarchical hyponymic relations – descriptions of which concepts form supersets or subsets of which other concepts which may be considered at the most basic level as dictionaries of terms. This chapter considers how these dictionaries may be automatically derived, using the published literature as a source of knowledge, and discusses some of the applications to which the derived knowledge may be put. 4.1 Hyponymic Relations Hyponymic (“is-a”) relations exist between two terms where one the hyponym, is a subset of the other, the hypernym. For example, “vehicle” is a hypernym of “car”, and “Ford Fiesta” is a hyponym of “car”. Knowledge of such relations forms a key part of the way that we reason and form deductions. Consider, for example, the following statements; i. Reactions between carboxylic acids and alcohols that form esters are esterifications. ii. Ethanol is an alcohol. iii. Acetic acid is a carboxylic acid. 94 iv. Ethyl acetate is an ester. It therefore follows that the reaction between acetic acid and ethanol that forms ethyl acetate is an esterification. In order to facilitate the automation of such, and other, reasoning – the ultimate goal of the semantic web – it is necessary first to encode the starting axioms in a formal representation such as an ontology. The creation of a formal knowledge representation is typically a slow process as manual curators determine and verify the information to be curated, and such work is confined to fields that the curators consider to be within the scope of their work. This work may be both quickened and broadened by the automatic acquisition of the hyponymic relations. 4.2 Hearst Patterns Hearst first proposed the use of lexico-syntactic patterns for the automatic acquisition of hyponymic relations (91), thereafter known as Hearst Patterns. She described six patterns that could be employed; Format Example Pattern Name HYPER such as HYPO Apolar solvents such as THF and hexane SUCH_AS such HYPER as HYPO Such bases as NaOEt or LDA SUCH_FOO_AS HYPO or other HYPER MeCl, EtBr or other organohalides OR_OTHER HYPO and other HYPER Benzene, ethylene oxide and other carcinogens AND_OTHER HYPER including HYPO Methyl ketones including acetone INCLUDING HYPER especially HYPO Grignard reagents, especially methyl magnesium chloride ESPECIALLY Table 4-1: Hearst Patterns and their usage in chemical texts wherein HYPER and HYPO represent noun phrases that denote the hypernym and hyponym(s) respectively. Each of these patterns has been assigned a name for ease of reference. Taking the example for the SUCH_AS pattern, it can be seen that the text communicates the information that THF and hexane are examples of apolar solvents. This information is readily 95 available to any fluent speaker of the English language, regardless of whether or not they are aware of what “THF”, “hexane” or an “apolar solvent” are. The application of these six patterns to a large corpus of text therefore provides a powerful method for the identification of hyponymic relations between lexical terms, and the approach has been broadly applied, including in the biomedical sciences (92) – though not previously to the chemical sciences. Of course, chemical documents contain Hearst Patterns that are not relevant to the chemical field. In order to limit the domain of hyponymic relations identified by any automated system, it is necessary to apply some form of filtering. The capacity to separate “chemical” from “non-chemical” words is provided by the OSCAR3 toolkit, as previously discussed. Indeed, a basic approach to the problem is implemented within OSCAR3 itself. 4.2.1 OSCAR3 Implementation The OSCAR3 application of Hearst Patterns employs token-based regular-expression style matching in the style as employed in the PatternRecogniser (see section 2.2.6.3), and uses as a core feature the named-entity recognition and the experimental subclass classification provided by OSCAR3. The SUCH_AS pattern described above, for example, is specified by OSCAR3 as; $CM:CLASS<hyper> $MAYBECOMMA $SUCHAS $CMEXACTHYPO $CM:CLASS refers to a named entity of type CM (Chemical – see section 2.2.6.3) and of subclass CLASS, while the meanings of $MAYBECOMMA, $SUCHAS and $CMEXACTHYPO are defined separately as follows; $MAYBECOMMA = $( , $) $? In this definition, the $ symbol before the brackets and the question mark symbol indicate that they are to be interpreted as in standard regular expressions i.e. the brackets define a character group 96 while the question mark states that the preceding group is optional, while the comma matches a literal comma. Thus, the $MAYBECOMMA expression matches either one comma or nothing at all. $CMEXACTHYPO = $( $( $CM:EXACT<hypo> $) $| $( $( $CM:EXACT<hypo> , $) $* $CM:EXACT<hypo> $MAYBECOMMA $ANDOR $CM:EXACT<hypo> $) $) $ANDOR = $( and $| or $) The whole expression for $CMEXACTHYPO thus matching either a single named entity of type CM and subtype EXACT, or a comma-separated list of arbitrary length terminated by “and” or “or” and a final CM:EXACT, e.g. “methane, ethane, propane, butane and pentane”, or simply “methane”. $SUCHAS = $( such as $| including $| excluding $| particularly $| especially $| mainly $| primarily $| chiefly $| specifically $) Here we see a number of Hearst Patterns, including the SUCH_AS pattern described above, being implemented at once – thus widening the scope for hyponymic relation acquisition. A number of other Hearst Patterns are similarly implemented by OSCAR3, and they are not confined to the field of hyponymic (“is-a”) relations, but also “has-a”, e.g. “the ester carbonyl” – indicating that a carbonyl is a part of an ester. Because the OSCAR3 implementation depends on a token-level match it lacks some of the flexibility that is afforded by the standard natural language approach of chunking the tokens into phrases and identifying hypernyms and hyponyms as noun phrases. The hypernym is required to be of type CM or of a specific subtype of CM, causing useful relations to be ignored – the class CM is intended to refer to words that have structural or substructural meaning. As a result, hypernyms based on usage, function or properties, etc. are ignored, for example the relation “bases such as LDA and n-butyl lithium”. This capability could be added by selecting a number of hypernyms that, while not 97 classified as CM by OSCAR3, are of sufficient chemical importance to be included, but this approach would not be truly unsupervised, and would therefore miss subtleties in the text e.g. “strong bases such as LDA and n-butyl lithium” or “5-HT1A antagonists such as Lecozotan or Spiperone”, in which the crucial function of the drugs is not they are antagonists, but that they are antagonists of the 5- HT1A receptor. 4.3 Acquiring Hyponymic Relations In order to address the issues described in the previous section, a new system was developed to apply OSCAR3’s name recognition capabilities to the problem of the detection of Hearst Patterns in chemical texts. This system aims to carry out unsupervised detection of molecular classifications, such as those exemplified in Table 4-1, and is based around ChemicalTagger (described in section 2.2.7) as outlined below; Figure 4-1: Acquisition and Storage of Hearst Patterns 98 In this system, the text from the EPO patents is passed into the HearstFinder one paragraph at a time. The HearstFinder then uses ChemicalTagger to analyse the grammar of the input text – returning a tagged and chunked document. From these documents it is possible to identify sections of the input text that correspond to a given Hearst Pattern, i.e. the noun phrases that denote hypernym and hyponym as well as the invariant text characteristic of the Hearst Pattern e.g. “such as”. By checking that the hyponym phrase consists of a sequence of entities of type CM, it is possible to narrow the set of identified hyponymic relations to those that define relations between chemical structures. The set of hyponymic relations derived from this process are then formally encoded into the Web Ontology Language (OWL) (93). Using appropriate heuristics, these relations are then trimmed with the intention to remove unreliable or inaccurate assertions. The full process is subsequently discussed in detail. 4.3.1 HearstFinder The application of Hearst Patterns to and identification of hyponymic relations within an input text is handled by the HearstFinder class. The operation of this class is subsequently discussed. 4.3.1.1 Specification of Hearst Patterns Hearst Patterns to be identified are defined to the HearstFinder as space-separated string representations of literal and meta-tokens, for example, the SUCH_AS pattern is defined as “$HYPER such as $HYPO”. The HearstFinder treats “$HYPER” and “$HYPO” as meta-tokens that denote the location of the noun phrases that define the hypernym and hyponym respectively, while other tokens in the specification are treated as literal tokens that define the structure of the Hearst Pattern and must be present in the source text in the same positions relative to the noun phrases as 99 they are in the specification relative to the meta-tokens in order for the source text to be considered to match the Hearst Pattern. The matching of phrases with meta-tokens is illustrated in Figure 4-2. This approach offers a greater flexibility than that of the OSCAR3 system since it allows for a user to easily apply the existing code to novel Hearst Patterns and is not limited to the discovery of pre- defined or recognisably chemical hypernyms and hyponyms. Indeed, the sections of input text that are identified by the HearstFinder are not in the first instance restricted to the chemical domain in any way. The software is therefore potentially reusable in a general context. 4.3.1.2 Matching Hearst Patterns Initially, the input text is passed to ChemicalTagger for grammatical analysis. The target Hearst Pattern is specified as described above and passed to the HearstFinder as an argument. Instances of this pattern are identified by locating instances of the first literal token from the target pattern within the text and simultaneously advancing through the tokens of the target pattern and source text. When a literal token is found in the target pattern, this token must be the next token of the source text; when a metatoken is found in the target pattern the containing phrase is noted and the pointer that tracks which position in the text is being examined is advanced to the end of this phrase. If this process can be completed successfully then metatokens in the target pattern that occur prior to the first literal token are matched by looking backwards in the source text from the position of the match to the first literal token. For example, the input text “Apolar solvents such as THF and hexane may be employed” produces the grammar tree shown in Figure 4-2, wherein NP indicates a noun phrase, VP indicates a verb phrase and PP indicates a prepositional phrase. 100 Figure 4-2: Grammatical structure of a Hearst Pattern When the SUCH_AS pattern ($HYPER such as $HYPO) is applied to this result, the hypernym and hyponyms are identified as the immediate parent phrases of the tokens before and after the literal pattern text (“such as”), i.e. “Apolar solvents” and “THF and hexane” respectively. At this stage, in order to eliminate non-chemical hyponymic relations e.g. “large cities such as Tokyo and London”, the content of the hyponym is checked using OSCAR3; each of the tokens making up the text of the hyponym must either have been tagged as CM by OSCAR3, or must belong to a predefined list of allowable tokens including articles (“a”, “an” and “the”), appropriate punctuation (comma and semicolon) and conjunctions (“or” and “and”). 4.3.2 Recording Hyponymic Relations A hyponymic relation defines one concept to be a subclass of another. Since such relationships perform a key role in ontologies – formal representations of knowledge – there are existing tools that provide a means for their storage and manipulation. In the current work, hyponymic relations are converted to, and stored in, the Web Ontology Language (OWL) (93). The required functionality for reading, writing and creating OWL is provided by the open-source OWL API library (94), version 3.0.0. Each extracted hyponymic relation is represented by a SubClassOf axiom in OWL, which denotes that one class – the subclass – is a subset of another – the superclass. Both hypernyms and hyponyms are Apolar solvents such as THF and hexane may be employed NP NP VP PP 101 represented by OWL classes. Hypernyms are represented by unique classes if the text string composing the hypernym phrase is novel once it has been lowercased and stripped of leading determiners and the terminal character ‘s’, as a naïve approach to the problem of pluralisation. Thus, the hypernyms “Esters”, “some ester” and “an ester” would be considered equivalent and represented by the same OWL class. Individual chemical names have already been identified within the hyponym phrase by ChemicalTagger, and the MOLECULE elements therein are taken to be individual hyponyms. It is not desirable to create multiple OWL subclasses for a single chemical substance, so non-novel hyponyms are identified both by string equivalence and by resolution of chemical names to InChIs using the NameResolver class from OSCAR3. The resulting connection table is converted to InChI using the InChIGeneratorTool class from JUMBO. If the resulting InChI has been previously seen then the class representing this structure is used as the subclass for the new axiom. Using this method, the hyponyms “chloroform” and “trichloromethane” are represented by the same OWL class. Since the current intent is to derive classifications of chemical compounds from the patent texts, relations for which it is not possible to resolve the hyponym to a connection table and to an InChI are discarded. This may occur, for example, in situations where the hyponym is systematic nomenclature that is not supported by OPSIN, a trivial term that does not occur in OSCAR’s dictionary or a term denoting a chemical fragment, such as “ethyl”. The OWL file created by this method contains more information than purely which terms are hyponyms of which other terms. The OWL classes corresponding to the hyponyms are annotated to include all the identified synonyms for a structure and the InChI for that structure, as well as the text of the paragraph(s) from which the hyponymic relation was inferred. Access to the source text is vital for the identification of sources of error in this procedure, as will be seen later. In addition, the SubClassOf axioms are annotated with the number of patents from which the relation has been inferred. Since relations that are asserted by a large number of sources may be considered to be 102 more reliable or more important than those asserted by fewer, this allows for the elimination of unreliable and unimportant information at a later stage. 4.3.3 Content of the Derived Relations & Sources of Error The full texts of the patents from the corpus of 667 unique, full-text patent documents (collected as described in section 5.1.3) were passed through the system. One of these patents, EP1651230, was found to contain text that caused ChemicalTagger to freeze and so was omitted from the procedure. The system identified 5624 Hearst Patterns across the remaining 666 patents, with each patent containing between 0 and 96 individual Hearst Patterns. The distributions of Hearst Patterns across the corpus and the contributing Hearst Patterns are summarised in Figure 4-3 and Figure 4-4 respectively. Figure 4-3: Distribution of Hearst Patterns across the patent corpus 0 50 100 150 200 250 300 350 0 0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-50 >50 Hearst Patterns P a te n t C o u n t 103 Figure 4-4: Individual Hearst Pattern frequency across the patent corpus It is particularly noteworthy in Figure 4-4 that the SUCH_AS pattern is dominant in terms of usage, comprising around 90% of the Hearst Patterns identified within the source texts. The OWL file derived from these Hearst Patterns defines 1001 superclasses (i.e. hypernyms), 984 subclasses (i.e. hyponyms) and a further 13 classes denoting terms that appeared as both hypernyms and hyponyms such as “pyridine” – being both a specific compound, C5H5N, and the class of compounds that contain a pyridine ring. Each superclass (including those that were also subclasses) had between 1 and 35 subclasses, collectively defining a total of 2738 unique hyponymic relations. The derived hyponymic relations contain a number of sets of superclasses in which one is a superset of the others e.g. “base” and “strong base” or “solvent” and “chlorinated solvent”. In such cases, the superclasses are formally related only in that they may share some or all subclasses. In the chemical field it is dangerous to make the assumption that a class that fits the pattern adjective-noun is a subset of the corresponding class noun, since while a chlorinated solvent is a solvent, a chlorinated hydrocarbon is not a hydrocarbon. Consequently, the system implemented here makes no such assumptions in order to avoid introducing such errors. 89.8% SUCH_AS 4.5% AND_OTHER 1.4% ESPECIALLY 2.1% INCLUDING 2.0% OR_OTHER 0.2% SUCH_FOO_AS 104 Of course, automated systems make mistakes and so the hyponymic relations derived from the source texts are not perfectly reliable. It was not feasible to manually examine and validate a knowledge base of this size and so no metrics are available for the raw performance of the system on the full set of input texts, though a validation was carried out on a subset of the input text and is discussed in section 4.3.5. The full set of derived hyponymic relations includes molecular classifications that may be considered common and generic (e.g. “polar aprotic solvent” and “strong base”), those that define common structural classes (e.g. “alcohol” and “amino acid”), those that define specific classes likely to be relevant to only a small subset of the patent documents (e.g. “antibiotic” and “antipsychotic drug”) and those that are entirely meaningless and have been included as a result of a mistaken parse of the input text such as those shown in Table 4-2. Hypernym Source text abbreviated word In Table 2, DHP-Cz represents 3,6-dihydroxyphenyl-9-decyl-carbazole, and other abbreviated words are the same as described in Table 1. maybe used A wide range of reducing agents maybe used, such as sodium borohydride, formaldehyde, formic acid, sodium formate, hydrazine hydrochloride, hydroxylamine, and hypophosphorous acid. Table 4-2: False hypernyms and their source texts Further, and potentially more serious, false assertions are contained where the input text lists examples of two or more molecular classes at once. For example, for the input text; “Compounds of formula I wherein R8 is NRcRd and R9 is hydrogen may be prepared by treatment of the appropriate precursor containing the C31-C32 unsaturation with HNRcRd or HCl; HNRcRd in an appropriate protic or aprotic solvents such as methanol, ethanol, benzene, toluene, dimethylformamide, dioxane, water and the like.” 105 ChemicalTagger identifies the noun phrase preceding “such as” as “aprotic solvents”, resulting in the misclassification of methanol, ethanol and water as aprotic solvents. Linguistic constructions of this form are problematic to the system as it is only possible to identify which of the molecular classes the hyponyms belong to by the application of the domain-specific knowledge that we seek to identify from the source texts. 4.3.4 Trimming the Relations In order to produce a knowledge base of manageable size and of higher quality, it was necessary to remove a number of classes and associated axioms. The full OWL file derived from the previous process was therefore trimmed based on a requirement that all axioms should have been derived from a minimum of three separate patent documents. Following the removal of axioms in this way, orphaned classes, i.e. those that had no remaining superclasses or subclasses, were also deleted. In this way, it was hoped that many of the mistaken classifications would be eliminated since it would be less common for a mistake to be repeated than for a correct hyponymic relation to be specified across multiple documents. It was further hypothesised that this removal of invalid axioms could be improved by increasing the number of input documents and the minimum source threshold, though this was not tested. Following the trimming of the derived relations in this way, the resultant OWL file defined 133 superclasses and 330 subclasses. Each superclass had between 1 and 24 subclasses, collectively defining a total 516 hyponymic relations. These 516 hyponymic relations were subjected to manual inspection and verification. Since the purpose of this exercise was to assess the validity of the hyponymic relations extracted from the source patents, hypernyms were judged to be acceptable if the extracted term was a grammatically and semantically valid description of a class of chemical structures while hyponymic relations were 106 judged to be acceptable if the molecule generated from the hyponym term could be correctly described as a member of the class in question. The validity of hyponymic relations was only assessed where the hypernym concerned had already been judged to be acceptable. The results of this process are summarised in Table 4-3. Task Acceptable Not acceptable Total % acceptable Hypernym verification 113 20 133 85.0% Hyponymic relation verification 459 28 487 94.3% Table 4-3: Verification of trimmed hyponymic relations Of the retained hypernyms, 85% were judged to be acceptable according to the preceding criteria. Of the 20 hypernyms judged not to have been acceptable, 5 were found to have been included due to the incorrect interpretation of the term “methyl” as denoting the chemical structure of methane, allowing such hypernyms as “radical” and “group” to enter the molecular structure classification. One unacceptable hypernym, “dien” was included as a result of typos where “diene” was intended. One hypernym, “feedstock”, was judged to be unacceptable as the term is a description of a bulk material as opposed to of specific chemical compounds. The remaining 13 unacceptable hypernyms were generated as a result of incorrect grammatical parsing by ChemicalTagger. In 10 of these cases, ChemicalTagger included too much text in the hypernym while in 3 cases too little text was included. Examples of these cases are illustrated in Table 4-4. 107 Hypernym Source text Can be formed from a variety of phospholipids Liposomes can be formed from a variety of phospholipids such as cholesterol, stearylamine or phosphatidylcholines. Are typically diluted with an inert carrier Such compositions are typically diluted with an inert carrier, such as water, before application. Agent …for example, sweetening agents such as fructose, aspartame or saccharin… Inhibitor …adenosine diphosphate (ADP) inhibitors such as clopidogrel… Table 4-4: Unacceptable hypernyms and sample source texts When examining the hyponyms of those molecular classes considered acceptable, a high proportion – 94% – were found to be valid examples of the hypernym concerned. The interpretation of six of the hypernyms – “non-toxic pharmaceutically acceptable inert carrier”, “suitable solvent”, “suitable base”, “pharmaceutically acceptable solvent”, “appropriate solvent” and “low molecular weight aliphatic alcohol” – was found to involve a degree of subjectivity. In these cases the acceptability of the hyponyms was adjudicated without reference to the subjective qualification – thus, hyponyms of “suitable solvent” were considered acceptable if they were solvents, etc. This high success rate suggests that the technique offers a very powerful means by which to identify and formalise chemical classifications. 4.3.5 HearstFinder Validation In order to quantify the performance of the HearstFinder, an annotation task was undertaken to permit the comparison of different humans’ opinions of what constituted a chemical Hearst Pattern with each other and with the performance of the machine. Annotation guidelines were written, and are attached in Appendix A. The goal of the task was to derive performance metrics within a defined scope. Since, on average, around 8.5 Hearst Patterns were curated from each patent, it was not feasible to verify against manually annotated sections of patent text selected randomly and without 108 limits. Instead, it was decided to focus the task on the SUCH_AS pattern, which provided 90% of the curated hyponymic relations, and to select the corpus from among those paragraphs that contained the string “such as”. Since the HearstFinder implementation requires the inclusion of the invariant text of a Hearst Pattern in the input text in order to return any results, this filtering of the corpus paragraphs served only to remove a large proportion of paragraphs that would be of no interest to a manual annotator and could not affect the performance metrics. The limitation of the task to the SUCH_AS pattern allowed the filtering to produce a corpus in which the relevant content was highly enriched, and the exclusion of the other patterns – which collectively contributed only 10% of the total curated relations – was considered an acceptable trade-off. The annotation guidelines are intended to cause the annotators to behave in the way that HearstFinder is intended to operate. The annotator is free to identify the hypernym according to his or her judgement of what is correct, while hyponyms are required to be terms that have structural meaning – though not artificially limited solely to those classed as CM by the OSCAR3 annotation guidelines. The annotated patterns thereby produced consist of those that fit the form that the HearstFinder is intended to identify, while the manually annotated hypernyms and hyponyms allow for the validation of those that the system automatically annotates against what a human considers to be the correct term. The corpus for this exercise was assembled by randomly selecting 300 paragraphs of text that contained the phrase “such as” from the set of unique, full-text patent documents assembled as described in section 5.1.3. A further, non-overlapping, set of 30 paragraphs selected by the same method were used in advance of the full annotation task to ensure that the annotators understood and could implement the annotation guidelines. Three annotators were used for the task; the present author, an academic with many years’ experience of the chemical domain and a summer 109 student who had completed two years’ undergraduate study of chemistry1. Each annotator’s results were compared to those of the other annotators and to those of the machine. Human annotation of the corpus was carried out using a customised version of the OSCAR3 ScrapBook functionality. The OSCAR3 ScrapBook allows a user to manually annotate the OSCAR3 named entities within a sample text using a web browser, by selecting the text to be annotated and clicking the button associated with the desired named entity class. By replacing the OSCAR3 named entities with the named entities “pattern” “hyper” and “hypo”, a tool for annotating the Hearst Patterns within the corpus was created. This tool is shown is Figure 4-5, while the XML annotations produced thereby is shown in Figure 4-6. 1 In the presentation of the results, “annotator A” was the current author, “annotator B” was the summer student Shaoming Chen and “annotator C” was the academic Peter Murray-Rust. All three sets of annotations are available for inspection on the attached disk. 110 Figure 4-5: The customised OSCAR3 ScrapBook 111 <P> <snippet id="s1" fileno="p0"> The inorganic substance includes hydrochlorides of <ne type="pattern"> <ne type="hyper">metals</ne> such as <ne type="hypo">potassium</ne> , <ne type="hypo">sodium</ne> , <ne type="hypo">magnesium</ne> , <ne type="hypo">iron</ne> , <ne type="hypo">manganese</ne> , <ne type="hypo">cobalt</ne> , <ne type="hypo">zinc</ne> </ne> and the like, sulfates of the above-described metals, and phosphates of the above-described metals. More specifically, potassium chloride, sodium chloride, magnesium sulfate, ferrous sulfate, manganese sulfate, cobalt chloride, calcium chloride, zinc sulfate, potassium phosphate, sodium phosphate and the like. </snippet> </P> Figure 4-6: Annotated Hearst Pattern as produced by the OSCAR3 ScrapBook The comparison of the results produced by the different annotators with one another and with the machine was carried out automatically using software built specifically for the purpose. Before two annotations can be compared, the corresponding annotations from the two sets undergoing the comparison must first be identified. This was achieved by numbering the instances of the phrase “such as” in the input texts sequentially from 0 to n, then identifying the pattern annotation, i.e. the ne element of type pattern which contained each instance of the phrase “such as”, if any. Each instance of the phrase “such as” may have been included in an annotation by both annotators, not included in an annotation by both annotators or included in an annotation by one annotator and not the other. Where both annotators applied an annotation, the annotations were compared on a number of criteria;  Did the annotators apply the pattern annotation to the same section of the text? 112  Did the annotators apply the hypernym annotation to the same section of text?  How many hyponym annotations were applied by both annotators, and how many were applied by one annotator and not the other? To answer the first two questions above, the raw text content of the appropriate annotations were stripped of all whitespace – to eliminate errors where one annotator had mistakenly included a space at the beginning or end of a word, and errors potentially introduced by the handling of whitespace in XML by the various software components involved in the execution of the task – before being compared to one another. Where the whitespace-stripped strings were equivalent it was judged that the annotators had annotated the same text and vice versa. To answer the third question, the sections of text to which the hyponym annotations had been applied were subjected to the same process of whitespace removal, then the sets of hyponyms annotated by each annotator were compared and the number occurring in both sets and the numbers occurring in one and not the other were calculated. The comparison of the performance of the human annotators with that of the machine proceeded according to the same criteria but using a slightly different method. Since the machine does not create XML annotations of the form produced by the OSCAR3 ScrapBook, instead holding in memory references to the appropriate sections of a ChemicalTagger output document, it was necessary to store the value of n – the record of which instance of the phrase “such as” is the subject of the annotation – in memory as the Hearst Pattern was identified. This allowed the Hearst Patterns recognised by the machine to be aligned with those annotated by the human annotators, then the two were compared as previously. During the analysis of the annotations, it was discovered that on a number of occasions the human annotators had produced annotations of the Hearst patterns in which the hypernym was not 113 marked. Such annotations were discounted from the annotation comparison procedure and the resultant metrics. Annotator B C M Metric A 269 45.1% 229 38.4% 133 22.3% Annotated By Both Annotators 234 39.2% 267 44.7% 289 48.4% Skipped By Both Annotators 88 14.7% 84 14.1% 174 29.1% Skipped By One Annotator 6 17 1 Missing Hypernym 219 81.4% 183 79.9% 58 43.6% Matching Patterns 248 92.2% 206 90.0% 61 45.9% Matching Hypernyms 713 533 336 Matching Hyponyms 71 9.1% 190 26.3% 17 4.8% Mismatched Hyponyms 1 68 8.7% 184 25.7% 45 11.8% Mismatched Hyponyms 2 B 248 41.5% 143 24.0% Annotated By Both Annotators 228 38.2% 236 39.5% Skipped By Both Annotators 100 16.8% 218 36.5% Skipped By One Annotator 21 0 Missing Hypernym 197 79.4% 59 41.3% Matching Patterns 216 87.1% 64 44.8% Matching Hypernyms 542 327 Matching Hyponyms 171 24.0% 40 10.9% Mismatched Hyponyms 1 174 24.3% 69 17.4% Mismatched Hyponyms 2 C 125 20.9% Annotated By Both Annotators 288 48.2% Skipped By Both Annotators 176 29.5% Skipped By One Annotator 8 Missing Hypernym 53 42.4% Matching Patterns 57 45.6% Matching Hypernyms 243 Matching Hyponyms 84 25.7% Mismatched Hyponyms 1 106 30.4% Mismatched Hyponyms 2 Table 4-5: Results of the HearstFinder validation exercise The results of the HearstFinder validation exercise are presented in Table 4-5, wherein the human annotators are labelled A, B and C and the machine annotator is labelled M. The metrics are calculated as follows;  Annotated By Both Annotators – the instances of the phrase “such as” that formed part of an annotation in both annotation sets. 114  Skipped By Both Annotators – the instances of the phrase “such as” that formed part of an annotation in neither annotation sets.  Skipped By One Annotator – the instances of the phrase “such as” that formed part of an annotation in one annotation set and not the other.  Missing Hypernym – the occasions on which a matching pair of Hearst pattern annotations were not compared as a result of one or both annotators failing to annotate the hypernym.  Matching Patterns – the occasions on which the two annotators applied the pattern annotation to a whitespace-stripped equivalent section of text.  Matching Hypernyms – the occasions on which the two annotators applied the hyper annotation to a whitespace-stripped equivalent section of text.  Matching Hyponyms – the occasions on which the two annotators applied the hypo annotation to a whitespace-stripped equivalent section of text.  Mismatched Hyponyms 1 – the occasions on which the first annotator applied the hypo annotation to a section of text which was not matched by the second annotator.  Mismatched Hyponyms 2 – the occasions on which the second annotator applied the hypo annotation to a section of text which was not matched by the first annotator. These metrics are presented as raw numbers and, where appropriate as percentages. For the metrics concerned with whether or not both annotators annotated a Hearst pattern, the percentages are calculated as a proportion of the total number of instances of the phrase “such as” in the corpus. For the pattern and hypernym metrics, the percentages are calculated as a proportion of the total number of Hearst patterns for which the comparison procedure took place, i.e. those 115 which were annotated by both annotators. For the mismatched hyponym metrics, the percentages are calculated as a proportion of the total number of hyponyms annotated by the annotator in question. From Table 4-5, it can be seen from the inter-annotator agreement scores that the humans’ interpretation and implementation of the annotation guidelines were not uniform; these scores set a baseline by which the performance of the machine may be judged. The machine performance is comparable to the inter-annotator agreement in terms of the “skipped by both annotators” metric, while the machine exhibits a higher rate, at around 30%, than the human annotators, at around 15%, in terms of the “skipped by one annotator” metric. This suggests that the machine has a tendency not to annotate Hearst patterns where the guidelines suggest that they should be annotated. This observation is to be expected as a consequence of the conservative strategy implemented by the HearstFinder. Notably, the requirement that the noun phrase that follows the key phrase “such as” be composed of terms identified as CM by OSCAR3 eliminates those Hearst patterns that contain complex hyponyms involving generic adjectives such as “higher alcohols”. OSCAR3 does not recognise the word “higher” in this context and consequently the entire Hearst pattern is discounted, while a human annotator does not make this mistake. The machine also scores significantly worse than the inter-annotator agreement in terms of the precise matching of the pattern and hyper annotations. The failure of the machine to correctly identify the hypernym can be attributed to one of two causes; the noun phrase that precedes the key phrase “such as” having been incorrectly recognised by ChemicalTagger, or the noun phrase having been correctly identified but not corresponding precisely with what the human annotator considered to be the correct hyponym. The annotation guidelines instruct the human annotators to discount from the hypernym adjectives that do not form a part of the structural class concerned, as in “suitable solvent”, and the correct interpretation of this instruction requires an understanding of the English language of the chemical domain that the machine does not possess. Consequently it is 116 difficult to devise a system that is not prone to the second source of error, while the development of ChemicalTagger to eliminate the first source of error was beyond the scope of the current work. The metrics for the rates of disagreement between the machine and the human annotators present a mixed picture. The inter-annotator agreement scores for mismatched hyponyms are significantly higher for the A-C comparison (26% and 26%) and the B-C comparison (24% and 24%) than for A-B (9% and 9%), and the score for the C-M (26% and 30%) comparison is significantly higher than for A- M (5% and 12%) and B-M (11% and 17%). These results are suggestive of a discrepancy in the implementation of the annotation guidelines between annotator C and annotators A and B. The scores for the comparisons A-M and B-M are similar to those of the comparison A-B, while the score for the comparison C-M are similar to those for the comparisons A-C and B-C, suggesting that the machine’s performance in this regard is comparable to that of the humans. It should be remembered that these metrics are computed based solely on the Hearst patterns which have been annotated by both annotators, so although the rate at which the machine includes incorrect hyponyms may be comparable to the humans, the rate at which it excludes correct hyponyms is much higher – as evidenced by the much higher rates at which a human and the machine disagree over whether to annotate a Hearst pattern (29%, 37% and 30%) compared to the rates at which two humans disagree on the issue (15%, 14% and 17%). This conservatism is considered to be entirely acceptable, even desirable, as it assists in the production of the reliable set of relations, as seen in section 4.3.4. 4.4 Uses of Derived Data To demonstrate the utility of the system described in this chapter, the derived relations were put to several different uses. These use cases are subsequently discussed. 117 4.4.1 Automatic Classification of Structural & Non-Structural Classes In the set of derived relations, the hypernyms correspond to the names of classes of chemical compounds. Such classes may be based on structural features, e.g. esters, or on non-structural features, e.g. base. Since the compounds that form a structural class must, by definition, share structural features while those that form non-structural classes need not, it should be the case that these classes should be differentiable by considering the chemical similarity of their members. In order to investigate this hypothesis, the set of relations produced as described in section 4.3.4 was further trimmed by removing superclasses that had fewer than six subclasses and the orphaned subclasses this process produced. The chemical similarity of each of the remaining 28 chemical classes was calculated by using the Chemistry Development Kit (CDK) version 1.0.1 to calculate fingerprints for each of the class members and the mean pairwise Tanimoto coefficient (MPT) for the class2. The class names were distributed to each of five manual annotators, each of whom held at least an undergraduate degree in chemistry. Each annotator selected for each class an appropriate label selected from the following;  Structural – the name of the class indicates that all members contain a specific substructure e.g. ketone or methyl ester.  Functional – the name of the class indicates that all members share a common function, usage, property or other non-structural feature e.g. antibiotic or surfactant.  Semi-structural – the name of the class indicates something about the structure or composition of the members, but not that that they share a specific substructure e.g. isomers of C6H10O or bicyclic systems. The annotators carried out this task by being presented with a list of the hypernyms as they were curated from the original patent texts, the order of which was randomised between the annotators, 2 The CDK fingerprinter operates in a similar manner to Daylight fingerprints, by generating a comprehensive set of paths through a given connection table as opposed to by comparison to a pre-specified set of fragments. Consequently, the results are not biased by decisions made by the CDK authors regarding which fragments are appropriate for inclusion. 118 along with a short set of instructions. An example set of instructions and class names is included in Appendix B. The results of this process are presented in Table 4-6, in which the chemical classes are sorted in ascending order of their mean pairwise Tanimoto (MPT) scores. Each chemical class has been assigned a label of structural, semi-structural or functional according to the consensus of the manual annotators and the classes are coloured green (structural), yellow (semi-structural) and red (functional) accordingly. Manual Annotator Votes Hypernym MPT Structural Semi- structural Functional base 0.145 5 inert solvent 0.147 5 solvent 0.159 5 organic solvent 0.195 1 4 halogenated hydrocarbon 0.236 1 4 mineral and carboxylic acid 0.244 3 2 organic acid 0.256 1 3 1 hydrocarbon 0.264 1 4 amine 0.279 5 tertiary amine 0.288 5 halogenated α-olefin 0.288 4 1 aromatic ether compound 0.336 5 cyclic olefin 0.411 4 1 diolefin 0.431 4 1 suitable solvent 0.448 5 trihydrocarbon-substituted phosphine 0.467 4 1 halogenated styrene 0.499 4 1 alkylamine 0.508 5 olefin 0.511 4 1 dihydrocarbon-substituted phosphine 0.527 4 1 alcohol 0.534 5 aliphatic unsaturated ether compound 0.546 5 α-olefin 0.569 4 1 aliphatic monoether compound 0.587 5 monohydrocarbon-substituted phosphine 0.590 4 1 alkylstyrene 0.594 5 ether compound 0.684 4 1 straight monoolefin 0.702 5 Table 4-6: Classification of structural & non-structural classes 119 It can be seen from Table 4-6 that in 13 cases the manual annotators agreed unanimously on a label and on a further 13 occasions they agreed by a 4-1 margin. Of the remaining two cases, the class “mineral and carboxylic acid” likely indicates a union of two separate classes – “mineral acid” and “carboxylic acid” – and it is unsurprising if this caused some confusion among the annotators with regards to how to proceed – while one further class, “organic acid”, was voted semi-structural by a 3-1-1 margin. It can be seen from Table 4-6 that ordering the classes by their MPT results in a perfect separation of the structural, semi-structural and functional classes with the exception of the classes “mineral and carboxylic acid” and “suitable solvent”. The first of these classes, as previously discussed, represents a union of two classes and was the subject of disagreement among the annotators over its nature. The second, “suitable solvent”, was unanimously voted as a functional class but has a MPT score that places it firmly among the structural classes, indicating that the process outlined in this section does not work in all cases. The ability to automatically determine if a molecular class is structural or non-structural does not have an immediate application within the current work, but it is thought that it may prove useful in the future in several ways. Firstly, it may prove a useful pre-screen to identify structural classes for a system that attempts to automatically determine the functional groups that define them, e.g. esters are those compounds that contain the substructure CC(O)OC. Secondly, it may assist in the development of a system for automatic categorisation of reactions in the sense that a reaction that operates on one member of a structural class may be reasonably expected to work on other members of the class, while this is not the case for non-structural classes. It was not attempted to implement the systems described here as part of the current work, and due to the relatively low number of molecular classes examined here further work is warranted to demonstrate and validate the performance of the current methodology before attempting to build a production system. 120 4.4.2 Detection of Useful Relationships It was hypothesised that the hyponymic relations identified by the application of Hearst Patterns to chemical documents may be of use for the supervised or unsupervised creation or enrichment of formal knowledge bases. In order to test this hypothesis, the set of trimmed molecular classifications identified as acceptable in section 4.3.4 were compared to the ChEBI Ontology (47; 48). The domain of ChEBI and that of the patent corpus do not perfectly overlap. The patents cover a broad area of chemistry, while the species present in the ChEBI Ontology are described as those that are “used to intervene in the processes of living organisms (either on purpose, as for drugs, or by accident, as for chemicals in the environment)” (95). Consequently, many of the automatically acquired relations are irrelevant to ChEBI and would not be expected to be present therein. The comparison was therefore restricted to fields relevant to ChEBI. Those selected were solvents, bases and the various classes of drugs exemplified in the molecular classifications. During this work, two questions were addressed for each of the automatically identified hyponymic relations – was the chemical species represented by the hyponym present in the ChEBI Ontology and, if so, was the containing hyponymic relation defined? The first question was addressed by searching for the chemical substance concerned both by chemical names as used in the original patents, by other common names of the substance known to the author and by the generated InChIs attached to the hyponym classes. Searching was performed using the web interface available at http://www.ebi.ac.uk/chebi/init.do. The second question was answered by considering whether the chemical substance was classified directly or indirectly (i.e. via one or more intermediate classifications, as in Figure 4-7) as belonging to a lexically or semantically equivalent classification. The results of these investigations are presented in Table 4-7, Table 4-8 and Table 4-9. 121 Figure 4-7: Indirect ChEBI classification of acetone as a solvent Chemical name Structure present in ChEBI? Hyponymic relation present in ChEBI? Methanol Yes Yes Ethanol Yes Yes DMSO Yes Yes Water Yes Yes Acetone Yes Yes Benzene Yes Yes Toluene Yes Yes Glyme Yes Yes 1,2-dichloroethane Yes Yes Chloroform Yes No Xylene3 Yes No Dioxane Yes No NMP Yes No Anisol Yes No DMF Yes No THF Yes No Acetonitrile Yes No Pyridine Yes No DCM Yes No Table 4-7: Comparison of solvent hyponyms with ChEBI It can be seen from Table 4-7 that a number of common solvents, notably DMF, THF, DCM and acetonitrile, are included in the ChEBI Ontology but are not defined therein as being solvents. This is not due to solvents being considered irrelevant to ChEBI since a number of solvents are present and 3 The name “xylene” may refer to ortho-, meta- or para-xylene. All three were present in the ChEBI Ontology and none was recorded as being a solvent Solvent Aprotic solvent Polar aprotic solvent Acetone 122 defined as such and the ontology defines the solvent classes “polar solvent”, “non-polar solvent”, “protic solvent” and “aprotic solvent” among others. It is surprising that of the 19 solvents identified, 10 were not defined as such in ChEBI in spite of the presence of suitable superclasses. Chemical name Structure present in ChEBI? Hyponymic relation present in ChEBI? Sodium hydroxide Yes Yes4 Potassium hydroxide Yes Yes5 Triethylamine Yes No Pyridine Yes No n-BuLi Yes No Imidazole Yes No Potassium carbonate No N/A Sodium hydride No N/A N,N-diisopropyl-N-ethylamine No N/A DMAP No N/A Table 4-8: Comparison of base hyponyms with ChEBI Table 4-8 demonstrates that the formal definition of bases in ChEBI is poor. Indeed, a manual search indicates that the “metallic base” class to which sodium and potassium hydroxide belong is the only class of bases defined. 4 Recorded in ChEBI as a “metallic base” 5 Recorded in ChEBI as a “metallic base” 123 Drug class Chemical name Structure present in ChEBI? Hyponymic relation present in ChEBI? Biguanide Metformin Yes Yes Cholesterol absorption inhibitor Ezetimibe Yes Yes6 Anti-estrogen Tamoxifen Yes Yes7 Toremifene Yes Yes8 Raloxifene Yes Yes9 H+, K+ ATPase inhibitor Omeprazole Yes No Lansaprazole Yes Yes10 Antibiotic Tetracycline Yes Yes Metronidazole Yes No11 Amoxicillin Yes Yes Tricyclic12 Amitriptyline Yes Yes Imipramine Yes Yes Doxepin Yes Yes Antipsychotic drug Thoridazine Yes Yes Haloperidol Yes Yes Psychostimulant Methylphenidate Yes No Diuretic Amiloride Yes No Furosemide Yes Yes Table 4-9: Comparison of drug hyponyms with ChEBI It should be expected that the drugs included in the molecular classifications should all be present in the ChEBI Ontology, since in order for a hyponymic relation and therefore its hyponym to be included, the hyponym must be resolvable to a connection table by OSCAR3 – which uses ChEBI as a source of linked trivial names and associated structures. It might also therefore be expected that the 6 Recorded in ChEBI as a “anticholesteremic drug” 7 Recorded in ChEBI as an “estrogen receptor antagonist” 8 Recorded in ChEBI as an “estrogen antagonist” 9 Recorded in ChEBI as an “estrogen receptor modulator” 10 Recorded in ChEBI as a “proton pump inhibitor” 11 Recorded in ChEBI as a “antitrichomonal drug” 12 A common abbreviation of “tricyclic antidepressant”, of which class the three examples are recorded as members in ChEBI 124 drug classifications deduced from the literature would similarly be formally encoded into ChEBI, though this is not the case for 4 of the 18 drugs present in the corrected set of molecular classifications. This observation, as well as the high proportion of identified solvents and bases that were not present in or defined as such in ChEBI suggest that the technology developed for the automatic acquisition of hyponymic relations is likely to be of use in the creation of such formal classifications of chemical substances. Though it is likely that a human curator would not want to enable fully- automatic deposition of hyponymic relations due to justifiable concerns about accuracy, such a system may act as a highly useful tool to identify both common molecular classes and examples thereof. 4.4.3 Application to Data Searching The automatic generation of dictionaries of common molecular classes provides a means of support for data searching in that it permits the formulation of queries such as “show me reactions that use a strong base in a chlorinated solvent” without the need for the user to define the terms “strong base” or “chlorinated solvent”. The combination of the hyponymic relations derived in the current work with data derived from other sources is made possible by the use of the web standard OWL for their formal descriptions. This work has been carried out by Dr Lezan Hawizy and is discussed in section 6.2. 4.5 Conclusions The work in this section has demonstrated the implementation of a system capable of the automatic detection of molecular classes based on their specification in the literature that operates with a high 125 degree of accuracy by combining established methods with novel technology. It has further discussed ways in which the information produced may be of use to the human curators of the ChEBI ontology and to the broader community. The techniques used in this process are ones that scale acceptably with the available volume of literature and it is believed that they will be of value in the future. 126 5. High-Throughput Abstraction of Chemical Reactions – PatentEye This chapter describes the implementation of a novel framework – tentatively named PatentEye – for the high-throughput and automatic abstraction of chemical reactions. As discussed in the introduction to the current work, the liberation of scientific data and its conversion to machine- understandable forms holds great promise. A key part of the chemical sciences are the reactions that chemists perform and report in great number, and the goal of the creation of PatentEye was to demonstrate the potential to create an automated system capable of extracting reactions from the literature, creating machine-understandable representations using Chemical Markup Language (CML) and sharing them as open data. To increase the reliability of the extracted syntheses, PatentEye attempts to validate the identified product molecules. This is achieved by comparison of a candidate product molecule with any accompanying structure diagram using the package OSRA (see section 2.2.8) for image interpretation and with any accompanying NMR and mass spectra, using the OSCAR3 data recognition functionality (see section 2.2.6.5). The identified NMR spectra are considered to be valuable data in their own right and are extracted and retained for use in later works. The implemented system is automated to the degree that it is capable of operating with minimal user interaction, and consequently the PatentEye workflow consists of a number of stages of processing. First, chemical patents are identified within the online archive of the European Patent Office (EPO) and are downloaded. The XML documents supplied by the EPO are then semantically enhanced so as to delimit sections and subsections of the text and to introduce additional metadata such as SMILES strings representing the content of structure diagrams and OSCAR3 data markup to describe identified spectra. Finally, reactions are extracted from these semantically enhanced documents using ChemicalTagger (see section 2.2.7) and are converted to CML. The working of each of these steps is described in this chapter. 127 5.1 Downloading Patents As discussed in section 2.1.3, the patents published by the European Patent Office were selected for the current work. In order to simplify the acquisition of a suitably large corpus of documents from which to work, as well as to facilitate future increases in the scale of PatentEye’s operation, software was written to automate the process of identification and downloading of the patent files. The operation of this software is subsequently discussed. 5.1.1 EPO Web Interface The European Patent Office (EPO) publishes patent documents through the European Publication Server, hosted at https://data.epo.org/publication-server/. This platform is designed for a human, interacting with it using a web browser. A user performs a search by entering his search parameters – a patent ID, a date range within which to search and a list of document kinds (see Table 5-1) to search for – into an HTML form, and upon submission an HTTP POST request to https://data.epo.org/publication-server/search is executed, specifying the required parameters. The server responds by redirecting the browser to https://data.epo.org/publication-server/result-list using the HTTP 302 status code, and the page at this address uses Asynchronous Java and XML (AJAX) to retrieve a table of results for the search, an example of which is shown in Figure 5-1. Figure 5-1: Search results for EP 1777210 128 This table comprises a list of the available documents related to the specified patent. The documents are each assigned a “kind code”, which classifies the contents of the documents according to the definitions shown in Table 5-1. Kind code Definition A1 European patent application published with European search report A2 European patent application published without European search report A3 Separate publication of the European search report A8 Corrected title page of an A document, i.e. A1 or A2 document A9 Complete reprint of an A document, i.e. A1, A2 or A3 document B1 European patent specification (granted patent) B2 New European patent specification (amended specification) B3 European patent specification (after limitation procedure) B8 Corrected title page of a B document, i.e. B1 or B2 document B9 Complete reprint of a B document, i.e. B1 or B2 document Table 5-1: Definitions of EPO kind codes, taken from EPO website (96) The search results table also provides links to the XML, PDF and ZIP files for the document, if these are available. Not all file types are available for all patent documents – if the patent is published under the Patent Cooperation Treaty (PCT), for example, then the PDF link is replaced by a PCT link and a full-text XML version is unavailable. 5.1.2 Automated Downloading of EPO patents In order to download documents using the web interface described previously, it is first necessary to determine an appropriate patent ID number. A user may already know the ID of the patent in which he is interested if, for example, it has been cited by another document or if it has been returned in the hit list of a patent searching tool. The European Publication Server, in addition to hosting the search interface, also publishes a series of weekly patent index files (97). Each of these files lists the patent documents published by the EPO in a certain week and provides further information such as the language in which the document is written and the International Patent Classification (IPC) codes 129 which have been assigned to the document. The IPC is a subject-based hierarchical classification scheme for patent documents, e.g. chemistry and metallurgy are assigned the code “C”, organic chemistry is assigned the code “C07”, acyclic or carbocyclic compounds are assigned the code “C07D” etc. These index files thereby provide the means for an automated system such as PatentEye to identify chemical patents, or indeed those related to any other field. PatentEye identifies patents to download as those that are written in English and that have been assigned one or more of the IPC codes listed in Table 5-2. IPC Code Description C07B General methods of organic chemistry; apparatus therefor C07C Acyclic or carbocyclic compounds C07D Heterocyclic compounds C07F Acyclic, carbocyclic or heterocyclic compounds containing elements other than carbon, hydrogen, halogen, oxygen, nitrogen, sulfur, selenium or tellurium Table 5-2: Relevant IPC codes The implementation of this process in PatentEye is provided by the EpoCrawler class. Once the list of chemical patent IDs has been derived from a patent index file, the IDs are passed to the PatentGrabber class, which uses the web crawler developed for the CrystalEye project (13) and interacts with the patent-searching web interface described above. The search request is submitted, and then the table of results is retrieved by replicating the AJAX request. This table is formatted in XHTML, allowing the PatentGrabber to identify which kinds of documents are available for the specified patent by reading the individual table rows. For the documents of kind A1, A2, A9 and B1 – each of which contains the full text of a patent – the URLs from which the ZIPs may be downloaded are identified and passed to the web crawler for download. 130 5.1.3 Formation of the Patent Corpus The EpoCrawler was used to download the zipped form of the chemical patents from the EPO website for the ten weeks dated from May 6th 2009 to July 8th 2009. To prevent duplication of patents within the corpus, files were then deleted such that only one document remained within the corpus for each patent ID. The order of priority used to determine which document to retain was A9 > A1 > A2 > B1. In total, the patent corpus comprises 690 documents from across the ten weeks, and the number of ZIP files originally downloaded and the number of unique patents from which they are drawn are shown in Figure 5-2. Figure 5-2: Variation of downloaded and unique patents in the corpus Of these 690 zips, it was found that 23 did not contain the XML version of the patent under the expected file name. Further work that uses the patent corpus is therefore based on a reduced corpus of 667 unique, full-text patent documents where the XML files are used as input. 0 20 40 60 80 100 120 2 0 0 9 -0 5 -0 6 2 0 0 9 -0 5 -1 3 2 0 0 9 -0 5 -2 0 2 0 0 9 -0 5 -2 7 2 0 0 9 -0 6 -0 3 2 0 0 9 -0 6 -1 0 2 0 0 9 -0 6 -1 7 2 0 0 9 -0 6 -2 4 2 0 0 9 -0 7 -0 1 2 0 0 9 -0 7 -0 8 # downloaded zips # unique patents 131 5.2 Document Enhancement As discussed previously, in section 2.1.3.1, the different sections of the XML-formatted patent documents are not clearly defined. The content of the description element is relatively flat – that is to say, the heading and p (paragraph) children are siblings of one another, such as in the following; <description> <p>...</p> <heading>Heading 1</heading> <p>...</p> <p>...</p> <heading>Heading 1.1</heading> <p>...</p> <heading>Heading 1.2</heading> <p>...</p> <heading>Heading 2</heading> <p>...</p> <p>...</p> </description> To a human reader, it is a simple task to realise that the headings 1.1 and 1.2 are subsections of Heading 1, and that the each of the paragraphs belongs to a section of the document that begins with the preceding heading. Since this is not made explicit in the structure of the XML, however, it is not trivially obvious to a machine that the document should be read in such a way. For this reason it is desirable to deflatten the XML – to rewrite the document such that as much of the implicit structure is made explicit as possible. This rewritten document is then saved to disk in order to prevent unnecessary repetition of the task. A number of other semantic enhancements are performed on the patent documents at this stage. These tasks include the application of OSCAR3 data recognition to identify spectral data within the text, the application of OSRA to add SMILES representations of the chemical structure images contained within the documents, the recognition and annotation of references in the text to other sections of the document, e.g. “the reaction was performed as in example 12” and the identification 132 and labelling of the paragraphs in the text that form part of an experimental section. The theory and application of these steps are subsequently discussed. 5.2.1 Paragraph Deflattening In this step, the description element of the patent document is checked for paragraph children. Any p elements that are found are detached from the document and re-attached as a child of the heading element that most recently precedes them. Any p elements that occur before the first heading child of the description element are ignored by this process. For example, the example of XML in the preceding section would be reformatted as follows; <description> <p>...</p> <heading>Heading 1 <p>...</p> <p>...</p> </heading> <heading>Heading 1.1 <p>...</p> </heading> <heading>Heading 1.2 <p>...</p> </heading> <heading>Heading 2 <p>...</p> <p>...</p> </heading> </description> Before this reformatting, the heading element was acting as an annotation on the heading text. While it can still be inferred that the text inside a heading element and preceding the first p element is the heading text, the reformatting process has destroyed the explicit declaration and created mixed content. To remove the requirement to infer the heading title, the heading text is removed from the document and made into a title attribute on the heading element, to form a document of the following form; 133 <description> <p>...</p> <heading title='Heading 1'> <p>...</p> <p>...</p> </heading> <heading title='Heading 1.1'> <p>...</p> </heading> <heading title='Heading 1.2'> <p>...</p> </heading> <heading title='Heading 2'> <p>...</p> <p>...</p> </heading> </description> 5.2.2 Document Segmentation As previously discussed, the EPO do not attempt to explicitly demarcate in their XML the existence of sections of a patent document. Headings in the document are denoted by use of the heading tag, but otherwise the reader is left to infer for themselves where subheadings occur and to which headings they belong. This lack of formal structure in the document is a barrier to the automated processing of the patent documents as it prevents a machine from making context-specific decisions about how to behave. At this stage in the semantic enrichment process, an attempt is made to formalise the document’s implicit structure. 5.2.2.1 Primary Sections The EPO’s instruction document, “How to get a European Patent – Guide for Applicants” (98), states; In the description you must: (a) Specify the technical field to which the invention relates. You may do this for example by reproducing the first ("prior art") portion of the independent claims in full or in substance or by simply referring to it. (b) Indicate the background art of which you are aware, to the extent that it is useful for understanding the invention, preferably citing source documents reflecting such 134 art. This applies in particular to the background art corresponding to the prior art portion of the independent claims. Source document citations must be sufficiently complete to be verifiable: patent specifications by country and number; books by author, title, publisher, edition, place and year of publication and page numbers; periodicals by title, year, issue and page numbers. (c) Disclose the invention as claimed. (d) The disclosure must indicate the technical problem that the invention is designed to solve (even if it does not state it expressly) and describe the solution. (e) To elucidate the nature of the solution according to the independent claims you can repeat or refer to the characterising portion of the independent claims (see example) or reproduce the substance of the features of the solution according to the relevant claims. (f) At this point in the description you need only give details of embodiments of the invention according to the dependent claims if you do not do so when describing ways of performing the claimed invention or describing what the drawings show. (g) You should state any advantageous effects your invention has compared with the prior art, but without making disparaging remarks about any specific previous product or process. (h) Briefly describe what is illustrated in any drawings, making sure you give their numbers. (i) Describe in detail at least one way of carrying out the claimed invention, typically using examples and referring to any drawings and the reference signs used in them. (j) Indicate how the invention is susceptible of industrial application within the meaning of Article 57. Each of these six points defines a topic that must be addressed in the patent. Consequently, European patents tend to follow a regular structure – primary sections of the documents are commonly headed according to the areas mandated in the instructions using relatively standardised terms. As a result, these heading titles may be matched using regular expressions. The regular expressions used for matching the most common section titles are given in Table 5-3. Each of these section titles corresponds to one of the areas mandated by the guide for applicants, with the exception of the final entry – while a summary of the invention is not mandated by the EPO it is a very common feature, and one for which the terminology is sufficiently consistent that it can be matched by regular expressions. 135 Section Title Regular Expression Field (.*\\s+)?(technical\\s+)?field(\\b.*)? Prior Art (.*\\s+)?(prior|background|related)(\\b.*(art(s)?|invention)(\\b.*)?)? Disclosure of Invention (.*\\s+)?(detailed description|(disclosure|description)\\b.*\\s+invention)\\b.* Description of Drawings (.*\\s+)?description of (.*\\s+)?drawing(s)?(\\b.*)? Mode of Carrying Out (.*\\s+)?modes?(\\s+.*carry.+(\\s+.*)?)? Industrial Applicability industrial applicability Summary of Invention (.*\\s+)?summary(\\s+.*invention(\\b.*)?)? Table 5-3: Regular expressions for identifying primary section headings These regular expressions are used to identify the primary sections of the patent description. Once this has occurred, the structure of the descendant elements of the patent’s description element is modified with the intent that the only child elements of the description should be the identified primary section headings. Non-primary headings that are found after a primary heading are detached from the document and reattached as a child element of the preceding primary heading, leaving those that occur before the first identified primary heading in place. In this manner, the flat structure of the XML provided by the EPO is converted into a tree that reflects the implicit structure of the document. In addition to this alteration of the XML structure, the names of the primary heading elements are normalised to enable their trivial location within the document during later work. Each of the primary headings has a separate keyword to which the element name is changed, for example the disclosure of invention headings are renamed “disclosureOfInvention”, while the summary of invention headings are renamed “summaryOfInvention”. Similarly, the remaining heading elements within the description that match the regular expression “(.*\\b)?example(:|\\-)?(\\s+.*)?” are renamed “example” in the same manner as the primary section headings. 136 Though the PatentEye functionality implemented as part of the current work does not ultimately make use of the primary section headings, the ability to identify the primary sections is thought to be of sufficient potential use to future applications that it has been retained within the current codebase. Since the functionality was not used in the current work, the accuracy of the regular expressions at recognising variations in the wording of the primary section headings by individual authors was not assessed. 5.2.2.2 Identification of Consecutive Headings Very often, a document contains a set of headings that are intended to form a series of consecutive headings. Most notably in the field of patents, it is clearly apparent that, if a document contains the headings “example 1”, “example 2” and “example 3” that these headings form a series and that, in general, other headings that occur during such a series are in fact subheadings of the those headings that form the series. This phenomenon provides a useful means by which a machine may identify the implicit structure within the document, and by re-ordering the semantics of the document it is possible to make this structure explicit. This procedure is implemented as part of the current work and is subsequently discussed. Recognisable Consecutive Headings Two formats of consecutive headings were identified for this work: those where the heading text is invariant save for an incrementing index e.g. “example 1”, “example 2”, etc. and those in which the headings do not necessarily share any text, but the sequential nature of the headings is identifiable from the presence of the incrementing token at the beginning of the headings, e.g. “A. Introduction”, “B. Methods”, etc. It was also noted that in headings describing example compounds the name of the compound it is common to include the name of the compound, e.g. “Step 2: 137 Preparation of methyl 5,8-dihydroxy-1,6-naphthyridine-7-carboxylate”, and that this should not be allowed to prevent the formation of lists of consecutive headings. In order to recognise these sets of consecutive headings, a representation is first generated of the title text of each of the headings in which the incrementable tokens and chemical names have been normalised. Firstly, chemical names are identified by passing the title text to OSCAR3 for named entity recognition and by replacing instances of chemical names with the string “$CM”. Subsequently, incrementable tokens are identified by matching against the regular expression; (\\d+[a-z]?(?!-)\\b|^[A-Z](?!-)\\b) This regular expression matches two types of incrementables. Firstly, it may match a number optionally followed by a single letter (e.g. “7” or “12b”) followed by a word boundary and not a dash (to avoid matching pieces of chemical nomenclature such as in “2-methyl”). Secondly, it may match a single uppercase character at the beginning of a string that is, again, followed by a word boundary but not a dash (“N-methyl”). The substrings that are matched by this regular expression are replaced with the string “$IN”. Representations created in this way may then be compared by simple string equivalence to determine if the headings for which they were generated are of a common format. If so, and if the incrementing tokens are consecutive, then the headings may constitute consecutive headings. The application of this procedure needs to be carefully applied in order to correctly identify lists of consecutive headings. Consider the following list of headings; Example 1 Step 1 Step 2 Example 2 Step 1 Step 2 Step 3 138 It should be clear to the reader that these headings describe two examples, each of which is broken down into a number of steps. The first example has two subheadings while the second has three. The heading “Step 2” in “Example 1” has no subsequent heading in its consecutive heading list, and it is important not to identify “Step 3” in “Example 2” as such. This is achieved by dividing the list of headings in the document into smaller lists after each identification of consecutive headings, using the headings in the consecutive heading list as the splitting points. This procedure is illustrated in Figure 5-3. Example 1 -> Example $IN -> Example $IN -> Example $IN Step 1 -> Step $IN -> Step $IN -> Step $IN Method -> Method -> Method -> Method Characterisation -> Characterisation -> Characterisation -> Characterisation Step 2 -> Step $IN -> Step $IN -> Step $IN Method -> Method -> Method -> Method Characterisation -> Characterisation -> Characterisation -> Characterisation Example 2 -> Example $IN -> Example $IN -> Example $IN Step 1 -> Step $IN -> Step $IN -> Step $IN Method -> Method -> Method -> Method Characterisation -> Characterisation -> Characterisation -> Characterisation Step 2 -> Step $IN -> Step $IN -> Step $IN Method -> Method -> Method -> Method Characterisation -> Characterisation -> Characterisation -> Characterisation Step 3 -> Step $IN -> Step $IN -> Step $IN Example 3 -> Example $IN -> Example $IN -> Example $IN Figure 5-3: Identification of and Document Restructuring Using Consecutive Headings In the first step, the title text of each of the headings is normalised by replacement of chemical names and incrementable elements, as previously discussed. The first heading, “Example 1”, is then identified as being part of a consecutive heading list with “Example 2” and so the list of headings is divided into two, shown in blue and green, using these headings as the splitting points. Now, when it comes to find the headings that follow the blue “Step 1” heading, it is impossible to include any of the subheadings of “Example 2”, as these headings are no longer contained in the heading list. This results in the correct identification of lists of step headings, as shown in the example above. 139 Once the lists of consecutive headings have been created, the document is restructured accordingly. Firstly, a check is carried out to ensure that the headings in each list are siblings, i.e. that they share a common parent element. If this is found not to be the case, the process aborts by throwing an UnrecognisedStructureException. Otherwise, the document restructuring proceeds by iterating through the list of all headings that are siblings of those in the list. Once the first heading in the list is reached, subsequent headings that are not also members of the list are detached from the document and reattached as children of the preceding member. This process is terminated upon reaching the final member of the list, with the result that the subheadings of this heading are not affected. This omission is accepted as a consequence of the fact that there is not another method available to determine its subheadings. 5.2.3 Data Annotation PatentEye takes advantage of the previously-developed and previously-described functionality for the recognition and annotation of experimental data that exists within OSCAR3. To enable the usage of this functionality, software was developed to support the application of OSCAR3 data annotations to an original source XML document and improvements were made to the performance of the OSCAR3 data recognition. This work is subsequently described. 5.2.3.1 Development of Annotation Framework OSCAR3 is used to annotate reports of spectral data within the patent documents. OSCAR3, however, does not directly allow for the annotation of arbitrary XML documents. Instead, the patent documents must first be converted to SciXML, as described previously, before annotation is 140 performed. This produces an annotated SciXML document, but the process of format conversion has destroyed much of the valuable markup that is included in the patent XML files. To work around this problem, software was written to identify the sections of text in the original document that correspond to the sections annotated by OSCAR3 in its SciXML documents in order to allow the annotations to be added to the original document. It is hoped that this process will be made redundant by the addition of the capability to annotate arbitrary XML documents in a future version of the OSCAR software. The process of creating and transferring OSCAR3’s data annotations is illustrated in Figure 5-4. Figure 5-4: Software architecture for the application of OSCAR3 data annotations to patent XML documents The DataAnnotator class operates on a single paragraph of text, held in memory as a nu.xom.Element. The text content of the paragraph is handed to OSCAR3 to create a SciXML document, and this document is used by the DataParser class in OSCAR3 to produce a set of inline annotations. A user-modifiable subset of the inline annotations, by default comprising those indicating the presence of NMR, mass spectrometry, high-resolution mass spectrometry and IR spectroscopy, are then identified. The DataAligner class then locates the equivalent, unannotated, sections in the XML source of the input paragraph. This process does not proceed by 141 simple substring matching, since the process of conversion to SciXML has normalised whitespace and removed markup from the source. Consider the example of the following input paragraph; <p> The compound was characterised as follows; MS <i>m/z</i> 189(M<sup>+</sup>). This confirms the experimental product as the desired product. </p> In the SciXML document on which OSCAR3 works, the i and sup tags will have been removed, and so they will be absent from the annotation produced by OSCAR3. The source XML of the input paragraph and the annotation, however, both share the text “MS m/z 189(M+)”. The DataAligner class, where possible, identifies the substring of the source XML that contains the text value of the annotation and, where present, XML tags from a predetermined set that correspond to style markup (such as in the example above) or surplus whitespace. The data annotation that has been created by OSCAR3 may then be copied into the input paragraph by replacing the section of source XML matched by the DataAligner with the source XML for the annotation. If this process cannot complete successfully, the DataAnnotator throws a FailedInlineException to indicate failure. It is important to realise that the process described as above cannot be guaranteed to produce well- formed XML. Consider the following input paragraph; <p> <sup>1</sup>H NMR (400MHz): δ 1.20 (2H, s), 2.34 (1H, t, J=2.3Hz), 3.78 (2H, d, J=2.3Hz). </p> In this case, the substring identified as a match by the DataAligner will include the closing </sup> tag but not the opening <sup> tag. When this substring is replaced by the source XML for the annotation, this will result in the presence of an unbalanced opening tag. As a result, before the source XML produced by the method outlined above can be built to produce a paragraph suitable 142 for insertion into the original document, it is necessary to identify and remove any such unbalanced tags. This process is handled by the DataAnnotator. 5.2.3.2 Measurement of OSCAR3 Performance In order to test the performance of OSCAR3’s DataParser on the EPO patents, a corpus of paragraphs was constructed from those patents that had successfully passed through the paragraph deflattening and document segmentation phases of the semantic enrichment procedure. These paragraphs were selected at random from those that were descendants of an example element, with each paragraph having an equal probability of selection. It was decided to limit the domain in this way in order to enrich the proportion of paragraphs containing experimental data and thereby reduce the time required to produce an annotated corpus of appropriate size. Of the paragraphs selected in this way, 1400 were divided into two sets of 700 each – one for development of the OSCAR3 regular expressions and one for validation of the development process. These paragraphs were manually examined, and sections of experimental text within the corpora were annotated according to the guidelines in Appendix C. The manual annotation of the documents was intended to show the position of the spectral data within the text and the type of spectrum present, but not to fully identify the components of the spectra e.g. peaks, shifts, etc. As a result, the annotated paragraphs were of the form; 143 <p id="p0166" num="0166" patent="EP 1343782B1.xml"> Intermediate Example 17 (7 g, .037 mol) and 10% Pd/C (.7g) in a concentrated methanol solution were shaken under approximately 40 psi of H <sub>2</sub> in appropriate pressure vessel using a Parr Hydrogenator. When the reaction was judged to be complete based upon the consumption of the nitrobenzimidazole, it was diluted with EtOAc and filtered through Celite and silica gel, which was washed with a mixture of EtOAc and MeOH and concentrated. The product was carried on without purification. <spectrum type='hnmr'> 1H NMR (300 MHz, d<sub>6</sub> DMSO) δ 7.11 (d, J = 8.38 Hz, 1H), 6.69 (d, J = 1.51 Hz, 1H), 6.53 (dd, J = 8.38, 1.51 Hz, 1H), 4.65 (s, 2H), 3.62 (s, 3H), 2.43 (s, 3H) </spectrum> . </p> Accuracy of the OSCAR3 data annotations is assessed by the OscarValidator class. Each manually annotated paragraph is read into memory as a XOM Element and an un-annotated copy made by removing the spectrum tags and rebuilding the XML as a XOM Element. This un-annotated paragraph is then passed to a DataAnnotator to produce an automatically-annotated paragraph that can be compared against the original manually-annotated copy. This comparison is performed by string comparison of the text of the annotations in the manually and automatically annotated versions of the paragraph and by comparing the values of the type attribute of the annotation. If an annotation of the same type with the same text is found in both the automatic and manually annotated paragraphs, a true positive (TP) is recorded. If an annotation from the manually- annotated paragraph is not matched in this way in the automatically-annotated paragraph, a false negative is recorded, and if an annotation from the automatically-annotated paragraph is not matched by one in the manually-annotated paragraph, a false positive is recorded. Validation of the version of the DataParser contained in the OSCAR3 α5 release (99), the version that existed prior to the commencement of this work, using the development paragraphs produced the results shown in Table 5-4, where precision measures the proportion of identified reagents that were correctly identified by the machine and recall measures the proportion of reagents specified in the text that were correctly identified by the machine. The two measures are defined as follows; 144 Spectrum type # in corpus TP FP FN Precision Recall MassSpec 227 104 51 123 67.1% 45.8% HNMR 206 97 21 109 82.2% 47.1% CNMR 12 6 2 6 75.0% 50.0% IR 10 0 0 10 0.0% HRMS 5 0 0 5 0.0% Table 5-4: Performance of OSCAR3 α5 on the development corpus The low rate of occurrence of CNMR, IR and HRMS data make it difficult to draw conclusions from the performance of the data annotation, but the rates of recall for the MassSpec and HNMR data are disappointingly low compared to previously published values (100). This is due in part to the origin of the data recognition as part of the Experimental Data Checker application. The purpose of this application was to highlight mistakes made by authors in preparing their manuscripts to RSC requirements. It is desirable in that context to fail to identify data sections that do not conform to the style guidelines of the journal concerned in order to highlight the author’s mistake, but in the context of the current task it is appropriate to loosen this strict requirement in order to increase rates of recall. FPTP TP  Precision FNTP TP  Recall 145 5.2.3.3 Development of OSCAR3 Data Recognition By manual comparison of the human and OSCAR3 annotations, a number of classes of common errors were identified. These errors are subsequently discussed. Illustrative examples are taken from the corpus of paragraphs that were examined during the development process. Unspecified NMR Type OSCAR3 recognises both 1H and 13C NMR and distinguishes between the two by requiring the presence of the literal string “1H” or “13C” respectively near the beginning of the text to be matched. Many of the reported NMR from the patents omitted this declaration, instead taking the form “NMR(CDCl3): δ 0.88(d, J = 6.6 Hz, 6H)…”. A trained chemist will easily infer that this as a 1H NMR since the integral is specified as “6H”, however this form was not matched by any of the OSCAR3 regexes and it therefore went unannotated. This problem was fixed by the creation of a new type of spectrum to be annotated – the unknownNmr. A spectrum is annotated as unknownNmr only if the isotope under examination is not identified, and no attempt is made to infer it from the text. Furthermore, it was noticed that some NMR spectra specify simply the element rather than the isotope under investigation, and were not being matched by the regular expressions. This problem was resolved by making the “1” and “13” preceding “H” and “C”, respectively, optional. Inclusion of “ppm” This problem occurred when the author appended “ppm” to the end of a spectrum, e.g. “13C-NMR (CDCl3): 12.02 (CH3), 23.64 (2C), 32.28 (2C), 38.81 (2C), 49.64, 54.09 (CH), 211.83 (C=O) ppm”. In 146 these cases, OSCAR3 would annotate the spectrum as far as the final peak but omit to include the “ppm” in the annotation. This problem was trivially fixed by allowing the inclusion of this final token. Negative Shifts While uncommon, negative values of NMR shift are entirely valid, but were not accepted as such by OSCAR3. Upon examination, this was discovered to be part of a wider bug in the data regexes – both the character “+” and “-” were intended to be allowed before numbers to indicate sign. In regular expressions, however, the hyphen is a metacharacter when it occurs within a character class and must therefore be escaped to give a literal hyphen in these circumstances. In a number of occasions throughout the data regexes unintentional unescaped hyphens were used inside character classes, and where identified this problem was resolved. Word Boundaries around the ‘Delta’ Character It was noted that OSCAR3 was failing to annotate NMR spectra which contained the substring “δ:”, e.g. “1H-NMR (CDCl3) δ: 2.34 (3H, s), 2.66 (3H, s), 8.12 (1H, s)”. The top-level regular expression that captures a 1H NMR spectrum in OSCAR3 α5 is as follows; 147 1 <node type="spectrum" id="hnmr" value="hnmr"> 2 <regexp parsegroup="0"> 3 <insert idref="nmrDelta" />? 4 \b 5 <insert idref="hNmr.Prolog"/> 6 (?: \W* for\s+\w+ (?: (![\(\);]).)*?)? 7 <insert idref="nmrMethod"/>? 8 (\W+(<insert idref="nmrDelta"/>|H)+\b)? 9 [\s:=]+? 10 (?: \W*ppm\W*?)? 11 (?: peaks\s+at\s+)? 12 (?:\s*<insert idref="nmrDelta"/>\s+)? 13 <insert idref="nmrPeakBlock"/> 14 </regexp> 15 <child type="quantity" id="hnmrSolvent"/> 16 <child type="quantity" id="hnmrStandard"/> 17 <child type="quantity" id="hnmrFrequency"/> 18 <child type="quantity" id="hnmrTemperature"/> 19 <child type="peaks" id="hnmr"/> 20 </node> As can be seen above, the NMR Delta is matched in one (or more, if necessary) of three places – on lines 3, 8 and 12. Following the first two of these is non-optional word boundary (“\b”). A word boundary in regular expressions is a zero-length match that can be made at the transition between a word character and a non-word character, a word character being any one of the lower case letters a-z, the upper case letters A-Z, the digits 0-9 and the underscore, “_”. The NMR Delta is subsequently defined as; <def id="nmrDelta" type="const">(?:d|đ|δ|ä)</def> When the NMR Delta is represented by the character “d” it is a word character, and when it is represented by the character “δ”, or other it is not. Thus, the NMR Delta is followed by a word boundary if the next character is a non-word character if the NMR Delta is a “d” and vice versa otherwise. The NMR Delta is rarely immediately followed by a word character, thus the literal delta, δ, is generally only matched on line 12 of the preceding regular expression. This is problematic where the delta is followed by one of the characters “=” and “:”, since these are only matched on line 9, i.e. before the literal delta can be matched. This problem was resolved by removing the 148 requirement on line 8 for the NMR Delta to be followed by a word boundary, allowing for the character “δ” to be matched at this point. Typos It was observed that a number of the manually annotated spectra were not recognised by OSCAR3 simply because they contained typos. Missing commas in lists of peaks, missing close brackets following a peak assignment and inappropriately positioned spaces, e.g. in the middle of a number were observed, along with less frequent typos, and caused OSCAR3 to fail to annotate the spectrum correctly. Frequently, these mistakes would also cause OSCAR3 to incorrectly annotate a subsection of the full spectrum, producing a false positive in addition to the false negative. For example; False negative: 1H-NMR(CDCl3,TMS) δ(ppm):2.55(3H,s),6.75-6.85(1H,m),7 .03(2H,d,J=8.4Hz),7.1- 7.2(1H,m),7.32(2H,d,J=8.4Hz) False positive: 1H-NMR(CDCl3,TMS) δ(ppm):2.55(3H,s),6.75-6.85(1H,m),7 The errors produced by the missing separators in peak lists were simple to resolve, by making the presence of the separators in these lists optional. To produce regular expressions that thoroughly compensate for typos that authors may make would be inadvisable, however, since this would require the creation of regular expressions that can ignore much of the common structure that reports of spectra share – and that the strategy of regular expression matching exploits. Consequently, it would result in a far higher rate of false positives – increasing the recall of the system at the cost of precision. Mass Spectrometry Peak Inversion The format expected by OSCAR3 α5 for reports of mass spectra is as follows; 149 m/z: 597 (M + H)+ That is to say, the fragment mass should precede the assignment (if present). A number of spectra found in this analysis, however, exhibited an inversion of this format, i.e. the assignment preceded the mass of the fragment, e.g. “MS: MH+ = 445.2” or “ESIMS (M+H)+ = 624”. This problem was solved by the creation of regular expressions to match this alternative template. Mass Spectrometry Partial Annotation It was found that frequently the OSCAR3 annotation would exclude part of the manual annotation. For example; Manual annotation: MS (ESI) m/z 467.5 [M+1]+ Automatic annotation: m/z 467.5 [M+1]+ Manual annotation: ESI-MSm/z: 455(M + H) + Automatic annotation: m/z: 455(M + H)+ In these cases it was common that the part of the spectrum which to be omitted from the annotation, as in the examples above, was a description of the method by which the spectrum was measured. However, in such cases the peak(s) and assignment(s) were included in the annotation. For this reason, the correction of these errors was not attempted as the annotation obtained by OSCAR3, while technically incorrect according to the method of measurement chosen, was of sufficient accuracy for the practical purposes of this thesis. 150 Mass Spectrometry Assignments When a mass spectrum is reported, it is common that the reported mass is not that of the molecule for which the spectrum is being recorded. As in the examples above, it is common to report instead the mass of the most intense peak in the spectrum and to assign to this mass a chemical species, e.g. “288*M+Na+” or “385 (M+1)”. In these cases, if the data identified in the source document is to be put to good use, it is necessary to capture this assignment in a machine-understandable way such that the reported mass becomes meaningful. This process was not supported in OSCAR3 in advance of the current work – consider the inline annotations produced by OSCAR3 α5 for the spectrum “MS m/z 362 (M+1)”; <spectrum type="massSpec"> MS m/z <peaks type=".."> <peak> <quantity type="mass"> <value> <point>362</point> </value> </quantity> (M+ <quantity type="intensity"> <value> <point>1</point> </value> </quantity> ) </peak> </peaks> </spectrum> While the assignment has been included in the spectrum annotation, it has not itself been specifically annotated. In advance of this work, the data regexes in OSCAR3 allowed virtually any bracketed text to follow the numerical value in a mass spec peak, and did not attempt to identify assignments in mass spectra. This has caused the numerical value, 1, in the assignment “M+1” to be interpreted as being the intensity of the given peak within the spectrum. Clearly, this is both mistaken and unhelpful for a client programmer who wishes to extract and work with the spectral data. To address this issue, an additional child of the peak node was created – the 151 massSpecAssignment. This required the creation of regular expressions to match specifically the forms of assignment that are used rather than accept generic text contained within brackets. For example, assignments of the form “*M-H++” are matched by the regular expression; (?: [\(\{\[] M (?!\s) [\+<insert idref="HYPHENCHARACTERS"/>]* (?: (?x-i:<insert idref="ELEMENTS"/>\d*)* | \d+ ) [\)\}\]] [\+<insert idref="HYPHENCHARACTERS"/>] ) (?!\w) As a result of this work, OSCAR3 now produces the following annotations for the previous example, “MS m/z 362 (M+1)”; <spectrum type="massSpec"> MS m/z <peaks type=".."> <peak> <quantity type="mass"> <value> <point>362</point> </value> </quantity> <quantity type="assignment">(M+1)</quantity> </peak> </peaks> </spectrum> With OSCAR3 now producing output in this format, it is possible for a machine to trivially identify that the peak at a mass of 362 has been given the assignment “M+1”, just as it was for a trained chemist to identify this information from the original text. This additional information can be used to predict the expected mass of the compound for which the spectrum was measured, as described in section 5.3.2.5. 152 5.2.3.4 Measurement of Improved OSCAR3 Performance Once the work described in the previous section was completed, the analysis described previously was re-performed using the upgraded regular expressions. Results are also presented for the analysis of the validation corpus, using both the original and the updated regular expressions. For the purposes of comparison, the results of the analysis on the development corpus using the original regular expressions are reproduced; Spectrum type # in corpus TP FP FN Precision Recall MassSpec 227 104 51 123 67.1% 45.8% HNMR 206 97 21 109 82.2% 47.1% CNMR 12 6 2 6 75.0% 50.0% IR 10 0 0 10 0.0% HRMS 5 0 0 5 0.0% Table 5-5: Performance of OSCAR3 α5 on the development corpus Spectrum type # in corpus TP FP FN Precision Recall MassSpec 227 146 67 81 68.5% 64.3% HNMR 206 175 16 31 91.6% 85.0% CNMR 12 10 2 2 83.3% 83.3% IR 10 0 0 10 0.0% HRMS 5 0 3 5 0.0% 0.0% Table 5-6: Performance of the improved OSCAR3 on the development corpus 153 Spectrum type # in corpus TP FP FN Precision Recall MassSpec 199 80 55 119 59.3% 40.2% HNMR 202 103 16 99 86.6% 51.0% CNMR 24 12 2 12 85.7% 50.0% IR 8 0 0 8 0.0% HRMS 14 6 7 8 46.2% 42.9% Table 5-7: Performance of OSCAR3 α5 on the validation corpus Spectrum type # in corpus TP FP FN Precision Recall MassSpec 199 122 53 77 69.7% 61.3% HNMR 202 179 40 23 81.7% 88.6% CNMR 24 22 4 2 84.6% 91.7% IR 8 0 0 8 0.0% HRMS 14 6 7 8 46.2% 42.9% UnknownNMR - - 3 - Table 5-8: Performance of the improved OSCAR3 on the validation corpus The manually-annotated corpus does not, of course, contain any spectra of type ‘unknownNmr’. A strict application of the previous procedure therefore produces false positives wherever OSCAR3 annotates as such. The metrics produced in this way are misleading – if OSCAR3 annotates an NMR spectrum that fails to identify the isotope under investigation and if the correct text is annotated then it has correctly performed its function. As a result, the results given above have been computed allowing for a manually-annotated HNMR or CNMR spectrum to be matched by an automatically- annotated unknownNmr. 154 In the results given above, it can be seen that the performance of OSCAR3 on this task has been significantly improved by the work described previously. The recall on previously unseen data has been significantly improved (88.6% up from 51.0% for HNMR, 61.3% up from 40.2% for Mass Spec), while the precision has been only marginally affected (69.7% up from 59.3% for Mass Spec, 81.7% down from 86.6% for HNMR). These modifications have greatly increased the potential for using OSCAR3 as a means for large-scale automated collection of spectral data from the literature. 5.2.4 Experimental Paragraph Classification While it is common for the experimental sections, i.e. those that describe the process and results of a chemical reaction, of a patent to occur as examples of the invention, it is not necessarily the case that the method of identifying document sections described in section 5.2.2.1 will result in their occurrence as part of an example element in the semantically enhanced patent documents. As a result, the semantic enhancement at this point has done nothing to identify the presence or location of some or all of the experimental sections in a number of documents. To address this concern the sections of the text, as contained by opening and closing heading tags, are classified as being either experimental or non-experimental by use of a naïve Bayesian classifier. This classification allows for a greater proportion of the experimental sections within the patent corpus to be recognised as such and treated appropriately during the later stages of the workflow. 5.2.4.1 Classifier Implementation The implementation of the naïve Bayesian classifier used here is supplied by the BayesianClassifer class in the third-party Java library Classifier4J (101), version 0.6. The API provided by BayesianClassifier allows for the binary classification of text strings, i.e. belonging 155 or not belonging to a given class. This class is determined by the client programmer and is taught to the Bayesian classifier by the provision of a number of examples of matches and non-matches. Once the training process is complete, the classifier will predict the likelihood of a previously unseen strings belonging to the class. 5.2.4.2 Training and Validation A corpus was assembled by selecting 800 p elements (i.e. paragraphs, in the most part) from those patents that had successfully passed through the paragraph deflattening and document segmentation phases of the semantic enrichment procedure, using a random process in which each paragraph had an equal chance of selection. These paragraphs were given file names numbering them sequentially from para000 to para799, and were manually inspected and determined to be experimental, non-experimental or empty according to the following criteria;  The paragraph is empty if it has no text content. Such empty paragraphs generally occur in the patent documents as containers for images.  The paragraph is experimental if; o It is an account of a reaction or a part of a reaction, including by way of reference to another section of text e.g. “The reaction was carried out as in example 12”. o It is a report of spectral or other characterisation data. o It is some combination of the above.  The paragraph is non-experimental if it is not empty or experimental. The manually-classified paragraphs may be summarised as follows; 156 Class Frequency of Occurrence Empty 117 Experimental 238 Non-experimental 445 In order to produce experimental and non-experimental sets of equal size, non-experimental paragraphs after the 238th were ignored for the remainder of this work. The first 119 (50% of the full set) experimental and non-experimental paragraphs were then used to train the Bayesian classifier before it was asked to predict probabilities of the remaining experimental and non-experimental paragraphs belonging to the experimental class. The predicted likelihoods may be summarised as follows; Experimental Non-experimental Predicted likelihood Frequency Predicted likelihood Frequency 0.99 115 0.01 102 0.98 ≥ p > 0.95 1 0.01 < p ≤ 0.06 3 0.05 ≥ p > 0 4 0.06 < p < 0.5 2 0.99 12 Thus, when classifying paragraphs as experimental if p < 0.5 and non-experimental if p > 0.5, the experimental paragraphs were correctly classified at a rate of 96.6% and the non-experimental paragraphs at a rate of 89.9%. These rates were deemed high enough to continue into production. 157 5.2.4.3 Integration into Workflow The naïve Bayesian classifier is integrated into the PatentEye code through the ParagraphClassifier class. This class handles the loading from disk of the training data described in the previous section as well as the training of the BayesianClassifer and delegating calls to classify sections of text to the underlying implementation. Heading elements in the patent documents are identified by use of the XPath “//heading”. If a heading has text content, the text content is passed to the ParagraphClassifier for a prediction to be made. If the predicted likelihood is greater than 0.5, the section is classified as being experimental, and this is noted in the XML by the addition of a classifier4j attribute with the value “experimental”. Otherwise, the opposite is recorded by setting the value of the classifier4j attribute to “nonExperimental”. 5.2.5 Image Analysis As previously discussed, the XML documents supplied by the EPO frequently employ images to communicate chemical information, including Markush structures, reaction schemes and illustrations of single chemical structures. These images contain important information in the context of understanding the content of the document, but this information is unavailable to a machine. It is desirable for PatentEye to identify the compounds represented in structural diagrams where possible so that they may be used to confirm or to query the identity of the example compounds of the patent. The images are supplied in TIF format, and are embedded in the XML document as follows; 158 <p id="p0473" num="0473"> <chemistry id="chem0413" num="0413"> <img id="ib0413" file="imgb0413.tif" wi="87" he="48" img-content="chem" img-format="tif" orientation="portrait" inline="no" /> </chemistry> Diisobutylaluminium hydride (7mL, 1 M in tetrahydrofuran , 7mmol) was added to a cooled (-10°C) solution of the ester from preparation 184 (1.2g, 2.7mmol) in tetrahydrofuran (25mL), and the reaction stirred for an hour at -10°C, followed by 1Hour at 0°C. Tlc analysis showed starting material remaining, so additional diisobutylaluminium hydride (5.4mL, 1 M in tetrahydrofuran, 5.4mmol) was added and the reaction stirred at 10°C for 10 minutes. </p> Figure 5-5: Embedded images in the patent XML. The text has been shortened for the sake of brevity Figure 5-6: EP1620437B1 Image 413 In this case, the image is a chemical structure diagram for the product of the reaction. The information contained within such images may be crucial in correctly identifying the product of a reaction. Mention is made in the DTD for the EPO patents that chemistry elements may at some point in the future support the inclusion of CML, but for the moment to facilitate the usage of this information it is necessary to first convert them into a machine-understandable format. As previously discussed, this problem has been previously addressed and applications for this purpose have been developed. The generation of connection tables for the images embedded within the 159 patents is achieved by interfacing with the application OSRA (65; 66; 67), and this process is subsequently discussed. 5.2.5.1 Technical Aspects OSRA version 1.2.2 was used for the current work, which was the most recent version at the time that the work commenced. OSRA is implemented in C++ and is distributed as a pre-compiled executable for the Windows environment. It operates as a command line facility, and using default parameters the command “osra <filename>” instructs the program to process the specified files and print at the command line a series of line-separated SMILES strings for the chemical structures found within it. The output produced by OSRA is then added to the patent XML as an attribute on the img element. For example, the enhancement produces, for the example from Figure 5-5, the following output; <p id="p0473" num="0473"> <chemistry id="chem0413" num="0413"> <img id="ib0413" file="imgb0413.tif" wi="87" he="48" img-content="chem" img-format="tif" orientation="portrait" inline="no" osraResult=" OCc1nn(CCOCC(F)(F)F)c2c(Nc3cccc(C)n3)nc(Cl)nc12"/> </chemistry> Diisobutylaluminium hydride (7mL, 1 M in tetrahydrofuran , 7mmol) was added to a cooled (-10°C) solution of the ester from preparation 184 (1.2g, 2.7mmol) in tetrahydrofuran (25mL), and the reaction stirred for an hour at -10°C, followed by 1Hour at 0°C. Tlc analysis showed starting material remaining, so additional diisobutylaluminium hydride (5.4mL, 1 M in tetrahydrofuran, 5.4mmol) was added and the reaction stirred at 10°C for 10 minutes. </p> Where OSRA identified the presence of more than one structure in an image, it returns a set of line- separated SMILES strings. So that they may be embedded into the XML document, this set of SMILES strings are concatenated into a single pipe (“|”) separated string that can be used as the value for a single osraResult attribute. While it is possible to produce a single, valid SMILES string that represents two or more disconnected molecular structures by using the dot character (“.”), e.g. 160 “CC(=O)O.CCO” to represent acetic acid and ethanol, the dot-separated molecular graphs can be reconnected by using ring closures, e.g. “C1CCCCC1” represents cyclohexane and “C1CC.CC1” represents hexane. In order to prevent the possibility of malformed SMILES output from OSRA effecting bond formation between disconnected structures, it was decided to use the pipe character, which does not occur in the SMILES vocabulary, to denote separated structures. Since OSRA is not implemented in Java, the PatentEye application interacts with it by using the capacity provided by the Java system libraries to execute arbitrary applications, wait for their termination and read into memory any output produced. The generation of connection tables from image files using OSRA is a time-consuming operation, as discussed in section 5.2.5.3, and so the results that are generated by OSRA are immediately cached and written to disk such that the task need not be repeated. In order to reduce the amount of time taken to parse the images for a given patent, only a subset of the images present in a patent undergoes the image analysis procedure. This subset is composed of those images that are selected by the XPaths “//example//chemistry/img” and “//heading[@classifier4j=‘experimental’]//chemistry/img”, i.e. those that are embedded in img tags that are children of a chemistry element that in turn is a descendant of an example element or a heading, the content of which has been classified as experimental. This selection is performed for a practical reason – only the images in these positions will be used later to corroborate a candidate product for a reaction, and so to analyse other images would be to waste computing resources. OSRA is claimed to support the TIF format, but it was discovered that an apparent bug in OSRA version 1.2.2 prevents the direct interpretation of TIF files. To work around this, the application ImageMagick (102) is first used to convert the TIF images into the PNG format, which OSRA accepts as input. ImageMagick is executed using the same procedure for calling external applications as described previously for OSRA. 161 5.2.5.2 OSRA Performance It was desired to validate the performance of OSRA in the context of this work. To produce a corpus reflective of the task in question the complete set of images contained within experimental sections of the ten-week corpus of semantically enhanced patents were identified using the XPaths given above, and from these 14697 images a subset of 300 were randomly selected. The TIF files for these images were extracted from their parent ZIP files, numbered sequentially and manually inspected. These images were found to conform to one of a number of different types, by far the most common of which was of a single chemical structure diagram. The set of 300 images was used to create a corpus of 200 images of single chemical structures by selecting, in ascending order, the first 200 such images. The types of images contained in the range image0 to image271 – image271 being the 200th single chemical structure image – are summarised in Figure 5-7. Figure 5-7: Types of images present in experimental sections 73.5% Single Chemical Structure 11.0% Multiple Chemical Structures 9.6% Chemical Reaction 2.2% Chemical Fragment 1.5% Markush Structure 1.1% Generic Reaction 0.7% No Chemical Structure 0.4% Biochemical 162 The chemical structures contained within the set of 200 single chemical structure images were manually converted to SMILES strings, chiefly by redrawing the structure with ChemDraw 12.0 (35) and exporting the structure as SMILES or by manual conversion in the case of simple structures, which were recorded in an index of the corpus. OSRA was used to analyse each of the 200 single chemical structure images, and the results of this analysis was appended to the index. Previous authors in the field have suggested subjective metrics of success such as less than 30 seconds of human editing being required to correct errors in the structure (61), while Filippov and Nicklaus propose measuring success by calculating a similarity metric between the machine- produced structure and the correct structure. Such measures are of limited utility in the present work; manual correction of structures or determination of correct structures cannot be implemented within a fully automated workflow. What is desired of the image analysis process is the correct identification of the product molecules of chemical syntheses, and while a high similarity between a structure believed to be the product (the “candidate product”) and a structure produced by OSRA may be indicative that the image analysis has made a minor error and the candidate product should be accepted, it may equally indicate that the image analysis is correct and the candidate product should be rejected. As a result, there is no threshold of similarity below the two structures being identical at which the structure derived from the image analysis becomes “good enough”. The manually-generated and OSRA-generated SMILES strings for each image were thus used to generate InChIs using JUMBO. The performance of OSRA was measured by comparing these InChIs by string equivalence; where the two InChIs were identical, it was counted as OSRA having correctly deduced the chemical structure contained within the image and considered a match. Where the InChIs differed it was considered a non-match. As discussed in section 2.2.3.2, InChIs are composed of layers that describe the molecule in increasing levels of detail, and so the two were examined 163 side-by-side to determine the level of agreement, i.e. the highest layer of the InChIs at which the two disagreed. In a number of cases, it was not possible to generate an InChI from the SMILES string produced by OSRA. The causes of these problems were also examined and determined to be primarily that the SMILES string contained the wildcard character, *, which is valid SMILES but is not supported by JUMBO or by InChI. In a further two cases the SMILES string returned by OSRA was found not to be valid, suggesting a bug within the OSRA program itself. The results from this work were as follows; Result Frequency % Match 68 34.0 Non-match 79 39.5 Unbuildable SMILES (containing wildcard) 51 25.5 Invalid SMILES 2 1.0 Table 5-9: OSRA performance Cause Frequency % Differing Hydrogen Count 12 15.2 Differing Molecular Formula (excluding H count) 57 72.2 Differing Regioisomers 4 5.1 Differing Stereochemistry 5 6.3 Differing Charges 1 1.3 Table 5-10: Causes of InChI disagreement 164 The rate at which the OSRA-produced structure and the manually-produced structure is, at 34%, significantly lower than that reported for OSRA 1.1.0 by Filippov and Nicklaus (103), in which the rate was reported as 26 matches out of 42 (61.9%) structures and 107 matches out of 215 (50.0%) structures on two data sets. Such rates will of course be highly dependent upon the images that form the test corpus, and the images supplied by the EPO are of highly variable quality. Many of the images that form the test corpus used in this work are severely pixelated, indistinct or contain background noise; some are only barely legible to a human skilled in the art. Examples from the single chemical image corpus are subsequently illustrated, along with the resultant structures produced by OSRA. All input images are the work of the European Patent Office and the original patent authors. Figure 5-8: Input image (left) and correctly interpreted structure (right) 165 Figure 5-9: Input image (top) and correctly interpreted structure (bottom) Figure 5-8 and Figure 5-9 show examples of images from the test corpus that were correctly interpreted by OSRA. Note that in Figure 5-9 the wedge bond to fluorine was correctly interpreted even though it is not immediately noticeable to the human eye. Figure 5-10: Input image (left) and correctly interpreted structure In Figure 5-10 we see that although the image contains a mistake – the missing hydrogen atoms in the amine substituent – OSRA has correctly identified the structure in the image. 166 Figure 5-11: Input image (top) and incorrectly interpreted structure (bottom) Figure 5-11 shows an example of OSRA producing buildable but incorrect output. The label for the fluorine atom has been missed, giving a methyl substituent on the benzene ring instead of the correct fluorine substituent. Furthermore, the chlorine substituent has been misrecognised as a hydroxyl group, presumably due to the letters “C” and “l” overlapping slightly to be recognised as an “O”, and the hydrogen being added automatically as in Figure 5-10. The analysis of the causes of structural disagreements such as these, presented in Table 5-10, suggest that when these errors occur they are major errors, with 15.2% being due to discrepancies in the hydrogen count of the molecules, and a further 72.2% being due to other discrepancies in the molecular formula, such as in Figure 5-11. 167 Figure 5-12: Input image (top) and unbuildable result (bottom) In Figure 5-12, the input image is of extremely low quality. Consequently, the output produced by OSRA comprises three disconnected subsections of the molecule and it is difficult to identify which sections of the output structures correspond to which sections of the input structure. The presence of the wildcard character, “*”, in the structure generated by the image analysis is generally indicative that OSRA has not correctly identified the full chemical structure and is returning a partial result. Consequently, it is evident that the result returned by OSRA is not a correct representation of the structure contained in the image and may be safely disregarded in any workflow. 168 Figure 5-13: Input image (top) and unbuildable result (bottom) Figure 5-13 shows a further example in which OSRA returns a partial interpretation of the input image. In this case, the structure produced by OSRA is entirely correct apart aside from the missing bromine atom. In such cases it is possible to determine that a candidate product subsumes the partial structure defined by the OSRA result, and thus that the structure determined by the image analysis agrees with the candidate structure to the fullest extent possible. It was not, however, attempted to implement this process within the current work. 5.2.5.3 Computational Expense The analysis of the illustrative images was the most computationally expensive part of the PatentEye workflow, but was still capable of running on a standard desktop PC (Pentium 4, 3.40 GHz). The time taken to compute a SMILES string from an input image on this platform varied from 890 milliseconds to 12,026,543 milliseconds – over three hours. The variation in time required for this task is illustrated in Figure 5-14. 169 Figure 5-14: Runtime required for image analysis To produce a legible graph, Figure 5-14 ignores the 216 images for which the required runtime exceeded 100 seconds. The graph of the remaining 9271 (97.7% of the total) images that were processable in under 100 seconds show a roughly linear increase in required runtime over the fastest 80-90% before the required runtime begins to increase markedly. The fastest 80% of the images may be processed in less than around 32 seconds each, while the fastest 90% may be processed in less than around 42 seconds each. This shows that the existing software tools and standard hardware may compute the bulk of the information that is to be gained from the images within reasonable periods of time. 0 20000 40000 60000 80000 100000 120000 0 10 20 30 40 50 60 70 80 90 Images processable in time t (%) T im e ( m s ) 170 5.2.6 Back Reference Annotation By the nature of a chemical patent document, the examples of the invention frequently consist of the syntheses of a number of related compounds. The methods of synthesis of these compounds are frequently the same, producing differing products by varying one or more of the reactants. Figure 5-15: Analogous reactions from EP1326865 This phenomenon is illustrated in Figure 5-15. The second reaction is described in the patent document as follows; “The title compound was prepared using the procedure described in Example 2, Step 2 replacing 2,5 dichlorobenzylamine with 1(R,S) aminoindane.” The patent author has saved himself from the need to repeat the description of procedure, conditions and reagents that has already been included for the first reaction. Authors also frequently refer to a reactant as being “the ester from step 3”, or similar, instead of by name. For a machine reading the document, these constructions create a problem – the information necessary to understand a reaction is not present in the description of that reaction, instead being present elsewhere in the document. A human reader will recognise the intent of the Back Reference to the earlier section of the document and will understand that he should refer back 171 to obtain the information that has been omitted from the immediate section – indeed, this is what happened during the process of the drawing of Figure 5-15 – but for a machine this process is not so trivial. In order to facilitate the same process to be performed automatically, the document is searched for text strings that may be such references to previous sections. These Back References, where found, are annotated in the semantically enhanced documents with a link back to the referenced section so that the machine will later be able to look up the relevant information when attempting to process the reactions described in the patent. The process of Back Reference annotation is subsequently discussed. 5.2.6.1 Hierarchical Indexing of Reaction Headings The document to be annotated is passed to the BackReferenceAnnotator class, which is intended to identify and annotate potential back references in the input document. Before any annotation can be carried out, the BackReferenceAnnotator must first identify the terms to be annotated. This is achieved by locating the example headings in the document using XPath and tokenising the title text of the headings on whitespace. Subheadings, i.e. heading or example elements that are children of the previously indexed example element, are similarly indexed and are recorded in the index of headings as children of their parent heading. For example, the following document structure would be indexed as shown; 172 <example id="h0005" title="EXAMPLE 1"> <heading id="h0006" title="N-(3,5-dichlorobenzyl)-8-hydroxy-1,6-naphthyridine-7- carboxamide"> <p id="p0135" num="0135">...</p> </heading> <heading id="h0007" title="Step 1: Preparation of 3-{[Methoxycarbonylmethyl-(toluene- 4-sulfonyl)-amino]-methyl}-pyridine-2-carboxylic acid isopropyl ester"> <p id="p0136" num="0136">...</p> </heading> <heading id="h0008" title="Step 2: Preparation of methyl 8-hydroxy-1,6-naphthyridine- 7-carboxylate"> <p id="p0137" num="0137">...</p> </heading> <heading id="h0009" title="Step 3: Preparation of N-(3,5-dichlorobenzyl)-8-hydroxy- 1,6-naphthyridine-7-carboxamide"> <p id="p0138" num="0138">...</p> </heading> </example> “EXAMPLE”, “1” “N-(3,5-dichlorobenzyl)-8-hydroxy- 1,6-naphthyridine-7-carboxamide” “Step”, “1:”, “Preparation”, “of”, “3-{[Methoxycarbonylmethyl-(toluene-4-sulfonyl) -amino]-methyl}-pyridine-2-carboxylic”, “acid”, “isopropyl”, “ester” “Step”, “3:”, “Preparation”, “of”, “N-(3,5-dichlorobenzyl)-8-hydroxy- 1,6-naphthyridine-7-carboxamide” “Step”, “2:”, “Preparation”, “of”, “methyl”, “8-hydroxy-1,6-naphthyridine-7-carboxylate” Figure 5-16: Indexed and Tokenised Headings Figure 5-16 shows the index generated from the preceding XML. Note that the structure of the XML has been preserved in the index, with one example heading containing four child headings. For each of the indexed headings, the tokens generated by tokenising the title text on whitespace are shown 173 within quotation marks. The index generated in this way is subsequently used to annotate references to these sections of text in the patent text. 5.2.6.2 Text Annotation Paragraphs that are contained by example headings or by headings that have been classified as experimental by the ParagraphClassifier are identified within the patent document using XPath. The text content of these paragraphs is tokenised on whitespace, as was the text of the headings, and then the token sets are compared to identify common tokens between the two. Common tokens in this case are identified by allowing for differences in case and for the punctuation at the end of the tokens to differ, such that the heading for “EXAMPLE 1” in the previous example would find a match within the string “Example 1, Step 2”. To avoid the annotation of common strings such as “1” or “the” that do not indicate the presence of a Back Reference, matches must be of two or more consecutive tokens from the heading. In the previous example paragraph text, “Example 1, Step 2”, it is clear that the reference is not to the whole of example 1, but specifically to the second step of that example. This is identified by the BackReferenceAnnotator by allowing the child headings of a parent heading to attempt to continue matching from the position in the paragraph token set where parent heading stopped matching. Once this process of identifying the paragraph text that matches the indexed headings is complete, the match is checked to see if it is a Compound Reference – a specific type of Back Reference that indicates the use of a previously defined chemical e.g. “the product of example 3”. This is achieved by checking if the text that precedes the matched text itself matches the regular expression; the (.*?) (from|of|(synthesi[sz]ed|produced) in)$ 174 Furthermore, the text matched by the capture group (.*?) can either be the word “product” or a single entity of type CM as annotated by OSCAR3. The preceding text is therefore required to be of the form “the aldehyde from” or “the product synthesised in”, with the anchor character “$” requiring that the matching text occurs at the end of the string. If the preceding text matches this regex then both matches are annotated as a single Compound Reference, otherwise the text matched to the example headings is annotated as a Back Reference. Compound References are sometimes used by patent authors with a local rather than global scope – that is to say, the compound referred to by a certain phrase will not be the same as the compound referred to by the same phrase in another part of the document. This is observed, for example, in the phrase “the product of step 1” – it is assumed and inferred that this refers to the first step of the current reaction. For this reason, the BackReferenceAnnotator also annotates Compound References by identifying sections of text that match to headings that are siblings of the parent heading of the paragraph containing the text. This is illustrated in Figure 5-17, where the colour coding indicates the section of the document for which each paragraph is annotated. Example 1 Step 1 Step 2 “The product of step 1 is added to…” Example 2 Step 1 Step 2 “The product of step 1 is added to…” Figure 5-17: Local annotation of sub-headings 175 The XML produced by this process is of the following form; <p id="p0141" num="0141"> Triphosgene (0.556g, 1.87 mmol) was added over 20 mins to a solution of <compoundRef targetId="h0012">the acid from step 1.</compoundRef> (0.89g, 4.68 mmol) and diisopropylethylamine 3.26 ml, 18.7 mmol) in DMF (22 ml) at 0°C. </p> <p id="p0143" num="0143"> The title compound was prepared using the procedure described in <backReference targetId="h0013">Example 2, Step 2</backReference> replacing 2,5 dichlorobenzylamine with 1(R,S) aminoindane. </p> In each of these cases, the value of the targetId attribute on the reference annotation is equal to the value of the id attribute on the heading element to which the reference points, introducing into the document a means by which a machine may easily resolve the reference. 5.3 Extraction of Reactions Chemical patents are a rich source of chemical reactions due to the requirement for a patent claimant to detail examples of the invention. The reactions published in this way are routinely manually indexed and added to databases such as CASREACT (4). A project at the Chemical Abstracts Service (CAS) in the 1980’s aimed to produce a system capable of automating or partially automating the indexing process by application of Natural Language Processing (NLP) technologies (104; 105; 106). These works claimed to “satisfactorily” process 36 out of 40 synthetic paragraphs from the Journal of Organic Chemistry (104) and to produce “usable results” for 80-90% for simple synthesis paragraphs and 60-70% for complex paragraphs (106), where complex paragraphs are defined as describing general procedures, instances of general procedures, analogous syntheses and parallel syntheses. The size of the corpus used to produce this second set of results was not given, nor in 176 either case was the procedure used for corpus creation. Accordingly, it is not possible to regard this area as a solved problem. The work at CAS produced a system capable of summarising a reaction by identifying its product, yield and component reaction steps. Discrete steps in the reaction were identified by the occurrence of verbs in the text, with each verb indicating an event and mapped to one of eight types of events; COMBINE, REACT, PREPARE, WORKUP, RESULT, TITLE, MISC and UNKNOWN. Within these events, further information such as chemicals, their amounts and roles, and reaction conditions were identified. Chemical syntheses were thereby extracted from the text as a sequence of the identified events. The era in which this technology was developed was very different. As a division of the American Chemical Society, CAS was in the privileged position of having access to a large body of published work in an electronic format. The situation today is different – the ubiquity of electronic publication and explosion of the scale of publication has granted such access far more widely, though publishers may very well supply the works subject to restrictive terms of use. The EPO patents, however, are subject to no such restrictions and so the time for a re-examination of the subject of automated extraction of chemical reactions has come. 5.3.1 Conventional Format of Experimental Sections In order to devise a system for automated extraction from reported syntheses, it is important to first consider the nature and common structure of such text. Fortunately, the reporting of chemical syntheses is highly stylised. By convention, chemists report syntheses using the past tense and the agentless passive voice. Consider the following; 177 Step 1: Preparation of 5-iodo-8-(1-phenyl-methanoyloxy)-[1,6]naphthyridine-7-carboxylic acid methyl ester To a solution of 8-hydroxy-5-iodo-[1,6]naphthyridine-7-carboxylic acid methyl ester (9.41 g, 28.5 mmol, from Example 112 Step 1) and cesium carbonate (13.9 g, 42.8 mmol) in dry DMF (250 ml), was added benzoic anhydride (9.67g, 42.8 mmol) and the solution stirred at room temperature for 4 hours. The solvent was removed under reduced pressure and the residue was partitioned between saturated aq. ammonium chloride and extracted into CHCl3. The combined organic extracts were washed with brine, dried over Na2SO4, filtered and and the solvent was evaporated in vacuo. The residue was purified by flash chromatography (40% EtOAc/Hexane gradient elution switching to 1% MeOH/CHCl3) to provide the desired product was a yellow solid. 1 H NMR (CDCl3, 400MHz) 9.11 (1H, d, J=4.21Hz), 8.48 (1H, d, J=8.51 Hz), 8.32 (2H, d, J=8.33 Hz), 7.75-7.67 (2H,m), 7.56 (2H, t, J=7.69 Hz), and 3.93 (3H, s) ppm. Figure 5-18: EP1326865 - Example 79, Step 1 Such descriptions of syntheses may be conceptually divided into three parts – the primary reaction, in which the target compound is completely or substantially produced; the work-up, in which the reaction is quenched and neutralised, solvents are removed, the product purified and suchlike; and the characterisation, in which spectral data is afforded to demonstrate that the product of the reaction is that intended. In the description of the primary reaction, reactants (“a substance that is consumed during the course of a reaction” (107)) are detailed by giving a name or other reference (e.g. “ketone 12b” or “the compound from step 2”) together with the quantity used, generally state by mass and by molar amount. Solvents are typically detailed by giving a name and the volume used. In the description of the work-up these quantities are commonly omitted, as in Figure 5-18. The identity of the product of the synthesis may be specified in one of two typical ways; in the heading of the section, as in Figure 5-18, or by statement at the end of the description of the work-up, e.g. “to yield 1,6-naphthyridine-8-carboxylic acid”. 5.3.2 Implementation of Automatic Reaction Extraction A system for the automated extraction of reaction information from the EPO patents is implemented. While this system is implemented in such a way as to integrate directly with the 178 enhanced patents produced by the process discussed in section 5.2, the methods and software employed are generic and could be reused to produce a system suitable for alternative inputs. The operation of the system is summarised in Figure 5-19. Figure 5-19: Abstracting reactions from patent text The enhanced patent XML documents are read into memory, and the headings that have been classed as experimental by the ParagraphClassifier as well as those converted into example elements are identified by means of XPath. The sections of the document either contained within the heading or example element or, if the heading has sub-headings, each subheading individually, are passed into the ExperimentParser class. Identities and amounts of reagents are identified either by passing the text to ChemicalTagger, or by analogy with a previous reaction if a back reference is present. The DataAnnotator class in OSCAR3 is used to identify spectral data, and NMR spectra are converted to CML using a converter from JUMBOConverters (45). The product of the reaction is identified by using OSCAR3 to identify chemical names in the heading title. The product identity is then validated by comparison with the results of the OSRA analysis of any image present 179 (see section 5.2.5), and with any 1H NMR or mass spectrum that is reported. The results of these processes are combined into a CML Reaction which is saved to disk and retained in memory in order that further reactions which refer back to this reaction may be resolved. This process is subsequently discussed in greater depth. 5.3.2.1 Identification of Products In order to identify the product of a reaction, the title text of the document section under examination is passed to OSCAR3 for named entity annotation. If OSCAR3 does not identify a single named entity of type CM in the title text, then the process of reaction extraction fails and the ExperimentParser throws a RuntimeException. If a single CM is found in the title text, then the name is resolved to a CML Molecule, which is added to the productList of the CML Reaction. 5.3.2.2 Attachment of Spectral Data As discussed in section 5.2.3, the most common spectra types found in the patent corpus were 1H NMR, 13C NMR and mass spectra. The reports of mass spectra generally report only the mass of the molecular ion, optionally plus or minus a defined offset, and so provide a useful source of information for validating a candidate product molecule but little information worth preserving – it is, after all, a simple task to calculate a molecular mass and the patents do not report fragmentation patterns. The NMR spectra, however, in addition to providing a means by which the product molecule may be verified are themselves data of potential importance and are worth preserving for future re-use. The format in which they are preserved in the enhanced patent XML documents, using inline annotation to identify features within the original patent text, is ideal in that context as it retains the original document text. It does not, however, enable trivial machine interpretation of the spectrum since it is not valid CML and tools do not exist for its easy manipulation. The OSCAR 180 annotated spectrum is therefore converted into a CML Spectrum by use of the OSCAR2CMLSpectConverter class in the JUMBOConverters library. This converter was created specifically for this task and was largely written by Peter Murray-Rust with some contribution from the present author. It does not attempt to perform any further text-mining on its input, instead relying entirely on the OSCAR3 annotations to fully identify features of interest such as peaks, integrals, multiplicities and coupling constants. Example output from this converter is given in Figure 5-20. 181 <spectrum xmlns="http://www.xml-cml.org/schema" type="hnmr"> <parameterList> <parameter dictRef="cmlx:solvent"> <molecule> <name>CDCl3</name> </molecule> </parameter> <parameter dictRef="cmlx:frequency"> <scalar dataType="xsd:double" units="unit:mhz"> 400.0 </scalar> </parameter> </parameterList> <peakList> <peak xValue="9.18" integral="1.0" yUnits="unit:hydrogen" peakMultiplicity="cmlx:doubletdoublet"> <peakStructure type="coupling" value="1.6" units="unit:hz" /> <peakStructure type="coupling" value="4.2" units="unit:hz" /> </peak> <peak xValue="8.53" integral="1.0" yUnits="unit:hydrogen" peakMultiplicity="cmlx:doubletdoublet"> <peakStructure type="coupling" value="1.6" units="unit:hz" /> <peakStructure type="coupling" value="8.5" units="unit:hz" /> </peak> <peak xValue="8.26" integral="1.0" yUnits="unit:hydrogen" peakMultiplicity="cmlx:multiplet" /> <peak xValue="7.72" integral="1.0" yUnits="unit:hydrogen" peakMultiplicity="cmlx:doubletdoublet"> <peakStructure type="coupling" value="4.2" units="unit:hz" /> <peakStructure type="coupling" value="8.5" units="unit:hz" /> </peak> <peak xMin="6.84" xMax="7.04" integral="3.0" yUnits="unit:hydrogen" peakMultiplicity="cmlx:multiplet" /> <peak xValue="4.72" integral="2.0" yUnits="unit:hydrogen" peakMultiplicity="cmlx:doublet"> <peakStructure type="coupling" value="6.2" units="unit:hz" /> </peak> <peak xValue="3.97" integral="3.0" yUnits="unit:hydrogen" peakMultiplicity="cmlx:singlet" /> <peak xValue="3.89" integral="3.0" yUnits="unit:hydrogen" peakMultiplicity="cmlx:singlet" /> </peakList> </spectrum> Figure 5-20: Example NMR spectrum in CML 5.3.2.3 Identification of Reagents As previously discussed, reagents used during the primary reaction section of a chemical synthesis are, by convention, reported along with the quantity used. Such lexical patterns are easily identified using ChemicalTagger – the output format is as follows; 182 <MOLECULE> <OSCAR-CM>4-(dimethylamino)-benzenethiol</OSCAR-CM> <QUANTITY> <_-LRB->(</_-LRB-> <MASS> <CD>.147</CD> <NN-MASS>g</NN-MASS> </MASS> <COMMA>,</COMMA> <AMOUNT> <CD>.96</CD> <NN-AMOUNT>mmol</NN-AMOUNT> </AMOUNT> <_-RRB->)</_-RRB-> </QUANTITY> </MOLECULE> Figure 5-21: Sample ChemicalTagger markup of a reactant The quantities of a MOLECULE may be a MASS, AMOUNT or VOLUME. These elements occur either as a child of a QUANTITY element, as seen above, or of a MIXTURE element if further text content is present, e.g. “4-(dimethylamino)-benzenethiol (.147g, .96mmol, prepared in step 2)”; <MOLECULE> <OSCAR-CM>4-(dimethylamino)-benzenethiol</OSCAR-CM> <MIXTURE> <_-LRB->(</_-LRB-> <MASS> <CD>147</CD> <NN-MASS>g</NN-MASS> </MASS> <COMMA>,</COMMA> <AMOUNT> <CD>96</CD> <NN-AMOUNT>mmol</NN-AMOUNT> </AMOUNT> <COMMA>,</COMMA> <VB-SYNTHESIZE>prepared</VB-SYNTHESIZE> <IN-IN>in</IN-IN> <NN>step</NN> <CD>2</CD> <_-RRB->)</_-RRB-> </MIXTURE> </MOLECULE> Figure 5-22: ChemicalTagger output for mixed content 183 It is presumed that any MOLECULE that occurs in the synthesis text and has a MASS, QUANTITY or VOLUME is a reagent. CML representations including connection tables are generated for reagents identified in this way by using the NameResolver class of OSCAR3 – which was modified in order to create the capacity to access its functionality from external applications. A solvent is then distinguished by one of two criteria – that it is a member of a list of known solvents, or that the quantity quoted for the MOLECULE is given only as a volume. In order to avoid the necessity of creating a large dictionary of synonyms for common solvents, e.g. “DCM”, “dichloromethane” or “methylene chloride”, the check against the list of known solvents is performed by generating from connection table an InChI using the InChIGeneratorTool class in JUMBO and checking this InChI against a known list. The reagents identified in this way are then used to populate the ReactantList and SpectatorList children of the CML Reaction. For example, the lists in Figure 5-23 are generated for the following example; Into a round bottom flask fitted with a gas inlet, condenser and a magnetic stirring bar was placed methyl 8-(benzoyloxy)-5-bromo-1,6-naphthyridine-7-carboxylate (.4 g, 1.03 mmol), 2,3- dimethoxybenzylamine (.432 g, .38 mL, 2.58 mmol) and 10 mL toluene: This mixture was refluxed for 18 hours, after which the reaction was cooled and the solvent removed in vacuo. The resulting residue was triturated with diethyl ether and filtered to yield 5-bromo-N-(2,3- dimethoxybenzyl)-8-hydroxy-1,6-naphthyridine-7-carboxamide as a light yellow solid. 184 <reactantList xmlns="http://www.xml-cml.org/schema"> <reactant> <molecule id="m1" title="methyl 8-(benzoyloxy)-5-bromo-1,6-naphthyridine-7- carboxylate"> <identifier convention="iupac:inchi"> InChI=1/C17H11BrN2O4/c1-23-17(22)13-14(24-16(21)10-6-3- 2-4-7-10)12-11(15(18)20-13)8-5-9-19-12/h2-9H,1H3 </identifier> <atomArray> ... </atomArray> <bondArray> ... </bondArray> </molecule> <amount units="cml:g">0.4</amount> <amount units="cml:mmol">1.03</amount> </reactant> <reactant> <molecule id="m1" title="2,3-dimethoxybenzylamine"> <identifier convention="iupac:inchi"> InChI=1/C9H13NO2/c1-11-8-5-3-4-7(6-10)9(8)12-2/h3- 5H,6,10H2,1-2H3 </identifier> <atomArray> ... </atomArray> <bondArray> ... </bondArray> </molecule> <amount units="cml:g">0.432</amount> <amount units="cml:mmol">2.58</amount> <amount units="cml:mL">0.38</amount> </reactant> </reactantList> <spectatorList> <spectator role="solvent" hasOnlyVolume="true"> <molecule xmlns:cmlx="http://www.xml-cml.org/schema/cmlx" cmlx:explicitHydrogens="true" title="toluene"> <atomArray> ... </atomArray> <bondArray> ... </bondArray> </molecule> <amount units="cml:mL">10.0</amount> </spectator> </spectatorList> Figure 5-23: Automatically generated reactantList and spectatorList. For the sake of brevity, atom and bond elements have been removed 185 It can be seen that, using the techniques discussed above, the software has correctly identified the reactants (methyl 8-(benzoyloxy)-5-bromo-1,6-naphthyridine-7-carboxylate and 2,3- dimethoxybenzylamine), the solvent (toluene), and the quantities of each that were used. This approach, however, is only applicable to syntheses that are described in a straightforward manner, i.e. those that directly identify the reagents employed and do so using interpretable nomenclature. Those that require the addition of information from other sections of the document by use of the previously discussed back reference require alternative techniques to be employed. These references, where identified, have by this stage in the workflow been annotated (see section 5.2.6). The treatment of reactions that are described by analogy to another synthesis is an involved process that is described in full later, while syntheses that use compound references are resolved by direct text substitution of the reference text with the name of the compound to which the reference refers prior to running ChemicalTagger over the text. For the example; Triphosgene (0.556g, 1.87 mmol) was added over 20 mins to a solution of <compoundRef targetId="h0012"> the acid from step 1. </compoundRef> (0.89g, 4.68 mmol) and diisopropylethylamine (3.26 ml, 18.7 mmol) in DMF (22 ml) at 0°C. The identity of “the acid from step 1” is determined by examining the referenced section of document. It is assumed that such references refer to the product of the referenced reaction rather than to any other compound involved. It would be preferable, in cases such as the above where the referenced compound could indeed be some compound other than the product, for the appropriate compound to be selected from the list of those compounds involved in the reaction in order to ensure that this assumption is correct. The product of the referenced reaction is determined in the same manner as described previously, by passing the title text of the heading to which the compoundRef element points to OSCAR3 in order to identify the name of the target compound. This 186 compound name is then added to the subject text in place of the compound reference, which for the above example yields; Triphosgene (0.556g, 1.87 mmol) was added over 20 mins to a solution of 8-hydroxy-1,6-naphthyridine-7-carboxylic acid (0.89g, 4.68 mmol) and diisopropylethylamine (3.26 ml, 18.7 mmol) in DMF (22 ml) at 0°C. Which is a format from which the reagents may be correctly identified using the previously described method. 5.3.2.4 Resolution of Analogous Reactions The method for identifying reagents described in section 5.3.2.3 is, of course, dependent upon the patent author directly describing the synthesis, as opposed to defining it by analogy to a previous reaction. The patent EP1326865 contains a number of reactions of the form; This reaction is fully described once; in Example 2, Step 2 in which RNH2 is 2,5-dichlorobenzylamine. Subsequent analogous reactions may be described in such a manner as; The title compound was prepared using the procedure described in Example 2, Step 2 replacing 2,5 dichlorobenzylamine with 1(R,S) aminoindane. or; The title compound was prepared using the procedure described in Example 2, Step 2 from 8- hydroxy-1,6-naphthyridine-7-carboxylic acid and 1(R,S) aminoindane. 187 The first of these formulations explicitly instructs the reader which of the reagents from the reference reaction is to be replaced, while the second does not. Examples of the first formulation could be dealt with to an acceptable degree of success by matching phrases of the form “replacing/substituting/replacement of/substitution of X for/with Y” etc. (where X and Y are chemical names) and modifying the CML Reaction produced from the source reaction accordingly. This approach, however, would not be helpful in the resolution of examples of the second formulation. Since the text does not inform the reader which of the reagents was replaced by the aminoindane, it is necessary to apply a degree of chemical knowledge in order for it to be determined. The algorithm implemented to resolve this problem is similar to the mental process that a chemist might use. First, consider the reagents employed and product produced in the reference reaction. A trained chemist, even without applying any knowledge of the reactivities of the species involved, will be trivially able to identify that a coupling has occurred between the carboxylic acid and the amine. This assertion may be made based on the high degree of common substructure between the two reagents and the product, and is illustrated in Figure 5-24. 188 Figure 5-24: Identification of key reactants The analogous reaction tells us that the product, in this case N-[(1R,S)-2,3-dihydro-1H-inden-1-yl]-8- hydroxy-1,6-naphthyridine-7-carboxamide, is prepared “from 8-hydroxy-1,6-naphthyridine-7- carboxylic acid and 1(R,S) aminoindane”. We thus know that the carboxylic acid and the aminoindane are employed in the analogous reaction, and by considering which of the reagents share a significant substructure with the product, we observe the following; 189 Figure 5-25: Significant substructures in the analogous reaction The product of the reference reaction is composed of substructures derived from the dichlorobenzylamine and the carboxylic acid, while that of the product of the analogous reaction is composed of substructures derived from the aminoindane and the carboxylic acid. It may therefore be concluded, since the dichlorobenzylamine played a role in the reference reaction that it does not in the analogous reaction, that the dichlorobenzylamine is not used as a reagent in the analogous reaction. This procedure is implemented as follows;  First, the reference reaction is analysed by the ReactionMapper class. The CML Reaction that was generated for the reference reaction states the product of the reaction and the reagents from which it is synthesised. For each of the reagents in turn, the maximum 190 common substructure with the product molecule is calculated using the Chemicx library (108), employing a bond-directed search and disregarding hydrogen atoms. From these is determined the smallest set of reagents which between them map to all of the non- hydrogen atoms of the product molecule. This replicates the mental procedure illustrated in Figure 5-24.  Chemicals named in the description of the analogous reaction are identified by OSCAR3. These names are resolved to connection tables using OSCAR3’s NameResolver class and, if necessary, converted to CML. These CML Molecules, along with the ReactionMapper from the previous step and the CML Molecule for the new product of the analogous reaction (determined as in section 5.3.2.1), are passed to the ModifiedReactionMapper class. This class creates a new ReactionMapper for the new reaction and provides the additional functionality to resolve the analogous reaction while delegating previously implemented functions to its new ReactionMapper. Each of the reagents identified as being part of the smallest complete mapping set in the previous step has their maximum common substructure with the new product molecule determined as before, as do each of the molecules mentioned in the reaction description of the analogous reaction. The smallest complete mapping set for the analogous reaction may then be determined as before. It is then possible to identify the replaced reagents as being those that were members of the first smallest complete mapping set but not of the second. A CML Reaction for the analogous reaction is then compiled. The CML Reaction for the reference reaction is copied and updated to reflect the differences between the two reactions. The product molecule of the original reaction is replaced with the new product molecule, while the new reagent molecules that are mentioned in the description of the analogous reaction and contained within the 191 smallest complete mapping set are added to the reactantList and those identified as not occurring in the analogous reaction by the ModifiedReactionMapper are removed. This method of resolving analogous reactions is suitable for application to both reaction descriptions that make explicit statements of a reagent to be replaced and to those that rely on the reader to be able to infer this information. When a reagent that is already present in the smallest complete mapping set for the reference reaction is again mentioned in the description of the analogous reaction, such as the carboxylic acid in the previous examples, it will be duplicated and passed to the ModifiedReactionMapper as though it were a novel reagent. Since only one of the instances of the molecule is required in the ModifiedReactionMapper’s smallest complete mapping set, however, it can be guaranteed that only one of the copies of the molecule will occur in the CML Reaction produced for the analogous reaction. 5.3.2.5 Automated Verification of Product Of course, it cannot be guaranteed that the connection table generated for the product molecule is correct. There are several potential sources of error – the name used in the heading in the original patent document may have been incorrect, the wrong heading may have been associated with the experimental section being processed, OSCAR3 may have misidentified a chemical name e.g. it may have annotated only “ethanoic acid” in the string “ethanoic acid methyl ester” or the connection table may have been incorrectly generated from the chemical name. As a result, it is desirable to be able to automatically verify the product in some way. This can be achieved by comparing the determined product to the extracted spectral data and, if present, any accompanying chemical images. The process of acquiring these sources of information must also be regarded as potentially inaccurate, and so it is not possible to definitively confirm or refute any candidate product. 192 Nonetheless, these checks provide potentially useful information regarding the validity of the assigned product and of the assigned spectral data. Checking Against 1H-NMR Given the 1H NMR spectrum of an unknown compound, it is possible for one skilled in the art to discount certain candidate structures. Most trivially, the proton count in the candidate structure should agree with total integral of the NMR spectrum. Each unique chemical environment in the candidate structure should give rise to a distinct peak in the NMR spectrum and it should be possible to assign for each of the chemical environments a peak that is in the correct region of the NMR spectrum. The peak multiplicities should be explained by potential couplings in the candidate molecule, and protons that couple to one another should share coupling constants. The application of these rules is subject to a large amount of subtlety, however. While each chemical environment should give rise to an individual peak, these peaks can overlap and be indistinguishable from one another, most notably in the case of aromatic protons. The determination of unique chemical environments is complicated by the need to consider three dimensional effects, as illustrated in Figure 5-26. Here, a plane of symmetry defined by R, Ha and the carbon atom that connects them exists through the molecule. Thus the protons Hb are equivalent with one another but not with their geminal protons Hc, which sit on the other face of the ring, made inequivalent by the group R. Figure 5-26: Proton environments in a non-trivial system 193 As a result of these effects, it is not possible to compute chemical environments based solely on the 2D connectivity of a molecule and confidently assert that this will equal the number of peaks reported in the molecule’s NMR spectrum. Whether there are more or fewer peaks in the NMR spectrum than predicted, the candidate molecule may be correct. Conversely, if the prediction matches the observation the candidate molecule may still be incorrect. The resolution of this problem falls outside of the scope of this project, and so the checking of structures against 1H NMR spectra is limited to the first method mentioned – ensuring that that the proton count of the molecule agrees with the total integral of the spectrum. The result of this check is recorded in automatically generated CML Reaction by adding a matchesHnmr attribute to the product molecule, with a value of true or false as appropriate. Checking Against Mass Spectrum As described in section 5.2.3.3, the regular expressions used by OSCAR3 to annotate mass spectra were developed in order to facilitate the specific annotation of peak assignments. Coupled with the reported mass, this provides the information required to check the candidate product against the reported mass spectrum. In this context, assignments are given as an offset from the molecular ion, e.g. “M+1” or “M–H”. The functionality to parse the assignment text and calculate the mass offset it implies is implemented in the MassSpecAssignment class, which supports both the numeric offset and formula offsets illustrated above. Regular expression matching determines the type of offset employed, or lack of an offset, e.g. “m/z: 271 (M+)”. Where a formula offset is used, the mass of the corresponding molecular fragment is determined using the CMLFormula class in CMLXOM. The reported mass is annotated as a numeric value by OSCAR3, and so once the mass offset has been determined it may be trivially added or subtracted from the mass of the reported peak to calculate an expected mass for the candidate product molecule. This mass is not necessarily the same as the average molecular mass of the product, since in mass spectrometry it is the individual 194 masses of the isotopomers that is measured. The discrepancy between the two would be of critical importance when considering, for example, any chlorine-containing product, and so it is necessary instead to be able to determine the masses of the isotopomers of a given molecule. This functionality was not previously available in JUMBO, and so was implemented and added to the MoleculeTool class for this purpose. The result of comparing the calculated mass of the reported molecular ion with the calculated mass of the most abundant isotopomer is recorded by adding a matchesMassSpec attribute to the product molecule in the automatically produced CML Reaction, with a value of true or false as appropriate. Checking Against Embedded Images When the experimental section includes a chemical image it is possible to compare the connection table of the candidate product with the results from the OSRA analysis of that image, which by this stage in the workflow have been appended to the patent document XML (see section 5.2.5). If a chemical image is found within the source experimental section, the recorded SMILES strings for the image are built into CML Molecules which are then used to generate InChIs, using the SMILESTool and InChIGeneratorTool classes from JUMBO respectively. An InChI for the candidate product molecule is similarly generated, and the InChIs are subsequently compared. Since the image included in the experimental section may be a reaction diagram it is possible for OSRA to have identified more than one molecule. Since the analysis of the often low-quality images is an error prone procedure, it is possible that the structures identified in the image may not contain the correct product. As previously discussed, when OSRA fails to correctly deduce a connection table from a drawn structure, it frequently reports a result containing the wildcard character, “*”. This character is not recognised by JUMBO’s SMILESTool, causing it to throw an Exception. As a 195 result, the following rules are applied when checking the candidate product against embedded images; 1. If the InChI generated for the candidate product matches one generated for the structures identified by OSRA, the product is considered to match the image. 2. If all of the structures identified by OSRA can be built into InChIs and the InChI for the candidate product does not match one of these, the product is considered not to match the image. 3. If some or all of the structures identified by OSRA cannot be built into InChIs and the InChI for the candidate product is not matched by one of those generated from the chemical image, no conclusion is drawn. If a conclusion is drawn from this process, it is recorded in the CML Reaction by the addition of a matchesImage attribute on the product CML Molecule, with a value of true or false as appropriate. If no conclusion is drawn, or if the source experimental section does not contain a chemical image, the CML Reaction is not modified. 5.3.3 Reaction Extraction Performance As previously described, the inputs to the reaction extraction procedure are those sections of the semantically enhanced patent documents that are expected to describe experimental procedures. XPath is used to select the appropriate headings from the patent, and the section of the document that is subjected to the analysis is this heading along with all of its descendant nodes. A successful analysis of the experimental section results in a CML Reaction, as discussed in section 5.3.2. It is not the case, however, that the input of each and every one of the sections of the document that are used as an input will successfully result in a CML Reaction. The reaction extraction 196 procedure may fail to generate a CML Reaction for a number of reasons. These causes were assessed by monitoring the results of the procedure when run on the 26287 input sections selected from the set of semantically enhance patent documents. From these input sections, 4444 CML reactions were successfully extracted, representing around 17% of the total input. The major causes of failure are subsequently discussed;  No text-containing paragraph children – it is expected that text should be contained inside p elements; if the input section does not contain recognisable text and does not, therefore, describe an experimental procedure in a manner that the system is designed to handle then it does not attempt to produce a CML Reaction. This situation arises where the restructuring of the patent document XML has not correctly organised the complete description of an experimental procedure into a single section of the semantically enhanced XML document, and was observed on 2595 occasions – around 10% of the reaction extraction attempts.  Too many paragraph children – if the input section has more than one text-containing paragraph child, then the system does not attempt to process it. This strategy is employed since a situation where one heading has multiple paragraph children may describe a single- step synthesis split across the paragraphs or may describe a multi-step synthesis where each paragraph describes a different reaction. In this second scenario, the techniques currently used to abstract the reaction would be inappropriate and so in the absence of an available method to distinguish between the two cases, it is impossible to determine how to proceed. Recent developments in the ChemicalTagger project that classify the role of each phrase in an experimental description, for example to produce “dissolve”, “heat” and “yield” phrases, may provide a means by which the paragraphs may be determined to describe either a single or multiple synthesis though this has not been implemented within the current work. This situation was observed on 2747 occasions, or around 10% of the total. 197  No product identified – the system attempts to determine the product by using OSCAR3 to recognise a single chemical name in the heading of the experimental section. If this process cannot be successfully completed, whether because the author has not named a single chemical compound in the heading or because OSCAR3 has failed to correctly identify the chemical name therein, the process aborts. This situation was observed on 5236 occasions, or around 20% of the total.  No reagents identified – if the system fails to identify any reagents in the source text, i.e. if there are no chemical names with associated amounts as recognised by ChemicalTagger, then no CML Reaction is produced. This situation may occur if the input describes an analogous reaction in which the backreference has not been annotated, if the patent author described the amounts of reactants in a format which is not supported by ChemicalTagger or if the input text does not describe a reaction and therefore does not describe amounts of chemicals. This situation was observed on 2562 occasions, or around 10% of the total. Further to the causes examined and discussed above, the failure to resolve backreferenced reactions caused a significant number of failures. These failures occurred primarily because the reference reaction had not been successfully extracted, because the extracted reference reaction failed to correctly identify and resolve to a structure all of the reagents required for the resolution procedure to succeed, or because the maximum common substructure searching element of the resolution procedure timed out. Though these errors were not automatically categorised, they are believed to have occurred for around 30% of the inputs. 198 5.4 Conclusions The work in this chapter has demonstrated the potential to create a system that downloads and processes the chemical literature and extracts machine-understandable data from it without the need for user interaction. The resulting CML reactions are further described in the next chapter. Much of the software that comprises PatentEye is likely to be of use to related projects in the future. Though the code is not currently modularised, it has already seen reuse as part of the Green Chain Reaction (109) – a distributed experiment to answer the question “are chemical reactions in the literature getting greener?” The Green Chain Reaction reused the PatentEye code for the identification, downloading and semantic enhancement of chemical patents and demonstrated the value of the work described in this chapter – as well as suggesting areas in which the PatentEye code and some of the libraries upon which it depends would benefit from refactoring. Since the community is unlikely to adopt software tools that are not both robust and usable, this exercise has proved valuable and it is hoped that the points raised will be addressed in the future. 199 6. Results The previous chapter described the means by which it was possible to accomplish high-throughput extraction of reactions from the patent literature. This chapter discusses the analysis of these extracted reactions to assess the performance of the PatentEye system (section 6.1) and briefly discusses the work of Dr Lezan Hawizy to enable the reuse of the results of the current work by third-parties (section 6.2). 6.1 Quality of Extracted Reactions To assess the accuracy of the semantified reactions, the output of the reaction extraction process was manually examined and compared with the source text. Each CML reaction was assessed on a number of criteria to determine the performance of the different modules of the reaction extractions system. These criteria included the accuracy of identified products, reagents and spectra, and the performance of the systems for automated product verification was tested by comparing the results of the automated verification with those of the manual verification. The methods employed for this process and the results obtained are subsequently discussed. 6.1.1 Corpus Formation Since the manual inspection of each and every reaction extracted from the patent texts was not a feasible task, a subset was selected to serve as a corpus from which to derive performance metrics. From the 4444 reactions successfully extracted from the 667 unique, full-text patent documents, 100 reactions were selected at random. This reaction corpus was then used in the subsequently described validation procedures. 200 During the manual inspection of the reaction corpus, it was discovered that two of the 100 CML Reactions were derived from multi-step syntheses that were described within a single paragraph. Since these cases did not reflect the kind of input for which the current software was designed, as discussed in section 5.3.3, they were excluded from the analysis process. A further two CML Reactions in the reaction corpus were found to have been derived from examples of their respective inventions that did not describe chemical syntheses – instead describing assays. These CML Reactions were similarly excluded from the analysis process; consequently, the process is based upon a reduced corpus of 96 CML Reactions. 6.1.2 Product Validation The source from which the reaction was extracted was examined to determine whether the chemical name identified in the heading text by OSCAR3 and from which the product CML molecule was generated agreed with that stated in the heading text. Since the name to structure conversion process is not a perfect procedure, this is no guarantee that the attached connection table is also correct. However, the development of OPSIN was not a part of the current work and is reported to operate at an extremely high rate of performance (110) and so it was not considered necessary to measure the accuracy of this process. Each extracted reaction is required by the extraction process to contain one and only one product molecule and so the product validation for each reaction was recorded as a simple boolean. The manual inspection of the reaction corpus showed that the correct product was identified on 88 of the 96 occasions, a success rate of around 92%. It was further noted that on each of the 8 occasions on which the correct product was not identified, the term identified as the product name could not be successfully resolved to a connection table, suggesting a means by which the errors may be automatically removed. Generally, the cause of the failure to identify the correct product 201 was due to the product of the reaction being named in the accompanying text, and hence not being present in the section heading of the source; instead, a term from the heading was falsely identified as a chemical name, which allowed for the creation of a CML Reaction from the source. The high rate of successful product identification suggests that the current techniques are sufficient to extract high-quality data from the literature in this regard. Those errors in product identification that exist at the present time may be largely eliminated in the future by the application of NLP techniques to permit the identification of products from the reaction description text. 6.1.3 Reagent Validation The sources from which the reaction corpus was extracted were examined, and for each the reagents employed and the amounts thereof were identified. These were then compared with those automatically extracted; instances where the same chemical name and amount were both manually identified and automatically extracted were counted as true positives, where the automatically extracted reagent list contained an instance that was not matched by both chemical name and amount in those manually identified a false positive was counted, and where a reagent was manually identified that was not automatically extracted, a false negative was counted. This work required the formalisation of the concept of a reagent to a sufficient degree that any subjectivity in determining what did and did not constitute a reagent could as far as possible be minimised. The IUPAC definition, “a test substance that is added to a system in order to bring about a reaction or to see whether a reaction occurs” (107), does not match the common usage of the term which further includes the chemical species involved in a reaction, i.e. reactants, solvents, catalysts, etc. It is this wider definition that fits the goal of the current work – to automatically determine how a reaction is carried out. 202 It was observed when considering this task that the chemical literature frequently underspecifies the work-up stage of a reaction. That is to say, the reagents employed may be stated without reference to their amounts, such as in; “The reaction mixture was stirred at 25 °C for 4 days and then diluted with ethyl acetate. The mixture was then washed with a dilute aqueous hydrochloric acid solution. At this time, methanol was added to the organic layer. A precipitate formed and was removed by filtration. The organics were further washed with a saturated aqueous sodium chloride solution, dried over magnesium sulfate, filtered, and concentrated in vacuo. The resulting solid was triturated with diethyl ether. The solid was collected by filtration and washed again with diethyl ether to afford…” (111) While the work-up is an undeniably important phase of a reaction, the techniques used in the current work are reliant on the specification of amounts in order to identify reagents. This technique is well-suited to identification of primary reagents but not those used in work-up, and so in order to produce a metric that indicates the performance of the software in the role for which it was designed it was decided to entirely omit reagents mentioned in the work-up phase, and inert atmospheres under which reactions were performed, from the current analysis. The manual inspection of the reaction corpus identified 249 true positives, 71 false positives and 139 false negatives – the system having a precision of around 78% and recall of around 64%. When considering these results, it should be remembered that the requirement for an identified reagent to be considered a true positive – that not only the chemical name but also the amounts employed in the reaction be identical to those described in the source text – is a rigorous standard. It was commonly the case during the analysis that the system identified the correct chemical name as a reagent but failed to correctly add one or more amounts, creating both a false positive and a false negative. These situations occurred where one or more of the amounts in the source text were not recognised by the regular grammar employed by ChemicalTagger for amount recognition. Frequently these situations were caused by the patent author employing a structure that may be 203 considered incorrect, e.g. “triphenylphosphine (3.08g., 11.78 mmol)” or “1-Phenylpiperazine (16.2 g, 0.10 mole)”. The non-standard full stop indicating the abbreviation of “grams” in the first example and the failure to contract the unit “mole” to its standard symbol “mol” in the second result in the failure to recognise and convert these amounts to CML. The data gathered in the current exercise permit the improvement of the ChemicalTagger grammar to recognise a greater variety of the reporting formats used by authors and thereby improve the precision and recall for the identification of reagents as measured by the current methods. These improvements, however, are not sufficient on their own to produce a system that operates at the level of a human operator. The current system requires further development before the data it produces are of sufficient quality to be considered reliable by the community at large. 6.1.4 Spectra Validation The extracted reactions contain, where identified and successfully converted to CML, the 13C and 1H NMR spectra of the products. In the patents used for this work, 1H NMR spectra are far more common than 13C (see section 5.2.3.4) – indeed, the manually examined subset of the reaction corpus was found to contain only two 13C NMR spectra. Consequently, only the validity of the attached 1H NMR spectra in the reaction corpus was considered. Where these spectra were present, the content was compared to the reported spectra in the original sources. In order to be considered correct, the attached spectra were required to fully describe the original spectra in terms of the shifts, integrals, multiplicities and coupling constants of each peak – any deviation from what was reported in the original text resulted in the attached spectrum being judged to be incorrect. The manual inspection identified 25 occasions on which the 1H NMR spectrum attached to a product molecule precisely replicated the information presented in the source text and 8 occasions on which it did not, i.e. a success rate of around 76%. The primary causes of the inclusion of incorrect 1H NMR 204 spectra were the failure to fully convert peak metadata, e.g. multiplicities, as identified by OSCAR3 to CML and the conversion to CML spectra of sections of input text that did not indicate 1H NMR spectra, i.e. false positives in the data recognition procedure. The first of these issues indicates a bug in JUMBOConverters that could be relatively trivially identified and fixed while it is expected that the second issue should produce 1H NMR that could be automatically distinguished from a genuine NMR spectrum in a majority of cases, since false positives will rarely contain expected peak metadata such as integrals and multiplicities. This was not, however, attempted in the current work since the identified sample was so small as to be statistically meaningless. Though the 1H NMR spectra validation is based on a small set of data, it is believed that the spectra identified by PatentEye are of nearly sufficient quality that they constitute a resource of value to the community. 6.1.5 Automated Verification Validation For each extracted reaction in the corpus, the results of the automated product verification, as described in 5.3.2.5, were compared with those of the manual product validation. For each of the methods of automated product verification – comparison of the product structure with chemical images, with its mass spectrum and with its 1H NMR – the automated verification was judged to be correct where the automated and manual verifications were in agreement and judged to be incorrect where they were not. The results of this examination of the reaction corpus are subsequently discussed. 205 6.1.5.1 1H-NMR Verification As discussed in section 5.3.2.5, the verification of products using their 1H NMR spectra is performed by matching the proton count of the candidate product molecule to the total integral of their respective 1H NMR spectrum. Of the 33 products in the reaction corpus with attached 1H NMR spectra, 23 (70%) were considered to have matched the product molecule while 10 (30%) were not. In all of the 33 cases, the product was manually judged to have been identified correctly. The 10 cases in which the extracted 1H NMR spectrum was judged not to match the candidate product molecule were examined to determine the cause. In one of these cases, the spectrum had been derived from a false positive in the OSCAR3 data recognition process. In the remaining nine cases, the total integral of the recorded spectrum was less than would be expected for the candidate product molecule. This is generally presumed to have been a consequence of the method by which the spectrum was acquired rather than indicative of an incorrect spectrum or mistaken product. The nine candidate product molecules and the solvents in which the spectra were reported to have been acquired are shown in Table 6-1. Candidate Product Solvent CDCl3 CDCl3 206 CDCl3 DMSO-d6 DMSO-d6 D20 207 CD3OD CD3OD CD3OD Table 6-1: Candidate product molecules for which the 1 H NMR was considered a non-match, with likely missing hydrogen atoms indicated in red For each of the candidate product molecules in Table 6-1, the n most acidic protons, where n is the difference between the proton count of the candidate product molecule and the total integral of the 1H NMR spectrum, are shown in red. It is supposed that in each case, those protons were either dissociated or exchanged with acidic deuterons from the solvent and so were not reflected in the observed spectrum. This demonstrates the fragility of the chosen method for validation of products 208 via their 1H NMR spectra, and so it is recommended that in the future a more advanced method of structure-spectrum comparison be employed. 6.1.5.2 Mass Spectrum Verification Of the 24 products in the reaction corpus for which automated verification against a mass spectrum was attempted, 18 (75%) were considered to have matched the candidate product molecule while 6 (25%) were not. In all of the 24 cases, the product was manually judged to have been identified correctly. The cases in which the reported mass spectrum was judged not to be a match to the candidate product molecule represent a combination of cases in which the OSCAR3 methodology was insufficient to deal with the reported mass spectrum, e.g. “m/z: 594.9 (M+H-HCl)” and “MS m/z 471, 473 (M+1, M+3)” and those in which the reported spectrum and assignment is apparently wrong, e.g. for the case of 2-fluoro-4-(4-fluoro-benzyloxy)-nitrobenzene (where the mass of the most commoner isotopomer is 265.1), the spectrum was reported as “MS: m/e = 265.1 (M++H)”. 6.1.5.3 Image Verification Of the 26 products in the reaction corpus for which automated verification against an attached image was attempted, 8 (31%) were considered to have matched the candidate product molecule while 18 (69%) were not. In all of the 26 cases, the product was manually judged to have been identified correctly. That the automated verification of candidate product molecules against attached images is highly error-prone should not be surprising, since the performance of the chemical image recognition package employed in the current work has previously been shown to perform poorly on a corpus of images taken from EPO patents (see section 5.2.5.2). 209 6.1.5.4 Conclusions Given the high accuracy (92%) of the PatentEye system in terms of correctly identifying the product of a chemical reaction compared to the accuracy of the implemented systems of automated product verification, it may be considered that this verification process is at best unnecessary and should be discontinued in future versions of the PatentEye system. However, since the NMR spectra reported in the literature are themselves data that are likely to be of interest and of use to the community, it is desirable to continue to collect and to validate these. The current, naïve, system of validation has been shown to be insufficient for the purpose and so a more capable method must be developed for the future. It is envisioned that such a system would operate in two parts; a “sanity check” would identify those spectra that are clearly in error and included as a result of a false positive in the OSCAR3 data recognition, e.g. those that occur in the middle of a chemical name, before a chemically intelligent system ensures that the extracted spectrum could feasibly correspond to the candidate product molecule under the specified conditions. Such a system would be capable of fully- automatic production of high volumes of reliable data to be distributed to the community. 6.2 Enabling Reuse of the Extracted Data This thesis began by discussing the need for machine-understandable and open data. The work described up to this point has shown how it is possible to extract machine-understandable data from the chemical literature, but the issue of how this data may be disseminated in a way that supports the semantic web of chemistry has not been addressed. Of course, the XML files produced as part of the current work could hosted on a webserver from where they could be downloaded by interested members of the community. This approach, however, neglects to provide interoperability between the data produced by PatentEye and that produced by any other source. In this scenario, drawing 210 together data from different sources is a problem that would need to be solved and re-solved by any number of different users. The issue of how best to produce linked data is addressed by the Resource Description Framework (RDF) (112), a web-standard technology for the encoding of knowledge. In RDF, statements about resources (concepts) are made using the subject-predicate-object format, e.g. “Marlon Brando starred in The Godfather”, in which the subject is “Marlon Brando”, the predicate is “starred in” and the object is “The Godfather”. Resources, such as “Marlon Brando” and “The Godfather” are defined by Uniform Resource Indicators (URIs), allowing other authors of RDF to make further statements about the same resources such as “Marlon Brando married Anna Kashfi” or “Al Pacino starred in The Godfather” and for the knowledge represented by the statements to be automatically combined – provided that the authors use the same URIs to define the same resources. Using RDF for the communication of chemical knowledge has the additional advantage that a host of tools for its usage exist, such as the open-source framework Sesame (113) and the query language SPARQL (114). This conversion of the information extracted from the chemical literature to RDF has been addressed by Dr Lezan Hawizy, with the creation of the PatentEye Repository (115). In the RDF files upon which the PatentEye Repository is based, the reactions extracted from the literature in section 5.3 are combined with the molecular classifications derived in section 4.3. Each instance of each chemical from the reactions is represented by a separate resource. Each of these resources links to a resource that defines the molecule concerned, thus allowing the different instances of the same chemical from across the literature to be drawn together and to be combined with data from other sources. The URIs for the resources used to define the molecules are provided by the Chemical RDF project hosted by OpenMolecules (116) and are of the form http://rdf.openmolecules.net/?XXX where XXX is the InChI of the molecule concerned. Some example RDF from the PatentEye Repository is shown in Figure 6-1, and the structure of this data is illustrated in Figure 6-2. 211 <rdf:Description rdf:about="http://www.patenteye.com/#ep01778686b1"> <j.0:hasReaction rdf:resource="http://www.patenteye.com/#ep01778686b1-h0023" /> <rdf:type rdf:resource="http://www.patenteye.com/#Patent" /> </rdf:Description> <rdf:Description rdf:about="http://www.patenteye.com/#ep01778686b1-h0023"> <rdf:type rdf:resource="http://www.patenteye.com/#Reaction" /> <j.0:hasMolecule rdf:resource= "http://www.patenteye.com/#ep01778686b1h0023isopropylalcohol2 00" /> </rdf:Description> <rdf:Description rdf:about= "http://www.patenteye.com/#ep01778686b1h0023isopropylalcohol200"> <j.0:hasCompound rdf:resource= "http://rdf.openmolecules.net/?inchi=1/c3h8o/c1-3(2)4/h3- 4h,1-2h3" /> <j.0:hasTitle rdf:datatype= "http://www.w3.org/2001/XMLSchema#string"> isopropyl alcohol </j.0:hasTitle> <j.0:hasAmount rdf:resource= "http://www.xml-cml.org/schema#mg_d0660b66763f4dd1" /> <rdf:type rdf:resource="http://www.patenteye.com/#spectator" /> </rdf:Description> <rdf:Description rdf:about= "http://rdf.openmolecules.net/?inchi=1/c3h8o/c1-3(2)4/h3-4h,1- 2h3"> <j.0:hasName rdf:datatype= "http://www.w3.org/2001/XMLSchema#string"> Isopropanol </j.0:hasName> <j.0:hasInChI rdf:datatype= "http://www.w3.org/2001/XMLSchema#string"> InChI=1/C3H8O/c1-3(2)4/h3-4H,1-2H3 </j.0:hasInChI> <rdfs:subClassOf rdf:resource= "http://www.patenteye.com/#ClassName=EPO01015" /> </rdf:Description> <rdf:Description rdf:about= "http://www.patenteye.com/#ClassName=EPO01015"> <j.0:hasName rdf:datatype= "http://www.w3.org/2001/XMLSchema#string"> low molecular weight aliphatic alcohol </j.0:hasName> </rdf:Description> <rdf:Description rdf:about= "http://www.xml-cml.org/schema#mg_d0660b66763f4dd1"> <j.0:hasValue rdf:datatype= "http://www.w3.org/2001/XMLSchema#float"> 72.20 </j.0:hasValue> <rdf:type rdf:resource="http://www.xml-cml.org/schema#mg" /> </rdf:Description> Figure 6-1: Sample RDF from the PatentEye Repository 212 ep01778686b1 ep01778686b1-h0023 ep01778686b1h0023 isopropylalcohol200 http://rdf.openmolecules.net/?inchi= 1/c3h8o/c1-3(2)4/h3-4h,1-2h3 72.20 (mg) low molecular weight aliphatic alcohol http://www.patenteye.com/ #ClassName=EPO01015 mg_d0660b66763f4dd1 isopropyl alcohol hasReaction hasMolecule hasCompound hasTitle hasAmount hasName hasValue subClassOf Figure 6-2: Diagrammatic illustration of PatentEye Repository RDF The examples shown in Figure 6-1 have been heavily trimmed and selected to exemplify the primary features of the PatentEye Repository RDF. This data format will, in the future, allow external parties to access and to interact with the information produced by PatentEye and to incorporate their own data in a highly automated fashion – removing the need for much of the data entry that is currently required when combining datasets. This functionality is predicated upon the usage of the InChI, and the OpenMolecules URIs, to define the chemical species. InChIs lack the ability to represent complex organometallics and polymers (33), for example, and so this technique will not currently be appropriate in all cases but for the majority of small molecules it is likely to be successful. 213 7. Conclusions This work has demonstrated how existing and novel technology can be combined to produce machine-understandable chemical information that could be used in support of the semantic web for chemistry and, in particular, how this information can be automatically liberated from existing digital documents. Given the restrictions that surround copyrighted documents, this work was directed towards the patent literature. Chapter 1 gave an introduction to the problem to be addressed in this thesis, while chapter 2 discussed the existing technologies that were used in the current work and the potential sources of documents that could be used. Chapter 3 described the extension of Chemical Markup Language to describe Markush structures. Markush structures are of crucial importance to chemical patents, and there currently exists no open standard for describing them. The work in this thesis has demonstrated the creation of semantics, in the form of Extended Polymer Markup Language (EPML), that allow for the description of much of the variability employed in the patent literature, and has demonstrated functioning software for the exemplification and substructure searching of these EPML descriptions. The software and the semantics developed as part of the current work require further refinement before they are published for the community to use, but represent an important proof of concept and initial demonstration of technologies. Chapter 4 described the implementation of a system for the automatic acquisition of hyponymic relations from the literature. Hyponymic relations form a key part of ontologies – formal representations of knowledge – which play an important role in the Semantic Web. The current manual curation of chemical ontologies such as ChEBI is a time consuming procedure and the creation of a technology that allows semi-automatic curation holds great potential. The system has 214 been validated by extracting relations from a small patent set, and shown both to operate at a high rate of performance and to discover relations that do not currently form a part of the ChEBI ontology. The technology is consequently recommended for adoption by the community, and current collaboration between the Unilever Centre and the European Bioinformatics Institute is moving in this direction. Chapters 5 and 6 described the development, implementation and validation of PatentEye – a system for the automatic extraction of chemical reactions from the literature. The extracted data describes the primary products, with attached NMR spectra where possible, and lists the reagents used in the syntheses. The reactions extracted by this method have been manually validated and it is shown that, while the system produces encouraging results, further work is required if there is to be a high level of confidence that the extracted data fully describes the reported reactions. The problem of automatic extraction of chemical reactions is highly non-trivial, and it is unlikely that any system will ever be infallible, but recent developments in the ChemicalTagger tool that assign roles (e.g. “add”, “heat”, “wash”) to reaction phrases will enable the assignment of roles to chemical entities detected in an experimental write-up. In turn, this will enable a means to extract not just the components of the reaction but the method too – the recipe as well as the ingredients – in addition to a more reliable means to identify the reagents than at present. Once an acceptably reliable means of reaction extraction has been implemented, the potential scale on which data can be rapidly extracted from the literature is immense. The current work extracted 4444 reactions from EPO publications covering a period of 10 weeks. With further development increasing the rate of recall from the literature, and the inclusion of patents published by the USPTO and the WIPO, the scale on which PatentEye extracts reactions could easily reach 100,000 reactions/year, while the open access literature offers a further opportunity to increase the scale on which PatentEye operates. Based on the sample of reactions analysed in section 6.1.4, around a third of these might be expected to contain NMR spectra. With the available archive of digitised 215 patent documents dating back many years, such a collection would immediately dwarf the open NMR database NMRShiftDB (117; 118; 119), which contains around 44000 spectra as of October 2010 and would be of great use to organic chemistry students and researchers as well as to creators of NMR prediction and structure elucidation software. The reactions derived by this system would provide an excellent overview of the types of chemistry being performed on a small scale in industry. This would allow for the answering of questions that are fascinating on an intellectual and organisational level, such as “how long after a novel method is published is it adopted by industry?” and “what differences exist in the synthetic methods used, by company and by country?” as well as permit the immediate determination of all synthetic routes for a given transformation and the identification of compatible and, by omission to imply incompatible, functional groups in the reactant. For example, the data would be expected to show that ketones may be reduced to alcohols both by LiAlH4 and NaBH4 and that their reaction with NaBH4 is compatible with the presence of an ester group but that the reaction with LiAlH4 is not. Again, such information is likely to prove useful to organic chemistry students and researchers, and the provision of the data in an open, linked and re-usable form as described in section 6.2 will be of great use to other informatics specialists in the short term and to automated researchers in the longer term. The software developed during the current work is also likely to prove useful to the community. As discussed in section 5.4, the PatentEye software has already seen use as part of the Green Chain Reaction project. It is hoped that in the future, opportunities will arise to further develop, refine and modularise the software used in the current work so that it may be released under an open licence for the community to reuse freely where appropriate. 216 8. Bibliography 1. CAS Database Content at a Glance. [Online] [Cited: 10 August 2010.] http://www.cas.org/expertise/cascontent/ataglance/index.html. 2. CAS Databases - CAPlus, Journal and Patent References. [Online] [Cited: 3 November 2010.] http://www.cas.org/expertise/cascontent/caplus/index.html. 3. CAS REGISTRY - The gold standard for substance information. [Online] [Cited: 3 November 2010.] http://www.cas.org/expertise/cascontent/registry/index.html. 4. CAS Databases - CASREACT, Chemical Reactions. [Online] [Cited: 3 November 2010.] http://www.cas.org/expertise/cascontent/casreact.html. 5. The Semantic Web. T. Berners-Lee, J. Hendler, O. Lassila. 17 May 2001, Scientific American. 6. The Automation of Science. R. D. King, J. Rowland, S. G. Oliver, M. Young, W. Aubrey, E. Byrne, M. Liakata, M. Markham, P. Pir, L. N. Soldatova, A. Sparkes, K. E. Whelan, A. Clare. 2009, Science, Vol. 324, pp. 85-89. 7. Mining chemical structural information from the drug literature. D. L. Banville 2006, Drug Discovery Today, Vol. 11, Issue 1, pp. 35-42. 8. The Next Big Thing: From Hypermedia to Datuments. P. Murray-Rust, H. S. Rzepa. 2004, Journal of Digital Information, Vol. 5, Issue 1. 9. Chemistry Add-in for Word. [Online] [Cited: 4 November 2010.] http://research.microsoft.com/en- us/projects/chem4word/. 10. Computers learn chemistry. R. van Noorden. 2007, Chemistry World. Vol. 4, Issue 2, pp. 10. 11. Project Prospect. [Online] [Cited: 4 November 2010.] http://www.rsc.org/Publishing/Journals/ProjectProspect/. 12. Semantic enrichment of journal articles using chemical named entity recognition. C. R. Batchelor, P. T. Corbett. 2007. Proceedings of the ACL 2007 Demo and Poster Sessions, Stroudsburg, PA, USA. pp. 45-48. 13. CrystalEye. [Online] [Cited: 1 September 2010.] http://wwmm.ch.cam.ac.uk/crystaleye/. 14. ChemicalTagger. [Online] [Cited: 4 November 2010.] http://bitbucket.org/lh359/chemicaltagger. 15. OSCAR. [Online] [Cited: 4 November 2010.] http://sourceforge.net/projects/oscar3-chem/. 16. XOM. [Online] [Cited: 4 November 2010.] http://www.xom.nu/. 17. Chemical Markup Language. [Online] [Cited: 4 November 2010.] http://sourceforge.net/projects/cml/. 217 18. CHIC – Converting Hamburgers Into Cows. J. A. Townsend, J. Downing, P. Murray-Rust. 2009. Fifth IEEE International Conference on e-Science, Oxford, UK. pp 337-343. 19. DSpace@Cambridge. [Online] [Cited: 12 October 2010.] http://www.dspace.cam.ac.uk/. 20. EThOS - Electronic Theses Online Service. [Online] [Cited: 12 October 2010.] http://ethos.bl.uk/. 21. USPTO.gov. [Online] [Cited: 13 August 2010.] http://www.uspto.gov/web/offices/pac/mpep/documents/0600_608_01_v.htm. 22. Terms and Conditions of Use for the EPO Website. [Online] [Cited: 13 August 2010.] http://www.epo.org/etc/termsofuse.html#Copyright. 23. EPO - European publication server. [Online] [Cited: 24 February 2011.] https://data.epo.org/publication-server. 24. EBD Data Information. [Online] [Cited: 24 February 2011.] http://docs.epoline.org/ebd/xmlinfo.htm. 25. PATENTSCOPE: Search International Patent Applications. [Online] [Cited: 12 October 2010.] http://www.wipo.int/pctdb/en/. 26. USPTO Bulk Downloads: Patent Grant Full Text. [Online] [Cited: 13 August 2010.] http://www.google.com/googlebooks/uspto-patents-grants-text.html. 27. XML Path Language (XPath) 2.0. [Online] [Cited: 6 November 2010.] http://www.w3.org/TR/xpath20/. 28. Open Babel. [Online] [Cited: 9 August 2010.] http://openbabel.org/wiki/Main_Page. 29. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. D. Weininger. 1988, J. Chem. Inf. Comput. Sci., Vol. 28, pp. 31-36. 30. SMILES. 2. Algorithm for generation of unique SMILES notation. D. Weininger, A. Weininger, J. L. Weininger. 1989, J. Chem. Inf. Comput. Sci., Vol. 29, pp. 97-101. 31. InChI Technical Manual. S. E. Stein, S. R. Heller, D. V. Tchekhovskoi. [Online] [Cited: 10 August 2010.] http://www.oci.uzh.ch/edu/lectures/material/DBC/LinNot/InChI_TechMan.pdf. 32. The IUPAC International Chemical Identifier. A. McNaught. 2006, Chemistry International. Vol. November-December, pp. 12-14. 33. Unofficial InChI FAQ. [Online] [Cited: 2 November 2010.] http://wwmm.ch.cam.ac.uk/inchifaq/#What%20Can%20InChI%20Currently%20Not%20Represent?. 34. On a System of Indexing Chemical Literature; Adopted by the Classification Division of the U.S. Patent Office. E. A. Hill, 1900, J. Am. Chem. Soc., Vol. 22, Issue 8, pp. 478-494. 35. CambridgeSoft Desktop Software - ChemDraw. [Online] [Cited: 4 November 2010.] http://www.cambridgesoft.com/software/ChemDraw/. 218 36. JNI-InChI. [Online] [Cited: 4 November 2010.] http://jni-inchi.sourceforge.net/. 37. Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles. P. Murray-Rust, H. S. Rzepa. 1999, J. Chem. Inf. Comput. Sci., Vol. 39, pp. 928-942. 38. Chemical Markup, XML, and the World-Wide Web. 2. Information Objects and the CMLDOM. P. Murray-Rust, H. S. Rzepa. 2001, J. Chem. Inf. Comput. Sci., Vol. 41, pp. 1113-1123. 39. Chemical Markup, XML, and the World-Wide Web. 3. Towards a Signed Semantic Chemical Web of Trust. G. V. Gkoutos, P. Murray-Rust, H. S. Rzepa, M. Wright. 2001, J. Chem. Inf. Comput. Sci., Vol. 41, pp. 1124-1130. 40. Chemical Markup, XML, and the Worldwide Web. 4. CML Schema. P. Murray-Rust, H. S. Rzepa. 2003, J. Chem. Inf. Comput. Sci., Vol. 43, pp. 757-772. 41. Chemical Markup, XML, and the World Wide Web. 5. Applications of Chemical Metadata in RSS Aggregators. P. Murray-Rust, H. S. Rzepa, M. J. Williamson, E. L. Willighagen. 2004, J. Chem. Inf. Comput. Sci., Vol. 44, pp. 462-469. 42. Chemical Markup, XML, and the World Wide Web. 6. CMLReact, an XML Vocabulary for Chemical Reactions. G. L. Holliday, P. Murray-Rust, H. S. Rzepa. 2006, J. Chem. Inf. Comput. Sci., Vol. 46, pp. 145-157. 43. Chemical Markup, XML, and the World Wide Web. 7. CMLSpect, an XML Vocabulary for Spectral Data. S. Kuhn, T. Helmus, R. J. Lancashire, P. Murray-Rust, H. S. Rzepa, C. Steinbeck, E. L. Willighagen. 2007, J. Chem. Inf. Comput. Sci., Vol. 47, pp. 2015-2034. 44. CML Sourceforge Repository. [Online] [Cited: 2 September 2010.] http://cml.svn.sourceforge.net/viewvc/cml/. 45. JUMBOConverters. [Online] [Cited: 24 February 2011.] http://bitbucket.org/wwmm/jumbo- converters. 46. High-Throughput Identification of Chemistry in Life Science Texts. P. Corbett, P. Murray-Rust. 2006. Computational Life Sciences II, Cambridge, UK. pp. 107-118. 47. ChEBI: a database and ontology for chemical entities of biological interest. K. Degtyarenko, P. de Matos, M. Ennis, J. Hastings, M. Zbinden, A. McNaught, R. Alcantara, M. Darsow, M. Guedj, M. Ashburner. 2008, Nucleic Acids Research, Vol. 36, pp. 345-350. 48. Chemical Entities of Biological Interest: an update. P. de Matos, R. Alcantara, A. Dekker, M. Ennis, J. Hastings, K. Haug, I. Spiteri, S. Turner, C. Steinbeck. 2009, Nucleic Acids Research, Vol. 38, pp. 249-254. 49. Ontology Detail: Physico-chemical methods and properties. K. Degtyarenko. [Online] [Cited: 4 November 2010.] http://obofoundry.org/cgi-bin/detail.cgi?id=fix. 50. Ontology Detail: Physico-chemical process. K. Degtyarenko. [Online] [Cited: 4 November 2010.] http://www.obofoundry.org/cgi-bin/detail.cgi?id=rex. 219 51. Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. 2000, Nature Genet, Vol. 25, pp. 25-29. 52. Annotation of Chemical Named Entities. P. Corbett, C. Batchelor, S. Teufel. 2007. BioNLP: Biological, translational and clinical language processing, Prague, CZ. pp. 57-64. 53. Pyridines, pyridine and pyridine rings: disambiguating chemical named entities. P. Corbett, C. Batchelor, A. Copestake. 2008. LREC Workshop, Marrakech, MA. pp. 43-50. 54. Cascaded classifiers for confidence-based chemical named entity recognition. P. Corbett, A. Copestake. 2008, BMC Bioinformatics, Vol. 9. 55. OpenNLP Maxent. [Online] [Cited: 14 October 2010.] http://maxent.sourceforge.net/. 56. Experimental data checker: better information for organic chemists. S. E. Adams, J. M. Goodman, R. J. Kidd, A. D. McNaught, P. Murray-Rust, F. R. Norton, J. A. Townsend, C. A. Waudby. 2004, Org. Biomol. Chem., Vol. 2, pp. 3067-3070. 57. Chemical documents: machine understanding and automated information extraction. J. A. Townsend, S. E. Adams, C. A. Waudby, V. K. de Souza, J. M. Goodman, P. Murray-Rust. 2004, Org. Biomol. Chem., Vol. 2, pp. 3294-3300. 58. ANTLR: ANother Tool for Language Recognition. [Online] [Cited: 14 October 2010.] http://www.antlr.org/about.html. 59. OpenNLP. [Online] [Cited: 14 October 2010.] http://opennlp.sourceforge.net/. 60. Building a Large Annotated Corpus of English: The Penn Treebank. M. P. Marcus, M. A. Marcinkiewicz, B. Santorini. 2, 1993, Computational Linguistics, Vol. 12, pp. 313-330. 61. Kekulé: OCR - Optical Chemical (Structure) Recognition. J. R. Balmuth, J. R. McDaniel. 1992, J. Chem. Inf. Comput. Sci., Vol. 32, pp. 373-378. 62. Chemical Literature Data Extraction: The CLiDE Project. P. Ibison, M. Jacquot, F. Kam, A. G. Neville, R. W. Simpson, C. Tonnelier, T. Venczel, A. P. Johnson. 1992, J. Chem. Inf. Comput. Sci., Vol. 33, pp. 338-344. 63. Recent Advances in the CLiDE Project: Logical Layout Analysis of Chemical Documents. A. Simon, A. P. Johnson. 1997, J. Chem. Inf. Comput. Sci., Vol. 37, pp. 109-116. 64. CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition. A. T. Valko, A. P. Johnson. 2009, J. Chem. Inf. Model., Vol. 49, pp. 780-787. 65. Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution. I. V. Filippov, M. C. Nicklaus. 2009, J. Chem. Inf. Model., Vol. 49, pp. 740-743. 66. Extracting Chemical Structure Information: Optical Structure Recognition Application. I. V. Filippov, M. C. Nicklaus. 2009. Eighth IAPR International Workshop on Graphics Recognition, La Rochelle, FR. Session 4, pp. 3-12. 220 67. Improvements in Optical Structure Recognition Application. I. V. Filippov, M. C. Nicklaus, J. Kinney. 2010. Document Analysis Systems Workshop, Boston, MA, US. 68. Automated extraction of chemical structure information from digital raster images. P. Jungkap, R. Gus, S. Kerby, N. Mandee, L. Naesung, S. Kazuhiro. 2009, Chemistry Central Journal, Vol. 3. 69. Towards in-house searching of Markush structures from patents. J. M. Barnard, P. M. Wright. 2009, World Patent Information, Vol. 31, pp. 97-103. 70. The Chemical Abstracts Service Generic Chemical (Markush) Structure Storage and Retrieval Capability. 1. Basic Concepts. W. Fisanick. 1990, J. Chem. Inf. Comput. Sci., Vol. 30, pp. 145-154. 71. The Chemical Abstracts Service Generic Chemical (Markush) Structure Storage and Retrieval Capability. 2. The MARPAT File. T. Ebe, K. A. Sanderson, P. S. Wilson. 1991, J. Chem. Inf. Comput. Sci., Vol. 31, pp. 31-36. 72. Computer Storage and Retrieval of Generic Structures in Chemical Patents. 1. Introduction and General Strategy. M. F. Lynch, J. M. Barnard, S. M. Welford. 1981, J. Chem. Inf. Comput. Sci., Vol. 21, pp. 148-150. 73. Computer Storage and Retrieval of Generic Structures in Chemical Patents. 2. GENSAL, a Formal Language for the Description of Generic Chemical Structures. J. M. Barnard, M. F. Lynch, S. M. Welford. 1981, J. Chem. Inf. Comput. Sci., Vol. 21, pp. 151-161. 74. Computer representation and manipulation of combinatorial libraries. J. M. Barnard, G. M. Downs. 1997, Perspectives in Drug Discovery and Design, Vol. 7/8, pp. 13-30. 75. Chemical Markup, XML and the World-Wide Web. 8. Polymer Markup Language. N. Adams, J. Winter, P. Murray-Rust, H. S. Rzepa. 2008, J. Chem. Inf. Model., Vol. 48, pp. 2118-2128. 76. Polymer Builder. [Online] [Cited: 28 October 2010.] http://wwmm-svc.ch.cam.ac.uk/polydemo/. 77. Properties of Polymers: Their Correlation with Chemical Structure; their Numerical Estimation and Prediction from Additive Group Contributions. D. W. van Krevelen. 1997. 3rd Revised edition, Elsevier Science Ltd. 78. The Number of Structurally Isomeric Alcohols of the Methanol Series. H. R. Henze, C. M. Blair. 1931, J. Am. Chem. Soc., Vol. 53, Issue 8, pp. 3042-3046. 79. A Comparison of Different Approaches to Markush Structure Handling. J. M. Barnard. 1991, J. Chem. Inf. Comput. Sci., Vol. 31, pp. 64-68. 80. Computer Storage and Retrieval of Generic Chemical Structures in Patents. 3. Chemical Grammars and Their Role in the Manipulation of Chemical Structures. S. M. Welford, M. F. Lynch, J. M. Barnard. 1981, J. Chem. Inf. Comput. Sci., Vol. 21, pp. 161-168. 81. CORINA - Fast Generation of High-Quality 3D Molecular Models. [Online] [Cited: 4 November 2010.] http://www.molecular-networks.com/products/corina. 221 82. Recent Developments of the Chemistry Development Kit (CDK) - An Open-Source Java Library for Chemo- and Bioinformatics. C. Steinbeck, C. Hoppe, S. Kuhn, M. Floris, R. Guha, E. L. Willighagen. 2006, Curr. Pharm. Des., Vol. 12, Issue 17, pp. 2111-2120. 83. The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics. C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttman, E. Willighagen. 2003, J. Chem. Inf. Comput. Sci., Vol. 43, pp. 493-500. 84. Sourceforge.net: CDK. [Online] [Cited: 4 November 2010.] http://sourceforge.net/apps/mediawiki/cdk/index.php?title=Main_Page. 85. Jmol: an open-source Java viewer for chemical structures in 3D. [Online] [Cited: 4 November 2010.] http://www.jmol.org/. 86. Computer Storage and Retrieval of Generic Structures in Chemical Patents. 4. An Extended Connection Table Representation for Generic Structures. J. M. Barnard, M. F. Lynch, S. M. Welford. 1982, J. Chem. Inf. Comput. Sci., Vol. 22, pp. 16-164. 87. A Relaxation Algorithm for Generic Chemical Structure Screening. A. von Scholley. 1983, J. Chem. Inf. Comput. Sci., Vol. 24, pp. 245-241. 88. Computer Storage and Retrieval of Generic Structures in Chemical Patents. 5. Algorithmic Generation of Fragment Descriptors for Generic Structure Screening. S. M. Welford, M. F. Lynch, J. M. Barnard. 1983, J. Chem. Inf. Comput. Sci., Vol. 24, pp. 57-66. 89. Computer Storage and Retrieval of Generic Structures in Chemical Patents. 8. Reduced Chemical Graphs and Their Application in Generic Chemical Structure Retrieval. V. J. Gillet, G. M. Downs, A. Ling, M. F. Lynch, P. Venkataram, J. V. Wood. 1987, J. Chem. Inf. Comput. Sci., Vol. 27, pp. 126-137. 90. Marvin: History of Changes. [Online] [Cited: 8 October 2010.] http://www.chemaxon.com/marvin/help/developer/changes.html. 91. Automatic Acquisition of Hyponyms from Large Text Corpora. M. A. Hearst. 1992. 14th Conference on Computational Linguistics, Nantes, FR. pp 539-545. 92. Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? R. Winnenburg, T. Wachter, C. Pike, A. Doms, M. Schroeder. 2008, Briefings in Bioinformatics, Vol. 9, Issue 6, pp. 466-478. 93. OWL Web Ontology Language Overview. [Online] [Cited: 1 November 2010.] http://www.w3.org/TR/owl-features/. 94. OWL API. [Online] [Cited: 24 February 2011.] http://owlapi.sourceforge.net/. 95. ChEBI FAQ. [Online] [Cited: 13 November 2010.] http://www.ebi.ac.uk/chebi/faqForward.do#2. 96. EPO - Basic definitions. [Online] [Cited: 24 February 2011.] http://www.epo.org/patents/patent- information/european-patent-documents/basic-definitions.html. 222 97. European Publication Server Data Coverage. [Online] [Cited: 1 November 2010.] https://data.epo.org/publication-server/data-coverage. 98. Guide for Applicants. [Online] [Cited: 24 February 2011.] http://www.epo.org/patents/Grant- procedure/Filing-an-application/European-applications/Guide-for-applicants.html. 99. Download OSCAR3 from SourceForge.net. [Online] [Cited: 5 November 2010.] http://sourceforge.net/projects/oscar3-chem/files/oscar3-chem/alpha5/oscar3-a5.zip/download. 100. Chemical documents: machine understanding and automated information extraction. J. A. Townsend, S. E. Adams, C. A. Waudby, V. K. de Souza, J. M. Goodman, P. Murray-Rust. 2004, Org. Biomol. Chem., Vol. 2, pp. 3294-3300. 101. Classifier4J. [Online] [Cited: 24 February 2011.] http://classifier4j.sourceforge.net/. 102. ImageMagick: Convert, Edit and Compose Images. [Online] [Cited: 5 November 2010.] http://www.imagemagick.org/. 103. Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution. I. V. Filippov, M. C. Nicklaus. 2009, J. Chem. Inf. Model., Vol. 49, pp. 740-743. 104. Extraction of Chemical Reaction Information from Primary Journal Text Using Computational Linguistics Techniques. 1. Lexical and Syntantic Phases. E. M. Zamora, P. E. Blower Jr. 1984. J. Chem. Inf. Comput. Sci., Vol. 24, pp. 176-181. 105. Extraction of Chemical Reaction Information from Primary Journal Text Using Computational Linguistics Techniques. 2. Semantic Phase. E. M. Zamora, P. E. Blower Jr. 1984. J. Chem. Inf. Comput. Sci., Vol. 24, pp. 181-188. 106. Extraction of Chemical Reaction Information from Primary Journal Text. C. S. Ai, P. E. Blower Jr, R. H. Ledwith. 1990. J. Chem. Inf. Comput. Sci., Vol. 30, pp. 163-169. 107. IUPAC Compendium of Chemical Terminology (the "Gold Book") A. D. Wilkinson, A. McNaught. 1997. 2nd edition, Blackwell Scientific Publications. 108. chemicx.com [Online] [Cited: 5 November 2010.] http://www.chemicx.com/. 109. Green Chain Reaction - Science Online London 2010. [Online] [Cited: 27 October 2010.] http://scienceonlinelondon.wikidot.com/topics:green-chain-reaction. 110. Chemical Name to Structure: OPSIN, an Open Source Solution. D. M. Lowe, P. T. Corbett, P. Murray-Rust, R. C. Glen. J. Chem. Inf. Model. Manuscript accepted. 111. Amide Substituted Xanthine Derivatives With Gluconeogenesis Modulating Activity. EP 1515972. P. W. Dunten, L. H. Foley, N. J. S. Huby, S. L. Pietranico-Cole. 2003. 112. RDF/XML Syntax Specification (Revised). [Online] [Cited: 1 November 2010.] http://www.w3.org/TR/REC-rdf-syntax/. 113. OpenRDF.org. [Online] [Cited: 1 November 2010.] http://www.openrdf.org/. 223 114. SPARQL Query Language for RDF. [Online] [Cited: 1 November 2010.] http://www.w3.org/TR/rdf-sparql-query/. 115. PatentEye Repository. [Online] [Cited: 29 October 2010.] http://bitbucket.org/lh359/patenteye. 116. OpenMolecules.net. [Online] [Cited: 1 November 2010.] http://openmolecules.net/. 117. NMRShiftDB - open nmr on the web. [Online] [Cited: 1 November 2010.] http://www.ebi.ac.uk/nmrshiftdb. 118. NMRShiftDB - Constructing a Free Chemical Information System with Open-Source Components. C. Steinbeck, S. Krause, S. Kuhn. 2003, J. Chem. Inf. Comput. Sci., Vol. 43, pp. 1733-1739. 119. NMRShiftDB - compound identification and structure elucidation support though a free community-built web database. C. Steinbeck, S. Kuhn. 2004, Phytochemistry, Vol. 65, pp. 2711-2717. 224 Appendix A Hyponymic Relations Hyponymic (“is-a”) relations exist between two terms where one, the hyponym, is a subset of the other, the hypernym. For example, “vehicle” is a hypernym of “car”, and “Ford Fiesta” is a hyponym of “car”. Hypernyms and hyponyms may take the form of single words or of phrases, as shall be seen later. This task aims to quantify the performance of a system for the automatic acquisition of such relations based on Hearst Patterns. Hearst Patterns Hearst first proposed the use of lexico-syntactic patterns for the automatic acquisition of hyponymic relations, thereafter known as Hearst Patterns. She described six patterns that could be employed; Format Example Pattern Name HYPER such as HYPO Apolar solvents such as THF and hexane SUCH_AS such HYPER as HYPO Such bases as NaOEt or LDA SUCH_FOO_AS HYPO or other HYPER MeCl, EtBr or other organohalides OR_OTHER HYPO and other HYPER Benzene, ethylene oxide and other carcinogens AND_OTHER HYPER including HYPO Methyl ketones including acetone INCLUDING HYPER especially HYPO Grignard reagents, especially methyl magnesium chloride ESPECIALLY wherein HYPER indicates the hypernym, indicated in bold and HYPO the hyponym(s), indicated in italics. Each of these patterns has been assigned a name for ease of reference. In the example for the SUCH_AS pattern, the text communicates the information that THF and hexane are examples of apolar solvents. This information is readily available to a fluent speaker of the English language, regardless of whether or not they are aware of what “THF”, “hexane” or an “apolar solvent” are. The Task  This task focuses solely on the SUCH_AS pattern. You will be presented with a series of paragraphs, each containing one or more instances of the phrase “such as”. You should read each paragraph and identify the Hearst Pattern(s). For each Hearst Pattern you should identify the hypernym and the hyponym(s). You should use your own judgement to determine where in the text the pattern and the hypernym and hyponym(s) begin and end, based on the guidelines that follow and using your background knowledge or other sources to ensure that the hyponymic relations you find are correct. Thus, in the case of “esters of 225 unsaturated carboxylic acids such as maleic acid”, the hypernym is “unsaturated carboxylic acids”, not “esters of unsaturated carboxylic acids”.  A Hearst Pattern must be composed of a single hypernym, the phrase “such as”, a single hyponym or a list of hyponyms, optionally including leading determiners (e.g. “a”, “the”, “some”, “any”), and nothing else. You should not annotate a leading determiner as part of a hyponym unless you consider it vital to the meaning of the hyponym term. Thus, for example, you should annotate “stable carbocations such as the tertiary carbocation”.  Where more than one hyponym is found the list must be continuous. Whitespace, punctuation and conjunctions (“and” and “or”) are allowed to separate the list, other words are not. Thus, “polar solvents such as, for example, DMSO” should not be annotated. Where this extra text occurs inside a list of hyponyms, e.g. “polar solvents such as DMSO, THF – the most commonly used solvent – and acetone”, you should annotate the hyponyms that occur prior to the extra text and the Hearst Pattern as far as the end of the last annotated hyponym.  We are looking specifically for Hearst Patterns that inform us about the chemical domain. You should only include patterns in which all of the hyponyms are terms that have structural meaning, such as chemical structures, structural features or structural classes. Hyponyms thus need not correspond to specific chemicals, so for example may include “methyl group” and “beta-lactams” as well as specific molecules. Specific molecules may be identified by, for example, trivial names, systematic names, semi-systematic names, abbreviations such as “DMAP” or formulae such as “C3H8” or “MeCOCH2Cl”.  Hypernyms should therefore denote classes of chemical structures, structural features or structural classes. Hypernyms may themselves be structural classes (e.g. “beta-lactams”), but may also be based on function (e.g. “5-HT antagonists”), usage (e.g. “solvents”), properties (e.g. “visible-light absorbing molecules”) or something far more ephemeral (e.g. “interesting functional groups”). These examples should be considered illustrative rather than restrictive.  Hypernyms may include adjectives where the adjective forms a part of the hypernym, e.g. in “…using a non-polar solvent such as cyclohexane”, but not in “…by dissolution in cyclohexane or an alternative solvent such as benzene”.  Where suitable patterns are found, annotate the entire text of the pattern by using “click- and-drag” to select the appropriate text within the OSCAR scrapbook and clicking the “pattern” button. Then annotate the hypernym and the individual hyponyms similarly by using the “hyper” and “hypo” buttons respectively.  All annotations must begin and end at word boundaries.  It is not necessarily the case that all Hearst Patterns may be annotated in this way. Consider, for example, “metal oxides such as potassium and calcium oxide”. “Potassium and calcium oxide” describes the hyponyms, but potassium is not a metal oxide, and “potassium oxide” cannot be annotated as the words do not occur together. In this case you should annotate “metal oxide” as the hypernym, and “calcium oxide” as the only hyponym. This point should be applied where the final word significantly modifies the meaning of the non-final items in the list and not to cases such as “alkoxide anions such as methoxide and ethoxide anions”.  If the phrasing used in the text makes understanding the meaning of a Hearst Pattern impossible, you should not annotate anything. 226  Hearst patterns may use a hypernym denoting multiple classes, e.g. “polar or non-polar solvents such as DMSO or hexane”. In this case, annotate the hypernym as “polar or non- polar solvents” and the hyponyms as normal.  If there are nested Hearst Patterns, e.g. “…antibiotics such as beta-lactams such as amoxicillin…” then copy the source file as many times as necessary and annotate the patterns separately.  If there is a typo present in a hypernym or hyponym, you should treat the word as though the typo were not there. If there is a typo in the phrase “such as” you should not annotate the Hearst Pattern. 227 Appendix B The table on the following page contains a list of classes of molecules derived from the application of Hearst Patterns to chemical texts. In this task, each class should be assigned one and only one of the following labels;  Structural – the name of the class indicates that all members contain a specific substructure e.g. ketone or methyl ester.  Functional – the name of the class indicates that all members share a common function, usage, property or other non-structural feature e.g. antibiotic or surfactant.  Semi-structural – the name of the class indicates something about the structure or composition of the members, but not that that they share a specific substructure e.g. isomers of C6H10O or bicyclic systems. Having decided which of the labels fits the class name best, tick the appropriate box. In making your decision you may use your background knowledge and any reference sources you consider appropriate but you must not confer with anyone. 228 Class name Structural Functional Semi-structural olefin straight monoolefin amine inert solvent ether compound mineral and carboxylic acid alkylamine suitable solvent trihydrocarbon-substituted phosphine aromatic ether compound halogenated styrene hydrocarbon organic solvent halogenated α-olefin organic acid monohydrocarbon-substituted phosphine diolefin Base α-olefin solvent alkylstyrene alcohol dihydrocarbon-substituted phosphine cyclic olefin tertiary amine halogenated hydrocarbon aliphatic monoether compound aliphatic unsaturated ether compound 229 Appendix C 1. Annotate textual reports of spectral data. Do not annotate spectra that are reported in image form. Include spectra found within tables where it is possible to produce well-formed XML and where they should be annotated according to these guidelines. 2. Spectra should be annotated such that each spectrum tag contains one and only one spectrum. 3. Annotate spectra that are reported in a regular format, e.g. “1H NMR: 2.30 (s, 2H), 2.45 (d, 1H, J=2.8Hz”. Do not include spectra reported in natural language e.g. “1H NMR found to be identical as for previous example”. 4. Spectra containing typos should be annotated as though the typo were not present. 5. Do not include leading or trailing whitespace or punctuation. 6. Annotate only text corresponding to spectra produced from 1H NMR (HNMR), 13C NMR (CNMR), Mass Spectrometry (MassSpec), High-Resolution Mass Spectrometry (HRMS) and Infra-Red Spectroscopy (IR). 230 Appendix D Supporting information is contained on the attached disk. This information comprises;  /code/markush - the implementation of EPML tools as described in chapter 3  /code/markush/examples - example EPML describing different types of variation  /code/markush/fragments - CML fragments describing assorted key units  /code/markush/markushStructures - example markush structures encoded in EPML  /code/markush/polyinfo - PML descriptions of polymers from the PoLyInfo database and their associate CML fragments and atomisitic CML descriptions  /code/markush/src - the source code for the implementation  /code/markush/test - test code that really should have been in src/test/java  /code/markush/testResources - test files that really should have been in src/test/resources  /code/patentanalysis - the implementations of Hearst Patterns for relation extraction, and of reaction extraction described in chapters 4 and 5&6 respectively  /code/patentanalysis/classifier - the experimental sections, sorted according to their status as "experimental", "non-experimental" and "empty" used for model training and validation, as described in section 5.2.4  /code/patentanalysis/downloadsNoDuplicates - the ten weeks' worth of EPO patent downloads from which duplicated documents have been removed, as described in section 5.1.3  /code/patentanalysis/osra - the OSRA executable and supporting libraries used for image recognition  /code/patentanalysis/osraCache - the cache of OSRA results generated during the work  /code/patentanalysis/src - the source code for the implementation  /code/patentanalysis/testResources - test files that really should have been in src/test/resources  /data - data sets produced during the work  /data/chemImageCorpus - the corpus of chemical images used to measure the accuracy of OSRA in section 5.2.5, produced by random selection from the files in /data/processedPatents  /data/dataAnnotations - the corpus of experimental paragraphs annotated for experimental data as described in section 5.2.3.2, produced by random selection from the files in /data/processedPatents  /data/finalReactions - the resulting CML reactions from reaction mining the patent corpus, and the reasons for failure as discussed in section 5.3.3, by executing ReactionExtractor on the files in /data/processedPatents  /data/hearstAnnotations - the sets of annotations produced by the three annotators as discussed in section 4.3.5  /data/processedPatents - the set of semantically enhanced patents produced from the set of unique, full text documents as described in section 5.2, produced by executing ProcessPatents on the files in /code/patentanalysis/downloadsNoDuplicates  /data/reactionCorpus - the set of automatically extracted reactions used to validate the performance of PatentEye in section 6.1 (produced by random selection from the files in 231 /data/finalReactions), and an index file containing the manually-determined tp, fp and fn scores per patent  /data/generatedRelations - the OWL files containing the sets of relations used in chapter 4  /data/generatedRelations/structureOntology.owl - the full set of relations extracted from the patents, generated by running StructureOntologyCreator on the patents contained in /code/patentanalysis/downloadsNoDuplicates as described in section 4.3  /data/generatedRelations/trimmedBySourceCount.owl - the set of relations produced by trimming the relations to those asserted in 3 separate source documents, as described in section 4.3.4, by executing TrimStructureOntology on structureOntology.owl  /data/generatedRelations/trimmedBySubclasses.owl - the set of relations produced by further trimming the relations to only include those that refer to classes with at least six example structures, as described in section 4.4.1, by executing TrimStructureOntology on structureOntology.owl 232 Appendix E The following comprises a list of the major software components written solely by the current author as part of the current work. Much of the code is available on the attached disk.  The MarkushBuilder component for producing example structures corresponding to Markush structures encoded in EPML, as described in section 3.4.  The software for compiling and searching Extended Connection Tables corresponding to Markush structures encoded in EPML, as described in section 3.5.  The HearstFinder component for identifying chemical Hearst Patterns in text using ChemicalTagger, as described in section 4.3.1  The EPOCrawler component for automatically downloading chemical patents from the European Patent Office (EPO) website, as described in section 5.1.2.  The EPOProcessor component for the semantic enhancement of EPO patent documents, as described in section 5.2.  The ExperimentParser component for the automatic extraction and semantic description of chemical syntheses from plain text, as described in section 5.3.