DEPARTMENT OF CHEMISTRY
Extraction of chemical
structures and reactions
from the literature
Daniel Mark Lowe
Pembroke College
This dissertation is submitted for the degree of Doctor of Philosophy
June 2012
I
Disclaimer
This dissertation is the result of my own work and includes nothing which is the outcome of
work done in collaboration except where specifically indicated in the text.
This dissertation does not exceed the word limit (60000) set by the Chemistry Degree
Committee.
II
Abstract
The ever increasing quantity of chemical literature necessitates the creation of automated
techniques for extracting relevant information. This work focuses on two aspects: the conversion of
chemical names to computer readable structure representations and the extraction of chemical
reactions from text.
Chemical names are a common way of communicating chemical structure information. OPSIN
(Open Parser for Systematic IUPAC Nomenclature), an open source, freely available algorithm for
converting chemical names to structures was developed. OPSIN employs a regular grammar to direct
tokenisation and parsing leading to the generation of an XML parse tree. Nomenclature operations
are applied successively to the tree with many requiring the manipulation of an in-memory
connection table representation of the structure under construction. Areas of nomenclature
supported are described with attention being drawn to difficulties that may be encountered in name
to structure conversion. Results on sets of generated names and names extracted from patents are
presented. On generated names, recall of between 96.2% and 99.0% was achieved with a lower
bound of 97.9% on precision with all results either being comparable or superior to the tested
commercial solutions. On the patent names OPSIN’s recall was 2-10% higher than the tested
solutions when the patent names were processed as found in the patents. The uses of OPSIN as a
web service and as a tool for identifying chemical names in text are shown to demonstrate the direct
utility of this algorithm.
A software system for extracting chemical reactions from the text of chemical patents was
developed. The system relies on the output of ChemicalTagger, a tool for tagging words and
identifying phrases of importance in experimental chemistry text. Improvements to this tool
required to facilitate this task are documented. The structure of chemical entities are where possible
determined using OPSIN in conjunction with a dictionary of name to structure relationships.
Extracted reactions are atom mapped to confirm that they are chemically consistent. 424,621 atom
mapped reactions were extracted from 65,034 organic chemistry USPTO patents. On a sample of 100
of these extracted reactions chemical entities were identified with 96.4% recall and 88.9% precision.
Quantities could be associated with reagents in 98.8% of cases and 64.9% of cases for products
whilst the correct role was assigned to chemical entities in 91.8% of cases. Qualitatively the system
captured the essence of the reaction in 95% of cases. This system is expected to be useful in the
creation of searchable databases of reactions from chemical patents and in facilitating analysis of
the properties of large populations of reactions.
III
Acknowledgements
I would like to thank my supervisors, Professor Robert Glen and Professor Peter Murray-Rust,
for their guidance and advice. I would also like to thank Dr Peter Corbett, for his initial work on the
OPSIN codebase which was the precursor to the system that I developed, Dr Lezan Hawizy for her
work on ChemicalTagger and many useful discussions on extending it, Dr David Jessop for his
paragraph classifier, Albina Asadulina for her contribution to fused ring nomenclature support and
Dr Sam Adams for many fruitful discussions on cheminformatics algorithms. I would also like to
thank my colleagues at the Unilever Centre for providing such an enjoyable working environment. I
am very grateful to Boehringer Ingelheim for funding my research.
IV
Table of Contents
Disclaimer ...................................................................................................................................... I
Abstract ......................................................................................................................................... II
Table of Contents ......................................................................................................................... IV
Glossary ..................................................................................................................................... XIII
Chapter 1 Introduction ................................................................................................................. 1
1.1 Where can text mining be performed? .............................................................................. 3
1.2 What can be text mined? ................................................................................................... 4
1.3 Overview of research project ............................................................................................. 4
Chapter 2 Tools and Methods ...................................................................................................... 6
2.1 XML ..................................................................................................................................... 6
2.2 Chemical Markup Language ............................................................................................... 7
2.3 SMILES................................................................................................................................. 8
2.4 InChI .................................................................................................................................. 10
2.5 Formal grammars .............................................................................................................. 11
2.6 Automata .......................................................................................................................... 12
2.7 Regular expressions .......................................................................................................... 14
2.8 OSCAR4 ............................................................................................................................. 15
2.9 ChemicalTagger ................................................................................................................ 18
2.10 Apache Maven ................................................................................................................ 22
2.11 Distributed version control ............................................................................................. 25
2.12 Continuous integration testing ....................................................................................... 26
V
Chapter 3 Conversion of Chemical Names to Structures ........................................................... 28
3.1 Introduction ...................................................................................................................... 28
3.1.1 History of systematic nomenclature.......................................................................... 28
3.1.1 Classes of chemical name .......................................................................................... 29
3.1.2 General construction of systematic names ............................................................... 29
3.1.3 History of programmatic name to structure conversion ........................................... 31
3.1.4 Current solutions ....................................................................................................... 32
3.2 Development and implementation of OPSIN ................................................................... 34
3.2.1 Strategy for development of OPSIN .......................................................................... 34
3.2.2 Architecture ............................................................................................................... 34
3.2.3 Pre-processing ........................................................................................................... 35
3.2.4 Tokenisation and parsing ........................................................................................... 36
3.2.4.1 Introduction ........................................................................................................ 36
3.2.4.2 Tokenisation algorithm ....................................................................................... 37
3.2.4.3 Looking up tokens in the lexicon ........................................................................ 40
3.2.4.4 Generation of parses .......................................................................................... 42
3.2.4.5 Drawbacks of a regular grammar ....................................................................... 44
3.2.4.6 Right to left parsing ............................................................................................ 44
3.2.4.7 XML generation .................................................................................................. 45
3.2.5 CAS index name uninversion ..................................................................................... 45
3.2.6 Chemical word rule assignment ................................................................................ 47
3.2.7 Component generation ............................................................................................. 49
3.2.7.1 XML Transformations ......................................................................................... 50
3.2.7.2 Generation of alkanes ......................................................................................... 51
VI
3.2.7.3 Generation of heteroatom hydrides................................................................... 53
3.2.7.4 Generation of heterogeneous heteroatom hydrides ......................................... 53
3.2.7.5 Generation of hydrocarbon ring systems ........................................................... 54
3.2.7.5a Von Baeyer nomenclature ............................................................................ 54
3.2.7.5b Monocyclic Spiro nomenclature ................................................................... 56
3.2.7.5c Other hydrocarbon ring nomenclature ......................................................... 58
3.2.7.6 Rejection of parses caused by nomenclature ambiguity .................................... 59
3.2.7.7 Handling of nomenclature irregularities ............................................................ 61
3.2.8 Connection table generation ..................................................................................... 63
3.2.9 Specific nomenclature handling ................................................................................ 64
3.2.9.1 Groups with indeterminately positioned structural features ............................ 65
3.2.9.2 Traditional alkane/carboxylic acid locants ......................................................... 66
3.2.9.3 Skeletal replacement nomenclature .................................................................. 66
3.2.9.4 Conjunctive nomenclature ................................................................................. 67
3.2.9.5 Suffix handling .................................................................................................... 68
3.2.9.6 Charge and oxidation numbers .......................................................................... 71
3.2.9.7 Indication of saturation and unsaturation .......................................................... 72
3.2.9.7a Unsaturation terms ....................................................................................... 72
3.2.9.7b Hydro, dehydro, indicated hydrogen and added hydrogen ......................... 73
3.2.9.8 Subtractive nomenclature .................................................................................. 75
3.2.9.9 Functional replacement ...................................................................................... 76
3.2.9.9a Infix Functional Replacement ....................................................................... 76
3.2.9.9b Prefix Functional Replacement ..................................................................... 78
3.2.9.10 Hantzsch-Widman nomenclature ..................................................................... 81
3.2.9.11 Lambda convention .......................................................................................... 83
3.2.9.12 Fused Ring nomenclature ................................................................................. 84
VII
3.2.9.12a Fused Ring System Construction ................................................................ 84
3.2.9.12b Benzo fusions .............................................................................................. 87
3.2.9.12c Multi-parent systems .................................................................................. 87
3.2.9.12d Idealised grid construction ......................................................................... 88
3.2.9.12e Grid orientation .......................................................................................... 91
3.2.9.12f Peripheral numbering .................................................................................. 94
3.2.9.13 Bridges for fused ring systems .......................................................................... 94
3.2.9.14 Ring assemblies ................................................................................................. 95
3.2.9.15 Polycyclic spiro nomenclature .......................................................................... 97
3.2.9.16 ᴅ/ʟ stereochemistry .......................................................................................... 98
3.2.9.17 Amino acid nomenclature ................................................................................ 99
3.2.9.18 Carbohydrate nomenclature .......................................................................... 101
3.2.9.18a Systematic carbohydrate chains ............................................................... 102
3.2.10 Structure assembly ................................................................................................ 103
3.2.10.1 Substitutive nomenclature ............................................................................. 103
3.2.10.2 Additive nomenclature ................................................................................... 105
3.2.10.3 Multiplicative nomenclature .......................................................................... 106
3.2.10.4 Functional class nomenclature ....................................................................... 107
3.2.10.5 Structure-based polymer nomenclature ........................................................ 108
3.2.11 Kekulisation ........................................................................................................... 109
3.2.12 Valency checking ................................................................................................... 110
3.2.13 Application of stoichiometry ................................................................................. 111
3.2.13.1 Mixtures .......................................................................................................... 111
3.2.13.2 Charge balancing ............................................................................................ 111
3.2.14 Stereochemistry handling ...................................................................................... 113
3.2.14.1 Detection of stereocentres ............................................................................. 113
VIII
3.2.14.2 Applying stereochemistry ............................................................................... 114
3.2.14.2a R/S/E/Z stereochemistry ........................................................................... 115
3.2.14.2b Cis/trans stereochemistry ........................................................................ 117
3.2.14.2c Alpha/beta stereochemistry ..................................................................... 118
3.2.14.2d Carbohydrate stereochemistry ................................................................. 119
3.2.15 Ambiguous and formally incorrect chemical names ............................................. 119
3.2.15.1 Implicit bracketing .......................................................................................... 120
3.2.15.2 Implicit spaces ................................................................................................ 121
3.2.16 Output formats ...................................................................................................... 124
3.2.16.1 CML ................................................................................................................. 124
3.2.16.2 SMILES............................................................................................................. 125
3.2.16.3 InChI ................................................................................................................ 126
3.3 Results and discussion .................................................................................................... 127
3.3.1 Methodology ........................................................................................................... 127
3.3.1.1 Generated name test sets ................................................................................ 127
3.3.1.2 Chemical patents test set ................................................................................. 128
3.3.2 Data obtained .......................................................................................................... 129
3.3.2.1 ACD/Name generated names ........................................................................... 129
3.3.2.2 ChemBioDraw generated names ...................................................................... 129
3.3.2.3 Lexichem generated names .............................................................................. 130
3.3.2.4 Marvin generated names.................................................................................. 130
3.3.2.5 Compounds from headings in USPTO Patents .................................................. 131
3.3.3 Discussion ................................................................................................................ 131
3.4 Implementations............................................................................................................. 133
3.4.1 Java library ............................................................................................................... 133
IX
3.4.2 Command-line interface .......................................................................................... 135
3.4.3 OPSIN web service ................................................................................................... 135
3.4.4 OPSIN Document Extractor...................................................................................... 136
3.5 Areas for future work ..................................................................................................... 138
3.5.1 Vocabulary ............................................................................................................... 138
3.5.2 Carbohydrate nomenclature ................................................................................... 139
3.5.3 Inorganic nomenclature .......................................................................................... 139
3.5.4 Stereochemistry ....................................................................................................... 140
3.5.5 Nomenclature variants ............................................................................................ 140
3.5.6 Detection and handling of ambiguous names ......................................................... 141
3.5.7 Detection of typographical errors ........................................................................... 141
3.5.8 Foreign language support ........................................................................................ 142
3.6 Conclusions ..................................................................................................................... 143
Chapter 4 Extraction of Chemical Reactions from the Patent Literature ................................. 144
4.1 Introduction .................................................................................................................... 144
4.2 Previous attempts at text mining chemical reactions .................................................... 145
4.2.1 Chemical Abstracts Service ...................................................................................... 145
4.2.2 University of Cambridge .......................................................................................... 146
4.2.3 University of Toronto ............................................................................................... 147
4.3 Corpus choice ................................................................................................................. 147
4.4 Sectioning the relevant text within a patent .................................................................. 147
4.4.1 Archetypal experimental chemistry section ............................................................ 147
X
4.4.2 Sectioning workflow ................................................................................................ 148
4.4.3 Identifying paragraphs and headings ...................................................................... 150
4.4.4 Paragraph classification ........................................................................................... 150
4.4.5 Chemical tagging ...................................................................................................... 150
4.4.5.1 Improved tokenisation ..................................................................................... 150
4.4.5.2 Improved robustness of sentence parser ......................................................... 151
4.4.5.3 Recognition of new concepts ........................................................................... 152
4.4.5.4 Improved recognition of existing concepts ...................................................... 152
4.4.5.5 Improved action phrase assignment ................................................................ 153
4.4.5.6 Improved extensibility ...................................................................................... 154
4.4.6 Identification of inline headings .............................................................................. 154
4.4.7 Processing of headings ............................................................................................ 155
4.4.8 Processing of paragraphs ......................................................................................... 156
4.5 Section Parsing ................................................................................................................ 156
4.5.1 Processing of chemical entities ............................................................................... 158
4.5.1.1 Name to structure ............................................................................................ 158
4.5.1.2 Anaphora identification and resolution............................................................ 158
4.5.1.3 Property Extraction ........................................................................................... 160
4.5.1.4 Chemical type assignment ................................................................................ 160
4.5.2 Identification of discourse type ............................................................................... 161
4.5.3 Chemical role assignment ........................................................................................ 162
4.5.3.1 Product Role ..................................................................................................... 162
4.5.3.2 Reactant Role .................................................................................................... 162
4.5.3.3 Solvent Role ...................................................................................................... 162
4.5.3.4 Catalyst Role ..................................................................................................... 163
XI
4.6 Reaction mapping ........................................................................................................... 164
4.6.1 Indigo reaction creation .......................................................................................... 164
4.6.2 Atom-atom mapping ............................................................................................... 165
4.6.3 Stoichiometry calculation ........................................................................................ 165
4.6.4 Output ...................................................................................................................... 166
4.7 Evaluation ....................................................................................................................... 168
4.7.1 Methodology ........................................................................................................... 168
4.7.2 Results ...................................................................................................................... 169
4.7.2.1 Errors encountered ........................................................................................... 169
4.7.2.2 Overall statistics ................................................................................................ 170
4.7.2.3 Evaluated reaction quality ................................................................................ 170
4.8 Discussion ....................................................................................................................... 171
4.9 Comparison to other approaches ................................................................................... 173
4.10 Example use: solvent analysis ....................................................................................... 174
4.11 Limitations and areas for future work .......................................................................... 175
4.11.1 Interrelation between taggers ............................................................................... 175
4.11.2 Chemical entity type assignment .......................................................................... 175
4.11.3 Solvents contained within another entity ............................................................. 175
4.11.4 Acid/Base workup steps ........................................................................................ 176
4.11.5 Additional roles ...................................................................................................... 176
4.11.6 Structurally unknown intermediates ..................................................................... 176
4.11.7 Presentation of reactions ...................................................................................... 176
4.11.8 Reaction conditions ............................................................................................... 176
XII
4.12 Conclusions ................................................................................................................... 176
Chapter 5 Overall Summary of Results and Conclusions .......................................................... 178
References ................................................................................................................................ 180
Appendix A ................................................................................................................................ 192
Appendix B ................................................................................................................................ 193
Appendix C ................................................................................................................................ 194
XIII
Glossary
AST = Abstract Syntax Tree
CAS = Chemical Abstracts Service
ChEBI = Chemical Entities of Biological Interest
CIP = Cahn-Ingold-Prelog
CML = Chemical Markup Language
DTD = Document Type Definition
EPO = European Patent Office
InChI = IUPAC International Chemical Identifier
IPC = International Patent Classification
IUPAC = International Union of Pure and Applied Chemistry
MEMM = Maximum-Entropy Markov Model
OCR = Optical Character Recognition
OPSIN = Open Parser for Systematic IUPAC Nomenclature
OSCAR = Open Source Chemistry Analysis Routines
POS = Part Of Speech
SMILES = Simplified Molecular Input Line Entry Specification
USPTO = United States Patent and Trademark Office
WIPO = World Intellectual Property Organization
XML = eXtensible Markup Language
1
Chapter 1 Introduction
The scientific literature, comprising journal articles (Figure 1-1), patents (Figure 1-2) and
theses, is continuing to grow rapidly.
Figure 1-1 PubMed articles indexed per year from 1950-2011
1
Figure 1-2 World-wide chemistry patent applications per year from 2000-2009
2
Due to the size of the literature automated methods must be employed to allow identification
of relevant resources. Fortunately much of the literature is available in digital form whether by being
natively created as such or, in the case of legacy material, by being scanned. Optical character
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
1
9
5
0
1
9
5
3
1
9
5
6
1
9
5
9
1
9
6
2
1
9
6
5
1
9
6
8
1
9
7
1
1
9
7
4
1
9
7
7
1
9
8
0
1
9
8
3
1
9
8
6
1
9
8
9
1
9
9
2
1
9
9
5
1
9
9
8
2
0
0
1
2
0
0
4
2
0
0
7
2
0
1
0
A
rt
ic
le
s
p
e
r
ye
ar
0
50000
100000
150000
200000
250000
300000
350000
400000
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
P
at
e
n
t
ap
p
lic
at
io
n
s
p
e
r
ye
ar
2
recognition (OCR) is routinely employed on scanned documents to allow their text to be computer
readable. Using modern search engines, employing technologies such as Apache Lucene3, full text
searching is now routine; however such searches are not sufficient to allow domain specific queries
as traditional search engines have limited understanding of the content of the documents over
which they are searching.
Ontologies may be employed to formally encode the relation between entities in a domain
e.g. those that are hyponyms of other entities, and to encode terms that are synonymous with a
given concept in the ontology. Examples of such ontologies include the Gene Ontology4 and the
ChEBI5 (Chemical Entities of Biological Interest) ontology. Unlike some fields, the number of possible
entities in chemistry is essentially unbounded. For example, the number is of the order of 1060–10100
just for drug-like small molecule entities6. This, coupled with the use of various forms of systematic
nomenclature that lead to many names for the same chemical entity, makes the existence of an
ontology describing all possible chemical entities and their possible synonyms impractical.
For small molecules, a natural identifier is the chemical structure itself and hence much text
mining effort in chemistry has focused on the identification and conversion of textual and graphical
entities into chemical structures.
Textual chemical entities may be expressed in many ways including systematic nomenclature
such as IUPAC nomenclature, trivial names, chemical line identifiers e.g. InChI (Section 2.4) and
chemical formulae. In biomedical text mining identification of entities can be primarily achieved by
dictionary-based approaches7. In chemical text mining, however, dictionary-based approaches are
insufficient to recognise much systematic nomenclature, line identifiers and chemical formulae. As a
result, recent research on the identification of chemical entities from text has focused primarily on
machine-learning approaches8–15. More recently, grammar-based approaches have also been shown
to be applicable16.
Just as for the identification of textual chemical entities, resolution to chemical structures, in
the general case cannot be accomplished by dictionary approaches. As a result chemical name to
structure algorithms are required to allow the interpretation of systematic chemical nomenclature.
Corresponding efforts exist to extract chemical structures from images. Eight different
solutions have been reported, two of which are currently open source, that are under active
development 17–24. The area is rapidly progressing with new versions and new solutions significantly
increasing the percentage of chemical structure diagrams that can be recognised correctly. For
3
example, using one particular test set 69%25 (OSRA, 2009) to 88%22 (MolRec, 2012). With the growing maturity
of image to structure software, future research can be expected to increasingly leverage the
combination of the results of text mining and image to structure26.
Due to concerns over the accuracy of image to structure software, at the time the project was
initiated, and the pre-existence of an actively developed open source image to structure solution it
was decided to focus this research project solely on extracting information from text.
The lack of an open source name to structure algorithm with useful levels of performance
necessitated the development of such a name to structure algorithm as a critical part of this project.
This forms the first part of this project.
1.1 Where can text mining be performed?
Much of the chemical literature published in journals remains behind pay-walls with policies
on text mining that differ significantly between publishers and, often with restrictions and/or
charges attached27. A further problem is that no standardised data format exists for the
representation of journal articles meaning that some level of adaption is likely to be required for a
tool to work with articles from a particular journal.
Patents also provide a vast resource of chemical information, yet have the key advantage of
being in the public domain. More recent patents also have the advantage of being presented in
standardised data formats. Historically, access to bulk patent downloads has been difficult, but, with
the recent collaboration between the USPTO (United States Patent and Trademark Office) and
Google Patents28, the entire archive of US Patents can now be downloaded trivially.
Other important sources of patents include the European Patent Office (EPO), Japan Patent
Office, Korean Intellectual Property Office and the State Intellectual Property Office of the People's
Republic of China. It should, however, be considered that the most important patents are likely to
be filed at multiple patent offices. The World Intellectual Property Organization (WIPO) is an
important source of patent applications that are intended to be processed at multiple patent offices
but have not yet reached the “national phase” where they are examined by the national patent
offices.
Due to their ease of access and lack of OCR mistakes, USPTO patent applications from 2008 to
2011 forms the corpus used for text mining in this project.
4
1.2 What can be text mined?
Text mining has seen widespread use in bioinformatics for discovering relationships between
entities e.g. chemical interactions with Cytochrome P450 29–31 or protein-protein interactions32. Uses
in chemistry have been more limited including annotation of entities33 (cf. the RSC’s Project
Prospect), the association of linked and/or calculated data with identified entities34,35 (cf.
ChemAxon’s Chemicalize), and allowing patents to be structure searchable though large scale
extraction of chemical structures (cf. SureChem36, IBM BAO strategic IP insight platform37).
No large scale attempt at automatically extracting reactions from the literature has been
attempted which is the problem this project will address. Such a system has the potential to allow
more precise queries of the mined reactions and to improve knowledge driven reaction prediction
algorithms38.
1.3 Overview of research project
Chapter 2 describes the theory and software solutions that underlie the solutions that have
been developed. Covered topics include, computer readable chemical structure serialisations,
grammars and automata, software developed for identifying chemical entities and annotating
experimental chemistry text, and techniques that help provide a productive software development
environment.
Chapter 3 describes the development of OPSIN, a chemical name to structure algorithm. Other
existing and historic attempts at name to structure are discussed followed by a detailed description
of the processes that allow OPSIN to convert a name into a computer readable structure
representation. The forms of nomenclature supported by OPSIN are described, exemplified and,
where of sufficient complexity, the algorithms used to process them are described. OPSIN’s
performance is evaluated on sets of generated chemical names and names extracted from patents.
The various ways that OPSIN is used including as a command-line interface, web service and tool for
identifying systematic chemical names in free text are described.
Chapter 4 describes the development of software for the automated extraction of reactions
from patents. Previous attempts are discussed followed by a detailed description of the reaction
extraction system that has been developed. This covers the steps of identifying experimental
sections, determining the type and role of chemical entities and finally producing an atom-atom map
5
between the reactants and product/s. Precision and recall estimates are derived from a subset of the
four years of USPTO patents over which the system has been run.
Chapter 5 summarises the outcomes and future directions of this project.
6
Chapter 2 Tools and Methods
2.1 XML
XML (eXtensible Markup Language) is a standard for encoding information using mark-up in a
way that is machine-readable. An XML document may contain elements, attributes, text nodes,
comments, processing instructions, namespace declarations and doctype declarations. To explain the
first three of these, Figure 2-1 will be used as an example. The document is formed of elements,
made of labelled start and end tags, in this case inventory and vehicle. An element may be
associated with zero or more attributes e.g. wheels. An element may also have zero or more text
children e.g. ‘car’. To be well formed i.e. valid, an XML document must have a single root from which
all other elements are ultimately descendants.
cartricyclebicycle
Figure 2-1 A simple XML document. The inventory element is the root node of this document.
Comments appear in XML documents outside of other mark-up enclosed between ‘’ strings and are often used to give additional information to a human reader.
A processing instruction is enclosed within ‘’ and ‘?>’ strings and is intended to give an
instruction to the application processing the document, for example, that the document should be
rendered using a certain style sheet.
A namespace is declared using the reserved attribute name ‘xmlns’ and is used to uniquely
define element and attribute names (Figure 2-2). A good use for namespaces is when merging XML
content from two sources that may have conflicting element names or attributes with the same
name but different semantics. Assuming different namespaces were used in the two different
documents, elements with the same name can be distinguished.
7
Figure 2-2 Example of namespaces to differentiate between elements with the same name
A doctype may appear at the start of a document and associates the XML document with a
Document Type Definition (DTD). A DTD defines the basic structure of a document e.g. which
elements are allowed, which elements an element may have as children, what the content of a
particular attribute may be etc.
XML is a versatile data format with uses including web pages, databases and information
exchange over HTTP. While the format is fairly verbose this comes with the advantage that
semantics are explicit rather than implicit as in some other formats. As the format is typically not
compressed to a binary format, the format also has the advantage of often being human
understandable and being editable in standard text editing tools.
XML is employed extensively in OPSIN (Chapter 3) for encoding on-disk resources and as an in-
memory representation of a parse tree. Reading XML files and manipulating in-memory
representations of XML is achieved using the XOM Java XML API39,40.
2.2 Chemical Markup Language
Chemical Markup Language (CML) 41 is the application of XML to hold chemical data. CML was
developed by Professors Murray-Rust and Rzepa and initially announced in 199542. Since then the
format has evolved through numerous revisions and is now supported by many commercial and
open source chemistry applications.
A simple use case of CML is to encode molecular structure (Figure 2-3). The
elements/attributes that are allowed in CML, their allowed values for attributes and the allowed
children for each element are encoded in a schema43,44 which may be used for validation of CML45. If
no appropriate elements are available in CML then additional information may be recorded by the
use of elements or attributes outside of the CML namespace.
8
Figure 2-3 A CML document describing the connectivity of ethanol
CML has been extended to cover computational chemistry46, spectral data47 and polymers48. It
has also been extended to cover chemical reactions49. This method of encoding chemical reactions is
employed in Section 4.6.4 for serialising reactions that have been extracted from the patent
literature.
2.3 SMILES
SMILES50 (Simplified Molecular Input Line Entry Specification), published by Daylight in 198850,
is now a widely supported convention for representing a chemical connection table as a string of
ASCII text. The proliferation of applications that can read and/or write SMILES can be explained by its
favourable properties compared to other contemporary line formats. These include its terseness,
ease with which readers and writers can be written and that SMILES are relatively intelligible and
writeable by humans.
Figure 2-4 shows a SMILES string; the string is read from left to right to generate the atoms
and bonds of the molecule. The format contains many optimisations to reduce the length of the
representation and improve readability; hydrogen may be implicit on organic atoms, and bonds are
9
implicitly single or implicitly of type aromatic between aromatic atoms. Full details of the SMILES
specification are available from the OpenSMILES project51 and Daylight52.
N[C@H](C(=O)O)Cc1ccccc1
Figure 2-4 SMILES and structure for L-phenylalanine
The main limitation of SMILES is that it is not intrinsically a canonical format, meaning that a
single connection table can be represented by multiple SMILES (Figure 2-5). While implementations
exist that produce a canonical representation, including implementations in open source toolkits
such as the CDK53 and OpenBabel54, no standard implementation has been agreed upon making
canonical SMILES unsuitable for an interoperable canonical descriptor.
Figure 2-5 Examples of legal SMILES: CCO, OCC, C(O)C, C(C)O, CC1.O1, C1.O2.C12
Other limitations stem from the approximation of molecules as static graphs with well-defined
bond orders. These include the inability to recognise that two molecules are tautomers of each
other or, in the case of mesomers and bonds to metals, that two molecules are identical but simply
represented differently, as exemplified in Figure 2-6. These problems, as well as the problem of
having a universal canonical form are largely addressed by InChI (Section 2.4).
CC[Mg]Br CC[Mg+].[Br-]
Figure 2-6 Two representations of Ethylmagnesium bromide and their canonical SMILES (generated by
OpenBabel)
SMILES are employed by OPSIN as representations for fragments of chemical names and are
one of the program’s output formats. They are also employed in the reaction extraction code for use
as output and to allow data exchange with the Indigo toolkit55.
10
2.4 InChI
InChI or IUPAC International Chemical Identifier56 is a canonical identifier for chemical
compounds. InChIs are formed of layers, in such a way that layers may be removed from right to left
of the InChI without affecting the meaning of the remaining layers (Figure 2-7). This unique feature
of InChI can be utilised to determine at what layer two molecules differ. For example if the InChIs of
two molecules failed to match as is, but matched with the stereochemical layer removed, it can be
deduced that the two molecules differ just in stereochemistry.
Figure 2-7 Standard InChI string with layers annotated
Unlike SMILES, InChI does not suffer from problems with the representation of organic groups
for which multiple valence bond representations are possible (Figure 2-8);
SMILES: [O-][N+](=O)c1ccccc1
InChI=1S/C6H5NO2/c8-7(9)6-4-2-1-3-5-6/h1-5H
SMILES: O=N(=O)c1ccccc1
InChI=1S/C6H5NO2/c8-7(9)6-4-2-1-3-5-6/h1-5H
Figure 2-8 InChIs and SMILES for two different representations of nitrobenzene. Both representations
yield the same InChI, whereas the SMILES differ.
InChI=1S/C4H7ClFN/c1-4(5,7)2-3-6/h2-3H,7H2,1H3/p+1/b3-2+/t4-/m0/s1/i1D
Version string
Chemical formula
Atom connections
Hydrogen atoms
Charge
Stereochemical: double bond,
tetrahedral, parity, overall chirality
Isotopic
11
For simple inorganic compounds the same InChI will be produced regardless of whether ionic
or covalent representation is used (cf. Figure 2-6). However, as illustrated in Figure 2-9, multiple
InChIs are possible for those inorganic compounds with bonding not found in organic compounds,
such as the haptic covalent bonding between the π electrons of the cyclopentadienyl rings and the d
electrons of the iron in ferrocene.
InChI=1S/2C5H5.Fe/c2*1-2-4-5-3-1;/h2*1-5H; InChI=1S/2C5H5.Fe/c2*1-2-4-5-3-1;/h2*1-
5H;/q2*-1;+2
Figure 2-9 Two possible depictions of ferrocene and the two different InChIs they produce
InChIs may be standard or non-standard, with standard InChIs being distinguished by the
presence of the ‘S’ at the end of the version string (Figure 2-7). Unlike a standard InChI, a non-
standard InChI may include a fixed-H layer that allows specification of a particular tautomer and/or a
reconnected layer that explicitly includes bonds to metal atoms. A non-standard InChI may also have
experimental InChI flags enabled such as those for detecting more forms of tautomerisation.
2.5 Formal grammars
A formal grammar is formed of a disjoint set of terminal and non-terminal symbols, with
production rules specifying the replacements allowed for each non-terminal symbol. A terminal
symbol is a literal character in the language to be recognised whilst a non-terminal symbol will have
an associated production rule which defines it in terms of other terminal symbols or non-terminal
symbols.
12
An example of a formal grammar for some simple mathematical expressions could be:
equation ::= bracketed-expression | expression
bracketed-expression ::= “(” expression “)”
expression ::= ( bracketed-expression | digit+ ) operator (
bracketed-expression | digit+ )
digit ::= “0” | “1” | “2” | “3”| “4” | “5” | “6” | “7” | “8” | “9”
operator ::= “+” | “-“ | “×” | “÷”
In this grammar, each line is a production rule. The terminal symbols are the following
characters: ()01233456789+-×÷ whilst the non-terminal symbols are all the other terms. A
formal grammar must also have a non-terminal symbol which is designated as the start symbol,
which in this case is equation.
The types of languages that a grammar can express are related to what restrictions, if any, are
put on the allowed form of the production rules. Chomsky57 defined four types of grammar which in
order of increasing expressivity are regular, context-free, context-sensitive and unrestricted (Table
2-1).
Grammar Language recognised Production rules
Unrestricted Recursively enumerable α β
Context-sensitive Context-sensitive αAβ αγβ
Context-free Context-free A α
Regular Regular A a
A Ba
or
A a
A aB
Table 2-1 The different classes of grammars, the languages they recognise and the production rules
they support. Greek symbols are any combination of terminals or non-terminal symbols, capital letters are
non-terminals and lower case symbols are terminals (including the empty string).
Regular grammars and Context-free grammars will be returned to when discussing the
grammar employed by OPSIN (Section 3.2.4) and ChemicalTagger (Section 2.9), respectively.
2.6 Automata
An automaton is a mathematical construct formed of states, and transitions between those
states. An automaton has an alphabet which includes the symbols that may occur in the input to the
13
automaton. An automaton of the appropriate type (Table 2-2) may be used to check that a given
input is acceptable to a given grammar.
Grammar Type Automaton type
Unrestricted Turing machine
Context-sensitive Linear bounded automaton
Context-free Pushdown automaton
Regular Finite state automaton
Table 2-2 Automaton types required to process the archetypal grammar types
As this work only employs regular and context-free grammars this exposition will focus on the
properties of the corresponding automata for these grammars.
A finite state automaton (FSA) is the simplest automaton and is demonstrated by the example
shown in Figure 2-10. Each circle is a state and every arrow is a transition. When the automaton is
“run” over a given input, the input is consumed character by character with an appropriate
transition being attempted after consumption of each character. The character that must be
consumed for a transition to be allowed is indicated next to the arrow. If no appropriate transition is
possible then the input is not acceptable to the grammar. The automaton continues through the
input until all characters have been consumed at which point examination of whether the FSA is in
an accept state (double circle in diagram) indicates whether the input was accepted by the grammar.
Figure 2-10 A finite state automaton that matches “methane”, “ethane” and “ethene”
A pushdown automaton is the same as a FSA but with the exception of also having a stack. The
top of this stack may be inspected to determine which transition to make and as part of performing
a transition an entry may be added or removed from the stack. Figure 2-11 demonstrates a context-
free grammar and Figure 2-12 shows a pushdown automaton that could describe it.
expression = bracketed-expression | “a” ;
bracketed-expression = “(“ , ( bracketed-expression | “a” ) , “)” ;
Figure 2-11 A context-free grammar; a stack is required to keep track of the nesting
14
Figure 2-12 A possible state machine for the grammar from Figure 2-11
2.7 Regular expressions
Regular expressions are a widely supported method of encoding patterns for finding strings. A
true regular expression can always be expressed using a regular grammar (cf. Section 2.5) and the
expression it describes can be matched by a finite state automaton (cf. Section 2.6). Due to the
widespread usage of Perl-esque regular expressions, which are capable of matching languages that
are less restrictive than even context-free languages, it is useful to draw a distinction between true
regular expressions and “regexes”. Regexes may contain operators that necessitate such operations
as look ahead, look behind and references to named capture groups.
At its simplest, a regular expression is just the string for which one wishes to search.
Metacharacters may be used to achieve more expressive searches (Table 2-3). To match the literal
metacharacter the character is preceded by a forward slash. Forward slashes also proceed
abbreviated character classes e.g. \d for digits, \D for non-digits, \s for whitespace and \S for non-
whitespace.
Append bracket
to stack
Remove bracket
from stack
Accept empty
string if stack is
empty
15
Metacharacter/s Meaning
. Match any character
[ ]
Mark the start and end of the description for a
single character e.g. [ab] is either ‘a’ or ‘b’; [a-z]
is any of the 24 lower case characters
[^ ]
As above but matches a character that does not
meet the description
^ Start of string
$ End of string
( )
Demarcate a sub expression. In regexes this is by
default also a capturing group
* Indicates the preceding expression should be
repeated 0 or more times
? Indicates the preceding expression is optional
+ Indicates the preceding expression should be
repeated 1 or more times
{m,n} Indicates a range of number of times the
preceding expression should be repeated
| Indicates a choice between the expressions
either side of the operator
\ Used to indicate a literal metacharacter or a
shorthand character class
Table 2-3 Regular expression metacharacters
Regular expressions are used as the input to build OPSIN’s grammar, describing the form of
systematic chemical names. Regexes are employed in various places throughout all projects.
2.8 OSCAR4
OSCAR (Open Source Chemistry Analysis Routines) began as a tool for checking experimental
data58. The library, now in its fourth major revision, contains functionality for extracting chemical
entities from free text as well as, where possible resolving them to structures or ontology identifiers
(e.g. ChEBI ids5). The functionality for identifying and interpreting experimental data sections is also
retained.
As OSCAR has developed, so have the algorithms employed. This is especially pronounced in
the area of identifying chemical names. The original dictionary lookup approach was supplemented
by the addition of heuristic identification of chemical entities through regular expressions59, N-gram
analysis60 and finally by a maximum-entropy Markov model (MEMM)9. In this context, N-gram
analysis refers to calculating a probability that a word is chemical from the analysis of occurrences of
the constituent one to four letter sequences in the word as compared to known occurrences in a
training set of chemical and non-chemical words. A MEMM is used to predict the labels for a
16
sequence, in this case a sequence of tokens. OSCAR has a separate MEMM model for each of the
entity types that are not found by string matching. These entity types are chemical, reaction e.g.
hydroxylation, chemical adjective and enzyme. Features employed by the MEMM models include 1-4
character N-grams, the suffix of the token, whether it appears in any word lists e.g. English words,
and adjacent tokens to a given token.
The current version of OSCAR is OSCAR414 which differs from previous incarnations by being
divided into modules using Maven (cf. section 2.10 and Figure 2-13). This allows for independent
aspects of the program to be utilised without bringing in the entirety of OSCAR4.
17
Figure 2-13 Interdependencies between the modules of OSCAR4
OSCAR4 is indirectly employed in this work as a tagger and tokeniser for use in
ChemicalTagger (Section 2.9). To improve ChemicalTagger’s performance some improvements were
made to OSCAR4’s tokeniser (Section 4.4.5.1). Input was also put into developing the OPSIN
dictionary which is an implementation of OSCAR4’s IChemNameDict interface backed by OPSIN.
18
2.9 ChemicalTagger
ChemicalTagger61 is a tool for annotating chemical text that was developed in the Murray-Rust
group of the Unilever Centre, Cambridge. Its overall function is to attempt to extract semantic
information from chemistry documents and, in particular, from experimental sections. Figure 2-14
gives a schematic of the program’s architecture.
19
Figure 2-14 Architecture of ChemicalTagger
20
The input to the program is a string of text, typically a paragraph. The first step in the
workflow involves the Formatter normalising the input. For example, all hyphens are normalised
to a single hyphen type.
The next step is tokenisation. ChemicalTagger has a tokenisation interface which is
implemented by both an OSCAR4 tokeniser and a simpler whitespace tokeniser. For synthetic
chemistry text, the OSCAR4 tokeniser typically performs better and hence was used in this work.
Some specific tokenisations that are unlikely to be performed by more general tokenisers are
performed by the sub-tokeniser, the most important of which is the tokenising of numbers directly
concatenated to a unit. For example ‘50ml’ is tokenised to [‘50’, ‘ml’]. This is necessary to allow such
cases to be recognised as two tokens, hence allowing the numeric value and unit to be separately
tagged.
The tokenised input is then passed through a series of taggers which implement a tagger
interface allowing easy addition of more taggers. By default, the system employs a regex tagger, an
OSCAR4 tagger and a part of speech (POS) tagger, provided by OpenNLP62.
The regex tagger identifies chemistry related terms. For example, verbs relating to chemical
processes such as ‘heated’, and adjectives relating to chemicals e.g. ‘anhydrous’. For greater
specificity later in the workflow, some prepositions are explicitly tagged, e.g. ‘in’ has its own tag. The
OSCAR4 tagger tags chemicals, as well as some enzymes, chemical adjectives and reaction
adjectives. The POS tagger tags all tokens with a POS tag using the Penn Tree bank POS tags63. For
the cases where the POS tag is the literal character, the tag is changed e.g. ‘.’ becomes ‘STOP’. The
results from the taggers are then combined using a user tuneable order of preference as a decider in
the case where more than one tagger produces a tag for a token.
A limitation that is quickly encountered with naive regex tagging of key words is polysemy, the
capacity for the same word to have multiple meanings dependent on context. One of the more
common and important distinctions that needs to be made is between a word being a verb or an
adjective e.g. ‘the solution was concentrated using’ as compared to ‘concentrated sulfuric acid’.
These problems, as well as a few special cases indicative of excessive tokenisation, are identified and
corrected by hand crafted rules, prior to producing the final lists of tokens and tags. Table 2-4 shows
an example input after tokenisation and then after tagging.
21
Input To a vigorously stirred solution of pyridine in THF (40mL)
Tokenised To a vigorously stirred solution of pyridine in THF ( 40 mL )
Tagged TO DT RB JJ-CHEM NN-CHEMENTITY IN-OF OSCAR-CM IN-IN OSCAR-CM -LRB- CD NN-VOL -RRB-
Table 2-4 Example of ChemicalTagger output after tokenisation and output after tagging. Bold terms
are identified by the regex tagger, italicised by the OSCAR4 tagger and the remaining two tokens by the POS
tagger. Note that ‘stirred’ is initially tagged as a VB-STIR before subsequently being corrected, during tag
correction, to a JJ-CHEM due to its adjacency to an NN-CHEMENTITY.
The lists of tags and tokens are then interlaced and given to the chemical sentence parser.
This parser is prebuilt from an ANTLR364 grammar. This grammar has been hand-written to describe
the makeup of experimental chemistry paragraphs. ANTLR3 grammars nominally describe a subset
of context-free grammars but may also include “semantic predicates”. A semantic predicate allows
the execution of a Boolean method to determine whether or not a production rule in the grammar
may be used. Such a method may examine the current token or any previous or future token and
execute arbitrary code to make its decision. In principle, the method could even have persistent
state allowing the parsing of languages requiring greater expressivity than a context-free grammar.
The chemical sentence parser will attempt to break the input down into a tree structure called
an abstract syntax tree (AST). This is formed of sentences which in turn are formed of phrases e.g.
“NounPhrase”, “VerbPhrase”, “PrepPhrase”. Phrases are formed of other phrases, compound
constructs such as “MOLECULE”s or of tags. At its deepest level the tree will be formed entirely of
tags which correspond to the tags originally input to the sentence parser.
The AST is subsequently converted to XML and enriched by the annotation of phrase types
and roles for some molecules. Assignment of phrases is achieved by looking for the presence of
certain tags. For example a phrase would be annotated as a “Yield” if it contained a VB-Yield.
Solvents are identified by their presence after key words when in certain phrases, such as “Dissolve”
and “Wash” phrases. Roles may also be assigned from a successful match with the Hearst pattern65
[MOLECULE] ‘as a’ [NN-CHEMENTITY], or from a molecule being mentioned as being ‘in’ another
molecule, indicating the latter molecule to likely be a solvent. Figure 2-15 shows an example of the
final XML output.
22
To
a
vigorouslystirredsolutionofpyridineinTHF
<_-LRB->(
40mL
<_-RRB->)
Figure 2-15 Final ChemicalTagger output for the input from Table 2-4.
The output from ChemicalTagger is vital to many aspects of the reaction extraction system
described in Chapter 4. As a result significant effort was made, during this project, to improve
ChemicalTagger’s performance and extend it to identify concepts of importance for extracting
reactions as further described in Section 4.4.5.
2.10 Apache Maven
Science often works by building on prior achievements and the same is true of software
development. When building more advanced software, it is typically easier and quicker to employ
23
existing solutions to problems as dependencies rather than attempting to re-implement their
functionality. As the number of dependencies of a project increases, managing these dependencies
can become time consuming. The Apache Maven build system66 offers numerous advantages,
especially in managing larger projects with many dependencies.
The system works by each project corresponding to an artifact, or multiple artifacts in the case
of a project made from multiple modules. Each artifact is assigned a groupId, artifactId and version
description e.g.
chemicalTaggeruk.ac.cam.ch1.3.1
The artifactId is the name of the project/module and the groupId is a unique string which is
typically a domain name that you control. The combination of these fields should be sufficient to
uniquely identify a particular artifact. For released artifacts, the version number will typically be
numeric and for a given artifactId/groupId must be unique. For rapid development, it is often useful
for upstream projects to be able to depend on the latest version of a dependency without the need
to constantly update the dependency version. This can be accomplished by adding the special suffix
‘–SNAPSHOT’ to the version description string of the artifact. As a snapshot version of an artifact
changes with time, snapshot versions of dependencies should not be relied upon for releases.
Artifacts are stored in Maven repositories which, typically, are internet accessible. From these
repositories, the artifacts that a project requests are downloaded for local use. These dependencies
may in turn have their own dependencies which are recursively acquired. Figure 2-16 shows an
example of the result of this for the InChI module of OPSIN.
24
Figure 2-16 Dependency hierarchy for the opsin-inchi module
All the projects involved in this work were either natively available via Maven, or were
manually added to our Maven repository. As can be seen from Figure 2-17, advanced projects can
easily require a non-trivial number of dependencies.
Indirect dependencies
Direct dependencies
Scope in which dependency is required
e.g. test is just for running unit tests
25
Figure 2-17: How the open source projects involved in this project relate to each other. Green projects
are projects developed predominantly for this project whilst yellow projects are those in which significant
improvements were undertaken. Utility and unit testing libraries are not shown.
2.11 Distributed version control
In larger software development projects, such as the ones involved in this work, version
control is essential. Version control allows a developer to associate a message with each set of
changes that are made to a project’s code or dependent files. Reverting the state of particular files
or the whole project to an older revision is then trivial and is useful when one needs to investigate
the effect that a particular change had on a program’s output or to reinstate previously removed
functionality. On larger projects, the version control system must support multiple developers
committing changes, to which an elegant solution is a distributed version control system.
Under a distributed version control system each developer has a local repository into which
their changes are committed. A notable advantage of this approach is that significant changes to a
program can be done incrementally and only distributed when the program is once again stable.
26
Change sets can be transferred between repositories by “pushing” or “pulling”. For accessibility,
backup purposes and for clarity as to what is the latest code, it is useful to have a central repository
on a web-based site such as Bitbucket67 or GitHub68 that is readily accessible to all involved. It should
be noted that such a repository is only the central repository by convention as it does not differ in
structure from any of the other repositories.
The software developed as part of this thesis employs Mercurial69 for version control, due to
its wide support and ease of use, and Bitbucket for code hosting.
2.12 Continuous integration testing
As projects get larger it is highly useful to be able to define tests indicating the expected
output of methods for a given input. These can be used to assure that changes have not broken
existing functionality and that added functionality is functioning as expected. When coding in Java,
an easy way of implementing such tests is by using JUnit70. A continuous integration service can be
set up to constantly, or periodically, query a repository for changes. If a change has been committed,
the service automatically builds the project and runs its unit tests. If a failure in building the project
or running its unit tests is detected, the developers of the project can be emailed allowing them to
immediately look into the cause of the failure. The Jenkins71 continuous integration service was used
for this purpose (Figure 2-18).
For Maven projects, Jenkins can be configured to automatically deploy snapshot versions of
the project to a Maven repository allowing the project’s dependents to instantly benefit from the
updated version. Additionally, Maven projects that Jenkins is aware of, which depend on such a
project can be automatically built and tested to check that the updated dependency has not caused
problems in any of these upstream projects. For example, whenever OSCAR4 is updated, OSCAR4
will be built and tested. The same will then happen for ChemicalTagger, and then the patent
reaction extraction project, as each depends on the previous project.
27
Figure 2-18 View of selected projects from Jenkins. The coloured orb indicates at a glance whether the
project’s last build was successful (green), had unit test failures (yellow) or failed (red). The “weather”
pictogram indicates the stability of previous builds.
28
Chapter 3 Conversion of Chemical Names to Structures
This chapter describes the work undertaken in this project on the successful development of
OPSIN (Open Parser for Systematic IUPAC Nomenclature) an open source chemical name to
structure algorithm. A paper based on the work more fully described in this chapter was published in
the Journal of Chemical Information and Modelling72. The paper was the journal’s most accessed
paper during the month it was published and was among the top 20 most accessed for the year in
March 2012. OPSIN was also included in a review of open source cheminformatics applications73.
3.1 Introduction
3.1.1 History of systematic nomenclature
Compared to most spoken languages systematic chemical nomenclature is a relatively recent
invention with initial codification at the 1892 Geneva conference74. This conference defined a system
of nomenclature, known as the Geneva system, allowing the specification of simple compounds (e.g.
hydrocarbons) with substituents and common functional groups. This system evolved through
multiple recommendations75–80 into what is now known as IUPAC nomenclature. The interested
reader is referred to Smith Jr.’s review paper documenting the history of systematic organic
nomenclature81.
The Chemical Abstracts Service (CAS)82,83 and Beilstein have also developed systematic
nomenclature for use in Chemical Abstracts and Beilstein’s Handbook of Organic Chemistry
respectively. These nomenclature systems employ for the most part the same vocabulary and
operations as IUPAC nomenclature but are designed to uniquely assign a name to each compound.
As a result the nomenclature operations they describe are approximately a subset of those
documented by the IUPAC.
An additional consideration when producing an alphabetical listing of chemicals is that
structurally related compounds should be listed closely together. This is achieved in CAS
nomenclature through the use of inverted index names which place the parent group in front of the
group’s substituents e.g. ‘4-aminobenzenesulfonamide’ becomes ‘benzenesulfonamide, 4-amino-’.
Traditionally some substitutions of common parent groups yielded different trivial names which
would likely end up far apart in the index. To alleviate this problem, CAS index names have become
steadily more systematic by the removal of trivial names in favour of systematic parent group names
(Figure 3-1).
29
Figure 3-1 CAS index name: Benzenesulfonamide, 4-amino- (trivial name: sulfanilamide)
3.1.1 Classes of chemical name
Chemical names may be broadly categorised as systematic, semi-systematic and trivial (Figure
3-2). A trivial name cannot be decomposed into morphemes and can only be understood by
dictionary lookup. If a trivial name is substituted in a logical manner then the name is semi-
systematic. Names in which the parent group and all substituents are named in such a way that their
structures may be deduced by breaking them down into morphemes are systematic.
Figure 3-2 1,3,7-trimethyl-1H-purine-2,6(3H,7H)-dione (systematic), 1,3,7-trimethylxanthine (semi-
systematic), caffeine (trivial)
This categorisation is made far more blurry by the presence of retained trivial names in IUPAC
nomenclature. These are groups that are preferred over their systematic alternatives (e.g. ‘purine’)
or even in some cases the only allowed option. For example the alkane chains of length less than five
e.g. methane and ethane are trivial as they do not start with a morpheme indicating the number of
carbons in the chain. Names such as ‘monane’ and ‘diane’ would be systematic but are unknown.
3.1.2 General construction of systematic names
The order of construction of a systematic substitutive name is outlined in Figure 3-3.
30
Figure 3-3 Components of a substitutive name
Detachable prefixes:
These are terms such as ‘ethyl’, ‘chloro’ etc. They describe groups that will be
attached to the parent compound. Formally these describe radicals and are referred
to as substituents in this text.
Hydro/dehydro prefixes:
E.g ‘dihydro’. These describe the addition or removal of hydrogen from a system.
They are used inconsistently as if they were a detachable or non-detachable prefix.
Non-detachable prefixes:
E.g. ‘aza’, ‘1H-‘, ‘cyclo’, ‘methano’, ‘benzo’ etc. These prefixes are used to modify the
parent group e.g. changing atom element type, cyclising the structure, adding
bridges etc.
Name of parent:
E.g. ‘meth’, ‘benzene’, ‘acet’ etc. This is the name of the parent group.
Endings:
These are employed on alkanes and some natural products to indicate unsaturation
e.g. ‘ene’.
Suffixes:
There are two types of suffixes, cumulative suffixes that may be used in combination
with other suffixes e.g. ‘ium’, ‘yl’ and functional suffixes that describe the
functionality of the group and only one of which may be present e.g. ‘amide’, ‘oic
acid’.
Other components that may appear in multiple parts of a chemical name include:
Detachable
Prefixes
Hydro/dehydro
prefixes
Non-detachable
prefixes
Name of
parent
Suffixes Endings
(ane/ene/yne)
31
Locants:
These may be numeric, Greek characters or an element symbol (which may be used
in conjunction with a non-element symbol locant). Locants indicate the position on
the parent group referred to by the operation that the locant precedes.
Multipliers:
These indicate that an operation should be performed multiple times.
Stereodescriptors:
These are used to indicate the stereochemistry of a detachable prefix or parent
group.
3.1.3 History of programmatic name to structure conversion
Efforts to employ computers to extract information from chemical names dates back to the
work of Garfield in 196284,85. His program decomposed simple substitutive chemical names, formed
of acyclic components, into their composite morphemes. Each morpheme can then be treating as
either specifying a molecular formula e.g. ‘prop’ = C3 or modifying the molecular formula e.g. ‘ene’
indicates the presence of a double bond. A relatively simple formula was then employed to calculate
the hydrogen count from the heavy atom composition and the number of double bonds, hence
yielding a molecular formula. A molecular formula could then be used as a search parameter, for
example to search for compounds of identical composition in formula indexes of resources such as
Chemical Abstracts.
CAS published in 1967 and 197486,87 on an in-house tool for converting chemical names in CAS
nomenclature to structures. The aims were to verify that the names were syntactically valid and
ultimately that they agreed with the structure that was present in the CAS registry. The approach
interpreted chemical names in a left to right manner, without the use of a formal grammar. The
program is documented as supporting fused ring nomenclature (via dictionary lookup), bridges,
hydro prefixes, prefix functional replacement of oxygen by sulfur, von Baeyer nomenclature, spiro
nomenclature, conjunctive nomenclature, skeletal replacement, ring assemblies of two rings, and
special cases of subtractive nomenclature. As the inverted index form of chemical names is most
often used in CAS nomenclature, this tool is able to handle uninverted names through a special case
that inverts them.
32
In the 1980s, work at the University of Hull by Kirby et al. yielded a name to structure
algorithm based upon a formal grammar88–93. Their solution parsed chemical names from right to left
using a context-free grammar. The series of publications noted that technically IUPAC nomenclature
is a context-sensitive language if one enforces the order of enclosing marks (from outermost to
innermost: “{”, “[”, “(”), but if one does not enforce this order of bracket nesting the language is
then context-free. These publications also acknowledged that, beyond bracketing, much of IUPAC
nomenclature can be expressed by a regular grammar; which is the approach that is utilised by
OPSIN (Section 3.2.4).
Another interesting solution was CHEMNAME, developed by Chugai Pharmaceuticals in Japan.
This solution has been published through a series of seven Japanese conference papers from 1991-
2005. All these papers are written in Japanese making it difficult for non-Japanese readers to
understand the details of the solutions and algorithms described. The most recent papers in the
series describe advanced capabilities such as algorithmic handling of fused ring systems and support
for natural product nomenclature94–96.
3.1.4 Current solutions
At the time of starting this project in 2008, two open source attempts at performing chemical
name to structure were identified: ChemNomParse97 from the University of Manchester and OPSIN60
from the University of Cambridge. ChemNomParse had not been updated since 2003 and in testing
was found to only support alkanes, cycloalkanes, simple substitution from common substituents and
common suffixes. Precision was also found to be poor primarily due to terminal suffixes being
systematically misplaced (Figure 3-4).
Figure 3-4 Output for 3-chloropropanamide from the ChemNomParse GUI. The suffix is placed at the
end of the chain rather than the beginning.
33
OPSIN (Open Parser for Systematic IUPAC nomenclature) was a component of OSCAR3, a tool
for performing chemical text mining. The program supported alkanes, unsaturation and cyclisation
of alkanes, most Hantzsch-Widman nomenclature, bicyclo von Baeyer systems, mono spiro systems
with two rings, substitutive nomenclature, common suffixes, hydro and indicated hydrogen prefixes
and multi word names for esters/acyl halides/salts.
A review of the area98 gave the following contemporary description of OPSIN:
“OPSIN is presently limited to the decoding of basic IUPAC nomenclature but can handle
bicyclic systems, and saturated heterocycles. OPSIN does not currently deal with stereochemistry,
organometallics and many other expected domains of nomenclature”
Commercial solutions are available from ACD/Labs99, Bio-Rad Laboratories100, PerkinElmer
(formerly CambridgeSoft)101, ChemAxon102, ChemInnovation103, InfoChem104 and OpenEye105. With
the exception of PerkinElmer’s solution none of these approaches have been detailed in the
literature.
PerkinElmer’s solution, Name=Struct106, takes a lenient approach to chemical nomenclature
with the intention of supporting not just well formed names but names with minor mistakes, names
that explicitly contradict nomenclature recommendations and even names conforming to no formal
nomenclature recommendations. To this end Name=Struct takes a more relaxed approach to
tokenisation choosing to always recognise the longest allowed token at each character with ad hoc
rules preventing incorrect tokenisations and any punctuation being allowed to delimit tokens.
Tokens are associated with one or more meanings ordered in a hierarchy of preference with
disambiguation being achieved by examination of the token’s local environment. This process is
described in detail in a patent from CambridgeSoft107. The Name=Struct algorithm is incorporated
into ChemBioDraw.
MDL have a European patent that covers software to extract chemical information and in
particular reactions from text108. The patent describes a software application named Reverse
AutoNom. This software does not appear to be commercially available to the public.
The Heidelberg Institute for Theoretical Studies (formerly European Media Laboratories) has
also investigated this problem and produced CLP(name2structure)109. This application differs
somewhat from the previous solutions as it aims to be able to represent ambiguous chemical names.
34
This software is not currently distributed and it is unclear from the paper where the described
system lies between a proof of concept and a comprehensive solution.
3.2 Development and implementation of OPSIN
3.2.1 Strategy for development of OPSIN
The current version of OPSIN was arrived at by the incremental addition of areas of
nomenclature. The most important aspect when adding support for a new area of nomenclature was
to make sure that as new names became parsable that they either produced the intended
interpretation or the case was recognised as not being currently supported and hence no structure
was returned. In this way new nomenclature can be added whilst maintaining high precision.
As new nomenclature often required new functionality e.g. the ability to specify
stereochemistry, the underlying capabilities of the program were also incrementally updated. As
much nomenclature was not considered in the program’s original design, refactoring of existing
functionality was frequently required to provide a framework that could elegantly support both the
added and existing nomenclature.
The version of OPSIN documented in this thesis ultimately shares very little in common with
the version of OPSIN inherited in 2008. The current codebase of approximately 27,000 lines of Java is
nearly an order of magnitude larger and even areas of nomenclature nominally supported by the
original program have been overhauled to allow more complete support.
All subsequent references to OPSIN refer to version 1.2.0. This is latest released version as of
the time of writing and was released on the 6th December 2011. Unless explicitly stated to the
contrary all chemical names and nomenclature mentioned are supported and correctly
interpretable. As OPSIN is designed to be employed on real world names, not all names given as
exemplars strictly conform to IUPAC recommendations. Depictions for exemplars are generated by
the ChemBioDraw12 using SMILES produced by OPSIN.
3.2.2 Architecture
OPSIN is written in Java with grammar and token definitions described in XML. A schematic of
OPSIN’s workflow is presented in Figure 3-5. The following subsections describe and discuss the
implementation of the components of this workflow, with specific examples used to exemplify the
types of nomenclature that are handled.
35
Figure 3-5 Components of OPSIN’s architecture, showing the process from chemical name through to a
structure
3.2.3 Pre-processing
To simplify the recognition of terms by the parser a normalisation step based on simple string
manipulation was incorporated. This manipulation is used to normalise the majority of
representations for Greek characters, primes and other miscellaneous symbols. Additionally the
36
traditional British spelling of sulfur is normalised to remove the requirement of having two lexical
variants for all terms incorporating the substring ‘sulf’. After normalisation all characters in the
chemical name are printable ASCII characters. This property is utilised as a speed optimisation during
tokenisation (Section 3.2.4.3). Examples of the result of this string normalisation can be seen in Table
3-1.
Input string Normalised output string
λ lambda
lambda lambda
.lambda. lambda
$l lambda
sulphuric acid sulfuric acid
` '
′ '
“ ''
⁗ ''''
± +-
ᴅ D
æ ae
é e
Table 3-1 Example input and normalised output
3.2.4 Tokenisation and parsing
3.2.4.1 Introduction
Chemical nomenclature can be thought of as being an artificial language. However unlike most
artificial languages, such as programming languages, the morphemes of chemical names are often
not delimited. As a result, tokenisation of chemical names must, at least to some extent, rely on a
predefined lexicon of what may be found in a chemical name. The approach taken by Corbett and
Murray-Rust60 was to create all possible tokenisations for a chemical name based on the program’s
lexicon. However, as such an approach does not take into account the context, many tokenisations
will ultimately be found to be incorrect. For example ‘propan-2-ol’ would be tokenised to [‘prop’,
‘an’, ‘-’, ‘2-’, ‘ol’] and [‘propa’, ‘n-’, ‘2-’, ‘ol’], for which the latter is clearly wrong (the ‘n-’ token is the
same that would be found in ‘n-butane’). As all possible tokenisations must be generated, assuming
the number of places where tokenisation is ambiguous is n, and that each instance results in two
possibilities, this approach leads to n2 possible tokenisations. In longer systematic names, this can
cause an impractically large number of tokenisations to be generated and for the tokenisation
processes to take an unacceptably long time.
37
The approach taken by Kirby et al.90 avoided this problem by associating the morphemes with
terminal symbols from their formal grammar. During parsing, identified morphemes may then be
restricted to those with terminal symbols that are valid at that point in the grammar. The approach
favoured the longest identified morpheme, with backtracking and selection of a shorter morpheme
being performed if at a point in the chemical name no appropriate morphemes can be identified. An
apparent drawback of this approach is that in cases where the grammar is ambiguous and selection
of a shorter morpheme would yield an alternate parse this tokenisation would not be discovered.
The approach taken by OPSIN is similar to the approach of Kirby et al. except that all possible
parses are evaluated.
3.2.4.2 Tokenisation algorithm
Before this system is explained, it should be first qualified what is actually being tokenised.
Rather than attempting to write a grammar that describes all possible chemical names OPSIN’s
grammar instead describes a chemical “word” which in this context refers to the smallest meaningful
unit of language. A chemical name relates to these words by the following grammar:
Chemical ::= Word+
Word ::= Substituent | Full | FunctionalTerm
Substituent ::= Token+
Full ::= Substituent* MainGroup
MainGroup ::= Token+
FunctionalTerm ::= Token+
Where a “substituent” word describes a fragment of a chemical compound e.g. ‘ethyl’, a “full”
word describes a standalone chemical word e.g. ‘benzene’ or ‘ethylbenzene’ and a “functional term”
describes a modification term e.g. ‘ester’.
Table 3-2 gives a few examples of the result of dividing chemical names into words. ‘Vitamin
C’ is interpreted as one word, as only when considered as a single unit does it have its intended
meaning. Similarly ‘acetic acid’ is treated as one word as ‘acid’ on its own is not currently treated as
being meaningful. It should be emphasised that the definition of a word is not defined by whitespace
and instead the tokenisation process will determine the word boundaries.
38
Name Number of Words Word Types
Ethanoate 1 Full
Ethyl 1 Substituent
Ethylethanoate 1 Full
Ethyl ethanoate 2 Substituent, Full
Acetic acid 1 Full
Acetic anhydride 2 Full, Functional term
Vitamin C 1 Full
Table 3-2 Examples of chemical names with the number of words and types, as determined by OPSIN.
OPSIN’s grammar, as of v1.2.0, describes 123 discrete classes of token. Each token class can
either correspond to a list of tokens (e.g. ‘benzen’, ‘pyridin’ etc.) or, for classes that are not practical
to enumerate, to a regular expression that describes all tokens of that class (e.g. an expression for a
locant or for von Baeyer nomenclature). These lists of tokens and regular expressions are present in
external XML resource files allowing the easy addition of new vocabulary. Individual tokens are
associated in this XML with attributes containing semantic information, such as for ‘pyridin’ the
structure of pyridine and for ‘tetra’ that its value is 4. Additionally, the type of element that will
ultimately be created for this token when OPSIN produces its XML parse tree is indicated, e.g.
“group” and “multiplier” for ‘pyridin’ and ‘tetra’ respectively.
The 123 token classes in the grammar are represented for convenience as single characters
(which to avoid confusion with characters in the chemical name will be referred to as token
characters) and are each associated with a short textual description of their meaning (cf. Appendix
C). The grammar dictates which arrangements of these token characters are allowed.
The arrangements of the token characters are expressed as a large regular expression. This is
then compiled into a deterministic finite-state automaton using the dk.brics.automaton package110.
A deterministic finite-state automaton is formed of states, with each state having a set of allowed
transitions. These transitions correspond to the set of token characters that the automaton may
consume when in that state. Each transition leads the automaton to a new state. States that
correspond to an acceptable end point are called “accept states”. In OPSIN’s grammar, these always
correspond to a transition involving the endOfSubstituent, endOfMainGroup or
endOfFunctionalGroup token character.
To make maintaining and updating this regular expression tractable, it is expressed in terms of
the descriptions of the grammar token characters, with aliases used for complex expressions. For
example, there is an expression called “ringGroup” that describes any single ring, any von Baeyer
ring system or any trivial ring system. “ringGroup” is then used as part of the expression for ring
39
assemblies, fused systems and certain spiro systems. In v1.2.0 this regular expression alone is 1103
characters and the complete grammar is a 251,224 character long regular expression.
As previously mentioned, OPSIN does not treat whitespace as a hard delimiter; determination
of what is considered a breaking white space and a white space that is part of a token is instead
determined as a result of how the name has been tokenised. Tokenisation and parsing occurs
simultaneously as follows:
From the current state in the discrete finite automaton the list of allowed transitions
is checked to determine which token characters are allowable next.
For each allowed next token character, attempt to match all corresponding tokens
and regular expressions against the start of the chemical name.
If a match is successful, the grammar token character and token is recorded and this
process is repeated with the state now being the state to which the transition led.
The process terminates when no more of the name can be tokenised. The
tokenisations that include the largest part of the name and end in an accept state are
returned.
This process is performed iteratively and multiple routes may be found through the
automaton yielding multiple parses. A parse is only successful if in addition to the requirement of
ending in an accept state the next character in the name is a white space or the end of the name
(Figure 3-6).
40
Figure 3-6 Example of how OPSIN’s parser can quickly reject ungrammatical tokenisations. Note the
paths through the automaton, and how only one parse reaches an acceptable end point.
3.2.4.3 Looking up tokens in the lexicon
To make the tokenisation/parsing process as fast as possible OPSIN employs a radix trie to
store vocabulary tokens. A trie is a tree data structure in which strings sharing a common prefix
share a common node (Figure 3-7). Each node accepts a single character and may have up to as
many children as there are in the alphabet. The time taken to look up whether a string is present in a
trie is practically independent of the number of lexicon entries and instead scales linearly with the
length of the string that is being looked up hence yielding excellent performance even with a large
lexicon.
propan-2-ol
[prop]an-2-ol
[prop][an]-2-ol
[prop][an][-]2-ol
[prop][an][-][2-]ol
X
[propa]n-2-ol
[prop][an][]-2-ol
End of main
group
[prop][an][-][2-][ol][]
X
41
Figure 3-7 A trie describing indol, indolizin, indolin, inden, indazol and indan. Circular nodes are
accepting nodes
Due to the wide lexical variety in chemical names, especially trivial names, a standard trie is
memory inefficient and hence a radix trie is employed by OPSIN. In a radix trie, nodes with only a
single child are merged with their child such that all non-accepting nodes have more than one child
(Figure 3-8).
42
Figure 3-8 A radix trie describing indol, indolizin, indolin, inden, indazol and indan. Circular nodes are
accepting nodes
A trie data structure can also be used to efficiently check for the existence of prefixes that
differ by a small change such as a character insertion, deletion or transposition. Whilst not
investigated in this work, extending OPSIN to suggest spelling corrections in an efficient manner is
hence believed to be highly tractable.
The regular expressions that correspond to the non-enumerable token classes are compiled in
advance into deterministic finite-state automata which, like the trie, have run time solely dependent
on the length of string matched i.e. independent of the complexity of the regular expression the
state machine describes. The process of compiling regular expressions to deterministic finite-state
automata can be quite slow especially for the regular expression that describes the grammar; hence
the resultant deterministic finite-state automata are serialised and only updated if the regular
expressions that generate them are altered.
3.2.4.4 Generation of parses
Each word that OPSIN is able to parse will produce one or more parses; a parse being formed
of a list of token/token class pairs. All possible combinations of the parses for each word are then
generated. For the majority of chemical names, this process results in only one parse for the
chemical name as each constituent word could be parsed unambiguously (Figure 3-9).
43
Figure 3-9 Number of parses generated from parsable IUPAC names in the December 2011 ChEBI
database
5
An example of a name in the 4 parses category was: ‘3,7,11,15-tetramethylhexadeca-
2,6,10,14-tetraen-1-yl diphosphate’ for which the following tokenisations were generated:
[3,7,11,15-, tetra, meth, yl, hexa, deca, -, 2,6,10,14-, tetra, en, -, 1-, yl] [di, phosphate]
[3,7,11,15-, tetra, meth, yl, hexa, deca, -, 2,6,10,14-, tetra, en, -, 1-, yl] [diphosphate]
[3,7,11,15-, tetra, meth, yl, hexadeca, -, 2,6,10,14-, tetra, en, -, 1-, yl] [di, phosphate]
[3,7,11,15-, tetra, meth, yl, hexadeca, -, 2,6,10,14-, tetra, en, -, 1-, yl] [diphosphate]
As the same token may appear in multiple token classes multiple parses doesn’t necessarily
indicate ambiguity in tokenisation.
While the path through the finite state automaton, described by OPSIN’s grammar, is
unambiguous for a given sequence of token characters in practice we do not know a priori the token
classes involved or even the tokenisation. The parser instead investigates all token class/token pairs
that are acceptable to the grammar and match the chemical name. Ambiguity can arise from
different senses of a word, e.g. oxide can be a synonym for ether (‘diethyl oxide’) or mean the
addition of oxygen (‘trimethylphosphine oxide’). This can only be disambiguated in the next step,
where the relationship between the words is considered. The other reason is that a term could
44
exhibit ambiguity that is non-trivial to disambiguate. For example ‘tetradecyl’ is parsed as
[tetradec][yl] or [tetra][dec][yl]. Cases of this type are rare and have been dealt with, on a case by
case basis, as part of the Component Generation component (cf. Section 3.2.7.6).
3.2.4.5 Drawbacks of a regular grammar
The only significant drawback encountered in representing chemical nomenclature using a
regular grammar has been the problem of representing recursive bracketing. Areas in which
bracketing is not recursive, such as bracketed ring assemblies, are handled precisely by the
grammar. In a regular grammar, one cannot express (to an infinite depth) a language of the form
…((a))… where the number of open and close brackets is identical. It is, however, allowed to
write the same expression in a form where the number of open and close brackets can be any
number i.e. not necessarily matched. Hence, OPSIN only matches opening and closing brackets with
each other after the parsing stage (Section 3.2.7.1).
3.2.4.6 Right to left parsing
By default, OPSIN parses names from left to right, but by reversing the automata that
describes the grammar/non-enumerable token classes and employing tries with reversed strings, it
can also be used from right to left with near identical results. Differences arise primarily from
reasons outside of the parser e.g. the rightToLeft tokenisation routine currently does not remove
whitespace within brackets and cannot handle the presence of extraneous words such as ‘compound
with’. Genuine differences in principle may arise from the fact that the non-enumerable token class
automata are greedy e.g. historically 3,4'-Bi-pyridinyl could only be parsed from right to left because
3,4'-Bi- was parsed as two locants where 4'-Bi means the atom of Bismuth that is attached to an
atom with locant 4'!
The number of states in the reversed chemical grammar automaton is significantly lower,
4885 states as compared to 10747 in the left to right variant, indicating that there should be fewer
routes through the automaton. This did not however translate into any improvement in tokenisation
speed. The ability to parse from right to left is currently solely employed to assist in debugging which
part of an unparsable name is at fault. For example, in a name such as ‘1,3-dimethyl-4-unknownyl-
benzene’ OPSIN from left to right would be able to say ‘1,3-dimethyl-’ was a substituent and the
name was parsable up to ‘1,3-dimethyl-4-’. From right to left, OPSIN would determine that ‘benzene’
was a full term and that the name was parsable up to ‘yl-benzene’. Hence, this combination would
single out the point of failure and identify “unknown(yl)” as a potential vocabulary term.
45
3.2.4.7 XML generation
After parsing, an XML element is created for every token, except those that lack semantic
meaning (e.g. an optional ‘e’ or an optional hyphen), to yield an XML parse tree. It should be noted
that tokens from different token classes need not create different elements. For example, ‘chloro’
and ‘meth’ are tokens in different token classes but both produce a “group” element. These
elements become children of substituent, root and functionalTerm elements with the
special end of word grammar token characters being used to facilitate this chunking. These in turn
are children of word elements. This is best illustrated with an example (Figure 3-10).
ethyl(1R,5S)-8-(chloromethyl)-8-azabicyclo[3.2.1]oct2-ene3-carboxylate
Figure 3-10 XML parse tree produced for ‘ethyl (1R,5S)-8-(chloromethyl)-8-azabicyclo[3.2.1]oct-2-ene-3-
carboxylate’. For this name, only one parse is produced.
3.2.5 CAS index name uninversion
CAS index names are employed by CAS to allow the parent group that denotes the most senior
functionality of a molecule to be at the front of the name and hence used for indexing. This offers
significant advantages for alphabetic indexing in which the same group with different substituents
46
could end up in completely different places in the index. The process for inversion of chemical
names is briefly documented in the CAS nomenclature guidelines83. For the simple cases the index
name is simply the parent group followed by a comma, termed the inversion comma, and then the
substituents each ending in a hyphen (e.g. Figure 3-11).
Figure 3-11 CAS name: benzene, ethyl- IUPAC name: ethylbenzene
The inversion is somewhat more complicated when functional class nomenclature is employed
(Figure 3-12), when multiplicative nomenclature is involved (Figure 3-13) or when esters are
involved (Figure 3-14).
Figure 3-12 CAS name: Disulfide, bis(2-chloroethyl) IUPAC name: Bis(2-chloroethyl) disulfide or 1,2-
bis(2-chloroethyl)disulfane Note that the substituent does not have a hyphen indicating that it is not a prefix of
the ‘disulfide’
Figure 3-13 CAS name: Benzoic acid, 4,4’-methylenebis[2-chloro- IUPAC name: 4,4'-Methylenebis[2-
chlorobenzoic acid] Note that the index name has unbalanced brackets as compared to the uninverted name!
Figure 3-14 CAS name: Phosphoric acid, ethyl dimethyl ester IUPAC name: ethyl dimethyl phosphate
Note the change from phosphoric acid to phosphate
47
OPSIN supports CAS index names by performing an uninversion step prior to parsing.
Uninversion is attempted when ‘, ’ is found in a chemical name. A comma followed by a space should
be present in all well-formed CAS index names. The uninversion process involves the following steps:
Split name on ‘, ’; the first entry in this array should be the parent group.
Verify that if the parent group contains the space character that the words beyond the
first are either ‘acid’ or something OPSIN’s parser understands.
Iterate through the other members of the array. There should only be one but this is
not enforced.
A phrase like ‘compound with’ is ignored and a flag is set indicating subsequent words
should be appended to the final name.
The array entry under consideration is split into words by splitting on the space
character.
If a word ends with a hyphen it is a substituent. If the substituent is missing a closing
bracket, a closing bracket is added to the parent group.
If it didn’t end in a hyphen OPSIN’s parser is used to determine word type. This is used
to determine how these words are added to the name. Substituents will be
substituents involved in functional class nomenclature and hence go at the front of
the name. Functional terms are appended to the end of the name. If the functional
term is ‘ester’, the suffix of the parent group is modified. Full terms are treated in the
same way as functional terms if they end in ‘ate’, ‘ite’ or are a hydrohalide. If they do
not uninversion fails.
If the word is a CAS collective index it is ignored e.g. ‘(9CI)’.
The final name is formed from space separated substituents for functional class
nomenclature then concatenated substituents and the parent group, followed by
space separated functionalTerms and mixture components.
3.2.6 Chemical word rule assignment
After parsing has been completed, a chemical name will have been tokenised into substituent,
full and functional term words. “Word Rules” describe the interactions between these words. For
48
example, in ‘ethyl ethanoate’ (a substituent and a full word), the word rule ‘ester’ will be assigned
indicating that the ethyl group should be connected to the charged oxygen on the ethanoate with
the charge removed. Without word rules, OPSIN would not know how the ethyl fragment and
ethanoate group interact. OPSIN’s current word rules are listed in Table 3-3. Most word rules are
only employed by chemical names using functional class nomenclature (Section 3.2.10.4).
Word Rule Example
acetal Propanal dimethyl acetal
additionCompound Carbon tetrachloride
acidHalideOrPseudoHalide Cyanic chloride
amide Nitrous amide
anhydride Acetic anhydride
biochemicalEster Adenosine 5'-triphosphate
carbonylDerivative Propanone oxime
divalentFunctionalGroup Diethyl ether
ester Ethyl ethanoate
functionalClassEster Acetic acid ethyl ester
functionGroupAsGroup Cyanide
glycol Ethylene glycol
glycolEther Ethylene glycol monomethyl ether
hydrazide Phosphoric hydrazide
monovalentFunctionalGroup Ethyl alcohol
multiEster Ethyl propyl methylphosphonate
oxide Thiophene 1,1-dioxide
polymer Poly(ethylene)
simple Ethylbenzene
substituent Chloro
Table 3-3 Word rules and examples names that correspond to them
Word rule assignment is achieved by a mixture of looking at the string value of words, in
particular the functional terms e.g. ‘ester’, and looking in more detail at the XML OPSIN has
generated for a particular word. The following are two examples of word rules employed by OPSIN:
Additional word rules can be added trivially by adding entries such as the above to the
appropriate XML file but must be backed up by code within the program, describing the operations
that the word rule requires.
49
When a word rule matches, the XML for the matched word elements are nested within a new
containing wordRule element.
Word rules may be nested, allowing the interpretation of nested functional class
nomenclature. For example, ‘choline hydrogen sulfate’ is first matched by the ester word rule and
then by the biochemicalEster word rule (Figure 3-15).
Figure 3-15 choline hydrogen sulfate and its corresponding XML after word rule assignment. Contents
of word elements not shown for clarity.
If the molecule element has multiple wordRule children this is indicative that the name
describes either an ionic substance or a mixture. For (semi)metal halides/oxides it may be unclear as
to whether it is best to represent the structure covalently or ionically. This is determined at the word
rule assignment stage using cuts off on a quantitative van Arkel diagram111. For giant covalent
structures a known limitation is that neither the ionic nor covalent form are good representations.
Stoichiometry is determined by multipliers at the start of the words or specified after the name (cf.
Section 3.2.13.1).
If no word rules match, a rule exists that allows substituents to be combined with other
substituents or full words so that for example ‘ethyl benzene’ is interpreted initially as a substituent
and a full word but then is converted to just one full word ‘ethyl-benzene’. At the end of word rules
assignment, all words should have been assigned to a word rule otherwise an error is thrown.
Whether this error is thrown for names that correspond to the substituent word rule (names
formally representing radicals), e.g. ‘ethyl’, is controlled by a user-configurable switch.
3.2.7 Component generation
Component generation deals with processing nomenclature that can be efficiently acted upon
without access to a connection table representation of the fragments.
50
3.2.7.1 XML Transformations
Some terms which are described in the grammar by regular expressions are not monolithic in
nature and become more amenable if broken down further. Additionally some terms can benefit
from normalisation. The XML parse tree may be manipulated to achieve this. Table 3-4 summarises
these transformations/normalisations.
Term Example How it is handled
Superscript indication in locants N^4 N4 Superscript indication removed as
ambiguity is not introduced
Provisional recommendation for
indicating a heteroatom
attached to a numeric locant
4-N N4 Transformed into the older nomenclature
for this type of locant
Greek character name in locant ALPHA alpha Lower cased (OPSIN locants are case
sensitive)
Added hydrogen in locant 2(9H) 2 and 9H Added hydrogen removed from locant and
added hydrogen element created
Locant that also indicates
stereochemistry
1(S) 1 and 1S Stereochemistry removed from locant and
locanted stereochemistry element created
Carbohydrate style locants 2,4,6 tri O
O2,O4,O6 tri
Transformed into more general form
Ortho/meta/para locants o 1,ortho Normalisation to full lower case word.
Context sensitive addition of implicit ‘1’
locant
Indicated hydrogen 1H,2H 1H 2H Indicated hydrogen blocks split up and
locant attributes set
Stereochemistry (1R, 2R) 1R 2R Converted to individual stereochemistry
elements with locant attribute where
locants provided
Infixes thi oic acid oic
acid
Infixes become an attribute of the
following suffix except in cases where
multiplier use is ambiguous (Section
3.2.9.9a)
“Suffix prefixes” sulfonic acid ic
acid
“suffix prefix” becomes an attribute of the
following suffix
Lambda Convention 1lambda4,5
1lambda4 4,5
Lambda Convention either assigned as an
attribute of an adjacent heteroatom
replacement term or formed into a new
element with appropriate locant attribute
Table 3-4 Summary of XML transformations performed
As OPSIN does not employ a context-free grammar, no attempt at bracket matching is done at
the grammar level. The only depth present in the XML parse tree is the division of a name into
substituent, root and functionalTerm elements (Figure 3-16). Bracketing depth is
subsequently added in by matching openbracket and closebracket elements (Figure 3-17).
51
The type of bracket i.e. round, curly or square is currently ignored. Any unmatched brackets will be
reported and the parse will be rejected.
(chloromethyl)
Figure 3-16 XML parse tree prior to bracket matching
chloromethyl
Figure 3-17 XML parse tree after bracket matching
3.2.7.2 Generation of alkanes
Example
dodectetractkiliane
General Syntax units? tens? hundreds? thousands? unsaturation
The vast majority of alkane names are systematic in nature (Table 3-5) and hence, as the
syntax for generating them is straightforward, it makes sense to generate them algorithmically
rather than via enumeration. Even though only alkanes 1-4 and 11 are trivial in nature for
implementation purposes it is simplest to just consider alkanes of lengths 1-9 as trivial. All other
alkanes of lengths 10+ can then be considered as systematic allowing OPSIN to support creation of
alkanes of length up to 9999112. The length of the chain may be calculated from summing the
number of thousands/hundreds/tens/units in the name e.g. dodectetractkiliane is 2 + 10 + 400 +
1000 = 1412. OPSIN allows the creation of an alkane of length 11 either systematically (hendecane)
or trivially (undecane). Numbering of alkanes is achieved by simply numbering the chain from one
end to the other.
52
Alkane stem Chain Length Systematic?
meth (systematic = hen) 1 ✗
eth (systematic = do) 2 ✗
prop (systematic = tri) 3 ✗
but (systematic = tetr) 4 ✗
pent 5 ✓
hex 6 ✓
hept 7 ✓
oct 8 ✓
non 9 ✓
dec 10 ✓
undec (systematic = hendec) 11 ✗
dodec 12 ✓
n/a 13+ ✓
Table 3-5 Alkane stems and whether they can be formed systematically
Isomers of alkanes in general are formed by systematic nomenclature but for a limited set of
isomers a traditional method involving modifiers may be employed (Table 3-6 ). OPSIN implements
these modifiers systematically by generating appropriate SMILES for the branched alkane chain. Care
is taken to avoid generating nonsensical structure e.g. ‘isopropane’ and to respect cases where these
modifiers are not used systematically e.g. ‘t-octyl’.
Modifier Meaning Example
n or normal Straight chain (default
behaviour)
n-butane
t or tert Atom that suffixes apply to is
bonded to two methyl groups
and the remaining atoms in
the chain
tert-pentyl
i or iso The opposite end of the chain
to which suffixes apply has
two methyl groups attached to
the penultimate atom
isopentyl
s or sec The second atom in the chain
is used for suffixes (hence
meaningless if the alkane does
not have a suffix)
sec-pentyl
neo The opposite end of the chain
to which suffixes apply has
three methyl groups attached
to the penultimate atom
neohexane
Table 3-6 Modifier prefixes for producing alkane isomers
53
3.2.7.3 Generation of heteroatom hydrides
Example
pentaphosphane
General Syntax multiplier heteroatomhydride
For chains of non-metals other than carbon and boron the chain may be named by the
combination of a multiplier with the name of the hydride77 (Rule 2.2.2). OPSIN implements this
algorithmically to generate appropriate SMILES. Care is taken to avoid confusion between this
nomenclature and cases where multiple copies of the hydride are being referred to e.g. ‘1,4-
diazanyl-benzene’. Numbering is the same as for alkanes.
3.2.7.4 Generation of heterogeneous heteroatom hydrides
Example
disilazane
General Syntax multiplier heteroatom heteroatom unsaturation
If a chain is made of alternating heteroatoms it may be named by a multiplier in front of a
heteroatom, where the multiplier indicates the count of that heteroatom in the chain, followed by
the other heteroatom in the chain77(Rule 2.2.3). OPSIN implements this algorithmically to generate
appropriate SMILES. The SMILES generated depend on whether or not the chain is prefixed with
‘cyclo’ as this changes the composition of the chain (Figure 3-18). Numbering is the same as for
alkanes.
Figure 3-18 disiloxane (left) and cyclodisiloxane (right)
The nomenclature of heterogenous heteroatom hydrides overlaps with the syntax of
Hantzsch-Widman nomenclature for six-membered rings (Section 3.2.9.10). The two nomenclatures
are distinguishable by considering that in Hantzsch-Widman nomenclature the first heteroatom is of
higher priority than the second, whilst for heterogenous heteroatom hydrides, the opposite is
always true (Figure 3-19).
54
Figure 3-19 dioxathiane. Left: a HW interpretation. Right: incorrect heterogeneous heteroatom hydride
interpretation
3.2.7.5 Generation of hydrocarbon ring systems
3.2.7.5a Von Baeyer nomenclature
Example
bicyclo[3.2.1]octane
General Syntax multiplier cyclo von Baeyer descriptor alkane
The von Baeyer system113 is used to name polyalicyclic ring systems. This differs from fused
ring nomenclature (Section 3.2.9.11) which is generally applied to systems containing at least one
unsaturated ring. This distinction arises primarily from the reduction in comprehensibility when a
nomenclature is applied outside of its domain rather than any difference in expressive power.
To understand von Baeyer nomenclature, first some terms need to be defined:
Bridgehead: An atom which is bonded to three or more atoms of the ring system
Bridge: A connection between two bridgeheads. This could be an unbranched chain of atoms, an
atom or a bond. The latter two can be thought of as bridges of length 1 and 0 respectively.
For a system to be polycylic it must necessarily have at least two bridgeheads and three
bridges. Two bridgeheads are selected and the lengths of three bridges between them form the start
of the von Baeyer descriptor. If there are no further bridges then naming is complete e.g. Figure
3-20.
55
Figure 3-20 bicyclo[2.2.2]octane. This structure can be clearly seen to contain 3 bridges of length 2
between its bridgehead atoms.
Any bridges beyond the third are called secondary bridges and require locants to indicate
which bridgehead atom they are between. Numbering is assigned in the order that bridges are
created (Figure 3-21).
Figure 3-21 tricyclo[2.2.1.1
2,5
]octane. The numbers correspond to the numbering the von Baeyer
descriptor defines for the system. Red is the first bridge, blue is the second bridge, green is the third and
purple is the fourth bridge (a secondary bridge).
The von Baeyer descriptor is almost always followed by a description of an alkane, although a
heteroatom hydride (Section 3.2.7.3) is also allowed (Figure 3-22). The length of the alkane chain
can, and indeed is by OPSIN, checked to assure that it is equal to the sum of the length of the bridges
plus two. Additionally the multiplier preceding the von Baeyer descriptor is verified as being equal to
the number of bridges plus one.
Figure 3-22 tetracyclo[3.3.1.0
2,4
.0
6,8
]nonaphosphane
OPSIN interprets von Baeyer nomenclature by algorithmically generating appropriate SMILES.
Where superscript indication is missing OPSIN heuristically attempts to determine what is a locant
and what is a bridge length indication by assuming the locant will be larger (secondary bridges are
typically short in length as longer bridges are preferred for the earlier bridges in the name).
All features of von Baeyer nomenclature are supported with the exception of alternating
heteroatom chains. This limitation is due to the difficulty in determining which heteroatom should
56
be used such as to have the correct number of each heteroatom in the system with no atom being
the neighbour of a heteroatom of the same element and also due to the negligible usage of this
nomenclature. This difficulty can be seen in the fact that specialised nomenclature is required in
some cases to actually specify which heteroatom is at locant 1! (Figure 3-23).
Figure 3-23 1N-tricyclo[3.3.1.1
2,4
]pentasilazane (not OPSIN interpretable) or 1,3,5,7,10-pentaaza-
2,4,6,8,9-pentasilatricyclo[3.3.1.1
2,4
]decane (preferred name
78
, OPSIN interpretable)
3.2.7.5b Monocyclic Spiro nomenclature
Example
dispiro[4.2.4.2]tetradecane
General Syntax multiplier? spiro von Baeyer descriptor alkane
A spiro fusion is one in which two rings share a single atom. Ring systems formed of
monocyclic rings (i.e. not fused rings) may be named in a similar way to von Baeyer
nomenclature114(Rule SP-1).
The von Baeyer descriptor is interpreted from left to right using a carbon atom that will
become a spiro centre upon construction of the ring system. The first number in the descriptor
describes the number of carbon atoms forming the link from the starting atom back to itself.
Subsequent numbers describe the number of carbon atoms forming a link to a new spiro atom or
back to a previous spiro atom. Algorithmically, the point at which the numbers begin describing links
back to previous spiro centres may be determined by examination of the starting multiplier which
indicates how many spiro centres are expected in the ring system. Numbering proceeds in the order
that the links are created (Figure 3-24).
57
Figure 3-24 dispiro[4.3.2.1]dodecane. Atom 1 is the starting spiro atom. Atoms 2-4 are described by the
‘4’ from the descriptor, atoms 6-8 by the ‘3’, atoms 10-11 by the ‘2’ and atom 12 by the ‘1’.
For tri and higher spiro systems it is in some cases recommended and other cases required
that superscripted locants be used to indicate which spiro atom a link connects to (Figure 3-25).
Figure 3-25 trispiro[2.2.2.2.2.2]pentadecane (left) and trispiro[2.2.2
6
.2.2
11
.2
3
]pentadecane (right)
showing the effect of superscripted locants
The use of superscripts is also essential if a spiro atom is visited more than twice e.g. Figure
3-26.
Figure 3-26 7λ
6
-thiatrispiro[2.0.2.2
7
.3
7
.2
4
.3
3
]heptadecane
OPSIN interprets this nomenclature by algorithmically generating the SMILES described by the
von Baeyer descriptor. A check is performed to verify that the number of atoms in the von Baeyer
58
descriptor + the indicated number of spiro atoms as given by the multiplier is equal to the number of
atoms in the alkane following the von Baeyer descriptor. Unlike in the case of von Baeyer
nomenclature OPSIN requires indication that number are superscripted; the reason for this is that it
is not possible to know from the name’s syntax whether or not a number is expected to be followed
by a superscripted number (cf. Figure 3-25).
OPSIN supports all rules for spiro systems formed from monocyclic rings with the exception of
the generalisation to heteroatom hydrides instead of alkanes.
3.2.7.5c Other hydrocarbon ring nomenclature
The IUPAC nomenclature of fused rings115(Rule Fr-2.1) defines numerous micro syntaxes for
naming specific types of hydrocarbon ring systems. OPSIN has complete support for all of these
micro syntaxes. They are implemented by using the value of the required locant/multiplier to
algorithmically generate the SMILES for the ring system (Table 3-7). All of these systems have a
minimum value below which they are undefined e.g. [2]annulene is undefined as a ring of size 2 is
impossible.
Example Nomenclature description
[8]annulene
Defines a ring with the maximum number of
non-cumulative double bonds of size given by
the bracketed number
hexacene
A chain of n linearly fused benzene rings, where
n is defined by the multiplier
hexaphene
A chain of (n/2) + 1 (or (n + 1)/2 if n is odd)
linearly fused benzene rings fused at 120o to
(n/2) - 1 (or ((n + 1)/2) - 1 if n is odd) more
linearly fused benzene rings, where n is defined
by the multiplier
octalene
Two rings of size n, with the maximum number
of non-cumulative double bonds, where n is
defined by the multiplier
59
triphenylene
n benzene rings fused to alternating sides of a
ring of size 2n, where n is defined by the
multiplier
tetranaphthylene
n naphthalene rings 2,3-fused to alternating
sides of a ring of size 2n, where n is defined by
the multiplier
hexahelicene
n benzene rings fused in a helical arrangement,
where n is defined by the multiplier
Table 3-7 Micro syntaxes for generating hydrocarbon ring systems
3.2.7.6 Rejection of parses caused by nomenclature ambiguity
While for most names with multiple parses, all bar one will fail when performing detailed
processing of the name’s nomenclature, there exist some cases where multiple interpretations are
plausible. Usually one of these interpretations can be readily seen to be more likely than the other,
often due to one interpretation not unambiguously describing a single structure. Known cases of this
type can typically be dealt with early in the name to structure process.
One example is the ambiguity between longer alkane chains and shorter multiplied alkanes
(Figure 3-27). This arises due to OPSIN allowing a multiplier to apply to any group including an
60
alkaneStem. IUPAC nomenclature recognises this ambiguity and solves it by the use of group
multipliers when multiple shorter chains are desired e.g. ‘tetrakis(decyl)’. OPSIN follows these
recommendations except in the case where the multiplier is immediately preceded by as many
locants as the multiplier’s value in which case the multiple shorter chains interpretation is used.
Figure 3-27 tetradecyl correct (left) and incorrect (right) interpretations
A similar, but undocumented, ambiguity occurs between multiplied phenyl rings and
“polyaphene” rings (Section 3.2.7.5c). As the polyaphene interpretation is ambiguous it is not chosen
unless prefixed by a locant which would locate the “yl” to a specific atom.
Figure 3-28 Interpretations of tetraphenyl. Left: [tetra][phenyl] Right: [tetra][phen][yl]
Figure 3-29 shows another ambiguity. In this case the phenol derivative is preferred unless the
“ol” is locanted.
Figure 3-29 Interpretations of thiophenol. Left: [thio][phenol] Right: [thiophen][ol]
Another undocumented ambiguity occurs with heteroatom hydrides (Section 3.2.7.3)
combining both the “ene” and “ium” suffix with elision of the ‘e’ on the “ene” (Figure 3-30). This
clash could be avoided through the use of locants.
Figure 3-30 Incorrect interpretation of diselenium
61
3.2.7.7 Handling of nomenclature irregularities
IUPAC nomenclature due to its inclusion of so many recommendations has accumulated a
large number of oddities which for the most part can be dealt with prior to conversion of SMILES to
structures. Handling of irregularities generally involves modification of SMILES for a group or
rejection of the parse. The following are examples of aspects of nomenclature that are considered
irregular to OPSIN and hence special cased.
Presence of ‘acid’ after ‘ic’ is enforced except when followed by another word within
the same word rule e.g. ‘acetic anhydride’ is allowed
Methylenedioxy is treated as a single group and may be used to form bridges
Multiplied ‘ethylene’s and ‘propylene’s when followed by ‘glycol’ indicate a chain
interspersed with oxygens (Figure 3-31)
Figure 3-31 tetraethylene glycol
‘Xanthic acid’ and chalcogen analogues are entirely unrelated to ‘xanthene’. ‘Xanthyl’
is related to ‘xanthene’.
Some groups have implicit locants by convention e.g. ‘anthrone’ = ‘9(10H)-anthrone’
‘phospho’ has a different meaning in a biochemical context to an organic chemistry
context (Figure 3-32). OPSIN determines this by looking at whether the next group is
an amino acid/biochemical group/carbohydrate.
Figure 3-32 Organic interpretation of ‘phospho’ (left) and biochemical interpretation (right)
‘cysteic acid’ is not a synonym for ‘cysteine’
‘acrylamide’ is not a substituted amide ion (amide can mean [NH2-])
If a group directly follows ‘azo’, or the like, it is implicitly multiplied (Figure 3-33)
62
Figure 3-33 azobenzene = azodibenzene
Acids bonded to Coenzyme A always connect via an acyl even if the name states ‘yl’
rather than ‘oyl’. Additionally even if the acid is a di-acid only one end is an acyl group
(Figure 3-34).
Figure 3-34 Malonyl-CoA
‘keto’ can be a synonym of ‘oxo’ or mean that a carbohydrate, specifically a ketose, is
in the open chain form.
fluoroantimonic acid is not a derivative of antimonic acid (Figure 3-35). Similar
problems exist with some other inorganic acids.
Figure 3-35 fluoroantimonic acid (left) antimonic acid (right). Systematically fluoroantimonic acid would
be antimonic acid with a hydroxyl replaced by fluorine
‘-quinone’ is treated in the same ways as ‘-dione’ e.g. it may be prefixed with two
locants.
‘-ylium’ may mean the removal of a hydride ion or the formation of an acylium group
(Figure 3-36)
Figure 3-36 acetylium (left) and ethanylium (right)
63
Multiplied phosphates may refer either to a chain of phosphates or to multiple
phosphate ions (Figure 3-37). OPSIN uses the chain interpretations in preference up to
a length of five phosphates.
Figure 3-37 triphosphate, most likely (left); less likely (right)
3.2.8 Connection table generation
For processing more advanced nomenclature it is necessary to generate an in memory
connection table representation for the fragments of the name under consideration. OPSIN achieves
this using a custom SMILES reader. SMILES are read in character by character using a stack to keep
track of the atom to which the next atom will be bonded. All common features of SMILES including
stereochemistry are supported with some nonstandard extensions to include information not
allowed in standard SMILES.
The most significant difference between normal SMILES readers and OPSIN’s SMILES reader is
in its interpretation of hydrogen counts. OPSIN’s hydrogen model assumes that all substitutable
hydrogen are implicit hence one can instead just consider the valency of an atom and from that
calculate the number of hydrogens. In SMILES hydrogen may be implicit, treated as a property of an
atom or treated in the same way as other atoms. The implicit case does not require explicit handling
as OPSIN knows about the expected valences for organic atoms. When hydrogen atoms are treated
like normal atoms, or they are the property of a non-p block metal, they are considered
unsubstitutable.
In the case where hydrogen are a property of an atom OPSIN attempts to determine whether
the total incoming bond order including hydrogens is consistent with the atom being in one of its
standard valences. In the case that the atom is charged OPSIN attempts to understand the atom as
the uncharged atom in a standard valency with a certain number of protons added or removed e.g.
[NH4+] is interpreted as being in its normal valency with 1 proton added. If it is not possible to
consider the atom as being in one of the expected valences a hint about the minimum final valency
of the atom is set.
64
OPSIN supports two extensions that more naturally map to the concepts present in OPSIN’s
hydrogen handling model. The first is the use of ‘H?’ within a square bracket which is interpreted as
indicating the atom has implicit hydrogen in the same way as the organic subset are interpreted. For
example ‘[SiH?]’ is interpreted to be the same as [SiH4].
The other is the ability to explicitly set the valency of an atom using the Lambda Convention
(Section 3.2.9.11). This is done using the pipe character followed by the Lambda Convention valency
e.g. [P|3] = [PH3]. The Lambda Convention extension is useful for specifying the valency of atoms in
square brackets that when finally used will have valency higher than the sum of the intra-fragment
bond orders e.g. the fragment describing ‘selenoether’ is [Se|2]. Describe this fragment as [Se] is
incorrect as this implies 0 valency whilst [SeH2] whilst also acceptable is somewhat misleading as the
fragment is really a bare selenium known to form 2 bonds. The Lambda Convention extension also
allows OPSIN’s valency check to be bypassed e.g. F(=O)O is rejected but [F|3](=O)O is accepted as it
has been made explicit that the fluorine is expected to be that valency.
Lower case symbols in SMILES correspond to aromaticity. In OPSIN’s SMILES reader it instead
directly corresponds to the IUPAC’s concept of maximum number of non-cumulative double bonds.
This allows OPSIN to know that it may assign double bonds to atoms that cannot in their normal
valency accept double bonds (Figure 3-38). OPSIN allows aromatic antimony and tellurium to allow
rings with such atoms to be treated analogously to those containing arsenic and selenium.
SMILES
[cH2]1ccn2cccc12 1H-pyrrolizine 3H-pyrrolizine pyrrolizin-4-ium
Figure 3-38 SMILES for pyrrolizine and structures that may be ultimately generated. Note that OPSIN
would also accept c1ccn2cccc12 even though this is not valid SMILES.
3.2.9 Specific nomenclature handling
The majority of nomenclature manipulation occurs in the section named Component
Processing in the architecture diagram (Section 3.2.2). To allow locanted operations to precede
unlocanted operations, skeletal replacement nomenclature and all indications of
saturation/unsaturation are handled during Structure Assembly. Note that in the cases of ring
assemblies and polycyclic spiro systems (which fall under Component Processing) these pieces of
65
nomenclature are applied prior to Structure Assembly so that the complete ring assembly or
polycyclic spiro system may be assembled prior to Structure Assembly.
3.2.9.1 Groups with indeterminately positioned structural features
Example
1,3-xylene
General Syntax locant trivialGroup
Some trivial names do not describe a particular structure but instead multiple structures. In
these cases locants preceding the trivial name may be used to specify a specific structure. In some
cases such as for ‘camphorsulfonic acid’ (Figure 3-39) unless specified otherwise a particular locant is
assumed.
Figure 3-39 camphorsulfonic acid or more precisely 10-camphorsulfonic acid
To avoid the need to enumerate all possible combinations of locants in front of a group, OPSIN
includes attributes for adding groups (cf. 1,3-xylene), higher order bonds (Figure 3-40) and
heteroatoms (Figure 3-41) to a group
66
Figure 3-40 1-pyrazoline (left) and 3-pyrazoline (right)
Figure 3-41 1,8-naphthyridine (left) and 2,7-naphthyridine (right)
3.2.9.2 Traditional alkane/carboxylic acid locants
Greek locants may be used instead of numbers for locants on simple alkanes (Figure 3-42) and
carboxylic acids (Figure 3-43). To allow application to systematic as well as trivial groups OPSIN adds
these locants algorithmically. OPSIN starts numbering from the first atom/atom at which the suffix
applies. Care is taken to skip the first atom in the case where acid functionality is bonded to this
atom e.g. ‘ic acid’ but NOT ‘carboxylic acid’. Labelling proceeds along the chain as long as each atom
has one unvisited carbon neighbour i.e. branches terminate labelling. Cyclic atoms are not labelled
and terminate labelling. Groups with more than one acid group are not labelled.
Figure 3-42 Traditional Greek locants on pentane
Figure 3-43 Traditional Greek locants on butyric acid/butanoic acid
3.2.9.3 Skeletal replacement nomenclature
Skeletal replacement nomenclature 76 (Rule B-4) or “a” nomenclature refers to replacing carbon
atoms in a parent structure with heteroatoms. The name “a” nomenclatures come from the fact
that all the prefixes employed end with ‘a’ (Table 3-8). These prefixes, typically locanted, indicate
which carbons should be replaced by heteroatoms e.g. Figure 3-44. OPSIN supports the full
complement of “a” prefixes.
67
Heteroatom “a” prefix Plus proton Minus hydride Minus proton Plus hydride
Oxygen oxa oxonia oxidanylia oxidanida oxidanuida
Sulfur thia thionia sulfanylia sulfanida sulfanuida
Nitrogen aza azonia azanylia azanida azanuida
Phosphorus phospha phosphonia phosphanylia phosphanida phosphanuida
Table 3-8 Sample “a” prefixes for skeletal heteroatom replacement and charged analogues
Figure 3-44 1-thia-4-aza-2,6-disilacyclohexane
This nomenclature may be used in combination with the Lambda Convention if the
replacement heteroatom is not in its standard valency (cf. Section 3.2.9.11).
3.2.9.4 Conjunctive nomenclature
Example
benzeneethanol
General Syntax ring acylic group with functionality
Conjunctive nomenclature76(Rules C-51 – C-58) may be applied to systems formed of a cyclic
component and an acyclic component containing the principle functional group. The acyclic
component is numbered using Greek letters to avoid ambiguity with locants on the cyclic
component (Figure 3-45). OPSIN implements conjunctive nomenclature by resolving the
nomenclature that defines the ring system, resolving suffixes onto the acyclic component,
renumbering the acyclic component, cloning the acyclic component if necessary (Figure 3-46) and
then merging the fragments using appropriate locants or heuristically. To support 76(Rules C-812.3) which
states that radicals terminated by ‘amine’ may be used in conjunctive nomenclature ‘ylamine’ is an
allowed suffix in the grammar for conjunctive nomenclature (Figure 3-47).
68
Figure 3-45 α-chloro-β-methyl-1-naphthalenepropionic acid
Figure 3-46 2,3-naphthalenediacetic acid
Figure 3-47 fluorene-2-ethylamine. This is handled in the same way as ‘fluorene-2-ethanamine’ by
giving ‘ylamine’ the same meaning as ‘amine’ in this context.
3.2.9.5 Suffix handling
In IUPAC nomenclature suffixes are used to describe the principal functional group, to indicate
addition/removal of charge and to indicate the presence of radicals.
OPSIN categorises suffixes into three types: normal suffixes, radical adding suffixes and charge
modification suffixes. Further subdivisions are made to discriminate which are allowed to be
69
preceded by “suffix prefixes” and/or infixes. Modification of suffixes comes under functional
replacement (Section 3.2.9.9).
In general, the grammar is only specific enough to say which types of suffix are allowed on a
group but is incapable of saying specifically which suffixes are allowed. This is insufficiently specific,
not all suffixes are valid on all groups (Figure 3-48) and the same suffix may have different meanings
on different groups (Figure 3-49).
Figure 3-48 An incorrect interpretation of ‘acetal’ (correct name acetaldehyde)
Figure 3-49 ‘yl’ has different meanings on different acidStems. Acetyl (left) and lauryl (right)
To define the effects of suffixes and to enforce rules about which groups suffixes may apply
to, OPSIN uses external rules files (Figure 3-50). Adding more suffixes can be done by modifying
these files but this is seldom required as the number of base suffixes in IUPAC names is finite with
the vast number of possible suffixes coming from the use of infixes which OPSIN implements
algorithmically (Section 3.2.9.9).
70
Figure 3-50 Flowchart for handling suffixes. The group in question is the group to which the suffix will
apply.
OPSIN defines suffixes through the use of one or more of the rules outlined in Table 3-9.
71
Suffix Rule Description
addgroup Adds a group defined by SMILES. Optionally the
groups may be labelled, have radicals indicated
or have “functional atoms” indicated
addSuffixPrefixIfNonePresentAndCyclic Typically used to add a carbon atom before a
suffix e.g. ‘pyrazinoic acid’ is interpreted the
same as ‘pyrazincarboxylic acid’ would be
setOutAtom Sets an atom to be a radical. The valency may
also be specified
changecharge Used to specify the change in charge and
number of protons added/removed
addFunctionalAtomsToHydroxyGroups Makes all hydroxyl oxygens functional
chargeHydroxyGroups Makes all hydroxyl oxygens negatively charged
removeOneDoubleBondedOxygen Removes a double bonded oxygen
convertHydroxyGroupsToOutAtoms For each hydroxyl group removed a radical is
added
convertHydroxyGroupsToPositiveCharge For each hydroxyl group removed the charge is
increase by one
Table 3-9 Rules used to define the effects of suffixes
OPSIN handles most suffixes as being the addition of a small group to a parent group but for
inorganic acids OPSIN instead uses the suffix to mutate the structure of the acid (Figure 3-51).
Figure 3-51 carbonic acid (left), carbonyl (middle) and carbonate (right). ‘yl’ and ‘ate’ employ the
convertHydroxyGroupsToOutAtoms and chargeHydroxyGroups rules respectively
3.2.9.6 Charge and oxidation numbers
Example
methylmercury(1+) or methylmercury(II)
General Syntax element ( chargeSpecification | oxidationNumber )
Elements, especially inorganic elements, may have their charge specified by a bracketed
signed number. This can be trivially interpreted by setting the appropriate charge on the referenced
atom.
Oxidation numbers are indicated using a bracketed roman number and indicate the charge
that the atom would have if all its ligands were removed along with the electron pairs that were
shared with the atom. OPSIN does not in general support inorganic coordinate nomenclature so
72
attempts have only been made to make sure oxidation numbers are interpreted correctly when used
in conjunction with ligands that share names with prefixes used in organic nomenclature.
Generally neutral and cationic ligands are simply the name of the ligand. As there is no suffix
present to indicate that the group is a substituent, OPSIN cannot currently support these names.
Hence OPSIN is allowed to make the assumption that all substituents are negative ligands with the
exception of ‘carbonyl’ and ‘nitrosyl’ (Figure 3-52).
Figure 3-52 dichlorotetracarbonylmolybdenum(II). The two chloro ligands formally donate 1- charge to
the molybdenum which would be 2+ with no ligands making the compound overall neutral
3.2.9.7 Indication of saturation and unsaturation
3.2.9.7a Unsaturation terms
Example
hexa-1,3-dien-5-yne
General Syntax alkaneStem (locant? multiplier? unsaturator)+
Unsaturation on hydrocarbons76(Rule A-3) is indicated using ‘en(e)’ and ‘yn(e)’. Grammatically
‘an(e)’ is similar but distinct as it may not be locanted and adds no information e.g. ‘ethyl’ =
‘ethanyl’. Unsaturation of natural products ending in ‘an’, ‘ane’ or ‘anine’ is also indicated in this way
(Figure 3-53).
73
Figure 3-53 Pregn-4-en-20-yne
Where all single bonds are not equivalent, a locant is used to specify one atom in the bond to
be unsaturated. The atom at the other end of the bond is implicitly the one with the locant that is 1
higher. If this is not the case, a compound locant may be used to explicitly specify the atom at the
other end of the bond (Figure 3-54). OPSIN treats compound locants as if they were normal locants
until the point where unsaturation is applied, at which point such locants are inspected to determine
if they have a bracketed section.
Figure 3-54 bicyclo[8.5.1]hexadec-1(15)-ene. The double bond goes between the atoms with locants 1
and 15
3.2.9.7b Hydro, dehydro, indicated hydrogen and added hydrogen
Example
2,7-dihydro-1H-azepine
General Syntax (locant? multiplier hydro)* (indicatedHydrogen)? ringSystem
Cyclic compounds are saturated and unsaturated using the prefixes hydro and dehydro
respectively. As these prefixes refer to the atoms of a bond they should be in multiples of two.
74
OPSIN implements hydro not by actually creating double bonds but by unsetting the flag
indicating that the atom may be involved in a conjugated π-system. Dehydro conversely sets the
flag. These flags are used when performing kekulisation on the ring system (cf. Section 3.2.11).
When dehydro is applied to atoms that already have the flag set an explicit triple bond is indicated
(Figure 3-55). Care is taken when applying unlocanted hydro prefixes to take into account which
atoms will implicitly have their flag unset due to having insufficient valency for a double bond due to
the addition of a suffix or substituent. Hydro prefixes are in preference used on atoms that would
otherwise be capable of supporting double bonds.
Figure 3-55 1,2-didehydrobenzene (trivial name: benzyne)
Indicated hydrogen atoms are used to indicate an atom in a ring system not involved in a
double bond (Figure 3-56). Added hydrogen atoms are used to indicate the addition of a hydrogen to
an atom in a ring system as a result of the addition of a suffix (Figure 3-57). Both indicated and
added hydrogen are implemented by unsetting the aforementioned flag.
Figure 3-56 1H-pyrrole (left) and 3H-pyrrole (right)
Figure 3-57 isoquinolin-4a(2H)-yl
‘Perhydro’ has historically76(Rule A-23.1) been used to indicate that all atoms in a ring system are
unsaturated (Figure 3-58).
75
Figure 3-58 perhydroanthracene
3.2.9.8 Subtractive nomenclature
Subtractive nomenclature is used to indicate the removal of a group. It is comparatively rare in
organic nomenclature and hence OPSIN only supports the most common usage; the removal of a
hydroxyl using ‘deoxy’ (Figure 3-59). Addition of other single atom subtractive terms such as
‘desmethyl’ can be done at the vocabulary level but is likely to be of limited benefit due to such
terms most often being used to modify trivial names e.g. drug names, that are absent from OPSIN’s
vocabulary.
Figure 3-59 2-deoxy-ᴅ-ribose, an example of subtractive nomenclature
OPSIN implements subtractive nomenclature by normalising subtractive terms to be non-
detachable prefixes (an assumption is made that such terms are likely to apply to a
biochemical/carbohydrate fragment) then applying the term to the adjacent group. Care must be
taken when removing atoms at chiral centres; if the centre is subsequently substituted this can be
thought of as replacing the atom by a hydrogen atom which in turn is replaced hence preserving
stereochemistry (Figure 3-60). As OPSIN has implicit hydrogens this is achieved by inserting a
reference to a dummy “deoxyHydrogen” which will be replaced by a reference to either a hydrogen
atom or a substituent atom when hydrogens are made explicit. At this point it can be determined
whether the centre still is a stereocentre (cf. Section 3.2.14).
Figure 3-60 2-amino-2-deoxy-ᴅ-ribose, note that stereochemistry is retained at position 2
76
3.2.9.9 Functional replacement
Functional replacement involves the replacement of oxygen atoms/hydroxyl groups with
other atoms or groups77(Rule 3.4 and Table 8). This replacement may be indicated by the use of either
prefixes or infixes with more recent recommendations tending to encourage infixes due to reduced
possibilities for ambiguities. OPSIN treats functional replacements as far as possible as systematic
operations. For example, OPSIN does not have ‘thiol’ in its list of suffixes as this can be systematically
derived by combination of ‘thi’ with the suffix ‘ol’.
3.2.9.9a Infix Functional Replacement
Example
methanedithioic acid
General Syntax (multiplier? infix o?)+ suffix
Infix replacement replaces oxygen atoms/hydroxyl groups within a following suffix. OPSIN
implements all IUPAC infixes with the exception of chalcogen analogues of peroxo involving two
different chalcogens as these require currently unsupported nomenclature to be used
unambiguously.
77
Infix Transformation
amid(o) -O N
azid(o) -O N=[N+]=[N-]
bromid(o) -O Br
chlorid(o) -O Cl
cyanatid(o) -O OC#N
cyanid(o) -O C#N
dithioperox(o) -O- SS
diselenoperox(o) -O- [Se][Se]
ditelluroperox(o) -O- [Te][Te]
fluorid(o) -O F
hydrazid(o) -O NN
hydrazon(o) =O =NN
imid(o) =O =N
iodid(o) -O I
isocyanatid(o) -O N=C=O
isocyanid(o) -O [N+]#[C-]
isothiocyanatid(o) -O N=C=S
isoselenocyanatid(o) -O N=C=[Se]
isotellurocyanatid(o) -O N=C=[Te]
nitrid(o) =O and -O #N
perox(o) -O- OO
selen(o) =O or -O =[Se] or -[SeH]
tellur(o) =O or -O =[Te] or -[TeH]
thi(o) =O or -O =S or -[SH]
thiocyanatid(o) -O SC#N
selenocyanatid(o) -O [Se]C#N
tellurocyanatid(o) -O [Te]C#N
hydroxim(o) =O =NO
Table 3-10 OPSIN supported infixes and the transformations they describe. Hydroxim is not an IUPAC
endorsed infix.
A multiplied infix may be formally ambiguous if no brackets are used to clarify whether the
infix is multiplied or the infixed suffix is multiplied (Figure 3-61). OPSIN disambiguates by inspection
of multiplier type e.g. bis implies multiplication of the infixed suffix and by examining the number of
available oxygen (Figure 3-62)
Figure 3-61 ethanedithioic acid (left, OPSIN interpretation). Incorrect interpretation (right)
78
Figure 3-62 butandithione. The name clearly indicates two thione suffixes as the ‘one’ suffix only
describes one oxygen atom.
Due to some infixes accepting more than one bond order to an oxygen, these must be acted
on last to avoid problems with more specific infixes failing to apply (Figure 3-63).
Figure 3-63 ethanoic acid (left) ethanthioimidic acid (right). The thio could apply to either oxygen whilst
the imid may only apply to the double bonded oxygen
If the atom to which infix replacement applies is ambiguous this ambiguity needs to be
recorded as it may be resolvable later (Figure 3-64).
Figure 3-64 S-methyl ethanthioate (left) and O-methyl ethanthioate (right)
3.2.9.9b Prefix Functional Replacement
Example
1-chloro-2,4-diimidotricarbonic acid
General Syntax (locant? multiplier? prefix)+ group
The prefixes in Table 3-11 may be employed for functional replacement. It can be quickly seen
that many are identical to those employed as substituents in substitutive nomenclature hence to as
far as possible avoid ambiguity, prefix functional replacement is typically only recommended for
certain non-carboxylic acids.
79
Prefix OPSIN classification
amido dedicatedFunctionalReplacementPrefix
azido halideOrPseudoHalide
bromo halideOrPseudoHalide
chloro halideOrPseudoHalide
cyanato halideOrPseudoHalide
cyano halideOrPseudoHalide
dithioperoxy Currently Unsupported
diselenoperoxy Currently Unsupported
ditelluroperoxy Currently Unsupported
fluoro halideOrPseudoHalide
hydrazido dedicatedFunctionalReplacementPrefix
hydrazono hydrazono
imido dedicatedFunctionalReplacementPrefix
iodo halideOrPseudoHalide
isocyanato halideOrPseudoHalide
isocyano halideOrPseudoHalide
isothiocyanato halideOrPseudoHalide
isoselenocyanato halideOrPseudoHalide
isotellurocyanato halideOrPseudoHalide
nitrido dedicatedFunctionalReplacementPrefix
peroxy peroxy
seleno chalcogen
telluro chalcogen
thio chalcogen
thiocyanato Currently Unsupported
selenocyanato Currently Unsupported
tellurocyanato Currently Unsupported
Table 3-11 Prefixes for functional replacement listed by the IUPAC. Each of these prefixes corresponds
to and has the same meaning as the infixes described in the previous section.
OPSIN implements this nomenclature by first classifying whether a substituent may be a
functional replacement prefix and if it is classifies it as one of the following: chalcogen,
halideOrPseudoHalide, dedicatedFunctionalReplacementPrefix, hydrazono or peroxy.
OPSIN restricts chalcogen replacement to non-carboxylic acids, the suffixes of trivial carboxylic
acid stems and to aldehyde suffixes. In the case that a group has no oxygen within applicable suffixes
oxygen atoms within the group may be replaced. Allowing replacement on oxygen atoms within
groups allows for support of chalcogen analogues of trivial names (Figure 3-65). Explicitly adding
chalcogen analogues of trivial names to OPSIN’s vocabulary is generally not preferred as beside the
extra effort in generating such entries, two parses will be produced: one with the chalcogen
replacement prefix as a separate token and one in which it is part of the trivial name. For chalcogen
analogues of rings used as components in fused ring nomenclature including the chalcogen analogue
80
in the program’s vocabulary is necessary as the grammar does not allow a prefix to precede a ring
within a fused ring system.
Figure 3-65 Phenol (left) and thiophenol (right). Note that ‘thiophenol’ is not in OPSIN’s vocabulary
As with infix replacement, chalcogen replacement may be ambiguous and this ambiguity is
noted as it may be resolvable later.
Peroxy replacement is treated in the same way as chalcogen replacement except that only
functional oxygen atoms and etheric oxygens are considered. OPSIN prefers etheric oxygen atoms to
functional oxygen atoms allowing the intended interpretation for names like peroxydicarbonic acid
to be generated (Figure 3-66).
Figure 3-66 peroxydicarbonic acid
OPSIN only supports the use of dedicatedFunctionalReplacementPrefixes on non-carboxylic
acids and enforces that they must be used for functional replacement.
Hydrazono and halideOrPseudoHalide functional replacement terms are also restricted to non-
carboxylic acids with the additional restriction that insufficient substitutable hydrogen should be
present on the atoms indicated hence precluding the substitution interpretation (Figure 3-67).
Figure 3-67 chlorophosphoric acid (functional replacement) or chlorophosphonic acid (substitution as
the phosphorus in phosphonic acid has a hydrogen atom) or phosphorochloridic acid (infix functional
replacement)
81
Care is taken when performing both infix and prefix functional replacement to have the
correct charges on the modified section of the molecule and to correctly annotate which atoms are
“functional atoms” (Figure 3-68).
Figure 3-68 Acetate (left) and peroxyacetate (right). Note that the charge on the replacement
functionality is dependent on the original functionality and that the functional atom has effectively moved.
3.2.9.10 Hantzsch-Widman nomenclature
Example
1,3,5-triazine
General Syntax locant? (multiplier? heteroatom)+ HWstem
Hantzsch-Widman nomenclature116 is used to describe the structure of heteromonocycles i.e.
individual rings containing at least one heteroatom. The system initially applied only to nitrogen,
oxygen, sulfur and selenium but through various recommendations has been extended such that it
now can be applied to all p-block elements except the noble gases. Traditionally mercury has also
been included in the system although the 2004 provisional recommendations do not recommend its
use78. A Hantzsch-Widman name is formed of one or more prefixes (Table 3-12), describing the
heteroatoms in the ring, followed by a stem (Table 3-13) describing the size of the ring and whether
or not it is unsaturated e.g. ‘1,3-oxazole’. Prefixes are preceded by locants indicating the position of
the heteroatoms.
82
Element Prefix*
fluorine fluora
chlorine chlora
bromine broma
iodine ioda
oxygen oxa
sulfur thia
selenium selena
tellurium tellura
nitrogen aza
phosphorus phospha
arsenic arsa
antimony stiba
bismuth bisma
silicon sila
germanium germa
tin stanna
lead plumba
boron bora
aluminium aluma
gallium galla
indium indiga
thallium thalla
Table 3-12 Hantzsch-Widman system
prefixes *Note that the final ‘a’ is elided prior to a
vowel
Ring Size Unsaturated Saturated
3 irene
irine (nitrogen
containing)
irane
iridine (nitrogen
containing)
4 ete etane
etidine (nitrogen
containing)
5 ole olane
olidine (nitrogen
containing)
6 ine/inine ane/inane
7 epine epane
8 ocine ocane
9 onine onane
10 ecine ecane
Table 3-13 Hantzsch-Widman system stems
Note that the final ‘e’ on the stems is optional
The choice of stem for six-membered rings is dependent on the heteroatoms present in the
ring and is required to avoid conflicts between Hantzsch-Widman rings and heteroatom hydrides or
heteroatom chains (cf. Section 3.2.7.3).
OPSIN has complete support for Hantzsch-Widman nomenclature including the now
deprecated support for rings with one double bond 76 (Rule B-1.2) (Figure 3-69) and deprecated
heteroatom prefixes.
Figure 3-69 2-oxazoline; Note that the 2 refers to the position of the double bond. The position of the
oxygen and nitrogen are 1,3 by widely accepted convention.
83
To avoid the previously mentioned ambiguity between heteroatom hydrides and Hantzsch-
Widman nomenclature, OPSIN has explicit categories in its grammar for heteroatom prefixes that
may be used with the ‘ine’ and ‘ane’ stems (Figure 3-70).
Figure 3-70 The correct interpretation of azane (left) and an incorrect Hantzsch-Widman interpretation
(right). OPSIN only generates one parse for ‘azane’ which corresponds to the former.
The stem is used to generate a ring with appropriate saturation onto which heteroatoms are
substituted. Exceptions are made to support certain ring systems having certain locants by
convention e.g. ‘oxazole’ = ‘1,3-oxazole’. Additionally certain systems which will rarely mean the
Hantzsch-Widman ring are blocked e.g. ‘thiol’ or ‘seleninic acid’. The complete ring system may be
subsequently used as a component in other nomenclature such as fused ring nomenclature (Section
3.2.9.11) or polycyclic spiro nomenclature (Section 3.2.9.15).
OPSIN handles the names of fused ring components recommended in FR-2.2.1(c)115, for
heteromonocycles of size greater than 10 atoms, as an extension of the Hantzsch-Widman system
(Figure 3-71).
Figure 3-71 [1,4,9,12]oxatriazacyclopentadecine
3.2.9.11 Lambda convention
IUPAC nomenclature has a concept of standard bonding number (Table 3-14) where bonding
number is defined by the sum of the bond orders of all bonds to an atom. Typically an atom will have
this standard bonding number hence no attempt needs to be made to specify it. If the atom is not in
84
its standard bonding number, and does not implicitly have a non-standard bonding number (e.g.
phosphorane is defined as a phosphorus atom with bonding number 5), the Lambda Convention117
should be employed.
Standard bonding number Elements
3 B Al Ga In Tl
4 C Si Ge Sn Pb
3 N P As Sb Bi
2 O S Se Te Po
1 F Cl Br I At
Table 3-14 Standard bonding numbers
The Lambda Convention may be applied to heteroatoms used in skeletal
replacement/Hantzsch-Widman nomenclature or directly to a group (Figure 3-72). OPSIN fully
supports the Lambda Convention. In OPSIN’s implementation care must be taken to distinguish
between the case where the locant before the λ is needed to locate a heteroatom and cases where
the locant is purely for use by the Lambda Convention.
Figure 3-72 Examples of the Lambda Convention: 2λ
6
-trisulfane (left) and 1,6,6aλ
4
-trithiapentalene
(right)
3.2.9.12 Fused Ring nomenclature
Fused ring nomenclature115 is used to name polycyclic ring systems especially ones containing
unsaturated rings. All rings in the ring system will be ortho-fused (i.e. have a bond in common) to at
least one other ring.
3.2.9.12a Fused Ring System Construction
Example
furo[3,2-b]thieno[2,3-e]pyridine
General Syntax (fusionComponent fusionBracket)+ parentComponent
Fused ring nomenclature names are created using the names of trivially named fused ring
systems, ring systems named as in Section 3.2.7.5c and individual ring names. These will hitherto be
85
referred to as components with the rightmost component being the parent component. Atoms
shared by multiple components are considered to be part of both components for the purpose of
name construction (Figure 3-73). Cyclised alkanes when used as fusion components are treated as
implicitly unsaturated in this nomenclature.
Figure 3-73 pyrano[2',3':4,5]cyclohepta[1,2-g]quinoline showing the components
The fusion brackets in the name describe how each component is connected to the next
component. All bar references to the parent component employ numbers to refer to atoms. The
parent component is instead treated as if it had no locants and instead bonds are referred to using
letters (Figure 3-74). The letters typically are in the same order as the original numeric locants,
except in cases where the numeric locants of the peripheral atoms are not continuous (e.g. acridine).
In these cases the letters use the order the atoms would be in if the system were systematically
numbered. It should be noted that purine is an exception to this.
Figure 3-74 pyrano[2',3':4,5]cyclohepta[1,2-g]quinoline showing the internal ring numbering. This
numbering is unrelated to the final numbering of the complete system
To reconstruct the fused ring system, the name is read from right to left and the components
are successively fused. Components are primed, double primed etc. depending on how many
components removed from the parent component they are.
Components may also be multiplied (Figure 3-75). Multiplied components are primed but may
not be fused onto hence avoiding the introduction of numbering ambiguity.
Pyran
Quinoline
Cyclohepta-1,3,5-triene
86
Figure 3-75 difuro[3,2-b:3',4'-e]pyridine
A fusion bracket specifies the atoms that are to be fused in each fusion (except for fusion
brackets to the parent compound which specify bonds). However there are many cases in which
some locants may be omitted or the entire fusion bracket omitted. In these cases OPSIN internally
generates a fusion bracket with the missing locants added before proceeding as normal (e.g. Figure
3-76 and Figure 3-77).
Figure 3-76 pyrazino[g]quinoxaline becomes pyrazino[2,3-g]quinoxaline
Figure 3-77 1H-naphtho[2,3][1,2,3]triazole becomes 1H-naphtho[2,3-d][1,2,3]triazole
Bridgehead atoms are typically omitted from fusion brackets (Figure 3-78). To get a complete
list of atoms to be used in a fusion OPSIN iterates over the component's atoms from the atom
indicated by the starting locant to the atom indicated by the ending locant. This list will include all
atoms including bridgheads.
Figure 3-78 naphtho[2,1,8-def]quinoline (interpreted as if it were written naphtho[2,1,8a,8-
def]quinoline)
87
This procedure for handling missing locants appears to be similar to that described by
Matsuura94 with the exception that OPSIN never employs numeric locants to describe atoms to use
on the parent component.
With the exception of benzo fusions (Section 3.2.9.12b) and multi-parent systems (Section
3.2.9.12c), OPSIN does not support the inclusion of fused ring systems created by fused ring
nomenclature as fusion components. All other aspects of fused ring system construction are
believed to be supported.
3.2.9.12b Benzo fusions
Example
3-benzoxepine
General Syntax locant benz(o) parentComponent
As a special case heterobicylic systems containing a benzene ring are named using a different
syntax. Benzene is fused to the parent component and then the locant before the ‘benzo’ is used to
assign the position of heteroatoms in the complete system. Although not made explicit in the
nomenclature recommendations, for implementation purposes the heteroatoms must be
considered to be repositioned, as their absence would mean that numbering does not necessarily
start on the parent component.
Such systems may be used as fusion components and hence are processed prior to other
fusion nomenclature.
3.2.9.12c Multi-parent systems
Example
benzo[1,2-f:4,5-g']diindole
General Syntax (fusionComponent fusionBracket multiplier)+ parentComponent
When there are multiple candidates for the parent component and all candidates are fused to
the same component, this part of the ring system may be named as a multi-parent system. For these
88
systems, a multiplier is used to indicate that a component is replicated and locants are used to
indicate where these components are fused.
As long as the pairs of inter-parent components are identical, the system can also be used in
cases involving more than one inter-parent component (Figure 3-79).
Figure 3-79 anthra[2'',1'',9'':4,5,6;6'',5'',10'':4',5',6']diisoquinolino[2,1-a:2',3'-a']diperimidine
Multi-parent systems may have further fusion performed on them and hence are processed
after benzo fusions, but prior to other fusion nomenclature. The whole multi-parent system can be
thought of as being the parent component for further fusion.
3.2.9.12d Idealised grid construction
Once the fused ring system has been constructed it is numbered by OPSIN. This is achieved by
determining the preferred layout of the ring system on an idealised 2D grid, determining the
preferred orientation and then determining the preferred peripheral numbering. The first of these
steps is significantly complicated by not all sizes of rings tessellating. The system works perfectly for
six-membered rings, but for all other ring sizes manipulation of ring shape or multiple orientations
are possible (Figure 3-80).
89
Figure 3-80 Ring shapes considered by OPSIN. Ring shapes that are recommended by Fused Ring and
Bridged Fused Ring Nomenclature (1998)
115
but not by the 2004 draft recommendations
78
are not considered
by OPSIN.
First, OPSIN determines the rings that comprise the fused ring system. This is achieved by
calculating the smallest set of smallest rings (SSSR) from the complete ring system. These rings are
then associated with their neighbouring rings.
Starting from a ring with the minimum number of neighbouring rings, ring connection tables
are created. Multiple ring connection tables may be created as multiple orientations of the ring
shapes may need to be considered for 5- and 7-membered rings (Figure 3-81). OPSIN considers the
minimum possible number of orientations needed to enumerate all possibilities grid layouts. For
example, when a ring is only involved in one fusion, only one orientation needs to be considered as
the different orientations only effect the calculated position of other rings relative to the starting
fusion bond. Additionally, OPSIN does not consider orientations of 5-membered rings involving
fusion to the elongated bond if other orientations are possible. An example of a complete ring
connection table may be seen in Figure 3-82.
Figure 3-81 Orientations potentially considered for 5- and 7-membered rings
90
Ring Ring shape Direction Neighbouring Ring
benzene standard 1 pyridine
pyridine standard 0 cyclopenta
cyclopenta enterFromLeftHouse 4 pyridine
pyridine standard -3 benzene
Figure 3-82 Ring connection table for cyclopenta[c]isoquinoline. The depiction shows the orientation
described by this table. The depiction often, such as in this case, does not represent the final orientation of the
system. Numbers are used to indicate the directions from a ring to a neighbouring ring on the idealised grid.
An example of a case where multiple orientations of a 5-membered ring need to be
considered to evaluate all possible grid layouts is shown in Figure 3-83 .
Figure 3-83 Layouts for benzo[b]cycloocta[jk]fluorine ignoring rotations and reflections. These layouts
are indistinguishable until peripheral numbering is considered (right layout preferred).
Next, OPSIN attempts to eliminate those connection tables with more distorted rings.
Distorted rings are recognised by the direction from one ring to another not being the opposite of
the direction from the other ring back to the first ring. Generation of connection tables for ring
systems that may only be drawn with distorted rings is a known limitation in OPSIN’s
implementation (Figure 3-84).
Figure 3-84 Distorted ring possibilities for cyclobuta[def]phenanthrene. OPSIN only currently considers
the left one. The right one is the preferred layout for numbering.
2
±4 0
-1
-2
-3
1 3
91
For rings of sizes greater than 8, OPSIN supports the special case where all such rings are only
fused to one other ring (Figure 3-85) but does not support the general case as this would require
consideration of multiple ring orientations. As the IUPAC recommendations115 acknowledge
potential problems with the naming of systems containing such rings, and as these systems are also
quite rare, this is not considered a significant limitation to OPSIN’s implementation.
Figure 3-85 1-methyl-5H-cyclotrideca[b]naphthalene
3.2.9.12e Grid orientation
At this stage there may still be multiple ring connection tables corresponding to different grid
layouts. Typically, the rules for orientation of the ring system may be used to rule out all grid layouts
bar one.
The rules are as follows:
Maximum number of rings in a horizontal row
This is implemented by iterating over a ring connection table and counting the number of
rings where the direction between the rings is identical. The directions which yield the largest
number of rings in a line are returned. When multiple ring connection tables are considered, only
the combinations of ring connection table and directions that meet this criterion are considered
further.
92
Figure 3-86 Example of ring counts for different directions
For each applicable ring connection table/direction combination, a grid layout is generated
with the given direction defining the horizontal row, i.e. a rotation may be required as compared to
the starting ring connection table. If the grid layout has overlapping atoms that grid layout is
rejected. This is a minor limitation in OPSIN’s implementation, as this is only valid to do when a
layout without overlapping atoms actually exists.
Maximum number of rings in upper right quadrant
Minimum number of rings in lower left quadrant
Maximum number of rings above the horizontal row
To check a grid layout against these criteria the grid must be divided up into quadrants. The
horizontal divider is defined by the horizontal row and the vertical divider by the mid-point of the
horizontal row. The horizontal row is the row with maximum rings in a line. A given grid layout may
have multiple rows of rings that meet this requirement hence the division of the system into
quadrants must be performed using each possible horizontal row (Figure 3-87).
93
Figure 3-87 Lines showing the quadrants for the two possible horizontal row candidates of this grid
layout. The right interpretation is preferred as more rings are in the upper right quadrant.
Counting the occupancy of quadrants is relatively simple with rings contributing a ¼ if the
origin is located within a ring, ½ if an axis passes through a ring or otherwise 1 to a particular
quadrant. With the quadrant occupancies calculated one can calculate which combinations of grid
layout, horizontal row and quadrant give the preferred upper right quadrant.
For each of these combinations, the grid layout is then flipped appropriately such as to place
the preferred quadrant in the upper right. The peripheral atoms are then evaluated starting from the
uppermost, rightmost ring. These criteria apply entirely to the idealised grid layout and are not
affected by how the system would look if drawn. If the candidate ring has no non-fusion atoms the
next ring in a clockwise direction is used. The peripheral atoms of the system are then visited
starting from most counter-clockwise atom in this ring and proceeding in a clockwise manner around
the periphery of the ring system.
Especially for simple fused ring systems, multiple possible numberings (Figure 3-88) are
possible and hence the criteria to determine the preferred peripheral numbering must be applied.
Figure 3-88 Possible orientations of thiopyrano[3,2-b]pyridine without considering peripheral
numbering rules. Numbering in all cases would start at the top of the right ring.
94
3.2.9.12f Peripheral numbering
Preferred peripheral numbering is determined by comparing the lists of possible periphery
atoms. Rules include prioritising the list with the earliest heteroatom, the highest priority
heteroatom, the earliest fusion carbon atom etc.
OPSIN then iterates over the preferred list numbering the atoms. Numbering increases
monotonically except when carbon bridge heads are encountered which are instead labelled with
the current number followed by an ascending letter (Figure 3-89).
Figure 3-89 7H-difuro[2,3-e:2',3'-g]indole
OPSIN does not implement all of the rules for numbering interior atoms (i.e. those not on the
periphery) of fused ring systems meaning incorrect numbering may be produced for these atoms in
some cases.
3.2.9.13 Bridges for fused ring systems
Example
4a,8a-propanoquinoline
General Syntax locant? bridge bridgeFormingO fusedring
Bridges may be used, in IUPAC nomenclature, on trivially named fused ring systems or those
systematically named fused ring systems that could not be named if the bridge were considered part
of the fused ring system. The bridge is a non-detachable feature and should be placed adjacent to
the fused ring system. A bridge may be a divalent alkyl group, a heteroatom equivalent, a divalent
trivial ring or even a mixture of these.
95
OPSIN only supports the case where a bridge is an alkane or a divalent oxygen atom (or
chalcogen equivalent) (Figure 3-90). Practically, this did not appear to be a significant source of
failure during evaluation. However, adding further support for bridging nomenclature should not be
technically difficult if required in the future.
Figure 3-90 6,12-epoxy-5,13-methanobenzo[4,5]cyclohepta[1,2-f]isochromene
3.2.9.14 Ring assemblies
Example
2,2':6',2''-terpyridyl
General Syntax locant? multiplier ring radicalSuffix?
A ring assembly 76 (Rules A-51 – A-56) is defined as a system comprising of two or more cyclic systems
joined together by single or double bonds such that the number of bonds between the rings is one
less than the number of rings. A cyclic system could be any ring or a fused ring system.
For the case when the rings involved are constitutionally identical, the IUPAC recommend
specific nomenclature that clearly indicates this relationship between the rings. Two slightly
different methods have been recommended for naming such systems: one employs additive
operations (no atoms added or removed when forming a bond) and the other employs conjunctive
operations (one hydrogen removed from each group when forming a bond) (Figure 3-91).
Figure 3-91 3,3'-bipyridine (conjunctive) or 3,3'-bipyridyl (additive)
96
The naming methods involve using the name of the ring system (conjunctive) or the name of
the radical of the ring system (additive) preceded by a Latin multiplier and, where necessary, a
locant to indicate which atoms in the rings are connected. As an exception, benzene rings always
use the radical name ‘phenyl’, e.g. ‘biphenyl’. An ortho/meta/para locant may be used instead of a
normal locant for six-membered rings with bonds that are all in the same relative positions (Figure
3-92).
Figure 3-92 p-quaterphenyl
OPSIN has support for all common ring assembly nomenclature. It does not support the use of
delta convention to specify a double bond between rings or the new locant system employing
superscripts introduced in the provisional recommendations78 (Figure 3-93). Due to the multitude of
ways that are used to represent superscripted characters rather than actual superscripted numbers,
it is hoped that this recommendation will not be included in the final recommendations.
Figure 3-93 1
1
,2
1
:2
2
,3
1
-tercyclopropane (new locant system; not OPSIN interpretable) 1,1′:2′,1′′-
tercyclopropane (current locant system; OPSIN interpretable)
Ring assemblies are handled by first converting an ortho/meta/para locant (if present) into the
explicit locant form normally used. Non-detachable features are then resolved onto the ring system
before the ring system is duplicated the appropriate number of times. Care is taken to distinguish
between features that apply to the individual ring or to the ring assemblage as the latter should not
be processed at this stage. The cloned ring systems are then bonded via the atoms indicated by the
supplied locants, or by heuristically chosen atoms if no locants are provided.
97
3.2.9.15 Polycyclic spiro nomenclature
Example
spiro[piperidine-4,9'-xanthene]
General Syntax multiplier? spiro openBracket ring (locant ring)+ closeBracket
To name spiro systems made from one or more polycyclic rings, nomenclature employing the
names of the constituent ring systems is used114(Rules Sp-2 – Sp-6). The general nomenclature for these
systems is to state the number of spiro centres followed by a bracketed section listing the
constituent ring systems and the locants of the atoms on them that are involved in spiro fusions
(Figure 3-94).
Figure 3-94 2"H,4"H-trispiro[cyclohexane-1,1'-cyclopentane-3',3"-cyclopenta[b]pyran-6",1'''-
cyclohexane]
When the ring systems involved are identical a contracted form is employed to avoid
repetition (Figure 3-95).
98
Figure 3-95 Examples of spiro systems with repeated ring systems: 3,3'-spirobi[indole] (left) and
3,3':6',6"-dispiroter[bicyclo[3.1.0]hexane] (right)
OPSIN supports the majority of common polycyclic spiro nomenclature but lacks complete
support. OPSIN currently lacks support for systems formed of a mixture of identical and non-
identical rings in which the identical rings are mentioned using multipliers e.g. trispiro[1,3,5-
trithiane-2,2':4,2":6,2'''-tris(bicyclo[2.2.1]heptane)]. Another limitation is that locants on ring
systems beyond the first should be in square brackets; as OPSIN uses the same expression for rings
inside and outside spiro systems this behaviour is supported only in cases where OPSIN allows
locants to be enclosed in square brackets outside of a spiro system e.g. 3H-spiro[1-benzofuran-2,1'-
cyclohex[2]ene] is unsupported.
OPSIN fully supports an older method of naming spiro systems76(Rule A-42) which instead has the
term ‘spiro’ and locants indicating the atoms involved in the spiro fusion between the ring systems
involved (Figure 3-96).
Figure 3-96 2H-indene-2-spiro-1'-cyclopentane
3.2.9.16 ᴅ/ʟ stereochemistry
ᴅ/ʟ stereochemistry is used to describe how the stereochemistry of a compound compares to
the stereochemistry of the two enantiomers of glyceraldehyde; ᴅ-glyceraldehyde and ʟ-
glyceraldehyde (Figure 3-97).
99
Figure 3-97 ᴅ-glyceraldehyde (left) and ʟ-glyceraldehyde (right)
As for both monosaccharides and amino acids one chiral form is significantly more prevalent
in nature it may be assumed that when unspecified that this is the form that is referred to (ᴅ for
monosaccharides and ʟ for amino acids). OPSIN supports this convention by storing such compounds
with their stereochemistry defined as in their natural form. ᴅ/ʟ stereochemistry can then be simply
treated as a modification of this stereochemistry e.g. ᴅ- indicates that the stereochemistry of an
amino acid should be inverted whilst ʟ- indicates it may be left as is.
Due to this implementation, ᴅ/ʟ stereochemistry’s rare use in general organic nomenclature
e.g. ᴅ-α-Amino-β-phenylpropionic acid is unsupported.
3.2.9.17 Amino acid nomenclature
Example
ʟ-leucinamide
General Syntax ᴅ/ʟ? trivialAminoAcidName suffix?
Amino acid nomenclature118 provides succinct names for amino acids, amino acid derivatives
and polymeric amino acids in peptides. The nomenclature essentially consists of the trivial names for
the common amino acids in conjunction with suffix rules that differ slightly from those of general
organic nomenclature.
As compared to other carboxylic acids, amino acid nomenclature is only codified for a subset
of the suffixes supported in general organic nomenclature. A few quirks that needed to be taken into
account when implementing suffixes rules for amino acids were:
‘ol’ and ‘al’ are valid suffixes (e.g. glycinol) . It should also be noted that on di-acids
that these suffixes must be locanted.
The absence of a suffix is the equivalent of the ‘ic acid’ suffix
100
‘yl’ means acyl i.e. what ‘oyl’ often means
Locanted ‘yl’ means add a radical
‘o’ may be used to add a radical to the amino nitrogen e.g. glycino
When constructing a peptide the names of the acyl groups of amino acids may be
concatenated (Figure 3-98). As brackets are not required, to assure the correct interpretation OPSIN
adds implicit brackets (Figure 3-99).
Figure 3-98 threonylglycylglycylglycine
Figure 3-99 ʟ-arginyl-O-phosphono-ʟ-seryl-ʟ-alanyl-ʟ-proline, interpreted as ((ʟ-arginyl-O-phosphono-ʟ-
seryl)-ʟ-alanyl)-ʟ-proline
OPSIN supports the majority of common amino acid nomenclature. OPSIN does not support
the use of ᴅ/ʟ on achiral amino acids that are made chiral by substitution.
101
3.2.9.18 Carbohydrate nomenclature
Example
α-ᴅ-glucopyranose
General Syntax α/β? ᴅ/ʟ? carbohydrateStem suffix+
Carbohydrate nomenclature119 may be employed to more succinctly name saccharides. All
aldoses and 2-ketoses (Figure 3-100) of length up to 6 carbons have trivial names with each
diastereomer having a different name. A specific enantiomer is indicated by the use of ᴅ/ʟ in front of
the trivial name, which relates the configuration of the highest-numbered carbon stereocentre (the
configurational atom) to that of ᴅ/ʟ- glyceraldehyde (Figure 3-101, cf. Section 3.2.9.16).
Figure 3-100 Structure of aldoses (left) and ketoses (right) where n is 1 or more and m is 0 or more. A 2-
ketose is one in which at least one of the m’s is 0.
Figure 3-101 ᴅ-glucose (left) and ʟ-glucose (right)
To create carbohydrate derivatives either the trivial name of the carbohydrate without the
terminal ‘se’ or a systematically defined stem may be used (Section 3.2.9.18a). This is then followed
by suffixes indicating additions or modifications to the chain.
Monosaccharides most commonly are found in a cyclic form as a hemiacetal or a hemiketal so
one of the most common suffixes employed indicates the ring size formed when the carbohydrate
cyclises. For example furanose for a 5-membered ring or pyranose for a 6-membered ring (Figure
102
Anomeric centre
Configurational atom /
Anomeric reference atom
3-102). Cyclisation forms an additional stereocentre referred to as the anomeric centre. The
configuration of this centre is specified using either α or β. These specify the relationship between
the stereochemistry at the anomeric reference atom and the anomeric centre. The anomeric
reference atom and configurational atom are always synonymous unless the carbohydrate stem has
been systematically defined.
Figure 3-102 α-ᴅ-galactofuranose. ᴅ-galactose cyclised to form a 5 member ring.
OPSIN supports cyclising all IUPAC endorsed trivial carbohydrate names but does not currently
support cyclisation of systematically defined stems, or any other suffixes.
3.2.9.18a Systematic carbohydrate chains
Example
ʟ-ribo-ᴅ-manno-nonose
General Syntax (ᴅ/ʟ? configurationalPrefix)+ chainLength suffix+
Monosaccharides lacking trivial names may be named using configuration prefixes derived
from the names of the trivial aldoses. These prefixes specify that the defined stereocentres have the
same stereochemistry as the aldose from which the prefix was derived. The number of stereocentres
the prefixes define should be exactly the same as the number of stereocentres that are in the sugar.
Of note from an implementation perspective the configuration prefixes refer to the final structure of
the sugar e.g. after subtractive nomenclature has been performed (Figure 3-103).
103
Figure 3-103 3,6-Dideoxy-ʟ-threo-ʟ-talodecose. threo specifies the configuration at 2 centres and talo at
4. A decose has 8 stereocentres but two are removed by removal of hydroxyl groups.
Any trivial carbohydrate chain name may be specified using this nomenclature e.g. ᴅ-glucose =
ᴅ-gluco-hexose.
3.2.10 Structure assembly
3.2.10.1 Substitutive nomenclature
Example
2,4,6-trinitrotoluene
General Syntax (locant? multiplier? substituent)+ parentGroup
The substitutive operation is the most common method of connecting fragments in organic
chemical nomenclature. A substitutive operation involves the replacement of one, or more,
hydrogen atoms by another fragment. The number of hydrogens replaced is determined by the
valency of the radical on the replacement fragment e.g. ‘yl’ =1, ‘ylidene’ =2 etc. Substitutive
nomenclature is performed recursively on the substituents/brackets respecting bracketing so as to
ensure the correct groups are substituted.
OPSIN supports two special cases of substitutive nomenclature. Perhalogeno terms e.g.
‘perchloro’ indicate that all substitutable hydrogens have been replaced by the indicated halogen
(Figure 3-104).
104
Figure 3-104 perfluoro(decahydro-1-methylnaphthalene)
The other special case is for bridging substituents such as epoxy, epithio and methylenedioxy.
Unlike fused ring bridges these may be applied to systems that are not fused rings and additionally
are often treated as detachable prefixes. Bridging substituents differ from normal substituents in
that they may be preceded by two locants indicating the two atoms to which the bridging
substituent attaches (Figure 3-105).
Figure 3-105 3',4'-Methylenedioxy-α-pyrrolidinopropiophenone. The substitution of the benzene ring
by methylenedioxy yields the 1,3-Benzodioxole ring system.
OPSIN supports alphanumeric (e.g. ‘1’, ‘3a’), Greek (e.g. ‘beta’), element symbol locants (e.g.
‘N’) and element symbol locants in combination with alphanumeric locants (e.g. ‘N4’ or ‘4-N’). All
locant types may contain primes. Element symbol locants are assigned algorithmically using a series
of empirically defined heuristics that reproduce the labelling IUPAC has specified for certain nitrogen
containing suffixes. Locants that combine an alphanumeric component with an element symbol
locant are not assigned, as in most cases such locants will never be referred to. Instead, when such a
locant is requested, the atom corresponding to the alphanumeric part of the locant is looked up and
then a search for an atom that is connected either directly or through atoms without alphanumeric
locants is initiated to find the atom matching the element symbol portion of the locant. The
expected element symbol locant may differ from that assigned when the molecule was considered
105
as a whole as the element symbol locant will be based off just the atoms connected to the point in
the molecule being investigated (cf. the two different atoms referred to as N’ in Figure 3-106).
Figure 3-106 1-N′,1-N′-diethyl-1-N′′′ ′′,1-N′′′ ′′,3-N'-trimethylcyclohexane-1,1,3-tricarbohydrazonohydrazide
3.2.10.2 Additive nomenclature
An additive operation involves the joining of two fragments together without loss of atoms. In
the context of joining fragments this typically applies to the bonding of radicals together (e.g. Figure
3-107). Some operations in functional class nomenclature (Section 3.2.10.4) are also formally
additive operations.
Figure 3-107 Methyl and sulfonyl are combined via an additive operation to create the prefix
methylsulfonyl
As additive operations may only occur between substituents that are adjacent within a
chemical name, OPSIN performs additive operations prior to performing substitutive operations.
Without this ordering of operations some names that are not perfectly formed, but are in common
106
parlance considered unambiguous, e.g. methylsulfonylcyclohexane become ambiguous. In this case
the methyl may be interpreted as a substituent of the cyclohexane if it is not first additively bonded
to the sulfonyl.
Implementation of this nomenclature is significantly complicated by ambiguity in some
substituents as to whether or not they are multi-valent radicals (Figure 3-108). The left-hand
interpretation for these two substituents implies a substitutive operation interpretation in which a
double bond is formed.
Figure 3-108 Interpretations of methylene (left) and imino (right)
Another ambiguity that affects a small number of substituents relates to the valency of the
radicals that the substituent possesses (Figure 3-109). This ambiguity occurs most often in
multiplicative nomenclature. The draft 2004 recommendations now only recommend the name
nitrilo for the left interpretation in Figure 3-109.
Figure 3-109 Interpretations of nitrilo in general organic nomenclature. The left interpretation is
preferred.
Disambiguation can be achieved in most cases by examining the adjacent substituent e.g. is it
a multi-valent radical.
3.2.10.3 Multiplicative nomenclature
Example
4,4'-methylenedioxydibenzoic acid
General Syntax locant? (substituent multiplier)+ parentGroup
Multiplicative nomenclature76(Rules C-72 to C-74) is used when a structure may be assembled from
multiple identical components. All substituents involved are multi-valent radicals with additive
operations connecting the substituents.
107
Multiplicative nomenclature is implemented as a special case of additive nomenclature. It is
detected by the presence of a multi-valent radical group followed by a multiplier equal to the
valency of the multi-valent radical group. Once multiplicative nomenclature has been detected,
groups are joined from left to right until the parent group is reached.
Additionally a special case is required to allow the case where a substituent that is not
obviously a multi-valent radical acts as one e.g. the benzylidene in Figure 3-110.
Figure 3-110 4,4'-benzylidenedi-o-toluidine
3.2.10.4 Functional class nomenclature
Example
ethyl alcohol
General Syntax groupOrSubstituent functionalTerm
Functional class nomenclature involves a group, often a substituent, followed by a class name.
Traditionally, in the case where the group is a substituent, this nomenclature was called
radicofunctional nomenclature. OPSIN has specific rules for dealing with different types of functional
class nomenclature which roughly parallel the word rules mentioned in Section 3.2.6. For example
the same code may be used for all “carbonyl derivatives” e.g. oximes, hydrazones, semicarbazones
and imides.
Some class names may be related to a chemical structure that will either be bonded onto the
preceding fragment (e.g. ‘cyanate’ or ‘ketone’) or replace an atom on a preceding fragment (e.g.
‘oxime’). Some of these fragments may even be substituted and hence are best treated as normal
groups prior to being incorporated into the preceding fragment (Figure 3-111).
108
Figure 3-111 hexan-3-one 4,4-diphenylsemicarbazone
Other class names such as ‘ester’ (Figure 3-112) or ‘acetal’ (Figure 3-113) are purely used to
determine what operation needs to be performed on the groups that are present.
Figure 3-112 ʟ-alanine methyl ester, constituent parts (left) and final structure (right)
Figure 3-113 propanal dimethyl acetal, constituent parts (left) and final structure (right)
3.2.10.5 Structure-based polymer nomenclature
Example
poly[oxyethylene]
General Syntax poly substituent+
Polymers may be represented in IUPAC nomenclature by naming the repeat unit preceded by
‘poly’120. With the addition of only a few special cases, OPSIN is able to support the nomenclature
used to describe a repeat unit as part of its general handling of additive nomenclature.
109
Figure 3-114 poly[(benzo[1,2-d:4,5-d']bis[1,3]thiazole-2,6-diyl)-1,4-phenyleneoxy-1,3-phenylene(1,3,5,7-
tetraoxo-1,2,3,5,6,7-hexahydrobenzo[1,2-c:4,5-c']dipyrrole-2,6-diyl)-1,3-phenyleneoxy-1,4-phenylene]
A special case was required to handle the fact that ‘imino’ and ‘methylene’ are used nearly
exclusively as linkers in polymer nomenclature whilst in general nomenclature they can often refer
to a double bonded atom (Figure 3-115).
Figure 3-115 OPSIN’s interpretations of poly(imino-2,2-dimethylpentamethyleneiminoazelaoyl) (left)
and imino-2,2-dimethylpentamethyleneiminoazelaoyl (right)
Another special case was required for those groups with three or more connections that only
have two in polymer nomenclature (Figure 3-116)
Figure 3-116 Interpretations of nitrilo in polymer nomenclature. Note that for nitrilo this is in direct
contradiction with the 2004 draft recommendations which specify that nitrilo should refer only to the
interpretation with three connections cf. Figure 3-109
3.2.11 Kekulisation
During the assembly of fragments, double bonds on atoms in rings are not explicit. Instead
they are represented using the flag indicating that the atom may be involved in a π system, that was
mentioned previously in connection with SMILES reading (Section 3.2.8) and operations that
add/remove hydrogen (Section 3.2.9.7b). Before performing kekulisation this flag is removed from
any atoms which by forming a double bond would end up in an unusual valency. It is also removed
from atoms that are adjacent only to atoms that may not form double bonds.
110
For kekulisation to be successful there must be an even number of atoms possessing the flag.
If there are an odd number of atoms, an atom with the flag is selected via a series of heuristics to be
eliminated from use in double bond formation. These heuristics are in order of priority:
An atom that was indicated as having hydrogen in the original fragment
An atom that is adjacent to only one atom with the flag set
An atom adjacent to two bridgehead atoms
A heteroatom
An arbitrarily chosen atom
The algorithm adds double bonds first to atoms that have only one neighbour to which they
are capable of being double bonded. Subsequently bonds in which at most one atom is a bridgehead
may be considered followed finally by bonds in which both are bridgeheads. A more rigorous
solution allowing backtracking when placing double bonds, such that an earlier misplaced double
bond will not prevent kekulisation, is a possible future improvement. Nonetheless, except in cases
where the position of indicated hydrogen has been underspecified and the name is hence
ambiguous, cases of this algorithm failing are extremely rare.
3.2.12 Valency checking
Once all the fragments have been assembled a check is performed on the valency of each
atom. The valency is checked either against the highest known stable valency for that atom’s
element/charge, or against the Lambda Convention specified valency (taking into account protons
added/removed by charge modifying suffixes). If a valency check fails, then the name is rejected.
A rationalisation for the decision to reject such structures rather than producing a hypervalent
interpretation is that in substitutive nomenclature it is impossible to generate a hypervalent
structure (without the Lambda Convention) as only as many hydrogens as are present on the atom
when in its standard valency may be substituted. This means that a name that produces a
hypervalent structure is not only chemically suspect but also formally incorrect.
111
3.2.13 Application of stoichiometry
3.2.13.1 Mixtures
Example
methylene chloride compound with octanol (2:1)
General Syntax component (compound with)? component+ stoichiometry
Mixtures may be specified by stating the components of the mixture followed by indication of
stoichiometry. Often the components are separated by a term like ‘compound with’. OPSIN has a
small list of terms that are accepted between chemical names and subsequently ignored to achieve
this.
Indication of mixture stoichiometry is recognised and stripped from the name prior to
tokenisation/parsing. Once word rules have been assigned, the indicated stoichiometry is added as
an attribute of each top level wordRule. As top level word rules correspond to separate structures,
there is expected to be stoichiometry indication for as many components as there are top level word
rules. Once processing of the word rules has been completed their contents are multiplied out
appropriately.
3.2.13.2 Charge balancing
Example
magnesium chloride
(fully specified this name would be magnesium(2+) dichloride)
Compounds described in the chemical literature are typically intended to be overall charge
neutral. As a result indication of explicit stoichiometry is often omitted. The problem is further
complicated by metals, which often have their charge omitted. If the compound is formed of more
than one component and is not charge neutral, OPSIN goes through a series of heuristics to attempt
to balance the charge on the compound. These are:
If a metal is uncharged and has fewer bonds than its typical oxidation state, it
indicates that it is a candidate for being made into a cation.
112
Potential cationic metals are set to their typical charges (Figure 3-117)
Figure 3-117 sodium chloride. The sodium is set to its standard charge of +1 resulting in a neutral
compound
If setting the metal to its typical charge doesn’t satisfy the charge imbalance a higher
charge is tried if a higher charge is known to be possible (Figure 3-118)
Figure 3-118 thallium trichloride. Thallium is typically thallium(1+) but as there are known to be three
chlorides thallium(3+) is assumed.
Where stoichiometry is undefined and the choice of component/s to multiply is
unambiguous, components are multiplied. Components may only be multiplied by
integers (Figure 3-119).
Figure 3-119 iron(3+) sulfate. Typically only one component needs to be multiplied, but in some cases
such as this both are.
A metal has its charge set lower than its typical charge (Figure 3-120)
Figure 3-120 magnesium monochloride. As there is explicitly only one chloride the number of chlorides
may not be adjusted. Hence the charge on the magnesium is adjusted
A salt is neutralised76(Rule C816.4) (Figure 3-121)
113
Figure 3-121 caffeine citrate. Citrate in isolation would be treated as a tri anion but as there is another
compound present it is treated as if it were citric acid.
3.2.14 Stereochemistry handling
3.2.14.1 Detection of stereocentres
Tetrahedral (e.g. Figure 3-122) and double bond (Figure 3-123) stereochemistry are commonly
found in organic chemicals.
Figure 3-122 (R)-bromochlorofluoromethane (left) and (S)-bromochlorofluoromethane (right)
Figure 3-123 (E)-but-2-ene (left) and (Z)-but-2-ene (right)
As when unambiguous to do so locants are often omitted from stereochemistry prefixes, any
rigorous solution to this area must be capable of detecting stereocentres. For this purpose, OPSIN
employs a derivative of the InChI canonicalisation algorithm121,122 to label atom environments.
Hydrogen are made explicit prior to stereocentre perception and hence do not present a problem.
Higher bond orders are handled, in an analogous way to the Cahn-Ingold-Prelog (CIP) sequence
rules123–125, by treating all bonds as if they were single and adding additional atoms to the atoms at
both ends of the higher order bond. The end result is that each constitutionally distinct atom
environment is given its own environment number.
These atom environments are then used to identify true stereocentres126 i.e. stereocentres
that do not require the existence of other stereocentres in the molecule to be stereocentres. For
114
detecting tetrahedral stereocentres, a list of atoms to consider is generated by finding those that
correspond to known atom/bond configurations that may be tetrahedral stereocentres (Figure
3-124). This approximately corresponds to the stereocentres detected by InChI122(Table 8).
Figure 3-124 Examples of tetrahedral stereocentres recognised by OPSIN. X and Y are two atoms in
different environments that are bonded together.
OPSIN ignores those centres that nominally meet these criteria but in reality would not be
stereocentres due to simple resonance or tautomerism. Again this approximately corresponds to the
specification of InChI although the case depicted in Figure 3-125 is not explicitly mentioned in the
specifications.
Figure 3-125 Due to resonance this structure is achiral
The list of true stereocentres is then produced by checking that all atoms neighbouring the
potential stereocentre are in different atom environments.
Double bond stereocentres are found by analysing the atom environments at either end of a
double bond. Each atom in the double bond is expected to be bonded to a total of 3 atoms unless
the atom is nitrogen in which case 2 is acceptable with the third “atom” being a lone pair.
OPSIN does not currently detect cumulene stereochemistry although doing so would not be
technically challenging.
If an atom in a fragment has defined stereochemistry but is not identified as a stereocentre
this information is removed as it is assumed that the atom is no longer a stereocentre in the final
molecule e.g. substitution of a hydrogen atom may have made two substituents equivalent.
3.2.14.2 Applying stereochemistry
OPSIN performs stereochemistry operations in the order: locanted stereochemistry,
carbohydrate stereochemical prefixes, unlocanted stereochemistry; whilst tracking which
115
stereocentres have had their configuration set. As it is not uncommon for a structure with implicit
stereochemistry to have this stereochemistry overridden, these cases are not considered as having
set the configuration of the stereocentres. OPSIN currently considers five distinct types of
stereochemistry R and S, E and Z, cis and trans, alpha and beta, and carbohydrate stereochemistry.
3.2.14.2a R/S/E/Z stereochemistry
For R, S, E and Z stereochemistry once an appropriate stereocentre has been identified the
“ligands” i.e. connected atoms, must be ranked using the CIP system. OPSIN’s implementation
includes support for rules 1 (higher atomic number preferred to lower) and 2 (higher isotope
preferred to lower), which deal with constitutional differences between ligands. A failure is reported
if ligands cannot be distinguished.
The 1982 revision to the CIP system124 introduced the concept of hierarchical digraphs. A
hierarchical digraph is an acylic graph representation of the bonding within a ligand. The
transformation from the connection table of a ligand to a digraph involves two transformations:
Bonds of order greater than 1 are represented as single bonds with attached
duplicated atoms (called ghost atoms) e.g.:
Bonds that join to an atom previously visited by that branch of the digraph instead
join to a ghost atom which is not further bonded e.g.:
116
Rules 1 and 2 involve comparing the hierarchical digraphs for each ligand with rule 2 only
being invoked if rule 1 fails to distinguish the ligand. This comparison starts from the first layer of
atoms from the stereocentre. Evaluation proceeds on a layer by layer basis with a subsequent layer
only being investigated if the prior layer failed to distinguish the ligands. It should be noted that the
ordering of atoms in each layer is determined by the priority of atoms in the previous layer, and only
when a tie is encountered by the relative priority of the atoms within the layer.
OPSIN’s implementation is notable in that it only lazily evaluates the digraph. As typically
ranking may be determined within the first couple of layers, this approach is computationally faster
and more memory efficient, especially for larger molecules (Figure 3-126).
Figure 3-126 Hierarchal diagraph for piperidin-2-yl and cyclohexyl ligands. The two ligands are
distinguished by OPSIN at the 2
nd
level as [N,C,H] has higher priority than [C,C,H] with no further enumeration
of the digraph required.
OPSIN also implements a corner case in rule 1 (rule 1b125) in which two ligands may be
constitutionally different but have identical hierarchical digraphs (Figure 3-127). In this case ghost
atoms must be distinguished from non-ghost atoms and the position of the atom the ghost atom is a
duplicate of in the digraph is taken into account.
117
Figure 3-127 (5S)-bicyclo[3.1.0]hex-2-ene
From the combination of ordered ligands and a stereodescriptor, e.g. R/S/E/Z, it is then simple
to define the stereochemistry of a tetrahedral centre or double bond.
3.2.14.2b Cis/trans stereochemistry
Cis and trans are initially interpreted as referring to the relative stereochemistry of two
substituents on a ring. OPSIN does not have general support for detecting pseudo-asymmetric atoms
but has support for such stereocentres in this particular case. A ring system is investigated to find
tetrahedral atoms that either have one hydrogen or are connected to a fragment outside of the ring
system. If there are exactly two of them, their configuration may be set to be relatively cis or trans.
To do this the smallest set of smallest rings is calculated, allowing a list of all bonds not involved in
fusions to be compiled. From these bonds two paths joining the stereocentres should be
discoverable (Figure 3-128). This knowledge of the positioning of atoms at one stereocentre relative
to the positioning of atoms at the other stereocentre allows OPSIN to construct descriptions of the
stereochemistry that assure that the two centres will be cis/trans to each other. If one atom has
predefined stereochemistry, care is taken to leave that stereocentre as defined and have the other
stererocentre’s configuration be relative to the predefined stereocentre.
118
Figure 3-128 trans-2,6-dimethyl-2,6-dihydronaphthalene. Coloured atoms show the paths defining the
periphery of the molecule. By using the atoms at either end of the blue path and at either end of the green
path in the same place in the generated stereochemistry descriptions one can tie the configuration of the two
stereocentres together.
OPSIN also allows cis/trans to be used as an alternative to E/Z to specify double bond
stereochemistry but only in the special case where one group at either end of the double bond is
hydrogen. Without this criterion, it is formally ambiguous as to which groups are being defined as
cis/trans to each other.
3.2.14.2c Alpha/beta stereochemistry
Alpha/beta stereochemistry is used to indicate on which side of a plane a group is positioned.
OPSIN only current supports alpha/beta stereochemistry in conjunction with natural product
nomenclature127 (RF-10). In natural product nomenclature, a particular depiction of the molecule is
designated as the preferred orientation and it is with respect to this that alpha/beta stereochemistry
is defined. OPSIN encodes this information by associating each natural product that supports
alpha/beta stereochemistry with a list of the peripheral atoms of the natural product when read in a
clockwise direction. The positioning within this list of the adjacent periphery atoms to the
stereocentre, the atom to which alpha/beta is referring and the alpha/beta itself, is sufficient to
define the stereo configuration (Figure 3-129).
Figure 3-129 17β-Hydroxy-8α,9β,10α-androst-4-en-3-one
119
3.2.14.2d Carbohydrate stereochemistry
Carbohydrate stereochemistry is only employed on the systematic carbohydrate stems
described previously in Section 3.2.9.18a. OPSIN’s vocabulary has these carbohydrate stems with
their stereocentres configured such that the hydroxyl groups would point right on a Fischer
projection (Figure 3-130). The configuration prefixes can then be simply implemented as a list of ‘r’s
and ‘l’s indicating whether or not the configuration at each centre should be retained or flipped. For
example ᴅ-gluco is expressed as “r/l/r/r”. To be valid a carbohydrate name must have every
stereocentre in its stem, which still exists after substitutive and subtractive nomenclature operations
have been applied, defined by configurational prefixes.
“hexose” ᴅ-gluco- ᴅ-gluco-hexose
Figure 3-130 Fischer projection for ᴅ-gluco-hexose showing the method of constructing the
stereochemistry for the complete name
3.2.15 Ambiguous and formally incorrect chemical names
When a chemical name is underspecified e.g. lacking sufficient brackets or locants it may
become ambiguous and formally describe multiple structures. OPSIN has been empirically tuned to
attempt to generate the interpretation of a name that is most likely in common usage, with an
implicit assumption that an input chemical name is intended to describe a particular structure. This
is very similar to one of Brecher’s principles for a chemical nomenclature interpretation system106:
“The meaning of logically ambiguous names is determined by common usage”. The addition of
implicit brackets or spaces may be sufficient to give a formally ambiguous or highly unlikely name, an
unambiguous and likely interpretation. Heuristics for making these alterations are dealt with in the
following subsections.
120
3.2.15.1 Implicit bracketing
Implicit bracketing is employed by OPSIN in cases where substitution onto the rightmost
group, in the current scope, of a chemical name is not intended (Figure 3-131).
Figure 3-131 Allowed interpretations of aminomethylbenzene. The boxed interpretation is produced by
OPSIN by implicitly bracketing the name to (aminomethyl)benzene
Figure 3-131 depicts four structures consistent with the name, aminomethylbenzene. There is
only one possible structure where the aminomethyl is a substituent on the benzene ring, whereas if
the amino and methyl groups are direct substituents of the ring, there are three structural isomers.
In general, OPSIN adds implicit brackets to attempt to yield a name with only one possible (non-
degenerate) structural isomer, although perception of atom environments is not currently done to
rigorously achieve this.
In general OPSIN implicitly brackets names, when two substituents are directly adjacent (e.g.
no intervening locants/multipliers) to each other and the latter substituent has the
usableAsAJoiner attribute. This attribute is generally present on substituents which possess
only one substitutable hydrogen (e.g. formyl), all substitutable hydrogen on the same atom (e.g.
sulfamoyl) or are a multi-radical accepting additive bonds (e.g. carbonyl).
OPSIN distinguishes between the case in which substituents are directly concatenated and the
case in which they are separated by a hyphen; only the former are implicitly bracketed. This heuristic
was found to be useful for interpreting chemical names generated by Lexichem.
When implicit brackets are added, locants could apply to the implicit bracket or the contents
within it (Figure 3-132).
121
Figure 3-132 4-dimethylaminotoluene, interpreted as 4-(dimethylamino)toluene (left) but 2-
aminopropylbenzene, interpreted as (2-aminopropyl)benzene (right)
Similarly multipliers could apply to the implicit bracket or to the contents within it (Figure
3-133).
Figure 3-133 1,3,4-trimethylthiobenzene, interpreted as 1,3,4-tri(methylthio)benzene (left) but 1,3,4-
trimethylbutylbenzene, interpreted as (1,3,4-trimethylbutyl)benzene (right)
Determining whether the locants and multipliers of the first substituent should be placed
within the implicit bracket is heuristically determined by OPSIN considering whether the locant may
apply to the other groups within the implicit bracket, the group itself or a group onto which it may
be substituted. If a multiplier is a group multiplier e.g. ‘bis’ this is used as a hint that the multiplier
describes multiplication of the implicit bracket (Figure 3-134).
Figure 3-134 bismethylaminomethane, interpreted as bis(methylamino)methane (left) but
dimethylaminomethane interpreted as (dimethylamino)methane (right)
3.2.15.2 Implicit spaces
Spaces are used in functional class nomenclature to separate the functional class of the
compound from the substituent group. In most cases the absence of the space, with strict
application of this rule, leads either to a name with a highly unlikely interpretation (Figure 3-135) or
to a name with no interpretation e.g. ethylalcohol.
122
Figure 3-135 ethylchloride. Strictly this interpretation is not allowed as chloride possesses no
substitutable hydrogen.
Hence, OPSIN does not enforce the presence of a space before a functional term and instead
will treat such examples as if there were implicitly a space between the substituent and functional
class term. This is done by having this construct of a substituent directly followed by a functional
term actually present in the chemical grammar. The reason for this choice is that the parser is greedy
and will consume as much input as it can interpret. A consequence of this is that if this construct
were not in the grammar, chalcogen analogues of functional class terms would not be considered.
This is because the chalcogen prefix would always be parsed by the grammar as a substituent
instead of being considered as part of the functional term (Figure 3-136).
Figure 3-136 ethylthiocyanate or ethyl thiocyanate (left), ethylthio cyanate (right). For the space
omitted name OPSIN generates parses for both interpretations before disambiguating in favour of the left-
hand interpretation on the basis of having a longer functional term.
For esters disambiguation is more difficult as the space omitted form also produces a distinct
chemically sensible interpretation (Figure 3-137).
Figure 3-137 tert-butylacetate (left) and tert-butyl acetate (right)
Analysis of patents made it clear that strictly applying the IUPAC rules and treating such
names as substituted anions was inappropriate.
OPSIN employs the following heuristics to distinguish between the cases where the omission
of the space was intended and those in which an ester interpretation was intended. These criteria
are applied before substituents are multiplied e.g. diethyl would be treated as one substituent.
The first substituent in the name must have no locant and must be univalent. The
multiplier (if present) in front of the substituent must not exceed the number of
functional atoms present in the ‘ate’/ ‘ite’ group.
123
If the parent group has exactly one substituent the ester interpretation is preferred if:
o Substitution onto the ‘ate’/ ‘ite’ group would lead to ambiguity. Ambiguity is
determined through an analysis of the environments in which substitutable
hydrogen are found using the same environment labelling as is employed
during stereochemistry handling
o It is prefixed with the multiplier ‘mono’
o The substituent is a straight chain alkyl chain followed by
formate/methanoate/acetate/ethanoate. Such names produce an
unambiguous anion interpretation but would not normally be named like this
e.g. ethylethanoate would be called butanoate
If the parent group has multiple substituents the ester interpretation is preferred if:
o All substituents other than the first have locants (Figure 3-138)
o The ‘ate’/ ‘ite’ group has insufficient substitutable hydrogen atoms if and
only if the substitution interpretation is assumed
Figure 3-138 tert-butyl-4-vinylperbenzoate is interpreted as tert-butyl 4-vinylperbenzoate
Spaces may also be omitted in functional class names where the functional group is a divalent
group and hence two substituents are expected. A long standing exception allows for one
substituent to be omitted if both substituents are identical (Figure 3-139).
Figure 3-139 diethyl ether or ethyl ether (omitted substituent) or diethylether (omitted space) or
ethylether (omitted substituent and space)
124
When two concatenated substituents are present before such functional groups OPSIN
assumes that a space is omitted unless a locant is provided on the first substituent indicating that it
connects to the second substituent.
3.2.16 Output formats
After a name has been interpreted, an OPSIN Fragment will have been generated that
includes the molecule(s) described by the chemical name. This internal format may then be
serialised to CML, SMILES or InChI.
3.2.16.1 CML
OPSIN’s Fragment, Atom, Bond, AtomParity and BondStereo classes all contain a
method to produce a CML serialisation which can be useful for debugging. The process of serialising
a Fragment incorporates the results of serialising the constituent Atoms, Bonds, AtomParitys
and BondStereos. The CML serialisation differs from the other serialisations in that it also includes
the locants associated with each atom (Figure 3-140).
125
propane
Figure 3-140 Example of CML output
3.2.16.2 SMILES
OPSIN includes a SMILES writer that can convert its internal format to SMILES. The SMILES
writer includes support for everything that OPSIN’s internal format can represent about the
structure of a molecule, including stereochemistry. So as to produce shorter, more aesthetically
pleasing SMILES hydrogens are supressed on all organic atoms except for nitrogens with double
bond stereochemistry (Figure 3-141).
126
Figure 3-141 (Z)-ethanimine: C(/C)=N/[H] Note that without mentioning the hydrogen it is not possible
to express this stereochemistry
SMILES descriptions for individual atoms and bonds can usually be generated in isolation from
the rest of the molecule e.g. for an atom from its properties and hydrogen count. When
stereochemistry is involved it is more complex as the serialisation is affected by the ordering of
atoms within the SMILES string; hence the first step that the SMILES writer performs is a depth-first
traversal of the molecule defining the order in which the atoms will be serialised. Double bond
stereochemistry in conjugated systems is especially difficult as one must take into account the
direction of slashes used for the previous double bonds as the same slash is used in the definition of
the stereochemistry of both double bonds. OPSIN solves this by assigning consistent slash characters
to all bonds to non-implicit atoms, which are adjacent to double bonds with defined stereochemistry
before beginning writing of the SMILES string. In cases where neither group is an implicit hydrogen
this leads to superfluous slashes but as they are not contradictory this is not incorrect (Figure 3-142).
Figure 3-142 (1Z,3Z)-1-bromo-1-chloropenta-1,3-diene: Br\C(=C/C=C\C)\Cl
3.2.16.3 InChI
To create InChIs OPSIN employs the JNI-InChI library128. This allows the usage of InChI, a
natively C library, through Java, on the majority of systems. The conversion from OPSIN’s internal
format to JNI-InChI’s format is straightforward due to their near identical representations employed
for describing stereochemistry. OPSIN can produce either standard InChIs or InChIs with fixed
hydrogen layers. As IUPAC names generally specify a specific tautomer including the fixed hydrogen
layer is preferred.
127
As JNI-InChI is a very large dependency compared to OPSIN’s other dependencies, OPSIN is
divided into two Maven modules. One of these contains OPSIN’s core functionality, including CML
and SMILES output, whilst the other solely adds the ability to do InChI serialisation.
3.3 Results and discussion
Evaluating chemical name to structure performance while theoretically simple is impeded by
the difficulty of finding sufficiently large sets of accurately annotated name/structure pairs that are
representative of the names of interest.
A study by Eller129 found 26% of names in the analysed sample from the published literature to
be formally unacceptable. When testing name to structure performance it is important to be able to
know that conversion failures or unexpected name conversions are not just the result of the input
name being incorrect. Eller also noted that machine generated names from the three pieces of name
generation software tested (AutoNom 2000, ChemBioDraw and ACD/Name) produced formally
incorrect names in only 1% of cases. For this reason all testing on the precision of chemical name to
structure software has been performed on machine generated names. It should be noted that many
major chemical drawing programs (e.g. ChemBioDraw, Marvin Sketch, Accelrys Draw, ACD
ChemSketch) now incorporate structure to name algorithms, so finding machine generated chemical
names in the literature is becoming increasingly common.
It is important to know that the findings on generated names are still applicable to chemical
names “in the wild”. One of the most commercially important applications of name to structure is
locating chemical patents from the chemicals described within them. For this it is important to have
high recall on the names used in such patents to describe exemplified compounds.
3.3.1 Methodology
3.3.1.1 Generated name test sets
The SMILES and InChIs for 30,000 randomly selected compounds were downloaded from
PubChem, a database of more than 25 million small molecules. To randomly select the compounds,
PubChem IDs were generated by random number generation in the range of valid IDs with removal
of duplicates and revoked IDs until 30,000 valid IDs were generated.
The SMILES were then converted to names by ACD/Name 12.02, ChemBioDraw12, Lexichem
2.1.0 and Marvin 5.8.2. Due to an issue with ACD/Name’s SMILES to name conversion including
128
stereochemistry for double bonds, which did not have defined stereochemistry, an SDF generated by
Lexichem from the SMILES was instead used as input to ACD/Name. InChIs were generated from
these names by OPSIN 1.2.0, ChemBioDraw12, and Marvin 5.8.2. To give an indication of the
difference in performance between OPSIN 1.2.0 and the version of OPSIN available at the
commencement of this project, a version from November 2008 is included. As this version did not
directly output InChIs, these were instead generated from the program’s CML output using a simple
Pybel130 script as an interface to OpenBabel 2.3.1.
Determination of whether or not the InChIs were considered identical was made by
comparison of the layers that are present in standard InChIs. Where the InChIs were not identical it
was determined whether the layers that define the constitution of the molecule were identical. If
they were, this was classed as a “Stereochemical Discrepancy”, and, if they were different, this was
classed as a “Constitutional Discrepancy”.
As generated names are not expected to be correct in absolutely all cases a possible heuristic
for detecting such cases is by looking at the consensus of name to structure solutions. For the cases
where OPSIN failed to produce an identical InChI, the results of the other two name to structure
programs was examined to determine whether either of them arrived at the correct InChI. If no
solution could interpret a given name correctly this implies that the name may be suspect.
3.3.1.2 Chemical patents test set
USPTO patent applications that were filled in 2011 were downloaded from Google Patents131.
The patents were filtered to just those containing organic chemistry (IPC code: C07). For each
patent, heading elements were identified and their textual content passed to OSCAR4. Where
OSCAR4 identified exactly one entity of type chemical, the surface of the entity, i.e. the name, was
recorded. In the special case that the name (ignoring case) had been seen previously in the same or
a previous patent, the name was not recorded. This filtering step helps with the problem that not all
names present in headings will be exemplified compound names. A set of 248,846 names were
extracted in this manner. Manual inspection indicated that the names are predominantly systematic
in nature.
129
3.3.2 Data obtained
3.3.2.1 ACD/Name generated names
Figure 3-143 Comparison of performance on 29,718 ACD/Name 12.02 generated names
Names were tested as outputted by ACD/Name, with the exception that where present the
string ‘(non-preferred name)’ was removed from the end of names.
3.3.2.2 ChemBioDraw generated names
Figure 3-144 Comparison of performance on 29,414 ChemBioDraw12 generated names
0%
20%
40%
60%
80%
100%
OPSIN1.2.0 Marvin5.8.2 ChemBioDraw12 OPSIN
(11/11/08)
No Result
Constitutional
Discrepancy
Stereochemical
Discrepancy
Correctly Interpreted
0%
20%
40%
60%
80%
100%
OPSIN1.2.0 Marvin5.8.2 ChemBioDraw12 OPSIN
(11/11/08)
No Result
Constitutional
Discrepancy
Stereochemical
Discrepancy
Correctly Interpreted
130
Names were tested as outputted by ChemBioDraw.
3.3.2.3 Lexichem generated names
Figure 3-145 Comparison of performance on 29,301 Lexichem 2.1.0 generated names
Names were tested as outputted by Lexichem. On one exceptionally long systematic name
Marvin failed to produce a result within 30 minutes necessitating the manual exclusion of that
name.
3.3.2.4 Marvin generated names
Figure 3-146 Comparison of performance on 29,961 Marvin 5.8.2 generated names
0%
20%
40%
60%
80%
100%
OPSIN1.2.0 Marvin5.8.2 ChemBioDraw12 OPSIN
(11/11/08)
No Result
Constitutional
Discrepancy
Stereochemical
Discrepancy
Correctly Interpreted
0%
20%
40%
60%
80%
100%
OPSIN1.2.0 Marvin5.8.2 ChemBioDraw12 OPSIN
(11/11/08)
No Result
Constitutional
Discrepancy
Stereochemical
Discrepancy
Correctly Interpreted
131
Names were tested as outputted by Marvin.
3.3.2.5 Compounds from headings in USPTO Patents
Figure 3-147 Comparison of recall on 248,846 names extracted from USPTO patents by OSCAR4. Pre-
processed names were the result of passing the names through OPSIN’s pre-processor.
The results in Figure 3-147 are intentionally not presented as a percentage of the size of the
test set as at least 10% of the identified names are expected to be either false positives or contain
insufficient information to generate a connection table. Unlike in the generated names, UTF-8
characters beyond the ASCII subset were frequently encountered e.g. Greek letters (α) and primes
(′). A significant percentage of ChemBioDraw’s failures were purely due to the use of these
characters hence the names were passed through OPSIN pre-processor (Section 3.2.3) to allow
assessment of the level of nomenclature coverage rather than of ChemBioDraw12’s ability to
recognise non-ASCII characters. For names containing characters unrecognised by OPSIN’s pre-
processor the original name was retained.
3.3.3 Discussion
The results show that OPSIN has consistently high levels of recall (96.2% - 99.0%) and precision
(97.9%-99.3%) across all the sets of generated names. While precision as stated is high, many of the
0
50,000
100,000
150,000
200,000
ChemBioDraw12 Marvin5.8.2 OPSIN1.2.0 OPSIN
(11/11/08)
N
u
m
b
e
r
o
f
n
am
e
s
fo
r
w
h
ic
h
S
M
IL
ES
w
e
re
g
e
n
e
ra
te
d
Names as written
Pre-processed names
132
failures maybe expected to be the result of the names being incorrect. Table 3-15 shows that the
majority of the names that OPSIN incorrectly interpreted, were also incorrectly interpreted by
Marvin and/or ChemBioDraw. In the paper on OPSIN72, an older version of OPSIN, on different sets
of generated names in which incorrect and ambiguous names were identified and excluded from the
precision calculations, was able to achieve precision in excess of 99.8%. Different sets of names were
used than those in the paper, as in the course of creation of the paper the author manually checked
all names that produced discrepant results to determine whether the fault lay with the name. This
analysis allowed, subsequently to the paper, for the majority of the genuine errors made by OPSIN
to be corrected, but as a result these sets cannot be considered unseen test sets.
Sets of names
ACDName12.02 ChemBioDraw12 Lexichem2.1.0 Marvin5.8.2
Can be converted
correctly by a
solution
4.1% 45.5% 7.7% 17.0%
Can't be correctly
converted but can
be incorrectly
converted
71.8% 51.8% 83.3% 68.4%
Can't be
converted by
either solution
24.1% 2.6% 9.0% 14.6%
Table 3-15 Analysis of how the union of ChemBioDraw and Marvin handled the names that OPSIN 1.2.0
produced discrepant results on.
OPSIN’s names showed a lower level of agreement with the starting structures when using
names generated by ACD/Name (Figure 3-143), as compared to when using names generated by the
other software. This arises from the use, by ACD/Name, of amino acid names without ᴅ/ʟ prefixes to
describe amino acid components of the structure without defined stereochemistry. The IUPAC
recommendations118(Rule 3AA-3.3) state that the meaning of an amino acid name without the prefix
depends on the context e.g. if the amino acid is known to come from a natural source it may be
assumed to be ʟ whilst if it known to be synthetic it may be assumed to indicate a racemate. OPSIN,
and indeed the other name to structure solutions tested, assumes the ʟ configuration in all cases
leading to apparent discrepancies in results.
In OPSIN’s publication72 it was found that ChemBioDraw was sensitive to the representation of
superscripts and Greek letters used by other structure to name packages e.g. $a for alpha or ^ to
indicate superscripts. Pre-processing the chemical names to use representations understood by
133
ChemBioDraw may slightly improve its performance on the ACD/Name, Lexichem and Marvin
generated names.
Across all four sets of names Marvin can be seen to have generated stereochemically
discrepant results in a large percentage of cases. This appeared to a large extent to be caused by
difficulties in its algorithm correctly identifying the stereocentre to which indicated stereochemistry
should be applied. For example ‘(S)-bromo(chloro)fluoromethane’ was interpreted without
stereochemistry whereas ‘bromo(chloro)fluoro-(S)-methane’ was correctly interpreted.
The results on the names extracted from patents (Figure 3-147) also showed excellent
performance from OPSIN, giving significantly higher recall than Marvin. Comparison to
ChemBioDraw is more difficult as dependent on whether or not the names are pre-processed OPSIN
either had slightly higher or slight lower recall. Correspondence with a ChemBioDraw developer
indicated that the lack of support for non-ASCII Unicode characters was a bug that would be
corrected in the next version.
The difference between OPSIN’s current performance and the level potentially achievable
with ChemBioDraw is likely to be explained by OPSIN’s lack of support for some areas of
carbohydrate nomenclature as well as ChemBioDraw’s greater leniency in handling names that do
not conform to codified nomenclature practices.
3.4 Implementations
3.4.1 Java library
OPSIN’s main mode of distribution is as a Java library typically including both the core and
InChI modules. The API has been designed to offer convenience methods for the most commonly
required capabilities in conjunction with more advanced configurability. The methods in the public
API of NameToStructure are listed below:
Method Output
parseToCML(String name) nu.xom.Element
parseToSmiles(String name) String
parseChemicalName(String name) OpsinResult
parseChemicalName(String name,
NameToStructureConfig n2sConfig)
OpsinResult
getOpsinParser() ParseRules
134
The parseToCML and parseToSmiles are convenience methods and allow the direct
conversion of a chemical name to the relevant format e.g. a CML document and a SMILES string
respectively, using the program’s default options. A CML document is returned as a XOM Element
object allowing in-memory manipulation or trivial serialisation to XML.
Alternatively the output may be an OpsinResult. This contains whether name
interpretation was successful, the error message that was returned (if applicable) and the name that
was interpreted. An OpsinResult may be lazily serialised to either CML or SMILES using the class’
methods.
If greater configurability is desired, a NameToStructureConfig object can be provided
that allows configuration of OPSIN’s options (Table 3-16).
Option Explanation Default value
allowRadicals Should names that formally describe radials be accepted
e.g. ethyl
false
detailedFailureAnalysis If a chemical name is uninterpretable should OPSIN parse
it from right to left to attempt to generate a more
informative error message
false
Table 3-16 OPSIN’s configurable options
The ParseRules object returned by getOpsinParser allows the parsing of words using
OPSIN’s grammar. This functionality is employed extensively by the OPSIN Document Extractor
(Section 3.4.4) but is not known to be employed elsewhere. Note that generally only a single word
may be parsed at a time e.g. ‘ethyl ethanoate’ will not be fully parsable but ‘ethyl’ or ‘ethanoate’ are
parsable.
If one wishes to debug OPSIN’s behaviour an end user may achieve this by setting the Log4J
log level to either debug or trace depending on the level of detail required.
Library functions for InChI generation reside in the NameToInchi class in the InChI
module. Functions are available for the generation of an InChI with fixed-H layer or a StdInChI from
an OpsinResult. Convenience methods are also available to go directly from a name to either
form of InChI.
The library is available either from the project’s download page on BitBucket132 or from the
Maven central repository.
135
3.4.2 Command-line interface
When OPSIN is distributed in library form as an executable jar file, execution yields a
command line interface. Flags are available to set all of OPSIN’s configurable options, the desired
output format and verbosity (Figure 3-148). Verbose output corresponds to a Log4J log level of
debug. The same command-line is employed regardless of whether the InChI module is included,
hence to avoid the command-line interface depending on the InChI module, reflection is used to
check for the presence of the InChI functionality on the classpath. The command-line interface may
be used to perform batch processing by piping in a file of chemical names and directing the output
to an appropriate output file.
Figure 3-148 Screenshot of OPSIN command line help dialog showing available flags
3.4.3 OPSIN web service
The OPSIN web service133 provides access to OPSIN’s functionality to convert names to CML,
SMILES and InChI via a convenient web interface. Additionally the web interface can generate
depictions using the Indigo toolkit55. The Indigo toolkit is also used to enrich the CML with generated
2D coordinates.
Requests to the web interface may be either done using a browser by entering a chemical
name at opsin.ch.cam.ac.uk or programmatically by sending requests to opsin.ch.cam.ac.uk/opsin.
Requests may be made using content negotiation or by adding a suitable file extension to the
request (Table 3-17).
136
Request type Internet media type File extension
CML chemical/x-cml .cml
CML without 2d coordinates n/a* .no2d.cml
SMILES chemical/x-daylight-smiles .smi
InChI chemical/x-inchi .inchi
Depiction image/png .png
Table 3-17 Request types supported by the OPSIN web service. *chemical/x-no2d-cml is accepted but is
not a recognised internet mime type
The web service is employed by the Chemistry Add-in for Word134, a joint development
between the Unilever Centre and Microsoft, as a means of converting chemical names to chemical
objects.
The web service’s logs were analysed over a one week period in early December 2011 showing
requests from 171 unique IP addresses. Usage patterns varied from single names all the way through
to automated requests for 1000s of names. Analysis of failing web service requests has revealed that
the vast majority of failures have been caused by unrecognised trivial names (e.g. drug names),
spelling mistakes, non-English chemical names and non-names (e.g. SMILEs, molecular formulae
etc.). The few genuine failings have proven of some use in finding “bugs” and areas of unsupported
nomenclature.
When a failure is encountered the web service employs OPSIN’s reverse parsing to attempt to
identify the exact part of a name that is uninterpretable in the error response. Users of the service
have reported this to be useful in identifying and correcting errors in chemical names135.
3.4.4 OPSIN Document Extractor
The OPSIN Document Extractor136 attempts to find all sequences of words that are parsable by
OPSIN. This is assumed to indicate that, with a high degree of confidence, the identified strings are
chemical names. The program works as follows on a string of text:
Whitespace tokenisation to form an array of words. The character indices of these words in the
original string are recorded.
OPSIN’s pre-processor is employed to generate an array of normalised words which will be
operated on henceforth.
137
Identification of stop words e.g. ‘on’, ‘one’, ‘at’. These are English words that can also be the
ending of chemical names (often German chemical names) and should be prevented from
forming chemical names.
The words are parsed by OPSIN in pairs. Depending on whether or not OPSIN believes a word to
be interpretable on its own, the program may add one or both words to a buffer of successfully
parsed name fragments e.g. ‘ethyl benzene’ would be consumed in two cycles but ‘benzoic acid’
or ‘chloral hydrate’ would be consumed as one.
If a pair of words is partially interpretable and the point of failure does not occur at a word
boundary, spaces are removed until either no improvement in the length of name that is
interpretable is noticed or the chemical name ends at a word boundary.
As OPSIN knows the role of chemical words and whether they are valid on their own, intelligent
choices can be made as to whether space removal should be attempted. For example ‘benzene
sulfonamide’ should be ‘benzenesulfonamide’ but ‘pyridine acetic acid’ should be interpreted as
is, rather than treating the acetic acid as a conjunctive substituent of the pyridine ring.
Punctuation at the end of a chemical name, or a bracketed section immediately following a
chemical name is ignored and indicates the chemical name is complete. A chemical name is also
indicated as being complete if a subsequent word cannot be interpreted as being chemical or
the end of the array of words is reached.
Identified chemical names are classified as “complete”, “part”, “family” or “polymer”. “part”
names are names classified by OPSIN as substituents. “family” names are classed by OPSIN as
functional terms or are names that end in an ‘s’ which could not be interpreted by OPSIN.
“polymer” names start with the functional term ‘poly’ or ‘oligo’.
An unbalanced opening bracket at the start of a chemical name, or an unbalanced closing
bracket at the end of a chemical name, is removed. Balanced brackets surrounding a chemical
name are removed. A terminal ‘-’ or ‘,’ is removed e.g. ‘ethyl-’ is recognised as ‘ethyl’
The output is a list of identified chemical names which can be queried for the normalised
chemical name, the raw text, the chemical name classification, the start and end character
indices within the original string and the start and end positions within the array of words.
138
As the program knows whether punctuation is valid as part of a chemical name, individual
chemical names may still be extracted from lists of chemical names even in the presence of
erroneous whitespace (Table 3-18).
Input: ‘indane, 1,2, 3,4- tetrahydroquinoline, 3, 4-dihydro-2H-1, 4-benzoxazine, 1,5-naphthyridine, 1,
8- naphthyridine’
Identified chemical name Text value
indane indane
3,4-tetrahydroquinoline 3,4- tetrahydroquinoline
3,4-dihydro-2H-1,4-benzoxazine 3, 4-dihydro-2H-1, 4-benzoxazine
1,5-naphthyridine 1,5-naphthyridine
8-naphthyridine 8- naphthyridine
Table 3-18 Output from OPSIN Document Extractor on a list of chemical names containing erroneous
whitespace
The OPSIN Document Extractor is utilised as a tagger for use with ChemicalTagger (as
described in Section 4.4.5.6) and as an aid in name type assignment (as described in Section 4.5.1.4).
It should be emphasised that, whilst the approach taken by the OPSIN Document Extractor is rather
brute force in nature, it is still typically an order of magnitude faster than performing entity
recognition with OSCAR4. Hence, using the OPSIN Document Extractor as a complement to OSCAR4,
as is done in the work on reaction extraction described in Chapter 4 of this thesis, may be done with
minimal effect on performance.
3.5 Areas for future work
3.5.1 Vocabulary
Chemical name to structure is impossible when vocabulary unrecognised by the program is
encountered. Hence, an obvious improvement to OPSIN is the addition of more terms to its
vocabulary. This is especially important in the area of natural products, for which even the majority
of IUPAC recommended alkaloid and terpenoid trivial names, listed in the natural product
recommendation127 (Appendix), are yet to be added. A site offering an extensive list of trivial names
with corresponding systematic names and Japanese names was identified but unfortunately time
was insufficient to fully add more than just the acyclic trivial names137. Addition of trivial names is
complicated by the question of what category in the grammar to add the name to e.g. can the name
have suffixes, if so which suffixes? Other concerns are getting the numbering of the compound
correct when the compound has an accepted numbering system, and, especially in the case of
natural products, making sure the structure has correct stereochemistry. The stereochemistry
139
problem is made more difficult by the proliferation in databases of structures with slightly different
stereochemistry that have become erroneously associated with the same trivial name138.
The addition of vocabulary can also assist with the problem of trivial names that are
composed of understood morphemes which are then deconstructed into their apparent
morphemes. This poses a problem as the apparent morphemes may not precisely describe the
structure or may be wholly misleading (Figure 3-149). The addition of appropriate trivial names
allows the systematic interpretation to be overridden.
Figure 3-149 Methanophenazine (left) and a systematic interpretation of it (right)
3.5.2 Carbohydrate nomenclature
OPSIN currently possesses support for carbohydrates with the suffix ‘ose’ optionally infixed
with a ring size specifier to give suffixes such as ‘pyranose’. Adding support for more suffixes
especially those that allow groups named by carbohydrate nomenclature to be used as substituents
would yield a significant improvement in recall. Adding further suffix support is not entirely trivial
due to only some suffixes being locantable and some suffixes applying to multiple atoms but
extension of OPSIN’s existing mechanisms for handling similar cases should be sufficient.
Oligo saccharide nomenclature which employs arrows to indicate the linkage between
saccharides could be relatively trivially support by internal conversion to normal locants e.g.
α-ᴅ-glucopyranosyl-(14)-β-ᴅ-glucopyranose could be internally converted to:
O4-(α-ᴅ-glucopyranosyl)-β-ᴅ-glucopyranose
3.5.3 Inorganic nomenclature
Inorganic nomenclature is mostly unsupported by OPSIN. The reason for this stems from two
problems:
Datively bonded substituents are often named by the name of the group e.g. ‘amine’,
‘pyridine’ etc. To classify these as substituents rather than parent groups one needs to
know that they are followed by an inorganic parent. Treating them as parent groups
will not work as they may be preceded by ligands that are expressed as substituents
140
e.g. ‘chloro’ which will be referring to the metal rather than the datively bonded
substituent e.g. ‘dichlorodipyridine platinum(II)’.
OPSIN’s current internal format does not have good support for representing dative
bonds and other non-covalent interactions e.g. the interaction between the π
electrons and the iron atom in ferrocene. This same is also true to varying extents of
the formats to which OPSIN writes.
The issues of representing inorganics are dealt with by Clark139 who recommended the
introduction of a zero-order bond to the commonly used MDL chemical table file formats. This
would mostly solve the problems of representation although the exact semantics of the interactions
would be lost. It would also be insufficient to correctly represent systems with three centre two
electron covalent bonds e.g. diborane. In these systems representing one bond as a single bond and
the other as a zero order bond artificially introduces asymmetry.
Until support for file formats that allow better specification of inorganics becomes more
widespread improving support for inorganic nomenclature is unlikely to benefit most
cheminformatics applications.
3.5.4 Stereochemistry
Many forms of stereochemistry remain unsupported including endo, exo, syn, anti, r, s, e, z
and α/β stereochemistry on arbitrary ring systems. Adding rigorous detection for pseudo-
asymmetric centres would be an important precursor to further improving stereochemistry handling
by OPSIN. Currently only a limited subset of the pseudo-asymmetric centres are detected. For r, s, e
and z, extension of OPSIN’s implementation of the CIP rules would also be required.
3.5.5 Nomenclature variants
Even if OPSIN were to support all codified nomenclature, there still remains the long-tail of
nomenclature variants that appear, both intentionally and unintentionally, in “real world” use. This
is an unbounded problem as the chemistry community can always think of new ways to construct
chemical names that will be unexpected to a computer program. In this respect an approach like
that taken by Name=Struct may be advantageous over a grammar-based approach although this
must be weighed against the increase in erroneous structure conversions which will inevitably occur
when odd nomenclature is encountered.
141
3.5.6 Detection and handling of ambiguous names
OPSIN has not been designed to detect ambiguous chemical names and hence introducing
such functionality would involve substantial changes especially if alternative structure
interpretations were to be enumerated. The codebase now includes code for atom environment
detection and indeed this is actually employed for detection of ambiguity in the very specific case of
determining whether an ester interpretation is desired (Section 3.2.15.2). The application of this
technique to substitutive nomenclature operations would be sufficient to detect a significant
number of cases of ambiguity although structural ambiguity may be introduced by many other
nomenclature operations for which a similar analysis would also need to be performed.
A significant complicating factor is determining whether a name that is formally ambiguous
should be considered unambiguous by convention. In the example of p-aminomethylbenzene-
sulfonamide (Figure 3-150) the name is formally ambiguous as ‘aminomethyl’ is not bracketed. In
practice there is only one likely interpretation, as the name would otherwise contain a methyl group
that could be placed at multiple positions on the benzene ring whilst still being consistent with the
name.
Figure 3-150 p-aminomethylbenzene-sulfonamide
3.5.7 Detection of typographical errors
The problem of detecting and correcting typographical mistakes was investigated, during the
course of this project, yielding a proof of concept system. This worked by parsing the chemical name
to the point at which no further tokens could be found then determining if one operation could
change the chemical name such that it would match one of the tokens present in the allowed token
classes. These operations were substitution, insertion, deletion and transposition, which have been
found to account for 80% of typographical errors140. Due to the existence of morphemes in the same
token class that differ by only a single letter e.g. ‘amino’, ‘imino’, only substitutions between letters
that were adjacent on a US keyboard were allowed. Where multiple possible suggestions were
possible a heuristic, that chose the token that appeared more often in a training corpus, was
invoked.
142
This work yielded promising results. The only significant drawback being that, if the
typographic mistake was before the point in the name at which parsing failed, the mistake could not
be corrected as backtracking through the grammar’s finite state machine had not been
implemented. This work was not taken forward primarily due to concerns that the results would still
not be accurate enough for automated use in text mining. Nonetheless adding such functionality to
applications such as the OPSIN web service could be useful.
3.5.8 Foreign language support
OPSIN includes some support for German names purely by making the terminal ‘e’ at the end
of many chemical names optional. An experiment with adding further German vocabulary flagged up
an ambiguity that would be introduced in the parsing of ‘chloro-’. With the addition of the German
‘chlor’ this could then also be parsed as [chlor][o-] where ‘o-’ is an ortho locant, hence the German
specific vocabulary is currently not enabled to avoid introducing ambiguity into unambiguous English
names.
Figure 3-151 English: 2-Chloropyridine; German: 2-Chlorpyridin (unsupported due to ‘chlor’ not being in
vocabulary, ‘pyridin’ is allowed as OPSIN considers the ‘e’ optional)
A small proof of concept attempt was made to support Chinese chemical names indicating
that for Chinese many English morphemes could be simply replaced with Chinese characters due to
the underlying grammar being mostly the same. In some areas though the syntax was found not be
identical e.g. alkanes are ordered by hundreds, then tens then units whilst in English IUPAC names
the ordering is reversed. Modifying OPSIN’s grammar or enumerating such systematic constructions
are not especially elegant solutions. In languages such as French the ordering of words may be
different e.g. ‘acide formique’ which poses further problems.
Sayle141 described a method whereby names in foreign languages could be translated from,
and to, English through a mixture of word order rearrangement and morpheme string substitutions
(some of which were context sensitive). This is expected to be the more elegant solution although
the inherent disambiguation that a grammar-based system like OPSIN provides may give more
elegant solutions in cases where context sensitive substitutions are required.
143
3.6 Conclusions
This project has resulted in the creation of a fast, precise and extensible chemical name to
structure interpretation algorithm. By employing a strict grammar, OPSIN can elegantly fail on
chemical names that include nomenclature that is not yet supported. OPSIN is known to be
employed by AMBIT142, Cinfony143, the National Cancer Institute’s Chemical Identifier Resolver144,
Bioclipse145, LICSS146, OCMiner147, Digital Science’s SureChem36 and at the International Union of
Crystallography148, Dupont149, AstraZeneca150 and IBM151. This wide range of users encompasses text
mining efforts and more general applications in which name to structure can be a time saving
mechanism.
Newly synthesised compounds, and to a lesser extent reagents, are often referred to by
systematic names; hence the success of the work described in the next chapter on extracting
reactions from patents was only possible due to the high recall and precision afforded by OPSIN.
144
Chapter 4 Extraction of Chemical Reactions from the Patent
Literature
4.1 Introduction
Reaction databases are primarily employed by synthetic chemists to find ways to perform a
particular synthesis or synthetic step. They may already know the reaction they are interested in
performing and hence want to investigate the conditions employed in successful instances of the
reaction. Alternatively they may be interested in identifying reactions that would, or have the
potential to, lead to the formation of a particular moiety.
The largest reaction databases are the commercial CASREACT152,153 and Reaxys154,155 databases
each containing in excess of 30 million reactions. There are many smaller commercial databases e.g.
SPRESI156,157, Current Chemical Reactions158, Science of Synthesis159 and SORD160 (free to academics).
As compared to structural databases, where freely accessible databases like ChemSpider and
PubChem rival the size of the leading commercial databases, freely accessible reaction databases are
currently comparatively small in size. Such databases include the journal Organic Syntheses161 and
WebReactions162 which uses the ChemReact database, a subset of the SPRESI database.
Reaction databases are generally populated by manual abstraction of reactions from the
chemical literature. This is highly time consuming work and hence, due to the associated costs, large
scale abstraction is only practical for the largest commercial databases. Automated techniques for
reaction extraction have the potential, where primary literature is readily text minable as is the case
for patents, to allow the creation of large reaction databases with extremely low costs. Such
techniques may also find utility in expediting work to manually extract reactions from the literature
by providing crude extracted reactions which could then be tweaked by human curators.
This chapter describes the development of an open source system for the automatic
extraction of reactions from the chemical literature especially patents. The developed system was
presented at the spring 2012 ACS conference163. The description of the system unless specified
otherwise refers to v1.0 of the developed software.
145
4.2 Previous attempts at text mining chemical reactions
4.2.1 Chemical Abstracts Service
Blower et al.164–167 from the Chemical Abstracts Service published a series of papers spanning
the period 1983-1990 in which they discuss automated methods for extracting reactions from the
American Chemical Society’s Journal of Organic Chemistry. Their system modelled experimental
sections as being formed of a heading, a synthesis, a workup and a characterisation section with only
one resultant product. This model was found to describe over half the experimental sections they
encountered.
The original system165,166 worked by tokenising on common delimiters with appropriate rules
to differentiate between hyphens within chemical names and within other words. Words were
assigned part of speech tags or as chemical words by a mixture of dictionary lookup and looking at
the stem and suffix of words. A rule based system was used to disambiguate in cases where context
is necessary to accurately determine the part of speech. Assigning roles to reagents was partially
achieved using a “word expert” system that would be able to use surrounding words e.g. ‘in’ or
‘under’ to assign a likely role to each reagent. The words preceding and following each reagent were
then scanned for quantities which were associated with the appropriate reagent.
The discourse (heading/synthesis/workup/characterisation) was determined by a set of
criteria. The heading was determined by the absence of a verb in the words that made it up. The
synthesis section was not identified directly but instead assumed to be the content between the
heading and workup section. The workup section was identified by the presence of words from a list
a list of common operations performed at this stage e.g. crystallise, wash etc. The characterisation
section was identified by the presence of acronyms commonly associated with characterisation e.g.
mp, m/e etc.
The 1990 paper167 described a more refined approach taking inspiration from the previously
described system. Partial parsing of sentences is achieved using Augmented Transition Network
parsing; a parsing method based on the use of a finite state automaton that can accept words as
transitions and that allows nondeterministic transitions hence allowing recognition of content-free
languages. The parsing attempts to identify substance information, references to procedures,
time/temperature data, verb phrase and characterisation data. The system included some support
for general procedures (where a template for a reaction is given, optionally followed by specific
instances of the reaction in which not all reagents are specified), analogous syntheses (where a
146
compound is obtained in the same way as a previously described procedure) and parallel synthesis
(where the synthesis of several analogous compounds is given at once). The program was originally
intended to assist in abstracting for CASREACT. However, it was not deemed sufficiently accurate,
being able to produce “usable” results from 80-90% of simple synthesis paragraphs and only 60-70%
of the more complex cases.
While the reaction extraction system developed as part of this project is more sophisticated in
terms of chemical entity recognition and chemical entity resolution (something not even attempted
by the program) the range of experimental paragraph types supported still go beyond the scope of
what has been developed for this project.
4.2.2 University of Cambridge
Jessop et al168 developed a system, coined PatentEye, which was employed to extract
reactions from EPO patents. Verification of the structures of reaction products was attempted
through comparison of the result from a chemical name to structure algorithm (OPSIN), with those
obtained from an image to structure algorithm (OSRA). The structure could also be checked for
consistency with extracted NMR and mass spectra. With the version of OSRA utilised, the results
from image to structure conversion were insufficiently precise to allow verification of the products
with only 34% of a set of 200 images being converted exactly to the human reproduced structures.
Although the majority of NMR spectra could be successfully extracted, exactly predicting the peaks
in an NMR spectrum is a complicated process. This made it difficult to be sure that a spectrum was
or wasn’t consistent with a given product structure. While a case was firmly made for the utility of
capturing spectral information the case for using this information or image to structure results to
verify product information was less clear considering the relatively high accuracy of chemical name
to structure software, hence this was not pursued in the current work.
The workflow: sectioning of a document into experimental section, derivation of chemical
structures using OPSIN/OSCAR/OSRA and application of ChemicalTagger to identify reagents and
assign them roles and quantities, is broadly similar to the work described in this chapter. While with
the exception of the paragraph classifier (Section 4.4.4) there is no code in common between the
projects this project can be considered a spiritual successor. The most significant difference between
these projects is that the current work puts a far greater emphasis on the extracted structures. The
structures are used to assist in role assignment and facilitate the atom-mapping step that checks
that a reaction is feasible.
147
4.2.3 University of Toronto
Since 2001, ChemDraw binary CDX files and MDL Molfiles are available with USPTO patents
The CDX files are submitted by the patent applicant and hence offer another source of information
from which reactions may be extracted. Work at the University of Toronto has culminated in the
production of the SCRIPDB database of structures derived from these CDX files169. As of the end of
2010, SCRIPDB contained 10,840,646 molecule instances (molecules were de-duplicated on a per
patent basis). CDX files may also contain indication that compounds are involved in a reaction step
and the relationship between the reactions steps. 341,764 reaction steps were identified up till the
end of 2010. Correspondence with the author indicated that the number of reactions present in the
CDX files may be potentially up to double these values due to the CDX file in many cases having the
appropriate graphical elements (e.g. reaction arrow) but lacking the semantic indication that a
reaction is described.
4.3 Corpus choice
USPTO patents were chosen for this task due to the ease of acquiring large numbers of them
through Google Patents131 and due to the absence of optical character recognition induced noise in
post 1976 patents. Whilst USPTO patent applications are used throughout this chapter, the
described system would be equally applicable to patent grant text and is also known to work with
recent EPO patents due to the same XML tags being employed to designate headings and
paragraphs. For evaluating the effect of changes and identifying areas of weakness in the reaction
extraction system, a set of 106 patents that had been manually ascertained to contain reactions was
formed from the USPTO patent applications for the first week of 2008. Patents from that week were
not used when testing the final system.
4.4 Sectioning the relevant text within a patent
4.4.1 Archetypal experimental chemistry section
Experimental chemistry sections whilst still being free text are usually arranged in a
predictable manner. Typically, they start with a heading indicating the compound to be synthesised,
followed by a description of the synthesis, the workup steps undertaken and finally the
characterisation of the compound. Where the synthesis of a compound necessitates the synthesis of
intermediate compounds, typically each step of the synthesis is described separately with the final
step giving the overall target compound. An example of the first step of an experimental section is
148
shown in Figure 4-1, with its comprising sections annotated. A paragraph number is associated with
each paragraph in USPTO and EPO patents and may be used to uniquely identify a paragraph within
a given patent.
Figure 4-1 The start of a typical experimental section from a patent. The key features are annotated.
4.4.2 Sectioning workflow
Once a patent has been read in, the first challenge is to identify the experimental sections,
which entails discriminating experimental chemistry text from non-experimental chemistry text. If a
section is formed of multiple steps these must be associated with their parent section for the
purpose of later allowing anaphora that reference particular sections/steps to be resolved. This
process is shown schematically in Figure 4-2 .
Paragraph
number
Section heading
Section target
compound
Step target
compound
Synthesis
Characterisation
Workup
Step identifier
149
Figure 4-2 Schematic of processes employed in the segmentation of a document in steps, step headings
and section headings
150
4.4.3 Identifying paragraphs and headings
The majority of headings and paragraphs are identifiable in the XML provided by the USPTO
and EPO patent offices by the use of the element names heading and p. Headings that are present
at the start of paragraphs are only detected after chemical tagging (cf. Section 4.4.6). Empirically it
was found that paragraphs with ids starting with ‘h-’ followed by a number were often subheadings.
Paragraphs matching this criterion were considered as sub-headings rather than as paragraphs when
both a new line character was absent and ChemicalTagger found them to contain either a procedure
name or a chemical name.
4.4.4 Paragraph classification
When a paragraph is encountered, determination of whether or not it is an experimental
chemistry paragraph is made using a Naïve-Bayes classifier. This classifier is that previously described
by Jessop et al.168. The classifier was trained by splitting a manually classified corpus of paragraphs
evenly between training and testing. Once trained in this way, the classifier correctly identified
96.6% of experimental paragraphs as experimental and 89.9% of non-experimental as non-
experimental in the test set.
For this work, the entire corpus of paragraphs was used to train the Bayesian classifier. Leave
one out cross-validation gave results of 96.6% for identifying experimental and 90.7% for non-
experiment paragraphs, indicating that the performance of this classifier is likely to be negligibly
better than the one employed by Jessop.
4.4.5 Chemical tagging
The text of both headings and paragraphs are presented to ChemicalTagger to be marked up.
The general operation of ChemicalTagger is described in Section 2.9. The output from
ChemicalTagger is the primary input to the reaction extraction workflow and hence much effort has
been made as part of this project to improve the output of ChemicalTagger. By improving
ChemicalTagger it is also hoped that any other applications that rely on ChemicalTagger may benefit
from the improvements that have been implemented.
4.4.5.1 Improved tokenisation
ChemicalTagger has a tokeniser interface which for experimental chemistry text is most well
served by an implementation based on OSCAR4’s tokeniser. Improvements were made to OSCAR4’s
151
tokeniser including the additions of more common abbreviations and correcting cases of chemical
entities being erroneously split on hyphens and colons (Table 4-1).
Input OSCAR 4.0.2 OSCAR 4.1
conc. [conc][.] [conc.]
2,2':6',2''-terpyridine [2,2]['][:][6',2''-terpyridine] [2,2':6',2''-terpyridine]
NH4OH(aq) [NH4OH(aq)] [NH4OH][(][aq][)]
D-glycero-D-manno-heptose [D-glycero-D-manno][-][heptose] [D-glycero-D-manno-heptose]
Table 4-1 Examples of improvements made to OSCAR’s tokenisation
4.4.5.2 Improved robustness of sentence parser
When the ANTLR3 generated parser encounters input that is unacceptable to the grammar,
whether due to being unlexable or not conformant to the grammar, input is skipped until an
acceptable token may be consumed. The unrecognised input in such scenarios is, depending on the
version of ChemicalTagger, either ignored completely or captured in an UnmatchedPhrase. The
value of this element is the interleaved concatenation of the tokens and their tag values i.e. adjacent
tokens may have been merged and the relationship between tags and tokens has been lost for the
effected tokens. This undesirable behaviour can also lead to unexpected element content, if whilst in
a rule no suitable input can be found e.g. a MOLECULE element without any elements
corresponding to a chemical name.
To address this problem, the lexer was simplified to a whitespace tokeniser and the
Unmatched alternative was expanded to cover all tags present in the grammar. The grammar is
ordered such that this alternative is only tried, once all other rules for what a Sentence may
contain have failed (Figure 4-3).
152
butnotbenzene
Figure 4-3 Example of output from a phrase with tokens that may only be recognised by falling back to
the Unmatched rule. The unexpected tokens are present in the output and remain associated with their tags.
4.4.5.3 Recognition of new concepts
Additions to ChemicalTagger’s regex tagger and chemical sentence parser facilitated the
recognition of yields, experimental procedures, chemical compound anaphora, pH conditions, the
number of equivalents of a compound used and the physical state of a compound. To allow the
detection of procedure/step names at the start of a paragraph that can be assumed to have that
purpose only from context e.g. ‘1)’, the grammar contains rules for detecting such cases that are
applied specifically to the first phrase of input.
4.4.5.4 Improved recognition of existing concepts
Significantly more variants of units used to define quantities associated with reagents are now
recognised. For example, improved tokenisation and recognition of the non-standard spelling ‘mole’
has corrected the exemplar issues given by Jessop170. The vocabulary for other terms recognised by
ChemicalTagger e.g. yield verbs, has also been improved.
The recall and precision of reagents/products is affected most by the MOLECULE and
UNNAMEDMOLECULE grammar rules. The former detects chemical entities with an associated name
whilst the latter detects chemical entities that are defined purely by an anaphora. In both cases data,
especially quantities such as amounts, volumes etc. must be contained within the rule so that the
grammar will place them as children of the MOLECULE/UNNAMEDMOLECULE hence showing the
association. Significant effort has been put into improving the coverage of these rules to attempt to
153
mitigate problems with entities either not being recognised or not being associated with quantities
that refer to them cf. Table 4-2.
Input ChemicalTagger rev 166 (14/1/2011) ChemicalTagger 1.3
sodium
hydroxide
solution
(50ml)
sodiumhydroxidesolution
<_-LRB->(
50ml
<_-RRB->)
sodiumhydroxidesolution
<_-LRB->(
50ml
<_-RRB->)
title
compound
as a
colourless
solid (52
mg, 23%
yield)
titlecompound
as
a
colourlesssolid
<_-LRB->(
52mg,23%yield
<_-RRB->)
titlecompound
as
a
colourlesssolid
<_-LRB->(
52mg,23%yield
<_-RRB->)
Table 4-2 Comparison of output from an older version of ChemicalTagger and the improved version
4.4.5.5 Improved action phrase assignment
The noun forms of certain verbs may in some cases be used to indicate an action e.g.
‘purification by gas chromatography’. Such phrases are now annotated with the action the noun
confers; in this case the phrase is a “Purify” phrase.
154
4.4.5.6 Improved extensibility
In collaboration with the primary author of ChemicalTagger, Dr. Hawizy, changes were made
to allow the program to accept an arbitrary number of taggers rather than having a hard coded
expectation of an OSCAR4 tagger, regex tagger and a POS tagger. This is achieved by passing a list of
taggers to ChemicalTagger, in which the position of the tagger in the list determines its priority.
For this work, the default regex tagger and POS tagger were used in conjunction with an
OSCAR4 tagger customised with a small stop word list, an OPSIN tagger and a trivial chemical name
tagger. The stop word list consisted of a small set of common mistakes that OSCAR4 was found to
make on the evaluation set of patents.
The OPSIN tagger is an implementation of the OPSIN Document Extractor and is included
primarily to identify cases where OSCAR4 might otherwise identify two entities within a single
chemical name. A common cause of such problems is the presence of erroneous whitespace causing
an apparently unmatched bracket to be tokenised separately from the rest of a chemical name. The
OPSIN Document Extractor is presented with the untokenised input string and can often recognise
chemical names containing erroneous whitespace as well as some complex chemical names
incorrectly recognised by OSCAR4.
The trivial chemical name tagger is designed to recognise chemical names that OSCAR doesn’t
currently recognise and those for which the regex tagger produces competing tags e.g. ‘Lawesson's
reagent’ in which reagent would be tagged as an NN-CHEMENTITY.
The prioritisation of the taggers is summarised in Table 4-3.
Priority Tagger Description
Highest Trivial Chemical Finds chemicals that neither OPSIN or OSCAR4 recognise
OPSIN Finds chemicals that are parsable by OPSIN
Regex Tags keywords e.g. yield words
OSCAR4 Finds chemicals using a machine-learning approach
Lowest OpenNLP Tags part of speech
Table 4-3 Taggers employed and their priority
4.4.6 Identification of inline headings
Headings present at the start of paragraphs (Figure 4-4) must be detected and handled
separately from the rest of the paragraph. This is achieved by examination of ChemicalTagger’s
output for the start of the paragraph. Constructs such as a procedure identifier followed by a
155
suitable delimiter and phrases following patterns like “Synthesis of xxx” are identified as headings
and removed from the paragraph. All patterns operate on ChemicalTagger’s tags to allow more
lexical variations to be accepted. For example the “Synthesis of xxx” pattern would be implemented
as an examination of an initial NounPhrase for an NN-SYNTHESIZE tag followed by a
PrepPhrase element containing an IN-OF tag and a NounPhrase element. Identified inline
headings are then treated analogously to other headings.
4-Fluoro-2-methylbenzonitrile (31). A mixture of 2-bromo-5fluorotoluene (3.5 g, 18.5 mmol) and…
Figure 4-4 Example of a paragraph containing an inline heading (bold text)
4.4.7 Processing of headings
After a heading has been run through ChemicalTagger it is examined for molecule entities and
procedure names. Entities known to present as false positives in OSCAR4’s output are filtered out
using the regexes used for identifying molecules as being of type “false positive” (cf. Section 4.5.1.4).
Additionally, as OSCAR4 is known to classify strings of capital letters as chemicals, if the entirety of
the heading is formed of capital letters e.g. ‘ABSTRACT’, no molecule entities are recognised. If the
heading has a molecule entity and/or a procedure name the heading is assumed to be part of an
experimental section; otherwise the heading serves as a delimiter between experimental sections.
A procedure name is either associated with an experimental section or a reaction step
dependent on whether it is believed to be a sub-heading. A procedure name is determined to be a
sub-heading if it contains neither an NN_METHOD nor NN_EXAMPLE word or the procedure’s
NN_METHOD word is ‘stage’ or ‘step’ (Table 4-4). As previously mentioned, paragraphs with ids
starting with ‘h-’ are treated as sub-headings.
Example of heading procedure names Examples of sub-heading procedure names
Example 5 Step b
General procedure 3 1)
Method 2a 2.
Table 4-4 Examples of headings and subheadings
If a molecule entity is detected in a heading an attempt is made to identify a string within the
heading that appears to be an alias for the compound so that subsequent use of the alias may
resolve to that compound. A heading molecule is associated with a reaction step or an experimental
section dependent on whether or not an appropriate procedure name had been found indicating the
start of a reaction step.
156
4.4.8 Processing of paragraphs
Paragraphs, which were classified as experimental, are associated with the current reaction
step, when the current step or current experimental section, is associated with either a molecule
entity or a procedure name. The requirement of an appropriate preceding heading allows further
non-experimental paragraphs that passed though the paragraph classifier to be ignored. As a special
case, paragraphs containing a yield phrase within which resides a molecule entity are always added
to the current reaction step to allow for the case where the paragraph fully describes a reaction
whilst not being preceded by a heading.
4.5 Section Parsing
The identified sections are processed sequentially in the order that they were defined in the
patent using the scheme in Figure 4-5.
157
Figure 4-5 Schematic of processes employed to extract from reactions from an experimental chemistry
section.
158
4.5.1 Processing of chemical entities
4.5.1.1 Name to structure
The workflow relies on OSCAR4.1 for the resolution of chemical names, which in turn relies on
OPSIN 1.2.0, the chemical names present in the ChEBI database as of December 2011 and a
manually created dictionary of chemical formulae and common chemical abbreviations. This latter
dictionary was increased from 81 entries to 260 entries to afford better coverage of the
abbreviations used for common reagents in organic chemistry.
To allow better support for the combination of a systematic name with an adjacent
abbreviated name, where such cases are identified by the presence of a non-chemical hyphen, the
different parts of the name are handled separately and the SMILES and InChI then constructed by
merging the output for the two names. Merging of InChIs is achieved using the InChI library by
constructing an input containing all the structures. In the case where a name is uninterpretable and
has no delimiters identified by ChemicalTagger, if the name is found to contain exactly one slash or
dot or space, parsing of the substrings either side of the delimiter is attempted.
4.5.1.2 Anaphora identification and resolution
Chemical entities may be referred to by anaphora; that is terms that reference a previous
entity. Four types of anaphora are recognised: references to compound identifiers, references to
procedures, textual aliases and textual references to heading compounds.
A reference to a compound identifier is identified in ChemicalTagger’s output by the
encapsulating REFERENCETOCOMPOUND element (Figure 4-6).
compound92
<_-LRB->(
107mg,0.24mmol
<_-RRB->)
Figure 4-6 Compound 92 in this example is a reference to a previously defined chemical entity
159
A reference to a procedure is identified by the encapsulating PROCEDURE element (Figure
4-7). Only procedures mentioned within a molecule are assumed to be the source of the chemical
entity.
Chloropyrimidine
<_-LRB->(
0.5g,1.75mmol
<_-RRB->)
fromstep
<_-LRB->(
c
<_-RRB->)
Figure 4-7 Chloropyrimidine in this example is an anaphora for a particular chloropyrimidine from a
previous step.
If a chemical entity is associated with a bracketed chemical entity the two are assumed to be
synonyms (Figure 4-8). As the purpose of this is to improve recall if the synonym is subsequently
used, only cases in which one chemical entity is resolvable to a structure but the other is not are
considered. Subsequent mentions of the unresolvable name will yield the same structure as the
resolvable name.
N-Ethoxycarbonyl-2-ethoxy-1,2-dihydroquinoline
<_-LRB->(
EEDQ
<_-RRB->)
Figure 4-8 Example of a systematic chemical name and its abbreviation
Entities with a name matching the case insensitive regex:
(crude|desired|title[d]?|final|aimed|expected|anticipated) (compound|product)
160
are assumed to refer to heading compounds. Typically, this is the compound associated with the
heading of the current step. However, if this is the final step, or the step is not associated with a
compound, then the current section heading compound is assumed.
4.5.1.3 Property Extraction
Where present volumes, amount (i.e. number of mols), mass, molarity (i.e. concentration),
number of equivalents, pH, percent yield and the physical state of a compound may be extracted
from ChemicalTagger’s output. Association of these properties with a chemical entity is achieved by
the relevant elements being nested within the chemical entity in the ChemicalTagger output.
4.5.1.4 Chemical type assignment
Every chemical entity is assigned a type (Table 4-5).
Chemical Entity Type Description Examples
exact Describes a specific compound 2-chloroethanol, pyridine
definite reference Describes a specific compound but
relies on information described
elsewhere in the document
Compound 5, the pyridine
from example 2
chemical class Describes a series of compounds ether, pyridines
fragment Describes a radical or substructure
of a compound
ethyl, pyridine ring
false positive Not a chemical entity or one that
would not be expected to be part of
a chemical reaction (e.g. an NMR
solvent)
CDCl3, TLC
Table 4-5 Description of chemical entity types assigned by the system
False positives are recognised by the entities presence within an APPARATUS or
AtmospherePhrase phrase (as identified by chemical tagging) or being followed by a word
indicating the chemical entity is a surface e.g. ‘silica surface’. Additionally a series of regular
expressions are used to match NMR solvents as well as characterisation terms known to be
misidentified by OSCAR4 as chemicals.
Entities are recognised as being of type “chemical class” by being prefixed by the determiners
‘a’ or ‘an’, by being followed by the word ‘compound’ or ‘derivative’, by being a known function
class e.g. ‘aldehyde’, by being assigned as such by the OPSIN Document Extractor or by ending in a
plural ending.
161
Entities are recognised as type “fragment” if they are followed by words like ‘group’ or ‘ring’
or are assigned as such by the OPSIN Document Extractor e.g. ‘ethyl’.
Any entity not explicitly assigned a type is assumed to be of type “exact”.
4.5.2 Identification of discourse type
Paragraphs are broken down into phrases by the top level phrase elements into which
ChemicalTagger has grouped the input. Each phrase is then classified as either synthesis or workup.
Phrases that form the characterisation section are not explicitly identified as the boundary between
workup and characterisation may occur within a phrase complicating exact identification of the
boundary. As it is practical to filter out the vast majority of chemical entities that are associated with
characterisation, characterisation sections are indirectly ignored by virtue of contributing no allowed
chemical entities.
The approach used to identify the discourse type assumes that all text up to the start of the
workup section relates to synthesis and hence discourse analysis concentrates on the identification
of phrases that relate to workup. It was found that phrases of types "Concentrate", "Degass",
"Dry", "Extract", "Filter", "Partition", "Precipitate", "Purify", "Recover",
"Remove", "Wash", "Quench" were associated with workup. Where a phrase does not fit into one
of these roles, the assumption is made that the phrase is of the same type (synthesis/workup) as the
preceding phrase.
In contrast to the literature solutions, the presence of a molecule possessing an associated
amount, yield or number of equivalents is used to indicate the return to a synthesis section. This
heuristic arises from the observation that the amounts of workup reagents are rarely precisely
specified and allows the support for multi-step reactions within the same paragraph. Additionally
the presence of a "Synthesize" or "Yield" phrase is assumed to indicate the end of a workup
section.
Figure 4-9 Typical experimental chemistry paragraph showing the different phrase types identified by
ChemicalTagger. In this case the dry phrase is used to indicate the start of the workup section.
Phrase types:
Synthesis
Workup
162
4.5.3 Chemical role assignment
A putative role in the reaction is assigned for each chemical entity that has not been excluded
due to being in a workup section or being of type “false positive” (Table 4-6).
Chemical role Description
product This is a compound produced as a result of a reaction
reactant A substance that undergoes a chemical change in a reaction
solvent A compound in which reactants are dissolved
catalyst A compound which is not consumed by a reaction and
accelerates a reaction
Table 4-6 Roles considered for chemical entities in a reaction
Role assignment is achieved through a mixture of the output of ChemicalTagger, analysis of
the local textual environment and lists of known solvents/catalysts.
4.5.3.1 Product Role
A chemical entity is assigned as being a product in these situations:
It is associated with a percentage yield
It is part of a noun phrase followed by ‘is synthesised’ (or a similar phrase)
It is part of a yield phrase
It is identified as being an anaphora to the current heading compound e.g. ‘title
compound’
4.5.3.2 Reactant Role
This is the default role assigned if no clear indication of an alternate role can be determined
from the text or textual environment.
4.5.3.3 Solvent Role
A chemical entity is assigned as being a solvent in these situations:
ChemicalTagger assigns it as a solvent
It corresponds to an InChI-less solvent e.g. brine
It is proceeded by the words ‘in’, ‘in a mixture of’ or either of these followed by a
chemical entity and the word ‘and’
163
Once a reaction has been constructed, sensibility checks also ensure that if a reagent is listed
as both a solvent and reactant that all instances are reclassified as a solvent. Additionally if a
reaction does not have a solvent, a reagent without a specified amount may be assigned as a solvent
if its InChI matches that of a known solvent or its volume is imprecisely defined.
4.5.3.4 Catalyst Role
A chemical entity is assigned as a catalyst in these situations:
ChemicalTagger assigns it as a catalyst
Its name corresponds to a known catalyst
Its InChI corresponds to a known catalyst
A chemical entity that is found to contain a transition metal atom which is absent from the
product molecule/s is considered to be a catalyst with a few exceptions for where a transition metal
is part of an oxidising agent and organocopper/zinc/mercury chemistry. These two exceptions are
enforced by identifying particular transition metals in high oxidation states and the presence of
carbon-metal bonds respectively.
164
4.6 Reaction mapping
Figure 4-10 Schematic of processes employed in converting putative reactions to atom-mapped
reactions
4.6.1 Indigo reaction creation
The extracted reactions are loaded into the Indigo toolkit (version 1.1-beta9) to provide atom-
mapping and depiction. This is accomplished using SMILES as the input format. To allow more
165
efficient atom-mapping and more aesthetic depictions, for each role, chemical entities that share
the same InChI are considered to be the same compound and hence are only added once to the
Indigo reaction.
Upon creation of the Indigo reaction, a check is done to ensure that the reaction has a
product, a total of at least two reactants/solvents/catalysts, and that none of the reactants have the
same structure as the product. It should be noted that these conditions may fail for correctly
identified reactions if SMILES could not be obtained for reactants and/or product.
4.6.2 Atom-atom mapping
In a well formed chemical reaction all atoms in the product/s must have come from the
reactants and hence any “reactions” for which this is not true should be rejected. One way of
achieving this is by performing atom-atom mapping (AAM). This is a technique for relating the atoms
of the reactants to those of the product. This is typically implemented using a maximum common
subgraph algorithm to find the maximum number of atoms in the product that may be accounted for
by a given reactant. The resultant mapping is not necessarily unique in terms of the atoms picked
within a reactant or even in terms of which reactants are used to provide atoms.
By default, Indigo attempts to match atoms with identical charge and valency in the reactant
and products and similarly attempts to match bonds with the same bond order. Hence for greater
leniency and to reflect some of the operations that may occur in real chemical reactions these
conditions are relaxed to allow changes in charge, valency and bond order.
In some reactions it was found that the solvent was also a reactant. To prevent such cases
resulting in incomplete atom mapping when atom mapping fails, an attempt is made to reclassify a
solvent as a reactant and AAM is repeated. Currently there is no method for reporting this dual role
and instead the solvent will be reported as a reactant.
4.6.3 Stoichiometry calculation
Where AAM was successful it may be used to calculate the stoichiometry of the reaction. The
stoichiometry of each reactant is assumed to be equal to the greatest number of times a particular
atom from the reactant appears in the product. This approach has several limitations; namely, that a
reactant will not be considered to contribute to the reaction if it either only contributes non-heavy
atoms or only contributes to an unstated side product and that the atom mappings may be wrong,
especially when the system has identified too many reactants.
166
4.6.4 Output
The list of all reactions and a list of mappable reactions are retrievable after a patent has been
processed. These reactions may be serialised to a graphical depiction (Figure 4-11) and CML (Figure
4-12). For mappable reactions, the graphical depiction will contain the results of the AAM.
Figure 4-11 Graphical depiction of an extracted reaction. The numbers indicate the mapping between
atoms in the reactants and products. The solvent is present above the arrow.
167
[CH2:1]([n:3]1[cH:7][c:6](-
[c:8]2[cH:13][cH:12][n:11][c:10]3[nH:14][cH:15][cH:16][c:9]23)[c:5](-
[c:17]2[cH:23][cH:22][c:20]([NH2:21])[cH:19][cH:18]2)[n:4]1)[CH3:2].[O:24]=[C:25]=[N:26][c:27]
1[cH:32][cH:31][cH:30][cH:29][cH:28]1>c1cc[n]cc1>[CH2:1]([n:3]1[cH:7][c:6](-
[c:8]2[cH:13][cH:12][n:11][c:10]3[nH:14][cH:15][cH:16][c:9]23)[c:5](-
[c:17]2[cH:23][cH:22][c:20]([NH:21][C:25]([NH:26][c:27]3[cH:32][cH:31][cH:30][cH:29][cH:28]3)=
[O:24])[cH:19][cH:18]2)[n:4]1)[CH3:2]title product50.0definiteReferencepowder4-[1-ethyl-4-(1H-pyrrolo[2,3-b]pyridin-4-yl)-1H-
pyrazol-3-yl]aniline2.2exactphenyl isocyanate2.4exactpyridine4exact
Figure 4-12 CML output for the extracted reaction depicted in Figure 4-11
The CML output includes all the information extracted for a given reaction. For each chemical
entity this includes a role, an entity type (cf. Section 4.5.1.4), and where possible a chemical
structure (as SMILES and InChI), quantities e.g. volumes, amounts, weights etc. and the physical
state. If available the yield of the product is also recorded. Where AAM was successful the atom-
168
mapped reaction is included as reaction SMILES and the stoichiometry of the reactants are captured
by the count attribute with reactants that contribute no atoms to the products lacking a count
attribute.
4.7 Evaluation
4.7.1 Methodology
USPTO patent applications for the period of 2008 through to the end of 2011 were
downloaded from Google Patents131. The XML representation of the patents was inspected to
determine the IPC (International Patent Classification) codes associated with each patents. Only
patents containing the IPC code ‘C07’ were selected for processing. The ‘C’ refers to section C which
describes chemistry and metallurgy whilst the ‘07’ refers to sub category of organic chemistry. A
patent is associated with one or more IPC codes. Additionally patents from the first week of 2008
were not used as these were used to identify limitations in older versions of the reaction extraction
system.
This yielded a set of 65,034 patents on which the reaction extraction system was ran. For each
patent the reactions were serialised to depictions and CML with segregation of the output based on
whether or not AAM was successful. The file names of the serialised reactions include the paragraph
from which the reactions were extracted.
Whilst a successful atom mapping is a good indicator that a found reaction really is a reaction
other aspects of the output can be used to filter out dubious reactions hence additional criteria were
applied to produce a smaller but higher quality set of reactions. These were:
Reactions containing any products that could not be resolved to structures were
excluded. This helps with some cases where the product is not resolved to a structure
but instead the counter ion from a salt is resolved. Cases where a product is described
in such a way that ChemicalTagger associates both a MOLECULE and an
UNNAMEDMOLECULE with different parts of the product’s description may be
unnecessarily excluded by this criterion.
Reactions containing any entities of types: fragment or chemical class, were excluded
in order to exclude generic rather than specific reactions.
A sample of 100 randomly selected reactions was selected from this set to evaluate the quality
of the extracted reactions. For each selected reaction, chemical entities were manually identified
169
and associated with a role. If this role was not that the entity was a workup/characterisation reagent
then the entity type and quantities, that the reaction extraction system attempts to find, were also
manually identified.
The correctness of name to structure conversion was not evaluated as it is likely to be more
accurate than manual conversion by the average chemist. Cases where the reaction extraction
system missed reagents that were only implicitly described, for example, in a reaction being
performed analogously to a previous reaction, were not penalised as analogous reactions are
outside of the scope of the system as implemented.
4.7.2 Results
4.7.2.1 Errors encountered
Using v1.0 of the reaction extraction system, 10 of the 65,034 patents had to be manually
skipped due to either crashing the reaction extraction system (3 cases) or taking an unacceptably
long time to complete (7 cases). The results presented in this chapter are hence for the other 65,024
patents.
One crash was caused by an oversight in the way OPSIN generates parse combinations. This
occurred when processing a long series of fragments that had been erroneously identified as a single
name and resulted in an OutOfMemoryError. Another was caused by a StackOverflowError when
ChemicalTagger attempted to parse an exceptionally long sentence of bracketed molecules which
would each be associated with the previous in the parse tree. The other crash was caused by an
OutOfMemoryError when tagging a nearly 700,000 character long “sentence”. All the cases in which
a patent took an unacceptably long time to complete related to the AAM procedure. A timeout of 1
minute was specified for the AAM but a bug in Indigo-1.1-beta9 meant that a small minority of
reactions did not respect the timeout. This was reported to the developers of Indigo and fixed in
Indigo-1.1.
Using the subsequently released version of Indigo, fixing the bug in OPSIN and the OPSIN
Document Extractor, and limiting paragraphs/headings to 35,000 characters allowed the system to
run over all patents in the four year period without any manual intervention. The process took 84
hours using 1 thread for each year on an Intel Core i7-2600k. This is sufficiently fast to easily allow
the patent applications for a week to be analysed within a day of their public release.
170
4.7.2.2 Overall statistics
484,259 atom mapped reactions were extracted (Figure 4-13), of which 424,621 met the more
stringent criteria described in 4.7.1.
Figure 4-13 Number of patents with a given number of atom mapped reactions
4.7.2.3 Evaluated reaction quality
Table 4-7 indicates the precision/recall with which the entities involved in a chemical reaction
were identified. False positives may be workup reagents, characterisation reagents or not chemicals
at all. False negatives are those entities which are involved in the reaction but were not identified.
True Positives 474
False Positives 60
False Negatives 18
Precision 88.9%
Recall 96.4%
F1 score 92.5%
Table 4-7 Statistics for recognition of chemical entities (reagents and products)
Table 4-8 shows whether for correctly detected chemical entities with quantities specified in
the text, whether these were associated with the chemical entity. A quantity in this context could be
1
10
100
1,000
10,000
100,000
0 200 400 600 800 1000
N
u
m
b
e
r
o
f
p
at
e
n
ts
w
it
h
g
iv
e
n
n
u
m
b
e
r
o
f
re
ac
ti
o
n
s
Number of extracted reactions
171
a yield, amount, weight, volume etc. If an entity has multiple quantities all must be correctly
associated to be considered a success.
Agent type Successful cases/total cases (%)
Reagents 317/321 (98.8%)
Products 48/74 (64.9%)
Table 4-8 Statistics for association of quantities with reagents/products possessing quantities
Table 4-9 shows for each chemical entity of a given role in the manually annotated reactions
(which is also found in the automatically extracted reactions) whether the extracted entity has that
role.
Role Successful cases/total cases (%)
Product 99/100 (99.0%)
Reactant 241/244 (98.8%)
Solvent 85/99 (85.9%)
Catalyst 10/24 (41.7%)
Other Spectator 0/7 (0%)
Overall 435/474 (91.8%)
Table 4-9 Statistics for association of roles with entities
4.8 Discussion
The number of patents containing a specified number of extracted reactions (Figure 4-13)
shows a power law distribution with respect to the number of reactions extracted from each patent.
The majority of patents have less than 10 reactions but a significant minority have greater than 100.
Patents with greater than 100 reactions account for only 1.59% of the patents processed but held
41.26% of the extracted reactions.
Recall of chemical entities (Table 4-7) was high (96.4%), whilst precision was somewhat lower.
This was primarily due to the classification of workup reagents, especially the first workup reagent,
as reactants. This is due to the text often just saying that the reagent was added without any
indication of the purpose. Heuristics involving the addition of a common solvent as the last step of a
reaction could be investigated.
Association of quantities (Table 4-8) with reagents was near perfect and can be considered a
solved problem. There appear to be only a finite number of ways in regular use for associating
quantities with chemical entities and all of them are supported by ChemicalTagger. Association of
quantities with the product was less successful. This was because this information is often present at
the end of an experimental section and only implicitly assumed to apply to the product. Heuristics
172
could be investigated for improving association of unassigned quantities with the product of the
reaction. This is expected to be especially applicable to unassigned yields as these will almost
invariably be the yield of the reaction.
Assignment of roles (Table 4-9) was excellent for products (99.0%) and reactants (98.8%) but
worse for spectator reagents. For this evaluation, the definition of a catalyst was the strict definition
that the reagent is not consumed in the course of the reaction. As experimental descriptions
typically only describe the intended product and not the fate of the other reagents involved it is
often impossible to tell from just the text whether or not a reagent is a catalyst. This also made the
assignment of reagents as catalysts problematic for the manual annotations as the likely mechanism
of the reaction had to be investigated in some cases. Similarly, a reagent was considered a reactant
even if it did not contribute any heavy atoms to the product as long as it was believed to be
consumed by the reaction. Nonetheless, despite the difficulties in identifying catalysts the addition
of more known catalysts and the application of heuristics based around the relative quantity of
reagent used would yield improved results.
Overall only 22% of the extracted reactions were flawless by all the metrics evaluated i.e.
perfect entity recognition, quantity assignment and role assignment. However it should be borne in
mind that most failures were minor, for example: a yield not assigned to a product, a workup
reagent assigned as a reactant, etc. It should also be noted that some mistakes in chemical entity
identification are not visible in the graphic depiction. This happens when two copies of a reagent are
inadvertently identified (as happens if both a reagent and an anaphora to the reagent are
independently resolved), as they will have been merged (using InChI to check for identity) prior to
depiction.
The correct identification of product and major starting material is a somewhat more
qualitative but potentially more useful metric, especially for reaction searching, with the proviso that
there are no false positive entities that could be mistaken for either of these entities. This was true
of 95% of the evaluated reactions. There were several reasons for the failure with the other 5%. A
recurring problem was the difficulty in determining the meaning of a reference to a procedure as it
could mean:
The compound produced at the end of that procedure
The compound produced at the end of that procedure but only in the context of an
analogy to the current reaction
173
The procedure itself
One failure involved a reaction being erroneously split into two reactions due to
ChemicalTagger misassigning ‘starting’ as a verb rather than as an adjective (in the context of ‘to
give (i) starting material’). This caused the yield phrase to end at the word ‘starting’ and hence the
true products ended up as the reactants for a second reaction. Another failure involved a reaction in
which some of the compounds were defined using anaphora to previously defined labelled
compounds. In this case the association between these previous compounds and their numeric
identifiers had not been made; hence preventing resolution of these anaphoric references.
4.9 Comparison to other approaches
PatentEye (Section 4.2.2) was evaluated, by Jessop, on ten weeks of EPO patents which
corresponded to a corpus of 667 patents. From these 4444 reactions were extracted of which a
subset was evaluated to assess the recall and precision of reagent, and recall of product
identification. The results were 64% recall and 78% precision for reagent identification with the
criteria for a true positive being that both the entity and associated quantities were found. Table 4-7
in conjunction with the 98.8% association of quantities with reagents suggests that the current
system may perform significantly better but a direct comparison is impossible without using the
same corpus. It should also be considered that as both PatentEye and the system developed for this
project rely on ChemicalTagger that improvements made to ChemicalTagger in the course of this
project would likely also improve PatentEye’s performance if it were updated.
The correct product was identified by PatentEye in 92% cases. The reason stated for the
failures were false positive chemical entities in headings that could not be converted to structures.
The fact that such “reactions” would always be rejected at the atom-mapping stage in the developed
system further complicates comparison.
Correspondence with the authors of SCRIPDB indicated that the database included 190,083
reaction steps from USPTO patents for the period of 2008-2011. Of these only 7,873 possess a
reaction arrow, a reagent and a product. As the source of the reactions is the CDX files rather than
the text it would be potentially interesting to assess the level of overlap between these resources as
a way of assessing how many reactions cannot be found from just the text. The results presently
indicate that significantly more reactions can be extracted from the text but the number present in
the CDX files may be understated due to reactions not being explicitly indicated as reactions and due
to the use of generic structures to describe multiple reactions in one diagram.
174
4.10 Example use: solvent analysis
Besides obvious use cases, such as reaction searching, a large database of reactions allows one
to start asking questions about the properties of the population of chemical reactions. For example,
which are the most common solvents employed (Figure 4-14).
Figure 4-14 The top 15 solvents by frequency of occurrence in reactions
To produce Figure 4-14 unique solvent InChIs were recorded for each reaction. Where a single
name indicates a mixture of solvents and hence produced one InChI this was split into its component
InChIs using the heuristic that a mixture of solvents would be composed of neutral components.
One can observe that a few solvents occur disproportionally more than others. In total 627
discrete InChIs were detected indicating a potentially long tail. The sum of all solvents beyond the
15th is still less than the instances for any of the top 3 solvents indicating that most of these solvents
are rarely used. A significant number of the InChIs that occurred very rarely are in fact not solvents
so the figure of 627 for total solvents encountered is likely to be somewhat of an overestimate.
Investigation of such cases could be useful for improving the precision of solvent detection.
With the growing importance of Green Chemistry171, there is increased interest in finding
alternatives to solvents, such as dichloromethane, that are known to have a negative environmental
0
10000
20000
30000
40000
50000
60000
175
impact172. Being able to identify analogous reactions that were run in greener solvents is a potential
use case.
4.11 Limitations and areas for future work
4.11.1 Interrelation between taggers
Conceptually, running a series of independent taggers is easy to understand and manipulate.
However, to work ideally in practice, some taggers need the knowledge provided by other taggers.
For example, the designation of words as chemicals and hence likely nouns would be useful to the
POS tagger, since it was found to occasionally tag longer chemical entities as adjectives resulting in
increased erroneous part of speech tag assignment to adjacent words.
Another example is in the regex tagger where certain words that are to be tagged may have a
different part of speech depending on the context they are used in. This means that ideally the regex
tagger should assign them different tags but as it has no knowledge of the context this is impossible
necessitating the use of ad hoc post tagging tag corrections. If the regex tagger were aware of the
tag assigned by the POS tagger this problem could often be resolved at the tagging step.
4.11.2 Chemical entity type assignment
Determiners in front of chemical names are used inconsistently. One would expect to be able
to use the presence of ‘a’/‘an’ to indicate that a chemical entity was describing a class of chemicals
and to use ‘the’ to indicate that the chemical entity referred to a particular substance referenced
previously. In practice, possibly due to not all patents being written by native speakers, the presence
of a determiner is insufficient to rule out the interpretation that a chemical entity is of type “exact”.
4.11.3 Solvents contained within another entity
In some cases the specification of the solvent is included through the use of a bracketed
description of the solvent immediately after the solute chemical entity. ChemicalTagger will
associate the bracketed description with the preceding solute entity rather than considering the
solvent as a distinct chemical entity. As a result the solvent is not recorded in these cases. Another
case where the solvent is not identified is where it is only implicitly described by an adjective e.g.
‘aqueous’ or ‘methanolic’. If the meaning of the adjective is understood determining the solvent
would be trivial.
176
4.11.4 Acid/Base workup steps
One of the leading causes of false positives was compounds that took part in an acid or base
workup step that were not identified as being part of the workup. This could be partially addressed
by allowing ChemicalTagger to identify the keyword ‘neutralise’ and hence identify neutralisation
steps.
4.11.5 Additional roles
The role of desiccant should be added to describe the common use of compounds like sodium
sulfate as drying agents. These compounds should be present in the list of spectator chemicals, but
are neither catalysts nor solvents.
4.11.6 Structurally unknown intermediates
It is not uncommon in a multi-step reaction to have intermediate compounds. Often for
brevity these compounds are referred to only by an important functional group e.g. ‘protected
amine’ or even just as a description of the substance e.g. ‘grey powder’. Currently such reactions
cannot be atom-mapped as the product of the first step of the reaction will have no structure. The
same is true if this compound is used as a starting material in the next reaction.
4.11.7 Presentation of reactions
The results of the AAM could be used to align the depictions of the reactants and products
making it clearer to see the transformation that has occurred.
4.11.8 Reaction conditions
Reaction conditions e.g. temperatures and the time taken for each step, are not currently
extracted. Extracting such information would be a simple extension as the two exemplified
properties are already appropriately tagged in ChemicalTagger’s output by the TempPhrase and
TimePhrase elements. Capturing such information was not seen as a high priority as reaction
searching is typically done by structure rather than by conditions and, at present, the extracted
reactions are not aimed to be a replacement for the original text.
4.12 Conclusions
This work has shown that it is practical to use text mining to build a large reaction database
from the publically available chemical literature, specifically patents, without human intervention.
The extracted reactions (as reaction SMILES) are publically available from the project’s BitBucket
177
page173 as is the code to perform the reaction extraction. With the input of more patent documents
the creation of a database of over a million reactions should be a relatively trivial undertaking. To
the best of the author’s knowledge such a database would become the largest publically accessible
reaction database. Such a resource could be extremely useful to both commercial and academic
institutions, particularly when they are unable to access fee requiring systems such as Reaxys or
SciFinder.
The development of the reaction extraction system has led to many improvements in
ChemicalTagger, OSCAR4, the OPSIN Document Extractor and OPSIN. Especially in the case of
ChemicalTagger, it is hoped that the improvements made will be of benefit to other users of these
libraries.
178
Chapter 5 Overall Summary of Results and Conclusions
The increasing size of the chemical literature on the one hand creates problems in identifying
informational resources, but on the other hand gives access to an ever increasing amount of
information. To address both these issues, this work has centred on the development and validation
of tools to text-mine literature for chemical information.
The development of the chemical name to structure algorithm OPSIN has been a key
achievement. The system employs a regular grammar and corresponding automaton to facilitate
tokenisation and parsing of chemical names. OPSIN was shown through examples and both artificial
and real world benchmarks, to have high coverage and precision on organic chemical nomenclature,
rivalling and often exceeding the commercial solutions tested. The algorithm is shown to be
applicable to named entity recognition and has been successfully included into a frequently utilised
public web service. Already, OPSIN itself appears to have achieved wide usage in the chemistry
community, and as other comparable open-source solutions are not currently available is likely to
increase in usage.
Building on the capabilities of OPSIN, in reliable and precise name to structure conversion, it
was used as a critical component in the creation of a system for extracting chemical reactions from
text-based literature. The system was demonstrated to be able to identify experimental sections and
identify chemical entities. The entities are assigned roles, types and associated with quantities
specified in the text. Finally atom-mapping is employed primarily to remove implausible reactions.
The system was validated by inputting text from over 65,000 patent applications, and sampling the
output of 424,621 extracted chemical reactions. This indicated a successful output in that 95% of
them captured the essence of the reaction.
The extraction process is fast and requires little human intervention making it highly scalable.
Hence the system could be used to facilitate much larger scale extraction of reactions from the
patent literature. This would prove useful not only for the more obvious usage by synthetic chemists
looking for routes of synthesis but also may prove useful in analysing trends in the chemistry used in
organic syntheses. The system has the potential to benefit the community by allowing access to a
large number of reactions without the restrictions or costs of traditional reaction searches.
Publishers of organic chemistry journals may also be interested as a way of adding value to their
articles by making them reaction searchable.
179
All of the software solutions developed as part of this project are open source and made freely
available via BitBucket (cf. Appendix A). In this way these projects may be used in a complementary
manner to other open source chemistry projects73. The potential also exists for them to be modified,
improved and extended, in ways not necessarily conceived of by the author, allowing for wider
usage than with traditionally more rigid commercial solutions. The software developed in this
project is expected to prove useful to the cheminformatics community and, in the case of more user
friendly services such as the OPSIN web service, the general chemistry community.
180
References
(1) Alexandru Dan Corlan. Medline trend: automated yearly statistics of PubMed results for any
query. http://www.webcitation.org/65RkD48SV (accessed Feb 14, 2012).
(2) Economics and Statistics Division, WIPO. World Intellectual Property Indicators, 2011 edition;
2011; http://www.wipo.int/ipstats/en/statistics/patents/.
(3) Apache Software Foundation. Apache Lucene. http://lucene.apache.org/ (accessed Feb 21,
2012).
(4) Ashburner, M.; Ball, C. A.; Blake, J. A.; Botstein, D.; Butler, H.; Cherry, J. M.; Davis, A. P.;
Dolinski, K.; Dwight, S. S.; Eppig, J. T.; Harris, M. A.; Hill, D. P.; Issel-Tarver, L.; Kasarskis, A.;
Lewis, S.; Matese, J. C.; Richardson, J. E.; Ringwald, M.; Rubin, G. M.; Sherlock, G. Gene
Ontology: tool for the unification of biology. Nature Genetics 2000, 25, 25–29.
(5) Degtyarenko, K.; de Matos, P.; Ennis, M.; Hastings, J.; Zbinden, M.; McNaught, A.; Alcantara,
R.; Darsow, M.; Guedj, M.; Ashburner, M. ChEBI: a database and ontology for chemical
entities of biological interest. Nucl. Acids Res. 2008, 36, D344–350.
(6) Schneider, G.; Fechner, U. Computer-based de novo design of drug-like molecules. Nat Rev
Drug Discov 2005, 4, 649–663.
(7) Liu, H.; Hu, Z. Z.; Torii, M.; Wu, C.; Friedman, C. Quantitative assessment of dictionary-based
protein named entity tagging. J Am Med Inform Assoc 2006, 13, 497–507.
(8) Wren, J. A scalable machine-learning approach to recognize chemical names within large
text databases. BMC Bioinformatics 2006, 7, S3.
(9) Corbett, P.; Copestake, A. Cascaded classifiers for confidence-based chemical named entity
recognition. BMC Bioinformatics 2008, 9, S4.
(10) Boudin, F.; Torres-Moreno, J.; El-Bèze, M. Mixing statistical and symbolic approaches for
chemical names recognition. Computational Linguistics and Intelligent Text Processing 2008,
334–343.
(11) Klinger, R.; Kolarik, C.; Fluck, J.; Hofmann-Apitius, M.; Friedrich, C. M. Detection of IUPAC
and IUPAC-like chemical names. Bioinformatics 2008, 24, i268.
(12) Grego, T.; Pęzik, P.; Couto, F. M.; Rebholz-Schuhmann, D. Identification of Chemical Entities
in Patent Documents. In Proceedings of the 10th International Work-Conference on Artificial
Neural Networks: Part II: Distributed Computing, Artificial Intelligence, Bioinformatics, Soft
Computing, and Ambient Assisted Living; IWANN ’09; Springer-Verlag: Berlin, Heidelberg,
2009; pp. 942–949.
181
(13) Sun, B.; Mitra, P.; Lee Giles, C.; Mueller, K. T. Identifying, Indexing, and Ranking Chemical
Formulae and Chemical Names in Digital Documents. ACM T Inform Syst 2011, 29, 12.
(14) Jessop, D. M.; Adams, S.; Willighagen, E. L.; Hawizy, L.; Murray-Rust, P. OSCAR4: a flexible
architecture for chemical text-mining. J Cheminf 2011, 41.
(15) Rocktäschel, T.; Weidlich, M.; Leser, U. ChemSpot: a hybrid system for chemical named
entity recognition. Bioinformatics 2012, 28, 1633–1640.
(16) Sayle, R.; Xie, P. H.; Muresan, S. Improved Chemical Text Mining of Patents with Infinite
Dictionaries and Automatic Spelling Correction. J. Chem. Inf. Model. 2011, 52, 51–62.
(17) Park, J.; Rosania, G.; Shedden, K.; Nguyen, M.; Lyu, N.; Saitou, K. Automated extraction of
chemical structure information from digital raster images. Chem. Cent. J. 2009, 3, 4.
(18) Filippov, I. V.; Nicklaus, M. C. Optical Structure Recognition Software To Recover Chemical
Information: OSRA, An Open Source Solution. J. Chem. Inf. Model. 2009, 49, 740–743.
(19) Valko, A. T.; Johnson, A. P. CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical
Chemical Structure Recognition. J. Chem. Inf. Model. 2009, 49, 780–787.
(20) Zimmermann, M. Chemical Structure Reconstruction with chemoCR. TREC-CHEM 2011 2011.
(21) Smolov, V.; Zentsev, F.; Rybalkin, M. Imago: open-source toolkit for 2D chemical structure
image recognition. TREC-CHEM 2011 2011.
(22) Sadawi, N. M.; Sexton, A. P.; Sorge, V. Chemical Structure Recognition: A Rule Based
Approach. In 19th Document Recognition and Retrieval Conference; 2012.
(23) Fujiyoshi, A.; Nakagawa, K.; Suzuki, M. Robust Method of Segmentation and Recognition of
Chemical Structure Images in ChemInfty. In Pre-Proceedings of the 9th IAPR International
Workshop on Graphics Recognition; Seoul, South Korea, 2011.
(24) Lounnas, V.; Vriend, G. AsteriX: A Web Server To Automatically Extract Ligand Coordinates
from Figures in PDF Articles. J. Chem. Inf. Model. 2012, 52, 568–576.
(25) Filippov, I. V.; Nicklaus, M. C.; Kinney, J. Improvements in Optical Structure Recognition
Application. In Document Analysis Systems Workshop; Boston, 2010.
(26) Yan, S.; Spangler, W. S.; Chen, Y. Cross Media Entity Extraction and Linkage for Chemical
Documents. In Twenty-Fifth AAAI Conference on Artificial Intelligence; San Francisco, 2011.
(27) Van Noorden, R. Trouble at the text mine. Nature 2012, 483, 134–135.
(28) Peter Pappas, J. R. B. USPTO Teams with Google to Provide Bulk Patent and Trademark Data
to the Public. http://www.uspto.gov/news/pr/2010/10_22.jsp (accessed Jun 4, 2012).
(29) Feng, C.; Yamashita, F.; Hashida, M. Automated Extraction of Information from the
Literature on Chemical-CYP3A4 Interactions. J. Chem. Inf. Model. 2007, 47, 2449–2455.
182
(30) Yamashita, F.; Feng, C.; Yoshida, S.; Itoh, T.; Hashida, M. Automated Information Extraction
and Structure−Activity Relationship Analysis of Cytochrome P450 Substrates. J. Chem. Inf.
Model. 2011, 51, 378–385.
(31) Jiao, D.; Wild, D. J. Extraction of CYP Chemical Interactions from Biomedical Literature Using
Natural Language Processing Methods. J. Chem. Inf. Model. 2009, 49, 263–269.
(32) Donaldson, I.; Martin, J.; de Bruijn, B.; Wolting, C.; Lay, V.; Tuekam, B.; Zhang, S.; Baskin, B.;
Bader, G. D.; Michalickova, K.; Pawson, T.; Hogue, C. W. PreBIND and Textomy – mining the
biomedical literature for protein-protein interactions using a support vector machine. BMC
Bioinformatics 2003, 4, 11.
(33) Batchelor, C. R.; Corbett, P. T. Semantic enrichment of journal articles using chemical named
entity recognition. In Proceedings of the 45th Annual Meeting of the ACL on Interactive
Poster and Demonstration Sessions; ACL ’07; Association for Computational Linguistics:
Stroudsburg, PA, 2007; pp. 45–48.
(34) Swain, M. chemicalize.org. J. Chem. Inf. Model. 2012, 52, 613–615.
(35) Pafilis, E.; O’Donoghue, S. I.; Jensen, L. J.; Horn, H.; Kuhn, M.; Brown, N. P.; Schneider, R.
Reflect: augmented browsing for the life scientist. Nat Biotechnol. 2009, 27, 508–510.
(36) Digital Science. SureChem. https://surechem.com/ (accessed Jun 4, 2012).
(37) Chen, Y.; Spangler, S.; Kreulen, J.; Boyer, S.; Griffin, T. D.; Alba, A.; Behal, A.; He, B.; Kato, L.;
Lelescu, A.; Kieliszewski, C.; Wu, X.; Zhang, L. SIMPLE: a strategic information mining
platform for licensing and execution. In Proceedings of the 2009 IEEE International
Conference on Data Mining Workshops; 2009; pp. 270–275.
(38) Kayala, M. A.; Azencott, C.-A.; Chen, J. H.; Baldi, P. Learning to Predict Chemical Reactions. J.
Chem. Inf. Model. 2011, 51, 2209–2222.
(39) Harold, E. R. XOM Design Principles. In Proceedings of Extreme Markup Languages;
Montréal, Québec, 2004.
(40) Elliotte R. Harold. XOM. http://www.xom.nu/ (accessed Jun 4, 2012).
(41) Murray-Rust, P.; Rzepa, H. S. Chemical Markup, XML, and the Worldwide Web. 1. Basic
Principles. J. Chem. Inf. Comput. Sci. 1999, 39, 928–942.
(42) Murray-Rust, P.; Rzepa, H. S. CML: Evolution and design. J Cheminf 2011, 3, 44.
(43) Murray-Rust, P.; Rzepa, H. S. Chemical Markup, XML, and the World Wide Web. 4. CML
Schema. J. Chem. Inf. Comput. Sci. 2003, 43, 757–772.
(44) Chemical Markup Language Schema 3. http://www.xml-cml.org/schema/ (accessed Jun 4,
2012).
183
(45) Joe Townsend. Chemical Markup Language Validator. http://validator.xml-cml.org/
(accessed Jun 4, 2012).
(46) García, A.; Murray-Rust, P.; Wakelin, J. The use of XML and CML in computational chemistry
and physics programs. In Proceedings of the UK e-Science All Hands Meeting 2004; 2004; pp.
1111–1114.
(47) Kuhn, S.; Helmus, T.; Lancashire, R. J.; Murray-Rust, P.; Rzepa, H. S.; Steinbeck, C.;
Willighagen, E. L. Chemical Markup, XML, and the World Wide Web. 7. CMLSpect, an XML
Vocabulary for Spectral Data. J. Chem. Inf. Model. 2007, 47, 2015–2034.
(48) Adams, N.; Winter, J.; Murray-Rust, P.; Rzepa, H. S. Chemical Markup, XML and the World-
Wide Web. 8. Polymer Markup Language. J. Chem. Inf. Model. 2008, 48, 2118–2128.
(49) Holliday, G. L.; Murray-Rust, P.; Rzepa, H. S. Chemical Markup, XML, and the World Wide
Web. 6. CMLReact, an XML Vocabulary for Chemical Reactions. J. Chem. Inf. Model. 2006,
46, 145–157.
(50) Weininger, D. SMILES, a chemical language and information system. 1. Introduction to
methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36.
(51) OpenSMILES Specification. http://www.opensmiles.org (accessed Jun 4, 2012).
(52) Daylight Chemical Information Systems. Daylight Theory: SMILES.
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html (accessed Jun 4, 2012).
(53) Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. The Chemistry
Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics. J.
Chem. Inf. Comput. Sci. 2003, 43, 493–500.
(54) O’Boyle, N.; Banck, M.; James, C.; Morley, C.; Vandermeersch, T.; Hutchison, G. Open Babel:
An open chemical toolbox. J Cheminf 2011, 3, 33.
(55) GGA Software Services. Indigo Toolkit. http://ggasoftware.com/opensource/indigo
(accessed Jun 4, 2012).
(56) IUPAC. The IUPAC International Chemical Identifier (InChI). www.iupac.org/inchi/ (accessed
Jun 4, 2012).
(57) Chomsky, N. Three models for the description of language. IRE Trans. Inf. Theory 1956, 2,
113–124.
(58) Adams, S. E.; Goodman, J. M.; Kidd, R. J.; McNaught, A. D.; Murray-Rust, P.; Norton, F. R.;
Townsend, J. A.; Waudby, C. A. Experimental data checker: better information for organic
chemists. Org. Biomol. Chem. 2004, 2, 3067.
184
(59) Townsend, J. A.; Adams, S. E.; Waudby, C. A.; de Souza, V. K.; Goodman, J. M.; Murray-Rust,
P. Chemical documents: machine understanding and automated information extraction.
Org. Biomol. Chem. 2004, 2, 3294.
(60) Corbett, P.; Murray-Rust, P. High-Throughput Identification of Chemistry in Life Science
Texts. Lecture Notes in Comput. Sci. 2006, 4216, 107–118.
(61) Hawizy, L.; Jessop, D. M.; Adams, N.; Murray-Rust, P. ChemicalTagger: A tool for semantic
text-mining in chemistry. J Cheminf 2011, 3, 17.
(62) Apache OpenNLP. http://incubator.apache.org/opennlp/ (accessed Jun 4, 2012).
(63) Marcus, M. P.; Marcinkiewicz, M. A.; Santorini, B. Building a large annotated corpus of
English: The Penn Treebank. Computational Linguistics 1993, 19, 313–330.
(64) Parr, T. The definitive ANTLR reference: building domain-specific languages; Pragmatic
Bookshelf, 2007.
(65) Hearst, M. A. Automatic acquisition of hyponyms from large text corpora. In Proceedings of
the 14th International Conference on Computational Linguistics: Volume 2; 1992; pp. 539–
545.
(66) Apache Maven Home Page. http://maven.apache.org/ (accessed Jun 4, 2012).
(67) Atlassian. Bitbucket. https://bitbucket.org/ (accessed Jun 4, 2012).
(68) GitHub. https://github.com/ (accessed Jun 4, 2012).
(69) Mercurial SCM. http://mercurial.selenic.com/ (accessed Jun 4, 2012).
(70) JUnit. http://www.junit.org/ (accessed Jun 4, 2012).
(71) Jenkins. http://jenkins-ci.org/ (accessed Jun 4, 2012).
(72) Lowe, D. M.; Corbett, P. T.; Murray-Rust, P.; Glen, R. C. Chemical Name to Structure: OPSIN,
an Open Source Solution. J. Chem. Inf. Model. 2011, 51, 739–753.
(73) O’Boyle, N. M.; Guha, R.; Willighagen, E. L.; Adams, S. E.; Alvarsson, J.; Bradley, J.-C.;
Filippov, I. V.; Hanson, R. M.; Hanwell, M. D.; Hutchison, G. R.; James, C. A.; Jeliazkova, N.;
Lang, A. S.; Langner, K. M.; Lonie, D. C.; Lowe, D. M.; Pansanel, J.; Pavlov, D.; Spjuth, O.;
Steinbeck, C.; Tenderholt, A. L.; Theisen, K. J.; Murray-Rust, P. Open Data, Open Source and
Open Standards in chemistry: The Blue Obelisk five years on. J Cheminf 3, 37.
(74) Pictet, A. Le Congrès International de Genève pour la Réforme de la Nomenclature
Chimique. Archives des sciences physiques et naturelles: Geneva, 1892; Vol. 27, pp. 485–
520.
(75) Definitive Rules for Nomenclature of Organic Chemistry. J. Am. Chem. Soc. 1960, 82, 5545–
5574.
185
(76) IUPAC. Nomenclature of Organic Chemistry; Pergamon Press, Oxford, 1979.
(77) IUPAC. A Guide to IUPAC Nomenclature of Organic Compounds (Recommendations 1993);
Blackwell Scientific publications, 1993.
(78) IUPAC. Draft Nomenclature of Organic Chemistry.
http://old.iupac.org/reports/provisional/abstract04/favre_310305.html (accessed Jun 4,
2012).
(79) Nomenclature of Inorganic Chemistry: Recommendations 1990; Blackwell Scientific
Publications, 1990.
(80) Nomenclature of Inorganic Chemistry: IUPAC Recommendations 2005; Cambridge, UK: Royal
Society of Chemistry Publishing/IUPAC, 2005.
(81) Smith, H. A. The Centennial of Systematic Organic Nomenclature. J. Chem. Educ. 1992, 69,
863.
(82) Donaldson, N.; Powell, W. H.; Rowlett, R. J.; White, R. W.; Yorka, K. V. Chemical Abstracts
Index Names for Chemical Substances in the Ninth Collective Period. J. Chem. Doc. 1974, 14,
3–15.
(83) Chemical Abstracts Service. Naming and Indexing of Chemical Substances for Chemical
Abstracts. American Chemical Society 2007.
(84) Garfield, E. An Algorithm for Translating Chemical Names to Molecular Formulas. J. Chem.
Doc. 1962, 2, 177–179.
(85) Garfield, E. An Algorithm for Translating Chemical Names to Molecular Formulas. Essays of
an Information Scientist 1984, 7, 441–513.
(86) Vander Stouw, G. G.; Naznitsky, I.; Rush, J. E. Procedures for Converting Systematic Names
of Organic Compounds into Atom-Bond Connection Tables. J. Chem. Doc. 1967, 7, 165–169.
(87) Vander Stouw, G. G.; Elliott, P. M.; Isenberg, A. C. Automated Conversion of Chemical
Substance Names to Atom-Bond Connection Tables. J. Chem. Doc. 1974, 14, 185–193.
(88) Cooke-Fox, D. I.; Kirby, G. H.; Rayner, J. D. Computer translation of IUPAC systematic organic
chemical nomenclature. 1. Introduction and background to a grammar-based approach. J.
Chem. Inf. Comput. Sci. 1989, 29, 101–105.
(89) Cooke-Fox, D. I.; Kirby, G. H.; Rayner, J. D. Computer translation of IUPAC systematic organic
chemical nomenclature. 2. Development of a formal grammar. J. Chem. Inf. Comput. Sci.
1989, 29, 106–112.
186
(90) Cooke-Fox, D. I.; Kirby, G. H.; Rayner, J. D. Computer translation of IUPAC systematic organic
chemical nomenclature. 3. Syntax analysis and semantic processing. J. Chem. Inf. Comput.
Sci. 1989, 29, 112–118.
(91) Cooke-Fox, D. I.; Kirby, G. H.; Lord, M. R.; Rayner, J. D. Computer translation of IUPAC
systematic organic chemical nomenclature. 4. Concise connection tables to structure
diagrams. J. Chem. Inf. Comput. Sci. 1990, 30, 122–127.
(92) Cooke-Fox, D. I.; Kirby, G. H.; Lord, M. R.; Rayner, J. D. Computer translation of IUPAC
systematic organic chemical nomenclature. 5. Steroid nomenclature. J. Chem. Inf. Comput.
Sci. 1990, 30, 128–132.
(93) Kirby, G. H.; Lord, M. R.; Rayner, J. D. Computer translation of IUPAC systematic organic
chemical nomenclature. 6.(Semi) automatic name correction. J. Chem. Inf. Comput. Sci.
1991, 31, 153–160.
(94) Ikutoshi Matsuura. Development of a System for Translation of Chemical Name into 2D-
Structure (V). 26th Symposium on Chemical Information and Computer Science 2003, 101–
104.
(95) Ikutoshi Matsuura. Development of a System for Translation of Chemical Name into 2D-
Structure (VI). 27th Symposium on Chemical Information and Computer Science 2004, 63–66.
(96) Ikutoshi Matsuura. Development of a System for Translation of Chemical Name into 2D-
Structure (VII). 28th Symposium on Chemical Information and Computer Science 2005, 29–
32.
(97) University of Manchester. ChemNomParse. http://chemnomparse.sourceforge.net/
(accessed Jun 4, 2012).
(98) Banville, D. L. Chemical Information Mining: Facilitating Literature-Based Discovery; 1st ed.;
CRC Press, 2008.
(99) ACD/Name; ACD/Labs: Toronto, Canada; http://www.acdlabs.com/.
(100) Bio-Rad Laboratories. IUPAC DrawIt; Hercules, CA; http://www.bio-rad.com.
(101) Struct=Name; PerkinElmer: Cambridge, MA; http://www.cambridgesoft.com.
(102) Name to structure; ChemAxon: Budapest, Hungary; http://www.chemaxon.com/.
(103) NameExpert; ChemInnovation Software: San Diego, CA; http://www.cheminnovation.com.
(104) Name to structure; InfoChem: Munich, Germany; http://infochem.de/.
(105) Lexichem ToolKit; OpenEye Scientific Software: Santa Fe, NM; http://www.eyesopen.com.
(106) Brecher, J. Name=Struct: A Practical Approach to the Sorry State of Real-Life Chemical
Nomenclature. J. Chem. Inf. Comput. Sci. 1999, 39, 943–950.
187
(107) Brecher, J. S. Method, system, and software for deriving chemical structural information. US
Patent 7,054,754, May 30, 2006.
(108) Lawson, A. J.; Roller, S.; Grotz, H.; Wisniewski, J. L.; Kelkheim, L. G. Method and software for
extracting chemical data. EPO Patent EP20050252713, November 1, 2006.
(109) Engelken, H. A System for Semantic Analysis of Chemical Compound Names. In Proceedings
of the ACL-IJCNLP 2009 Student Research Workshop; Suntec, Singapore, 2009; pp. 36–44.
(110) Møller, A. dk.brics.automaton – Finite-State Automata and Regular Expressions for Java.
http://www.brics.dk/automaton/ (accessed Jun 4, 2012).
(111) Jensen, W. B. A Quantitative van Arkel Diagram. J. Chem. Educ. 1995, 72, 395.
(112) Lozac’h, N. Extension of Rules A-1.1 and A-2.5 concerning numerical terms used in organic
chemical nomenclature (Recommendations 1986). Pure Appl. Chem. 1986, 58, 1693–1696.
(113) Moss, G. P. Extension and revision of the von Baeyer system for naming polycyclic
compounds (including bicyclic compounds). Pure Appl. Chem. 1999, 71, 513–529.
(114) Moss, G. P. Extension and revision of the nomenclature for spiro compounds. Pure Appl.
Chem. 1999, 71, 531–558.
(115) Moss, G. P. Nomenclature of Fused and Bridged Fused Ring Systems (IUPAC
Recommendations 1998). Pure Appl. Chem. 1998, 70, 143–216.
(116) Powell, W. Revision of the Extended Hantzsch-Widman System of Nomenclature for
Heteromonocycles. Pure Appl. Chem. 1983, 55, 409–416.
(117) Powell, W. Treatment of Variable Valence in Organic Nomenclature (Lambda Convention).
Pure Appl. Chem. 1984, 56, 769–778.
(118) Nomenclature and symbolism for amino acids and peptides (Recommendations 1983). Pure
Appl. Chem. 1984, 56, 595–624.
(119) McNaught, A. D. Nomenclature of carbohydrates (IUPAC Recommendations 1996). Pure
Appl. Chem. 1996, 68, 1919–2008.
(120) Kahovec, J.; Fox, R. B.; Hatada, K. Nomenclature of regular single-strand organic polymers
(IUPAC Recommendations 2002). Pure Appl. Chem. 2002, 74, 1921–1956.
(121) Tchekhovskoi, D. InChI Canonicalization Algorithm.
http://sourceforge.net/mailarchive/forum.php?thread_name=5.1.1.5.2.20050708111329.0
2502190%40email.nist.gov&forum_name=inchi-discuss (accessed Jun 4, 2012).
(122) InChI Technical Manual (Version 1.04). http://www.inchi-trust.org/downloads/ (accessed
Jun 4, 2012).
188
(123) Cahn, R. S.; Ingold, C.; Prelog, V. Specification of molecular chirality. Angew. Chem., Int. Ed.
Engl. 1966, 5, 385–415.
(124) Prelog, V.; Helmchen, G. Basic Principles of the CIP-System and Proposals for a Revision.
Angew. Chem., Int. Ed. Engl. 1982, 21, 567–583.
(125) Mata, P.; Lobo, A. M.; Marshall, C.; Johnson, A. P. The CIP sequence rules: Analysis and
proposal for a revision. Tetrahedron: Asymmetry 1993, 4, 657–668.
(126) Razinger, M.; Balasubramanian, K.; Perdih, M.; Munk, M. E. Stereoisomer generation in
computer-enhanced structure elucidation. J. Chem. Inf. Comput. Sci. 1993, 33, 812–825.
(127) Giles, P. M. Revised Section F: Natural products and related compounds. Pure Appl. Chem.
1999, 71, 587–643.
(128) Sam Adams. JNI-InChI. http://jni-inchi.sourceforge.net/ (accessed Jun 4, 2012).
(129) Eller, G. A. Improving the Quality of Published Chemical names with Nomenclature Software.
Molecules 2006, 11, 915–28.
(130) O’Boyle, N. M.; Morley, C.; Hutchison, G. R. Pybel: a Python wrapper for the OpenBabel
cheminformatics toolkit. Chemistry Central Journal 2008, 2.
(131) USPTO Patent Application Publication Full Text with Embedded Images.
http://www.google.com/googlebooks/uspto-patents-applications-text-with-embedded-
images.html (accessed Jun 4, 2012).
(132) OPSIN Source Code on Bitbucket. http://bitbucket.org/dan2097/opsin/ (accessed Jun 4,
2012).
(133) Lowe, D. M. OPSIN Web Service. http://opsin.ch.cam.ac.uk/ (accessed Jun 4, 2012).
(134) Murray-Rust, P.; Townsend, J.; Downing, J.; Dirks, L.; Wade, A.; Naim, O.; Galos, M.;
Haughton, T. Chemistry Add-in for Word - Microsoft Research.
http://research.microsoft.com/chem4word/ (accessed Jun 4, 2012).
(135) Southan, C. Synergies between ChemAxon’s chemicalize and other open resources to
extract structures from patents, discern SAR, and find intersects or similarities in PubChem.
Chemaxon UGM, Budapest, Hungary, May 23, 2012.
(136) Lowe, D. M. OPSIN Document Extractor. https://bitbucket.org/dan2097/opsin-document-
extractor (accessed Jun 4, 2012).
(137) Tomoyuki, S. English to Japanese trivial chemical names.
http://homepage1.nifty.com/nomenclator/triv/trivial.htm (accessed Jun 4, 2012).
(138) Williams, A. J.; Ekins, S. A quality alert and call for improved curation of public chemistry
databases. Drug Discovery Today 2011, 16, 747–750.
189
(139) Clark, A. M. Accurate Specification of Molecular Structures: The Case for Zero-Order Bonds
and Explicit Hydrogen Counting. J. Chem. Inf. Model. 2011, 51, 3149–3157.
(140) Damerau, F. J. A Technique for Computer Detection and Correction of Spelling Errors.
Communications of the ACM 1964, 7, 171–176.
(141) Sayle, R. Foreign Language Translation of Chemical Nomenclature by Computer. J. Chem. Inf.
Model. 2009, 49, 519–530.
(142) Jeliazkova, N.; Jeliazkov, V. AMBIT RESTful web services: an implementation of the OpenTox
application programming interface. J Cheminf 2011, 3, 18.
(143) O’Boyle, N. M.; Hutchison, G. R. Cinfony–combining Open Source cheminformatics toolkits
behind a common interface. Chemistry Central Journal 2008, 2, 24.
(144) Sitzmann, M. NCI/CADD Chemical Identifier Resolver.
http://cactus.nci.nih.gov/chemical/structure (accessed Jun 4, 2012).
(145) Willighagen, E. OPSIN used for a Bioclipse wizard. http://chem-bla-
ics.blogspot.com/2011/02/opsin-used-for-bioclipse-wizard.html (accessed Jun 4, 2012).
(146) Lawson, K. R.; Lawson, J. LICSS–A chemical spreadsheet in Microsoft Excel. Journal of
Cheminformatics 2012, 4, 3.
(147) Weber, L. Chemical Ontologies for Life Sciences. Chemaxon UGM, Budapest, Hungary, May
23, 2012.
(148) International Union of Crystallography; ChemAxon. International Union of Crystallography
chooses ChemAxon Name to Structure technology.
http://www.chemaxon.com/news/international-union-of-crystallography-chooses-
chemaxon-name-to-structure-technology/ (accessed Jun 4, 2012).
(149) Kinney, J. Validation and characterization of chemical structures derived from names and
images in scientific documents. 9th International Conference on Chemical Structures,
Noordwijkerhout, The Netherlands, June 7, 2011.
(150) Muresan, S. Automated spelling correction to improve recall rates of name-to-structure
tools for chemical text mining. Chemaxon UGM, Budapest, Hungary, May 17, 2011.
(151) OPSIN used for generating SMILES from extracted chemical names, Personal communication
from IBM 2011.
(152) Blake, J. E.; Dana, R. C. CASREACT: more than a million reactions. J. Chem. Inf. Comput. Sci.
1990, 30, 394–399.
(153) Chemical Abstracts Service. CAS Databases - CASREACT, Chemical Reactions.
http://www.cas.org/expertise/cascontent/casreact.html (accessed Jun 4, 2012).
190
(154) Goodman, J. Computer Software Review: Reaxys. J. Chem. Inf. Model. 2009, 49, 2897–2898.
(155) Elsevier Properties SA. Reaxys. https://www.reaxys.com/info/ (accessed Jun 4, 2012).
(156) Roth, D. L. SPRESIweb 2.1, a Selective Chemical Synthesis and Reaction Database. J. Chem.
Inf. Model. 2005, 45, 1470–1473.
(157) InfoChem. SPRESIweb. http://www.spresi.com/ (accessed Jun 4, 2012).
(158) Thomson Reuters. Current Chemical Reactions.
http://thomsonreuters.com/products_services/science/science_products/a-
z/current_chemical_reactions/ (accessed Jun 4, 2012).
(159) Thieme Chemistry. Science of Synthesis. http://www.science-of-
synthesis.com/en/products/reference-works/science-of-synthesis.html (accessed Jun 4,
2012).
(160) Wife, D. SORD (Selected Organic Reactions Database). http://www.sord.nl/ (accessed Jun 4,
2012).
(161) Organic Syntheses. http://www.orgsyn.org/ (accessed Jun 4, 2012).
(162) WebReactions. http://www.openmolecules.org/webreactions/ (accessed Jun 4, 2012).
(163) Lowe, D. M. Automated Extraction of Reactions from the Patent Literature. CINF#75, 243rd
ACS National Meeting & Exposition, San Diego, CA, March 27, 2012.
(164) Reeker, L. H.; Zamora, E. M.; Blower, P. E. Specialized information extraction: automatic
chemical reaction coding from English descriptions. In Proceedings of the first conference on
Applied natural language processing; Association for Computational Linguistics, 1983; pp.
109–116.
(165) Zamora, E. M.; Blower Jr, P. E. Extraction of chemical reaction information from primary
journal text using computational linguistics techniques. 1. Lexical and syntactic phases. J.
Chem. Inf. Comput. Sci. 1984, 24, 176–181.
(166) Zamora, E. M.; Blower Jr, P. E. Extraction of chemical reaction information from primary
journal text using computational linguistics techniques. 2. Semantic phase. J. Chem. Inf.
Comput. Sci. 1984, 24, 181–188.
(167) Ai, C. S.; Blower Jr, P. E.; Ledwith, R. H. Extraction of chemical reaction information from
primary journal text. J. Chem. Inf. Comput. Sci. 1990, 30, 163–169.
(168) Jessop, D. M.; Adams, S. E.; Murray-Rust, P. Mining Chemical Information from Open
Patents. Journal of Cheminformatics 2011, 3, 40.
(169) Heifets, A.; Jurisica, I. SCRIPDB: a portal for easy access to syntheses, chemicals and
reactions in patents. Nucl. Acids Res. 2011, 40, D428–D433.
191
(170) Jessop, D. M. Information extraction from chemical patents. Ph.D, University of Cambridge,
2011.
(171) Dunn, P. J. The importance of Green Chemistry in Process Research and Development.
Chemical Society Reviews 2012, 41, 1452.
(172) Hargreaves, C. R.; Manley, J. B. Collaboration to Deliver a Solvent Selection Guide for the
Pharmaceutical Industry. In ACS GCI Pharmaceutical Roundtable; ACS Green Chemistry
Institute, 2008.
(173) Lowe, D. M. Patent Reaction Extraction Project. https://bitbucket.org/dan2097/patent-
reaction-extraction (accessed Jun 4, 2012).
192
Appendix A
This project has resulted in the creation of a significant amount of code. For posterity the
versions of the software that were current at the point of writing this thesis are attached as
supporting information. For all projects both source code and binaries (inclusive of dependencies)
are included in the /code directory. Note that only the OPSIN binary is executable, the other projects
are exclusively used as libraries.
Software components developed for this project:
OPSIN (version 1.2.0)
OPSIN Document Extractor (version 1.0.1)
OPSIN-ws (13th March 2012)
Patent Reaction Extraction (version 1.0)
Newer versions of these software projects may be available from https://bitbucket.org/dan2097
The Patent Reaction Extraction code depends heavily on the improved versions of ChemicalTagger
and OSCAR4 that were developed for this project:
ChemicalTagger (version 1.3.1)
OSCAR4 (version 4.1)
Newer versions of these software projects may be available from https://bitbucket.org/wwmm
193
Appendix B
To allow reproduction of the results in this thesis, both the data sets and results are included
as supporting information for all cases where the size of the data did not make this impractical.
Chemical name to structure testing:
/data/nameToStructure/Pubchem30000_dec2011 - Includes the Pubchem IDs, their
corresponding SMILES and InChIs, the names generated from ACD/Name,
ChemBioDraw, Lexichem and Marvin, and the InChIs generated by ChemBioDraw,
Marvin and OPSIN on these names. A summary of the results which was used to
generate Figure 3-143, Figure 3-144, Figure 3-145 and Figure 3-146 is included in
30000Pubchem_Dec11.xls.
/data/nameToStructure/2011_oscar4_patentnames – Includes the chemical names
found by OSCAR4 in 2011 organic chemistry patent application headings, the names
after processing by OPSIN’s pre-processor and the SMILES generates from both sets of
names from ChemBioDraw, Marvin and OPSIN. These results were used to generate
Figure 3-147.
/data/nameToStructure/ChebiDec11 – Includes names and corresponding SMILES
from compounds in the ChEBI database in December 2011. These names were used as
the input to produce the results for Figure 3-9.
Reaction extraction testing:
/data/reactionExtraction/evaluation – Includes the automatically extracted reactions
randomly chosen for evaluation and the manual evaluation performed to test their
quality. A breakdown of the number of patent applications with a certain number of
reactions is also included as was used to generate Figure 4-13.
/data/reactionExtraction/solvents – Includes the complete list of solvents and their
occurrence counts. This was used to generate Figure 4-14.
Due to size constraints including the 2008-2011 organic chemistry patent applications is not
practical but the ExtractOrganicChemistryPatents class used to filter patents to those containing IPC
code C07 is included in the Patent Reaction Extraction project.
194
Appendix C
Terms in OPSIN’s grammar:
Grammar Symbol Description Examples
a An 'a'
acetalClass acetal, ketal, hemiacetal, hemiketal
acidStem acet, valer, succin
alkaneStemHundreds hect, trict
alkaneStemModifier iso, neo, tert
alkaneStemTens dec, cos, icos
alkaneStemThousands killi, dili
alkaneStemTrivial meth, undec
alkaneStemUnits hen, do, tri
alphaBetaStereochemLocant 3beta
amineMeaningNitrilo amine as a substituent in the middle of a name
aminoAcidEndsInAn tryptoph
aminoAcidEndsInIc glutam, aspart
aminoAcidEndsInIne lys, alan, glutam
aminoAcidYl yl as in glycin-2-yl
ane
ane as in the ending of an alkane or heteroatom
analogue
anhydrideFunctionalGroup anhydride, peroxyanhydride
annulen [8]annulen
basicFunctionalClass ester, glycol, cyanohydrin
benzo benzo as in benzo as a fused ring component
bigCapitalH 5H-
bridgeFormingO 'o' as in ethano
canBeDlPrefixedSimpleGroup glucose, galactosamine
carbohydrateChain triose, hexose
carbohydrateConfigurationalPrefix glycero, gluco, manno
carbohydrateRingSize oxirose, furanose, pyranose
carbohydrateStem gluco, manno, fructo
chalcogenAcid sulfon, sulfin, tellur
chalcogenReplacement thio, seleno, telluro
chargeOrOxidationNumberSpecifier (IV), (2+)
closeBracket ], }, )
colonSeperatedLocant 1,2:3,4
comma A comma that is ignored after parsing
cyclicUnsaturableHydrocarbon menth, prism, adamant
cyclo cyclo as in cyclopropane
dispiroter 1,2':7',2''-dispiroter
divalentFunctionalGroup ketone, sulfone
dlStereochemistry D-, L-, Dg-
e An optional 'e'
elementaryAtom sodium, natrium, zirconium
elidedAMultiplier tetr, pent
195
endOfFunctionalGroup
Indicates the end of a functional group has been
reached
endOfMainGroup Indicates the end of the principal group
endOfSubstituent
Indicates the end of a substituent has been
reached
epoxy epoxy, epithio
FR2hydrocarbonComponent cen, len, helicen
functionalModifier poly
fusionBracket [4,5-d], [3',4':5,6]
fusionRing indolo, pyrido, pyrrolo
fusionRingAcceptsFrontLocants naphthyridino, phenanthrolino
groupMultiplier bis, tris, tetrakis
groupStemAllowingAllSuffixes hydrazin
groupStemAllowingInlineSuffixes amid, keten, formazan
hantzschWidmanSuffix iran, olan, inan
heteroAtom aza, azonia, azanylia, azanida
heteroAtomaElided az, thi
heteroStem alum, bor, oxid, sulf
hwAne
ane as in the ending of a Hantzsch-Widman
system
hwAneCompatible oxa, ox, thi
hwHeteroAtom aza, arsa, bisma
hwIne
ine as in the ending of a Hantzsch-Widman
system
hwIneCompatible oxa, thia, selena
hydro hydro, dehydro
hyphen An optional hyphen
implicitIc
Added after unsuffixed amino acids to simplify
systematic construction
ine ine as in the ine of glycine
infixableInlineSuffix oyl
inlineChargeSuffix ium, ylium, ide, uide
inlineSuffix yl, ylidyne, oyl, sulfonyl
inlineSuffixAllowingPrefixes amido, oyl
interSubstituentHyphen A hyphen between two substituents
lambdaConvention 3lambda5
lightRotation (+), (-), (+-)
locant 2, S-, alpha, N5
locantThatNeedsBrackets Subset of locantGroup
mono mono as in locanted monophosphate
monoNuclearNonCarbonAcid sulfam, azin, phosphon
monovalentFunctionalGroup alcohol, thiol
multipleFusor [2',3':3,4;2",3":6,7]
multiplyableFunctionalClass oxime, oxide
naturalProductRequiresUnsaturator morphin, androst
nitrogenHeteroStem az as in diazano
nonCarbonAcidNoAcyl diphosphon, boron, selen
o An optional euphonic o
196
oMeaningYl o as in glycino
openBracket [, {, (
optionalCloseBracket Same as closeBracket but ignored after parsing
optionalOpenBracket Same as openBracket but ignored after parsing
orthoMetaPara ortho, meta, para, o-, m-, p-
perhydro perhydro
relativeCisTrans r-5,c-5,t-7
repeatableInlineSuffix yl, ylidene
replacementInfix thi, perox, hydrazid, hydrazon
ringAssemblyMultiplier bi, ter, quarter
simpleCyclicGroup perbenzoic acid
simpleGroup hydroxide, chloroform, thiuram disulfide
simpleGroupClass
amine, carboxylic acid (groups that must be
substituted to not be generic)
simpleMultiplier di, tri, tetra
simpleSubstituent chloro, hydroxy, amino
spiro spiro as in a polycyclic spiro system
spiroDescriptor spiro[2.2]
spiroLocant
Subset of locantGroup used between
components of a spiro system
spiroOldMethod
spiro as in a polycyclic spiro system (deprecated
naming system)
standaloneMonovalentFunctionalGroup chloride, cyanide
stereochemistryBracket (2R), (2E,4Z)
structuralCloseBracket
Same as closeBracket but used to assist in
nomenclature interpretation
structuralOpenBracket
Same as openBracket but used to assist in
nomenclature interpretation
subtractivePrefix deoxy, desoxy
suffix one, ol, carboxylic acid
suffixableSubstituent sil, vin
suffixesThatCanBeModifiedByAPrefix amide, ate
suffixPrefix sulfon, sulfin, carbono
symPolycylicSpiro spirobi, spiroter
trivialRing benzen, pyridin, toluen
trivialRingSubstituent phenyl
trivialRingSubstituentAnySuffix pyrid, acrid
trivialRingSubstituentInlineOnly imidaz, tol
unbrackettedStereochem E, trans
unsaturator ene, en, yne
vonBaeyer cyclo[2.2.2]
vonBaeyerMultiplier bi, tri, tetra
ylamine ylamine (used in conjunctive nomenclature)
ylene ylene