Repository logo
 

Extraction of chemical structures and reactions from the literature

cam.restrictionthesis_access_open
dc.contributor.advisorGlen, Robert
dc.contributor.advisorMurray-Rust, Peter
dc.contributor.authorLowe, Daniel Mark
dc.date.accessioned2013-07-23T08:23:10Z
dc.date.available2013-07-23T08:23:10Z
dc.date.issued2012-10-09
dc.description.abstractThe ever increasing quantity of chemical literature necessitates the creation of automated techniques for extracting relevant information. This work focuses on two aspects: the conversion of chemical names to computer readable structure representations and the extraction of chemical reactions from text. Chemical names are a common way of communicating chemical structure information. OPSIN (Open Parser for Systematic IUPAC Nomenclature), an open source, freely available algorithm for converting chemical names to structures was developed. OPSIN employs a regular grammar to direct tokenisation and parsing leading to the generation of an XML parse tree. Nomenclature operations are applied successively to the tree with many requiring the manipulation of an in-memory connection table representation of the structure under construction. Areas of nomenclature supported are described with attention being drawn to difficulties that may be encountered in name to structure conversion. Results on sets of generated names and names extracted from patents are presented. On generated names, recall of between 96.2% and 99.0% was achieved with a lower bound of 97.9% on precision with all results either being comparable or superior to the tested commercial solutions. On the patent names OPSIN s recall was 2-10% higher than the tested solutions when the patent names were processed as found in the patents. The uses of OPSIN as a web service and as a tool for identifying chemical names in text are shown to demonstrate the direct utility of this algorithm. A software system for extracting chemical reactions from the text of chemical patents was developed. The system relies on the output of ChemicalTagger, a tool for tagging words and identifying phrases of importance in experimental chemistry text. Improvements to this tool required to facilitate this task are documented. The structure of chemical entities are where possible determined using OPSIN in conjunction with a dictionary of name to structure relationships. Extracted reactions are atom mapped to confirm that they are chemically consistent. 424,621 atom mapped reactions were extracted from 65,034 organic chemistry USPTO patents. On a sample of 100 of these extracted reactions chemical entities were identified with 96.4% recall and 88.9% precision. Quantities could be associated with reagents in 98.8% of cases and 64.9% of cases for products whilst the correct role was assigned to chemical entities in 91.8% of cases. Qualitatively the system captured the essence of the reaction in 95% of cases. This system is expected to be useful in the creation of searchable databases of reactions from chemical patents and in facilitating analysis of the properties of large populations of reactions.en_GB
dc.identifier.doi10.17863/CAM.16293
dc.identifier.otherPhD.35691
dc.identifier.urihttp://www.dspace.cam.ac.uk/handle/1810/244727
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/244727
dc.language.isoenen_GB
dc.publisher.departmentDepartment of Chemistryen_GB
dc.publisher.departmentPembroke Collegeen_GB
dc.publisher.institutionUniversity of Cambridgeen_GB
dc.rightsAll Rights Reserveden
dc.rights.urihttps://www.rioxx.net/licenses/all-rights-reserved/en
dc.subjectName to structureen_GB
dc.subjectOPSINen_GB
dc.subjectChemical text miningen_GB
dc.subjectText miningen_GB
dc.subjectPatent reaction extractionen_GB
dc.subjectReaction miningen_GB
dc.subjectPatentsen_GB
dc.titleExtraction of chemical structures and reactions from the literatureen_GB
dc.typeThesisen_GB
dc.type.qualificationlevelDoctoral
dc.type.qualificationnameDoctor of Philosophy (PhD)

Files

Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
lowethesis.pdf
Size:
4.72 MB
Format:
Adobe Portable Document Format
Licence
https://www.rioxx.net/licenses/all-rights-reserved/
No Thumbnail Available
Name:
Extraction of chemical structures and reactions from the literature_Supporting Information.zip
Size:
514.24 MB
Format:
ZIP file
Licence
https://www.rioxx.net/licenses/all-rights-reserved/
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
3.84 KB
Format:
Item-specific license agreed upon to submission