Repository logo
 

Automated analysis and validation of open chemical data


Type

Thesis

Change log

Authors

Day, Nicholas E 

Abstract

Methods to automatically extract Open Data from the chemical literature, validate it, and use it to validate theory are examined. Chemical identifiers which assist the automatic location of chemical structures using commercial Web search engines are investigated. The IUPAC International Chemical Idenfitifer (InChI) gives almost 100% recall and precision, though is shown to be too long for present search engines. A combination of InChI and InChIKey, a shorter, fixed-length hash of the InChI string, is concluded to be the best current method of identifying structures. The proportion of published, Open Crystallographic Information Files (CIFs) that are valid with respect to the specification is shown to be improving, and is around 99% in 2007. The error rate in the conversion of valid CIFs to Chemical Markup Language (CML) is less than 0.2%. The machine generation of connection tables from CIFs requires many heuristics, and in some cases it is impossible to deduce the exact connection table. CrystalEye, a fully-automated system for the reformulation of the fragmented crystallographic Web into a structured XML-based repository is described. Published, Open CIFs can be located and aggregated programmatically with almost 100% recall. It is shown that, by converting CIF data to CML, software can be created to use the latest Web standards and technologies to enhance the ability of Web users to browse, find, keep updated, download and reuse the latest published crystallography. A workflow for the high-throughput calculation of solid-state geometry using a semi-empirical method is described. A wide-range of organic and inorganic systems provided by CrystalEye are used to test both the data and the method. Several errors in the method are discovered, many of which can be attributed to the parameterization process. An Open NMR experiment to perform high-throughput prediction of 13C chemical shifts using a GIAO protocol is described. The data and analysis were provided on publicly-available webpages to enable crowdsourcing, which assisted in discovering an error rate of 6.1% in the starting data. The protocol was refined during the work and shown to have an average unsigned error of 2.24ppm for 13C nuclei of small, rigid molecules; comparable to the errors observed elsewhere for general structures using HOSE and Neural Network methods.

Description

Date

Advisors

Keywords

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge