Automatic analysis and validation of open polymer data

England, Nicholas William

Automatic analysis and validation of open polymer data

Repository URI

http://www.dspace.cam.ac.uk/handle/1810/237228
https://www.repository.cam.ac.uk/handle/1810/237228

Repository DOI

https://doi.org/10.17863/CAM.16280

Files

thesis_index.pdf (3.05 MB)

Type

Thesis

Authors

England, Nicholas William

Abstract

A system to automatically extract, analyse, validate and model polymer data has been produced. This system is called the Polymer Informatics Knowledge System (PIKS).

Methods of storing polymer data electronically are examined. The majority of data-formats are only capable of representing an idealised structure of a macromolecule rather than the actual distribution of structures present in the polymer. Polymer markup language (PML) is the only data-format capable of storing this information. A novel extension to the PML language, allowing copolymers produced with a depletion of reactants is introduced. Without the extension only Markov-chains can be produced.

An informatics analysis of Unilever data of cleaning efficacy of polymers is performed. A representative macromolecule was produced for each polymer sample. Descriptors were calculated over these and used for machine learning to predict the cleaning efficacy. From these models a monomer was identified which was very strongly correlated with good cleaning performance. The monomer in question cannot be revealed as it is a trade secret.

Polymer data from the PoLyInfo database are extracted and converted into XML. A summary of the data available in the PoLyInfo Database is presented. The PIKS tools were used to automatically validate this data for internal consistency, as well as against another data source. The monomers and polymers were analysed for consistency, as well as CML reactions being produced for the polymerisation reactions in the database which were also checked for constancy. The error in the structures was found to be 5.8% for the monomers, 7.3% for the polymers and 2.9% for the reactions. Some of the causes of the discrepancies are presented.

The property data from the PoLyInfo database was then used for machine learning. Support Vector Regression (SVR) models of the glass transition temperature were produced both with and without the inclusion of sample characterisation data. Both methods performed similarly, with the model without producing an RMS error of 19.1K (r^2=0.96), while the model with produced an RMS error of 20.1K (r^2=0.96). This means that more sample characterisation data is required than the M_w and M_w/M_n.

Keywords

Polymers, Informatics

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights

Sponsorship

This work was supported by Unilever

Collections

Theses - Chemistry