Accelerating Materials Discovery for Optical Applications using Machine Learning, Natural Language Processing and Density Functional Theory

Change log

This thesis presents a novel approach that combines materials informatics and theoretical calculations to accelerate the discovery of materials with desirable optical properties. Unlike traditional experimental research programmes, this work emphasises the significant contributions in terms of knowledge and technological outcomes resulting from the study.

The thesis begins by reviewing the fundamental physics of optical properties of materials and the existing literature on materials informatics and the applications of machine learning in materials discovery (Chapter 1). Subsequently, the methodologies employed throughout the research are outlined, including the utilisation of natural language processing (NLP) tools for information extraction, various machine learning models, and techniques for quantifying π-conjugation in organic molecules (Chapter 2).

The research presents compelling results, starting with the development of a complete workflow that extracts refractive indices and dielectric constants from scientific publications, resulting in a substantial database comprising 109,880 records of experimental data on optical materials (Chapter 3). Building upon this, second-order Sellmeier equations are used to reconstruct chromatic-dispersion relations of various compounds, while machine learning techniques are employed to model refractive indices of inorganic compounds, showcasing the potential of auto-generated databases for property prediction and visualisation (Chapter 4).

Moreover, a new algorithm or metric is introduced to characterise π-conjugation in organic molecules, which plays a crucial role in determining their nonlinear optical properties (Chapter 5). This algorithm enables a high-throughput computational study on more than 20,000 molecules, leading to the identification of four commercially available organic molecules that hold sufficient potential to be used as nonlinear optical materials (Chapter 6). This metric enables the accelerated discovery of organic compounds with exceptional molecular hyperpolarisability coefficients, a domain where literature data is notably scarce.

To enhance data extraction capabilities in the field of optical materials, two novel neural network-based language models, OpticalBERT and OpticalTableSQA, are presented (Chapter 7). OpticalBERT is a BERT-based model that is pre-trained on an extensive corpus of optical materials and exhibits remarkable advancements in various NLP tasks compared to traditional rule-based approaches. OpticalTableSQA is a table-based question-answering model that is specifically designed for question-answering tasks on optical materials tables, further enhancing data extraction capabilities.

This thesis concludes by summarising the contributions made and outlining potential avenues for future research (Chapter 8). By leveraging materials informatics, machine learning, and novel metrics for organic compounds, this research significantly advances the discovery and understanding of materials with desirable optical properties. The developed methodologies and databases provide valuable resources for researchers in the field and pave the way for further exploration in this domain.

Jacqueline, Cole
machine learning, materials informatics, natural language processing, optical materials, π conjugation of organic molecules
Doctor of Philosophy (PhD)
Awarding Institution
University of Cambridge
China Scholarship Council Cambridge Trust