Repository logo
 

Computer vision and deep-learning tools for automatic data extraction from chemical reaction schemes


Loading...
Thumbnail Image

Type

Change log

Abstract

This thesis focuses on the application of computer vision and deep-learning techniques to the development of an artificial intelligence tool for image-mining chemical reaction schemes. In particular, the aim of this work is the development of an autonomous, high-throughput tool for converting visual data into machine-readable formats. Data extracted in this manner can be used for the creation of databases and provides insight into the chemical domain through a big data approach, which facilitates the chemical discovery process. The extraction process is also complementary to the more common data mining workflows based on natural language processing, since data from reaction schemes are often inaccessible through text.

Chapter 1 reviews the current state of research in the area and its limitations. It places the work presented in this thesis in the context of previous developments while also describing more broadly the motivation behind it. Additionally, the broader context of chemical reaction database generation and recent research efforts are discussed. Chapter 2 provides background on the machine-learning and computer vision techniques used, emphasizing the unsupervised learning and deep learning paradigms that underlie most of this research.

Following the introduction, chapters describing results are presented. These describe the present research in-depth and its evolution along the path of creating an intelligent and autonomous tool for image mining from chemical reaction schemes. Chapter 3 describes the first published work on ReactionDataExtractor v. 1.0, which is an early instance of the chemical extraction tool based on a combination of unsupervised machine learning and symbolic artificial intelligence approaches that allow extraction from simple reaction schemes. Chapter 4 describes a second published work on the more mature ReactionDataExtractor v. 2.0. system, which has a much more robust architecture and a sophisticated design that effectively lifts several limitations of the earlier version. Chapter 5 presents the final improvements and integrations performed on the pipeline to enable chemical patent data mining. It concerns the integration of ReactionDataExtractor into a PDF data mining tool and the Markush structure recognition. Finally, Chapter 6 summarises the contributions made by this thesis as well as the current state and limitations of the framework. It also proposes avenues for future work.

Description

Date

2024-03-14

Advisors

Cole, Jacqueline

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as All Rights Reserved
Sponsorship
BASF SE

Collections