Repository logo
 

Hybridising Organic Chemical and Synthetic Biological Reaction Data for Molecular Synthesis


Type

Thesis

Change log

Authors

Zhang, Chonghuan 

Abstract

Computer-assisted synthesis planning (CASP) accelerates the development of organic synthetic pathways of complex functional molecules. CASP tools are usually developed from organic synthetic chemistry rules or the reaction data. However, synthetic biology offers a new degree of freedom through the ability to develop new synthetic steps. In this PhD thesis, a method for hybridising conventional organic and synthetic biological reaction datasets is presented to guide synthesis planning. Part of the organic reactions from the Reaxys database was combined with the metabolic reactions from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database to create a hybrid dataset. The combined dataset was used to assemble synthetic pathways from multiple building blocks to a target molecule. Route assembly was performed using reinforcement learning, which was adapted to learn the values of molecular structures in synthetic planning and to develop a policy model to propose near-optimal multistep synthetic routes from the pool of available historical reactions. To quantify the added value of synthetic-biological reaction transformations in hybrid pathways, three policy model "decision makers" were developed from the organic, biological, and hybrid reaction pools, respectively. The near-optimal synthetic pathways predicted from the three reaction pools were evaluated and compared to discuss the advantages of synthetic chemical and synthetic biological reaction hybrid decision space in optimising reaction pathways.

The hybrid pathways show that biochemical transformations may allow significant gains in syntheses efficiency, but yet, there is a limited access to biochemical reaction data, which limits the opportunity to find alternatives and synergies with organic synthesis. Hence, a workflow to explore the sparse synthetic biological domain was proposed. Learned from the biocatalytic transformations of recorded reactions, feasible biosynthetic reactions were proposed to expand KEGG reaction dataset by four folds. To catalyse the novel reactions, a transformer model learned from the reaction SMILES and amino acid sequences of native enzymes to predict promiscuous enzymes for potential substrates. The proposed transformer model calibrates the feasibility of the predicted reactions and reduces the search scope for promiscuous enzymes in the pool. The hybrid pathways also show the organic reactions are noisy, with a major issue of missing reaction co-participants in the reaction records. A heuristic-based method was developed to identify the balanced reactions from reaction databases, and complete some imbalanced reactions by adding candidate molecules. A machine learning masked language model was trained to learn from the reaction SMILES sentences of these completed reactions. The model predicted missing molecules for the incomplete reactions, analogous to predicting missing words in sentences. The model was promising to predict small and middle size missing molecules for incomplete reactions.

The thesis presents an idea of hybridising organic chemical and synthetic biological reactions and delivers workflows towards a cleaner and more comprehensive hybrid reaction space to guide molecular synthesis. With the thesis presented methods, the research of CASP could be accelerated, and we could aim for more sustainable and efficient molecular synthesis routes.

Description

Date

2022-12-20

Advisors

Lapkin, Alexei

Keywords

AI for chemistry, molecular synthesis, reaction informatics

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge
Sponsorship
Cambridge Trust, China Scholarship Council and the Sustainable Reaction Engineering group