Machine Learning Methods for Modeling Synthesizable Molecules

Change log
Bradshaw, John 

The search for new molecules often involves cycles of design-make-test-analyze steps, where new molecules are designed, synthesized in a lab, tested, and then analyzed to inform what is to be designed next. This thesis proposes new machine learning (ML) methods to augment chemists in the design and make steps of this process, focusing on the tasks of (a) how to use ML to predict chemical reaction outcomes, and (b) how to build generative models to search for new molecules. We take a common approach to both tasks, building our ML models around existing powerful tools and abstractions from the field of chemistry, and in doing so, show that the tasks we tackle are intrinsically linked.

Reaction prediction is important for validating synthesis plans before carrying them out. Many previous ML approaches to reaction prediction have treated reactions as either a black box translation or a single graph edit operation. Instead, we propose a model (ELECTRO) that predicts the reaction products through modeling a sequence of electron movements. We show how modeling electron movements in this way has the benefit of being easy for chemists to interpret, and also is a natural format in which to incorporate the constraints of chemistry, such as balanced atom counts before and after a reaction. We show that our model achieves excellent performance on an important subset of chemical reactions and recovers a basic knowledge of chemistry without explicit supervision.

In designing new models to search for molecules with particular properties, it is important that the models describe not only what molecule to make, but also crucially how to make it. These instructions form a synthesis plan, describing how easy-to-obtain building blocks can be combined together to form more complex molecules of interest through chemical reactions. Inspired by this real-world process, we develop two machine learning approaches that incorporate reactions into the virtual generation of new molecules. We show that aligning our model with the real-world process allows us to better link up the design and make steps involved in molecule search, and permits chemists to examine the practicability of both the final molecules we suggest and their synthetic routes. Molecule search is inherently an extrapolation task, and we show that by building our methods around the inductive biases of modeling reactions, we can generalize to new chemical spaces, suggesting molecules that not only perform well, but are synthesizable too.

Hernández-Lobato, José Miguel
Ghahramani, Zoubin
Schölkopf, Bernhard
ML, machine learning, artificial intelligence, Deep generative models, de novo design, retrosynthesis, reaction prediction, chemoinformatics, cheminformatics, ml4molecules
Doctor of Philosophy (PhD)
Awarding Institution
University of Cambridge