Repository logo
 

Leveraging Language Model Multitasking To Predict C-H Borylation Selectivity.

Accepted version
Peer-reviewed

Type

Article

Change log

Authors

Kotlyarov, Ruslan 
Papachristos, Konstantinos 
Wood, Geoffrey PF 
Goodman, Jonathan M  ORCID logo  https://orcid.org/0000-0002-8693-9136

Abstract

C-H borylation is a high-value transformation in the synthesis of lead candidates for the pharmaceutical industry because a wide array of downstream coupling reactions is available. However, predicting its regioselectivity, especially in drug-like molecules that may contain multiple heterocycles, is not a trivial task. Using a data set of borylation reactions from Reaxys, we explored how a language model originally trained on USPTO_500_MT, a broad-scope set of patent data, can be used to predict the C-H borylation reaction product in different modes: product generation and site reactivity classification. Our fine-tuned T5Chem multitask language model can generate the correct product in 79% of cases. It can also classify the reactive aromatic C-H bonds with 95% accuracy and 88% positive predictive value, exceeding purpose-developed graph-based neural networks.

Description

Keywords

3404 Medicinal and Biomolecular Chemistry, 34 Chemical Sciences

Journal Title

J Chem Inf Model

Conference Name

Journal ISSN

1549-9596
1549-960X

Volume Title

Publisher

American Chemical Society (ACS)
Sponsorship
Exscientia and EPRSC via SynTech CDT
Relationships
Is supplemented by: