Repository logo
 

Leveraging Language Model Multitasking To Predict C-H Borylation Selectivity.

Accepted version
Peer-reviewed

Loading...
Thumbnail Image

Change log

Abstract

C-H borylation is a high-value transformation in the synthesis of lead candidates for the pharmaceutical industry because a wide array of downstream coupling reactions is available. However, predicting its regioselectivity, especially in drug-like molecules that may contain multiple heterocycles, is not a trivial task. Using a data set of borylation reactions from Reaxys, we explored how a language model originally trained on USPTO_500_MT, a broad-scope set of patent data, can be used to predict the C-H borylation reaction product in different modes: product generation and site reactivity classification. Our fine-tuned T5Chem multitask language model can generate the correct product in 79% of cases. It can also classify the reactive aromatic C-H bonds with 95% accuracy and 88% positive predictive value, exceeding purpose-developed graph-based neural networks.

Description

Journal Title

J Chem Inf Model

Conference Name

Journal ISSN

1549-9596
1549-960X

Volume Title

Publisher

American Chemical Society (ACS)

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International
Sponsorship
Exscientia and EPRSC via SynTech CDT

Relationships

Is supplemented by: