Machine learning framework for cost effective deep mutational scanning through targeted substitution profiling.
Accepted version
Peer-reviewed
Repository URI
Repository DOI
Change log
Abstract
BACKGROUND: Deep mutational scanning (DMS) provides comprehensive maps of protein variant effects but remains experimentally intensive. Machine learning (ML) approaches have the potential to reduce experimental burden of DMS by predicting the functional impact of substitutions from limited data. RESULTS: We introduced a ML classifier trained on normalised DMS scores from SARS-CoV-2 main protease (Mpro) to categorise amino acid substitutions as functional (wild-type-like) or non-functional. Using brute-force feature selection, we identified minimal subsets of six substitution scores per residue that enable accurate classification of the remaining substitutions, achieving minimum (worst accuracy) scores exceeding 90%. Models including support vector machines, random forests, and logistic regression were evaluated without retraining (zero-shot prediction) against additional SARS-CoV-2 Mpro datasets and against unrelated datasets. The zero-shot performance of the models was strongest for other enzymes and more modest when applied to DMS systems that assess protein folding and/or protein-protein interactions. CONCLUSION: The results show that targeted DMS combined with ML can reduce sequencing and reagent costs while preserving classification accuracy, offering a practical route to accelerate variant effect prediction.
Description
Journal Title
Conference Name
Journal ISSN
1471-2105

