Repository logo
 

Machine learning framework for cost effective deep mutational scanning through targeted substitution profiling.

Accepted version
Peer-reviewed

Change log

Abstract

BACKGROUND: Deep mutational scanning (DMS) provides comprehensive maps of protein variant effects but remains experimentally intensive. Machine learning (ML) approaches have the potential to reduce experimental burden of DMS by predicting the functional impact of substitutions from limited data. RESULTS: We introduced a ML classifier trained on normalised DMS scores from SARS-CoV-2 main protease (Mpro) to categorise amino acid substitutions as functional (wild-type-like) or non-functional. Using brute-force feature selection, we identified minimal subsets of six substitution scores per residue that enable accurate classification of the remaining substitutions, achieving minimum (worst accuracy) scores exceeding 90%. Models including support vector machines, random forests, and logistic regression were evaluated without retraining (zero-shot prediction) against additional SARS-CoV-2 Mpro datasets and against unrelated datasets. The zero-shot performance of the models was strongest for other enzymes and more modest when applied to DMS systems that assess protein folding and/or protein-protein interactions. CONCLUSION: The results show that targeted DMS combined with ML can reduce sequencing and reagent costs while preserving classification accuracy, offering a practical route to accelerate variant effect prediction.

Description

Journal Title

BMC Bioinformatics

Conference Name

Journal ISSN

1471-2105
1471-2105

Volume Title

Publisher

Springer Nature

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International
Sponsorship
Novo Nordisk Foundation (via Rhodes University) (NNF23SA0084504)