Repository logo
 

Prediction of protein biophysical traits from limited data: a case study on nanobody thermostability through NanoMelt.

Published version
Peer-reviewed

Repository DOI


Change log

Abstract

In-silico prediction of protein biophysical traits is often hindered by the limited availability of experimental data and their heterogeneity. Training on limited data can lead to overfitting and poor generalizability to sequences distant from those in the training set. Additionally, inadequate use of scarce and disparate data can introduce biases during evaluation, leading to unreliable model performances being reported. Here, we present a comprehensive study exploring various approaches for protein fitness prediction from limited data, leveraging pre-trained embeddings, repeated stratified nested cross-validation, and ensemble learning to ensure an unbiased assessment of the performances. We applied our framework to introduce NanoMelt, a predictor of nanobody thermostability trained with a dataset of 640 measurements of apparent melting temperature, obtained by integrating data from the literature with 129 new measurements from this study. We find that an ensemble model stacking multiple regression using diverse sequence embeddings achieves state-of-the-art accuracy in predicting nanobody thermostability. We further demonstrate NanoMelt's potential to streamline nanobody development by guiding the selection of highly stable nanobodies. We make the curated dataset of nanobody thermostability freely available and NanoMelt accessible as a downloadable software and webserver.

Description

Acknowledgements: We are grateful to Dr Chris Johnson for his support in facilitating the use of the Prometheus equipment at the MRC Laboratory of Molecular Biology. We acknowledge Dr Gabriel Ortega Quintanilla for sharing his thermostability data from an unpublished study. We thank Montader Ali, Misha Atkinson, Magdalena Nowinska, Dr Mauricio Aguilar Rangel, and Dr Oded Rimon for donating samples of their purified nanobodies to expand our dataset. P.S. is a Royal Society University Research Fellow (grant no. URF\R1\201461). We acknowledge funding from UK Research and Innovation (UKRI) Engineering and Physical Sciences Research Council (grant no. EP/X024733/1, an ERC starting grant to P.S. underwritten by UKRI).


Publication status: Published

Is Part Of

Publisher

Taylor & Francis

Rights and licensing

Except where otherwised noted, this item's license is described as http://creativecommons.org/licenses/by/4.0/
Sponsorship
Engineering and Physical Sciences Research Council (EP/X024733/1)
Royal Society (URF\R1\201461)