Repository logo

Learning the molecular grammar of protein condensates from sequence determinants and embeddings

Published version



Change log


Intracellular phase separation of proteins into biomolecular condensates is increasingly recognised as an important phenomenon for cellular compartmentalisation and regulation of biological function. Different hypotheses about the parameters that determine the tendency of proteins to form condensates have been proposed with some of them probed experimentally through the use of constructs generated by sequence alterations. To broaden the scope of these observations, here, we established an in silico strategy for understanding on a global level the associations between protein sequence and phase behaviour and further constructed machine learning classifiers for predicting protein liquid–liquid phase separation (LLPS) from sequence. Our analysis highlighted that LLPS-prone proteins are more disordered, hydrophobic and of lower Shannon entropy than sequences in the Protein Data Bank or the Swiss-Prot database, show a fine balance in their relative abundance of polar and hydrophobic residues and have their disordered regions enriched in polar, aromatic and charged residues. Using these features, we built a machine learning classifier that effectively distinguished LLPS-prone sequences both from structured proteins and within thehuman proteome. Moreover, to remove the requirement for feature engineering, we trained a neural network based language model (LM) for generating low-dimensional embedding vectors for protein sequences and showed that a classifier constructed on such embedding vectors can learn the underlying principles of protein phase behaviour at a comparable accuracy to a classifier that relied on knowledge-based features. Our final model, combining engineered features with unsupervised embeddings, achieved a high performance both when distinguishing LLPS-prone proteins form structures ones and when identifying them within the human proteome. These results provide a platform rooted in molecular principles for understanding protein phase behaviour. The predictor, termed DeePhase, is accessible from



Journal Title

Proceedings of the National Academy of Sciences of USA

Conference Name

Journal ISSN


Volume Title


National Academy of Sciences
Rhodes Trust (Unknown)