Inference frameworks in computational biology: from protein-protein interaction networks using machine learning to carbon footprint estimation.
Protein-protein interactions (PPIs) are essential to understanding biological pathways and their roles in development and disease. Computational tools have been successful at predicting PPIs in silico, but the lack of consistent and reliable frameworks for this task has led to network models that are difficult to compare and, overall, a low level of trust in the predicted PPIs. To better understand the underlying mechanisms underpinning these models, I designed B4PPI, an open-source framework for benchmarking that accounts for a range of biological and statistical pitfalls while facilitating reproducibility. I use B4PPI to shed light on the impact of network topology and understand how different algorithms deal with highly connected proteins. By studying functional genomics-based and sequence- based models (two of the most popular approaches) on human PPIs, I show their complementarity as the former performs best on lone proteins while the latter specialises in interactions involving hubs. I also show that algorithm design has little impact on performance with functional genomic data. I replicate these results between human and yeast data and demonstrate that models using functional genomics are better suited to PPI prediction across species. These analyses also highlight disparities in computing resources needed to train the prediction tools; some models run within seconds while others need hours. Longer runtimes require more energy and are responsible for more greenhouse gas emissions. Being able to quantify this impact is crucial as climate change profoundly affects nearly all aspects of life on earth, including human societies, economies and health. Various human activities are responsible for significant greenhouse gas emissions, including data centres and other sources of large-scale computation. Although many important scientific milestones have been achieved thanks to the development of high-performance computing, the resultant environmental impact has been underappreciated. I present a methodological framework to estimate the carbon footprint of any computational task in a standardised and reliable way, and metrics to contextualise greenhouse gas emissions are defined. I develop a freely available online tool, Green Algorithms, which enables a user to estimate and report the carbon footprint of their computation (available at www.green-algorithms.org). The tool easily integrates with computational processes as it requires minimal information and does not interfere with existing code while also accounting for a broad range of hardware configurations. Finally, I quantify the greenhouse gas emissions of algorithms used for particle physics simulations, weather forecasts, natural language processing and a wide range of bioinformatic tools. With rapidly increasing amounts of sequence and functional genomics data, this work on protein interactions provides a systematic foundation for future construction, comparison and application of PPI networks. It also integrates essential metrics of environmental efficiency developed by the Green Algorithms project, a simple generalisable framework and a freely available tool to quantify the carbon footprint of nearly any computation. This work also elucidates the carbon footprint of common analyses in bioinformatics and provides recommendations to empower scientists to move toward greener research.