Quantifying the diversity of bioactive small molecule sets
Repository URI
Repository DOI
Change log
Authors
Abstract
The concept of molecular diversity is used in a range of different contexts in cheminformatics, including choosing compound sets for screening, measuring the success of generative AI models, and analysing and comparing new or existing compound sets. Despite its frequent use in multiple different contexts, no gold standard method to quantify diversity exists. A multitude of molecular representations and strategies to quantify diversity are used, and there is limited knowledge of how the numerous diversity measures compare.
This thesis focuses on the use of diversity measures to quantify and compare the diversity of sets of bioactive small molecules. A representative set of diversity measures based on scaffolds and fingerprints was chosen and implemented together with a diversity measure based on reduced graphs.
To analyse the diversity of bioactive small molecules, a carefully designed dataset of interacting compound-target pairs was extracted from the open-source bioactivity database ChEMBL. Interacting compound-target pairs were identified by combining information from measured activities and a manually curated set of mechanisms of action in ChEMBL. The extraction workflow is fully automated and reproducible and can be performed for all ChEMBL versions from ChEMBL 26 onwards.
Before analysing the diversity of compounds in the compound-target dataset, it was investigated how well the chosen diversity measures are able to quantify the diversity of bioactive small molecules. Diversity measures are frequently used to compare compound sets of various sizes. Using ChEMBL data, it was analysed whether dataset size had a strong influence on any of the measures. Furthermore, most diversity measures quantify the structural diversity of the compounds in a compound set, but they are commonly used to predict the functional diversity of a compound set, i.e., how many different targets the compounds can potentially interact with. Using the compound-target dataset, it was examined how the measured diversity of compound sets changed with increasing functional diversity.
The measures that proved to be suitable for capturing the diversity of compound sets of varying sizes were compared. The potential advantages and disadvantages of the measures were examined, and two complementary measures were chosen to analyse the diversity of compounds in the compound-target dataset. For each target in the dataset, the diversity of the compounds interacting with the target was calculated. General trends and systematic differences between target classes and between individual targets were analysed, and examples of interest discussed.
Additionally, the influence of time was investigated. An estimate of the first occurrence of each compound was determined based on information in the dataset and in ChEMBL. Subsequently, the development of scaffold- and reduced graph-based diversity over time was systematically explored for the targets in the dataset. Finally, the contributions of the thesis were summarised, and further practical applications of the work and potential directions for future research were outlined.

