Repository logo

Theses - Computer Science and Technology


Recent Submissions

Now showing 1 - 20 of 197
  • ItemOpen Access
    Near-Memory Processing for Low-precision Deep Neural Networks
    Miralaei, Aida
    Deep neural networks (DNNs) provide many application domains with state-of-the-art performance and accuracy. However, they are compute-heavy and data-intensive which makes deploying them on resource-constrained edge devices challenging. Transforming the real-valued parameters of a DNN into a minimised-bit-width approximated version through quantising or binarising lowers their accuracy to an extent but significantly reduces their computation complexity and storage space. Moreover, it makes them good candidates to be considered for near-memory processing due to these criteria. On the other hand, when it comes to designing a hardware module (either for near-memory processing or not), the more specialised the hardware the better the performance. The downside with over-specialised modules then becomes their inability to adapt, which can be problematic in a fast-evolving field such as deep learning. To this end, designing a module with reasonable performance, area overhead and energy consumption while maintaining a good balance between the physical limitations of the design and its flexibility is a challenge. The contribution of this thesis is to introduce a processing-near-memory module (PNM) for low-precision convolution neural networks. The placement of this module is near the main memory (DDR4 DRAM) and the memory controller, with a data layout that does not require rearrangement either before or after the convolution operations. Here, two distinct design modes for the PNM module are presented: mode S, which focuses on minimizing area overhead, and mode T, aimed at optimizing data transfers between the module and DRAM. The performance, area, and energy costs of these designs were thoroughly assessed through both analytical and practical analysis. These evaluations highlighted the impact of varying filters, hardware replicas, and bit-widths for each model on the stated criteria, leading to recommended design choices tailored to specific use cases' demands and constraints. An evaluation using a BCNN based on AlexNet indicates that for mode S, a configuration of one hardware replica with 16 filters per replica offers an optimal balance between area, runtime, and energy. In contrast, for mode T, the best configuration comprises 16 replicas and 32 filters. Comparatively, mode T surpasses mode S by a factor of 6.39 in performance, while mode S achieves greater savings in area and energy, by factors of 3.56 and 15.99, respectively.
  • ItemOpen Access
    Reliable and decentralised deep learning for physiological data
    Xia, Tong
    Physiological data encompass measurements from various bodily functions and processes. By employing machine learning to model these data, especially with the advancement of mobile sensing technologies, it becomes feasible to automatically and continually monitor and diagnose one's health status. This holds considerable promise for easing the burden on clinical resources and ensuring timely treatment for the wider population. Nonetheless, significant challenges related to the data and the modelling methods are yet to be resolved, obstructing the deployment of machine learning, especially deep learning, in real-world healthcare contexts. One challenge is that labelled physiological data for model development are usually insufficient and imbalanced, leading to models occasionally exhibiting bias and overconfidence in their predictions. This can result in unreliable diagnoses which yield expensive clinical costs. Moreover, deep learning research generally requires massive data on a centralised server, while privacy concerns hinder the aggregation of physiological data from individuals or hospitals. In order to tackle these challenges and pave the way for reliable deep learning-driven health diagnostics, this thesis proposes several novel solutions and makes the following contributions: Chapter 4 introduces an ensemble learning approach designed to handle data imbalance and model overconfidence for binary health screening. This method utilises balanced training sets derived from imbalanced physiological data, training multiple ensemble models. The predictions from these models are fused to reduce bias and calibrate confidence from a signal model, with model uncertainty measured by the inconsistency among multiple models. This approach effectively mitigates model overconfidence, thereby facilitating reliable automated diagnoses. In Chapter 5, an efficient uncertainty quantification approach is presented to improve the reliability of multi-class mobile health diagnostics. This approach incorporates the cutting-edge technique of evidential deep learning and introduces two novel mechanisms specifically designed to handle class imbalance. The quantified uncertainty enables accurate and efficient detection of misdiagnoses and out-of-training distributed inputs. Chapter 6 introduces a cross-device federated learning method to address privacy concerns arising from gathering physiological data for model development. This method allows physiological data to remain on personal mobile devices, with only locally trained models aggregated into a global health diagnostic model. To mitigate bias caused by data imbalance, a novel loss-weighted model aggregation method is proposed to enhance the performance of the global model. Chapter 7 illustrates a cross-silo federated learning method that enables multiple data holders such as hospitals to collaboratively train a model without exchanging raw data. The distributional heterogeneity of these physiological data silos poses a challenge to federated learning. To address this, a novel method based on feature sharing and augmentation is proposed to balance privacy protection and model performance. All proposed methods have been validated using real-world physiological datasets and commonly used machine learning benchmark data. Specific attention is given to clinical tasks, including the modelling of respiratory audio for respiratory health screening, ECG signals for predicting cardiovascular diseases, and dermoscopic images for detecting skin cancer. Extensive experiments demonstrate that these methods effectively address challenges posed by limited, imbalanced, and decentralised physiological data, thereby enabling reliable health diagnoses. These contributions have significant potential to advance the deployment of deep learning in real-world healthcare scenarios.
  • ItemOpen Access
    Towards Maintainable and Explainable AI Systems with Dataflow
    Paleyes, Andrei
    Machine learning is enjoying rapid growth both as a thriving academic discipline and as a technology that has the potential to transform many aspects of our everyday lives. We have already witnessed breakthroughs in speech generation, drug discovery, recommendation algorithms, and more, all achieved with the help of machine learning. It is vital to realise that any practical application of machine learning is not limited to just creating an accurate model based on a sanitised dataset. Such real-life applications are complex software systems, in which the model is only one, albeit important, component. A significant effort is also spent on creating data collection and cleaning pipelines, quality assurance, model updating workflows, monitoring and operational maintenance of these systems. The experience of numerous practitioners shows that the translation of a well-performing machine learning model to a well-performing machine learning system is not easy. This thesis embarks on a quest to understand the pain points of this translation process and explore software architecture paradigms well suited for the needs of modern data-driven systems. We begin by surveying existing reports on ML deployment and the difficulties they describe. The identified issues and concerns are matched against a typical ML deployment workflow, and we show that there is no single bottleneck, and the entire deployment pipeline is riddled with challenges. We argue that a lot of these challenges are caused by existing software infrastructure and that more data-oriented approaches to software architecture are needed to tackle them. This observation leads us to the second contribution of this thesis, in which we examine data-oriented architecture (DOA) as a promising software architecture paradigm that machine learning systems can benefit from. We focus on measuring the level of adoption of DOA in practical deployments of machine learning and show that even though the paradigm itself is relatively unknown, its principles widely permeate the modern engineering of ML systems. Specifically, we identify dataflow architecture as one of the patterns that realise all DOA principles. We proceed to evaluate the benefits of the dataflow for the deployment of machine learning. The evaluation is presented in two parts. In the first part, we compare the process of deploying an ML model within the functionally equivalent codebases of applications implemented with dataflow and service-oriented approaches, the latter being used as a baseline. We identify some benefits of dataflow, such as higher discoverability and simpler data collection in the system. We also identify the limitations of the paradigm. We then present Seldon Core v2, an open-source model inference platform we designed following the dataflow architecture. We present a detailed discussion on how DOA principles can be implemented in practice, discuss the data observability features of the platform, and quantify the performance trade-offs involved. The last contribution of the thesis points out another benefit of dataflow architecture for software development: a strong relationship between dataflow software and graphical causal models. We identify a connection between dataflow graphs and causal graphs and argue that this relationship allows a straightforward application of causal inference to dataflow software. We use fault localisation as a concrete example of this idea and showcase it in a variety of dataflow systems and scenarios. The thesis closes with a discussion on research avenues that can further develop the community's understanding and adoption of Data-Oriented Architectures and dataflow for machine learning systems.
  • ItemOpen Access
    Non-parametric modelling of signals on graphs
    Opolka, Felix
    Graphs are simple yet powerful data structures that describe entities and their relationships between each other using nodes and edges, making them popular candidates for modelling a wide variety of real-world objects, ranging from molecules to social or biological networks. As a result of their suitability for various modelling scenarios, machine learning on graph-shaped data has emerged as an important field of research in the last few years. While powerful when coupled with machine learning models, graphs pose unique challenges to those models, which need to be able to adapt to not only highly diverse data but also a highly diverse graph domain that may vary in size, connectivity patterns, and its interaction with node features, to name a few. In this work, I hypothesise that Gaussian processes, a class of Bayesian non-parametric models, are particularly well suited for modelling data on graph domains. To provide evidence for this hypothesis, I demonstrate the merits of Bayesian non-parametric modelling for graph data by deriving Gaussian process models for three of the most important tasks in graph machine learning: link prediction, graph-level prediction, and node-level prediction. The resulting models exhibit a number of strengths, including good model fit and robustness against overfitting due to their non-parametric nature, in addition to well calibrated uncertainty estimates. Moreover, the capability of Gaussian processes to optimise hyper-parameters allows designing models that adapt to a graph's particular characteristics, such as the smoothness and multi-scale structure of a graph signal or the locality of features. These strengths of the proposed models and in particular their competitive performance compared to a range of baseline models are confirmed in extensive experiments on a wide range of real-world data sets.
  • ItemOpen Access
    On the evaluation and application of neural language models for grammatical error detection
    Davis, Christopher; Davis, Christopher [0000-0003-4517-5851]
    Neural language models (NLM) have become a core component in many downstream applications within the field of natural language processing, including the task of data-driven automatic grammatical error detection (GED). This thesis explores whether information from NLMs can positively transfer to GED within the domain of learning English as a second language (ESL), and looks at whether NLMs encode and make use of linguistic signals that would facilitate robust and generalisable GED performance. First, I investigate whether information from different types of neural language model can be transferred to models for GED. I evaluate five models against three publicly available ESL benchmarks, and report results showing positive transfer effects to the extent that fine-grained error detection using a single model is becoming viable. Second, I carry out a causal investigation to understand whether NLM-GED models make use of robust linguistic signals during inference – in theory, this would enable them to generalise across different data distributions. The results show a high degree of linear encoding of noun-number within each model’s token-level contextual representations, but they also show markedly varying error detection performance across model types and across in- and out-of-domain datasets. Altogether, the results indicate models employ different strategies for error detection. Third, I re-frame the typically downstream GED task as an evaluation framework to test whether the pre-trained NLMs implicitly encode information about grammatical errors as an artefact of their language modelling objective. I present results illustrating stark differences between masked language models and autoregressive language models – while the former seemingly encodes much more information related to the detection of grammatical errors, the results also present evidence of a brittle encoding across different syntactic constructions. Altogether, this thesis presents a holistic analysis of NLMs – how they might be applied to GED, whether they utilise linguistic information to enable robust inference, and whether their pre-training objective implicitly imbues them with knowledge about grammaticality.
  • ItemOpen Access
    Deception and defense from machine learning to supply chains
    Boucher, Nicholas
    Broad classes of modern cyberattacks are dependent upon their ability to deceive human victims. Given the ubiquity of text across modern computational systems, we present and analyze a set of techniques that attack the encoding of text to produce deceptive inputs to critical systems. By targeting a core building block of modern systems, we can adversarially manipulate dependent applications ranging from natural language processing pipelines to search engines to code compilers. Left undefended, these vulnerabilities enable many ill effects including uncurtailed online hate speech, disinformation campaigns, and software supply chain attacks. We begin by generating adversarial examples for text-based machine learning systems. Due to the discrete nature of text, adversarial examples for text pipelines have traditionally involved conspicuous perturbations compared to the subtle changes of the more continuous visual and auditory domains. Instead, we propose imperceptible perturbations: techniques that manipulate text encodings without affecting the text in its rendered form. We use these techniques to craft the first set of adversarial examples for text-based machine learning systems that are human-indistinguishable from their unperturbed form, and demonstrate their efficacy against systems ranging from machine translation to toxic content detection. We also describe a set of defenses against these techniques. Next, we propose a new attack setting which we call adversarial search. In this setting, an adversary seeks to manipulate the results of search engines to surface certain results only and consistently when a hidden trigger is detected. We accomplish this by applying the encoding techniques of imperceptible perturbations to both indexed content and queries in major search engines. We demonstrate that imperceptibly encoded triggers can be used to manipulate the results of current commercial search engines, and then describe a social engineering attack exploiting this vulnerability that can be used to power disinformation campaigns. Again, we describe a set of defenses against these techniques. We then look to compilers and propose a different set of text perturbations which can be used to craft deceptive source code. We exploit the bidirectional nature of modern text standards to embed directionality control characters into comments and string literals. These control characters allow attackers to shuffle the sequence of tokens rendered in source code, and in doing so to implement programs that appear to do one thing when rendered to human code reviewers, but to do something different from the perspective of the compiler. We dub this technique the Trojan Source attack, and demonstrate the vulnerability of C, C++, C#, JavaScript, Java, Rust, Go, Python, SQL, Bash, Assembly, and Solidity. We also explore the applicability of this attack technique to launching supply chain attacks, and propose defenses that can be used to mitigate this risk. We also describe and analyze a 99-day coordinated disclosure that yielded patches to dozens of market-leading compilers, code editors, and code repositories. Finally, we propose a novel method of identifying software supply chain attacks that works not only for Trojan Source attacks, but for most forms of supply chain attacks. We describe an extension to compilers dubbed the Automated Bill of Materials, or ABOM, which embeds dependency metadata into compiled binaries. Specifically, hashes of each source code file consumed by a compiler are embedded into its emitted binary, and these hashes are included recursively into all downstream dependencies. They are stored in a highly space and time efficient probabilistic data structure that requires an expected value of just 2.1 bytes to represent each unique dependency source code file. With ABOMs, it becomes possible to detect all naturally occurring and most adversarially induced vulnerabilities used for supply chain attacks in downstream software by querying binaries for the presence of poisoned dependencies without the need to locate tangible indicators of compromise. In this thesis, we therefore demonstrate how weaknesses in a core building block of modern systems – text encodings – can cause failures in a wide range of domains including machine learning, search engines, and source code. We propose defenses against each variant of our attack, including a new tool to identify most generic software supply chain attacks. We believe that these techniques will be useful in securing software ecosystems against the next generation of attacks.
  • ItemOpen Access
    On the Optimality of the Lexicon
    Pimentel Martins Da Silva, Tiago
    The principle of least effort posits that a pressure towards communicative efficiency shapes natural languages. In this thesis, we investigate the existence, the nature, and the impact of such a pressure in natural languages' lexicons. We investigate the existence of this pressure by (i) estimating what optimal word lengths would be using coding theory, (ii) proposing pressure-free baselines and estimating their word lengths, and then (iii) comparing natural lexicons to both these optimal and pressure-free artificial lexicons. We investigate the nature of this pressure by comparing multiple ways in which communicative efficiency can be operationalised; we formalise it as either a pressure to shorten utterances, or a pressure to keep information rates as close as possible to an unknown communication channel capacity. Finally, we study the impact of this pressure on cross-linguistic differences in word lengths and on the ratio of homophones in natural languages. Overall, our results support a Zipfian view of communicative efficiency, in which lexicons are pressured towards having utterances that are as short as possible. Our results, however, also highlight the existence of competing constraints and pressures in how lexicons are structured: (i) a language's phonotactic complexity seems to bottleneck the extent to which economy of expression can optimise a lexicon, and (ii) a pressure for clarity seems to keep the ratio of homophones in a language close to chance.
  • ItemOpen Access
    Self-supervised learning for data-efficient human activity recognition
    Tang, Chi Ian
    Over the last decade, smart mobile devices have become ubiquitous, bringing about significant lifestyle changes worldwide. Mobile sensing, which involves obtaining and analysing data from mobile devices and the environment, has emerged as an active research area. It captures the unique opportunity for mobile devices to offer insight into user behaviours. Within mobile sensing, human activity recognition is a fundamental task that aims to identify users' physical actions. Motivated by advancements in deep learning, human activity recognition research has also widely adopted these methods. However, compared to other data modalities, human activity recognition models struggle with the limited availability of labels, due to the difficulty of ground-truth collection. These models often fail to generalise across different users, devices, and changing data distributions. This thesis tackles these challenges by developing and evaluating novel training paradigms. Our proposed paradigms leverage data from additional sources, including other devices and readily available unlabelled data that can be collected easily and often passively, to provide supervision for deep learning, enabling human activity recognition models to be more data-efficient. First, we proposed a new semi-supervised training pipeline that combines self-supervised learning and knowledge distillation to effectively leverage large-scale unlabelled datasets for human activity recognition. This helps models generalise better across different users by increasing the diversity of data that the model is trained on through augmentation and unlabelled data. Next, we designed a collaborative self-supervised learning technique that leverages unlabelled data from multiple devices carried by a user. This method is inspired by the insight that data from multiple devices capture the same physical activity from different viewpoints. The contrastive learning setup, which makes representations for samples from different devices to be similar, is used to extract high-quality features from the data. Finally, we developed continual learning methods motivated by observations that user behaviour often shifts over time due to lifestyle changes. These methods help models better adapt to changing data distributions and learn from new data. We first proposed a multi-task training method that allows models to have better flexibility in adapting to new tasks. Then, we developed a continual learning strategy that balances retaining prior knowledge and learning from new data. This strategy uses self-supervised learning for knowledge retention and a carefully designed loss function to balance different learning objectives. Through extensive evaluation on open datasets, the training paradigms proposed in this thesis provide evidence for and contribute to the development of data-efficient human activity recognition systems by leveraging readily-available data through self-supervised learning.
  • ItemOpen Access
    Formally verifying the security properties of a proof-of-stake blockchain protocol
    Worrasangasilpa, Kawin
    In 2009, Bitcoin was brought into existence, as the first real-world application of blockchain protocols, by its pseudonymous and mysterious creator, Satoshi Nakamoto. It was presented as a form of cryptocurrency, has become widely known and been recognised as the first successful E-cash since the introduction of the E-cash idea in 1983. Many more crytocurrencies including Ethereum and Tether have emerged following the success of Bitcoin. Bitcoin is secure, meaning that it satisfies persistence and liveness. Without the need of a trusted third party, it appears to prevent double-spending attacks. Bitcoin's security is obtained by the use of cryptographically secure chains of blocks for time-stamping (hence the name 'blockchain') and a technique, often called Nakamoto consensus, combined from the longest-chain rule and Proof-of-Work (PoW). Briefly, it allows and encourages all parties to participate in picking the longest chain in the system and solving a cryptographically difficult puzzle to declare the next block of transactions for that chain. PoW is sometimes referred to as a lottery system. PoW requires a majority of the computational power to be honest and it consumes a gigantic amount of energy, so it is not scalable. In the light of this problem, Proof-of-Stake (PoS) was suggested to replace PoW. PoS-based blockchain protocols, instead of using computational power, use in-system currency to agree on a new block to be added, but keep mostly everything else the same. Kiayias et al. was the first to propose a provably secure PoS-based blockchain protocol: Ouroboros. Ouroboros' security guarantees, persistence and liveness, can be verified through proving that Ouroboros satisfies three elementary properties for blockchains as proposed by Gary et al.: Common Prefix, Chain Growth, and Chain Quality. In this project, we attempt to formalise, in Isabelle/HOL, the combinatorial analysis used to prove that Ouroboros satisfies common prefix with near certainty. We cover the case of a static stake protocol under a few assumptions: the network is synchronous, the majority of the stake is honest, and the stake transfer between executions does not effect the lottery system.
  • ItemOpen Access
    Evaluating Natural Language Generation Tasks for Grammaticality, Faithfulness and Diversity
    Xie, Huiyuan
    Natural language generation (NLG) plays a vital role in many applications. Evaluating the quality of generated text is crucial for ensuring the effectiveness and user satisfaction of NLG systems. With the popularisation of deep learning in recent years, many models have been reported to achieve super-human performance on popular benchmarks. However, it has been observed that existing holistic benchmarks and evaluation metrics frequently fail to accurately assess specific evaluation factors that are of interest to the field. This thesis explores a diagnostic evaluation framework for assessing the grammaticality, faithfulness, and diversity (GFD) of generated text in NLG tasks. These three metrics are considered as essential linguistic qualities, which need to be present in the outputs of NLG models. Grammaticality is examined by analysing the parsability of a sentence with a well-defined formal grammar. Faithfulness is divided into two facets: grounding faithfulness and task faithfulness. These two facets investigate how well the model outputs align with both the information provided in the input, and the inherent requirements of the task. Diversity is further divided into word-level and parse-level diversity measures. In the proposed GFD framework, the evaluation of the three metrics does not require task-specific references to be constructed. By clearly defining and evaluating these generation qualities, this framework aims to provide insights into the strengths and limitations of NLG models. To demonstrate the versatility of the GFD evaluation framework, three different generation tasks are explored: synthetic image captioning, football highlight generation from match statistics, and topic-shift dialogue generation. These tasks are deliberately chosen to cover a diverse range of generation scenarios. Each task provides unique grounding information and constraints that influence the generation process, which in turn create diverse challenges for the evaluation of NLG models. Experiments on these tasks reveal the challenges in fine-grained NLG evaluation when the availability of ground truth representations diminishes or when there is a delicate balance between input groundings and task constraints. This thesis empirically demonstrates how the GFD evaluation framework, in combination with diagnostic datasets, can provide insights into model strengths and limitations to supplement standard evaluations.
  • ItemOpen Access
    Distributional and relational inductive biases for graph representation learning in biomedicine
    Scherer, Paul; Scherer, Paul [0000-0002-2240-7501]
    The immense complexity in which DNAs, RNAs, proteins and other biomolecules interact amongst themselves, with one another, and the environment to bring about life processes motivates the mass collection of biomolecular data and data-driven modelling to gain insights into physiological phenomena. Recent predictive modelling efforts have focused on deep representation learning methods which offer a flexible modelling paradigm to handling high dimensional data at scale and incorporating inductive biases. The emerging field of representation learning on graph structured data opens opportunities to leverage the abundance of structured biomedical knowledge and data to improve model performance. Grand international initiatives have been coordinated to organise and structure our growing knowledge about the interactions and putative functions of biomolecular entities using graphs and networks. This dissertation considers how we may use the inductive biases within recent graph representation learning methods to leverage these structures and incorporate biologically relevant relational priors into machine learning methods for biomedicine. We present contributions in two parts with the aim to foster research in this multidisciplinary domain and present novel methods that achieve strong performance through the use of distributional and relational inductive biases operating on graph-structured biomedical knowledge and data. The first part is concerned with consolidating and expanding the current ecosystem of practical frameworks dedicated to graph representation learning. Our first contribution presents Geo2DR, the first practical framework and software library for constructing methods capable of learning distributed representations of graphs. Our second contribution, Pytorch Geometric Temporal, is the first open source representation learning library for dynamic graphs, expanding the scope of research software on graph neural networks that were previously limited to static graphs. The second part presents three methods wherein each contribution tackles an active biomedical research problem using relational structures that exist within different aspects of the data. First we present a methodology for learning distributed representations of molecular graphs in the context of drug pair scoring. Next, we present a method for leveraging structured knowledge on the variables of gene expression profiles to automatically construct sparse neural models for cancer subtyping. Finally, we present a state-of-the-art cell deconvolution model for spatial transcriptomics data using the positional relationships between observations in the dataset.
  • ItemOpen Access
    Computational criminology: at-scale quantitative analysis of the evolution of cybercrime forums
    Hughes, Jack; Hughes, Jack [0000-0002-0730-1055]
    Cybercrime forums and marketplaces are used by members to share hacking techniques, general community-building discussions, and trade hacking tools. While there is a large corpus of literature studying these platforms, from a cross-forum ecosystem comparison to smaller qualitative analyses of specific crime types within a single forum, there has been little research into studying these over time. Using the CrimeBB dataset from the Cambridge Cybercrime Centre, this first contribution of the thesis explores the evolution of a large cybercrime forum, from growth to gradual decline from peak activity, with research questions grounded in the digital drift framework from criminological theory. This finds a trend towards financially-driven cybercrime over time, by users and the forum as a whole. The second contribution of the thesis presents a method for detecting trending terms, using a lightweight natural language processing method to handle queries, given the size of the dataset. Evaluation using manual annotations showed more relevant salient terms were detected over TF-IDF. Finally, the third contribution of the thesis applies signalling theory to analyse the usage of argot (jargon and slang) on the forum, finding a negative correlation with reputation usage, and clustering to find a decreasing use of argot over time. Part of this contribution includes a lightweight argot detection pipeline with word embeddings aligned with manual annotations. Overall, the combination of different approaches, including criminological theory driving research directions, natural language processing to analyse forum text data, machine learning for classifications, and data science techniques, all contribute to provide a unique interdisciplinary perspective within the field of cybercrime community research, both drawing insights into these communities and contributing novel tools for measurements of large, noisy text data.
  • ItemOpen Access
    Strong metadata privacy for mobile devices and applications
    Hugenroth, Daniel; Hugenroth, Daniel [0000-0003-3413-1722]
    Smartphones have become the primary computing devices for many. Living inconspicuously in our pockets, they store our most intimate personal messages and pictures as well as sensitive corporate information and government secrets. This has already motivated widespread adoption of end-to-end encryption for mobile messaging applications, such as WhatsApp and Signal, which protect the confidentiality of messages. However, metadata, such as who has been messaging whom and when, can still be observed by platform operators, local internet providers, and other adversaries tapping into network traffic. This dissertation presents protocols and applications for mobile devices that not only protect the content of messages but also communication patterns. Anonymity networks provide metadata privacy, but the most popular ones, like Tor, remain vulnerable to traffic analysis, while strong alternatives, like Loopix, use cover traffic at the expense of higher bandwidth and latency. In this context smartphones raise two important challenges: battery constraints dictate conservative power usage and connectivity is often intermittent. In order to better understand power consumption on modern smartphones we run experiments on real hardware and find that cryptographic operations are cheap while radio transmission can be costly. In particular, popular solutions such as VPN and Tor are practical with negligible impact on the battery life. However, more secure designs using cover traffic are impractical and highlight the need for protocol design that takes energy limitations into account. The latency and bandwidth requirements of protocols with strong metadata privacy are particularly challenging when sending messages to many recipients---especially on mobile devices where users are often offline. We design Rollercoaster, a multicast scheme for mix networks which incorporates these constraints and allows better utilisation of the underlying network for sporadic group communication. This enables decentralised applications such as group messaging and collaborative text editing while retaining efficient mix parameters. Finally, we present CoverDrop, a practical system for initial contact between whistleblowers and journalists. CoverDrop integrates into a standard news reader app such that all its users contribute cover traffic to achieve unobservable communication for sources while having negligible impact on battery life. In addition, we implement plausibly-deniable storage to keep previous usage of CoverDrop secret even if the phone is captured by an adversary. To achieve this, our key stretching scheme, called Sloth, uses the Secure Element found in many modern smartphones, preventing the adversary from parallelising brute-force attacks and therefore allowing for shorter, more memorable passphrases.
  • ItemOpen Access
    Eavesdropping risks of the DisplayPort video interface
    Erdeljan, Dimitrije; Erdeljan, Dimitrije [0009-0000-1863-5221]
    The switching activity of digital circuits unintentionally generates electromagnetic signals, which may leak sensitive information processed by the device to nearby radio receivers. This problem, known as *compromising emanations* or *TEMPEST*, has been demonstrated for computer displays using analog video interfaces (VGA) and older digital interfaces (LVDS, HDMI, DVI). DisplayPort is a newer interface with a significantly more complex signal structure, and in particular uses a linear-feedback shift register to scramble the transmitted pixel data. Due to scrambling, images produced by applying previously published eavesdropping techniques to DisplayPort appear as random noise, and the interface is thought to be a far more difficult target. I start by showing that DisplayPort is vulnerable to electromagnetic eavesdropping, assuming that the displayed image mainly consists of a small set of colours. The attack starts by recovering scrambler timing parameters and synthesising a replica of the scrambler synchronised with the target. This replica is then used to build templates for each of the expected colours, and to identify pixel colours from short-term cross-correlation between the received signal and templates. The two main limitations of this initial attack are limited accuracy of the reset-timing model and a requirement that the attacker already knows which colours are present in the image. I address the former by designing a scrambler tracking algorithm based on a phase-locked loop that keeps the local replica closely synchronised with the target. For the latter, I exploit several properties of the 8b/10b encoding used together with this accurate scrambler alignment to efficiently enumerate colours and produce a list of candidate colours likely to be present in the image. Finally, I extend the tracking algorithm to also align signal phase across frames, which enables coherent periodic averaging of template correlations. This averaging technique further improves the signal-to-noise ratio in the reconstructed image and thus increases eavesdropping range. Accurate time alignment additionally improves horizontal resolution over that achieved using the simpler timing model. I demonstrate that the algorithms developed in this thesis can be used to recover clearly readable text from 8 m distance in realistic circumstances, even using a software-defined radio receiver with a bandwidth that is an order of magnitude lower than the bitrate used in the DisplayPort video link.
  • ItemOpen Access
    Argument mining with informal text
    Ye, Yuxiao
    The rapid growth of online discussions has led to a wealth of user-generated text, rich in arguments and diverse in nature. However, the complexity of these informal arguments presents a challenge for argument mining. Argument mining is the task of automatically analysing arguments, such that the unstructured information contained in them is converted into structured representations. Current practice in argument mining largely focuses on well-structured and edited formal text, but the annotation schemes and models developed for these simpler texts cannot account well for the phenomena found in informal text. To capture the characteristics of informal arguments, I designed an annotation scheme which includes undercuts, a counterargument device that challenges the relationship between a premise and a claim. Other computational approaches conflate undercuts with direct attacks, a device where the truth of the claim or the premise itself is challenged. I also presented the resultant large-scale Quora dataset featuring informal arguments, complemented by a layer of annotation detailing complete argument structures. I then proposed an end-to-end approach to argument mining based on dependency parsing. My approach uses new dependency representations for arguments and two new neural dependency parsers, one based on biaffine parsing and the other on GNNs. It comfortably beats a strong baseline on the Quora dataset. When applied to an existing benchmark dataset of formal arguments, my approach establishes a new state of the art. It is also the first automatic argument mining approach that is able to recognise undercuts. Furthermore, I conducted a study on external knowledge integration for end-to-end argument mining, such as information from syntax, discourse, knowledge graphs, and large language models. I found that feature-based integration using GPT-3.5 is the most effective method among those I have surveyed. Overall, I hope that my work, by providing automatic analyses of arguments in online discussions, will eventually foster better understanding among people with different opinions.
  • ItemOpen Access
    Coding for emerging archival storage media
    Sella, Omer; Sella, Omer [0000-0002-2795-8580]
    The race between generating digital data and storing it prompted a search for new media to hold our data for centuries, with fused Silica and DNA in the lead. These media are in a rapid stage of research and development. Error Correcting Codes and coding schemes must be designed for these emerging media’s constraints and noise characteristics, similar to the large body of work on coding for communication applications. Unlike communication standards, digital data storage, primarily archival, can and should capitalise on longer block sizes and more complex coding. Longer blocks have the potential to reduce coding overhead and therefore cost, while longer retrieval latency allows for more complex algorithms. This cycle of noise characterisation and code design for storage media could be made more efficient by automation and generalisation. In this work, we present the use of Reinforcement Learning to construct long Error Correcting Codes. We show that Reinforcement Learning is effective when targeting the end goal of reducing Bit Error Rate rather than proxy metrics used in the state-of-the-art heuristics. In addition, we present a unified approach to handle constraints in coding data into DNA. Together these provide a practical toolbox that would allow a co-design of a storage medium and its accompanying coding scheme. Finally, we show that our toolbox requires little human expert intervention, which facilitates designing coding schemes in lockstep with rapid development.
  • ItemOpen Access
    Securing encrypted communication
    Vasile, Diana-Alexandra; Vasile, Diana [0000-0002-3476-3060]
    Secure messaging has led to the mass adoption of strong security principles such as end-to-end encryption and perfect forward secrecy, which had previously failed to gain traction. While this marks tremendous progress in the effort to enhance the security of communication, there are still many open challenges. This dissertation looks at two open problems: providing key transparency in secure messaging apps in an attempt to detect targeted wiretaps, and securing the initial contact for journalists. We begin by formalising the different combinations of key-to-name bindings seen in popular secure messaging apps into key-name-graphs, which we then use to verify that the key server provides the same snapshot of a key-name-graph for a user to all his friends. This approach is proposed as a baseline gossip protocol between friends who physically co-locate, however, when coupled with some enhancements, it has broader applicability to both different underlying network technologies and to expanding verification beyond the friendship connection. We analyse the deployability of the baseline gossip protocol using secondary data from two datasets: Wi-Fi usage and call-detail records. We also implement the protocol as a discrete event simulator and use it to perform model checking to analyse the protocol's security. Secure messaging is not enough for everyone, though. There are certain cases in which further enhancements such as anonymity and metadata privacy are needed to protect those communicating. As such, we analysed the options available to journalists to communicate with sources who may later become whistleblowers. Through the insights from two workshops organised with large British news organisations we show that the options available to journalists are inadequate. We identify several open problems, such as low-latency secure and anonymous communication and secure collaboration. We focus our efforts on initial contact, a problem that appeared during the workshop to have a significant detrimental effect on the security of sources. We discovered that often sources do not place significant emphasis on secure communication from the start, and retrospectively applying security is non-trivial by the time they are ready to share sensitive information. We thus propose a new and secure initial contact system, called CoverDrop, which is integrated as a secure library directly inside newsreader apps.
  • ItemOpen Access
    Large-scale inference and imputation for multi-tissue gene expression
    Viñas Torné, Ramon; Viñas Torné, Ramon [0000-0003-2411-4478]
    Integrating molecular information across tissues and cell types is essential for understanding the coordinated biological mechanisms that drive disease and characterise homoeostasis. Effective multi-tissue omics integration promises a system-wide view of human physiology, with potential to shed light on intra- and multi-tissue molecular phenomena, but faces many complexities arising from the intricacies of biomedical data. This integration problem challenges single-tissue and conventional techniques for omics analysis, often unable to model a variable number of tissues with sufficient statistical strength, necessitating the development of scalable, non-linear, and flexible methods. This dissertation develops inference and imputation methods for the analysis of gene expression data, an immensely rich and complex biomedical data modality, enabling integration across multiple tissues. The imputation task can strongly influence downstream applications, including performing differential expression analysis, determining co-expression networks, and characterising cross-tissue associations. Inferring tissue-specific gene expression may also play a fundamental role in clinical settings, where gene expression is often profiled in accessible tissues such as whole blood. Due to the fact that gene expression is highly context-specific, imputation methods may facilitate the prediction of gene expression in inaccessible tissues, with applications in diagnosing and monitoring pathophysiological conditions. The modelling approaches presented throughout the thesis address four important methodological problems. The first work introduces a flexible generative model for the in-silico generation of realistic gene expression data across multiple tissues and conditions, which may reveal tissue- and disease-specific differential expression patterns and may be useful for data augmentation. The second study proposes two deep learning methods to study whether the complete transcriptome of a tissue can be inferred from the expression of a minimal subset of genes, with potential application in the selection of tissue-specific biomarkers and the integration of large-scale biorepositories. The third work presents a novel method, hypergraph factorisation, for the joint imputation of multi-tissue and cell-type gene expression, providing a system-wide view of human physiology. The fourth study proposes a graph representation learning approach that leverages spatial information to improve the reconstruction of tissue architectures from spatial transcriptomic data. Collectively, this thesis develops flexible and powerful computational approaches for the analysis of tissue-specific gene expression data.
  • ItemOpen Access
    Transient execution vulnerabilities in the security context of server hardware
    Randal, Allison
    The thesis of this work is that eliminating speculation is a feasible approach to mitigating the transient execution vulnerabilities on large-scale server hardware. Many mitigations have been proposed and implemented for many variants of the transient execution vulnerabilities, and while the Meltdown-type exception-based transient execution vulnerabilities have proven to be tractable, Spectre-type vulnerabilities and other speculation-based transient execution vulnerabilities have been far more resistant to countermeasures. After years of research and development by academia and industry, eliminating speculation is still the only reliable countermeasure against Spectre. For smaller-scale embedded systems or security-focused hardware such as a cryptographic system or a root-of-trust (RoT), eliminating speculation is widely accepted as a reasonable approach to improving security. But, for larger-scale and general-purpose hardware, eliminating speculation is often rapidly dismissed as inconceivable, though the claim that speculation is required for adequate performance is rarely supported by concrete performance results. The performance results we do have from several independent strands of research over the past few decades have shown that speculation features on large-scale server hardware do not offer the same performance advantages as on smaller-scale hardware, so eliminating speculation on large-scale server hardware does not harm performance as much as we might suspect. And selective speculation techniques have shown that speculation-based transient execution vulnerabilities can be mitigated by a partial elimination of speculation, so we can preserve some of the performance of speculation while subduing the security risk. In order to demonstrate the feasibility of eliminating speculation from modern server hardware microarchitectures, I consider three alternative approaches that partially or completely eliminate speculative execution. Heterogeneous multicore systems that combine speculative and non-speculative cores make it possible to entirely disable speculation for security-critical or untrusted sections of code, by running that code on a non-speculative core. Code running on a speculative core performs as well as it would on a fully speculative hardware architecture. The systems software developer has the power to choose which code runs with the performance advantage of speculation, and which code runs with the security advantage of no speculation. However, heterogeneous multicores only offer the ability to disable speculation at the process or thread level. A finer-grained approach is desirable, to limit the performance penalty of disabled speculation to the smallest possible region of code. Non-speculative cores keep the performance advantages of most common features in modern hardware architectures---such as dynamic multiple issue, dynamic pipeline scheduling, out-of-order execution, and register renaming---while avoiding the risk of speculative execution. Such processors do not perform as well as equivalent speculative processors, but the results of this work indicate that they can perform as well or better than equivalent speculative processors with all relevant mitigations for the transient execution vulnerabilities applied. The performance penalty of eliminating speculation can also be partially offset by increasing the size of fetch and issue stage components in the pipeline. Non-speculative cores do not give systems software developers the option to choose between performance and security. However, these cores may be desirable for large-scale server deployments that exclusively serve privacy-centered workloads, such as processing hospital patient data. Selective speculation combines speculative and non-speculative features on a single core. The performance of selective speculation cores is proportional to the use of speculative and non-speculative features, so only regions of code that disable speculation pay a performance penalty. Out of the three approaches considered in this dissertation, selective speculation cores are best for large-scale general-purpose server deployments, because they simplify resource allocation by keeping all cores identical, have no performance penalty for code run as entirely speculative, and give systems software developers the most precise control over speculation.
  • ItemOpen Access
    Context-conscious fairness throughout the machine learning lifecycle
    Lee, Seng Ah
    As machine learning (ML) algorithms are increasingly used to inform decisions across domains, there has been a proliferation of literature seeking to define “fairness” narrowly as an error to be “fixed” and to quantify it as an algorithm’s deviation from a formalised metric of equality. Dozens of notions of fairness have been proposed, many of which are both mathematically incompatible and morally irreconcilable with one another. There is little consensus on how to define, test for, and mitigate unfair algorithmic bias. One key obstacle is the disparity between academic theory and practical and contextual applicability. The unambiguous formalisation of fairness in a technical solution is at odds with the contextualised needs in practice. The notion of algorithmic fairness lies at the intersection of multiple domains, including non-discrimination law, statistics, welfare economics, philosophical ethics, and computer science. Literature on algorithmic fairness has predominantly been published in computer science, and while it has been shifting to consider contextual implications, many approaches crystallised into open source toolkits are tackling a narrowly defined technical challenge. The objective of my PhD thesis is to address this gap between theory and practice in computer science by presenting context-conscious methodologies throughout ML de- velopment lifecycles. The core chapters are organised by each phase: design, build, test, and monitor. In the design phase, we propose a systematic way of defining fairness by understanding the key ethical and practical trade-offs. In the test phase, we introduce methods to identify and measure risks of unintended biases. In the deploy phase, we identify appropriate mitigation strategies depending on the source of unfairness. Finally, in the monitor phase, we formalise methods for monitoring fairness and adjusting the ML model appropriately to any changes in assumptions and input data. The primary contribution of my thesis is methodological, including improving our understanding of limitations of current approaches and proposal of new tools and in- terventions. It shifts the conversation in academia away from axiomatic, unambiguous formalisations of fairness towards a more context-conscious, holistic approach that covers the end-to-end ML development lifecycle. This thesis aims to provide end-to-end coverage in guidance for industry practitioners, regulators, and academics on how fairness can be considered and enforced in practice.