Technical and societal implications of machine learning security
Repository URI
Repository DOI
Change log
Authors
Abstract
Machine learning has become increasingly integrated into critical applications, making its security an increasingly pressing issue. This thesis addresses the practical and societal dimensions of ML security. On the attack front, the thesis demonstrates new adversarial techniques. For computer vision, it shows that simple human-crafted markers can fool state-of-the-art image classifiers. For multimodal systems, it presents ``audio jailbreaks'' -- imperceptible audio perturbations that consistently bypass an LLM's alignment. In addition, experiments reveal scaling behaviour in data poisoning: larger language models are more susceptible to poisoning than smaller ones, learning harmful behaviours from minimal malicious training data. To better quantify model vulnerability, the thesis explores the use of effective dimensionality, an information-theoretic measure of a neural network's complexity, as a new robustness metric. Empirical results show that models with lower effective dimensionality exhibit greater resistance to adversarial manipulation.
On the defence front, a weakness in LLM watermarking schemes is identified: in multi-turn dialogues, unwatermarked participants inadvertently mimic a watermarked model's distinct lexical patterns. The thesis develops an Adversarial Suffix Filtering mechanism -- a lightweight, model-agnostic input sanitisation pipeline that removes malicious prompt suffixes, protecting LLMs from adversarial prompt injections without requiring any changes to the underlying model. Finally, on a governance front, the thesis proposes a policy framework of tiered anonymity. By linking identity verification requirements to an account's audience reach on social media, this framework aims to dampen large-scale AI-generated misinformation while maintaining pseudonymity for ordinary users.
These contributions reflect an interdisciplinary approach that integrates technical interventions with governance measures, illustrating how layered strategies of attacking, measuring, defending, and governing can collectively enhance the security and trustworthiness of machine learning systems in society.
Description
Date
Advisors
Anderson, Ross
