Federated Linear Dimensionality Reduction


Thumbnail Image
Change log

In recent years, the explosive rate of dataset expansion has offered the ability for researchers to access an unprecedented amount of information. Moreover, in addition to actual dataset size increases, the types and locations of data generators are more heterogeneous than ever -ranging from traditional servers to a myriad of IoT devices. These facts, coupled with recent emphasis on privacy and data-ownership, led to the creation of federated datasets. Such datasets are characterised by their massive size and are usually scattered across decentralised edge devices, each holding their local data samples. As exciting as these federated datasets might be, they introduce an astounding challenge: how to efficiently process federated data at scale? Naturally, given these constraints, centralisation of such datasets is often intractable, thus making traditional analytical methods inapplicable.

This thesis introduces a suite of mathematical advancements that makes traditional learning algorithms applicable to the federated setting, by summarising the massive amounts of information into succinct dataset-specific representations. Concretely, we focus primarily on linear dimensionality reduction and, in particular, on Principal Component Analysis (PCA) due to its pervasiveness, along with its ability to process unstructured data. The first advancement we introduce is a novel algorithm to perform streaming and memory-limited dimesionality reduction at the edge that uses a generalisation of incremental Singular Value Decomposition (SVD). Further, we provide a rank-adaptive SVD extension able to account for distribution shifts over-time. Subsequently, building upon previous constructions, we present an (ε,δ)-differentially private federated algorithm for PCA. To achieve federation, we put forth a lightweight merging algorithm that unlocks the ability to process each subproblem locally at the edge which, in turn, through merging is propagated accordingly. We are able to guarantee differential privacy via an input-perturbation scheme in which the covariance matrix of a dataset is perturbed with a non-symmetric random Gaussian matrix.

To evaluate the practicality of our innovations, we describe an algorithm able to perform task scheduling on federated data centres. The scheduler enables each decentralised node to incrementally compute its local model and independently execute scheduling decisions on whether to accept an incoming job based on the workload seen thus far. Finally, we complement our findings with an evaluation on synthetic as well as real-world datasets including sensor node measurements, hard-written images, and wine quality readings, considering a wide range of data modalities and dimensionalities.

Mascolo, Cecilia
Crowcroft, Jon
federated learning, machine learning, computer science, dimensionality reduction, principal component analysis
Doctor of Philosophy (PhD)
Awarding Institution
University of Cambridge
This thesis and associated studentship was supported by The Alan Turing Institute under grants: TU/C/000003, {TU/B/000069}, and {EP/N510129/1}.