Accelerating the Configuration Tuning of Big Data Analytics with Similarity-aware Multitask Bayesian Optimization

Fekry, A; Carata, L; Pasquier, T; Rice, A

Accelerating the Configuration Tuning of Big Data Analytics with Similarity-aware Multitask Bayesian Optimization

Accepted version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/312750

Repository DOI

https://doi.org/10.17863/CAM.59851

Files

Accepted version (550.25 KB)

Type

Conference Object

Authors

Fekry, A

Carata, L

Pasquier, T

Rice, Andrew

https://orcid.org/0000-0002-4677-8032

Abstract

One of the key challenges for data analytics deployment is configuration tuning. The existing approaches for configuration tuning are expensive and overlook the dynamic characteristics of the analytics environment (i.e. frequent changes in workload due to receiving evolving input sizes or change in the underlying cluster environment). Such workload/environment changes can cause significant performance degradation, with retuning the configuration to accommodate those changes can yield up to 85% potential execution time saving.

We propose SimTune, an approach that accommodates such changes through efficient configuration tuning. SimTune combines workload characterization and Multitask Bayesian optimization to identify similarity across workloads and accelerate finding near-optimal configurations. Our experimental results show that SimTune reduces the search time for finding close-to-optimal configurations by 56-73% (at the median) when compared to existing state-of-the-art techniques. This means that the amortization of the tuning cost happens significantly faster, enabling practical tuning in the rapidly changing environment of distributed analytics.

Keywords

4606 Distributed Computing and Systems Software, 46 Information and Computing Sciences

Journal Title

Proceedings - 2020 IEEE International Conference on Big Data, Big Data 2020

Conference Name

2020 IEEE International Conference on Big Data (Big Data)

Journal ISSN

2639-1589

Publisher

IEEE

Publisher DOI

https://doi.org/10.1109/BigData50022.2020.9378085

Rights

Sponsorship

Google Cloud, Amazon AWS

Collections

Cambridge University Research Outputs