Flexible and efficient computation in large data centres

Gog, Ionel Corneliu

doi:10.17863/CAM.18802

Flexible and efficient computation in large data centres

Repository URI

https://www.repository.cam.ac.uk/handle/1810/271804

Repository DOI

https://doi.org/10.17863/CAM.18802

Files

Primary Thesis (3.99 MB)

Type

Thesis

Authors

Gog, Ionel Corneliu

Abstract

Increasingly, online computer applications rely on large-scale data analyses to offer personalised and improved products. These large-scale analyses are performed on distributed data processing execution engines that run on thousands of networked machines housed within an individual data centre. These execution engines provide, to the programmer, the illusion of running data analysis workflows on a single machine, and offer programming interfaces that shield developers from the intricacies of implementing parallel, fault-tolerant computations.

Many such execution engines exist, but they embed assumptions about the computations they execute, or only target certain types of computations. Understanding these assumptions involves substantial study and experimentation. Thus, developers find it difficult to determine which execution engine is best, and even if they did, they become “locked in” because engineering effort is required to port workflows.

In this dissertation, I first argue that in order to execute data analysis computations efficiently, and to flexibly choose the best engines, the way we specify data analysis computations should be decoupled from the execution engines that run the computations. I propose an architecture for decoupling data processing, together with Musketeer, my proof-of-concept implementation of this architecture. In Musketeer, developers express data analysis computations using their preferred programming interface. These are translated into a common intermediate representation from which code is generated and executed on the most appropriate execution engine. I show that Musketeer can be used to write data analysis computations directly, and these can execute on many execution engines because Musketeer automatically generates code that is competitive with optimised hand-written implementations.

The diverse execution engines cause different workflow types to coexist within a data centre, opening up both opportunities for sharing and potential pitfalls for co-location interference. However, in practice, workflows are either placed by high-quality schedulers that avoid co-location interference, but choose placements slowly, or schedulers that choose placements quickly, but with unpredictable workflow run time due to co-location interference.

In this dissertation, I show that schedulers can choose high-quality placements with low latency. I develop several techniques to improve Firmament, a high-quality min-cost flow-based scheduler, to choose placements quickly in large data centres. Finally, I demonstrate that Firmament chooses placements at least as good as other sophisticated schedulers, but at the speeds associated with simple schedulers. These contributions enable more efficient and effective use of data centres for large-scale computation than current solutions.

Date

2017-09-01

Advisors

Watson, Robert N. M.

Keywords

large scale data processing, cluster management, dataflow systems, cluster scheduling

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution-NonCommercial-ShareAlike 4.0 International

Sponsorship

Google fellowship in distributed systems

Collections

Theses - Computer Science and Technology

Flexible and efficient computation in large data centres

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Date

Advisors

Keywords

Qualification

Awarding Institution

Rights and licensing

Sponsorship

Collections