Flexible and efficient computation in large data centres Ionel Corneliu Gog University of Cambridge Computer Laboratory Corpus Christi College September 2017 This dissertation is submitted for the degree of Doctor of Philosophy Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation is not substantially the same as any that I have submitted or that is being con- currently submitted for a degree, diploma, or other qualification at the University of Cambridge, or any other University or similar institution. This dissertation does not exceed the regulation length of 60,000 words, including tables and footnotes. Flexible and efficient computation in large data centres Ionel Corneliu Gog Summary Increasingly, online computer applications rely on large-scale data analyses to offer person- alised and improved products. These large-scale analyses are performed on distributed data processing execution engines that run on thousands of networked machines housed within an individual data centre. These execution engines provide, to the programmer, the illusion of run- ning data analysis workflows on a single machine, and offer programming interfaces that shield developers from the intricacies of implementing parallel, fault-tolerant computations. Many such execution engines exist, but they embed assumptions about the computations they execute, or only target certain types of computations. Understanding these assumptions involves substantial study and experimentation. Thus, developers find it difficult to determine which execution engine is best, and even if they did, they become “locked in” because engineering effort is required to port workflows. In this dissertation, I first argue that in order to execute data analysis computations efficiently, and to flexibly choose the best engines, the way we specify data analysis computations should be decoupled from the execution engines that run the computations. I propose an architecture for decoupling data processing, together with Musketeer, my proof-of-concept implementation of this architecture. In Musketeer, developers express data analysis computations using their pre- ferred programming interface. These are translated into a common intermediate representation from which code is generated and executed on the most appropriate execution engine. I show that Musketeer can be used to write data analysis computations directly, and these can execute on many execution engines because Musketeer automatically generates code that is competitive with optimised hand-written implementations. The diverse execution engines cause different workflow types to coexist within a data cen- tre, opening up both opportunities for sharing and potential pitfalls for co-location interfer- ence. However, in practice, workflows are either placed by high-quality schedulers that avoid co-location interference, but choose placements slowly, or schedulers that choose placements quickly, but with unpredictable workflow run time due to co-location interference. In this dissertation, I show that schedulers can choose high-quality placements with low latency. I develop several techniques to improve Firmament, a high-quality min-cost flow-based sched- uler, to choose placements quickly in large data centres. Finally, I demonstrate that Firmament chooses placements at least as good as other sophisticated schedulers, but at the speeds associ- ated with simple schedulers. These contributions enable more efficient and effective use of data centres for large-scale computation than current solutions. Acknowledgements Foremost, I would like to thank my initial supervisor, Steve Hand, for his support over the past five years. Steve’s advice, passion for research, and his high standards have been crucial in shaping the work at the core of my thesis. Likewise, I am grateful to Robert Watson, my second supervisor, for his insightful comments on this dissertation, and for his support in securing funding and arranging numerous conference visits throughout my PhD. I am also indebted to Ian Leslie for his comments and suggestions on drafts of this document. The systems at the centre of this dissertation have grown from my close collaboration with Malte Schwarzkopf. I am grateful for the enjoyable time we spent working together, and for the time Malte took to provide feedback on countless drafts of this dissertation. I would also like to thank to several other former colleagues in the Computer Laboratory for their contribution to the success of our projects: Natacha Crooks, Matthew Grosvenor, and Adam Gleave. Throughout my PhD, I have also been privileged to work with distinguished researchers from industry. I thank Michael Isard and Derek Murray for our collaboration, and for giving me the opportunity to work with the Naiad system, which features in this dissertation. I am grateful to John Wilkes for the insights I gained from our collaboration in the Borgmaster team, and which helped me refine part of my PhD work. In addition to those already mentioned above, I am grateful to Frans Kaashoek for hosting me for several months at MIT. I would also like to thank my friends from the systems research community – Allen Clement, Frank McSherry, Martin Maas, and Justine Sherry – who have welcomed me into the community. Finally and above all, I am profoundly grateful to my family, who have supported me through this difficult journey. I thank my mother, Maria Gog, for her continuous encouragements to follow my dreams, and to my sister, Antonia Gog, for her cheerful support. Contents 1 Introduction 15 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.2 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.3 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2 Background 21 2.1 Cluster workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Cluster scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3 Musketeer: flexible data processing 63 3.1 Musketeer overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.2 Expressing workflows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.3 Intermediate representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.4 Code generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.5 DAG partitioning and automatic mapping . . . . . . . . . . . . . . . . . . . . 83 3.6 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4 Musketeer evaluation 93 4.1 Experimental setup and metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2 Overhead over hand-written optimised workflows . . . . . . . . . . . . . . . . 95 4.3 Impact of Musketeer optimisations on makespan . . . . . . . . . . . . . . . . 98 4.4 Dynamic mapping to back-end execution engines . . . . . . . . . . . . . . . . 101 4.5 Combining back-end execution engines . . . . . . . . . . . . . . . . . . . . . 104 CONTENTS CONTENTS 4.6 Automatic back-end execution engine mapping . . . . . . . . . . . . . . . . . 105 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5 Firmament: a scalable, centralised scheduler 111 5.1 Firmament overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.2 Flowlessly: a fast min-cost flow solver . . . . . . . . . . . . . . . . . . . . . . 114 5.3 Extensions to min-cost flow-based scheduling . . . . . . . . . . . . . . . . . . 136 5.4 Network-aware scheduling policy . . . . . . . . . . . . . . . . . . . . . . . . 143 5.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6 Firmament evaluation 147 6.1 Experimental setup and metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.3 Placement quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 7 Conclusions and future work 159 7.1 Extending Musketeer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.2 Improving Firmament . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Bibliography 163 8 List of Figures 2.1 CDF of task runtime from a Google cluster trace . . . . . . . . . . . . . . . . . 23 2.2 CDF of task runtime from an Alibaba cluster trace . . . . . . . . . . . . . . . . 23 2.3 CDF of task resources requests from a Google cluster trace . . . . . . . . . . . 24 2.4 Data-centre task life cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 Examples of front-end frameworks and back-end execution engines . . . . . . 27 2.6 Examples of different dataflows models . . . . . . . . . . . . . . . . . . . . . 31 2.7 Examples of different graph processing models . . . . . . . . . . . . . . . . . 33 2.8 Makespan of a PROJECT and JOIN query in different data processing systems 39 2.9 Makespan of PageRank in different data processing systems . . . . . . . . . . 40 2.10 Resource efficiency of different data processing systems . . . . . . . . . . . . 42 2.11 Analysis of task CPU versus memory consumption from a Google cluster . . . 46 2.12 Service task resource requests normalised to usage from an Alibaba cluster . . 48 2.13 Resource requests normalised to usage from a Google cluster . . . . . . . . . . 48 2.14 Comparison of different cluster scheduler architectures . . . . . . . . . . . . . 53 2.15 Stages a task proceeds through in task-by-task queue-based schedulers . . . . . 57 2.16 Stages min-cost flow-based schedulers proceed through. . . . . . . . . . . . . 58 2.17 Example of a simple flow network modelling a four-machine cluster . . . . . . 59 2.18 Example of a Quincy-style flow network . . . . . . . . . . . . . . . . . . . . . 60 2.19 Quincy’s scalability as cluster size grows . . . . . . . . . . . . . . . . . . . . . 60 3.1 Coupling between front-end frameworks and back-end execution engines . . . 64 3.2 Decoupling between front-end frameworks and back-end execution engines . . 64 3.3 Schematic of the decoupled architecture for data processing . . . . . . . . . . . 65 3.4 Phases of a Musketeer workflow execution . . . . . . . . . . . . . . . . . . . . 66 LIST OF FIGURES LIST OF FIGURES 3.5 PageRank workflow represented in Musketeer’s IR . . . . . . . . . . . . . . . 75 3.6 Max-property-price workflow represented in Musketeer’s IR . . . . . . . . . . 79 3.7 GAS steps highlighted on the Musketeer IR for PageRank . . . . . . . . . . . . 81 3.8 Example of Musketeer’s dynamic partitioning heuristic . . . . . . . . . . . . . 88 3.9 Example of a DAG optimisation inadvertently breaking operator merging . . . 90 3.10 Workflow on which Musketeer’s dynamic heuristic misses a merge opportunity 91 4.1 Netflix movie recommendation workflow . . . . . . . . . . . . . . . . . . . . 96 4.2 Musketeer-generated code vs. hand-written baselines on Netflix workflow . . . 97 4.3 Musketeer-generated code overhead for PageRank on the Twitter graph . . . . 98 4.4 Benefits of operator merging and type inference on top-shopper . . . . . . . . . 100 4.5 Benefits of operator merging and type inference on cross-community PageRank 100 4.6 Musketeer versus Hive and Lindi front-ends on TPC-H query 17 . . . . . . . . 102 4.7 Musketeer’s makespan on PageRank on different graphs . . . . . . . . . . . . 103 4.8 Musketeer’s resource efficiency on PageRank on the Twitter graph . . . . . . . 103 4.9 Comparison of back-end systems and Musketeer on cross-community PageRank 105 4.10 Makespan overhead of Musketeer’s automatic mapping decisions . . . . . . . . 106 4.11 Makespan of SSSP and k-means on a 100 instances EC2 cluster. . . . . . . . . 107 4.12 Musketeer DAG partitioning algorithms runtime . . . . . . . . . . . . . . . . . 108 5.1 Architecture of the Firmament cluster scheduler . . . . . . . . . . . . . . . . . 113 5.2 Example of a flow arc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.3 Example of a residual network flow arc . . . . . . . . . . . . . . . . . . . . . 116 5.4 Example of a reduced cost flow arc . . . . . . . . . . . . . . . . . . . . . . . . 116 5.5 Runtime for min-cost flow algorithms on clusters of various sizes . . . . . . . . 122 5.6 Runtime for min-cost flow algorithms under high cluster utilisation . . . . . . . 123 5.7 Example of a load-spreading flow network . . . . . . . . . . . . . . . . . . . . 123 5.8 Runtime for min-cost flow algorithms using a load-spreading policy . . . . . . 124 5.9 CDF of the number of scheduling events per time intervals in the Google trace . 125 5.10 Number of misplaced tasks when using approximate min-cost flow . . . . . . . 126 5.11 Comparison of incremental and from-scratch cost scaling . . . . . . . . . . . . 129 5.12 Runtime reductions obtained by applying problem-specific heuristics . . . . . . 131 10 LIST OF FIGURES LIST OF FIGURES 5.13 Schematic of Flowlessly’s internals . . . . . . . . . . . . . . . . . . . . . . . . 133 5.14 Runtime reductions obtained by applying price refine before changing algorithm 134 5.15 Example of min-cost flow scheduler’s limitation in handling data skews . . . . 138 5.16 Example showing how convex arc costs can be modelled in the flow-network . 139 5.17 Examples that create dependencies between tasks’ flow supply . . . . . . . . . 141 5.18 “and” flow network construct . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.19 Rigid gang scheduling flow network construct . . . . . . . . . . . . . . . . . . 142 5.20 Generalised flow network gang scheduling construct . . . . . . . . . . . . . . 143 5.21 Example of a flow network for a network-aware scheduling policy . . . . . . . 144 6.1 Task scheduling metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.2 Comparison of Firmament’s and Quincy’s task placement delay . . . . . . . . 150 6.3 Firmament’s algorithm runtime when cluster is oversubscribed . . . . . . . . . 151 6.4 Firmament’s scalability to short-running tasks . . . . . . . . . . . . . . . . . . 152 6.5 Firmament’s latency on sped up Google trace . . . . . . . . . . . . . . . . . . 153 6.6 Percentage of tasks that achieve data locality with Firmament and Quincy . . . 155 6.7 Firmament’s network-aware policy outperforms state-of-the-art schedulers . . . 156 11 List of Tables 2.1 Comparison of existing data processing systems . . . . . . . . . . . . . . . . . 30 2.2 Comparison of existing cluster schedulers . . . . . . . . . . . . . . . . . . . . 52 3.1 Types of MapReduce jobs generated for Musketeer IR operators . . . . . . . . 78 3.2 Rate parameters used by Musketeer’s cost function. . . . . . . . . . . . . . . . 85 4.1 Evaluation cluster and machine specifications. . . . . . . . . . . . . . . . . . . 94 4.2 Modifications made to back-end execution engines deployed. . . . . . . . . . . 95 5.1 Worst-case time complexities of min-cost flow algorithms . . . . . . . . . . . . 121 5.2 Optimality requirements of different min-cost flow algorithms . . . . . . . . . 128 5.3 Arc changes that require solution reoptimisation . . . . . . . . . . . . . . . . . 130 6.1 Specifications of the machines in the local homogeneous cluster. . . . . . . . . 148 Listings 3.1 Hive query for the max-property-price workflow . . . . . . . . . . . . . . . . . 69 3.2 BEER DSL code for the PageRank workflow . . . . . . . . . . . . . . . . . . . 70 3.3 Musketeer’s Lindi-like C++ interface. . . . . . . . . . . . . . . . . . . . . . . 72 3.4 Gather-Apply-Scatter DSL code for PageRank. . . . . . . . . . . . . . . . . . 73 3.5 Spark code for max-property-price . . . . . . . . . . . . . . . . . . . . . . . . 80 3.6 Optimized Spark code for max-property-price . . . . . . . . . . . . . . . . . . 82 3.7 Musketeer scheduling – high-level overview . . . . . . . . . . . . . . . . . . . 84 3.8 Dynamic programming heuristic for exploring partitionings of large workflows 89 4.1 Hive code for the top-shopper workflow. . . . . . . . . . . . . . . . . . . . . . 99 5.1 Algorithm for extracting task placements from the flow returned by the solver. . 136 LISTINGS LISTINGS 14 Chapter 1 Introduction The growing desire to optimise products based on data-driven insights has lead to more (and more diverse) data analysis workflows being executed in data-centre clusters. These workflows execute on distributed data processing execution engines and conduct short interactive compu- tations, graph analysis or batch data processing. The execution engines run on clusters of tens of thousands of commodity machines and offer intuitive programming interfaces that developers use to write workflows without worrying about how these are parallelised and on which cluster machines these are executed. Many distributed execution engines have been developed in recent years to target the afore- mentioned types of computations. Each engine promises benefits over prior solutions, but they often make different assumptions, target distinct use cases, and are evaluated under different conditions with varying workflows. For example, batch data execution engines optimise for processing huge data sets by parallelising workflows across many disks and machines. Graph processing systems, by contrast, spend significant time partitioning input graphs in order to reduce communication among machines. In practice, however, developers find it difficult to understand the trade-offs these systems make, and to determine which system or combination of systems, is best to execute their workflows efficiently. To make matters worse, the introduction of a faster or more efficient execution engine does not currently guarantee rapid adoption because existing workflows must be manually ported – a task not undertaken lightly, even though rewriting may bring significant performance gains. Developers must port workflows manually because user-facing front-ends that express work- flows (e.g., Hive [TSJ+09], SparkSQL [AXL+15], Lindi [MMI+13]) are tightly coupled to back-end execution engines that run workflows (e.g., MapReduce [DG08], Spark [ZCD+12], Naiad [MMI+13], PowerGraph [GLG+12]). As a result, workflows implemented using front- ends cannot flexibly run on many back-end execution engines; workflows can only run on the back-end execution engines that the front-ends are coupled to. The execution engine diversity also causes workflows with different resource requirements to coexist in data centres, which opens up the opportunity of sharing the heterogeneous data-centre 15 16 hardware among workflows, but also introduces new challenges. Workflows consist of many parallel tasks whose runtime and performance can vary significantly depending on the hardware tasks use, and the interference caused by other tasks co-located on the hardware. Thus, in order to execute workflows efficiently, it is essential to place tasks in such a way that tasks: do not interfere, execute on preferred hardware, and run predictably. Data-centre cluster schedulers are responsible for placing tasks on machines. These schedulers use elaborate algorithms to find high-quality task placements that take into account hardware heterogeneity and reduce task co-location interference [DK13; DK14; VPK+15]. However, these schedulers are typically centralised components that choose placements for entire clus- ters, and thus can take seconds or minutes to find task placements [SKA+13; DSK15]. They fail to meet the scheduling latency requirements of interactive data processing computations that must complete in a timely manner. As a result, clusters that run interactive computations distribute placement decisions across several schedulers that use simple algorithms to place workflows with low scheduling delay [OWZ+13; RKK+16]. These distributed schedulers only have partial and often stale information about the cluster’s state, and thus, they can choose poor task placements. Poor placements cause tasks to interfere, and consequently experience performance degradations that make workflows unpredictable. In the research for this dissertation, I have first developed a new data processing architecture that decouples workflow specification from execution. To demonstrate the practicality of this architecture, I have built Musketeer, a system that: (i) dynamically maps front-end workflow descriptions to a common intermediate representation, (ii) determines a good decomposition of workflows into jobs, and (iii) automatically generates efficient code from the intermediate representation for the chosen back-end execution engines out of a broad range of supported engines. Second, I have extended the Firmament centralised scheduler, which is based on an expensive min-cost flow optimisation, with new key components that reduce task placement latency at scale. In a series of experiments, I demonstrate that Firmament quickly chooses high-quality placements. To achieve this, Firmament relies on several different optimisation algorithms, uses incremental algorithms where possible, and applies problem-specific optimisations. In this dissertation, I use Musketeer and Firmament to investigate the following thesis: A workflow manager that decouples front-end frameworks from back-end execu- tion engines can flexibly execute workflows on many execution engines, automati- cally port workflows, and potentially increase performance. Additionally, even a centralised data-centre cluster scheduler for such workflows can scale to large clusters, while choosing high-quality placements at low place- ment latency. CHAPTER 1. INTRODUCTION 17 1.1 Contributions In this dissertation, I make three principal contributions: 1. My first contribution is an architecture for decoupling data processing workflow specifi- cation from the manner in which workflows are executed. I argue that workflows should be automatically translated into an intermediate representation, which can be optimised, and from which code can dynamically be generated at runtime for the best combination of execution engines. To explore the drawbacks and benefits of my architecture, I developed Musketeer, a system that dynamically translates user defined workflows to a range of data processing systems. 2. My second contribution is to show that centralised data-centre schedulers can scale to large clusters. Centralised data-centre schedulers choose high-quality placements. How- ever, this comes at the cost of high placement latency at scale which degrades runtime for interactive computations and decreases data-centre utilisation. I extended the Firmament centralised scheduler to show that this perceived scalability limitation is not fundamen- tal. Firmament uses the flow-based scheduling approach introduced by Quincy [IPC+09], which models scheduling as a minimum-cost flow optimisation over a graph, but which is known to take minutes to place tasks on large clusters. To address this limitation, I developed Flowlessly, a minimum-cost flow solver that makes flow-based schedulers scale to tens of thousands of machines at sub-second placement latency in the common case. Flowlessly uses multiple min-cost flow algorithms, solves the problem incremen- tally when possible, and applies several problem-specific optimisations. 3. My third contribution is to extend the Firmament cluster manager with a new schedul- ing policy, which reduces end-host network interference. I show that this policy outper- forms four state-of-the-art schedulers on a mixed cluster workload. The policy uses one of the several scheduling features I developed for flow-based schedulers (e.g., complex constraints, gang-scheduling), which were previously thought to be incompatible with flow-based schedulers. 1.1.1 Collaborations The designs, architectures and algorithms presented in this dissertation are the result of my own work. However, colleagues in the Computer Laboratory have helped me implement sev- eral components that I later describe. In particular, Malte Schwarzkopf extended Musketeer with support for generating code for Metis jobs (§3.4). He has also implemented Firmament’s components that spawn tasks, monitor and submit resource utilisation statistics from worker agents to the centralised coordinator (§5.1). Finally, Malte also implemented the load-spreading 18 1.2. DISSERTATION OUTLINE scheduling policy that I use to show the limitations of one of Flowlessly’s min-cost flow algo- rithms (§5.2). Natacha Crooks contributed to the implementation of Musketeer, adding, in particular support for a gather, apply, and scatter front-end framework (§3.2). Natacha extended Musketeer with traditional database query rewriting rules that optimise workflows in order to reduce their run- time (§3.3.1). Moreover, she also extended Musketeer to integrate and generate job code for the Spark general-purpose data processing system (§3.4). Matthew Grosvenor implemented the code that translates Hive workflows to Musketeer’s inter- mediate workflow representation (§3.2). Adam Gleave contributed to Firmament by conducting an investigation of several minimum-cost flow algorithms in his Part II project under my co- supervision [Gle15]. In addition, Malte Schwarzkopf, Natacha Crooks, Matthew Grosvenor and Adam Gleave have co-authored papers about Musketeer [GSC+15] and Firmament [GSG+16]. 1.2 Dissertation outline This dissertation is structured as follows: Chapter 2 gives an overview of the state-of-the-art data processing execution engines and iden- tifies concepts shared by all of them. Moreover, in a series of experiments, I show that no execution engine always outperforms all others, and I highlight the challenges work- flow developers are faced with in choosing between them. In this chapter, I also trace the recent developments in cluster scheduling, and discuss the requirements a scheduler must satisfy in order to place workflows such that they complete as soon as possible. In the discussion, I put an emphasis on the limitations of prior centralised and distributed schedulers. Chapter 3 describes my architecture for decoupling data processing workflow specification from execution. I describe Musketeer, a proof-of-concept data processing workflow man- ager I built to showcase the architecture. Musketeer supports several front-end frame- works for developers to express their workflows, translates workflows into a common intermediate representation, and generates code to execute workflows in the best com- bination of data processing execution engines. First, I outline Musketeer’s architecture, then discuss how workflows can be expressed when using it. Following, I discuss Mus- keteer’s intermediate representation and how code for different frameworks is generated from it. Finally, I discuss how Musketeer decides on which combination of execution engines to run a workflow. Chapter 4 investigates Musketeer’s ability to efficiently run real-world data processing work- flows. In a range of experiments, I show that Musketeer (i) generates efficient workflow code that achieves comparable performance to optimised hand-written implementations; CHAPTER 1. INTRODUCTION 19 (ii) speed-ups legacy workflows by mapping them to more efficient execution engines; (iii) flexibly combines several execution engines to run workflows; and (iv) automatically decides which engines are best to use for a given workflow. Chapter 5 describes how centralised min-cost flow schedulers can be optimised to choose good task placements and scale to tens of thousands of machines at low scheduling la- tency. First, I explain how Firmament, an existing min-cost flow-based scheduler, works and how it differs from traditional task-by-task schedulers. Following, I introduce Flow- lessly, a min-cost flow solver I developed to make flow-based scheduling fast. Flowlessly automatically chooses between different min-cost flow algorithms, solves the optimisa- tion incrementally, and uses problem-specific heuristics. Finally, I describe how min-cost flow-based schedulers can be extended with features that were thought to be incompatible with such schedulers. Chapter 6 evaluates Firmament’s performance. First, I demonstrate in simulations using a Google workload trace from a 12,500-machine cluster that Firmament provides low schedul- ing latency even at scale. Second, I show that Firmament matches the scheduling latency of state-of-the-art distributed schedulers for workloads of short tasks, and exceeds their placement quality on a real-world cluster. Finally, I show that Firmament chooses better placements than state-of-the-art centralised and distributed schedulers. Chapter 7 highlights directions for future work and concludes this dissertation. I consider challenges in expressing more types of workflows and discuss improvements that could be made to the mechanism Musketeer uses to automatically choose execution engines. I also discuss how Firmament’s scheduling policies might be improved and how Flowlessly might be extended to optimise flow networks generated by complex policies. 1.3 Related publications Parts of the work described in this dissertation are part of peer-reviewed publications: [GSC+15] Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Cle- ment, and Steven Hand. “Musketeer: all for one, one for all in data processing systems”. In: Proceedings of the 10th ACM European Conference on Computer Systems (EuroSys). Bordeaux, France, Apr. 2015. [GSG+16] Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert N. M. Watson, and Steven Hand. “Firmament: fast, centralized cluster scheduling at scale”. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Georgia, USA, 2016, pp. 99–115. 20 1.3. RELATED PUBLICATIONS I have also co-authored the following publications, which have influenced the work described in this dissertation, but did not directly contribute to: [GSG+15] Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, Robert N. M. Watson, An- drew W. Moore, Steven Hand, and Jon Crowcroft. “Queues don’t matter when you can JUMP them!” In: Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Oakland, California, USA, May 2015. [GGS+15] Ionel Gog, Jana Giceva, Malte Schwarzkopf, Kapil Vaswani, Dimitrios Vytiniotis, Ganesan Ramalingam, Manuel Costa, et al. “Broom: sweeping out Garbage Collection from Big Data systems”. In: Proceedings of the 15th USENIX/SIGOPS Workshop on Hot Topics in Operating Systems (HotOS). Kartause Ittingen, Switzerland, May 2015. [GIA17] Ionel Gog, Michael Isard, andMartín Abadi. Falkirk: Rollback Recovery for Dataflow Systems. In submission. 2017. Chapter 2 Background Many modern applications rely on large-scale data analytics workflows to provide high-quality services. These workflows run on hundreds of machines deployed in large data-centre clusters, which comprise of tens of thousands of networked commodity machines, and are shared by multiple applications and users. Developers of modern applications must solve two distinct but related problems to efficiently utilise data-centre clusters and to obtain good application performance: Application implementation: implement applications and their corresponding data process- ing workflows such that: (i) they get high-performance, efficient data processing with minimal implementation effort; and (ii) application implementations are sufficiently flex- ible to remain compatible with future advances in parallel data processing. Application execution: choose a set of machines on which to place applications. The selected machines must have sufficient available resources for the applications to run efficiently, and should meet other constraints such as tolerance of machine faults. In this dissertation, I discuss two solutions that I propose to solve these problems: (i) a data processing architecture that decouples data analysis workflow specification from execution, and enables automatic, flexible and dynamic translation of workflows to the best combination of data processing systems; and (ii) a centralised cluster scheduler that chooses high-quality placements with low latency, and thus efficiently and predictably executes applications. Hence, in this chapter, I describe the properties and limitations of the state-of-the-art data processing systems and cluster schedulers that are used in today’s data centres. First, in Section 2.1, I briefly describe the types of workloads that are currently executed in data centres. These workloads have different resource requirements because they conduct var- ious computations, and also vastly different scheduling latency expectations; some have to be quickly scheduled because they provide latency critical services, while others can wait. 21 22 2.1. CLUSTER WORKLOADS Following, in Section 2.2, I describe the different programming paradigms provided by the data processing systems available in today’s data centres. I divide data processing systems into two categories: (i) front-end frameworks that are used by developers to express workflows (§2.2.2), and (ii) back-end execution engines that run these workflows on hundreds of machines (§2.2.1). I look at how developers choice of front-end framework and back-end execution engine affects the efficiency of their workflows execution. Finally, I focus on the challenges that arise from combining different types of workflows and applications in a single cluster. In Section 2.3, I outline the features a cluster scheduler must have to efficiently utilise resources, and to place workflows in such a way that they run pre- dictably and complete as fast as possible. I also discuss the state-of-the-art cluster scheduler architectures, and describe their limitations. Finally, I categorise schedulers by the way they process workloads: (i) queue-based schedulers that process one workflow at a time, and (ii) min-cost flow schedulers that reconsider the entire workload each time they make decisions. I give a high-level description of how min-cost flow schedulers work and how they differ from task-by-task queue-based schedulers. 2.1 Cluster workloads The workloads executed in data centres are becoming increasingly varied as more applications rely on distributed systems to service requests with low latency and to provide insights obtained from large-scale data analysis. The workloads consist of jobs that run many tasks. A task is an instance of an application executed as one or more processes in a container or virtual machine running on a single machine. Many types of tasks with different runtimes and resource requirements execute in data centres. For example, in one of Google’s clusters, the shortest 25% of tasks run for less than 180 seconds and the longest 25% of tasks run for more than 1,000 seconds (see Figure 2.1), and in one of Alibaba’s clusters the shortest 25% of tasks run for less than 8 seconds and the longest 25% of tasks run for more than 154 seconds (see Figure 2.2). Similarly, task resource requirements can vary greatly across tasks (see Figure 2.3). As a result, tasks can put conflicting demands on cluster schedulers. Short-running tasks need the schedulers to offer low placement latency in order to achieve low task runtime. By contrast, long-running tasks need schedulers to use expensive optimisations that choose high-quality placements, which take into account tasks resource requirements, and do not cause task runtime increases or performance degradation. Data-centre tasks are commonly categorised by what kind of applications they run [SKA+13; VPK+15], and fall into one of the following three categories: (i) service tasks, (ii) batch tasks, or (iii) interactive tasks. Service tasks are instances of high-priority applications or production-critical systems such as web servers, load balancers and databases. These systems must serve requests at all times, CHAPTER 2. BACKGROUND 23 0 200 400 600 800100 0 120 0 140 0 160 0 180 0 200 0 Task runtime [sec] 0.0 0.2 0.4 0.6 0.8 1.0 C D F of ta sk ru nt im e Figure 2.1: CDF of task runtime computed from a 30-day trace of a 12,500-machine Google cluster. Task runtimes vary greatly: 25% of tasks run for less than 180 seconds, and 75% of tasks run for less than 1,200 seconds. I do now show runtimes beyond 2,000 seconds because the trace contains service tasks that run until failure or for the entire trace duration – i.e., skew the CDF. 0 200 400 600 800100 0 120 0 140 0 160 0 180 0 200 0 Task runtime [sec] 0.0 0.2 0.4 0.6 0.8 1.0 C D F of ta sk ru nt im e Figure 2.2: CDF of task runtime computed from a 24-hour trace of a 1,300-machine Al- ibaba cluster. Task runtimes vary greatly: 50% of tasks run for less than 28 seconds, and 10% of tasks run for more than 428 seconds. and thus service tasks run continuously until they are restarted due to hardware or software up- grades. Service tasks must be quickly migrated to other machines in case of hardware failures, and must run these user-facing systems predictably such that they meet service level latency agreements and maintain high serving rates. Cluster schedulers must place service tasks on ma- chines on which they do not interfere with other tasks, and also avoid placing interfering tasks on these machines subsequently. Batch processing tasks are instances of infrastructure systems that run for limited time. Typi- cal batch processing tasks run offline computations such as extracting, transforming and loading 24 2.1. CLUSTER WORKLOADS 0.0 0.1 0.2 0.3 0.4 Task normalised resource request 0.0 0.2 0.4 0.6 0.8 1.0 C D F of no rm al is ed ta sk us ag e CPU Memory Disk Figure 2.3: CDF of task resources requests from a Google cluster trace. Resource requests are normalised to the largest request. Task resource requests vary greatly; a small fraction of tasks requests few times more requests than other tasks. (ETL) data, and analyses for improving product engagement and increasing revenues. In con- trast to service tasks, batch tasks run computations whose failure does not cause application downtime. Moreover, these offline computations do not have stringent placement requirements: (i) tasks must not necessarily start simultaneously, (ii) computations are not sensitive to strag- gler tasks (i.e., tasks that take longer to complete than other tasks) because they do not have to complete in real-time, and (iii) tasks do not have to be placed in different fault tolerance domains. Thus, batch processing tasks are good candidates for preemption when other more important types of tasks must execute and no resources are available in the cluster. Interactive data processing tasks execute simple queries submitted by data analysts, or dis- patched by systems that provide personalised user responses. These tasks must complete as fast as possible (within seconds) because they are used in latency-sensitive systems such as online customer tools, monitoring systems and frameworks for interactive data exploration. These sys- tems are becoming increasingly popular, and as a result more data-centre tasks are interactive. For example, 50% of the SCOPE data analytics tasks at Microsoft are interactive and run for less than 10 seconds [BEL+14, §2.3]. Interactive data processing tasks are challenging to schedule because: (i) they must be placed as fast as possible to avoid query completion time delays, and (ii) they must be scheduled simultaneously, otherwise some tasks may become stragglers, and significantly delay query completion time. 2.1.1 Task life cycle All service, batch and interactive tasks execute on cluster managers and transition through the task life cycle that I show in Figure 2.4. Tasks are first submitted to a data-centre cluster man- CHAPTER 2. BACKGROUND 25 time Task submitted Started task scheduling Task placed Task running Task completed waiting scheduling starting running task placement latency task runtime task makespan Figure 2.4: Data-centre task life cycle: state transitions events (bottom) and task-specific metrics (top). ager, which schedules, runs and monitors tasks. After submission, tasks wait to be considered for placement by the cluster manager scheduler. Next, the cluster manager scheduler uses a specific algorithm to schedule tasks – i.e. chooses tasks placements. Upon task placement, the cluster manager downloads task binaries and setups containers or virtual machines for the tasks to execute in. Finally, tasks run until completion. When I refer to the durations of the different stages of a task’s life cycle, I use the following metrics (see Figure 2.4): 1. Task placement latency represents the time it takes the scheduler to decide where a task will run. I measure it from the time the task is submitted until the scheduler places the task on a machine. 2. Task runtime represents the time spent executing the user-provided computation in the task. I measure it from the time a task starts running until it completes. Task runtime does not include task setup time (e.g., install dependencies, download binaries) 1. 3. Task makespan is an end-to-end metric that measures the entire time it takes to execute a task, from the moment the task is submitted until it completes 1. All tasks transition through the same stages, but they have different scheduling requirements. On one hand, service tasks run for a long time, and thus many service tasks do not require quick placement, but expect schedulers to choose high-quality placements for them because they serve critical applications. On the other hand, interactive tasks must complete as soon as possible. They must be placed as fast as possible, and transition through task stages quickly because otherwise their completion time may be significantly delayed. In an ideal data-centre cluster, tasks would share machines without affecting each others’ perfor- mance; the cluster scheduler would choose placements instantaneously; and thus task makespan would equal task runtime. However, in the real world, cluster schedulers may cause unneces- sary increases in task makespan because they may not choose high-quality task placements or may be unable to offer low task placement latency. Poor task placement quality decreases 1Task runtime and makespan do not apply to service tasks because they do not complete. 26 2.2. DATA PROCESSING cluster utilisation, increases task makespan (for batch tasks), or decreases application-level per- formance (for service tasks). Similarly, high task placement latency increases task makespan, and decreases utilisation if tasks are waiting to be placed while resources are idle. 2.2 Data processing Many systems for the parallel processing of the aforementioned types of tasks have been de- veloped in recent years (see Figure 2.5). These systems seek to give developers the ability to run parallel analyses on data sets whilst shielding them from the complexity introduced by data partitioning, communication, synchronisation and fault tolerance. Many of these systems are designed to work well for specific workloads. For example, batch data processing execution engines process huge data sets spanning many disks in order to im- prove user engagement, or to extract, transform and load (ETL) data into other specialised sys- tems. Real-time stream processing systems are used by applications for tasks such as detecting spam and drawing insights from data collected from devices in the Internet of Things. Graph processing systems execute jobs for detecting fraud and improving recommendation engines. Finally, interactive data analytics systems run tasks that complete within seconds and support user-facing services such as language translation and personalised search. These systems embrace domain-specific optimisations to offer good performance for the work- loads they target while maintaining simplicity. For example, systems for large-scale batch pro- cessing prioritise data throughput over quick recoveries from hardware failures [DG08]. In- teractive data analytics systems use sampling or boundedly stale results [MGL+10], systems that run iterative workflows cache intermediate data in memory between iterations [ZCD+12; MMI+13], and graph processing systems adopt a vertex-centric synchronous computational model to provide simple programming interfaces without sacrificing performance [MAB+10; GLG+12; KBG12]. Additional specialised systems are being developed as new workflows and problem domains emerge. Today large organisations like Google, Facebook or Microsoft already run dozens of systems. In Figure 2.5, I show a subset of the data processing systems that have been developed over the past few years. Even though I only consider batch data and graph processing systems here, I notice that companies deploy more than a dozen different systems. As a result, data pro- cessing workflow developers are faced with many choices of systems to use when implementing workflows. These data processing systems can be categorised into two groups based on the type of abstrac- tions they provide to developers: Front-end frameworks are used by developers to express workflows using high-level declara- tive abstractions such as SQL-like querying languages or vertex-centric graph interfaces. CHAPTER 2. BACKGROUND 27 Pr eg el [M A B + 10 ] bu lk sy nc hr on ou s X -S tr ea m [R M Z 13 ] ed ge -c en tr ic G ra ph C hi [K B G 12 ] ve rt ex -c en tr ic Po w er G ra ph [G L G + 12 ] ga th er ,a pp ly ,s ca tte r G ra ph -s pe ci al is ed ex ec ut io n en gi ne s M ap R ed uc e [D G 08 ] bi pa rt ite da ta flo w D ry ad [I B Y + 07 ] ac yc lic da ta flo w N ai ad [M M I+ 13 ] tim el y da ta flo w Sp ar k [Z C D + 12 ] ac yc lic da ta flo w Fl in k [F L N 16 ] ac yc lic da ta flo w M et is [M M K 10 ] bi pa rt ite da ta flo w C IE L [M SS + 11 ] dy na m ic da ta flo w H yr ac ks [B C G + 11 ] ac yc lic da ta flo w G en er al -p ur po se ex ec ut io n en gi ne s G ir ap h [G IR 16 ] ve rt ex -c en tr ic Sa w za ll [P D G + 05 ] pr oc ed ur al la ng ua ge Fl um eJ av a [C R P+ 10 ] la ng ua ge -i nt eg ra te d H iv e [T SJ + 09 ] SQ L- lik e Pi g [O R S+ 08 ] SQ L- lik e Te nz in g [C L L + 11 ] SQ L- lik e M ap R ed uc e fr on t- en ds D ry ad L IN Q [Y IF + 08 ] la ng ua ge -i nt eg ra te d SC O PE [C JL + 08 ] SQ L- lik e D ry ad fr on t- en ds G ra ph L IN Q [M cS 14 ] la ng ua ge -i nt eg ra te d L in di [L N D 16 ] la ng ua ge -i nt eg ra te d N ai ad fr on t- en ds Sp ar kS Q L [A X L + 15 ] SQ L- lik e G ra ph X [G X D + 14 ] ve rt ex -c en tr ic Sp ar k fr on t- en ds Front-endframeworks Back-endexecutionengines Fi gu re 2. 5: T he re ar e m an y fr on t- en d fr am ew or ks an d ba ck -e nd ex ec ut io n en gi ne s. Fr on t- en ds an d ba ck -e nd s ty pi ca lly ha ve on e- to -o ne m ap pi ng s. 28 2.2. DATA PROCESSING Back-end execution engines execute workflows expressed in front-end frameworks, or imple- mented using low-level abstractions (e.g., map and reduce functions, or stateful dataflow vertices). Front-end frameworks offer interfaces that abstract away how and where workflows run. How- ever, in practice many front-end frameworks are coupled to a single back-end execution engine; all front-end frameworks depicted in Figure 2.5 execute workflows on a single back-end. De- velopers chose a front-end framework to implement workflows, but as a result of this coupling, they do not benefit from the advantages different back-end execution engines have. More- over, once developers implement workflows, they find it tedious to manually port them to other frameworks. Developers become “locked in”, despite faster or more efficient back-ends being available. In the following subsections, I describe different dataflow models that the batch and graph pro- cessing execution engines are based on (§2.2.1). I also explain how these models fundamentally limit back-end execution engines in certain situations. Following, I describe several state-of- the-art front-end frameworks (§2.2.2). Finally, I show using several real-world experiments that choosing the best framework is difficult and that performance varies greatly depending on: (i) type of computation executed, (ii) input data size, and (iii) engineering decisions made in the framework’s development process (§2.2.3). 2.2.1 Back-end execution engines I turn now to describing the evolution of batch processing and graph processing execution engines. Purpose-built data processing engines exist for other types of tasks: for example, stream processing is conducted using Heron [KBF+15] and Storm [TTS+14] at Twitter, Mill- Wheel [ABB+13] at Google, S4 at Yahoo! [NRN+10], Puma and Stylus at Facebook [CWI+16], and Samza at LinkedIn [SAM16]. However, in this dissertation I only focus on the evolution of batch and graph processing execution engines, but many of the concepts I introduce apply to stream processing systems as well. 2.2.1.1 Batch data processing back-end execution engines Batch data processing execution engines are used to analyse large amounts of data to obtain data-driven insights in an “offline” fashion. Data analyses were initially conducted on single commodity or mainframe machines. However, as data increased in size, disk I/O bandwidth became a bottleneck, or memory storage capacity limited how much data could be analysed. To address these limitations, analyses were parallelised over many disks and commodity machines. However, failures occur on a regular basis when using many commodity machines [SPW09; GOF16]. Thus, the computation model that batch data processing systems use must be robust CHAPTER 2. BACKGROUND 29 to handle permanent or transient machine and network failures, must avoid fine-grained, costly coordination (e.g., no shared memory), and must not require determinism. One model that meets these requirements is the parallel dataflow model. Initially proposed by Dennis as an alternative to the control flow architecture [Den68], the dataflow has been unsuccessful as a commercial computer architecture, but it inspired the dataflow programming model because of its intrinsic suitability for parallel computations. The dataflow programming model defines computations as directed graphs in which vertices conduct computations defined by users. The vertices apply pure functions (e.g., map, reduce) or complex, non-deterministic, stateful functions that may ingest data from external systems. Vertices are usually stateless, or if they are stateful then they only store data locally. Vertices’ output data are sent as input to the vertices outgoing arcs connect to. Thus, the flow of data is defined explicitly by arcs. The lack of shared vertex state and the explicit data coordination that is represented in the graph make the dataflow model well-suited for running on large clusters of commodity machines in which failures are common. I now describe the existing dataflow models and the back-end execution engines that are based on them (see Table 2.1 for a summary). In the prior literature, vertices and tasks are used inter- changeably to refer to dataflow vertices. I call dataflow vertices “tasks” in order to distinguish between when I refer to them versus when I discuss vertices in graph data. MapReduce dataflow model. Back-end execution engines that are based on the MapReduce dataflow model execute two types of tasks: (i) tasks that read input data and process it, and (ii) tasks that aggregate the processed data and output it (see Figure 2.6a). The programming interfaces exposed by these back-ends do not require developers to specify data dependencies among tasks because the back-ends automatically create dependencies among tasks. Google’s MapReduce [DG08] was perhaps the first big data execution engine and gained wide adoption (within Google). MapReduce influenced the design of other back-end execution en- gines (e.g., Apache Hadoop [HAD16], HaLoop [BHB+10]). MapReduce expects a developer to express her workflow using two functions: map and reduce. The back-end executes the workflow by first running m tasks that apply in parallel the user-provided map function over input data shards. The map function is applied in turn over each input data row and outputs zero or more key-value pairs. Next, MapReduce executes an intermediate step in which it sorts and groups by key the output from all m tasks. Subsequently, MapReduce sends shards of the grouped key-value pairs to r tasks that apply the user-provided reduce function. The function is applied on every key and list of grouped values, and outputs a key-value pair. MapReduce execution engines are popular because they provide a simple programming model and they can tolerate hardware failures. However, these engines cannot execute complex work- flows. For example, they restrict workflows to take a single input set and generate a single output set, and they cannot execute three-way joins on different columns in a single job. These limi- tations are addressed by high-level front-end frameworks (e.g., Pig [ORS+08], Hive [TSJ+09]) 30 2.2. DATA PROCESSING D ata processing system E xecution m odel E nvironm ent In-memoryDistributed I/O Require pre-processing Default shardingData sizesFault toleranceLanguage H adoop M apR educe [H A D 16] M apR educe dataflow cluster – 3 – user-def. large 3 Java H aL oop [B H B +10] M apR educe dataflow cluster 3 – – – m ed. – C ++ M etis [M M K 10] M apR educe dataflow m achine 3 – – user-def. sm all – C ++ Spark [Z C D +12] acyclic dataflow cluster 3 3 – uniform m ed. 3 Scala D ryad [IB Y +07] acyclic dataflow cluster – 3 – user-def. large 3 C # H yracks [B C G +11] acyclic dataflow cluster – 3 – uniform large 3 Java Flink [FL N 16] acyclic dataflow cluster 3 3 – uniform m ed. 3 Java/Scala C iel[M SS +11] dynam ic dataflow cluster (3 ) 3 – user-def. m ed. 3 various N aiad [M M I +13] tim ely dataflow cluster 3 (3 ) – user-def. m ed. (3 ) C # Pregel[M A B +10] vertex-centric cluster – 3 – uniform m ed. 3 C ++ G raphC hi[K B G 12] vertex-centric m achine 3 – 3 – sm all – C ++ G iraph [G IR 16] vertex-centric cluster – 3 – uniform m ed. 3 Java Pow erG raph [G L G +12] gather,apply and scatter cluster 3 3 3 pow er-law m ed. (3 ) C ++ X -Stream [R M Z 13] edge-centric m achine 3 – 3 – m ed. – C ++ D rem el[M G L +10] execution trees cluster – 3 – uniform large. 3 C ++ Table 2.1: A selection of contem porary back-end execution engines w ith their features and properties. (3 ) indicates thatthe system can be extended to supportthis feature. CHAPTER 2. BACKGROUND 31 (a)MapReduce dataflow. (b) Acyclic dataflow. Driver Ite ra tio n 2 Ite ra tio n 1 (c) Driver controlled acyclic iterative dataflow. Ite ra tio n 1 Ite ra tio n 2 (d) Dynamic dataflow. Loop context (e) Timely dataflow. Figure 2.6: Examples of the different dataflows models described in Section 2.2.1.1. Tasks are shown as circles and data flows along the arcs that connect them. Back-end execution engines target different workloads depending on which dataflow model they are based on. and workflow managers (e.g., Apache Oozie [OOZ16]). These systems provide interfaces to implement workflows that cannot be expressed as a single MapReduce job. For each workflow, they generate and manage a direct acyclic graph (DAG) of MapReduce jobs. However, this approach suffers from several drawbacks: (i) workflows optimisations cannot be applied across job boundaries, (ii) resources are wasted while job output data are written to disk even if other jobs read it later, and (iii) additional systems must be managed. Acyclic dataflow model. MapReduce execution engines cannot execute complex workflows because they are based on the MapReduce dataflow model, which only supports two types of tasks. The acyclic dataflow model avoids this limitation: it models data analysis computations as directed acyclic graphs of tasks (see Figure 2.6b). Tasks can be connected to any other tasks as long as no cycle exists in the graph. The acyclic dataflow model is suitable for parallel execution because all tasks that have input available can execute in parallel. Moreover, tasks that ingest other tasks’ output can execute as soon as the latter produce output data. 32 2.2. DATA PROCESSING Complex workflows can execute in a single job on back-ends based on the acyclic dataflow model, which avoids the need for separate workflow managers. For example, a workflow of MapReduce jobs can be modelled as a direct acyclic dataflow graph consisting of several con- nected MapReduce dataflow graphs. Dryad is the first general-purpose distributed back-end execution engine based on the acyclic graph dataflowmodel [IBY+07]. A Dryad job comprises of a DAG of task vertices, each execut- ing a user-specified computation. Tasks that connect with edges have communications channels between them to send data (e.g., TCP channels, shared memory). A Dryad job completes when all tasks finish processing input or data sent by other tasks. Like Dryad, Hyracks executes a DAG of user-specified task vertices [BCG+11], and Spark builds a DAG of “stages” that each correspond to several parallel tasks [ZCD+12]. Spark speeds up workflows by storing data in memory using resilient distributed data sets (RDD) rather than on disks [ZCD+12]. RDDs keep task output data in memory when possible and stream it to dependent tasks. By contrast, the MapReduce execution engines write the output of every task to disk. Nonetheless, there still are some workflows that cannot natively execute in a single job on back- ends based on the acyclic dataflow model. For example, back-ends cannot execute workflows that contain data-dependent or unbounded iterations because their underlying dataflow model is acyclic. This limitation is addressed in back-ends using driver programs that execute each workflow iteration in a job, and check for computation completion. For example, the machine learning Apache Mahout library uses a driver program that submits cluster jobs that execute the iteration body, waits for them to complete, and finally checks if the iteration termination con- dition is met [MAH16]. Apache Flink [ABE+14; FLN16] uses a declarative fix-point operator that it repeatedly evaluates after each iteration. By contrast, Spark and Dryad use extensions of high-level languages. DryadLINQ [YIF+08] is a front-end framework for Dryad in which itera- tive workflows are expressed by wrapping the directed acyclic graph in a C# for loop. Similarly, developers can write iterative Spark workflows using Spark primitives (e.g., join, map, group by) and for loops provided by the high-level languages Spark supports (e.g., Scala, Python). However, these solutions are not fault-tolerant; driver programs or DryadLINQ/Spark for loops execute on a single machine and maintain state (e.g., iteration count) vital for correct workflow completion. A failure of the machine on which they execute causes the entire workflow to restart from scratch. Dynamic dataflow model. In order to address the limitations of the acyclic dataflow model, Murray et al. developed CIEL [MSS+11], which introduces a new, dynamic dataflow model, which natively supports workflows with data-dependent iterations. In the dynamic dataflow model, each task can spawn additional tasks at runtime (see Figure 2.6d). CIEL uses this feature to execute iterative workflows. For a given iterative workflow, CIEL first executes a set of tasks to conduct one iteration. When these tasks complete, one designated task checks if the iterative convergence criterion is satisfied, if not, it spawns a new set of tasks to conduct another iteration. CHAPTER 2. BACKGROUND 33 T2 T1 T0 Superstep 1 Barrier synchronisation T2 T1 T0 Superstep 2 T2 T1 T0 ... ... ... (a) BSP graph processing. Each circle represents a task/process that runs vertex-centric code for one or more vertices. T0 T1 T2 T3 T4 T5 (b) GAS graph processing. Circles represent tasks that apply user-provided gather, apply or scatter functions. The tasks do not synchronise. Figure 2.7: Examples of the different graph processing models described in Section 2.2.1.2. This process repeats until the convergence criterion is satisfied. CIEL is a general back-end execution engine, but nested-loop workflows may be tedious to implement because developers must take care of the order in which tasks spawn additional tasks. Timely dataflow model. Timely dataflow was introduced by Murray et al., and is a general model that does not requires developers to implement tasks that dynamically spawn other tasks in order to run iterative workflows. Timely dataflows are directed cyclic dataflows in which edges carry both data records and logical timestamps (see Figure 2.6e). These logical times- tamps are used to pass iteration or workflow progress information among task nodes. The Naiad back-end execution engine is based on this timely dataflow model. Unlike many other systems that are not based on the timely dataflow model, Naiad is a general back-end that can execute a wide range of types of workflows: batch, iterative, incremental, streaming and graph workflows. 2.2.1.2 Graph data processing back-end execution engines Graph-structured data is common in workflows that detect frauds, improve recommendation engines, or compute transportation routes. These workflows run a series of iterations until a converge criterion is satisfied. Initially, developers used batch back-end execution engines that are based on dataflow models ill-suited to such workflows (e.g., MapReduce and acyclic dataflow models) [DG08; GIR16]. These back-ends perform poorly on graph workflows be- cause they run at least a job per iteration, and do not optimise across iterations. Specialised graph processing back-ends addressed these limitations [MAB+10; GLG+12; KBG12; GIR16]. They generally use one of two computations models that have limited expressivity, but can deliver significantly better performance on graph workflows. 34 2.2. DATA PROCESSING Bulk synchronous parallel model. In 1990, Valiant introduced the bulk synchronous paral- lel (BSP) model for parallel computations with the purpose of bridging between software and hardware [Val90]. BSP algorithms run on a set of processes and proceed in a series of global supersteps. Each superstep comprises of three steps: (i) each process independently performs local computation, (ii) processes exchange messages all-to-all, and (iii) processes wait (i.e., barrier synchronise) until all finish the previous two steps. Pregel [MAB+10] was the first graph processing back-end to be based on the BSP computa- tion model, and the first that provided a “vertex-centric” interface for developers to implement workflows. Developers provide code that Pregel executes in parallel for each graph vertex. In each superstep, the code receives data from adjacent vertices, runs the user-provided vertex computation, and sends data to adjacent vertices. Sent data are received by adjacent vertices in the following superstep (see Figure 2.7a). Pregel influenced other BSP-based systems such as Unicorn [CBB+13] and Apache Giraph [GIR16]. Despite their popularity, BSP-based graph processing back-ends are not suited for all graph workflows. First, McSherry et al. showed that on the graph connectivity problem, the single- threaded “Union-Find with weighted union” algorithm can outperform the equivalent distributed label propagation algorithm running on a specialised graph-processing back-end deployed on a 100-machine cluster by over an order of magnitude [MIM15]. Yet, frameworks that offer vertex-centric interfaces are not sufficiently expressive to support the more efficient Union-Find algorithm. Finally, in BSP-based back-ends, some processes may finish work early, but idle-wait for other processes to finish work and dispatch synchronisation messages. To make matters worse, tasks may idle-wait for longer if the cluster network is under load and synchronisation messages are delayed – i.e., BSP-based back-ends are prone to be affected by other applications’ network traffic [GSG+15]. Asynchronous model. The process synchronisation step in the BSP model causes unnec- essary slow-downs for some graph problems such as belief propagation and website rank- ing [LBG+12]. These algorithms work towards convergence criteria that can be achieved with- out synchronising processes at the end of each superstep. To the contrary, they converge faster if each process operates on the most recent data – even if processes have not yet received all messages from other processes. Low et al. address this limitation in GraphLab, a graph processing back-end which supports asynchronous graph computations [LBG+12]. In GraphLab, user-defined vertex code can di- rectly access its own vertex’s state, adjacent vertices’ and edges’ state. GraphLab runs vertex code in parallel, but it ensures that neighbouring vertices do not run simultaneously to keep state consistent. PowerGraph introduced the three-phase Gather, Apply, Scatter (GAS) abstraction that can ex- press both synchronous and asynchronous graph computations [GLG+12]. In the gather phase, CHAPTER 2. BACKGROUND 35 data are collected from adjacent vertices using a user-defined commutative and associative gather function. The function’s output is then passed to the apply user-defined function that updates the vertex state. Finally, the scatter phase filters the vertex state and sends it to adjacent vertices. Optimising graph processing frameworks. Gonzales et al. observe that many social and web graphs have power-law degree distributions, which are difficult to partition across ma- chines [GLG+12]. They address this problem with a new vertex-cut graph partitioning algo- rithm that reduces communication. Chen et al., improve upon PowerGraph’s performance with PowerLyra, a back-end that dynamically applies different variants of a partitioning algorithm depending on the graph’s structure [CSC+15]. PowerLyra combines PowerGraph’s vertex-cut and edge-cut with a few heuristics. Zhang et al., improve further on PowerLyra’s performance with a 3D partitioning algorithm that reduces network communication for matrix-based appli- cations that are executed as graph workflows [ZWC+16]. Others optimise graph processing back-end engines for single machine execution. For example, GraphChi [KBG12] leverages SSDs for resource-efficient vertex-based graph workflows on a single machine, while X-Stream [RMZ13] uses an edge-centric approach that is optimised to sequentially read input graphs, also on a single machine. 2.2.2 Front-end frameworks Many big data workflow developers are data analysts who find it difficult to implement work- flows using low-level interfaces provided by back-end execution engines. Instead, developers use front-end frameworks that offer higher-level abstractions, which make it easy to express workflows. These frameworks translate workflows into jobs, which run on back-end execution engines. In practice, front-end frameworks are almost always coupled to a single back-end execution engine, and cannot execute workflows on other back-ends; see Figure 2.5 for examples of cou- pling between front-end frameworks and back-end execution engines. This coupling hinders the adoption of new more efficient back-ends because legacy workflows must be manually ported. Porting is tedious and is not undertaken lightly, especially in large companies where hundreds of thousands of workflows would have to be ported manually. In this dissertation, I focus on front-end frameworks for batch and graph data processing. In ad- dition, there are many front-end frameworks for representing stream and interactive data work- flows (e.g., PowerDrill [HBB+12], Peregrine [MG12]). Many of the techniques I describe can be applied to these as well. 36 2.2. DATA PROCESSING 2.2.2.1 Batch data processing front-end frameworks Batch back-end execution engines processes huge data sets quickly because they parallelise I/O and processing over many disks and machines. Hence, it is paramount for front-end frameworks to provide intuitive ways of expressing workflows without restricting parallelism. Structured Query Language (SQL). SQL is an intuitive declarative language that is exten- sively used as an interface for interacting with databases. SQL comprises of a small set of operators that are based on Codd’s relational algebra (e.g., filter, project, join) [Cod70]. These operators are data-parallel because each operator can be simultaneously applied on partitions of the data. SQL queries submitted by users are parsed and translated into logical query plans, which store queries as directed acyclic graphs of operators. SQL’s widespread use and the logical query plan DAG representation that maps to an acyclic dataflow workflow make SQL-like abstractions an attractive way of expressing workflows. For example, Sawzall [PDG+05] is a front-end similar to SQL. It introduces a new procedural programming language in which big data workflows are expressed in two steps: (i) a step in which operators are performed on a single data record at a time, and (ii) a step in which the output of the first step is processed using data aggrega- tors. The language is less sophisticated than SQL, but it can easily translate to MapReduce jobs. Pig [ORS+08] and Hive [TSJ+09] are widely used front-end frameworks that present a SQL-like interface to developers. They both run on top of the Hadoop MapReduce back-end and translate SQL-like workflows to directed acyclic graphs of MapReduce jobs. They retrofit support for acyclic complex workflows on top of MapReduce execution engines: workflows execute as a series of MapReduce jobs in topological order of data dependencies. Shark [XRZ+13] replaces Hive’s physical query plan generator to use Spark’s resilient distributed data sets rather than generate several MapReduce jobs. Spark SQL [AXL+15] is a redevelopment of Shark that of- fers better integration between relational SQL code and traditional Spark procedural processing code. SCOPE [CJL+08] and Tenzing [CLL+11] make the relationship to SQL more explicit; Tenzing provides an almost complete SQL implementation on top of MapReduce. The semantics of these SQL-like frameworks, however, are heavily influenced by back-end execution engines to which they translate workflows. For example, Pig relies on COGROUP clauses to delineate MapReduce jobs, while Spark SQL is tightly integrated with a Spark-centric query plan optimiser. Language-integrated solutions. Data processing workflows are often deeply integrated in applications. In such cases, applications and workflows can be developed using programming languages for which integrated workflow front-ends exist. For example, FlumeJava is a Java library that provides several immutable parallel collections that support several operations for parallel processing (e.g., ParallelDo, GroupByKey) [CRP+10]. FlumeJava defers the execu- tion of these operations, and constructs an acyclic dataflow graph. Following, it optimises the CHAPTER 2. BACKGROUND 37 dataflow graph, and executes the workflow on either the local machine or a MapReduce cluster depending on the input data size. Similarly, Language INtegratedQuery (LINQ) is a .NET framework that adds SQL-style query extensions as syntax sugar to the programming languages running on top of .NET. LINQ ab- stracts away workflow definition from execution by supporting several “query providers” (e.g., provider to XML, provider to SQL). Yu et al. developed DryadLINQ, a new LINQ query provider that translates LINQ expressions and executes them in parallel on a cluster using the Dryad back-end execution engine [YIF+08]. Similarly, Lindi is a query provider that executes workflows on the Naiad back-end execution engine [LND16]. 2.2.2.2 Graph data processing front-end frameworks Some specialised front-end frameworks offer interfaces that are well-suited for expressing graph processing workflows. Pregelix [BBJ+14] is a front-end framework that provides the same vertex-centric interface as Pregel, but internally models the graph computation as a traditional database query of relational operators. By taking this approach, Pregelix can execute workflows on the Hyracks [BCG+11] back-end execution engine based on the acyclic dataflow model. GraphX [GXD+14] offers a Scala-embedded GAS interface and runs on top of Spark. GraphX translates GAS workflows into traditional Spark procedural code that applies operations over RDDs. Finally, GraphLINQ [McS14] is a C# library for Naiad with LINQ-like operators opti- mised for graph processing. GraphLINQ offers an easy-to-use procedural programming inter- face in which workflows are expressed by manipulating relations storing graphs’ vertices and edges. To sum up, front-end frameworks provide high-level declarative abstractions that developers can use to express workflows. The frameworks translate workflows into low-level abstractions and execute them on back-end engines. In theory, front-ends should be able to translate workflows to several back-end execution engines, but in practice this is not the case: each front-end is coupled to a back-end (see Figure 2.5). I now show in a series of experiments that this coupling increases workflow makespan because no back-end execution engine outperforms all other. 2.2.3 No winner takes it all Back-end execution engines have built-in assumptions about the likely operating regime, and optimise their behaviour accordingly. As a result, back-end execution engine performance is often non-intuitive, which makes it difficult for developers to determine which back-end is best for a given workflow, input data set and cluster setup. I show that back-end performance is non-intuitive in a set of simple benchmarks that I execute on a heterogeneous local cluster and on a medium-sized cloud cluster 2. The former represents 2See §4.1 for a description of the setup. 38 2.2. DATA PROCESSING a dedicated, fully controlled environment with low expected external variance, while the latter corresponds to a typical shared, multi-tenant data-centre cluster. In all cases, I run frameworks over a shared Hadoop file system (HDFS) installation that stores input and output data. I either implement workflows directly against a particular back-end execution engine, or use a front-end framework with its corresponding native back-end. 2.2.3.1 Performance Batch data analytics workflows often consist of relational operators. Consequently, I consider the behaviour of two relational operators in three isolated micro-benchmarks. I use the hetero- geneous local cluster of seven nodes as an example of a small-scale data analytics deployment. In later experiments, I show that my results generalise to more complex workflows and to larger clusters. In all experiments, I measure workflow makespan – i.e., the total execution time of workflows. Makespan includes the time it takes to load input data from HDFS, to pre-process or transform data to the formats supported by back-ends (e.g., graph partitioning and sorting in PowerGraph and GraphChi), workflow computation time, and the time it takes to write the output to HDFS. As a result, the numbers I present are not directly comparable to those presented in many pa- pers, which focus only on the workflow computation time. I believe that makespan is a more insightful metric to use than computation time because it is an accurate estimation of how long users would have to wait for their workflows to complete. Input size. A key question for a workflow developer is whether to leverage parallelism within a single-machine or across many machines. The answer to this question depends heavily on both input data size and back-ends’ architectural design. A single-machine can be used efficiently for a workflow if the entire working set (including any overheads) fits into its main memory and the parallelism available in the machine is sufficient. It is important to understand the trade-offs between single-machine and distributed execution because workflows that can be processed on a single machine are common in practice: 40–80% of Cloudera customers’ MapReduce jobs and 70% of jobs in a Facebook trace have ≤ 1GB of input [CAK12]. To investigate how input size affects system performance and when single-machine back-ends are best, I look at a simple string processing workflow in which I extract one column from a space-separated, two column ASCII input. This corresponds to a PROJECT query in SQL terms, but is also reminiscent of a common pattern in log analysis batch jobs: lines are read from storage, split into tokens, and a few are written back. I consider input sizes ranging from 128MB up to 32GB. In Figure 2.8a, I compare the makespan of this workflow on five different frameworks. Two of these are developer-friendly SQL-like front-ends (Hive, Lindi), while the others are back-end execution engines that require the developer to program against a low-level API (Hadoop, Metis CHAPTER 2. BACKGROUND 39 0.1 0.5 1 2 4 8 16 32 Input size [GB; log2-scale] 0 50 100 150 200 250 300 M ak es pa n [s ec ] Hive Hadoop Metis Spark Lindi (a) PROJECT on a relation. Asymmetric Symmetric Input relation sizes 0 100 200 300 400 500 M ak es pa n [s ec ] 91 7 Hive Hadoop Spark Lindi Serial C (b) JOIN two data sets. Figure 2.8: Different systems perform best for simple queries. Lower is better; error bars show min/max of three runs. and Spark). For small inputs (≤ 0.5GB), the Metis single-machine MapReduce back-end per- forms best. Once the data size exceeds 0.5 GB, however, the distributed frameworks outperform it due to their ability to perform I/O in parallel 3. I/O efficiency. As input data size grows, Hive, Spark and Hadoop all surpass the single- machine Metis, not least since they stream data from and to HDFS in parallel. The Lindi front-end implementation for Naiad performs surprisingly poorly; I tracked this down to an im- plementation decision in the Naiad back-end, which uses only a single input reader thread per machine, rather than having multi-threaded parallel reads. Since the PROJECT benchmark is primarily limited by I/O bandwidth, this decision proves detrimental. Data structure. Following, I consider a JOIN workflow. This workflow’s output size is highly dependent on the structure of the input data: it may generate less, more, or an equal amount of output compared to its input. I therefore measure two different cases: (i) an input- skewed, asymmetric join of the 4.8M vertices and 69M edges of a social network graph (Live- Journal), and (ii) a symmetric join of two uniformly randomly generated 39M row data sets. In Figure 2.8b, I show the makespan of this workflow on different front-ends and back-ends (plus a simple implementation in serial C code). The unchallenging asymmetric join (producing 1.28 million rows, 1.9 GB output size) works best when executed in single-threaded C code on a sin- gle machine, as the computation is too small to amortise the overheads of distributed execution engines. The far larger symmetric join (producing 1.5 billion rows, 29 GB output), however, works best when it is expressed in Hive and runs on Hadoop MapReduce, although I did need to manually optimise Hive to use a suitably high degree of concurrency. By default Hive uses the CombinedHiveInputFormat class to merge several input splits, and thus reduces the total 3Metis is not bottlenecked on computation, but on reading the data from HDFS. With the data already local, Metis performs best up to 2 GB input data. 40 2.2. DATA PROCESSING 0 200 400 600 800 Makespan [sec] GraphLINQ Spark Hadoop PowerGraph GraphLINQ GraphChi 10 0 no de s 16 nd . 1 no de (a) Orkut (3.0M vertices, 117M edges). 0 1500 3000 4500 Makespan [sec] GraphLINQ Spark Hadoop PowerGraph GraphLINQ GraphChi 10 0 no de s 16 nd . 1 no de (b) Twitter (43M vertices, 1.4B edges). Figure 2.9: Makespan of PageRank on social network graphs in different back-ends and front-ends. Makespan varies depending on scale; lower is better; error bars: ±σ of 10 runs. number of map tasks; however this limits the potential parallelism which, in this case, is detri- mental. I instead use the HiveInputFormat class and manually set the parallelism – a choice that data analysts would have difficulty making without knowledge of Hive’s internals. Other systems suffer from inefficient I/O parallelism (e.g., Lindi uses a single-threaded writer), or have overhead due to constructing in-memory state and scheduling tasks sub-optimally (Spark). The takeaway here is that the best system may depend not only on the size of the data to be processed, but also on the nature of processing conducted, and its sensitivity to data skew. Dataflow model. The choice of dataflow model can have a major impact on back-end perfor- mance and efficiency. For example, many common workflows include iterative computations on graphs (e.g., social networks), which are ill-suited to execute on back-ends based on the MapReduce dataflow model (§2.2.1.1). In the following, I compare different back-ends and front-ends running PageRank on social network graphs. I use an EC2 cluster (m1.xlarge instances) and I vary its size in order to determine systems’ efficiency at different scales. Many specialised graph processing systems are based on the bulk synchronous parallel or gather, apply, and scatter (GAS) models (§2.2.2). These systems cannot run many types of workflows, but can deliver significantly better performance on graph workflows than other back- ends. In Figure 2.9, I show the makespan of a five-iteration PageRank workflow on a small Orkut social network graph (3.0M vertices, 117M edges) and on a large Twitter graph (43M vertices, 1.4B edges). It is evident that graph-specialised systems have significant advantages for this computation: the GraphLINQ implementation of the workflow that runs on Naiad out- performs all other systems4. PowerGraph also performs well because its vertex-centric sharding reduces the communication overhead that dominates PageRank makespan. 4Only the GraphLINQ front-end for Naiad is shown here; Lindi is not optimised for graph computations and performs poorly. CHAPTER 2. BACKGROUND 41 Engineering choices. Engineering choices such as targeted scale and back-end implementa- tion language also have an impact. Different techniques make sense at different scales. For example, Hadoop, which is designed to run on clusters of thousands of nodes [HAD16], piggy- backs task launch messages onto coarsely spaced heartbeats to avoid incast at the master node, leading to widely documented task spawn overhead [OPR+13], but improving scalability. By contrast, Spark, which is designed to run on clusters of hundreds of nodes [ZCD+12], eagerly launches tasks as soon as they are scheduled. Finally, PowerGraph provides no fault tolerance because it targets tens of powerful machines, setups in which workflows are less likely to be affected by failures. Other seemingly unimportant engineering choices such as programming language chosen for implementation can also significantly affect performance. For example, many back-ends are implemented in managed languages with garbage collection. These back-ends are highly sen- sitive to garbage collector configuration parameters. In my experiments, I frequently observed stalls when large heaps had to be garbage-collected 5. By contrast, other systems are imple- mented in unmanaged languages (e.g., PowerGraph in C++), and utilise the machines’ compute resources more efficiently. 2.2.3.2 Resource efficiency The fastest back-end is not always the most resource efficient. While PageRank implemented in GraphLINQ and running on 100 Naiad nodes has the lowest makespan in Figure 2.9b, Power- Graph performs better than GraphLINQ when using only 16 nodes (due to its improved shard- ing)6. Moreover, when the input graph is small, single-node GraphChi performs only 50% worse than Spark on 100 nodes, and only slightly worse than PowerGraph on 16 nodes (see Figure 2.9a). In such situations, it may be worthwhile to wait longer for workflows to complete, but utilise resources more efficiently. Resource efficiency is a measure of the efficiency loss incurred due to scaling out over multi- ple machines and executing workflows in unoptimised execution engines. I compute resource efficiency by normalising workflows’ fastest single-machine makespan (which I assume to be maximally resource-efficient) to their aggregate execution time over all machines when running in a distributed back-end. For example, a workflow that runs for exactly 30s on a back-end deployed on 100 machines incurs an aggregate runtime of 3,000s. If the best single-machine back-end completes the same workflow in 1,500s, the resource efficiency is 50% when running on 100 machines. In Figure 2.10, I show the resource efficiency for the PageRank workflow by normalising sys- tems’ resource usage to that of GraphChi. Only PowerGraph comes close to GraphChi in terms 5In some cases, garbage collection even manifested itself as a failure: garbage-collecting a 64 GB JVM heap took more than 60s and triggered system-level timeouts. 6PowerGraph requires the number of nodes to be a power of two, but running it on 32 or 64 nodes showed no benefit over running it on 16 nodes. 42 2.2. DATA PROCESSING 0% 25% 50% Resource efficiency GraphLINQ Spark Hadoop PowerGraph GraphLINQ GraphChi 100% 10 0 no de s 16 nd . 1 nd . (a) Orkut (3.0M vertices, 117M edges). 0% 25% 50% Resource efficiency GraphLINQ Spark Hadoop PowerGraph GraphLINQ GraphChi 100% 10 0 no de s 16 nd . 1 nd . (b) Twitter (43M vertices, 1.4B edges). Figure 2.10: Resource efficiency of different data processing systems running PageRank on a EC2 cluster. of total machine time used, but still consumes significantly more resources. PowerGraph’s re- source use is between 2.5× (Twitter) and 8× (Orkut) that of GraphChi. 2.2.4 Data processing systems summary In the previous subsection, I showed that no single front-end or back-end systematically outper- forms all others. I also highlighted that systems performance depends on: • input data size: single machine back-ends outperform distributed back-ends for small inputs; • structure of the data: skew and selectivity impact I/O performance and work distribu- tion; • dataflow models back-ends are based on: some models (e.g., MapReduce dataflows) are ill-suited to iterative computations, and systems based on specialised dataflow models often operate more efficiently; • engineering decisions: overheads due to implementing back-ends in managed languages, and data loading input costs vary significantly across frameworks. Understanding when and how these properties and decisions affect workflow performance re- quires detailed, in-depth knowledge of systems’ inner workings. It is unrealistic to expect the majority of workflow developers to have this knowledge. To make matters worse, even if devel- opers have this knowledge, they cannot easily choose the most appropriate execution engines because they often lack information about key parameters at workflow implementation time (e.g., the size of generated intermediate data). Many workflows today achieve suboptimal performance because of the following four issues: CHAPTER 2. BACKGROUND 43 1. Execution engines are chosen statically, ahead of runtime. Data size is an important factor for choosing an execution engine, but it is difficult to predict how much interme- diate data a workflow will generate. Moreover, input data size can significantly change between two consecutive workflow runs because of temporary traffic spikes. Yet, work- flows are implemented for, and executed in an execution engine decided upon ahead of runtime. This can lead to sub-optimal performance if the data volume or resource avail- ability does not match implementation-time expectations. 2. Static, ahead of time imposition of job boundaries in the workflow. Developers stati- cally partition complex workflows into jobs for different execution engines at implemen- tation time. However, the best workflow partitioning depends on the expressivity of the dataflowmodels on which the available engines are based, on data structure and skew, and on the state of the cluster, information which is rarely available before workflow runtime. Instead, workflows should dynamically be partitioned into jobs at runtime. 3. Workflow migration difficulties. Workflows must be manually ported to new front-end frameworks or engine-specific interfaces upon the introduction of new execution engines. This requires significant engineering effort; hence it discourages adoption of new execu- tion engines, even though they offer performance gains or reduced resource demands. 4. Job-level scheduling. Workflows can comprise of several jobs that have dependencies among them. However, many execution engines and cluster managers schedule work- flows on a per-job basis, and do not take into account job dependencies. This can lead to unnecessary bottlenecks, resource waste and longer makespans [GKR+16]. In Chapter 3, I propose a new data processing architecture that decouples front-end frame- works from back-end execution engines. I implement a proof-of-concept of this architecture, Musketeer, which automatically generates back-end code and dynamically chooses back-ends at runtime. In Chapter 4, I show that workflows running on Musketeer use less resources or complete faster compared to running on a single back-end. 2.3 Cluster scheduling Cluster management systems such as Mesos [HKZ+11], YARN [VMD+13], Borg [VPK+15], and Kubernetes [KUB16] use virtualization solutions to share cluster resources. However, the isolation techniques used by current virtualization solutions (e.g., memory limits, disk quotas) do not offer performance isolation. In practice, two tasks that are co-located on a machine are likely to interfere, which can cause application-level performance variation. The interference happens because the underlying hardware – the machines, the network and the storage – is still fundamentally shared among tasks. 44 2.3. CLUSTER SCHEDULING The role of the cluster scheduler is to place tasks on cluster machines such that machines are shared efficiently and application-level performance is not significantly affected. Yet, these demands are increasingly more difficult to meet as workloads become increasingly diverse and clusters grow in size. In the following sections, I first discuss the requirements nowadays’ workloads impose on clus- ter schedulers (§2.3.1). Next, I describe the main cluster scheduler architectures, and discuss if these architectures satisfy the cluster scheduler requirements (§2.3.2). Finally, I describe how min-cost flow schedulers that choose high-quality placements work, I compare them with task- by-task queue-based schedulers, and show that current min-cost flow schedulers do not scale to large clusters (§2.3.3). 2.3.1 Cluster scheduling requirements I outline below the challenges cluster schedulers must overcome in order to efficiently utilise clusters and meet the demands of the increasingly diverse workloads they execute. 2.3.1.1 Hardware heterogeneity Data-centre clusters are built using commodity machines purchased in bulk. Nevertheless, cluster hardware is more heterogeneous than one might expect: machines are replaced upon failure, hardware upgrades are rolled out regularly, and resources are intentionally diversi- fied [TMV+11; BCH13; MT13]. Most clusters have memory and storage of different latency and bandwidth, and run at least three processor generations. Hence, it is common to have up to 30 different machine config- urations [DK13]. An analysis of a publicly available Google trace found that a typical cluster has three different machine platforms and ten machine specifications [RTG+12; Sch16]. Simi- larly, a study of Amazon’s EC2 infrastructure found that m1.large instances use five different CPU models. This diversity can cause up to a 60% workload runtime variation when the same type of EC2 instances are used [OZN+12], or up to 100% runtime increase when different CPU generations are used [Sch16, §2.1.2]. The cluster scheduler must mind hardware heterogeneity when it places tasks to efficiently utilise the hardware. For example, it must place machine learning tasks on machines with GPU accelerators, if possible, because these tasks benefit from the parallelism GPUs offer. Finally, the cluster scheduler must also be able to leverage hardware heterogeneity to reduce power usage. Utilising more power-effective hardware when the workloads’ service level objectives allow it can lead to improvements of up to 20% in power efficiency [NIG07; Sch16]. CHAPTER 2. BACKGROUND 45 2.3.1.2 Task co-location interference When two or more tasks execute concurrently on a machine, they may interfere on the fun- damentally shared intra-machine resources (e.g., disk bandwidth, memory bandwidth, cache, shared operating system data structures) and inter-machine cluster resources (e.g., network links). A cluster scheduler that simply bin-packs tasks to use all the CPUs, memory and disk I/O can achieve high resource utilisation, but at the cost of poor and variable application-level performance (e.g., high latency and low number of served queries per second) due to interfer- ence. Schwarzkopf showed in a study of task co-location that data-centre applications’ (e.g., PageR- ank, strongly connected components, image classification) performance degrades by up to 2.1× when tasks are co-located, compared to when tasks run on otherwise idle machines [Sch16, §2.1.3]. Similarly, Mars et al. found a 35% degradation in application-level performance for co-located Google workloads [MTH+11]. Co-location interference also affects latency critical production applications, and can cause latency degradations of up to 3× [LCG+15]. All tasks compete for the shared resources, but certain tasks make better neighbours, while oth- ers can cause significant interference. The cluster scheduler must place and manage tasks such that they do not significantly interfere. Two different solutions have been proposed to reduce interference: (i) non-controlling solutions that consider how much a task would be affected by other running tasks before placing it on a machine, and (ii) controlling solutions that use mech- anisms to limit resource utilisation of low-priority tasks in order to avoid affecting high-priority latency critical tasks. Non-controlling interference avoidance. Paragon [DK13] is a scheduler that profiles tasks to discover how much interference on various resources affects them. Paragon inputs the pro- filing results into a collaborative filtering algorithm that classifies submitted tasks and identifies similarities with tasks previously scheduled. Finally, Paragon greedily places submitted tasks on machines with well-suited hardware, and on which tasks do not interfere. In practice, however, task resource utilisation can vary greatly as tasks transition through different stages [CAB+12]. However, Paragon may miss these resource utilisation variations because it profiles tasks for only several seconds. The Quasar [DK14] scheduler uses similar techniques, but shifts from a reservation-centric approach in which developers ask for a fixed amount of resources for each task, to a performance-centric approach in which developers express performance requirements (e.g., queries per second, query latency). In addition to the profiling Paragon does, Quasar also studies application scale-up and scale-out behaviour. Quasar uses these profiling results to de- termine the least amount of resources required to meet applications’ performance requirements, given currently available machines and active workloads. Controlling interference avoidance. Non-controlling interference avoidance techniques ei- ther migrate tasks or throttle the number of tasks that execute on a machine. However, these 46 2.3. CLUSTER SCHEDULING 0.00 0.05 0.10 0.15 0.20 CPU usage 0.00 0.05 0.10 0.15 0.20 M em or y us ag e Figure 2.11: Analysis of task CPU versus memory usage in the Google trace. The values are normalised to the highest value present in the trace for the given type of metric. techniques may not be suitable for high-priority latency critical tasks. Task migration may cause application downtime or may require state to be recomputed, which can increase application la- tency. Task throttling stops additional interfering tasks from being placed on machines, but does not decrease interference among running tasks. Heracles [LCG+15] is a feedback-based controller that actively reduces task co-location inter- ference. It dynamically manages several software and hardware isolation mechanisms to reduce interference on cores, network and last-level-cache (LLC). Heracles uses Linux cpuset cgroups to pin tasks to cores, the Linux qdisc scheduler with hierarchical token bucket queuing to en- force bandwidth limits on outgoing task traffic, and Intel’s Cache Allocation Technology (CAT) hardware to partition the shared LLC. Heracles combines these techniques to achieve 90% av- erage machine utilisation without violating latency SLAs for high-priority tasks. To sum up, the cluster scheduler must co-locate tasks to achieve high cluster utilisation, but without affecting application-level performance. An ideal scheduler would take into account co-location interference when it places tasks, but it would also use controlling techniques to reduce interference among running tasks. 2.3.1.3 Multi-dimensional resource fitting Many cluster schedulers assume that tasks have uniform resource requirements, and statically partition machines into a fixed number of “slots”, in which they execute tasks [DG08; IPC+09; OWZ+13; VMD+13; DDK+15; KRC+15]. However, nowadays’ clusters execute different types of tasks (e.g., interactive, batch and service) which have diverse resource needs. In Figure 2.11, I show the normalised mean CPU and memory usage for 10,000 tasks picked at random from CHAPTER 2. BACKGROUND 47 a publicly available trace of a Google cluster [RTG+12]. Task CPU and memory usage vary greatly, and there is no correlation between the two. Cluster schedulers that statically partition machines into slots are not well suited to handle such diverse workloads because they assume that tasks have similar resource requirements (i.e., place tasks into equally-sized slots) [OWZ+13; DDD+16]. They either oversubscribe machines because they co-locate too many tasks that utilise more resources than the slots are allocated, or underutilise machines because they co-locate tasks that do not fully utilise slot resources. Fundamentally, these schedulers cannot achieve high resource utilisation and good application performance. To sum up, the cluster scheduler must dynamically allocate resources to tasks, and it must achieve a good bin-packing on different resource dimensions in order to neither under-utilise nor over-utilise resources. 2.3.1.4 Resource estimation and reclamation Some schedulers do not partition machines into slots, but expect users to specify task resource requirements at task submission time [HKZ+11; VMD+13; VPK+15; KUB16]. These sched- ulers use resource requirements to allocate resources and bin-pack tasks on machines such that CPU and memory are almost fully utilised. However, in practice, users find it difficult to pre- dict how many resources tasks require. To make matters worse, cluster managers kill tasks that utilise more resources than requested, and thus incentivise users to overestimate task resource needs. For example, 70% of jobs use less than 10% of requested resources, and 20% of jobs use less than 20% of requested resources in a Twitter production cluster [DK14]. Similarly, 90% of service tasks use less than 10% of requested CPUs, and 75% use less than 60% of requested memory in a 1,300-machine Alibaba cluster [ALI17] (see Figure 2.12). Users over- estimate task resource requirements for tasks they run on Google’s Borg cluster manager; in Figure 2.13, I show task CPU, memory and local disk usage normalised to resource request from a Google cluster [RTG+12]. In this cluster, 50% of tasks use less than 50% of requested CPU, 90% of tasks use less than 50% of requested memory, and 90% of tasks use less than 20% of requested local disk space. To avoid keeping resources idle, the Borg cluster manager reclaims resources and dynamically adjusts resource reservations if tasks underutilise allocated resources [VPK+15]. Borg achieves aggregate CPU utilisation of 25-35% and aggregate mem- ory utilisation of 40% even though users significantly overestimate task resource requirements. Other cluster managers automatically estimate resource requirements. Quasar [DK14] profiles tasks for several seconds to estimate the effect scaling up and scaling out tasks has on application performance. However, task resource utilisation varies greatly during task lifetime because tasks have periods when they communicate and compute, but also have periods when they wait for other tasks to do work [CAB+12]. Quasar may miss such resource utilisation variations in its short task profiling step. 48 2.3. CLUSTER SCHEDULING 0.0 0.2 0.4 0.6 0.8 1.0 Service task resource usage normalised to request 0.0 0.2 0.4 0.6 0.8 1.0 C D F CPU Memory Disk Figure 2.12: CDF of service task resource requests normalised to usage from a 1,300-machine Alibaba cluster. 0.0 0.5 1.0 1.5 2.0 Task resource usage normalised to request 0.0 0.2 0.4 0.6 0.8 1.0 C D F CPU Memory Disk Figure 2.13: CDF of resource requests normalised to usage from a 12,500-machine Google cluster. By contrast, Jockey [FBK+12] requires users to specify deadlines by when jobs must com- plete. Jockey dynamically and automatically adjusts task resource allocations at runtime such that as many jobs as possible meet deadlines. However, this technique cannot be applied to long-running service tasks, which do not have deadlines. In conclusion, cluster schedulers that support either automatic resource estimation or dynamic resource reclamation underutilise clusters. An ideal scheduler must estimate resource utilisation to predict users’ resource needs, and must reclaim resources at runtime in case it mispredicts. CHAPTER 2. BACKGROUND 49 2.3.1.5 Data locality Data locality is essential for many tasks that run in today’s clusters. Data-intensive tasks can become “stragglers” if data locality preferences are ignored. Moreover, cluster schedulers that do not take into account data locality may co-locate tasks that do not read local data next to tasks that read non-replicated local input data [ZKJ+08; AGS+13]. Such placements increase makespan because tasks compete for disk bandwidth: non-local data tasks write output, and local data tasks read input and write output. Cluster schedulers aim to place tasks close to input data to alleviate disk and network band- width interference, but other complementary techniques also exist: (i) some schedulers de- lay task placement until better placement options become available [ZBS+10; HKZ+11], (ii) other schedulers consider pre-empting already running tasks in order to improve data local- ity [IPC+09], and (iii) distributed files systems automatically increase the replication of popular data in the hope of increasing the likelihood of achieving data locality [AAK+11]. Until recently, disks provided higher bandwidth than data-centre host network links. How- ever, today’s data centres use 10 Gbps, full-bisection bandwidth Ethernet which offers an order of magnitude higher bandwidth than disks. This increase in bandwidth combined with new low-latency switches have made the latency and bandwidth differences between local and re- mote disk accesses insignificant [AGS+11]. However, high task data locality still remains a key requirement for cluster schedulers because data centres nowadays deploy systems that store data in memory [ORS+11; ZCD+12; LGZ+14]. Ignoring data locality in such systems leads to significant network traffic, which in turn creates network congestion that affects application performance [GSG+15]. To sum up, in order to reduce network and disk interference, cluster schedulers must place tasks such that they read a high fraction of their input locally. Tasks placed by such schedulers are less affected by network congestion and have better and more predictable performance. 2.3.1.6 Placement constraints Data-centre clusters are built with increasingly heterogeneous hardware (§2.3.1.1) and run ever more diverse tasks (§2.1). Some tasks can only run on machines that have certain properties (e.g., service tasks that run web servers require machines with public IP addresses), or may benefit if they run on specialised hardware (e.g., machine learning tasks complete faster if they are placed on GPUs). Cluster schedulers represent such task requirements as placement con- straints that restrict the set of machines on which tasks can be placed. There are three types of placement constraints: • Hard constraints specify requirements that machines must satisfy. Tasks with hard con- straints must not be placed as long as there are no machines that satisfy the constraints. 50 2.3. CLUSTER SCHEDULING Hard constraints typically specify requirements on machine attributes such as kernel ver- sion and clock speed[SCH+11]. These constraints are common; over 50% of Google’s tasks have simple hard constraints [RTG+12]. Cluster schedulers use different techniques to satisfy hard constraints. Some cluster schedulers (e.g., Sparrow, Borg) sample machines until they find several that satisfy the constraints [OWZ+13; VPK+15]. Others (e.g., YARN), inform application controllers whether task hard constraints can be fulfilled or not [VMD+13]. • Soft constraints do not have to necessarily be satisfied. Tasks can execute, possibly with degraded performance, even when soft constraints are not satisfied. For example, Tensor- Flow [ABC+16] machine learning tasks, which may have soft constraints to execute on GPUs, can also run on CPUs. Quincy models soft constraints as task placement preferences [IPC+09]. It creates a flow network that contains a node for each cluster task and machine. Quincy directly connects task nodes to the nodes of the machines that satisfy task soft constraints, using preference arcs. Following, it runs a min-cost flow optimisation over the network that discovers task placements. By contrast, Quasar does not support soft constraints on hardware, but expects users to provide high-level application soft constraints (e.g., maximum latency, throughput, queries per second). Quasar tries to satisfy such constraints by automatically adjusting how many resources applications receive. • Complex constraints combine hard and soft constraints to model complex requirements that two or more tasks or machines have. Task affinity and task anti-affinity are two popu- lar types of complex constraints. Task affinity constraints model requirements for placing two or more dependent tasks on a resource (e.g., a web server task must be co-located with a database task it communicates with). By contrast, task anti-affinity constraints model requirements for placing tasks on distinct resources. For example, some jobs may require tasks to be placed on different machines or racks in order to decrease downtime likelihood in case of hardware failures. Few cluster schedulers support complex constraints. Borg only allows users to specify simple task anti-affinity constraints [VPK+15]. Nevertheless, 11% of tasks from a pub- licly available trace of a Google cluster have anti-affinity constraints [SCH+11; RTG+12]. Alsched [TCG+12] and TetriSched [TZP+16] are the only schedulers that fully support complex constraints. However, these schedulers trade off task placement latency and scalability for complex constraints support. To sum up, support for the different types of constraints is appealing because constraints guide schedulers to place tasks such that performance and fault tolerance are improved. However, task constraints can significantly increase task placement latency. For example, Google’s scheduler task placement latency increases by up to 6× when constraints are enabled [SCH+11]. An ideal CHAPTER 2. BACKGROUND 51 scheduler must support constraints without significantly increasing task placement latency and without scalability degradation. 2.3.1.7 Scalability and low scheduling latency High-quality task placements lead to higher machine utilisation [VPK+15], more predictable application performance [ZTH+13; DK14], and increased fault tolerance [SKA+13]. However, cluster schedulers can choose high-quality placements only if they take into account hardware heterogeneity (§2.3.1.1), task co-location interference (§2.3.1.2), achieve good bin-packing on different resources (§2.3.1.3), accurately estimate the amount of resources tasks require (§2.3.1.4), provide task data locality (§2.3.1.5), and support placement constraints (§2.3.1.6). In practice, schedulers run slow, algorithmically complex optimisations in multiple dimensions to support all these features. Placement latencies of tens of seconds or even minutes are com- mon, nevertheless, unacceptable for many workloads. For example, the availability of critical service tasks is reduced if failure recovery takes minutes because of scheduling [SKA+13]. Similarly, task response time of short-running interactive tasks can increase by up to an order of magnitude due to high placement latencies. Thus, it is critical to place tasks quickly to main- tain high cluster utilisation, keep task makespan low, and meet user expectations. Sophisticated schedulers choose high-quality placements, but the algorithms they use are computationally expensive and slow. For example, Quincy [IPC+09], which runs a minimum-cost flow opti- misation over a flow network, is widely acknowledged to choose high-quality placements, but its placement latency increases to minutes at scale [GSG+16, §2.2; OWZ+13, §9; GZS+13, §8; GSW15, §5; BEL+14]. Quincy cannot choose any placements for incoming tasks while the optimisation runs. This leaves cluster resources idle and increases task placement latency even when resources are available. An ideal scheduler must provide low placement latency without sacrificing task placement quality. 2.3.2 Cluster scheduler architectures All of today’s prevalent cluster scheduler architectures trade off between placement quality and placement latency. In Figure 2.14, I show a high-level view of these architectures, and in Table 2.2 I summarise state-of-the-art schedulers and emphasise which architectures they adopt. Monolithic schedulers. Complex scheduling algorithms that choose high-quality placements often need complete, up-to-date information about cluster state (e.g., machine utilisation, run- ning tasks, unscheduled tasks). Monolithic schedulers are a great fit for these complex algo- rithms because: (i) they manage and place tasks for entire clusters, (ii) use one scheduling logic to choose placements for all different types of tasks, and (iii) store information about the cluster state in a centralised component (see Figure 2.14a). 52 2.3. CLUSTER SCHEDULING Scheduler W orkload A rchitecture Hardware heterogeneity Co-location interference M ulti-resource fitting Data locality Placement constraints Resource estimation FairnessLow latency at scale L A T E [Z K J +08] M apR educe C entralised 3 – – – – – – – H FS [Z B S +10] M apR educe C entralised – – – 3 – – 3 – H -D R F [B C F +13] M apR educe C entralised – – 3 – – – 3 – Q uincy [IPC +09] D ryad tasks C entralised – – – 3 3 – 3 – Jockey [FB K +12] S C O P E C entralised – – – – – 3 – – A pollo [B E L +14] S C O P E D istributed – – 3 3 – – (3 ) 3 alsched [T C G +12] Sim ulation C entralised 3 – 3 3 3 – – – Sparrow [O W Z +13] Spark tasks D istributed – – – – – – (3 ) 3 K M N [V PA +14] Spark tasks C entralised – – – 3 – – – – H aw k [D D K +15] Spark tasks H ybrid – – – – – – – 3 E agle [D D D +16] Spark tasks H ybrid – – – – – – – 3 M ercury [K R C +15] D ata-processing tasks H ybrid – – – – 3 – (3 ) 3 YA Q [R K K +16] D ata-processing tasks H ybrid – – – 3 – – 3 3 C hoosy [G Z S +13] M ixed C entralised – – (3 ) 3 3 – 3 – Paragon [D K 13] M ixed C entralised 3 3 3 – – – – – W hare-M ap [M T 13] M ixed C entralised 3 3 – – – – – – Q uasar[D K 14] M ixed C entralised 3 3 3 (3 ) – 3 – – B istro [G SW 15] M ixed C entralised (3 ) – 3 3 3 – (3 ) (3 ) tetrisched [T Z P +16] M ixed C entralised 3 – 3 3 3 – (3 ) – M esos [H K Z +11] M ixed Tw o-level – – 3 3 – – 3 – Y arn [V M D +13] M ixed Tw o-level – – 3 3 – – 3 – O m ega [SK A +13] M ixed Shared-state 3 – 3 3 (3 ) – (3 ) 3 Tarcil[D SK 15] M ixed Shared-state 3 3 3 3 – 3 – 3 Table 2.2: E xisting cluster schedulers and their properties. Ticks in parentheses indicate that the scheduler could attain this property via extensions. CHAPTER 2. BACKGROUND 53 Scheduler State T T T T T T T T (a)Monolithic scheduler. Scheduler Scheduler Resource manager T T T T T T T T (b) Two-level scheduler. Shared cluster state Scheduler State Scheduler State T T T T T T T T (c) Shared-state scheduler. Scheduler Scheduler T T T T T T T T (d) Distributed scheduler. Scheduler Stale state Scheduler T T T T T T T T (e) Hybrid scheduler. Figure 2.14: Comparison of different cluster scheduler architectures. Red circles repre- sent long-running tasks, yellow circles correspond to short-running tasks, grey boxes are machines, and schedulers that store private cluster state are shown in rectangular boxes. However, monolithic, centralised cluster schedulers have high task placement latency [IPC+09; OWZ+13]. In practice, monolithic schedulers often stop their complex scheduling algorithms early or use simple heuristics to place tasks in a timely manner. For example, Google’s Borg centralised scheduler uses a multi-dimensional resource model to determine feasible machines, and a greedy cost-based scoring algorithm to choose among the feasible machines. Borg does not asses feasibility for each machine, but scores randomly sampled machines until a termina- tion condition is met [VPK+15, §3.4]. Thus, Borg sacrifices placement quality for placement latencies of seconds. Likewise, Facebook’s Bistro [GSW15] runs expensive machine scoring computations (collaborative filtering and path selection), but uses simple greedy algorithms for task placement to reduce placement latency. Alsched [TCG+12] and TetriSched [TZP+16] are centralised, monolithic schedulers that model task placement as a Mixed Integer Linear Programming (MILP) problem. Both schedulers support complex placement constraints, facilitate planning ahead in time of task resource utili- sation, and choose high-quality placements. However, solving MILP problems for large clusters 54 2.3. CLUSTER SCHEDULING can take tens of seconds. Thus, both schedulers stop early, and thus trade placement quality for latency. Quincy [IPC+09] is a centralised, monolithic scheduler that reconsiders the entire workload each time it places tasks. Quincy achieves optimal task placement for its scheduling policy. However, Quincy is widely considered unscalable: “Quincy [...] takes over a second [...], making it too slow” [OWZ+13, §9], “Quincy sacrifices scheduling delays for the optimal schedule” [GSW15, §5], “Quincy suffers from scalability challenges when serving large-scale clusters” [BEL+14, §6], “[Quincy’s] decision overhead can be prohibitively high for large clusters” [DSK15, §1]. Two-level schedulers. Data-centre clusters run an increasing suite of data processing systems that execute batch, stream, graph and iterative workflows (§2.1). Hindman et al. [HKZ+11] notice that the programming models and communication patterns these different types of work- flows create give rise to diverse scheduling needs. They argue that a monolithic, centralised scheduler is unlikely to provide an API that is sufficiently flexible to express the different scheduling policies required by all these systems, or if it were possible to build such a scheduler its complexity would negatively affect scalability. Instead, Hindman et al. introduce the two- level architecture [HKZ+11] (see Figure 2.14b). Cluster managers that implement the two-level architecture still manage resources, but delegate scheduling to other systems. Mesos is the first two-level cluster scheduler [HKZ+11]. Applications that run in Mesos-managed clusters receive resource reservation offers from Mesos, and they must decide which resource reservation offers to accept and which tasks to allocate to these reservations. Similarly, the YARN [VMD+13] resource manager has a two-level architecture, but in contrast to Mesos, applications make re- source reservation requests to the resource manager. Both Mesos and YARN have three key drawbacks. First, they hide cluster state information because system schedulers only have information about the resources they reserve or they are offered. The schedulers do not have access to other relevant information such as: what other tasks run on the machines on which schedulers are offered resources, which other machines have available and possibly better suited resources. Second, tasks can experience priority in- version: high-priority tasks do not preempt low-priority tasks that are placed by another sched- uler because the scheduler that places high-priority tasks is not aware of the existence of these low-priority tasks. Finally, systems that must start multiple tasks simultaneously can hoard re- sources. For example, MPI jobs and stateful graph processing systems benefit if all job tasks start simultaneously. These systems are incentivised to accumulate resource reservations until they can execute all tasks, but these resources are wasted because no tasks execute until the schedulers accumulate sufficient resources. Shared-state schedulers. Like the two-level architecture, the shared-state scheduler archi- tecture was developed to effectively schedule different types of workflows in shared clusters, and to enable system developers to easily build custom schedulers. CHAPTER 2. BACKGROUND 55 Google’s Omega [SKA+13] was the first cluster manager to use a shared-state scheduler ar- chitecture. In Omega multiple schedulers run in parallel, and each schedules a subset of the workload (see Figure 2.14c). Schedulers implement different policies that use weakly consis- tent replicas of the entire cluster state to choose placements. In contrast to two-level schedulers, Omega schedulers have access to the entire cluster state and use optimistically-concurrent trans- actions to modify it. Two or more schedulers can directly or indirectly choose conflicting placements in Omega. Schedulers choose directly conflicting placements if they simultaneously decide to place one or more tasks on the same resource. In such cases, only one scheduler is allowed to place a task, while others have to retry. Shared-state schedulers can choose indirectly conflicting placements if they have incompatible features. For example, a co-location interference aware scheduler could place a web service task on a machine on which the task does not interfere with other running tasks. However, a low-latency co-location interference unaware scheduler could later place interfering tasks on the same machine, which would increase serving latency and decrease web serving rate. Apollo is a shared-state cluster scheduler that assumes conflicts are not always harmful [BEL+14]. In contrast to Omega, Apollo does not eagerly resolve conflicts, but it first dispatches tasks to machines, and then does conflict resolution. Apollo’s schedulers can simultaneously place tasks on a machine, and executes tasks if sufficient resources are available. However, if resources are insufficient, Apollo uses correction mechanisms to continuously re-evaluate placements using up-to-date resource utilisation statistics. When it changes task placements, the Apollo scheduler starts new task instances, which may cause inefficient resource utilisation because multiple tasks copies may run simultaneously. Nonetheless, Apollo reduces task placement latency for non- interfering directly conflicting placements, but it affects performance and increases makespan of tasks that are affected by indirectly conflicting placements. Distributed schedulers. Today’s challenging workloads include short-running interactive tasks that complete within seconds (§2.1). Future workloads are likely to comprise of even more short-running tasks because increasingly more applications expect fast responses from infras- tructure systems. Such workloads require cluster schedulers to have: (i) low placement latency, and (ii) high placement throughput. The distributed scheduler architecture is designed to meet these requirements. In this archi- tecture, schedulers are fully distributed: they do not share state and do not require coordina- tion [OWZ+13; RKK+16] (see Figure 2.14d). Moreover, distributed schedulers deliberately use simple algorithms that choose placements at scale using randomly sampled and gossiped information [OWZ+13; DDD+16]. However, distributed schedulers achieve poorer placement quality than other types of schedulers (e.g., monolithic, centralised schedulers) because each scheduler instance only has partial and often stale cluster state information. Moreover, dis- tributed schedulers sacrifice scheduling quality because they do not support essential scheduling 56 2.3. CLUSTER SCHEDULING features such as co-location interference awareness (§2.3.1.2 and multi-dimensional resource fitting (§2.3.1.3). Hybrid schedulers. The hybrid architecture splits workloads across a centralised scheduler and one or more distributed schedulers (see Figure 2.14e). Hawk [DDK+15] is a hybrid sched- uler that uses statistics from previous task runs to assign tasks to either a centralised scheduler that places long-running tasks or to one of several schedulers that place short-running tasks. All Hawk’s schedulers place tasks to machines that are statically partitioned into “slots”. How- ever, as I noted in Section 2.3.1.3, slot-based schedulers do not choose high-quality placements because they assume that all tasks utilise the same amount of resources and interfere equally. In contrast to shared-state schedulers, the Hawk hybrid scheduler queues tasks in machine-side queues when tasks do not fit on machines. However, tasks may queue after or wait for long- running tasks to complete, which may increase the makespan of short-running tasks by several orders of magnitude. In order to address this limitation, Hawk splits the cluster into a general pool of machines and a small dedicated pool of machines reserved for short-running tasks. Mercury [KRC+15] is a hybrid cluster manager that offers two quality of service options: guar- anteed and queable tasks. Mercury places guaranteed tasks with a centralised scheduler that has priority in case of placement conflict, and queable tasks with one of many low-priority dis- tributed schedulers it runs. Finally, Eagle [DDD+16] is a cluster manager that extends Hawk with state gossiping techniques, and Yaq-d [RKK+16] is a scheduler that reorders tasks in machine-side queues to reduce latency for short-running tasks. To sum up, there are many cluster scheduler architectures, but they all either trade between placement latency and placement quality. Monolithic, centralised schedulers choose high- quality placements, but at the cost of high placement latency. Distributed schedulers choose placements with low latency at scale, but do not choose high-quality placements because they do not avoid task co-location interference, do not take into account hardware heterogeneity, and do not achieve high data locality. Other scheduler architectures choose quality placements only for parts of the workload (e.g., long-running tasks placed by hybrid schedulers), or do not efficiently utilise resources (e.g., two-level schedulers hoard resources). In this dissertation, I show that there is no need to trade off placement quality for placement la- tency. I describe several algorithms and techniques I developed to make Firmament, a min-cost flow centralised scheduler that chooses high-quality placements, place tasks with low placement latency at scale. 2.3.3 Introduction to min-cost flow scheduling In this subsection, I give a high-level description of how min-cost flow-based cluster schedulers work and discuss what distinguishes them most from other types of cluster schedulers. I also CHAPTER 2. BACKGROUND 57 time Task submitted Task enqueued Task dequeued Task placed Task starts enqueueing waiting in scheduler queue scheduler running waiting in worker-side queue algorithm runtime task placement latency Figure 2.15: Stages tasks proceed through in task-by-task queue-based schedulers. Tasks can spend significant time waiting in the scheduler queue and worker-side queues. introduce the Quincy scheduling policy that I use in the following chapters to evaluate the techniques I developed to reduce Firmament’s placement latency. 2.3.3.1 Task-by-task schedulers Cluster schedulers can be categorised by how they process submitted tasks. Most cluster sched- ulers, whether centralised, shared-state, hybrid or distributed, are queue-based and place tasks one by one. In Figure 2.15, I show the stages through which a submitted task transitions in a task-by-task queue-based scheduler. Task-by-task schedulers first add submitted tasks to a queue of unscheduled tasks. Following, they dequeue tasks one by one. For each task they perform a feasibility check to identify suit- able machines, then score machines according to their suitability, and finally place the task on the best-scoring machine. Rating the different placement choices (i.e., scoring) and choosing the best-scoring machine, can be expensive, and typically dominates scheduler algorithm run- time on large clusters. To keep these tractable, cluster schedulers use a variety of techniques. For example, Google’s Borg scheduler relies on several caching and approximation optimisa- tions [VPK+15, §3.4]. The Sparrow distributed scheduler takes a more radical approach and does not score machines. Instead, it uses batch sampling to randomly select machines to dis- patch reservations for a job’s tasks. Task reservations are stored in “worker-side” queues, and tasks are only placed on a machine when one of their reservations is at the front of a queue. Other schedulers also use “worker-side” queues [OWZ+13; BEL+14; RKK+16] to which one or more schedulers add tasks to [OWZ+13; DDK+15; DSK15; DDD+16]. Task-by-task queue-based schedulers have a fundamental limitation: they cannot consider how a task placement affects the placement options available to other queued tasks. Consider, for example, a scenario in which a cluster has only one machine with a GPU available. In the scheduler queue there are two machine learning tasks among many other types of tasks. One is at the front of the queue and has a preference for running on the machine with the GPU. The other one is towards the end of the queue and has a stronger preference for running on the same machine than the first task. A task-by-task queue-based scheduler places the first task on the machine with the GPU, but is faced with two sub-optimal choices when placing the second 58 2.3. CLUSTER SCHEDULING time Change detected Graph updated Solver started Solver finished Workload rescheduled updating waiting solver running extracting placements algorithm runtime task placement latency Figure 2.16: Stages min-cost flow-based schedulers proceed through. Tasks can spend significant time waiting for the solver to execute expensive min-cost flow algorithms. How- ever, min-cost flow-based schedulers amortise scheduling costs over multiple tasks because they consider entire workloads. machine learning task: (i) assign the second task to a sub-optimal machine or, (ii) migrate the first task to another machine – potentially loosing all the work the task has done – and place the second task on the desired machine. This is a fundamental limitation, which causes task-by-task schedulers to increase task makespan and to waste work. By contrast, schedulers that simultaneously consider all tasks can directly place the second machine learning task on the machine with a GPU, and the first machine learning task on another machine. 2.3.3.2 Min-cost flow-based schedulers An alternative approach only suitable for monolithic, centralised cluster schedulers is min-cost flow-based scheduling, introduced by Quincy [IPC+09]. This approach uses a placement mech- anism – min-cost flow optimisation – with an attractive property: it guarantees overall optimal task placements for a given scheduling policy. In contrast to task-by-task schedulers, a flow- based scheduler not only schedules new tasks, but also reconsiders the entire existing workload (“rescheduling”), and preempts and migrates tasks if prudent. Min-cost flow-based schedulers model the cluster state and the scheduling problem as flow networks. These networks are directed graphs whose arcs carry flow from source nodes (i.e., nodes that supply flow) to sink nodes (i.e., nodes that demand flow). A cost and capacity associated with each arc constrain the flow, and specify preferential routes for it. In Figure 2.16, I show the stages a min-cost flow-based scheduler transitions through when scheduling workloads. When a change to the cluster state occurs (e.g., task submission, task failure), the scheduler updates its internal graph-based representation of the cluster and the scheduling problem. Next, the scheduler runs a min-cost flow optimisation over its internal graph-based representation. The min-cost flow optimisation yields an optimal minimum cost flow from which the scheduler extracts task placements. If changes occur to the cluster state while the scheduler executes the optimisation, the scheduler updates its internal graph, but only runs a new optimisation once the prior optimisation completes. In Figure 2.17, I show an example of a flow network that expresses a simple cluster scheduling problem. Each task node T j,i on the left hand side, represents the ith task of job j. Each task CHAPTER 2. BACKGROUND 59 T0,0 T0,1 T0,2 T1,0 T1,1 M0 M1 M2 M3 S U0 U1 5 5 5 7 7 2 3 1 6 4 2 Figure 2.17: Flow network for a four-machine cluster with two jobs of three and two tasks. All tasks except T0,1 are scheduled on machines. Arc labels show non-zero cost, and those with flow are in red. All arcs have unit capacity, and the red arcs form the min-cost solution. node is a source of one unit of flow. All flowmust be drained into the sink node (S) for a feasible solution to the optimisation problem. To reach S, flow from T j,i can proceed through a machine node (Mm), which schedules the task on machine m (e.g., T0,2 on M1). Alternatively, the flow may proceed through a special “unscheduled aggregator” node (U j for job j) to the sink, and consequently leave the task unscheduled (e.g., T0,1). In Figure 2.17, task placement preferences are expressed as costs on direct arcs that connect task nodes with machine nodes (e.g., 2 for placing T0,2 onM3). The cost to leave a task unscheduled – or to preempt it when running – is the cost on the arc that connects it to the unscheduled aggregator (e.g., 7 for T1,1). Given this flow network, a min-cost flow solver finds a globally optimal (i.e., minimum-cost) flow (highlighted in red in Figure 2.17). Task placements are extracted from this flow by tracing flow paths from the machine nodes back to the task nodes. In the example, the optimal flow expresses the best trade-off between tasks’ unscheduled wait time and their placement preferences. The optimisation places tasks with strong preferences (i.e., low arc cost to machine nodes), and leaves unscheduled tasks that can wait until resources are available (i.e., low arc cost to unscheduled aggregator nodes). A min-cost flow-based scheduler is guaranteed to find the overall optimal placement with re- gards to the arcs’ cost if each task node is connected to each machine node. But this requires billions of arcs in flow networks modelling large clusters. The scheduler would take tens of minutes to run the optimisation on such a large network. However, optimal placements can also be found without connecting each task node directly to each machine node: arcs can connect tasks to aggregator nodes, similar to the unscheduled aggregators. Such aggregators may, for example, group machines in a rack, or similar tasks (e.g., tasks in a particular job, tasks with the same resource requirements). The aggregator nodes allow different scheduling policies to be expressed without requiring an inordinate number of arcs. When using aggregators, the cost of a task placement option is the sum of all arc costs on the path to the sink. In Figure 2.18, I illustrate this idea with the original Quincy scheduling policy [IPC+09, §4.2]. 60 2.3. CLUSTER SCHEDULING T0,0 T0,1 T0,2 T1,0 T1,1 X R0 M0 M1 R1 M2 M3 S U0 U1 running locality preference loca lity pre fere nce Figure 2.18: Adding aggregators to the flow network in Figure 2.17 enables different scheduling policies. The example shows the Quincy [IPC+09] policy with cluster (X) and rack (R) aggregators. 50 450 850 125 0 250 0 500 0 750 0 100 00 125 00 Cluster size [machines] 0 20 40 60 80 100 Sc he du lin g la te nc y [s ec ] Figure 2.19: Quincy [IPC+09] scales poorly as cluster size grows. Simulation on subsets of the Google trace; boxes are 25th, 50th, and 75th percentile delays, whiskers 1st and 99th, and a star indicates the maximum value. This policy is designed for batch jobs, and optimises for a trade-off between data locality, task unscheduled wait time, and task preemption cost. It uses rack aggregators (Rr) to group ma- chines that share racks, and a cluster-level aggregator (X) to group racks. Tasks have low-cost preference arcs to machines and racks on which they achieve high data locality, but also higher- cost fall-back arcs connected to the cluster aggregator (e.g., T0,2). These are used if none of the preferred machines are available. However, min-cost flow-based schedulers are too slow to be suitable for large clusters even if they use policies that take advantage of aggregator nodes. In Figure 2.19, I show that schedul- ing latency is too high to place interactive tasks at scale when using the Quincy policy and a CHAPTER 2. BACKGROUND 61 state-of-the-art min-cost flow algorithm (i.e., cost scaling [Gol97]). In the experiment, I replay subsets of a public trace of one of Google’s clusters [RTG+12], which I augmented with local- ity preferences for batch processing jobs7 against my implementation of the Quincy scheduling policy. I measure the scheduling latency for clusters of increasing size. The latency increases with scale, up to a median of 64s and a 99th percentile of 83s for the full Google cluster (12,500 machines). During this time, the scheduler must wait for the optimisation to finish, and cannot place any newly submitted tasks. Moreover, tasks may finish and free resources, but the sched- uler cannot replace the tasks with new ones, and thus leaves resources idle even though there is work to do. 2.3.4 Cluster scheduling summary Data-centre clusters comprise of increasingly more machines and run more diverse types of tasks (§2.1). State-of-the-art cluster schedulers struggle to handle these workloads at scale. They either choose high-quality placements or offer low placement latency. As a result, resource utilisation in data centres is consistently below 20% [McK08; Liu12; DSK15]. In order to quickly place the workload and obtain high resource utilisation, a cluster scheduler must: 1. consider hardware heterogeneity when scheduling tasks (§2.3.1.1); 2. avoid co-locating tasks that interfere on the same resource (§2.3.1.2); 3. consider task requirements of different resources and conduct multi-dimensional resource fitting (§2.3.1.3); 4. estimate tasks’ resource requirements and automatically adjust resource reservations when tasks do not fully utilise resource reservations (§2.3.1.4); 5. obtain high data locality (§2.3.1.5); 6. support placement constraints for tasks that have hardware preferences (§2.3.1.6); 7. scale to tens of thousands of machines at low placement latency (§2.3.1.7). In this dissertation, I show that with my extensions, the Firmament centralised cluster scheduler does not suffer from the limitations other monolithic, centralised schedulers do. In Chapter 5, I describe several techniques and algorithms that I developed to improve Firmament to choose high-quality placements with low task placement latency. Finally, in Chapter 6, I evaluate how Firmament compares to other schedulers. I focus on both placement latency and quality. 7Details of my simulation are in §6.1; in the steady-state, the 12,500-machine cluster runs about 150,000 tasks comprising about 1,800 jobs. 62 2.3. CLUSTER SCHEDULING Chapter 3 Musketeer: flexible data processing Many data processing execution engines are designed to work well in their target domain (e.g., large-scale string processing, iterative machine learning training, social network graph analyt- ics), and most perform considerably less well when operating outside of this “comfort zone”. In Section 2.2, I evaluated a range of contemporary data processing execution engines – Hadoop, Spark, Naiad, PowerGraph, Metis and GraphChi – and I found that (i) their performance varies widely depending on the workflow type, cluster state, input data, and engineering choices; (ii) no single system always outperforms all others; and (iii) almost every system performs best under some circumstances. Choosing the best data processing execution engine requires significant expert knowledge about the programming paradigm, design goals and implementation of the many available engines. In practice, however, developers are forced to choose which front-end framework and back-end execution engine to use at implementation-time, without having sufficient information. These decisions cannot be easily changed later because each front-end is tightly coupled to a single back-end execution engine (see Figure 3.1). In this chapter, I argue that users should, in prin- ciple, be able to execute their high-level workflow on any data processing system. If front-end frameworks are decoupled from back-end execution engines (see Figure 3.2) then developers would see four main benefits: 1. Developers would write workflows only once using a front-end framework they choose, but could easily execute them on many alternative back-ends. 2. Workflows’ independent partitions (i.e., jobs) could be executed on different back-end execution engines. 3. Developers would not have to decide ahead of runtime which back-end execution engine is best; this decision could automatically be made using information available at workflow runtime. 4. Existing workflows could easily be ported to new and more performant execution engines. 63 64 Coupled Hive Giraph SparkSQL GraphX Lindi GraphLINQ Hadoop Spark Naiad Figure 3.1: State-of-the-art: front-end frameworks are coupled to a single back-end execu- tion engine. Decoupled Hive Giraph SparkSQL GraphX Lindi GraphLINQ Hadoop Spark Naiad Figure 3.2: Key insight: decoupling front-end frameworks and back-end execution engines increases flexibility. Front-end frameworks can be decoupled from back-end execution engines in one of two ways: (i) by implementing workflow translation for each front-end and back-end pair or, (ii) by trans- lating workflows implemented using front-end frameworks into a shared intermediate represen- tation, and generating back-end execution code from this representation. The first approach requires many translation shim layers (one for each front-end and back-end pair), growing quadratically in the number of systems supported. In addition, high-level front-end concepts (e.g., vertex-centric code) would have to directly be translated to low-level back-end concepts (e.g., map and reduce functions), which might be difficult to reason about. By contrast, the second approach only requires each system to be translated to or from the inter- mediate representation. For this reason I choose this approach for my data processing decoupled architecture (Figure 3.3). I break the execution of a data processing workflow into three high- level steps. First, a developer expresses her workflow using a front-end framework. Next, the workflow is translated into an intermediate representation. Third, job code is generated from this representation and jobs are executed on one or more back-end execution engines. However, I have to address five challenges for developers to see the benefits of decoupling front-end frameworks from back-end execution engines: 1. Front-end workflows must run without any changes. 2. The intermediate representation must be sufficiently expressive to support a broad range of workflows. 3. It must be easy to integrate new front-end frameworks and back-end execution engines. CHAPTER 3. MUSKETEER: FLEXIBLE DATA PROCESSING 65 Front-ends Hive Pig DryadLINQ GAS DSL GreenMarl SQL DSL SparkSQL GraphX Lindi GraphLINQ Intermediate representation Dataflow DAG Back-ends Hadoop Metis Dryad Spark Naiad CIEL Giraph PowerGraph GraphChi X-Stream Figure 3.3: The decoupled data processing architecture translates front-end workflow de- scriptions to a common intermediate representation from which it generates jobs for back- end execution engines. My Musketeer prototype supports the systems in bold. 4. Generated job code must be competitive with hand-written optimised workflow imple- mentations for each back-end. 5. The most appropriate back-ends must be chosen automatically using information avail- able at runtime. In this chapter, I explain how I overcome these challenges in Musketeer, my proof-of-concept implementation of the decoupled data processing architecture I advocate. In Section 3.1, I give an overview of Musketeer. Subsequently, in Section 3.2, I describe how workflows can be expressed using Musketeer-supported front-end frameworks and how legacy workflows can execute. In Section 3.3, I introduce the general intermediate representation Musketeer uses to decouple front-ends from back-ends. Following, in Section 3.4, I discuss the techniques Musketeer uses to generate efficient code to run workflows in back-end execution engines. Next, in Section 3.5, I describe how Musketeer automatically partitions and maps workflows to the best combination of back-ends. Finally, in Section 3.6, I discuss Musketeer’s limitations. 3.1 Musketeer overview In Musketeer workflows proceed from specification to execution through seven modular phases (see Figure 3.4). I now give an overview of how these phases fit in the three high-level steps of my decoupled data processing architecture. Workflow expression. User-facing high-level abstractions for workflow expression (“frame- works”) act as front-ends to Musketeer. Many such frameworks exist: SQL-like querying lan- guages and vertex-centric graph abstractions are especially popular (see §2.2.2). While design- ing Musketeer, I assumed that users write their workflows for such front-end frameworks. 66 3.1. MUSKETEER OVERVIEW SQL query GAS kernel §3.2 translate to IR §3.3 optimize IR recog- nize idioms merge operators map to systems §3.4 expand templates dispatch jobs DAG partitioning algorithm, §3.5 Figure 3.4: Phases of a Musketeer workflow execution. Dotted, grey operators show pre- vious state; changes are in red; encircled operators represent jobs. Musketeer currently supports four front-end frameworks: Hive, Lindi, a SQL-like DSL and a gather, apply and scatter DSL. Musketeer integrates front-ends by either parsing user inputs directly, or by offering an API-compatible shim for each front-end. For example, Musketeer di- rectly parses HiveQL queries and translates them into the common intermediate representation, whereas it supports Lindi workflows using an API-compatible Lindi reimplementation. In Section 3.2, I present in detail how users can express workflows, describe the front-end frameworks that Musketeer supports, and explain how Musketeer integrates them. Translation and optimisation. Ideally, in the future, all available front-end frameworks and back-end execution engines will agree on a single common intermediate representation (IR). In Section 3.3, I put forward an initial version of what I think this intermediate representation should look like. It is a dynamic directed acyclic graph (DAG) of dataflow operators, with edges corresponding to input-output dependencies between operators. The operators are loosely based on Codd’s relational algebra [Cod70]. This IR abstraction is general: it supports specific operator types based on relational algebra and general user-defined functions (UDFs). The IR supports iterative computations by dynamically expanding the DAG (as in CIEL [MSS+11] and Pydron [MAA+14]), and thus it does not have acyclic dataflow model’s main limitation: inability to execute iterative computations. The approach I chose is extensible: to add new front-end frameworks developers must only pro- vide translation logic from framework constructs to the intermediate representation. Similarly, back-end developers must only provide appropriate back-end job code templates and code gen- eration logic to extend Musketeer to support new back-end execution engines. This approach is similar to the one taken by the LLVM modular compiler framework [LA04], albeit in a dif- ferent domain; since Musketeer was published, similar architectures have also been adopted by Google Cloud Dataflow [ABC+15] and Weld [PTS+17]. Musketeer’s intermediate representation is also well-suited for performing query optimisations which have been extensively studied by the database community [KD98; AH00; BBD05]. CHAPTER 3. MUSKETEER: FLEXIBLE DATA PROCESSING 67 Equivalent techniques are already used in several front-ends (e.g., Spark SQL [AXL+15], Flume- Java [CRP+10]) or implemented in back-ends (e.g., Dryad’s workflows are externally optimised by Optimus [KIY13]). Musketeer applies similar query optimisations on its intermediate rep- resentation, e.g., to reduce intermediate data sizes where possible. In Musketeer, these optimi- sations only have to be implemented once on the IR, whereas prior solutions require them to be implemented in each front-end and back-end. Job generation and execution. Finally, Musketeer must generate code for distributed back- end execution engines. In Section 3.4, I describe in detail how Musketeer does this. A naïve approach would simply generate a job for each IR operator, but this approach suffers from high per-operator job startup costs, and fails to exploit opportunities for optimisation within exe- cution engines (e.g., sharing data scans among operators, in-memory storage of intermediate data). Workflows typically benefit if they execute in as few independent jobs as possible. In Section 3.4.1, I describe how Musketeer merges operators and generates code for them in a single job. However, even if Musketeer is able to merge operators, there are some execution engines that have limited expressivity (e.g., MapReduce cannot compute a three-way join on differ- ent columns in a single job). Therefore, the IR DAG may still have to be partitioned into multiple jobs. Many valid partitioning options exist, depending on the workflow and the execu- tion engines available. In Section 3.5, I show that exploring this space is an instance of k-way graph partitioning, an NP-hard problem, and introduce a heuristic to solve it efficiently for large DAGs. Given a suitable partitioning, Musketeer generates jobs for the chosen execution en- gines and dispatches them for execution. Finally, when jobs complete they write output in a distributed file system (e.g., HDFS) from which dependent jobs read it. 3.2 Expressing workflows Distributed execution engines simplify data processing on clusters of commodity machines by shielding developers from the intricacies of writing parallel, fault-tolerant code, and from the challenges of partitioning data and computation scheduling over thousands of machines. Never- theless, they require developers to express computations in terms of low-level primitives, such as map and reduce functions [DG08] or message-passing vertices [MMI+13]. As a result, higher-level front-end frameworks that expose more convenient abstractions are regularly used (see §2.2.2). This is common in industry: according to a recent study [CAK12], up to 80% of workflows running in production clusters are expressed using front-end frameworks such as Pig [ORS+08], Hive [TSJ+09], DryadLINQ [YIF+08] or Shark [XRZ+13]. These front-ends can be grouped into three types: (i) SQL-like query languages (Hive, SQL- like DSL), (ii) language integrated queries (LINQ), and (iii) vertex-centric interfaces (“Gather- 68 3.2. EXPRESSING WORKFLOWS Apply-Scatter” DSL). In the decoupled data processing architecture, there are three approaches to translate front-end workflows into the common intermediate representation: Query parsing: Several front-end frameworks (e.g., Hive, Pig) use the ANTLR parser gen- erator to transform each workflow query into an abstract syntax tree out of which they generate MapReduce jobs. One way to decouple these front-ends from back-ends is to translate the abstract syntax tree into the IR rather than generate code. An alternative is to implement a parser that directly parses workflow queries and translates them to the IR. Automatic translation of workflow code: Workflows implemented in imperative and func- tional languages could be automatically translated to the IR using symbolic execution. Prior systems exist that do this for single back-ends. For example, HadoopToSQL [IZ10] uses symbolic execution to derive preconditions and postconditions forMapReduce work- flows, and generates SQL queries out of these. Similarly, QBS [CSM13] automatically translates imperative code into SQL queries. QBS uses constraint-based synthesis to ex- tract queries from code invariants and postconditions expressed in relational algebra. API-compatible shim: Some interfaces for expressing workflows (e.g., vertex-centric, and gather, apply and scatter) require developers to implement several functions. These func- tions cannot directly be translated to the IR because they do not have one-to-one map- pings to the IR operators. For example, automatic translation solutions cannot translate the gather function from a GAS interface because they cannot infer from invariants and preconditions that both a JOIN and a GROUP BY operator are required to express the function [IZ10; CSM13]. These workflows also cannot be easily parsed because they are implemented in low-level languages such as C++. However, an API-compatible shim that offers a subset of the low-level language’s features, but is sufficiently rich to allow unmodified workflow code to be translated to the IR could be implemented. My Musketeer prototype supports four front-end frameworks (Hive, Lindi, a custom SQL-like DSL with support for iterations, and a graph-oriented “Gather-Apply-Scatter” DSL). I now turn to describing in detail these front-ends and how Musketeer translates them to the IR. 3.2.1 SQL-like data analytics queries Front-end frameworks that provide query languages based on SQL are widely used to express relational queries on data of tabular structure. For example, in 2009, Facebook was storing and processing 2 PB of data using Hive [TSJ+09; HIV16]. All data analysis computations at Microsoft are written in a SCOPE, a SQL-based front-end, and are executed on clusters of tens of thousands of commodity machines [CJL+08; BEL+14]. Similarly, many data analytics workflows are implemented in SQL-like front-ends such as Tenzing [CLL+11] at Google and Pig [ORS+08] at Yahoo!. CHAPTER 3. MUSKETEER: FLEXIBLE DATA PROCESSING 69 1 SELECT id, street, town FROM properties AS locs; 2 locs JOIN prices ON locs.id = prices.id AS id_price; 3 SELECT street, town, MAX(price) FROM id_price 4 GROUP BY street AND town AS street_price; Listing 3.1: Hive query for the max-property-price workflow. Properties’ locations are stored in the properties table and property prices are stored in the prices table. These SQL-like front-end frameworks are well-suited to express and run non-iterative batch workflows. However, many data analytics computations are iterative and only complete when a data-dependent convergence criterion is met. For example, the k-means algorithm iterates until no points change clusters between iterations. Similarly, PageRank iterates until the differences for all pages between previous page rank value and current rank is smaller than an error factor. Such iterative workflows cannot be expressed in these SQL-like front-ends because they model workflows as directed acyclic graphs of SQL-like operators. In practice, developers express workflows’ loop bodies using these front-ends and write driver programs to check convergence criteria and dispatch workflows (§2.2.1). I integrated Hive as one of Musketeer’s front-ends to show Musketeer’s suitability to front-ends that use relational high-level workflow descriptions, but I also developed a SQL-like domain- specific language (DSL) with iteration to show Musketeer is sufficiently expressive to execute iterative workflows. Hive. In the Hive front-end framework workflows are expressed using HiveQL, a SQL-like language that provides query operators such as SELECT, PROJECT, JOIN, AGGREGATE and UNION. HiveQL has a basic type system that supports tables with primitive types, and simple collections such as arrays and maps. In Listing 3.1, I show an example Hive analytics workflow that computes the most expensive property on each street for a real-estate data set. My Musketeer prototype directly parses most HiveQL operators and translates them to the in- termediate representation. However, Musketeer does not currently support tables with non- primitive types, nor is able to execute Hive user-defined functions (UDFs). These limitations are not fundamental; they could be addressed with extra engineering effort. BEER. I implemented my own SQL-based domain-specific language that can express itera- tive workflows. Like other SQL-like query front-ends, BEER operates on typed tables, which can be accessed by columns. BEER is based on relational algebra and offers operators such as SELECT, PROJECT, JOIN, UNION and DIFFERENCE. It also supports aggregate operators MAX, MIN, COUNT, and arithmetic operators MUL, DIV, SUB and SUM. In contrast to other SQL-like front-ends, BEER also provides a WHILE operator that can be used to implement iterative work- flows with data-dependent convergence criteria. In Listing 3.2, I show a BEER implementation of the PageRank iterative workflow that uses the WHILE operator. 70 3.2. EXPRESSING WORKFLOWS 1 CREATE RELATION edges WITH COLUMNS (INTEGER, INTEGER), 2 // Relation storing (page_id, rank) pairs. 3 CREATE RELATION page_rank WITH COLUMNS (INTEGER, DOUBLE), 4 // Relation storing current iteration number. 5 CREATE RELATION iter WITH COLUMNS (INTEGER), 6 // Computes number of outgoing edges for each graph node. 7 COUNT [edges_1] FROM (edges) GROUP BY [edges_0] AS node_cnt, 8 // Pre-computes 3-tuples of (src_node, dst_node, src_outgoing_degree). 9 (edges) JOIN (node_cnt) ON edges_0 AND node_cnt_0 AS edges_cnt, 10 WHILE [(iter_0 < 20)] ( 11 // Joins the pre-computed 3-tuples with the current page ranks. 12 (edges_cnt) JOIN (page_rank) ON edges_cnt_0 AND page_rank_0 AS edges_pr, 13 DIV [edges_pr_3, edges_pr_2] FROM (edges_pr) AS rank_cnt, 14 PROJECT [rank_cnt_1, rank_cnt_3] FROM (rank_cnt) AS links, 15 AGG [links_1, +] FROM (links) GROUP BY [links_0] AS page_rank1, 16 MUL [0.85, page_rank1_1] FROM (page_rank1) AS page_rank2, 17 // Updates each node’s page rank value. 18 SUM [0.15, page_rank2_1] FROM (page_rank2) AS page_rank, 19 // Increases iteration counter. 20 SUM [iter_0,1] FROM (iter) AS iter) Listing 3.2: BEER DSL code for PageRank. The workflow conducts 20 PageRank itera- tions in which it updates each vertex’s page rank value stored in page_rank table. In my BEER implementation, I use the ANTLR parser generator to create a parser out of the language grammar I defined. Following, I use this parser to transform workflow queries into abstract syntax trees. Finally, I map these trees of BEER relational algebra operators to Mus- keteer’s intermediate representation; most operators have directly corresponding Musketeer IR operators (§3.3). Workflows sometimes contain user-defined functions that are used to express specialised data processing tasks [ORS+08; KPX+11]. Thus, it is important to support UDFs in Musketeer. BEER has a UDF construct that can be used to specify code paths to UDFs. The code to which these paths point to is included in the back-end specific code Musketeer generates. This ap- proach works well for workflows that run on back-ends that execute or interface with code im- plemented in the users’ programming language of choice, but may restrict the back-end choice because some workflows can only be executed in less efficient back-ends that support the pro- gramming language in which the UDFs are implemented. 3.2.2 Language integrated queries Workflows are sometimes fully integrated into application logic. In such cases, it is appealing to use a language integrated query front-end such DryadLINQ [YIF+08] or Lindi [MMI+13]. These front-ends add native SQL-like querying expressions to common programming languages. CHAPTER 3. MUSKETEER: FLEXIBLE DATA PROCESSING 71 Lindi. The Lindi framework is a C# language extension that can be used to express workflows as sequential programs of LINQ operators [LND16]. These LINQ operators run user-provided code and conduct transformations of data sets. Like many other front-ends, Lindi can currently only run workflows on a single back-end, the Naiad execution engine. Internally, Lindi imple- ments a Naiad vertex for each LINQ operator it offers. Supporting the complete Lindi front-end in Musketeer is challenging because Lindi’s operators execute user-defined functions. These functions could be applied by Musketeer using the same mechanism as in BEER, but this would restrict the back-ends on which Lindi workflows can run. The UDFs would have to execute in back-ends that support the programming languages in which the workflows are implemented. However, this limitation can be addressed by either: (i) automatically translating UDFs to different languages using source-to-source compilers or, (ii) implementing a Lindi API-compatible shim that offers a subset of C# features, but has sufficient features to express the workflows. I chose the latter approach in Musketeer because state-of-the-art source-to-source compilers can only translate limited language subsets [Pla13]. I implemented a Lindi-like API shim in C++, which I also extended with an operator that can be used for iterative workflows. In Listing 3.3, I show the Lindi-like API supported by Musketeer. 3.2.3 Graph data analysis Domain-specific front-end frameworks for expressing graph computations are popular: Pregel and Giraph run computations on MapReduce [MAB+10], GraphLINQ offers a graph-specific API for the Naiad execution engine [McS14], and GreenMarl DSL emits code for few multi- threaded and distributed runtimes [HCS+12]. Many of these front-ends provide a “vertex- centric” API in which users provide code that is concurrently executed for each vertex [MAB+10; KBG12]. This “vertex-centric” programming pattern is generalised by the Gather, Apply and Scatter (GAS) model in PowerGraph [GLG+12]. In this paradigm, data are first gathered from neighbouring nodes, then vertex state is updated and, finally, the new state is scattered to the neighbours. In Musketeer, graph processing workflows can be expressed using the BEER DSL. However, BEER requires workflows to be modelled using relational algebra operators, which can often be non-intuitive for graph computations (see Listing 3.2). To address this, I developed a proof-of- concept domain-specific front-end framework that combines the GAS programming model with my BEER DSL. To express graph workflows in my graph “Gather-Apply-Scatter” DSL, users must define the three GAS steps. For each step, they use BEER relational operators or user- defined functions (UDFs). In Listing 3.4, I show the implementation of the PageRank workflow in my GAS front-end framework. My GAS DSL does not have directly corresponding operators in Musketeer’s IR. Thus, the DSL requires both syntactic translation and transformation from its gather, apply, and scatter paradigm to the IR. Musketeer achieves this by modelling message exchanges in each iteration 72 3.2. EXPRESSING WORKFLOWS 1 O p e r a t o r N o d e Select ( O p e r a t o r N o d e o p _ n o d e, v e c t o r< C o l u m n> s e l e c t _ c o l u m n s, 2 C o n d i t i o n T r e e c o n d _ t r e e, s t r i n g r e l _ o u t p u t _ n a m e) ; 34 O p e r a t o r N o d e Where ( O p e r a t o r N o d e o p _ n o d e, C o n d i t i o n T r e e c o n d _ t r e e, s t r i n g r e l _ o u t p u t _ n a m e) ; 56 O p e r a t o r N o d e SelectMany ( O p e r a t o r N o d e o p _ n o d e, C o n d i t i o n T r e e c o n d _ t r e e, s t r i n g r e l _ o u t p u t _ n a m e) ; 78 O p e r a t o r N o d e Concat ( O p e r a t o r N o d e l e f t _ o p _ n o d e, s t r i n g r e l _ o u t p u t _ n a m e, O p e r a t o r N o d e r i g h t _ o p _ n o d e) ; 910 O p e r a t o r N o d e GroupBy ( O p e r a t o r N o d e o p _ n o d e, v e c t o r< C o l u m n> & g r o u p e d _ c o l u m n s, G r o u p B y T y p e g r o u p _ r e d u c e r, 11 C o l u m n g r o u p _ b y _ c o l u m n, s t r i n g r e l _ o u t p u t _ n a m e) ; 1213 O p e r a t o r N o d e Join ( O p e r a t o r N o d e l e f t _ o p _ n o d e, s t r i n g r e l _ o u t p u t _ n a m e, O p e r a t o r N o d e r i g h t _ o t h e r _ o p _ n o d e, 14 v e c t o r< C o l u m n> l e f t _ j o i n _ c o l s, v e c t o r< C o l u m n> r i g h t _ j o i n _ c o l s) ; 1516 O p e r a t o r N o d e Distinct ( O p e r a t o r N o d e o p _ n o d e, s t r i n g r e l _ o u t p u t _ n a m e) ; 1718 O p e r a t o r N o d e Union ( O p e r a t o r N o d e l e f t _ o p _ n o d e, s t r i n g r e l _ o u t p u t _ n a m e, O p e r a t o r N o d e r i g h t _ o p _ n o d e) ; 1920 O p e r a t o r N o d e Intersect ( O p e r a t o r N o d e l e f t _ o p _ n o d e, s t r i n g r e l _ o u t p u t _ n a m e, O p e r a t o r N o d e r i g h t _ o p _ n o d e) ; 2122 O p e r a t o r N o d e Except ( O p e r a t o r N o d e l e f t _ o p _ n o d e, s t r i n g r e l _ o u t p u t _ n a m e, O p e r a t o r N o d e r i g h t _ o p _ n o d e) ; 2324 O p e r a t o r N o d e Count ( O p e r a t o r N o d e o p _ n o d e, C o l u m n c n t _ c o l u m n, s t r i n g r e l _ o u t p u t _ n a m e) ; 2526 O p e r a t o r N o d e Min ( O p e r a t o r N o d e o p _ n o d e, C o l u m n g r o u p _ b y _ c o l u m n, 27 v e c t o r< C o l u m n> & m i n _ c o l u m n s, s t r i n g r e l _ o u t p u t _ n a m e) ; 2829 O p e r a t o r N o d e Max ( O p e r a t o r N o d e o p _ n o d e, C o l u m n g r o u p _ b y _ c o l u m n, 30 v e c t o r< C o l u m n> m a x _ c o l u m n s, s t r i n g r e l _ o u t p u t _ n a m e) ; 3132 / / E x t e n s i o n f o r i t e r a t i v e c o m p u t a t i o n s . 33 O p e r a t o r N o d e Iterate ( O p e r a t o r N o d e o p _ n o d e, C o n d i t i o n T r e e c o n d _ t r e e, s t r i n g r e l _ o u t p u t _ n a m e) ; L isting 3.3: M usketeer’s L indi-like C ++ interface. CHAPTER 3. MUSKETEER: FLEXIBLE DATA PROCESSING 73 1 GATHER = { 2 SUM (vertex_value) 3 } 4 APPLY = { 5 MUL [vertex_value, 0.85] 6 SUM [vertex_value, 0.15] 7 } 8 SCATTER = { 9 DIV [vertex_value, vertex_degree] 10 } 11 ITERATION_STOP = (iteration < 20) 12 ITERATION = { 13 SUM [iteration, 1]) 14 } Listing 3.4: Gather-Apply-Scatter DSL code for PageRank. (i.e., scatter step) as a join between a relation storing vertices and a relation storing edges. Moreover, it models gather steps by appending a group-by clause to each operator used in these steps. This technique is similar to the one used by Pregelix to offer Pregel-like semantics on top of Hyracks, a general purpose shared-nothing dataflow engine [BBJ+14]. 3.3 Intermediate representation The common intermediate representation (IR) lays at the core of my decoupled data processing architecture. Workflows implemented in different front-ends are translated to the IR, and back- end execution engine code is generated from the IR. To successfully decouple front-ends from back-ends, the intermediate representation must have the following three properties: Expressivity: the IR must be sufficiently general to express all the different types of workflows supported by front-end frameworks (i.e., batch, iterative and graph processing). Efficiency: the IR must not lose information associated with workflows whose absence may affect performance. For example, for a graph processing workflow, it must be possible to infer from the intermediate representation that the workflow is a graph computation. Straightforward mappings to back-ends: the IR must not be too high-level (e.g., operators for common algorithms) to miss on optimisation opportunities, but it also must not be too low-level to have no direct mapping to operators supported by back-ends. Common intermediate representations are used in other problem domains as well. For example, the LLVM [LA04] compiler framework compiles a range of programming languages (e.g., C, C++, Haskell) into a light-weight and low-level IR. LLVM’s IR is typed and sufficiently ex- pressive to be used to efficiently conduct compiler optimisations, transformations and analysis, 74 3.3. INTERMEDIATE REPRESENTATION which must only be implemented once because the LLVM IR abstracts away the details of target machines. I considered a similar approach for the intermediate representation, but I concluded that while LLVM IR is efficient and sufficiently general to express the workflows implemented in front- ends, there are no straightforward ways to map workflows from it to back-end execution en- gines’ APIs. Even for a simple workflow that joins two data sets, the IR bytecode would have to be statically analysed to rediscover that the code conducts a join and that the workflow should be executed in a join-specialised execution engine. Fundamentally, the LLVM IR is a low-level abstraction optimised for generating binaries. By contrast, the common data processing IR must generate code for back-end execution engine APIs, which abstract away dataflow models (e.g., MapReduce, acyclic, dynamic). Higher-level intermediate representations based on dataflow models might be better suited. The MapReduce dataflow model is simple, but it is not sufficiently expressive to represent workflows that shuffle data more than once (i.e., workflows that contain two or more join operators). The acyclic dataflow model can express these workflows, but it cannot represent workflows that require data-dependent or unbounded iterations. By contrast, the dynamic dataflow model addresses this limitation. In this model, operators can spawn additional operators at runtime, and thus operators can decide at the end of each iteration if more operators must be spawned for an additional iteration. In my Musketeer prototype, I choose the IR to be an acyclic directed graph of dataflow opera- tors, but like in the dynamic dataflow model, Musketeer can dynamically extend the graph for iterative workflows 1. The initial set of operators I use is loosely based on Codd’s relational algebra [Cod70] and covers the most common operations from industry workflows [CAK12]. Musketeer’s relational algebra-based set of operators includes SELECT, PROJECT, UNION, INTERSECT, JOIN and DIFFERENCE, aggregators (AGG, GROUP BY), column-level alge- braic operations (SUM, SUB, DIV, MUL), and extremes (MAX, MIN). However, standard relational algebra is not sufficiently expressive to meet my requirements because it cannot represent queries that compute the transitive closure of a relation [Mur11]. Nonetheless, this limitation can be addressed by extending the set of operators with a fixed point operator that enables recursive computation [AU79]. I extend the IR with a WHILE operator that can be used to express dynamic data-dependent iterations. The operator dynamically extends the IR DAG based on the output of the operators used in WHILE’s condition. As shown by Murray [Mur11, §3.3.3], a DAG with this facility is sufficient to achieve Turing completeness and it can express all while-programs (though not necessarily efficiently). Finally, I also include a special UDF operator in Musketeer’s IR to support user-defined functions. Musketeer’s IR set of operators is, in my experience, already sufficient to model many widely- used data processing paradigms and amenable to analysis and optimisation [KIY13; KAA+13]. 1Instead of acyclic directed graph of dataflow operators, Musketeer could use the timely dataflow model to avoid dynamically unrolling loops. CHAPTER 3. MUSKETEER: FLEXIBLE DATA PROCESSING 75 edges SUM MULTIPLY AGGREGATE PROJECT DIVIDE JOIN WHILE COUNT JOIN vertices Figure 3.5: PageRank workflow represented in Musketeer’s intermediate representation. For example, MapReduce workflows can be directly modelled with the MAP, GROUP BY and AGG operators. Similarly, complex graph workflows can be mapped to a specific JOIN, MAP, GROUP BY pattern, as shown by GraphX [GXD+14] and Pregelix [BBJ+14]. In Figure 3.5, I show how such a graph processing workflow (PageRank) is represented in Musketeer’s IR. 3.3.1 Optimising workflows Many front-end frameworks optimise workflows before execution. Pig [ORS+08], Hive [TSJ+09], Shark [XRZ+13] and SparkSQL use rewriting rules to optimise relational queries. Similarly, FlumeJava [CRP+10], Optimus [KIY13] and RoPE [AKB+12a] apply optimisations to operator DAGs. Yet, each optimisation is implemented independently for each front-end framework. One of the advantages of decoupling front-ends from back-ends is that optimisations can be implemented and applied on the intermediate representation. This is leveraged in the LLVM modular compiler framework in which many optimisation passes are made over the IR to pro- duce programs that run faster [LA04]. Musketeer can likewise provide benefits to all supported front-ends – and future ones – by applying optimisations on the intermediate representation. My Musketeer prototype performs a small set of standard query rewriting optimisations on the IR. Most of these move selective operators (e.g., PROJECT, INTERSECT) closer to the start of the workflow and push generative operators to the end in order to reduce the amount of data processed by subsequent operators. 76 3.4. CODE GENERATION More advanced optimisations could also be applied on Musketeer’s IR. These optimisation could, e.g., prune unused columns of intermediate operators whose output must be written on disk or in memory. Similarly, the optimisations could move highly selective JOIN operators closer to the start of workflows, and push JOIN operators that generate large amounts of data closer to the end of workflows. In comparison to front-end frameworks or database systems, however, it is more difficult to predict the effect of query rewriting rules in Musketeer. Optimisation rules that are guaranteed to reduce makespan in other systems might not always have the same effect because Musketeer dynamically maps workflows to back-ends at runtime. In particular, optimisations may increase the minimum number of jobs required to run a workflow and thus potentially increase makespan. In Section 3.6, I discuss how Musketeer could address this limitation. 3.4 Code generation Efficient back-end code must be generated from the intermediate workflow representation for developers to benefit from the decoupled data processing architecture. This is a challenging undertaking because back-ends offer a diverse set of interfaces for expressing workflows. For example, both simple stateless map and reduce functions (for Hadoop MapReduce) and code for complex stateful vertices (for Naiad) must be generated. The code generation mechanism must meet the following three requirements to efficiently inte- grate back-end execution engines: 1. generate workflow code that is competitive with hand-written optimised implementations for each back-end; 2. generate code for a range of back-ends which provide diverse APIs and are implemented in programming languages with different syntax; 3. offer an intuitive and easy-to-use interface for integrating new back-end execution en- gines. In the following, I discuss three approaches to implement the code generation mechanism. One approach is to generate code for each IR operator in a low-level language (e.g., C++). Follow- ing, the code would either be directly executed in back-ends that support the chosen program- ming language, or the code would be plugged using foreign function interfaces (FFI) in jobs that execute on back-ends supporting other programming languages. With this approach, the intermediate representation would have to be translated only to a single language, but it could be challenging to use foreign function interfaces with workflows that use complex data types. Moreover, FFIs may be expensive to use because most back-ends create an object for each input CHAPTER 3. MUSKETEER: FLEXIBLE DATA PROCESSING 77 record. These objects would have to be transformed to FFI-compatible objects. Bacila demon- strated that foreign function interfaces can add up to 10% overhead to generated workflow code compared to baselines without FFIs [Bac15]. In an alternative approach, the code generation mechanism could use source-to-source com- pilers to translate the generated code to different programming languages. Source-to-source compilers would add a small per-operator overhead, but would not add overhead to each in- put record. However, source-to-source compilers can only translate limited language sub- sets [Pla13]. Finally, the code generation mechanism could use code templates for each IR operator and back-end pair. These templates would be instantiated and concatenated to produce executable workflow code. The templates only have to be implemented once and can be used many times to generate code in the supported programming languages and using the interfaces provided by back-ends. However, this approach has two disadvantages: (i) it requires templates to be implemented for each operator when a new back-end is integrated, and (ii) optimisations must be applied across template boundaries to generate efficient code. I choose to follow the third approach in my Musketeer proof-of-concept, but I also combine it with several optimisations I developed to make the performance of Musketeer-generated code competitive with hand-written baselines. Musketeer merges operators and executes them as a single job when possible to reduce job start-up costs (§3.4.1), applies traditional database opti- misations such as sharing data scans in the context of large data processing (§3.4.3), and uses compiler techniques such as idiom recognition (§3.4.2) and type inference (§3.4.4) to optimise across template boundaries. I explain these optimisations with respect to themax-property-price Hive workflow (see Listing 3.1), and the BEER PageRank workflow (see Listing 3.2). However, these optimisations generalise to other workflows, and could be applied to other workflow man- agers. Musketeer can currently generate efficient code for seven back-end execution systems: Hadoop MapReduce, Spark, Naiad, PowerGraph, GraphChi, Metis and serial C code by using templates and the above-mentioned optimisations. For the time being, I assume that the user explicitly specifies which back-end execution engines to use; in Section 3.5, I describe how Musketeer can automatically map workflows to back-ends. 3.4.1 Merging operators Generating code and executing each operator in a separate job has prohibitive cost; regardless of back-end, each operator incurs a significant job start-up cost: the job must load input data, build up internal data structures, and coordinate parallel workers. This cost is especially high in general-purpose back-ends that store intermediate data in-memory (e.g., Spark, Naiad) because they can run most workflows in a single job, and thus do not usually incur per operator data loading costs. 78 3.4. CODE GENERATION MapReduce job class Operators Shuffle key map-only PROJECT, SELECT, SUM, SUB, DIV, MUL, UNION – no-op-map and reduce INTERSECT, DIFFERENCE entire input row aggregating-map and reduce AGG, COUNT, MAX, MIN GROUP BY columns, if present map and reduce JOIN values in columns joined on Table 3.1: Musketeer’s IR operators fall under different MapReduce job classes. Some (e.g., PROJECT) do not need the reduce phase, while others (e.g., INTERSECT) only need the reduce phase. Therefore, operators should be merged to reduce the number of jobs executed, and implicitly workflowmakespan. Ideally, all operators within a workflow should be merged into a single job, but in practice it is not always possible. The types of operators that can be merged depend on the chosen back-end execution engine and the dataflow model it uses. For example, specialised graph processing back-ends based on the bulk synchronous parallel model (e.g., Pregel) or the asynchronous model (e.g., PowerGraph) only support a few operator idioms (see Section 3.4.2). Similarly, back-ends based on the restrictive MapReduce dataflow model must execute some workflows using multiple jobs. Hadoop MapReduce jobs comprise of two explicit phases (map and reduce) and an implicit shuffle phase. The shuffle phase sorts key-value pairs returned from the map phase, groups them by key, and sends them to the reduce phase. As a result, workflows that contain several operators that each require a shuffle cannot be executed in a single Hadoop MapReduce job unless the shuffle phases can be composed (i.e., if they use the same key to sort the key-value pairs). I divide the IR operators into four classes depending on the phases they use (see Table 3.1): 1. map-only operators (e.g., PROJECT, SELECT, DIV, MUL, SUB, SUM, UNION) need no shuffle and reduce phases. Two or more map-only operators can be merged by concate- nating the operators’ map phase code. Moreover, map-only operators can be merged with any other class of operators by either concatenating their map phase code into the map or reduce phases of other operators. 2. no-op-map and reduce operators (e.g., INTERSECT, DIFFERENCE) have no-op map phase, but use the shuffle phase to group all identical input rows, and to dispatch them to a reducer. Two or more map-only and no-op-map and reduce operators can be merged into a job. However, no-op-map and reduce operators cannot be merged with other classes of operators that require a shuffle phase. 3. aggregating-map and reduce operators (e.g., GROUP BY, COUNT, MAX, MIN) use the map phase to locally aggregate rows within an input data partition, the shuffle phase to CHAPTER 3. MUSKETEER: FLEXIBLE DATA PROCESSING 79 properties PROJECT SELECT JOIN prices MAX GROUP BY Job 1 Job 2 Figure 3.6: Max-property-price workflow represented in Musketeer’s IR. Musketeer gen- erates code for two jobs when users desire to run the workflow on Hadoop MapReduce. group the aggregated results, and the reduce phase to aggregate the data across input parti- tions. These operators can be merged with map-only operators and with aggregating-map and reduce operators that are associative and group on the same column (e.g., MAX, MIN). 4. map and reduce operators (e.g., JOIN) use all the threeMapReduce phases. They cannot be merged with any other operator that has a shuffle phase, unless both operators use the same key in the shuffle phase. Musketeer uses these rules to merge operators when possible. In Figure 3.6, I show how Mus- keteer divides the max-property-price workflow into jobs if it were to run the workflow on Hadoop MapReduce. Lines 1–3 in Listing 3.1 (§3.2.1) result in a job that selects columns from the properties relation and joins the result with the prices relation using id as the key. The job comprises of a map-only SELECT operator and a map and reduce JOIN operator. Lines 4–5 group by a different key than the prior join, and thus they are grouped into a second job that consists of: an aggregating-map and reduce GROUP BY operator, followed by an associative aggregating-map and reduce MAX operator, and by a map-only PROJECT operator. The Hadoop MapReduce, however, may be a poor back-end choice for the max-property-price workflow because of additional job start-up overheads, and costs to serialise data to disk at the end of the first job and deserialisation from disk at the start of the second job. By contrast, other back-ends can merge all classes of operators (e.g., Naiad, Spark). In Listing 3.5, I show the Spark code that Musketeer generates for the max-property-price workflow. To summarise, Musketeer merges operators to reduce reduce job start-up costs and to reduce data serialisation costs. Operator merging is essential for good performance: in Section 4.3, I show that it reduces workflow makespan by 2–5×. 80 3.4. CODE GENERATION 1 prop_locations = 2 properties.map(property => (property.uid, property.street, property.town)) 3 prop_prices = prop_locations 4 .map(prop => (prop.uid, (prop.street, prop.town))) 5 .join(prices) 6 .map((key, (left_rel, right_rel)) => (key, left_rel, right_rel)) 7 street_price = prop_prices 8 .map(prp_price => ((prp_price.street, prp_price.town), prp_price.price) 9 .reduceByKey((left, right) => Max(left, right)) Listing 3.5: Spark code formax-property-price. Four map phases are required because data structures must be transformed to match the expected input format of dependent phases (optimisations help avoid this, see § 3.4.3). 3.4.2 Idiom recognition Some back-end execution engines are based on restrictive dataflow models and are specialised for a single workflow type. For example, GraphChi offers a vertex-centric computation inter- face, and PowerGraph uses the gather, apply, scatter (GAS) decomposition to run graph analysis workflows. Neither back-end can execute workflows that are not graph computations or can- not be mapped to their model. Workflows that are suitable for specialised back-ends can be detected by recognising computational idioms in the IR operator DAG. Idiom recognition is a technique used in parallelising compilers to detect computational idioms that allow performance improving transformations to be applied [PE95]. My Musketeer prototype detects vertex-centric graph-processing workflows represented in its IR, even when workflows are originally expressed using relational front-ends; e.g., BEER in- stead of the GAS DSL. The idiom Musketeer uses is a reverse variant of the idiom GraphX and Pregelix use to translate graph workflows to dataflow operators [BBJ+14; GXD+14]. Muske- teer looks for a combination of the WHILE, JOIN and AGGREGATE operators: the body of the WHILE loop must contain a JOIN operator with two inputs (i.e., the vertices and the edges). The JOIN operator must be followed by a AGGREGATE operator that aggregates data by the vertex column. This structure maps to graph computation paradigms: the JOIN on the vertex column is equiv- alent to sending messages to neighbour vertices (vertex-centric model), or the “scatter” phase (GAS decomposition); the AGGREGATE is equivalent to receiving messages, or the “gather” step (GAS); and any other operators in the WHILE body are part of the per-vertex computation (vertex-centric model) or the “apply” step (GAS), and are equivalent to updating vertex state. In Figure 3.7, I show the GAS steps on the Musketeer IR of the PageRank workflow. Musketeer may occasionally fail to detect graph workloads, and, consequently execute them in less efficient back-ends. For example, a graph triangle counting workflow that computes all 3-tuples of vertices that form a triangle may not be recognised. A vertex forms a triangle with two other vertices if it is their neighbour and these other vertices are connected by an edge CHAPTER 3. MUSKETEER: FLEXIBLE DATA PROCESSING 81 edges MULTIPLY PROJECT DIVIDE AGGREGATE JOIN WHILE COUNT JOIN vertices scatter step gather step apply step SUM Figure 3.7: GAS steps highlighted on the Musketeer IR for the PageRank workflow. Mus- keteer applies idiom recognition to detect these steps. as well. Musketeer’s idiom recognition algorithm does not reliably detect this graph workflow because it can be expressed in two ways: (i) as a graph workflow using Musketeer’s GAS DSL and, (ii) as a batch workflow that joins a table twice (i.e., the edges table) and filters out the results (i.e., selects the triangles). In the second case, Musketeer fails to take advantage of the opportunity to run the computation in a specialised graph execution engine because the workflow does not contain a WHILE operator. This limitation could partly be solved with a “reverse loop unrolling” heuristic that detects when multiple operators take the same input and produce the same output (or a closure thereof). While my simple idiom recognition technique does not catch every opportunity of using spe- cialised graph processing back-ends, it yields no false positives, and is thus safe to apply even across complex workflows. 3.4.3 Sharing data scans Where possible Musketeer merges operators and executes them as a single job, eliminating job creation overheads. However, this optimisation is not sufficient to generate code that is competitive with hand-written baselines. Workflows do not incur overhead costs of starting multiple jobs, but unnecessarily many data scans are still conducted for each operator in some back-ends (e.g., Spark, Naiad). For example, Musketeer translates the first SELECT and the 82 3.4. CODE GENERATION 1 locs = 2 properties.map(c => (c.uid, (c.street, c.town))) 3 id_price = locs 4 .join(prices) 5 .map((key, (l_rel, r_rel)) => 6 ((l_rel.street, l_rel.town), r_rel.price)) 7 street_price = id_price 8 .reduceByKey((left, right) => Max(left, right)) Listing 3.6: Optimized Spark code for max-property-price. Scan sharing and type infer- ence reduce the maps to two. JOIN operator from the max-property-price workflow into two Spark map transformations and a join (Listing 3.5, lines 3–6). The first map implements the first SELECT operator, while the second map establishes a key→ 〈tuple〉 mapping over which the join is conducted. Even though Spark holds the intermediate RDDs in memory, scanning over the data twice yields a significant performance penalty if the data are large. Musketeer reduces data scans if the operators or several of their steps can be composed into a single function. It applies the composed function to each input record in a single pass over the input data. I illustrate this optimisation using a Spark code example, but the optimisa- tion is not limited to Spark code. Moreover, the same technique was later used in Weld to fuse loops [PTS+17]. Musketeer generates optimised Spark code (Listing 3.6) for the max- property-price workflow in which it composes the anonymous lambdas from the first two map transformations (Listing 3.5, lines 4 and 6) into a single lambda (Listing 3.6, lines 5–6). Thus, the generated code selects the required columns and prepares the relation for the join trans- formation with only one data scan. 3.4.4 Look-ahead and type inference Many execution engines (e.g., Spark and Naiad) expose a rich API for manipulating data types. For example, the SELECT . . . GROUP BY clause in the max-property-price workflow (List- ing 3.1, lines 4–5) can be implemented directly in Spark using a reduceByKey transformation. However, such API calls often require specific input data formats. In my example, Spark’s reduceByKey requires the data to be represented as a set of 〈key, value〉 tuples. Unfortunately, the preceding join transformation outputs the data in a different format (viz. 〈key, 〈left_rela- tion, right_relation〉〉). The naïve generated code for Spark is inefficient because it contains two map transformations: one to flatten the output of the join (Listing 3.5, line 6), and another to key the relation by a 〈town, street〉 tuple (line 8). In order to avoid generating superfluous data transformations, Musketeer looks ahead and uses type inference to determine the input format of the operators that ingest another operator’s output. With this optimisation, Musketeer expresses the two map transformations as a single CHAPTER 3. MUSKETEER: FLEXIBLE DATA PROCESSING 83 transformation (Listing 3.6, lines 5–6). 3.5 DAG partitioning and automatic mapping Many execution engines cannot run complex workflows in a single job (§3.4.1). For example, execution engines based on the MapReduce dataflow model can execute only one GROUP BY operator per job, and vertex-centric engines (e.g., PowerGraph, GraphChi) are restricted to the idiom described in Section 3.4.2. Moreover, even general execution engines have built-in assumptions about the likely operating regime and the types of workflows they expect to run, and their implementation is optimised based on these assumptions (§2.2.3). In the decoupled data processing architecture, deciding which combination of systems is best to run a workflow is made by partitioning the IR DAG into sub-DAGs, each representing a job. However, this decision is difficult: on the one hand, the choice of back-ends is affected by the DAG partitioning, and on the other hand, the best partitioning depends on which back-ends are available and what properties these have (e.g., optimised for graph or batch workflows, operator merging abilities). In order for an automatic DAG partitioning mechanism to be effective, it must: 1. take into account back-end expressivity limitations, and partition the IR DAG only into sub-DAGs that can execute on available back-ends; 2. predict with reasonable accuracy the relative performance of available back-ends, and use this prediction to automatically decide which back-ends are best for a given workflow; 3. re-consider IR DAG partitioning once jobs complete and new statistics are available (e.g., intermediate data size); and 4. be able to generalise over limitations of future back-ends in order to not hinder the adop- tion of new execution engines. In this section, I explain how my Musketeer prototype meets these requirements. In algo- rithm 3.7, I give a high-level overview of the steps Musketeer takes to make decisions. Muske- teer starts with the optimised IR operator DAG and iterates as long as there are operators left to execute. In each iteration, Musketeer scores the goodness of different DAG partitions using a simple cost function that considers information specific to both workflows and back-ends (see §3.5.1). When the input workflow is small or when there are only several operators left to ex- ecute, Musketeer compares all possible DAG partitions. However, when this is too expensive, Musketeer applies an efficient heuristic based on dynamic programming (§3.5.2). Following, Musketeer generates code for the first sub-DAG (i.e., the job that does not depend on other jobs) and executes it in the back-end engine chosen by the partitioning algorithm. When the job completes, Musketeer removes the job’s operators from the IR DAG and updates statistics (e.g., intermediate data sizes). Musketeer repeats these steps until all operators complete. 84 3.5. DAG PARTITIONING AND AUTOMATIC MAPPING 1 # Transform the workflow into a IR DAG 2 DAG_IR = translate(workflow) 3 # Apply optimisation on the IR DAG 4 DAG_IR_OPT = optimise(DAG_IR) 5 while DAG_IR_OPT has unexecuted operators: 6 # Partition the DAG 7 partitions = dag_partitioning(DAG_IR_OPT) 8 # Get the first partition’s operators and the back-end they’ve 9 # been mapped to 10 (operators, back_end) = partitions[0] 11 # Generate code for the operators 12 job_code = generate_code(operators, back_end) 13 execute(job_code) 14 # Update expected intermediate data sizes 15 update_statistics() 16 # Remove executed operators from the IR DAG 17 remove_operators(DAG_IR_OPT, operators) Listing 3.7: Musketeer partitions the IR DAG, executes the first job, updates statistics, and updates the partitioning of the remaining operators. 3.5.1 Back-end mapping cost function Musketeer uses a simple cost function to compare different DAG partitioning options and to decide which back-end is best for each partition. The cost function scores the combination of an operator sub-DAG, input data size and back-end system. The cost is finite and represents how long it will take to execute the merged sub-DAG of operators in the given back-end. However, the cost can be a large number if the back-end is not sufficiently expressive to run the operator sub-DAG as a single job. The total cost of a DAG partitioning is the sum of costs of running each sub-DAG partition in its most suitable back-end. I chose a weighted sum of four high-level components as the cost: 1. Operator performance. In a one-off calibration, Musketeer measures each operator in each back-end and records the rate at which it processes input data. 2. Data volume. Operators have bounds on their output size based on their behaviour (e.g., whether they are generative or selective). Musketeer applies these bounds to runtime input data sizes to predict intermediate data sizes. Moreover, it refines these predictions after each intermediate workflow job completes and outputs data. 3. Workflow history. Many data-centre workflows are recurrent (e.g., 40% of Bing’s work- flows [AKB+12b]). Musketeer collects information about each job it runs (e.g., runtime, and input, intermediate and output data size), and uses this information to refine the scores for subsequent runs of the same workflow. CHAPTER 3. MUSKETEER: FLEXIBLE DATA PROCESSING 85 Parameter Description PULL Rate of data ingest from HDFS. LOAD Rate of loading or transforming data. PROCESS Rate of processing operator on in-memory data. PUSH Rate of writing output to HDFS. Table 3.2: Rate parameters used by Musketeer’s cost function. 4. Back-end resource utilisation efficiency. If required, Musketeer’s cost function can bias the score towards back-ends that do not complete fastest, but utilise resources efficiently. Out of the above-mentioned components, the operator performance provides the most impor- tant signal. The operator performance calibration supplies Musketeer with the four rates I list in Table 3.2. PULL and PUSH quantify back-end read and write HDFS throughput. Musketeer mea- sures them using a “no-op” operator. LOAD, by contrast, corresponds to back-end-specific data loading or transformation steps (e.g., partitioning the input in PowerGraph). Finally, PROCESS approximates the rate at which the operator’s computation proceeds. In some systems (e.g., Na- iad), Musketeer measures PROCESS rate directly, while in others (e.g., Hadoop MapReduce), Musketeer subtracts the estimated duration of the ingest (from PULL) and output (from PUSH) stages from the overall runtime to obtain PROCESS. Musketeer uses this information to estimate the benefit of shared scans: the cost of PULL, LOAD and PUSH is paid just once (rather than once per-operator), and it combines those with the costs of PROCESS for all the operators. Musketeer conducts the operator performance calibration in a one-off profiling for an idle de- ployed cluster. Nevertheless, the experiments provide a good indication of how suitable are back-ends for running a particular operator. Musketeer does not measure how operators are affected by co-location interference because I expect workflows to execute in clusters managed by co-location aware schedulers such as the one one I describe in Chapter 5. These schedulers place workflows in such a way that interference is reduced and workflows complete almost as fast as if they were placed on an idle cluster. The processing rate parameters Musketeer obtains from calibration experiments enable generic cost estimates, but Musketeer can make more accurate predictions when it uses workflow- specific historical information (e.g., historical intermediate data size). No such information is available when a workflow first executes, and thus Musketeer only makes conservative per- formance predictions. For example, when it estimates intermediate operator output data sizes it assumes worst-case (e.g., SELECT operators output as much data as they read). Thus, more jobs may initially be generated, but on subsequent workflow executions, Musketeer tightens the bounds using historical information, which may unlock additional merge opportunities and back-end options. 86 3.5. DAG PARTITIONING AND AUTOMATIC MAPPING 3.5.2 DAG partitioning There are many ways to divide an IR DAG into partitions (i.e., back-end jobs). The goal of the DAG partitioning algorithm is to find the partitioning with the smallest total cost (i.e., smallest predicted workflow makespan). For each operator, Musketeer must decide whether to merge it with the operators that ingest its output. Thus, the number of possibilities to partition a DAG increases exponentially with the number of operators in the workflow, N. If k, the optimal number of partitions is known a priory, then IR DAG partitioning is an in- stance of the k-way graph partitioning problem [KL70]. However, the optimal number of par- titions is unknown in practice. Musketeer could solve the k-way graph partitioning problem for each k value between one (i.e., all operators execute in a single job) and N (i.e., each op- erator in a separate job). Unfortunately, the k-way graph partitioning problem is itself already NP-hard [GJS76]: the best partitioning is guaranteed to be found only by exploring all k-way partitions. For small workflows, Musketeer uses an exhaustive search to find the partitioning with the smallest total cost (§3.5.2.1). For more complex workflows, the exhaustive search is pro- hibitively expensive (see §4.6.2). In such situations, Musketeer switches to a dynamic pro- gramming heuristic that completes faster, but does not necessarily find the optimal partitioning (§3.5.2.2). 3.5.2.1 Exhaustive search To explore all possible DAG partitionings and mappings to back-ends, the exhaustive search algorithm takes as input a given IR operator DAG, D, and transforms it into a linear ordering of operators. For a given DAG with N operators, the ordering maps the operators to a N-tuple of operators (o1,o2, ...,on) 2. Following, the algorithm uses a binary encoding of N-tuples to efficiently encode the state of mapped operators. The encoding allows it to store in a vectorC of 2N length, the minimum cost of each possible state in which zero or more operators are mapped. For each i,C[i] stores the cost of the best mapping for the mapped operators encoded by i. The encoding maps each possible operator N-tuple (i.e., (o1,o2, ...,on)), to a corresponding binary representation of numbers from 0 to 2N−1. In this binary representation, an operator is considered to be mapped to a back-end if the binary representation contains a “1” at the operator’s position in the linear ordering. For example, a 101 encoding means that the first and third operator are already mapped, and C[5] contains the minimum cost of mapping these operators to the best back-end combination. The cost of the best complete DAG partitioning is stored in C[2N−1]. 2The exhaustive search can use any linear ordering, but I choose to topologically sort the operators to obtain an ordering that respects operator precedence – i.e., an operator does not appear in the linear ordering before any of its ancestors. CHAPTER 3. MUSKETEER: FLEXIBLE DATA PROCESSING 87 Informally, for a given binary mapping p, the algorithm finds the minimum cost partitioning by exploring all the options of partitioning the operators in two partitions: (i) a partition pd for which the algorithm previously computed the minimum cost, and (ii) a partition job in which all the operators are mapped to a job that would execute in back-end b. The combined pd and job partitions must map all mapped operators from binary mapping p, but each partition must map distinct operators – i.e., p= pd⊕ job. In practice, the algorithm uses the cost function to recursively populate the vector and to com- puteC[2N−1]. The cost function is defined over 2(o1,o2,...,on)×B, where B is the set of supported back-ends. It returns the predicted cost of running a set of operators ( job ∈ 2(o1,o2,...,on)) in a single job on a back-end. The algorithm uses the following recurrence: C[p] = min b∈B∧pd, job∈2(o1,o2,...,on)∧p=pd⊕ job (C[pd]+ cost( job,b)) The algorithm is guaranteed to find the optimal partitioning with respect to the cost function be- cause it explores all the possible partitions and mappings. However, the algorithm is exponential in the number of workflow operators (N) because it computes the minimum cost mapping for each possible binary encoding between 0 and 2N−1. Its worst-case complexity is O(2N ∗ |B|). Thus, Musketeer can only use the algorithm on workflows with up to few dozens operators because it would otherwise dominate the actual execution time. 3.5.2.2 Dynamic programming heuristic Some industry workflows comprise of large DAGs that comprise of up to hundreds of op- erators [CRP+10, §6.2]. Musketeer partitions such workflows with a dynamic programming heuristic I developed. The algorithm chooses good partitionings and mappings in practice, and its execution time scales linearly with the number of operators. It focuses on grouping consec- utive operators within the linear ordering of operators into jobs, and thus explores only a subset of the possible partitionings. In Figure 3.8, I illustrate the high-level steps of my algorithm on the IR operator DAG of the PageRank workflow. First, the algorithm topologically sorts the IR operator DAG. Second, it explores possibilities of merging neighbouring operators into single jobs using a dynamic programming algorithm. Unlike exhaustive search, this algorithm only finds the optimal parti- tioning for the given ordering. Not all linear operator orderings are equally good inputs to the dynamic programming algo- rithm. In order for the algorithm to find good partitionings, the linear ordering must have the following properties: (i) it must respect operator precedence (i.e., an operator does not appear in the linear ordering before any of its ancestors), and (ii) it must place in neighbouring positions as many mergeable operators as possible. Topological sorting algorithms produce orderings that have the first property. To ensure the second property holds as well, I implemented a topological 88 3.5. DAG PARTITIONING AND AUTOMATIC MAPPING edges COUNT SUM MULTIPLY AGGREGATE PROJECT SUM MULTIPLY AGGREGATE PROJECT DIVIDE DIVIDE JOIN WHILE JOIN COUNT JOIN WHILE COUNT JOIN To po lo gi ca l So rt D yn am ic Pr og ra m m in g Hadoop job GraphChi job 1 2 JOIN WHILE JOIN DIVIDE PROJECT AGGREGATE MULTIPLY SUM vertices Figure 3.8: The dynamic partitioning heuristic takes an IR DAG, (1) transforms it to a linear order, and (2) computes job boundaries via dynamic programming. On the right, I show several possible partitions and system mappings. sort algorithm that is based on depth-first search because Musketeer runs more workflows that have greater IR DAG “height” (i.e., long chains of dependent operators) rather than “width” (i.e., many parallel operators). My algorithm produces orderings that provide more operator merging opportunities than other alternatives. Following, the dynamic programming algorithm uses the cost function to determine the best partitioning for running the first nops operators using n jobs. It explores all the possibilities for running the first opexec operators of the ordering using n jobs−1 and the remaining operators as a single job. The algorithm finds good mappings because it considers all partitionings of the linear ordering and because the cost function ensures it merges as many operators as possible within each individual job. In practice, the algorithm uses the cost function to compute for each N-operator workflow a N-by-N matrixC. For each 0< n jobs≤ nops≤N, the matrix stores atC[nops][n jobs] the minimum cost of running the first nops linearly ordered operators using n jobs jobs on the best back-ends. The algorithm computes the matrix with the following recurrence: C[nops][n jobs] = min b∈B∧opexec 0: 31 job_size = JOBS[n_ops][num_jobs] 32 job_encoding = (1 << job_size - 1) << (n_ops - job_size) 33 mappings.append((job_encoding, BE[n_ops][num_jobs])) 34 n_ops = n_ops - JOBS[n_ops][num_jobs] 35 num_jobs = num_jobs - 1 36 return mappings Listing 3.8: Dynamic programming heuristic for exploring partitionings of large work- flows. It takes as input N, the number of workflow operators, and the cost function. It outputs a list of mappings of operator job encondings to back-ends. auxiliary N-by-N matrices BE and JOBS besides the matrix C. The algorithm uses these ma- trices to reconstruct the best partitioning and back-end mappings upon completion. For each 0< n jobs ≤ nops ≤ N, BE[nops][n jobs] stores the name of the back-end it chose for last job, and JOBS[nops][n jobs] stores the number of operators the last job has. In contrast to the exhaustive search, the dynamic programming heuristic scales to large work- flows because it has a polynomial worst-case complexity of O(N3 ∗ |B|). However, the heuristic can miss out on opportunities to merge operators because the linear orderings it chooses may 90 3.6. LIMITATIONS AND FUTURE WORK not contain the best operator adjacencies. I discuss this further and show an example in Sec- tion 3.6. Nonetheless, I found the dynamic programming heuristic to work well in a set of cluster experiments (§4.6.1). 3.6 Limitations and future work Decoupling front-end frameworks from back-end execution engines increases flexibility, but it may obfuscate some end-to-end optimisation opportunities that expert developers might realise. Musketeer is best suited for developers who write analytics workflows for high-level front-end frameworks. Developers who are willing to painstakingly hand-optimise a particular workflow might be able to do better, but I expect them to be a minority. They nevertheless can still benefit from Musketeer by being able to rapidly explore different back-ends, or by using Musketeer to automate the first step of porting a workflow to new back-ends. In the following paragraphs, I highlight some of Musketeer’s current limitations and discuss how they can be addressed in future work. Optimising workflows. The optimisations I described in Subsection 3.3.1 re-order operators to reduce intermediate data sizes. However, these optimisations are not always beneficial: they can reduce operator merging opportunities or change the IR DAG such that Musketeer’s dy- namic programming heuristic does not find good partitionings. Consider the example of the workflow shown in Figure 3.9. It consists of three operators: COUNT, SORT and INTERSECT. Musketeer’s query rewriting rules would pull the selective INTERSECT operator above SORT operator. But this may be unhelpful – while COUNT and SORT can be merged in MapReduce back-ends, COUNT and INTERSECT cannot. Applying this rewrite would have the side-effect of eliminating the possibility of merging the two opera- tors in any MapReduce-based back-end. The workflow would incur extra job startup costs and would necessitate three passes over the data rather than two. COUNT SORT INTERSECT DA B C COUNT INTERSECT SORT DA B C mergeable not mergeable not mergeable not mergeable Figure 3.9: Example of a DAG optimisation inadvertently decreasing operator merging opportunities. CHAPTER 3. MUSKETEER: FLEXIBLE DATA PROCESSING 91 3. JOIN 2. UNION 3. JOIN 2. UNION Topological sort Heuristic Optimal JOIN JOIN UNION PROJECT 1. JOIN 4. PROJECT 1. JOIN 4. PROJECT Figure 3.10: The dynamic heuristic does not return the minimum-cost partitioning for this workflow: it misses the opportunity to merge JOIN with PROJECT. To avoid this problem, Musketeer’s query optimiser would have to return the entire set of re- ordered IR operator DAGs. Musketeer could then apply the dynamic programming heuristic algorithm on each DAG and pick the partitioning with the smallest cost across all DAGs. If the number of DAGs obtained by applying the query rewriting rules is too large, Musketeer might only partition the most promising DAGs. DAG partitioning. My dynamic programming heuristic returns the optimal k-way partition- ing for a given linear ordering of operators. However, it may miss fruitful merging opportunities because it only explores one linear ordering. In Figure 3.10, I show a workflow example on which the dynamic heuristic misses on good merging opportunities. In Hadoop MapReduce, it is best to run the top JOIN operator in the same job as the PROJECT operator, but the linear ordering based on depth-first exploration breaks this merge opportunity 3. A simple approach to mitigate this limitation is to generate multiple linear orderings, run my inexpensive dynamic partitioning heuristic for each one of them, and pick the best partitioning. 3.7 Summary In this chapter, I argued that data processing systems should adopt an architecture that decouples front-end frameworks from back-end execution engines. I first gave an overview of Musketeer (§3.1) and then I described how it decouples systems by translating workflows defined in one of the supported front-end frameworks (§3.2) to an intermediate representation consisting of a dynamic DAG of operators (§3.3). Following, I presented how Musketeer applies optimisa- tions and generates code for suitable back-end execution engines (§3.4). Next, I described how Musketeer automatically partitions the IR DAG and decides which combination of back-ends is 3This limitation does not affect general-purpose back-ends (e.g., Naiad and Spark), which are able to merge any sub-region of operators. 92 3.7. SUMMARY best for running a workflow (§3.5). Finally, I highlighted Musketeer’s limitations and discussed several ways in which these could be addressed (§3.6). With Musketeer, users benefit from the advantages the decoupled data processing architecture offers: (i) developers write workflows once, but can execute them on many alternative back- ends, (ii) different back-ends can be combined within a workflow at runtime, and (iii) existing workflows can seamlessly be ported to new execution engines. In Chapter 4, I show that Mus- keteer enables compelling performance gains and its generated code performs almost as well as unportable, hand-optimised baseline implementations. Chapter 4 Musketeer evaluation Musketeer is my prototype implementation of the decoupled data processing architecture. Mus- keteer maps workflows expressed in front-end frameworks to an intermediate representation, dynamically and automatically chooses combinations of back-ends to run workflows on, and generates back-end code. In this chapter, I evaluate if Musketeer offers the benefits of the decoupled data processing ar- chitecture: (i) developers write workflows once, but can execute them on many alternative back- ends without performance penalties compared to hand-written implementations, (ii) makespan reduction because different back-ends can be combined within a workflow, and (iii) good au- tomatic back-end execution mapping at runtime. I show that Musketeer offers these benefits in a range of cluster experiments with real-world workflows. In my experiments, I answer the following six questions: 1. How does Musketeer’s generated workflow code compare to hand-written optimised im- plementations? (§4.2) 2. Do Musketeer’s operator merging, data scan sharing and type inference optimisations reduce makespan? (§4.3) 3. Does Musketeer manage to speed-up legacy workflows by mapping them to a different back-end execution engine? (§4.4) 4. Does Musketeer reduce workflow makespan by flexibly combining back-end execution engines? (§4.5) 5. How costly is Musketeer’s automatic DAG partitioning and back-end execution engine mapping? (§4.6.2) 6. Does Musketeer’s automatic back-end execution engine mapping make good choices? (§4.6.1) 93 94 4.1. EXPERIMENTAL SETUP AND METRICS In my experiments, I use seven real-world workflows: three batch workflows, three iterative workflows and a hybrid batch-graph workflow. The batch workflows run: (i) TPC-H query 17, an ad-hoc business-oriented query, (ii) top-shopper, which identifies top spenders in an online shop, and (iii) a Netflix movie recommendation algorithm based on collaborative filter- ing [BKV08]. The iterative workflows run: (i) five PageRank iterations on various graphs, (ii) single-source shortest path (SSSP), and (iii) k-means clustering. Finally, the hybrid workflow comprises of a batch-preprocessing step followed by five PageRank iterations. 4.1 Experimental setup and metrics I execute all the experiments I describe in this chapter on either a heterogeneous, but dedicated local cluster or on a medium-sized Amazon Elastic Compute (EC2) cluster. The clusters differ both in hardware configuration and scale: The heterogeneous local cluster is a small seven-machine cluster comprising of machines with different CPU clock frequencies and from different generations. The machines also have different memory architectures and RAM capacities. All the machines are connected via a two-switch 1G network (see Table 4.1a). The medium-sized cloud cluster is a cloud computing cluster consisting of 100 m1.xlarge Amazon EC2 instances (see Table 4.1b). The instances have 15GB of RAM and four vir- tual CPUs. The cluster represents a realistic scale-up environment in which the network and the machines are shared with other EC2 tenants. All the experiments I conduct on the local cluster are executed in a fully controlled environment without any external network traffic or machine utilisation. I use this environment to mea- sure with low variance frameworks’ performance and their suitability to execute workflows on heterogeneous clusters. In contrast, the medium-sized cloud cluster represents a realistic data Type Machine Architecture Cores Thr. Clock RAM 3× A GW GR380 Intel Xeon E5520 4 8 2.26 GHz 12 GB PC3-8500 2× C Dell R420 Intel Xeon E5-2420 12 24 1.9 GHz 64 GB PC3-10666 1× D Dell R415 AMD Opteron 4234 12 12 3.1 GHz 64 GB PC3-12800 1× E SM AS1042 AMD Opteron 6168 48 48 1.9 GHz 64 GB PC3-10666 (a) Heterogeneous seven-machine local cluster. Instance type Compute Units (ECU) vCPUs RAM 100× m1.xlarge 8 4 15 GB (b)Medium-sized, multi-tenant Amazon EC2 cluster. Table 4.1: Specifications of the machines in the two evaluation clusters. CHAPTER 4. MUSKETEER EVALUATION 95 System Modification Hadoop Tuned configuration to best practices. Spark Tuned configuration to best practices. GraphChi Added HDFS connector for I/O. Naiad Added support for parallel I/O and HDFS. Table 4.2: Modifications made to back-end execution engines deployed. centre environment in which workflows can be affected by interfering applications running in other tenants’ instances. In both clusters, all machines run Ubuntu 14.04 (Trusty Tahr) with Linux kernel v3.14. I deploy all systems supported by Musketeer1 on these clusters. I use a shared Hadoop File System (HDFS) as the storage layer because the file system is already supported by Hadoop MapReduce, Spark and PowerGraph. Moreover, in order to establish a level playing field for my experiments, I tune, modify, and add support for interaction with HDFS in all the other frameworks (see Table 4.2). In my experiments, I measure makespan, resource efficiency, andworkflow overhead introduced by Musketeer-generated code. I calculate overhead by normalising the makespan of Musketeer- generated workflows running in a single execution engine to the makespan of an optimised hand-written implementation for the same engine. 4.2 Overhead over hand-written optimised workflows If the overheads of automatically generated workflow code are unacceptably high (e.g.,> 30%) then they may outweigh makespan differences arising from back-end execution engines’ design choices. In the following, I show using a complex batch workflow and an iterative workflow that Musketeer’s generated code overhead over optimised hand-written baselines is usually around 5–20%, and never exceeds 30%. Musketeer combines techniques such as operator merging (§3.4.1), idiom recognition (§3.4.2), data scans sharing (§3.4.3) and type inference (§3.4.4) in order to optimise the code it generates for complex workflows. 4.2.1 Batch processing I first measure Musketeer’s generated code overheads using the batch Netflix movie recom- mendation workflow [BKV08]. The workflow implements collaborative filtering for movie recommendation (see Figure 4.1). It takes two inputs: a 100 million-row movie ratings table (2.5GB) and a 17,000-row movie list (0.5MB). From these inputs, the collaborative filtering 1Hadoop 2.0.0-mr1-chd4.5.0, Spark 0.9, PowerGraph 2.2, GraphChi 0.2, Naiad 0.2 and Metis commit e5b04e2. 96 4.2. OVERHEAD OVER HAND-WRITTEN OPTIMISED WORKFLOWS movies AGGREGATE PROJECT SELECT JOIN ratings PROJECT JOIN MUL JOIN AGGREGATE MUL MAX Figure 4.1: Netflix movie recommendation workflow. workflow computes movie recommendations for all users, and finally outputs the top recom- mended movie for each user. It is a challenging workflow because it contains many operators (13) and provides several, but non-trivial, operator merging opportunities. Moreover, the work- flow is data-intensive, with up to 700 GB of intermediate data generated. These data are large enough to exceed the RAM of a single machine, but small enough to fit onto my medium-sized EC2 cluster’s memory. In my experiment, I look at how Musketeer’s generated code compares to hand-written base- lines when it processes different amounts of data. I control the amount of processed data by varying the number of movies I use for computing recommendations. For example, the work- flow generates only 17 GB of intermediate data when there are approximately 700 movies in the input, but the intermediate data size increases to 240 GB when the algorithm uses 1,200 movies, and to 700 GB when it uses 3,500 movies. I hand-implemented workflow baselines in three general-purpose systems that are good can- didates for running this workflow (Hadoop MapReduce, Spark, and Lindi on Naiad). Hadoop MapReduce is a good candidate because it can quickly process large amounts of data. However, it has to write intermediate data to disks several times because it cannot run the entire workflow in a single job. By contrast, Spark only spills intermediate data to disk if the data does not fit in memory. Finally, Lindi on Naiad is a good candidate because it can execute the entire workflow CHAPTER 4. MUSKETEER EVALUATION 97 0 500 1000 1500 2000 2500 3000 3500 Number of movies selected 0 500 1000 1500 2000 2500 3000 M ak es pa n [s ec ] Musketeer (Hadoop) Hadoop Musketeer (Spark) Spark Musketeer (Naiad) Lindi on Naiad Figure 4.2: Makespan of Musketeer-generated code versus hand-written implementations for the Hadoop, Spark and Naiad back-ends on the Netflix movie recommendation work- flow. Error bars show ±σ over five runs on the 100 EC2 instances cluster. in a single job and it can stream input and intermediate operators’ output data to dependent operators as soon as the data are available. In Figure 4.2, I compare Musketeer-generated code for the Netflix workflow to the aforemen- tioned hand-optimised baselines. I extensively tuned each baseline to deliver good performance for the given system. For all three systems, the overhead added by Musketeer’s generated code is low: it is minimal for Naiad and under 30% for Spark and Hadoop even as the input and intermediate data grow. The remaining overhead for Spark is primarily due to my type-inference algorithm’s simplicity, which can cause the Musketeer-generated code to make an extra pass over the data. However, this algorithm could be improved with additional engineering effort. 4.2.2 Iterative processing Next, I measure the overheads of Musketeer-generated code on iterative workflows using the PageRank algorithm. Iterative workflows are challenging because Musketeer must generate code that efficiently uses each back-end execution engine’s mechanism for running iterative computations. For example, Musketeer must generate code for Spark’s driver program, generate code that uses Naiad’s timely dataflow model, or run its own driver program to execute iterative computations on Hadoop MapReduce. I hand-implemented baselines for the most popular distributed and single-machine back-end ex- 98 4.3. IMPACT OF MUSKETEER OPTIMISATIONS ON MAKESPAN Hadoop Spark Naiad PG GraphChi -20% 0% 20% 40% 60% 80% O ve rh ea d 100 nodes 16 nodes 1 node Figure 4.3: Musketeer-generated code overhead for PageRank on the Twitter graph. I use the 100-node EC2 cluster and show results for best setup (i.e., 16 EC2 nodes for Power- Graph, and a single EC2 node for GraphChi). Error bars show ±σ over five runs; negative overheads are due to variance on EC2. ecution engines supported by Musketeer that can run the PageRank workflow: Hadoop MapRe- duce, Spark, Naiad, PowerGraph, and GraphChi. The workflow executes five PageRank itera- tions on the Twitter graph with 43 million vertices and 1.4 billion edges (23 GB of input data). In Figure 4.3, I show the overheads of Musketeer-generated code over the baselines. The av- erage overhead is below 30% in all cases. The variability in overhead (and improvements over the baseline) are due to interference and performance variability on EC2. There are no fundamental reasons for which Musketeer necessarily must produce generated code with any overhead. Indeed, further optimisations of the code generation are possible. For example, Musketeer-generated code currently does several unnecessary string operations per input data row even when operators are merged. These operations could be implemented more efficiently or, in some cases, removed. In conclusion, my experiment shows that Musketeer generates code that never has more than 30% overhead over hand-written optimised baselines for both batch and iterative processing workflows, and thus it does not outweigh makespan reductions obtained from choosing well- suited back-ends. 4.3 Impact of Musketeer optimisations on makespan Musketeer uses three key techniques to improve the code it generates: (i) operator merging (§3.4.1), (ii) data scan sharing (§3.4.3), and (iii) look-ahead and type inference (§3.4.4). With operator merging, Musketeer executes several operators in a single job, and thus avoids job creation overheads where possible. By sharing data scans, Musketeer avoids superfluous reads and writes of intermediate data. Finally, by using look-ahead and type inference, Musketeer is able to transform each operator’s output to match input requirements of dependent merged operators. This technique reduces the amount of string processing workflows conduct. CHAPTER 4. MUSKETEER EVALUATION 99 1 SELECT purchase_id, amount, user_id 2 FROM shop_logs 3 WHERE country = ’USA’ 4 AS usa_shop_logs; 5 SELECT user_id, SUM(amount) AS total_amount 6 FROM usa_shop_logs 7 GROUP BY user_id 8 AS usa_spenders; 9 SELECT user_id, total_amount 10 FROM usa_shoppers 11 WHERE total_amount > 12000 12 AS usa_big_spenders; Listing 4.1: Hive code for the top-shopper workflow. In this section, I evaluate if these optimisations are really necessary. I measure the effect these optimisations have on workflow makespan using a simple micro-benchmark top-shopper batch workflow and a complex iterative workflow. 4.3.1 Batch processing In Listing 4.1, I include the Hive query code for the top-shopper batch workflow. The workflow finds the largest spenders in a certain geographic region based on an online store’s tabular logs. It first filters the purchases by region, then it aggregates their values by user ID, and finally, it selects all users that have spent more than a threshold. Top-shopper consists of three SELECT operators. The first and third operators have only simple WHERE clauses, but the second operator has a GROUP BY clause. Given that the workflow con- tains only one GROUP BY clause, all its operators can be merged and executed as a single job even in the least expressive supported back-end (i.e., Hadoop MapReduce). In Figure 4.4, I show top-shopper’s makespan with operator merging, data scans sharing and type inference turned off and on. I also vary the size of the input data from few KB up to 30 GB (100 million users) to show how much these optimisations help as I increase input data size. The results illustrate that the optimisations significantly reduce makespan: I observe a one-off reduction in makespan of≈25–50s as operator merging avoids per-operator job creation overheads and shares data scans, along with an additional 5–10% linear benefit per 10M users attributable to type inference which reduces the number of string operations conducted. 4.3.2 Iterative processing I now measure by how much the three optimisations reduce generated code overheads. I run a complex cross-community PageRank workflow on the local heterogeneous cluster. The work- flow computes the relative popularity of users present in two web communities. It involves a 100 4.3. IMPACT OF MUSKETEER OPTIMISATIONS ON MAKESPAN 0 20 40 60 80 100 Millions of users 0 50 100 150 200 M ak es pa n [s ec ] Hadoop Hadoop optimised Spark Spark optimised Figure 4.4: Operator merging and type inference eliminate per-operator job creation over- heads for the top-shopper workflow running on the 100-node EC2 cluster. Error bars show ±σ over ten runs. Ha do op Ha do op op tim ise d Ha do op ba sel ine Sp ark Sp ark op tim ise d Sp ark ba sel ine 0 400 800 1200 1600 2000 M ak es pa n [s ec ] Figure 4.5: Operator merging and type inference eliminate per-operator job creation over- heads and help bring cross-community PageRank generated code performance close to hand-written baselines. batch computation followed by an iterative computation. First, the workflow intersects the edge sets of two communities (e.g., all LiveJournal and WordPress users), and subsequently, it runs five PageRank iterations on the nodes and edges present in both communities. In contrast to the top-shopper workflow, cross-community PageRank cannot execute as a single job in the least expressive back-end, Hadoop MapReduce, and challenges Musketeer’s ability to automatically CHAPTER 4. MUSKETEER EVALUATION 101 decide which operators can be merged and what is the best way to merge them. I use the LiveJournal graph (4.8M nodes and 68M edges) and a synthetically generated web community graph (5.8M nodes and 82M edges) in my experiment. In Figure 4.5, I show that generated code for Hadoop MapReduce and Spark cross-community PageRank sees a benefit, and the generated code overheads are not unacceptably high – i.e., never exceed 30% over hand- written baselines. Each Hadoop MapReduce iterations completes 180 seconds faster and each Spark iteration completes 300 seconds faster because of operator merging, data scan sharing and type inference. To sum up, these three optimisations reduce workflow makespan by up to 60% for the simple top-shopper batch workflow and up to 80% for the complex iterative cross-community PageR- ank workflow; therefore, they are essential to achieving low-overhead in generated code. 4.4 Dynamic mapping to back-end execution engines I have previously shown that no data processing framework systematically outperforms all oth- ers at different scales (see §2.2.3). I now investigate Musketeer’s ability to dynamically map workflows to the most competitive back-end execution engine at different cluster scales. I con- sider a batch workflow and an iterative graph workflow to show that Musketeer can leverage back-end diversity for different types of workflows. 4.4.1 Batch processing I run query 17 from the TPC-H business decision benchmark using the HiveQL and Lindi front- ends to illustrate the flexibility offered by Musketeer at different scales. The TPC-H benchmark generates data to accurately model the day-to-day operations of an online wholesaler. Query 17 is a business intelligence query that computes how much yearly revenue would be lost if the wholesaler stopped accepting small orders. It first selects all products of a given brand and with a given container type. Following, it computes the average yearly order size for these products. Finally, the query calculates the average yearly loss in revenue if orders smaller than 20% of this average were no longer taken. I implemented the query in HiveQL using one SELECT with a GROUP BY clause, two JOINs and a SELECT with a simple WHERE clause. I used equivalent LINQ operators to implement the workflow in Lindi. In Figure 4.6, I show workflow makespan as I increase input data size from 7.5 GB (TPC-H scale factor 10) to 75 GB (TPC-H scale factor 100). Makespan ranges between 200–400s when I run the Hive version of the workflow using Hive’s native Hadoop MapReduce back-end. Musketeer, however, can map the Hive workflow specification to different back-ends. In this case, if Musketeer maps it to Naiad, it reduces the makespan by 2× compared to Hive. This is not surprising: Hive cannot run the workflow with fewer than three jobs because Hadoop is 102 4.4. DYNAMIC MAPPING TO BACK-END EXECUTION ENGINES 0 20 40 60 80 100 TPC-H scale factor 0 100 200 300 400 500 600 700 M ak es pa n [s ec ] Hive on Hadoop Lindi on Naiad Musketeer (Naiad) Figure 4.6: Musketeer reduces TPC-H query 17 makespan on a 100-node EC2 cluster compared to Hive and Lindi running jobs on their native back-ends. Less is better; error bars show min/max of three runs. based on the restrictive MapReduce dataflow model, while Naiad can run the entire workflow in a single job. A developer might have sufficient understanding of data processing systems to know that Naiad can run the workflow in a single job. She might therefore specify the workflow using the Lindi front-end and target Naiad directly. However, the Lindi query actually scales worse than the Hive query, despite the query running as a single job on Naiad. This is the case because Lindi’s high-level GROUP BY operator is non-associative. Thus, the Naiad job generated for the workflow’s Lindi version collects input data for the GROUP BY on a single machine before it applies the operator. This limitation does not affect Musketeer, which generates code for an improved associative GROUP BY operator implemented using Naiad’s low-level vertex API. The associative operator groups data locally on each machine before it sends the data to one machine. Consequently, Musketeer’s generated Naiad code scales far better than the Lindi version (up to 9× at scale 100). The Naiad developers may of course improve Lindi’s GROUP BY in the future, but this exam- ple illustrates that by decoupling and generating efficient code, Musketeer can improve perfor- mance even for a front-end’s native execution engine. 4.4.2 Iterative processing While batch workflows can be expressed using SQL-like front-end frameworks such as Hive and Lindi, iterative graph processing workflows are typically expressed in other front-ends (see §3.2.3). To evaluate Musketeer’s ability to dynamically map graph computations at different scales, I implemented PageRank using Musketeer’s GAS DSL front-end (Listing 3.4, §3.2.3). CHAPTER 4. MUSKETEER EVALUATION 103 0 200 400 600 800 Makespan [sec] Musketeer GraphLINQ Spark Hadoop Musketeer GraphLINQ PowerGraph Musketeer GraphChi 10 0 no de s 16 nd . 1 no de (a) Orkut (3.0M vertices, 117M edges). 0 1500 3000 4500 Makespan [sec] Musketeer GraphLINQ Spark Hadoop Musketeer GraphLINQ PowerGraph Musketeer GraphChi 10 0 no de s 16 nd . 1 no de (b) Twitter (43M vertices, 1.4B edges). Figure 4.7: Musketeer performs close to the best-in-class system for five iterations of PageRank on 1, 16 and 100 EC2 nodes. Error bars are ±σ over 5 runs. 0% 25% 50% Resource efficiency Musketeer GraphLINQ Spark Hadoop Musketeer PowerGraph GraphLINQ Musketeer GraphChi 100% 100% 10 0 no de s 16 nd . 1 nd . Figure 4.8: Musketeer’s resource efficiency on PageRank on the Twitter graph (more is better). I run this workflow on the two social network graphs (Orkut and Twitter) that I used earlier to compare the data processing systems (see §2.2.3). In Figure 4.7, I compare the makespan of five PageRank iterations executed using Musketeer- generated jobs to hand-written optimised baselines implemented in: (i) general-purpose systems (Hadoop, Spark), (ii) a specialised graph-processing front-end for Naiad (GraphLINQ), and (iii) special-purpose graph processing execution engines (PowerGraph, GraphChi). Different sys- tems achieve their best performance at different scales, and I only show the best result for each system. The only exception to this is GraphLINQ on Naiad, which is competitive at both 16 and 100 nodes. At each scale, Musketeer’s best mapping is almost as good as the best-in-class baseline. On one node, Musketeer does best when mapping to GraphChi, while mapping to Naiad (Orkut) or PowerGraph (Twitter) is best at 16 nodes, and mapping to Naiad is always best at 100 nodes. 104 4.5. COMBINING BACK-END EXECUTION ENGINES 4.4.2.1 Resource efficiency Workflow makespan is an important metric for many users, but in some cases it may be worth- while to trade off makespan for improved resource efficiency. In Figure 4.8, I show the re- source efficiency for the same configurations as I previously used to measure makespan, run- ning PageRank on the Twitter graph. Musketeer achieves resource efficiencies close to the best stand-alone implementations at all three scales. On one node Musketeer generates GraphChi code that is as efficient as the baseline. On sixteen machines, Musketeer is 14% less resource efficient than the most efficient baseline executed in PowerGraph. Finally, on 100 machines, Musketeer is 17% less resource efficient than the best alternative (GraphLINQ on Naiad). In conclusion, these experiments show that Musketeer’s dynamic mapping approach is flexible and can be used for both batch and iterative computations. 4.5 Combining back-end execution engines In addition to dynamically mapping entire workflows to specialised back-ends, Musketeer can also combine different back-ends by mapping parts of complex workflows to different back- ends. Therefore, Musketeer can leverage execution engine diversity and use systems only for the parts of workflows they have been specialised for. Prior workflow managers (e.g., Oozie, Pig) can only execute workflows that require both batch and iterative graph processing in a sin- gle back-end (e.g., Hadoop MapReduce). If worthwhile, Musketeer by contrast can execute the workflow’s batch computation in a general data processing back-end and its graph computation in a specialised graph processing back-end. In Figure 4.9, I compare makespan of cross-community PageRank workflow for different com- binations of back-ends, explored using Musketeer and the local heterogeneous cluster. Out of the three executions on single back-ends, the workflow completes fastest in Lindi at 153s. However, the makespan is comparable when Musketeer combines Hadoop MapReduce with a special-purpose graph processing back-end (e.g., PowerGraph), even though these systems use fewer machines: PowerGraph runs on two machines versus Lindi which uses a seven-machine Naiad deployment. This is the case because general-purpose systems (like Hadoop MapRe- duce) work well for the batch phase of the workflow, but cannot execute the iterative PageRank as fast as specialised graph processing system, or general systems that use the timely dataflow model (i.e., Naiad). Hence, combining systems like this can increase resource efficiency. However, a combination of Lindi and GraphLINQ, which both run jobs on Naiad, works best. This combination outperforms Lindi because it takes advantage of GraphLINQ’s graph specific optimisations, and it outperforms Hadoop and PowerGraph because it avoids the extra I/O that results from moving intermediate data across back-end boundaries. Musketeer currently does not fully automatically generate the low-level Naiad code to combine Lindi and GraphLINQ, but it could be extended to do so. CHAPTER 4. MUSKETEER EVALUATION 105 0 150 300 450 600 Makespan [sec] Lindi & GraphLINQ Hadoop & GraphChi Hadoop & PowerGraph Hadoop & Spark Lindi only Spark only Hadoop only Figure 4.9: A cross-community PageRank workflow is accelerated by combined back- ends. All jobs apart from the “Lindi & GraphLINQ” combination were generated by Mus- keteer. Error bars show the min/max of 3 runs. Musketeer’s ability to flexibly partition a workflow makes it easy to explore different combi- nations of systems. In this section, I have shown that this can in some cases reduce workflow makespan (e.g., when using Lindi and GraphLINQ) or in other cases improve resource effi- ciency (e.g., when using Hadoop MapReduce and PowerGraph). 4.6 Automatic back-end execution engine mapping Developers who use Musketeer can manually specify which back-end to use, but they can also let Musketeer automatically map their workflows to back-ends (§3.5). In order for the mech- anism for automatically mapping workflows to be useful, it must: (i) choose mappings that come close to the best options, or at least as good as the mappings developers would have cho- sen, and (ii) run the algorithmically complex DAG partitioning algorithms and find mappings fast enough that it does not significantly increase workflow makespan. In this section, I test if Musketeer’s automatic mapping mechanism satisfies these requirements. 4.6.1 Quality of automatic mapping decisions Developers can manually map workflows to back-end execution engines, but their job is much easier if Musketeer automatically chooses these mappings to back-ends. I focus on evaluating the quality of Musketeer’s automatic mapping decisions (§3.5.1). First, I investigate decision quality using the workflows I previously executed (TPC-H query 17, top-shopper, Netflix movie recommendation, PageRank, Join, Project), and then I test decision quality on two additional workflows (single-source shortest path and k-means). In the experiment, I use 33 different configurations by varying the input data sizes for the work- flows. For each decision, I compare: (i)Musketeer’s choice on the first run (with no workflow- specific history), (ii) its choice with incrementally acquired partial history, and (iii) the choice 106 4.6. AUTOMATIC BACK-END EXECUTION ENGINE MAPPING 0% 20% 40% 60% 80% 100% Fraction of workflows + full history + partial history First run Decision tree M us ke te er Best option within ≤ 10% within ≤ 30% > 30% Figure 4.10: Makespan overhead of Musketeer’s automatic mapping decision compared to the best option. Workflow history helps, and Musketeer’s cost function outperforms a simple decision tree. it makes when it has a full history of the per-operator intermediate data sizes. I also compare Musketeer’s choices to those that emerge from a decision tree that a colleague and I developed based on our knowledge of data processing systems. The decision tree considers different back- ends features and known characteristics, and makes mappings accordingly (e.g., workflows with small inputs are executed on single-machine back-ends, graph analysis workflows are executed on specialised graph processing back-ends). I consider a choice that achieves a makespan within 10% of the best option to be “good”, and one within 30% as “reasonable”. In Figure 4.10, I show the results. The decision tree yields many poor choices because it uses simple fixed thresholds based on input data size to choose on which back-end to run workflows. Moreover, it statically makes decisions ahead of workflow runtime, it does not adjust its decisions at runtime based on intermediate data sizes, and it is unable to accurately predict the benefits of operator merging and shared scans. By contrast, without any knowledge, Musketeer chooses good or optimal back-ends in about 50% of the cases. When partial workflow history is available, over 80% of its choices are good. Musketeer always makes good or optimal choices if each workflow is initially executed operator-by-operator for profiling. I also evaluate Musketeer’s automatic mapping decisions on two new workflows: single-source shortest path (SSSP) and k-means clustering. SSSP can be expressed in vertex-centric systems, while k-means cannot. In Figure 4.11, I show workflow makespan for different back-ends and Musketeer’s automatic choice. The SSSP workflow receives the Twitter graph extended with costs as input, and I use 100M random two dimensional points for k-means (100 clusters) 2. Despite using my simple proof-of-concept cost function and a small training set, Musketeer correctly identifies the appropriate back-end (Naiad) in both cases. 2My k-means uses the CROSS JOIN operator, which is inefficient. By replacing it, I could reduce the makespan and address Spark’s OOM condition. However, I am only interested in the automatic mapping here. CHAPTER 4. MUSKETEER EVALUATION 107 0 600 1200 1800 Makespan [sec] Hadoop Spark Naiad PowerGraph ♣ 10 0 no de s 16 no de s (a) Single-source shortest path (SSSP) 0 1500 3000 Makespan [sec] Hadoop Spark Naiad ♣ (Out of memory) 10 0 no de s (b) k-means Figure 4.11: Makespan of SSSP and k-means on the EC2 cluster (5 iterations). A club (♣) indicates Musketeer’s choice. 4.6.2 Runtime of Musketeer’s automatic mapping Next, I focus on Musketeer’s DAG partitioning algorithms (§3.5.2). I measure the time it takes the exhaustive search and dynamic heuristic algorithms to partition the IR operator DAG. Ide- ally, they should not noticeably affect workflow makespan. In the experiment, I measure partitioning runtime on workflows with an increasing number of operators. The workflows are subsets of an extended version of the Netflix workflow with a total of 18 operators. This workflows affords many operator merging opportunities, thus it makes a good and complex test case for the DAG partitioning algorithms. In Figure 4.12, I show the runtimes for the two algorithms as the number of operators in the workflow increases. The exhaustive search guarantees that the optimal partitioning subject to the cost function is found, but it has exponential complexity. It completes within a second for workflows with up to 13 operators, but beyond, its runtime grows to hundreds of seconds. While it guarantees to find the optimal partitioning subject to the cost function, the exhaustive search runtime can outweigh the makespan reduction resulting from executing the workflow with the optimal partitioning and mapping. By contrast, Musketeer’s dynamic programming heuristic may not always find the best partitioning, but it scales gracefully and runs in under 10ms even beyond 13 operators. In conclusion, Musketeer’s automatic execution engine mapping solution makes good choices quickly on the workflows I tested it on. However, I only tested it in well-controlled environ- ments in which only one workflow runs at a time. Musketeer’s scheduler and cost function might need further refinement in order to make good decisions in shared cluster environments. I discuss these challenges and possible future extensions in Chapter 7. 108 4.7. SUMMARY 0 2 4 6 8 10 12 14 16 18 First x operators in NetFlix workflow 10−5 10−4 10−3 0.01 0.1 1 10 100 1000 A lg or ith m ru nt im e [s ec ;l og 10 -s ca le ] exhaustive (all systems) heuristic (all systems) exhaustive (Hadoop only) heuristic (Hadoop only) Figure 4.12: Runtime of Musketeer’s DAG partitioning algorithms when considering the first x operators of an extended version of the Netflix workflow (N.B.: log10-scale y-axis). 4.7 Summary In this chapter, I investigated Musketeer’s ability to efficiently run data processing computations using a range of real-world workflows. My experiments show that Musketeer: 1. Generates efficient code. Compared to time-consuming, hand-tuned implementations, Musketeer-generated code is at most 30% slower, while offering superior portability (§4.2, §4.3). 2. Speeds legacy workflows. Musketeer reduces legacy workflows’ makespan by up to 2× by mapping them to a different back-end execution engine (§4.4). 3. Flexibly combines back-ends. Musketeer finds combinations that outperform all sin- gle back-end alternatives by exploring combinations of multiple execution engines for a workflow (§4.5). 4. Makes good automatic system mappings. Musketeer’s automatic mapping mecha- nism makes good choices based on simple parameters that characterise execution engines (§4.6). Although these results are encouraging, they must not necessarily hold for all versions of the data processing back-ends I use. Many of the performance issues I have highlighted could be addressed in future releases (e.g., Lindi’s non-associative GROUP BY). However, new perfor- mance corner cases could also be introduced as system developers optimise for particular use CHAPTER 4. MUSKETEER EVALUATION 109 cases. Musketeer is well placed to automatically discover these performance corner cases and to benefit developers by decoupling front-end frameworks and back-execution engines, and ex- ecuting the workflows in the most appropriate combination of back-ends. Nonetheless, I believe my work represents only the first step in this promising direction. In Chapter 7, I discuss how Musketeer could be improved, and describe some future challenges. 110 4.7. SUMMARY Chapter 5 Firmament: a scalable, centralised scheduler Modern data-centre clusters comprise of heterogeneous hardware and execute diverse work- loads. These workloads consist of a range of tasks: from short-running interactive tasks that must complete within seconds to long-running service tasks that must meet service level ob- jectives (SLOs). This task diversity and hardware heterogeneity make it challenging for cluster schedulers to achieve high utilisation in increasingly large clusters, while keeping task comple- tion times within seconds for interactive tasks, and meeting SLOs for service tasks. As I explained in Chapter 2, to achieve these the cluster scheduler must: 1. take into account hardware heterogeneity (§2.3.1.1); 2. avoid co-locating interfering tasks (§2.3.1.2); 3. conduct multi-dimensional resource fitting (§2.3.1.3); 4. estimate task resource requirements and reclaim unused resources (§2.3.1.4); 5. obtain high data locality (§2.3.1.5); 6. support placement constraints (§2.3.1.6); and 7. choose task placements with low latency at scale (§2.3.1.7). State-of-the-art cluster schedulers strive to meet these requirements, but in practice they fall short. On the one hand, centralised schedulers use complex algorithms that take into account hardware heterogeneity and avoid co-locating interfering tasks. But they take seconds or min- utes to place tasks at scale, and thus fail to meet the placement latency requirements of short- running interactive tasks (see §2.3.2). On the other hand, distributed schedulers use simple algo- rithms to place tasks with low scheduling latency at scale [OWZ+13], but do not choose quality 111 112 5.1. FIRMAMENT OVERVIEW placements because they cannot take into account hardware heterogeneity, do not avoid task co-location interference and do not bin-pack tasks on different resource dimensions [RKK+16]. One of the main contributions of this dissertation is to show that centralised cluster schedulers can choose high-quality placements with low latency at scale. In this chapter, I extend Firma- ment [Sch16, §6], a min-cost flow-based scheduler, and show that with my extensions Firma- ment maintains the same high placement quality as state-of-the-art centralised schedulers and matches the placement latency of distributed schedulers in the common case. Moreover, with my extensions Firmament’s placement latency degrades gracefully even in extreme situations when jobs comprise of tens of thousands of tasks or when clusters are highly utilised. Firmament relies on the same min-cost max-flow optimisation approach as Quincy [IPC+09] does, but my work makes it scalable and generalises the approach to support new scheduling features. In Section 5.1, I introduce Firmament’s architecture. Next, in Section 5.2, I describe Flowlessly, a novel min-cost flow solver I developed for Firmament to offer low task place- ment latency at scale. Following, in Section 5.3, I extend the min-cost flow-based scheduling approach with complex placement constraints, gang scheduling, and I improve task co-location interference avoidance. In Section 5.4, I use the last extension to build a scheduling policy that avoids task interference on cluster networks. Finally, in Section 5.5, I discuss Firmament’s limitations and how these could be addressed in future work. 5.1 Firmament overview Firmament, like Quincy, models the scheduling problem as a min-cost flow optimisation over a flow network. I chose to extend the Firmament min-cost flow-based scheduler for three reasons. First, Firmament considers entire workloads, and thus can support rescheduling and task priority preemption. Second, Firmament achieves high placement quality and, consequently, reduces workflow makespan [Sch16, §8.2]. Third, Firmament amortises the solver runtime work well over many task placements, and hence achieves high task throughput – albeit at a high placement latency. With my extensions, the Firmament scheduler offers two key benefits over the state-of-the-art Quincy min-cost flow scheduler: 1. it chooses placements at low, sub-second latency at scale, sufficiently fast to support interactive and short-running tasks; and 2. it supports complex scheduling features that previously could not be expressed in min- cost flow-based schedulers (e.g., gang-scheduling) or were thought to be too expensive (e.g., resource hogging avoidance). In Figure 5.1, I give an overview of the Firmament scheduler architecture I use in my work. Each machine in the cluster runs a Firmament coordinator process. Coordinators schedule CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 113 M as te rc oo rd in at or Task table Knowledge base Resource topology Sc he du le r Min-cost, flow solver Flow graph Scheduling policy M ac hi ne Coordiantor Task Task... ... Coordinator Task Task... ... Coordinator Task Task... Figure 5.1: Firmament’s scheduling policy modifies the flow network according to work- load, cluster, and monitoring data; the flow network is passed to the min-cost flow solver, whose computed optimal flow yields task placements. tasks, monitor tasks, and collect performance counter information and task resource utilisation statistics. All tasks are submitted to the “master coordinator” process which schedules and delegates them to worker coordinator processes that run on worker machines. The worker coordinators are simple task executors that use no-op schedulers to place delegated tasks (i.e., they abide to the placement decisions made by the master coordinator). Upon start-up, each worker coordinator extracts the micro-architectural topology of the machine on which it runs, and submits the topology to the master coordinator. The master combines each machine’s micro-architectural topology into a cluster-wide resource topology that, for example, can include information about the network topology and how machines are grouped into racks. Worker coordinators also automatically collect performance counter information and resource utilisation statistics of running tasks. They send these statistics to the master coordinator which aggregates and stores them in a knowledge base. Firmament also stores in the knowledge base task-specific information for periodic workflows1. It constructs profiles that contain information about tasks’ suitability to run on different types of hardware and about key characteristics (e.g., cache working set size, network bandwidth usage). The master coordinator’s scheduler can use the task profiles and the cluster resource topology in scheduling policies to avoid task co-location interference and to take into account hardware heterogeneity when choosing placements. Firmament generalises min-cost flow-based scheduling over the single, batch-oriented policy 1Many data-centre workflows run periodically [AKB+12b], and tasks executed by different instances of a peri- odic workflow have similar resource usage patterns. 114 5.2. FLOWLESSLY: A FAST MIN-COST FLOW SOLVER proposed by Quincy. Cluster administrators can use a policy API to configure Firmament’s scheduling policy, which for example, may incorporate multi-dimensional resources, fairness, and priority preemption [Sch16, Appendix C]. The scheduling policy defines a flow network that models the cluster using nodes to represent tasks and machine micro-architectural topolo- gies. The policy can also use task profiles to encode in the flow network tasks’ preferences for particular resources, and to give hints on where tasks should be placed to reduce interference. Whenever cluster events have changed the flow network, Firmament submits the flow network to a min-cost flow solver that finds an optimal (i.e., min-cost) flow. Finally, after the solver completes, Firmament extracts the implied task placements from the optimal flow and enacts the necessary changes in the cluster. Firmament continuously monitors cluster events (e.g., task completions, machine failures), and updates the flow network. However, Firmament cannot react to these events while the solver runs because min-cost solvers are not incremental. Yet, as soon as the solver completes, Firma- ment modifies the graph according to its scheduling policy in response to monitoring informa- tion and events that occurred while the solver was running. Following, Firmament reruns the solver to compute the new optimal flow. In a busy cluster (i.e., with many task and machine events), the solver runs almost continuously. The task placement latency of Firmament and other min-cost flow schedulers is dominated by the runtime of the min-cost flow optimisation algorithms they use. In following sections, I describe several techniques that I developed to reduce min-cost flow algorithm runtime and overall placement latency. 5.2 Flowlessly: a fast min-cost flow solver To help Firmament choose high-quality task placements requires to efficiently solve an algo- rithmically challenging min-cost flow problem at the lowest possible latency. To achieve this, I studied several min-cost flow optimisation algorithms and their performance (§5.2.1). The key insight I discovered is that min-cost flow algorithms can be fast for the cluster scheduling problem: (i) if they match the problem structure well, or (ii) if few changes to cluster state occur while the algorithm runs. Based on this insight I implemented Flowlessly, a new min-cost flow solver that supports four min-cost flow algorithms. Each algorithm has edge cases in which it fails to place tasks with low latency (§5.2.2), but I investigate three techniques to reduce Flowlessly’s runtime in such situ- ations. First, I consider if solver runtime reductions are achievable with approximate min-cost flow solutions, but find that they generate unacceptably poor and volatile placements (§5.2.3). Second, I incrementalise min-cost flow algorithms, and discover that re-optimising previous so- lutions decreases solver runtime (§5.2.4). Third, I discuss two heuristics that are specific to the flow networks min-cost flow schedulers generate, but reduce solver runtime (§5.2.5). I com- bine these techniques with several other improvements to further reduce Flowlessly runtime. CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 115 Flowlessly runs two min-cost flow algorithms concurrently to avoid slowdown in edge cases (§5.2.6). Finally, I discuss two novel algorithms I developed to optimise the interaction between Flow- lessly and Firmament. These algorithms are not min-cost flow algorithms as such, but they efficiently update the flow network in response to cluster state changes and quickly extract task placements from the optimal flow (§5.2.7). 5.2.1 Min-cost flow algorithms Amin-cost flow algorithm takes a directed flow networkG= (N,A) as input. Each arc (i, j)∈ A has a cost ci j, a minimum flow requirement li j and a maximum flow capacity ui j (see Figure 5.2). Moreover, each flow network node i ∈ N has an associated supply b(i); nodes with positive supply are sources, those with negative supply are sinks. Informally, min-cost flow algorithms must optimally (i.e., with smallest cost) route the flow from all sources (i.e., b(i) > 0) to sinks (i.e., b(i) < 0) while meeting the minimum flow re- quirements and without exceeding the maximum flow capacity on any arc. For example, for networks generated using scheduling policies the flow must be routed from task nodes to the sink node, which has a flow demand equal to the total number of tasks. Formally, the goal of a min-cost flow algorithm is to find a flow f that minimises Eq. 5.1, while respecting the flow feasibility constraints of mass balance (Eq. 5.2) and capacity (Eq. 5.3): Minimise ∑ (i, j)∈A ci j fi j subject to (5.1) ∑ k:( j,k)∈A f jk− ∑ i:(i, j)∈A fi j = b( j),∀ j ∈ N (5.2) and li j ≤ fi j ≤ ui j,∀(i, j) ∈ A (5.3) A flow that satisfies the capacity constraints but not the mass balance constraints is called a pseudoflow. Some algorithms use an equivalent definition of the flow network called residual network (G′( f )). In the residual network, each arc (i, j) ∈ A with a cost ci j, a li j flow requirement and maximum capacity ui j is replaced by two arcs: (i, j) and ( j, i). Arc (i, j) has cost ci j, a flow requirement of l′i j = max(li j− fi j,0), and a residual capacity of ri j = ui j− fi j. Arc ( j, i) has cost −ci j, a zero flow requirement, and a residual capacity of r ji = fi j (see Figure 5.3). The feasibility constraints also apply in the residual network. The primal minimisation problem (Eq. 5.1) also has an associated dual problem, which some algorithms solve more efficiently. In the dual min-cost flow problem, each node i ∈ N has an 116 5.2. FLOWLESSLY: A FAST MIN-COST FLOW SOLVER i b(i) j b( j) ci j li j, ui j Figure 5.2: Each flow arc is directed and has associated a cost ci j, a minimum flow require- ment li j, and a maximum capacity ui j. i b(i) j b( j) ci j ri j = ui j − fi j l′i j = max(li j − fi j,0) −ci j r ji = fi j, l′ji = 0 Figure 5.3: In the residual network, each arc (i, j) has a residual capacity ri j = ui j− fi j, and a reverse arc ( j, i) with r ji = fi j and c ji =−ci j. i b(i), pi(i) j b( j), pi( j) cpii j = ci j−pi(i)+pi( j) li j, ui j Figure 5.4: The reduced cost of a flow arc is the sum between its cost and the difference of its node potentials. associated dual variable pi(i) called potential (Figure 5.4). Moreover, each arc has a reduced cost with respect to the node potentials, defined as: cpii j = ci j−pi(i)+pi( j),∀(i, j) ∈ A (5.4) The reduced cost does not change the cost of any directed cycleW in the flow network because the following equality holds: ∑ (i, j)∈W cpii j = ∑ (i, j)∈W (ci j−pi(i)+pi( j)) = ∑ (i, j)∈W ci j+ ∑ (i, j)∈W (pi( j)−pi(i)) = ∑ (i, j)∈W ci j In the equality, ∑ (i, j)∈W (pi( j)− pi(i)) reduces to zero because each cycle node’s potential pi( j) occurs with a positive sign for arc (i, j) and with a negative sign for the next arc in the cycle ( j,k). Similarly, for each directed path P from node i to node j the following equality holds: ∑ (i, j)∈P cpii j = ∑ (i, j)∈P (ci j−pi(i)+pi( j)) = ∑ (i, j)∈P ci j+ ∑ (i, j)∈P (pi( j)−pi(i)) = ∑ (i, j)∈P ci j−pi(i)+pi( j) With the exception of node i and node j, all nodes in the path P occur as both a source and a destination of an arc. Thus, the contribution their potential makes to the reduced cost is zero. CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 117 The two equations from above can be used to show that the dual min-cost flow problem is equiv- alent to the primal problem [AMO93, §2.4]. In contrast to the primal problem that minimises ∑ (i, j)∈A ci j fi j, the goal of the dual problem is to: Maximise ∑ i∈N b(i)∗pi(i)+ ∑ (i, j)∈A cpii j fi j (5.5) subject to the flow feasibility constraints of mass balance (Eq. 5.2) and capacity (Eq. 5.3). Regardless if a min-cost flow algorithm solves the primal or the dual problem, the algorithm completes when it finds an optimal feasible flow. A feasible flow is optimal if and only if it satisfies at least one of following three optimality conditions [AMO93, §9.3]. Theorem 5.2.1. Negative cycle optimality conditions. A feasible flow f is an optimal flow of minimum cost if and only if the residual network G′( f ) contains no directed negative-cost cycle. Intuitively, if f is a feasible flow andG′( f ) contains a directed negative-cost cycle then f cannot be optimal because a feasible flow with a smaller cost can be obtained by increasing the flow along the negative-cost cycle. Theorem 5.2.2. Reduced cost optimality conditions. A feasible flow f is an optimal flow of minimum cost if and only if there exists a set of node potentials pi such that there are no arcs in the residual network G′( f ) with negative reduced cost. Intuitively, if f is a feasible flow and G′( f ) contains only non-negative reduced cost arcs then the reduced cost of every directed cycleW in G( f ) is ∑ (i, j)∈W cpii j ≥ 0. The residual flow network G′( f ) does not contain any negative reduced cost cycle which entails that Theorem 5.2.1 holds and that f is optimal. Theorem 5.2.3. Complementary slackness optimality conditions. A feasible flow f is an opti- mal flow of minimum cost if and only if there is a set of node potentials pi such that the reduced arc costs and flows satisfy the following conditions for every arc (i, j) ∈ A: If cpii j > 0, then fi j = li j (5.6a) If li j ≤ fi j ≤ ui j, then cpii j = 0 (5.6b) If cpii j < 0, then fi j = ui j (5.6c) Informally, if either of the complementary slackness conditions is not satisfied then one of the other optimality conditions cannot be satisfied as well. Consider the following three cases that cover all the possible reduced cost values for any arc (i, j) ∈ A: 118 5.2. FLOWLESSLY: A FAST MIN-COST FLOW SOLVER • Case 1: If cpii j > 0, then c pi ji < 0 according to the definition of the residual network. But ( j, i) /∈ G′( f ), because otherwise it would not respect the reduced cost optimality con- ditions. Therefore, the arc’s flow must be equal to its minimum flow requirement (i.e., fi j = li j). • Case 2: If li j ≤ fi j ≤ ui j, then both (i, j) and ( j, i) ∈ G′( f ). These inequalities and the reduced cost optimality conditions require that both cpii j ≥ 0 and cpiji ≥ 0. But, zero is the only value that respects both conditions because cpii j =−cpiji in the residual network G′( f ). • Case 3: If cpii j < 0, then arc (i, j) /∈ G′( f ) because otherwise it would not respect the reduced cost optimality conditions. Therefore, the arc’s flow must be equal to its capacity (i.e., fi j = ui j). Substantial research effort has gone into min-cost flow algorithms. All the existing algorithms work as a series of iterations in which they either: (i) maintain flow feasibility and work to achieve optimality by finding a flow that respects one of above-mentioned types of optimality conditions or, (ii) refine a flow that respects the optimality conditions until the flow is feasible. I now describe several competitive min-cost flow algorithms to lay the groundwork for the following sections in which I explain why I implemented certain algorithms in Flowlessly, and how I optimised these algorithms. Cycle canceling. First, cycle canceling uses a max-flow algorithm to find a feasible, but not necessarily optimal flow [Kle67]. Following, it performs a series of iterations, during which it maintains flow feasibility and attempts to achieve optimality. In each iteration, cycle canceling augments flow along negative-cost directed cycles in the residual graph, which reduces the overall solution cost. The algorithm finds a feasible optimal flow when no negative-cost cycles exist in the graph (i.e., the negative cycle optimality conditions are met). Minimum mean cycle canceling. Minimum mean cycle canceling, like cycle canceling, first uses a max-flow algorithm to find a feasible flow, and runs a series of iterations during which it maintains flow feasibility [GT89]. In contrast to cycle canceling, minimum mean cycle cancel- ing does not augment flow along any negative-cost cycle, but along the cycle with the smallest mean cost in the residual network. The algorithm completes when there are no cycles with a mean negative cost left in the flow network. Successive shortest path. Unlike previous algorithms, successive shortest path [AMO93, p. 320] does not maintain a feasible flow in each iteration. Instead, it maintains a pseudoflow (a flow that satisfies the capacity constraints, but does not satisfy the mass balance constraints) that satisfies the reduced cost optimality conditions. Successive shortest path repeatedly selects a source node, finds the path with the smallest cost in the residual network from the source to CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 119 the sink, and augments flow along this path. The algorithm completes with an optimal flow when there are no source nodes left in the flow network (i.e., the flow satisfies the mass balance constraints). Primal-dual. Like successive shortest path algorithm, the primal-dual algorithm maintains a pseudoflow that satisfies the reduced cost optimality conditions [FF57]. The algorithm first transforms the network flow G= (N,A) to an equivalent single source and single sink network. It introduces a new source node src with a supply equal to the sum of supplies of prior source nodes (i.e., b(src) = ∑ b(i)>0 b(i)), and connects this node to prior source nodes with zero cost and b(i) capacity arcs. Similarly, the algorithm introduces a new sink node dst with a supply b(dst) = ∑ b(i)<0 b(i), and connects it to all prior sink nodes with zero cost and −b(i) capacity arcs. Unlike successive shortest path, which iteratively augments flow along the path with the small- est cost, primal-dual simultaneously augments flow along several reduced cost paths. In each iteration, the algorithm computes for each node i the smallest reduced cost d(i) to reach the node i from the source node src. Following, it subtracts d(i) from pi(i) for each node i, and thus creates at least a path of zero reduced cost arcs between the source node and the sink node. Primal-dual simultaneously augments flow along several paths by computing the maximum flow on the residual network of zero reduced cost, called the admissible network. Each primal-dual iteration reduces the supply at the source node src (i.e., improves feasibility) without breaking the reduced cost optimality conditions (i.e., pseudoflow optimality is maintained). Primal-dual completes with an optimal feasible flow after it pushed all the supply from the source node src to the sink node dst. Relaxation. The relaxation algorithm [BT88a; BT88b], like successive shortest path, main- tains a flow that satisfies the reduced cost optimality conditions, and augments flow from source nodes along the shortest path to the sink. However, unlike successive shortest path, relaxation optimises the dual problem by applying one of the following two changes when possible: 1. Keeping pi unchanged, the algorithm modifies the flow, f , to f ′ such that f ′ still respects the reduced cost optimality condition and the total supply decreases (i.e., feasibility im- proves). 2. It modifies pi to pi ′ and f to f ′ such that f ′ is still a reduced cost-optimal solution and the cost of that solution decreases (i.e., total cost decreases). This allows relaxation to decouple the improvements in feasibility from reductions in total cost. When relaxation has the possibility of reducing cost or improving feasibility, it chooses to reduce cost. 120 5.2. FLOWLESSLY: A FAST MIN-COST FLOW SOLVER Capacity scaling. Capacity scaling maintains a pseudoflow that satisfies the reduced cost optimality conditions. The algorithm converts this pseudoflow into an optimal feasible flow in a series of ∆ capacity scaling phases. In each ∆-scaling phase, the algorithm augments paths that can carry exactly ∆ units of flow. The algorithm halves ∆ and starts another scaling phase when there are no source nodes src with b(src) ≥ ∆, or no sink nodes dst with b(src) ≤ −∆ left. Capacity scaling completes when there are no nodes with flow supply or demand left in the network and ∆ = 1. The output flow is optimal because it satisfies the reduced cost optimality conditions and because there is no supply left in the network (i.e., flow satisfies mass balance constraints). The capacity scaling algorithm augments paths with sufficiently large flow; it augments at most M ∗ log(U) paths, where M is the number of arcs [EK72]. By contrast, the successive shortest path and the relaxation algorithms might augments paths with small flow amounts. In the worst case, successive shortest path augments up to N ∗U paths, where N is the number of nodes and U is the largest arc capacity. Cost scaling. Cost scaling [GT90; GK93; Gol97] iterates to reduce flow cost while maintain- ing feasibility. It uses a relaxed complementary slackness condition called ε-optimality. A flow is ε-optimal if the flow on arcs with cpii j > ε is zero and there are no arcs with residual capacity, and cpii j < −ε . Initially, the algorithm sets ε to the maximum arc cost. In each iteration, the algorithm relabels nodes (i.e., adjusts node potential to change reduced arc cost) and pushes flow along arcs with cpii j <−ε in order to achieve ε-optimality. Once the flow is ε-optimal, the algorithm divides ε by a constant factor and starts another iteration. Cost scaling completes when it achieves 1n -optimality, which guarantees that the complementary slackness optimality conditions are satisfied [Gol97]. Double scaling. The double scaling algorithm combines ideas from the capacity scaling and cost scaling algorithms [AGO+92]. Like cost scaling, double scaling refines ε in a series of iter- ations in which it achieves ε-optimality. However, instead of relabelling nodes and pushing flow from them, the double scaling algorithm uses ∆ capacity scaling phases to achieve ε-optimality. The algorithm completes when ε = 1n , ∆= 1 and there are no supply and sink nodes left in the network. Enhanced capacity scaling. The enhanced capacity scaling algorithm uses the same ap- proach as the capacity scaling algorithm [Orl93]. The algorithm conducts a series of ∆ ca- pacity scaling phases starting from a ∆ = max(ui j). Unlike capacity scaling, which refines ∆ until ∆= 1, the enhanced capacity scaling algorithm takes advantage of one key network prop- erty: any arc (i, j) is guaranteed to have a positive flow in the remaining ∆-scaling phases if fi j ≥ 8N∆ (proof in [AMO93, §10.7]). In each ∆-scaling phase, the algorithm: (i) discovers subgraphs consisting only of such arcs, (ii) pushes flow such that there is at most one node i CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 121 Algorithm Worst-case complexity Cycle canceling O(NM2CU) Minimum mean cycle canceling O(N2M2 log(NC)) Successive shortest path O(N2U log(N)) Primal-dual O(NU(M+N log(N)+NM log(N))) Relaxation O(M3CU2) Capacity scaling O((M2+MN log(N)) log(U)) Cost scaling O(N2M log(NC)) Double scaling O(NM log(U) log(NC)) Enhanced capacity scaling O(M log(N)(M+ log(N))) Table 5.1: Worst-case time complexities for the min-cost flow algorithms I describe. N is the number of nodes, M the number of arcs, C the largest arc cost and U the largest arc capacity. In the flow networks generated by scheduling policies, M > N >C >U . In bold I highlight the algorithms I implemented in Flowlessly. that has flow supply b(i) 6= 0 in each such subgraph, and (iii) augments flow from supply nodes to sink nodes along paths on which it can push ∆ units of flow. The algorithm completes with an optimal flow when there are no supply or sink nodes left in the the network (i.e., the mass balance constraints are satisfied). Time complexities. In Table 5.1, I summarise the worst-case complexities of the algorithms I discussed. The complexities suggest that successive shortest path ought to offer competitive runtimes. Its worst-case complexity is better than the complexity of: cycle canceling, minimum mean cycle canceling and primal-dual. Moreover, the algorithm should complete faster than ca- pacity scaling and double scaling when NU log(N) N >C >U (i.e., they have more arcs than nodes, the number of nodes if greater than the maximum arc cost, which is higher than the maximum arc capacity). In the experiment, I subsample the Google trace and replay it for simulated clusters of differ- ent sizes (like in Figure 2.19) to measure min-cost flow algorithm runtime. I use the Quincy scheduling policy for batch jobs, and prioritise service jobs over batch. In Figure 5.5, I plot the average runtime for each min-cost algorithm I consider. One might expect successive shortest path to do best because it has the lowest worst-case complexity, but successive shortest path does not outperform all algorithms, and even on a relatively small cluster of 1,250 machines successive shortest path’s runtime exceeds 100 seconds. However, the relaxation algorithm, which has the highest worst-case time complexity out of the algorithms I consider, performs best in practice. It outperforms cost scaling, which Quincy’s solver uses, by two orders of magnitude. On average, it makes placements in under 200ms even on a full-size cluster of 12,500 machines. One key reason for this perhaps surprising performance is that the relaxation algorithm does minimal work when most scheduling choices are straightforward. This is the case when the destinations for tasks’ flow are uncontested, i.e., not many new tasks have arcs to the same location and attempt to schedule there. In this situation, relaxation manages to route most of the flow supply in almost a single pass over the flow network. CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 123 91 92 93 94 95 96 97 98 99 100 Cluster slot utilization [%] 0 50 100 150 200 250 300 350 400 450 A lg or ith m ru nt im e [s ec ] Relaxation Cost scaling Figure 5.6: Close to full cluster slot utilisation, relaxation runtime increases dramatically, while cost scaling is unaffected: the x-axis shows the utilisation after scheduling jobs of increasing size to a 90%-utilised cluster. T0,0 T0,1 T0,2 T1,0 T1,1 X M0 M1 M2 M3 S U0 U1 5 5 5 7 7 0 0 0 1running: 0 Figure 5.7: Load-spreading policy with single cluster aggregator (X) and costs proportional to number of tasks per machine. 5.2.2 Edge cases for relaxation Relaxation is fast in the common-case setup I described above, but there are cluster setups when relaxation is slow. For example, it can perform poorly on flow networks of highly loaded or oversubscribed clusters, situations that happen in clusters that run batch processing tasks [BEL+14; RKK+16]. In Figure 5.6, I illustrate this: I push the simulated Google cluster closer to oversubscription. I take a snapshot of the cluster at 90% slot utilisation (i.e., 90% of task slots already run tasks), and I submit increasingly larger jobs until all cluster slots are utilised and some tasks have to queue to execute. Like in my previous experiments, I use the Quincy scheduling policy. The relaxation runtime increases rapidly, and at approximately 93% cluster slot utilisation, it exceeds that of cost scaling (the second best algorithm), growing to over 400s in the oversubscribed case. Additionally, some scheduling policies generate challenging flow networks that inherently cre- 124 5.2. FLOWLESSLY: A FAST MIN-COST FLOW SOLVER 0 1000 2000 3000 4000 5000 Tasks in arriving job 0 5 10 15 20 25 30 35 40 45 A lg or ith m ru nt im e [s ec ] Relaxation Cost scaling Figure 5.8: Artificial machine popularity slows down the relaxation algorithm: on cluster with a load-spreading scheduling policy, relaxation runtime exceeds that of cost scaling at just under 3,000 concurrently arriving tasks. ate few preferred destinations for most tasks’ flow supply, and thus challenge the relaxation algorithm. Consider, for example, a simple load-spreading policy that balances the number of tasks on each cluster machine. In Figure 5.7, I show an example flow network generated by this policy for a small cluster of four machines and five tasks. Task nodes only connect a cluster- wide aggregator node (X) and to unscheduled aggregator nodes (U j). The cluster aggregator X connects to each machine node with arcs that have cost proportional to the number of tasks running on each machine (e.g., one task on M3). Thus, the number of tasks on a machine only increases once all other machines have at least as many tasks. This policy makes nodes of machines with few occupied slots a popular destination for flow. I illustrate the effect the artificial machine popularity created by the load-spreading policy has on min-cost flow algorithms runtime with an experiment. I submit a single job with an increasing number of tasks to create more unscheduled tasks that favour executing on machines with few occupied slots. I measure how long it takes each algorithm to compute the optimal flow. This experiment simulates the rare-but-important arrival of very large jobs: for example, 1.2% of jobs in the Google trace have over 1,000 tasks, and some even over 20,000 tasks. Figure 5.8 shows that relaxation’s runtime increases linearly in the number of tasks. Relaxation exceeds cost scaling’s runtime when faced with over 3,000 new tasks. To make matters worse, a single overlong min-cost flow algorithm run can have a devastating effect on long-term task placement latency. While the algorithm executes, many tasks can finish and free slots, but these slots are not used until the next algorithm run completes because only the next run optimises over a flow network that encodes these slots as available. Thus, Firmament artificially creates flow networks with more task flow supply contention compared to if it were to quickly utilise recently freed slots. Moreover, Flowlessly may be again faced with many tasks when it runs again because more tasks are submitted during a long solver run. CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 125 1 10 100 1000 10000 100000 Task events per time interval 0.0 0.2 0.4 0.6 0.8 1.0 500 ms 1000 ms 5000 ms 20000 ms Figure 5.9: CDF of the number of scheduling events per various time intervals in the Google trace. The faster a scheduler runs the fewer tasks it must place. I analyse the workload in the public Google cluster trace to discover how many tasks the sched- uler must handle depending on how fast it runs. I divide the 30-day trace into time windows of different sizes and count the number of task events (e.g., task submissions, failures, comple- tions etc.) that occur in each time window. In Figure 5.9, I show CDFs of event count per time window for window sizes between 0.5s and 20s. A scheduler that completes a min-cost flow optimisation every 0.5s must process fewer than ten events in over 60% of cases, and fewer than 100 events in 95% of cases. If the scheduler runs every 20s, however, it must process over 500 task events in the median, and about 5,000 events in the 95th percentile. This highlights a vicious cycle: the quicker the scheduler completes, the less work it has to do in each scheduler run. Few long min-cost flow algorithm runs may have a disastrous effect: many task events accumulate while the algorithm runs, next run may take even longer, and thus the scheduler may even fail to ever recover to low task placement latencies. I now describe several techniques I developed to address these rare-but-important situations. 5.2.3 Approximate min-cost flow solutions Min-cost flow algorithms return an optimal solution. For the cluster scheduling problem, how- ever, an approximate solution may well suffice. Some schedulers use approximate solutions, albeit for different scheduling approaches. TetriSched [TZP+16], which uses a mixed-integer linear programming solver, and Paragon [DK13], which use collaborative filtering, terminate their solution search after a bounded amount of time. In this section, I investigate how good the solutions are if I terminate cost scaling and relaxation early. My hypothesis is that the al- gorithms may spend a long time on minor refinements to the solution with little impact on the overall task placements. I run an experiment in which I use a highly utilised cluster (same setup as in Figure 5.6) to 126 5.2. FLOWLESSLY: A FAST MIN-COST FLOW SOLVER 0 20 40 60 80 100 120 140 Algorithm runtime [sec] 0 1000 2000 3000 4000 5000 6000 Ta sk m is pl ac em en ts Relaxation Cost scaling Figure 5.10: Approximate min-cost flow yields poor solutions, since many tasks are mis- placed until shortly before the algorithms reach the optimal solution. The simulated cluster is at about 96% slot utilisation. investigate relaxation and cost scaling, but the results generalise to other setups as well. In Figure 5.10, I show the number of “misplaced” tasks as a function of how early I terminate the algorithm. I treat any task as misplaced if (i) it is preempted in the approximate solution but keeps running in the optimal one; (ii) it is scheduled on a different machine to where it is scheduled in the optimal solution; or (iii) it is left unscheduled but placed on a machine in the optimal solution. Both relaxation and cost scaling have thousands of misplaced tasks when terminated early, even before the final iteration that completes at 45s (cost scaling) and 142s (relaxation). The others algorithms I implemented in Flowlessly would misplace even more tasks when terminated early. Cycle canceling spends most of its runtime executing a max-flow algorithm to compute a feasible flow. If the algorithm is terminated early then it outputs a pseudoflow which misplaces at least as many tasks as much flow supply it has not routed. Similarly, early termination of the successive shortest path algorithm produces many task misplacements because the algorithm gradually improves pseudoflow feasibility by routing flow from source nodes to sink nodes. I thus conclude that early termination appears not to be a viable scalability optimisation for flow-based schedulers. 5.2.4 Incremental min-cost flow algorithms One of my key insights is that cluster state does not change dramatically between subsequent scheduling runs even when scheduling tasks on large clusters. In a typical Google cluster, fewer than 100 task events happen in 95% of 500ms long time intervals (see Figure 5.9). Nonetheless, min-cost flow algorithms use a sledgehammer approach: they run from scratch over the entire flow network regardless of how many cluster events occur between runs. CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 127 Min-cost flow algorithms might complete faster if they can reuse existing graph state and incre- mentally adjust the flow computed on the previous run. But for this to happen, both Firmament and Flowlessly must work incrementally. I adjust Firmament to collect scheduling events (e.g., task submissions, machine failures) while Flowlessly runs, and to submit only graph changes to Flowlessly. Moreover, I change Flowlessly’s algorithms to apply these graph changes on the latest flow solution, and to incrementally compute the new optimal flow. In this section, I describe the changes I made to incrementalise min-cost flow algorithms, and provide some intuition for which algorithms are suitable for incremental use. All cluster events ultimately reduce to three different types of graph changes in the flow net- work: 1. Supply changes at nodes when arcs or nodes that previously carried flow are removed (e.g., due to machine failure), or when nodes with supply are added (e.g., at task submis- sion). 2. Capacity changes on arcs (e.g., due to machines failing or joining the cluster). Note that arc additions and removals can also be modelled as capacity changes from and to zero capacity arcs. 3. Cost changes on arcs when the desirability of routing flow via some arcs changes; when these happen exactly depends on the scheduling policy (e.g., load on a resource has changed, a task has waited for a long time to run). Changes that adjust node supply, arc capacity or cost can invalidate flow feasibility and op- timality, but many min-cost flow algorithms require the flow to be either optimal or feasible before each internal iteration. In Table 5.2, I summarise these requirements. Some algorithms (e.g., cycle canceling, cost scaling) require the flow to be feasible at each step and work towards achieving one of the equivalent types of optimality: negative cycle optimality, reduced cost optimality or complementary slackness optimality. Other algorithms (e.g., successive shortest path, relaxation) require the flow to be optimal, but not necessarily feasible before each step. These algorithms push flow supply to the sink to obtain more nodes that satisfy the mass bal- ance constraints, adjust flow on arcs such that more arcs satisfy the capacity constraints, and ultimately find a feasible flow. The incremental solution requires the flow to be both optimal and feasible because an infeasible flow would evict tasks or leave tasks unscheduled, and a non-optimal solution would misplace tasks. Graph changes must be handled differently for each algorithm, depending on the preconditions the algorithm expects to hold before each internal iteration: • For algorithms that require themass balance constraints to be satisfied before each iter- ation, the flow supply resulted from supply changes (e.g., node addition, node removal) must be routed to sink nodes. If the resulted graph has many supply nodes, then the 128 5.2. FLOWLESSLY: A FAST MIN-COST FLOW SOLVER Algorithm Feasibility Optimality Capacity cnstr. Mass balance cnstr. Reduced cost ε-optimality Cycle canceling 3 3 – – Min mean cycle canceling 3 3 – – Successive shortest path 3 – 3 – Primal-dual 3 – 3 – Relaxation 3 – 3 – Capacity scaling 3 – 3 – Cost scaling 3 (3)† – 3 Double scaling 3 3 – 3 Enhanced capacity scaling 3 – 3 – Table 5.2: Algorithms – the ones implemented in Flowlessly are in bold – have different preconditions for each internal iteration. Some algorithms expect both types of feasibility constraints to be satisfied, while others only require reduced cost optimality conditions to be satisfied. Double scaling expects capacity constraints, mass balance constraints and ε-optimality, making it difficult to incrementalise. Cost scaling requires flow to satisfy mass balance constrains (†), but a modified version that does not have this requirement, but has a worse worst-case complexity exists. I incrementalised this cost scaling version in Flowlessly. supply can be pushed with a max-flow algorithm, otherwise the supply can be iteratively pushed along the shortest path route. Arc cost and capacity changes do not break the mass balance constraints, unless arcs are removed. Each arc removal can be treated as two supply changes that result from draining the arc’s flow: (i) an increase in the arc’s source node supply, and (ii) a decrease in the arc’s destination node supply. • For algorithms that require the flow to be feasible (i.e., satisfy capacity and mass balance constraints) before each iteration, two types of arc changes can break capacity constraints: (i) changes that decrease arc capacities, and (ii) changes that increase minimum arc flow requirements – additions of arcs with minimum flow requirements can be treated as in- creases in minimum flow. The flow on the affected arcs can be adjusted to satisfy the capacity constraints, but as a result, the arcs’ source node and destination node supply changes, which breaks the mass balance constraints. However, a maximum-flow algo- rithm can be used to adjust the flow such that it satisfies the mass balance constraints. • Some algorithms (e.g., successive shortest path, primal-dual) require the flow to satisfy both capacity constraints and the reduced cost optimality conditions before each it- eration. For these algorithms, the capacity changes must be first applied and the flow adjusted such that it continues to satisfy the capacity constraints. Yet, reduced cost op- timality conditions may be broken by capacity changes and arc cost changes. These changes can be applied, but reduced cost optimality conditions must be re-established by pushing or draining flow on the affected arcs, which causes supply changes. However, supply changes do not break capacity constraints or reduced cost optimality conditions, and thus no additional actions must be taken. CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 129 Quincy Load-spreading Scheduling policy 0 10 20 30 40 50 60 A lg or ith m ru nt im e [s ec ] Cost scaling Incremental cost scaling Figure 5.11: Incremental cost scaling is 25% faster compared to from-scratch cost scaling for the Quincy policy and 50% faster for the load-spreading policy. • Some algorithms (e.g., double scaling) require the flow to satisfy mass balance con- straints, capacity constraints and ε-optimality. For these algorithms, graph changes can be first applied and flow adjusted such that all arcs satisfy the capacity constraints. Subsequently, a feasible flow can be obtained using a max-flow algorithm. Finally, the min-cost flow algorithm can be resumed from the ε-optimality of the updated feasible flow. I considered implementing an incremental version of each algorithm from Table 5.2. However, some algorithms are unsuitable to incrementally re-compute the optimal flow. For example, the min-cost flow algorithms that require the mass balance constraints to be satisfied before each iteration are unlikely to show significant runtime reductions because they have to use expensive max-flow algorithms to adjust flow such that it satisfies the mass balance constraints. Max-flow algorithms are expensive because they commonly have an average-case performance close to the worst-case performance – O(FM) worst-case complexity for the flow networks generated by schedulers, where F is the total flow supply. Common cluster events (e.g., task submission, completion or failure) cause flow supply, and thus would increase runtime of these incremental algorithms. Capacity scaling and enhanced capacity scaling are also unlikely to quickly re-compute the op- timal flow. They require the residual flow network to contain only nodes with supply greater than −∆ and smaller than ∆. Many common cluster events (e.g., task submission, task comple- tion, machine utilisation change) introduce large node flow supplies. When such events occur, the incremental min-cost algorithms based on capacity scaling cannot quickly re-compute the solution because they fallback to ∆ values almost as large as if the algorithms were to start from scratch. I implemented incremental versions of the cycle canceling, successive shortest path, cost scal- ing and relaxation algorithms. However, I only discuss the algorithms that have competitive 130 5.2. FLOWLESSLY: A FAST MIN-COST FLOW SOLVER Reduced cost on arc from i to j Change type cpii j < 0 c pi i j = 0 c pi i j > 0 Increasing arc cap. Decreasing arc cap. fi j > u′i j Increasing arc cost c′pii j > 0 fi j > 0 Decreasing arc cost c′pii j < 0 Table 5.3: Arc changes that require solution reoptimisation. Green: flow is still optimal and feasible; red: change breaks feasibility or optimality; orange: change breaks feasibility or optimality if condition in cell is true. Only decreasing arc capacity can break feasibility; the other changes affect optimality only. runtimes. Incremental cost scaling is up to 50% faster than running cost scaling from scratch (Figure 5.11). Cost scaling’s possible gains from incremental execution are limited, because cost scaling requires the flow to be feasible and ε-optimal before each intermediate iteration (Table 5.2). Graph changes can cause the flow to violate one or both requirements. Table 5.3 shows the effect of different types of changes with respect of cost scaling’s flow feasibility and optimal- ity requirements. For example, a change that modifies the cost of an arc (i, j) from cpii j < 0 to c′pii j > 0 breaks optimality, whereas additions or removals of task nodes (with supply) break feasibility. My cost scaling implementation does not have the best possible worst-case com- plexity, but it does not require the mass balance constraints to be satisfied before each internal iteration. Nevertheless, my incremental cost scaling does not run several times faster than the from-scratch version. Many changes affect optimality and require cost scaling to fall back to a higher ε-optimality to compensate. In order to bring ε back down, my cost scaling implemen- tation must do a substantial part of the work it would do from scratch. However, the limited improvement still helps reduce the runtime of the second best algorithm. Incremental relaxation ought to work much better than incremental cost scaling because the relaxation algorithm only needs the flow to satisfy the capacity constraints and the reduced cost optimality conditions. However, in practice it turns out not to work well. While the algorithm can be incrementalised with relative ease, it can – counter-intuitively – be slower than when running from scratch. Relaxation requires reduced cost optimality to hold at each step of the algorithm and improves feasibility (Table 5.2) by pushing flow on zero-reduced cost arcs from source nodes to sink nodes. The algorithm gradually constructs a tree of arcs with zero-reduced costs for each source node in order to find zero-reduced cost paths to sink nodes. Incremental relaxation works with the existing, close-to-optimal state, which increases runtime because the closer to optimality the solution is, the larger the trees to be built are. In contrast, when running from scratch, only a small number of source nodes (i.e., tasks) selected in the execution have large trees associated with them. In practice, I have found that incremental relaxation can perform well in cases when tasks are unconnected to zero-reduced cost tree. However, it can perform poorly, especially in challenging situations with many tasks, since these are likely to require many tree constructions. CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 131 Without AP With AP Relaxation 0 10 20 30 40 50 60 A lg or ith m ru nt im e [s ec ] (a) Arc prioritisation (AP) reduces relaxation’s runtime by 45%. Without TR With TR Cost scaling 0 10 20 30 40 50 60 A lg or ith m ru nt im e [s ec ] (b) Efficient task removal (TR) reduces incremen- tal cost scaling’s runtime by 10%. Figure 5.12: Problem-specific heuristics reduce min-cost flow runtimes. 5.2.5 Optimising min-cost flow algorithms Relaxation has promising common-case performance at scale for typical workloads, but its edge-case behaviour makes it necessary to either (i) use other algorithms in these cases, or (ii) apply heuristics developed for these cases. Flowlessly runs min-cost flow algorithms on flow networks with specific properties, which dif- fer from the flow networks typically used to evaluate min-cost flow algorithms [KK12, §4]. For example, the flow networks Firmament generates have a single sink; are directed acyclic graphs, and flowmust always traverse unscheduled aggregators or machine nodes. Problem spe- cific heuristics that leverage these flow network properties might help min-cost flow algorithms find solutions more quickly. I developed several such heuristics, and found two beneficial ones: (i) prioritisation of promising arcs, and (ii) efficient task node removal. 5.2.5.1 Arc prioritisation The relaxation algorithm builds for every supply node a tree of zero-reduced cost arcs to locate zero-reduced paths to nodes with demand (i.e., paths that do not break reduced cost optimality). When the algorithm builds this tree, it can extend the tree with any zero-reduced cost arc that connects a node from the tree to an external node. Some arcs are better choices for tree extension than others because the quicker the algorithm can find paths to nodes with demand, the sooner it can route the supply. For example, some scheduling policies create flow networks that contain arcs that indicate task machine placement preferences. These arcs connect tasks to resource nodes which in turn are few arcs away or directly connected to the sink node. I adjust relaxation to prioritise such arcs that lead to nodes with demand when it extends the zero-reduced cost tree. To ensure these arcs are visited sooner, I add them to the front of the priority queue relaxation uses to build the zero-reduced cost tree. 132 5.2. FLOWLESSLY: A FAST MIN-COST FLOW SOLVER In effect, this heuristic implements a hybrid graph traversal that biases towards depth-first ex- ploration when demand nodes can be reached, but breadth first exploration otherwise. In Fig- ure 5.12a, I show that this heuristic reduces relaxation runtime by 45% when running over a flow network with contended nodes. In the experiment, I replay the Google trace at 90% cluster slot utilisation and use the load-spreading policy, which creates challenging flow networks for relaxation. 5.2.5.2 Efficient task removal My second heuristic reduces incremental min-cost flow algorithms’ runtime. The key insight behind my heuristic is that removing a running task (e.g., due to its completion, preemption, or a machine failure) constitutes a common, but costly change. When a task node is removed, the arcs connected to the node must also be removed. At least one of these arcs carries flow in the previous optimal solution, and thus their removal breaks feasibility and creates demand at the nodes previously connected to the task node. I improve task removal by adding an arc, called running arc, from each node representing a running task to the node modelling the machine on which the task runs. I set the cost on these arcs to the cost of continuing to run the tasks on the machines they are running on. Min-cost flow algorithms route the task flow supply for running task along these arcs unless better alternatives become available. Thus, I can easily reconstruct tasks’ flow supply path through the graph, and remove the flow all the way to the single sink node when the tasks complete or fail. By removing the flow, I only create demand at the sink node, which accelerates the incremental solution because fewer node potentials need to be adjusted. In Figure 5.12b, I show that this heuristic reduces incremental cost scaling runtime by about 10% when replaying the Google trace at 95% cluster slot utilisation using the Quincy scheduling policy2. 5.2.6 Algorithm choice In Sections 5.2.1 and 5.2.2, I showed that the performance of min-cost flow algorithms varies and that no algorithm consistently outperforms all others. Relaxation often works best in prac- tice, but scales poorly when resources are highly utilised. Cost scaling, by contrast, scales well and can be incrementalised (§5.2.4), but is usually substantially slower than relaxation. Not even the heuristics I previously described reduce runtime sufficiently for any algorithm to offer low task placement latency for all possible scheduling policies and cluster scenarios. In order to always get the lowest task placement latency, I adjust Flowlessly to speculatively ex- ecute cost scaling and relaxation, and pick the solution offered by whichever algorithm finishes first. In the common case, this is relaxation; however, in challenging situations when relaxation 2Relaxation does not benefit from this heuristic because the algorithm is not suitable for running incrementally (§5.2.4). CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 133 Fl ow le ss ly Cost scaling Graph state Relaxation Graph state Price refinePrevious graph state Input graph Output flow Figure 5.13: Schematic of Flowlessly’s internals. is slow, cost scaling guarantees that Firmament’s placement latency does not get unreasonably high. Flowlessly runs both algorithms rather than using a heuristic to choose the best one for two reasons. First, it is cheap; the algorithms are single-threaded and do not parallelise well. They comprise of many steps that make small changes to data that would have to be shared among threads; the parallel implementations would have to use locks extensively. I parallelised the cost scaling algorithm, but I found my implementation to barely outperform the single- threaded algorithm in most cases, and even to take longer to complete on some flow networks. Second, predicting the best algorithm is hard; the heuristics would depend on both scheduling policy and cluster utilisation, which would make the heuristic brittle and complex. In Figure 5.13, I show a high-level view of how Flowlessly runs two algorithms in parallel. In order to efficiently transition state from the relaxation algorithm to incremental cost scaling, Flowlessly applies an optimisation that I describe now. 5.2.6.1 Efficient algorithm switching Flowlessly runs in parallel relaxation and incremental cost scaling – because it is substantially faster than non-incremental cost scaling (§5.2.4). In the common case, however, the (from- scratch) relaxation algorithm is first to finish. Therefore, the next incremental cost scaling run must use the optimised solution returned by relaxation as a starting point. This can be unnec- essarily slow because relaxation and cost scaling maintain different reduced cost graph repre- sentations. Specifically, relaxation can converge on node potentials that are poor choices for satisfying cost scaling’s complementary slackness optimality requirements because relaxation adjusts node potentials and flow to satisfy reduced cost optimality conditions. For example, re- laxation can converge on greatly different potentials on two non-adjacent nodes i and j, which would introduce a high reduced arc cost in the network if arc (i, j) is added. Such graph changes 134 5.2. FLOWLESSLY: A FAST MIN-COST FLOW SOLVER 0 5 10 15 20 25 Algorithm runtime [sec] 0.0 0.2 0.4 0.6 0.8 1.0 C D F of ru nt im es Price refine + cost scaling Cost scaling Figure 5.14: Applying the price refine heuristic to a graph coming from relaxation reduces incremental cost scaling runtime by 4×. are difficult to handle because incremental cost scaling typically requires the flow to satisfy mass balance constraints and to be ε-optimal (for as smallest possible ε). The incremental algorithm can either: (i) exhaust the residual capacity on arc (i, j), but break mass balance constraints, or (ii) leave flow unchanged on arc (i, j), but deteriorate ε-optimality to a large ε . In Flowlessly, I choose the later approach and I combine it with price refine, a heuristic that reduces differences between adjacent node potentials and improves ε-optimality [Gol97]. The heuristic was originally developed for use within cost scaling, but helps make the transition between relaxation and incremental cost scaling more efficient. Price refine adjusts node po- tentials to discover if an ε-optimal flow is ε/α-optimal as well. I apply price refine on the previous solution before I apply the latest graph changes. I can reset all node potentials to zero and invoke price refine to compute node potentials for the 1n -optimal flow because the flow re- turned by the previous algorithm run is 1n -optimal. Given flow optimality, price refine can find node potentials such that the flow satisfies the complementary slackness optimality conditions. Hence, incremental cost scaling only has to restart at an ε value equal to the maximum arc cost graph change. In Figure 5.14, I illustrate that my price refine implementation reduces incremental cost scaling runtime by about 4× in 90% of the cases when I apply it on relaxation’s graph output. In the experiment, I replay the Google trace at 90% cluster slot utilisation using the Quincy scheduling policy. 5.2.7 Efficient solver interaction So far, I focused on reducing Flowlessly’s algorithm runtime. However, I must do more to achieve low task placement latencies. First, Firmament must efficiently update flow network CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 135 nodes, arcs, costs, and capacities before each min-cost flow optimisation run to reflect cluster state changes. Second, Firmament must not generate superfluous incremental flow network changes. Third, Firmament must quickly extract task placements out of the flow networks optimisations output. These steps are not included in the min-cost algorithm runtime, but must be efficient for placement latency to be low. I now explain how I improve these steps over the prior work on min-cost flow-based scheduling in Quincy. Flow network updates. I optimise Firmament to run only two bread-first traversals (BFS) over the flow network to update the network before each solver run. The first traversal updates task and resource statistics (e.g., current machine utilisation, resource requests). The traversal starts from nodes adjacent to the sink (usually machine nodes), and propagates statistics back- wards up to task nodes along each node’s incoming arcs. When the traversal completes, each node has up-to-date resource statistics. By contrast, the second traversal starts at task nodes. For each node, it invokes scheduling policy API methods that are implemented by each supported scheduling policy [Sch16, Appendix C]. These methods add and remove flow network nodes, add and remove arcs, and change arc costs and capacities using the statistics computed in the first traversal. For example, the methods can remove task preference arcs to machines, if these machines are now highly utilised. Both traversals have a linear in the number of arcs worst-case complexity of O(M+N). Their runtime is negligible compared to min-cost flow algorithms’ runtime, which have higher worst-case complexities (see Table 5.1). Flow network changes creation. Typical min-cost flow solvers receive flow networks as in- put and output optimal flow. By contrast, Flowlessly stores the flow network across runs, ex- pects to receive only flow network changes, and incrementally adjusts the flow. I extended Firmament to submit flow network changes to Flowlessly before each solver run. Nonetheless, these changes can be expensive to process because they may require Flowlessly to resize its internal data structures, or may cause numerous cache misses. I reduce how many flow network changes Firmament generates by making sure it: • does not generate duplicate changes; • merges multiple changes to an arc into a single change that encompasses all; • prunes changes to incoming and outgoing arcs of nodes that it later removes. My optimisations reduce how long it takes Flowlessly to modify the flow network between consecutive incremental min-cost flow runs and, ultimately, improve task placement latency. Task placement extraction. Flowlessly returns an optimal flow at the end of each run out of which Firmament extracts task placements. The Quincy scheduler has a mechanism to ex- tract task placements for its policy, but I had to generalise this mechanism because Firmament 136 5.3. EXTENSIONS TO MIN-COST FLOW-BASED SCHEDULING 1 to_visit = machine_nodes # list of machine nodes 2 node_flow_destinations = {} # auxiliary remember set 3 mappings = {} # final task mappings 4 while not to_visit.empty(): 5 node = to_visit.pop() 6 if node.type is not TASK_NODE: 7 # Visit the incoming arcs 8 for arc in node.incoming_arcs(): 9 moved_machines = 0 10 # Move as many machines to the incoming arc’s 11 # source node as there is flow on the arc 12 while assigned_machines < arc.flow: 13 node_flow_destinations[arc.source].append( 14 node_flow_destinations[node].pop()) 15 moved_machines += 1 16 # (Re)visit the incoming arc’s source node 17 if arc.source not in to_visit: 18 to_visit.append(arc.source) 19 else: # node.type is TASK_NODE 20 mappings[node.task_id] = 21 node_flow_destinations[node].pop() 22 return mappings Listing 5.1: Algorithm for extracting task placements from the flow returned by the solver. supports minimum flow requirements on arcs, arbitrary aggregators and multiple arcs between pairs of nodes. I devised a graph traversal algorithm (see Listing 5.1) to extract task assign- ments efficiently. The algorithm starts from machine nodes and backwards traverses the graph, propagating a list of machines to which each node sent flow. When the algorithm completes, a single-machine list has propagated to each scheduled task node. In contrast to standard breadth- first and depth-first graph traversals, the algorithm may visit a node more than once if there are several flow paths of different length between the sink and the node. However, such situations are rare in the flow networks Firmament’s scheduling policies generate. In the common case, the algorithm extracts the task placements in a single pass over the graph. 5.3 Extensions to min-cost flow-based scheduling Prior work showed that min-cost flow-based schedulers choose quality task placements because they satisfy data locality requirements [IPC+09], avoid task co-location interference [Sch16, §7.3], leverage hardware heterogeneity to reduce power consumption [Sch16, §7.4], support soft and hard constraints [Sch16, §6.3.1], and can offer fair shares to jobs [IPC+09]. However, min-cost flow-based schedulers still have several limitations: 1. Linear task placement costs. The cost of placing a task on a resource is given by the sum of the arc costs along the flow path from the task node to the resource node. These arc CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 137 costs are statically computed by the scheduling policy before each min-cost flow solver run. Thus, the cost of placing a task on a resource does not change regardless of howmany other tasks are placed on the same resource in the same scheduling round. As a result, min-cost flow-based schedulers cannot account for the performance penalties two or more tasks that are simultaneously placed on the same resource incur due to co-location (see Figure 5.15). This limitation causes task makespan to increase or task application level performance to decrease. Quincy adopts a radical approach to solve this limitation: it executes only one task per machine. However, this approach leads to low cluster resource utilisation. Schwarzkopf suggests an alternative approach in which several tasks can be co-located, but which disallows schedulers to simultaneously place more than a task on a resource in a scheduling round [Sch16, §7.3]. This approach does not significantly decrease resource utilisation, but increases task placement latency because tasks may have to wait for several scheduling rounds to complete before they are placed. 2. Tasks cannot have complex constraints. Some cluster workloads have complex con- straints that create mutual dependencies between tasks. Two popular types of such con- straints are task affinity and task anti-affinity (§2.3.1.6). Task affinity constrains create dependencies between two or more tasks for being placed on the same resource (e.g., a web server task must share a machine with a database task). By contrast, task anti-affinity constraints restrict tasks to be placed only on different resources (e.g., distribute database tasks across machines to improve fault tolerance). Despite the appeal, there are no ways to specify such constraints in min-cost flow-based schedulers. 3. Inefficient gang scheduling. Some cluster workloads expect schedulers to simultane- ously place tasks on machines. For example, graph processing workflows are iterative and run in systems that require all tasks to synchronise at the end of each iteration (§2.1). Task waste resources and workflow makespan increases, if they are not placed simultane- ously (i.e., gang scheduled). However, min-cost flow-based schedulers either do not sup- port gang scheduling [IPC+09], or can only gang schedule one workflow at a time [Sch16, §6.3.3]. In this section, I introduce three new basic flow network concepts I developed: convex arc costs, “xor” flow network construct, and “and” flow network construct. I use these concepts as building blocks for building complex scheduling policies that address the limitations I described above. In Subsection 5.3.1, I explain how convex arcs costs can be used to avoid interfering task placements. Finally, I describe how the basic flow network constructs can be grouped together to express complex constraints in flow networks, and to gang schedule many tasks at a time (§5.3.2). 138 5.3. EXTENSIONS TO MIN-COST FLOW-BASED SCHEDULING T0,0 T0,1 T1,0 T1,1 X R0 M0 M1 R1 M2 M3 S Figure 5.15: Example of min-cost flow scheduler’s limitation in handling data skews. The scheduler places tasks T0,0, T0,1 and T1,0 on machine M1 without taking into account that the tasks may interfere. 5.3.1 Convex arc costs Firmament can run scheduling policies that mitigate straggler tasks’ effect on job completion time. These policies do not connect tasks to heavily loaded machines (i.e., blacklist machines), instead they connect tasks to preferred machines. For example, Quincy uses preference arcs to connect task nodes to machine nodes on which task input data resides. Tasks placed by Quincy read a high fraction of input data locally, but they may become stragglers if: (i) input data are singly replicated, or (ii) a fraction of data is used by many tasks [NEF+12]. Such data distributions are common: 18% of the data from one of Microsoft’s Bing Dryad clusters is accessed by at least three unique jobs at a time, and the top 12% most popular data are accessed over ten times more than the bottom third of the data [AAK+11], In such situations, the Quincy scheduling policy creates many preference arcs that connect to several machines, which store the popular data. In Figure 5.15, I show one such example. Ma- chineM1 stores data that is accessed by tasks T0,0, T0,1 and T1,0. They each have a preference arc to machine M1. When min-cost flow algorithms optimise such flow networks, they push flow along the preference arcs, and thus co-locate tasks on the preferred machines (i.e., on M1 in the example network). As a result, the scheduler may place too many interfering tasks on a few machines in a scheduling round. This may happen because preference arc costs do not increase as more tasks are placed in a scheduling round. In other words, the scheduler is just as likely to place an additional task on machine M1 after it already decided to place several tasks in a scheduling round. Existing min-cost flow schedulers choose practical, but inefficient solutions to avoid this limi- tation. For example, Quincy does not run more than a task per machine, which does not keep clusters highly utilised. Schwarzkopf introduced an alternative approach based on admission control in Firmament [Sch16, §7.3.1]. Firmament ensures that no more than a given number of tasks can be placed on a resource in a scheduling round. It sets arc capacity equal to the maximum number of allowed co-located tasks on the arcs that connect resources nodes and the sink node. Thus, Firmament can achieve high cluster utilisation, but it cannot offer low task CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 139 i j ui j : 4, ci j : 5 (a) Linear arc cost. i j ui j : 1, ci j : 2 ui j : 1, ci j : 4 ui j : 1, ci j : 9 ui j : 1, ci j : 15 (b) Convex arc costs. Figure 5.16: Example showing how convex arc costs can be modelled in the flow-network. An arc (i, j) is transformed into several arcs with different costs. The same amount of flow can be routed from node i to node j, but the cost increases linearly (5.16a) or non-linearly (5.16b). placement latency for tasks that are admission controlled. These tasks may wait for several scheduling rounds to complete before they are placed. However, this trade-off is not necessary if scheduling policies use the flow network construct I introduce next. Convex arc costs. In all the flow networks I presented to this point, each arc (i, j) has asso- ciated a cost ci j and a maximum flow capacity ui j. In these flow networks, regardless of how many units of flows are already sent along arc (i, j), it costs ci j to send an additional unit of flow (i.e., it costs x ∗ ci j to send x units of flow). However, there are cases when it is desirable for the cost of sending flow to increase more than linearly. For example, if Firmament pushes two tasks’ flow supply to a machine node along an arc, it may be desirable to cost more to push second task’s flow supply because the task may interfere with the first task. Convex arc costs can be modelled by creating multiple arcs between pairs of nodes in the flow networks. These arcs can have different costs, minimum flow requirements and flow capacity constraints. For example, an arc (i, j)with a linear cost ci j and a capacity ui j can be transformed into several arcs that all connect node i and node j, but with different costs. The sum of capaci- ties of these arcs must be equal to ui j; the same amount of flow can be routed along the multiple arcs that model arc (i, j), albeit at a cost that does not have to increase linearly. In Figure 5.16, I show how an arc with a flow capacity of 4 and a cost of 5, can be represented as several arcs that have an equal total capacity, but model a convex cost. Scheduling policies that use convex arc costs generate flow networks with many arcs. In the worst case, if these policies strive to achieve high arc cost accuracy, they can introduce as many new arcs as the capacities of prior linear cost arcs. The runtime of existing min-cost flow solvers (e.g, cost scaling [Gol97]) grows on such large flow networks. However, Flowlessly’s runtime does not significantly increase when flow networks contain more arcs (see §6.3.1). 140 5.3. EXTENSIONS TO MIN-COST FLOW-BASED SCHEDULING 5.3.2 Complex placement constraints In Section 2.3.1.6, I categorise placement constraints into three types: (i) soft constraints that indicate placement preferences that do not necessarily have to be satisfied, (ii) hard constraints that express placement requirements that must be satisfied by individual tasks, and (iii) complex constraints that can be hard or soft, and require several tasks and machines to simultaneously satisfy them. All types of placement constraints are common in practice: 50% of tasks from a Google cluster have soft or hard constraints related to machine properties, and 11% of tasks have complex constraints [SCH+11]. However, only few cluster schedulers support complex con- straints [TCG+12; TZP+16], and to my knowledge, no min-cost flow-based cluster sched- uler does so. Prior work claims that complex constraints cannot be expressed in flow net- works because solvers route each unit of task flow supply to sink nodes independently of other flow [Sch16]. However, this is not the case: complex constraints can be represented in flow networks using two flow constructs that I introduce now. “xor” flow network construct. Scheduling policies that strive to avoid co-locating interfering tasks and that spread tasks across machines and racks for better fault tolerance – i.e., goals expressed with task anti-affinity complex constraints – would benefit if they can model exclusive disjunctions in flow networks. In Figure 5.17a, I model an exclusive disjunction of two tasks’ flow. I connect the tasks to an aggregator node (G), which I next connect to another node aggregator (O). I set a maximum capacity of one on arc (G,O) to ensure that only one unit of flow can be routed along this arc. As a result, at most one task (i.e., T0,0 or T0,1) can be placed on the machines to which node O connects. In Figure 5.17b, I extend the flow network to n tasks. Like previously, I connect task nodes to an aggregator node G, which only connects to node O. I set the maximum capacity on this arc to the maximum number of tasks connected to node G that are allowed to be placed. I also set maximum capacities on arcs connecting nodeO and machine nodes. These capacities control how many tasks can be co-located on each machine. The “xor” network flow construct can be used to express complex conditional preemption con- straints such as if task Tx is placed, then task Ty must be preempted as a result. Similarly, the generalised “xor” construct can be used to express conditional constraints, but also complex co-scheduling constraints (e.g., task anti-affinity constraints) that require several tasks to not share resources. “and” flow network construct. Some tasks benefit if they are co-located with other tasks (e.g., tasks that exchange data regularly). Task affinity constraints – a type of complex co- location constraints – express requirements that must be satisfied by two or more tasks. It is impossible to represent such task requirements in min-cost flow networks, but it is possible to encode them in generalised min-cost flow networks. CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 141 T0,0 T0,1 G O U0 M0 M1 S u : 1 (a) Example of “xor” flow network construct. At most one task out of T0 and T1 can be placed. T0,0 T0,1 T0,n ... U0 G O M0 M1 ... Mm S u : p u : r u : r u : r (b) Example of generalised “xor” flow network construct. No more than r tasks can be placed on a machine, and a maximum of p tasks out of n can be placed. Figure 5.17: Examples that create dependencies between tasks’ flow supply. Figure 5.17a shows a construct that ensures at most one task is placed out of two. Figure 5.17b gener- alises the construct to n tasks and a maximum of p placed tasks. T0,0 T0,1 U0 G E A M0 S u : 1 l : 1u : 1 γ : 2 Figure 5.18: “and” flow network construct. It models tasks’ requirement for being co- located on machineM0. Either both tasks are placed or none. The generalised min-cost flow problem finds min-cost flow solutions in flow networks in which arcs have positive flow multipliers γi j, called gain factors. In these networks, each unit of flow sent across any arc (i, j) is transformed into γi j units of flow, which are received at node j. I use γ factors to build an “and” flow network construct that encodes logical conjunction. The “and” network construct can be used to ensure that two or more tasks are placed only when they all respect a certain condition (e.g., all placed on the same machine). In Figure 5.18, I show a network with two tasks to exemplify the network construct. In this example, task T0,0 and T0,1 must be either co-located or left unscheduled. I connect the task nodes to a new aggregator nodeG, which I then connect to node E and nodeA. I set a minimum flow requirement and a maximum capacity of one on arc (G,E), and I set a maximum capacity 142 5.3. EXTENSIONS TO MIN-COST FLOW-BASED SCHEDULING T0,0 T0,1 G J M0 M1 S u : 1 u : 1 u : 2 l : 2 Figure 5.19: Rigid gang scheduling flow network construct. Both tasks are scheduled no matter if resources are available or not. of one and a γ gain factor of two on arc (G,A). These arc requirements and capacities limit possible task supply flow routes to two options: 1. One task’s flow supply is routed via the unscheduled aggregator node U0 and the other task’s flow supply via node aggregator G. The flow routed via G must next be routed to node E because arc (G,E) has a minimum flow requirement of one. Thus, both tasks remain unscheduled. 2. Both tasks’ flow supply is routed to node aggregator G. Following, exactly one unit of flow is routed to node E because arc (G,E) has a minimum flow requirement and a maximum capacity of one. The remaining flow at node aggregator G is routed to node A, but node A receives two units of flow because arc (G,A) has a γ gain factor equal to two. Thus, two units of flow are routed to machineM0, and tasks are co-located. The min-cost flow algorithms I implemented in Flowlessly do not find optimal flow in gen- eralised flow networks, but other polynomial combinatorial algorithms exist [Way99]. In the future, Flowlessly could be extended to support such generalised min-cost flow algorithms. 5.3.3 Gang scheduling Some graph processing back-end execution engines (e.g., Pregel, Giraph) synchronise all tasks at the end of each iteration (§2.1). The workflows these back-ends execute benefit if all work- flow tasks are placed simultaneously. Schedulers that do not gang schedule these tasks increase workflowmakespan and waste cluster resources with tasks that wait to synchronise. Regardless, only the Tarcil [DSK15] and TetriSched [TZP+16] cluster schedulers can gang schedule tasks. Schwarzkopf discusses gang scheduling in the context of min-cost flow schedulers [Sch16, §6.3.3]. He proposes using minimum flow arc requirements to force tasks to schedule (see Figure 5.19 for an example with two tasks). In his approach, the scheduler adds per-job gang aggregator nodes to which it connects tasks that must be gang scheduled. Following, the sched- uler connects each gang aggregator node to another corresponding per-job aggregator node. The scheduler sets the minimum flow requirement on these arcs to equal the number of job tasks that must be gang scheduled. Finally, the scheduler adds arcs from job aggregator nodes to preferred resource nodes. CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 143 T0,0 T0,n ... U0 G E A M0 ... Mm S u : n −1 l : n− 1 u : 1 γ : n Figure 5.20: Generalised flow network gang scheduling construct. The construct ensures that either all tasks get placed or none at all. However, this approach is limited: minimum flow requirements force all tasks to be placed no matter how costly it is, and how many other tasks (potentially higher-priority tasks) must be preempted. One way to cope with this limitation is to construct two flow networks: (i) a network that contains the gang scheduling construct, and (ii) a network without the construct. Firmament could run two Flowlessly instances in parallel and pick the lowest cost solution. However, Firmament could gang schedule only a few jobs at a time because the number of solver instances that it would have to execute in parallel grows exponentially – i.e., 2N solver instances to gang schedule N jobs. I suggest an alternative approach for gang scheduling that does not have this limitation. My approach uses a generalised version of the “and” flow network construct (see Figure 5.20). I connect all n tasks that must be gang scheduled to an aggregator node (G). Like previously, I next connect node G to node E and node A, but now I set arc’s (G,E) minimum flow re- quirement and maximum capacity to n−1, and arc’s (G,A) gain factor to n. As a result, either all tasks remain unscheduled if the solver routes flow supply along the unscheduled aggregator node U0 and node E to the sink node, or all tasks are placed on machines if the solver routes one unit of flow supply across arc (G,A). With such flow network constructs, many differ- ent workflows’ tasks can be gang scheduled without having to run multiple solver instances in parallel. 5.4 Network-aware scheduling policy Increasingly more data processing systems store data in memory and execute tasks that place high load on data-centre networks [ZCD+12; MMI+13]. Similarly, machine learning systems train models on large data using worker tasks that utilise a high fraction of machine network bandwidth. If these tasks are co-located, they interfere on the end-host network links, and thus take longer to complete. Moreover, they also can cause degradations in service task application- level performance (e.g., increase web serving latency) [GSG+15]. I use convex arc costs – one of the min-cost flow scheduling extensions I discussed – in a 144 5.4. NETWORK-AWARE SCHEDULING POLICY T0,0 T0,1 T0,2 T1,0 T1,1 RA400 M0 (850) M1 (850) RA150 M2 (650) M3 (1050) S U0 U1 2500 2500 1250 1050 10 00 800 running: 0 Figure 5.21: Example of a flow network for a network-aware policy with request aggrega- tors (RA) and dynamic arcs to machines with spare network bandwidth. scheduling policy to showcase their practical application for network-intensive workloads. In Figure 5.21, I illustrate the scheduling policy I developed, which avoids overcommitting end- host network bandwidth. Users specify task network bandwidth requests, which the policy uses to decide to which nodes to connect task nodes to. The policy creates request aggregator nodes (RAx) if exists at least a task with a x MB network bandwidth request in the flow network. It connects all task nodes to the request aggregator node that correspond to their network band- width requests. Following, the policy adds arcs from each RAx node aggregator to machines on which there is at least x MB of network bandwidth available. The policy could set linear costs on each (RAx,Mm) arc – e.g, set arc cost to x, by how many MB network utilisation increases when a task with xMB network bandwidth request is placed on machine m. However, such costs would cause Firmament to place too many tasks simultaneously on low network-utilised machines in a scheduling round, and thus oversubscribe end-host network links. Instead, my scheduling policy uses convex arc costs. Between each RA aggregator and machine pair it creates as many arcs as tasks fit within the available machine network bandwidth. The policy sets the costs on these arcs to the sum of available machine network bandwidth, task bandwidth request and bandwidth requests of other tasks that could be placed on this machine in the next scheduling round. For example, in Figure 5.21, machine M2 has available 650 MB/s. The policy sets the cost on the first arc between RA150 and M2 to 650+150, the cost on the second arc to 650+150∗2, and the cost on the third arc to 650+150∗3. Thus, the scheduler would only place two tasks with a network request of 150 MB/s if there are no other machines with a network utilisation smaller than 750 MB/s. Firmament uses the network-aware scheduling policy to generate and dynamically adjust flow networks when it observes bandwidth utilisation changes. The convex arc costs that the policy uses ensure that the end-host network load is balanced across machines. In Section 6.3.2, I show CHAPTER 5. FIRMAMENT: A SCALABLE, CENTRALISED SCHEDULER 145 that Firmament with my network-aware scheduling policy outperforms other state-of-the-art cluster schedulers on a network-intensive workload. 5.5 Limitations Min-cost flow-based schedulers offer many desirable scheduling features (e.g., soft and hard constraints, high data locality, co-location interference avoidance), but they also have several limitations that I discuss now. Multi-dimensional capacities. Scheduling policies restrict flow paths by setting minimum flow requirements and maximum capacity constraints on arcs. However, despite tasks having widely different requirements on multiple resource dimensions (see §2.1), min-cost flow solvers do not distinguish between different task flow supplies. Solvers ensure that each unit of flow satisfies the same minimum flow requirements and capacity constraints as other flow, regardless of the resource requirements the tasks that generate the supply have. To address this issue, min-cost flow schedulers admission control tasks to the flow network (i.e., they add tasks only when enough resources are available). Multi-dimensional task resource requests could be expressed in more general flow networks which route flow for multiple commodities (i.e., tasks create flow supply in different dimensions and arcs have capacity constraints for each dimension). However, finding the minimum cost flow in a multi-commodity network is a NP-complete problem. Multi-dimensional arc costs. Flow network arc costs represent how expensive is to route task flow along arcs. For example, the cost of an arc that connects a task node to a machine node represents how costly is to place the task on the machine. Moreover, a task placement cost must be represented as a sum of arc costs regardless of how many different types of re- sources tasks request. Min-cost flow schedulers use weighted linear functions to flatten task resource requirements to integer arc costs. However, it is challenging for developers to define these functions such that they accurately predict how well-suited machines are. These weighted linear functions would not be required if min-cost flow schedulers could generate and efficiently optimise multi-commodity flow networks, but as I previously noted, it is not possible to quickly find solutions in such flow networks. 5.6 Summary In this chapter, I argue that centralised min-cost flow-based schedulers can be optimised to achieve low task placement latency while maintain high placement quality. 146 5.6. SUMMARY I first introduce Firmament, a centralised min-cost flow scheduler I extend. I give an overview of how Firmament works, and how it interacts with min-cost flow solvers (§5.1). Next, I describe Flowlessly, the min-cost flow solver I developed to make Firmament choose placements with low latency at scale. I compare different min-cost flow algorithms, discuss edge cases, and the techniques I developed for Flowlessly to provide low algorithm runtime in all cases (§5.2). Following, I describe several scheduling features that previously were thought to be expensive or incompatible with min-cost flow-based schedulers (§5.3). Subsequently, I use one of these features to build a new scheduling policy that reduces interference on end-host network links (§5.4). Finally, I discuss the limitations min-cost flow schedulers have (§5.5). In Chapter 6, I show that with my extensions, Firmament places tasks in hundreds of millisec- onds on 12,500-machine clusters. Moreover, I show using a mixed-workload that Firmament chooses better placements than state-of-the-art distributed and centralised schedulers. Chapter 6 Firmament evaluation Firmament must choose high-quality placements with low task placement latency to efficiently utilise large clusters. In this chapter, I evaluate if Firmament meets these goals by running a range of large cluster simulations and experiments on a 40-machine cluster using real-world workloads. I first focus on evaluating Firmament’s scalability (§6.2). Following, I compare Firmament’s placements with those made by several state-of-the-art centralised and distributed schedulers (§6.3). In my experiments, I answer the following questions: 1. Does Firmament scale better than Quincy on large clusters when applying the same scheduling policy? (§6.2.1) 2. How well does Firmament handle challenging scheduling situations (e.g., oversubscribed clusters)? (§6.2.2) 3. What are Firmament scalability limits when faced with a worst-case workload? (§6.2.3) 4. Does Firmament place future workloads, which comprise of more short-running tasks, with low placement latency? (§6.2.4) 5. Does Firmament’s scalability also improve placement quality? (§6.3.1) 6. How does Firmament’s placements compare to other schedulers’ placements on a real- world mix of short-running interactive tasks and long-running batch and service tasks? (§6.3.2) 6.1 Experimental setup and metrics My experiments combine scale-up simulations with experiments on a local homogeneous testbed cluster. In simulation experiments, I replay a public production workload trace from a 12,500-machine 147 148 6.1. EXPERIMENTAL SETUP AND METRICS Machine Architecture Cores Threads Clock RAM 40× Dell R320 Intel Xeon E5-2430Lv2 6 12 2.4 GHz 64 GB PC3-12800 Table 6.1: Specifications of the machines in the local homogeneous cluster. time Task submitted Graph updated Flowlessly started Flowlessly finished Task scheduled Task completed updating waiting Flowlessly running extracting task running algorithm runtime task placement latency task response time Figure 6.1: The metrics I measure shown with respect to a task’s scheduling stages. Google cluster [RTG+12] in a simulator I implemented for the Firmament scheduler. My simu- lator has a similar design to Borg’s “Fauxmaster” [VPK+15, §3.1] and to Kubernetes’s “Kube- mark” [KBM17]: it merely stubs out RPCs and task execution, but otherwise runs the Firma- ment code and scheduling logic against simulated machines. However, there are three methodological limitations to note. First, the Google trace contains multi-dimensional resource requests for each task. Firmament supports multi-dimensional fea- sibility checking (as in Borg [VPK+15, §3.2]), but in order to fairly compare to Quincy, I divide slot-based placement. Second, I do not enforce task constraints for the same reason, despite them helping Firmament’s min-cost flow solver to choose placements. Third, the Google trace does not contain information about job types or input sizes. I use the same priority-based job type classification as in Omega [SKA+13], and estimate batch task input sizes as a function of the known runtime according to typical industry distributions [CAK12]. In all the experiments I compare withQuincy, I use my implementation of Quincy’s scheduling policy on Firmament, and run the cost scaling min-cost flow algorithm only, as in Quincy. In local cluster experiments, I use the homogeneous 40-machine cluster I describe in Table 6.1. The machines are distributed across four racks and are connected by a 10G network in a leaf- spine topology. The core interconnect offers a 320 Gbit/s bandwidth. All machines run Ubuntu 14.04 (Trusty Tahr) with Linux kernel v3.14 and are included in a Hadoop File System (HDFS) deployment. The local cluster models a small to medium cluster, albeit a fully controlled en- vironment without any external network traffic or machine utilisation. I use the local cluster to measure Firmament’s task placement quality. 6.1.1 Metrics In Figure 6.1, I show the task metrics I focus on in my evaluation, and highlight how they relate to Firmament’s scheduling stages. Algorithm runtime is the time it takes Flowlessly to run CHAPTER 6. FIRMAMENT EVALUATION 149 the best min-cost flow algorithm and represents how much time a task spends being actively scheduled. Task placement latency represents the time between task submission and task placement. This metric includes task wait time, the time Firmament spends updating the flow network, Flow- lessly’s runtime and the time it takes to extract task placements from the optimal flow. It is important to include task wait time in the latency metric because min-cost flow-based sched- ulers reconsider the entire existing workload each time they run a min-cost flow algorithm (see Section 2.3.3). These schedulers cannot place any tasks while the min-cost flow solver runs. Thus, tasks may have to wait for up to a min-cost flow solver run until they are considered by a solver. I measure task response time1 to quantify Firmament’s task placement quality. Task response time is the total time between task submission and task completion. It is an end-to-end met- ric that captures how quickly the scheduler places tasks and how good these placements are. High placement quality increases cluster utilization and avoids performance degradation due to overcommit. However, high-quality placements may not decrease task response time if the scheduler cannot choose them with low placement latency. Poor placement quality, by contrast, decreases application level performance (for long-running services), or increases task response time (for batch and interactive tasks). 6.2 Scalability In my first set of experiments, I evaluate Firmament’s scalability to large clusters. I compare Firmament’s task placement latency to Quincy’s (§6.2.1), I study how fast Flowlessly finds the optimal min-cost flow in challenging cluster situations (§6.2.2), I measure Firmament’s scalability when faced with a worst-case workload (§6.2.3), and I evaluate how well Firmament can handle future workloads that comprise of more short-running tasks (§6.2.4). 6.2.1 Scalability vs. Quincy In Figure 2.19, I illustrated that Quincy fails to scale to clusters of thousands of machines at low task placement latencies. I now repeat the same experiment using Firmament on the full 12,500-machine Google simulated cluster. In comparison to Figure 2.19, I increase cluster slot utilisation to 90% (i.e., I decrease the number of slots per-machine, but I do not change the cluster workload trace) to make the setup more challenging for Firmament. I also tune Quincy’s cost scaling-based min-cost flow solver for best performance2. 1Task response time is primarily meaningful for batch and interactive data processing tasks; long-running service tasks’ response time are conceptually infinite, and in practice are determined by failures and operational decisions. 2Specifically, I found that an α-factor parameter value of 9, rather than the default of 2 used in Quincy, improves runtime by ≈30%. 150 6.2. SCALABILITY 0 10 20 30 40 50 60 Task placement latency [sec] 0.0 0.2 0.4 0.6 0.8 1.0 C D F of ta sk pl ac em en tl at en cy Firmament Quincy (Cost scaling) Figure 6.2: Firmament achieves 20× lower task placement latency than Quincy on a sim- ulated 12,500-machine cluster at 90% slot utilisation, replaying the Google trace. The scheduling quality is unaffected. In Figure 6.2, I show the results as a CDF of task placement latency. Quincy takes between 25 and 60 seconds to place tasks, whereas, Firmament typically places tasks in hundreds of milliseconds and only exceeds a one second placement latency in the 97th percentile. This is more than a 20× improvement over Quincy’s placement latency without any reduction in placement quality. Firmament finds the same optimal placements as Quincy does, and scales to large clusters. 6.2.2 Scalability in extreme situations In the previous experiment, Firmament chooses placements fast because it uses the relaxation algorithm that handles the Google cluster workload well. However, in Subsection 5.2.2 I de- scribed a cluster setup in which relaxation does not work well. In this cluster setup, Firma- ment’s solver, Flowlessly, automatically falls back to the incremental cost scaling algorithm if it is faster (§5.2.6). I now evaluate if running two algorithms reduces Flowlessly’s runtime, and consequently task placement latency. In my experiment, I replay the 12,500-machine Google cluster trace, but I reduce the num- ber of slots for each cluster machine to obtain high cluster slot utilisation. In my simulation, the average cluster slot utilisation is 97%, but the cluster also experiences transient periods of oversubscription. In Figure 6.3, I compare Flowlessly’s automatic use of the fastest algorithm against using only one algorithm, either relaxation or cost scaling. During oversubscription (highlighted in grey), relaxation’s runtime spikes to hundreds of seconds per run, while cost scaling alone completes in ≈30 seconds independent of cluster load. Flowlessly correctly falls CHAPTER 6. FIRMAMENT EVALUATION 151 1000 1500 2000 2500 3000 3500 4000 Simulation Time [sec] 0 40 80 120 160 200 A lg or ith m ru nt im e [s ec ] oversubscribed Relaxation only Quincy (Cost scaling) Firmament Figure 6.3: Relaxation’s runtime increases when the cluster is oversubscribed (grey ar- eas). Firmament handles well oversubscription. It runs 2×faster than Quincy’s cost scaling. Moreover, it recovers 500s earlier from task backlog. back to incremental cost scaling in this situation, the algorithm that finishes first. It takes 10–15 seconds to complete, which is about 2× faster than using cost scaling only (as Quincy does). Moreover, Flowlessly recovers earlier from the cluster oversubscription starting at 2,200s: re- laxation runtime returns to sub-second levels at around 3,700s, whereas Flowlessly returns at around 3,100s. Relaxation on its own takes longer to recover because no new task placements are chosen while the algorithm runs. The slots freed by tasks that complete are re-used only after the following solver runs complete, even though new waiting tasks accumulate. Thus, the scheduler under- utilises the cluster when it uses relaxation only, even though there is work to do. To sum up, my experiment shows that Flowlessly’s combination of algorithms outperforms ei- ther algorithm running alone. Firmament obtains higher cluster utilisation and offers lower task placement latency because it uses incremental cost scaling when the cluster is oversubscribed, and quickly returns to relaxation when utilisation decreases. 6.2.3 Scalability to short-running tasks Task throughput and task durations in the Google trace workload are challenging for Quincy, but they are insufficient to stress Firmament. I now investigate Firmament’s min-cost flow- based approach scalability limits in the absence of oversubscription. In order to find Firmament breaking point, I subject it to a worst-case workload consisting entirely of a huge number of short-running tasks. This experiment is similar to Sparrow’s breaking-point experiment for the centralised Spark scheduler [OWZ+13, Fig. 12]. 152 6.2. SCALABILITY 5000 4000 3000 2000 1000 0 Task duration [ms] 0 1000 2000 3000 4000 5000 Jo b re sp on se tim e [m s] Ideal Firmament Spark (a) 100 machines 5000 4000 3000 2000 1000 0 Task duration [ms] 0 1000 2000 3000 4000 5000 Jo b re sp on se tim e [m s] Ideal Firmament (b) 1000 machines Figure 6.4: Firmament fails to keep up when tasks are running for less than ≈5ms at 100-machine scale, and ≈375ms at 1,000-machine scale, at 80% cluster slot utilisation. In the experiment, I simulate in turn three clusters of increasing size: 100, 1,000 and 10,000 machines. I submit jobs composed of 10 tasks at an interarrival time that keeps the cluster at a constant load of 80% if there is no scheduler overhead. I measure job response time, which is the maximum of the ten task response times for a job. If tasks schedule immediately and as a wave, the job response time is equal to the tasks’ runtime. I designed this experiment to only measure scheduler scalability, thus I do not increase response time for tasks that are placed on highly utilised machines (i.e., tasks do not interfere). In Figure 6.4, I plot job response time as a function of decreasing task duration. As I reduce task duration, I also reduce job interarrival time to keep the load constant, hence increasing the task throughput faced by the scheduler. With an ideal scheduler, job response time would be equal to task duration because the scheduler would take no time to choose placements. But in practice, the higher scheduler’s task placement latency is, the sooner job response time deviates from the diagonal. The breaking point occurs when the scheduler’s placement latency is at least as high as it takes to receive sufficient tasks to utilise the remainder available resources (i.e., 20% of the cluster). At that point, the scheduler accumulates an ever growing backlog of unscheduled tasks. The experiment stresses both scheduler task placement throughput and task placement latency. Schedulers that provide high task placement throughput by batching tasks and amortising work may not keep up with the workload because of low job interarrival times. Many tasks complete while the schedulers run, but batching schedulers only reuse freed resources after next scheduler runs complete. By contrast, task-by-task schedulers (e.g., Spark’s scheduler, Sparrow) are at advantage in this experiment because they can quickly make a decision for each submitted task. Indeed, the Sparrow distributed scheduler achieves job response times that are close to ideal on the 100 machine cluster [OWZ+13, Fig. 12]. However, Spark’s centralised task scheduler in 2013 had its breaking point on 100 machines at a 1.35 second task duration [OWZ+13, §7.6]. By contrast, even though the centralised Firmament scheduler runs a min-cost flow optimisation CHAPTER 6. FIRMAMENT EVALUATION 153 50x 100x 150x 200x 250x 300x Google trace acceleration 0 4 8 12 16 20 24 28 32 36 40 Ta sk pl ac em en tl at en cy [s ec ] Cost scaling Relaxation Firmament Figure 6.5: Firmament can schedule a 300× accelerated Google workload, while using relaxation only achieves far poorer placement latencies in the tail. over the entire workload every time, Figure 6.4 shows that it keeps up with the workload and achieves near-ideal job response time down to task durations as low as 5ms (100 machines) or 375ms (1,000 machines). This makes Firmament’s response time competitive with distributed schedulers on medium-sized clusters that only run short-running interactive analytics tasks. At 10,000 machines, Firmament keeps up with task durations ≥5s. However, such large clus- ters do not typically run short-running tasks only, but a mix of long-running and short-running tasks [BEL+14; DDK+15; KRC+15; VPK+15]. I therefore next investigate Firmament’s perfor- mance on a realistic mixed cluster workload. 6.2.4 Scalability to future workloads In this experiment, I simulate the workload of a 12,500-machine cluster using the Google trace. However, I accelerate the trace by dividing all task runtimes and interarrival times by an acceler- ation factor, and thus simulate a future workload that consists of shorter batch tasks [OPR+13], while service jobs continue to be long-running. For example, with a 300× acceleration, the median batch task takes 1.4 seconds, and the 90th and 99th percentile batch tasks take 12 and 61 seconds. In Figure 6.5, I compare the placement latency Firmament offers when using Flowlessly to the ones it offers when using individual min-cost flow algorithms. I measure placement latency across all tasks, and plot distributions (1st, 25th, 50th, 75th, 99th percentile, and maximum). As before, a single min-cost flow algorithm does not scale: cost scaling’s placement latency 154 6.3. PLACEMENT QUALITY already exceeds 30 seconds even with a 50× acceleration, and relaxation sees tail placement latencies well above 10 seconds beyond a 150× acceleration. Whereas, even at a acceleration of 300×, Firmament keeps up and places 75% of the tasks at with sub-second latency. Hybrid schedulers [DDK+15; KRC+15; DDD+16] can probably support these future workloads, but in contrast to Firmament, they sacrifice placement quality for short-running tasks and cause interference for long-running tasks. 6.3 Placement quality I now evaluate Firmament’s placement quality. I first compare Firmament to Quincy and show that my scheduler not only offers lower task placement latency, but also increases the fraction of input data that is locally read (§6.3.1). Finally, I evaluate Firmament on a mixed cluster workload and show that it outperforms three state-of-the-art centralised schedulers and one distributed scheduler (§6.3.2). 6.3.1 Improving data-locality Due to the scalability improvements I made, Firmament can use more complex scheduling poli- cies that generate large flow networks. In this experiment, I evaluate what effect has increasing the number of arcs in the flow network on the placement quality and placement latency. I use the Google cluster trace to simulate a 12,500-machine cluster, and as an illustrative example, I vary the data locality threshold in the Quincy scheduling policy. This threshold decides what fraction of a task’s input data must reside on a machine or within a rack in order for the former to receive a preference arc to the latter. Quincy originally picked a threshold of a maximum of ten arcs per task. However, in Figure 6.6b I show that even a higher threshold of 14% local data, which corresponds to at most seven preference arcs, yields algorithm runtimes of 20–40 seconds for Quincy’s cost scaling. A low threshold allows the scheduler exploit more fine-grained locality, but increases the num- ber of arcs in the flow network. Consequently, in Figure 6.6a, I show that if I lower the threshold to 2% local data, the percentage of locally read input data increases from 56% to 71%, which saves 4 TB of network traffic per simulated hour. Firmament is required to achieve this benefit; Figure 6.6a illustrates that when I use a 2% locality threshold in Quincy, algorithm runtime doubles, while Firmament still achieves almost the same algorithm runtime as with a 14% threshold. 6.3.2 Network-aware scheduling I now evaluate how good Firmament’s placements are compared to other centralised and dis- tributed schedulers. I deploy Firmament on the local 40-machine homogeneous cluster (Ta- CHAPTER 6. FIRMAMENT EVALUATION 155 14% (Quincy) 2% (Firmament) Preference threshold 0 20 40 60 80 100 D at a re ad lo ca lly [% ] (a) % of data read locally 0 10 20 30 40 Algorithm runtime [sec] 0.0 0.2 0.4 0.6 0.8 1.0 C D F of ru nt im es Firmament 2% Firmament 14% Quincy 2% Quincy 14% (b) Algorithm runtime Figure 6.6: Firmament achieves 27% higher task data locality than Quincy (6.6a). This comes without any placement latency increase because Firmament’s min-cost flow solver has smaller runtime even when I lower the task data locality preference threshold to 2% (6.6b). ble 6.1) to evaluate its performance with real cluster workloads. I run a workload of short- running interactive data processing tasks that take 3.5–5 seconds to complete on an otherwise idle cluster. Each task reads inputs of 4–8 GB from a cluster-wide HDFS installation, and in this experiment I use the network-aware scheduling policy I developed, which generates flow networks with convex arc costs. This policy reflects current network bandwidth reservations, considers actual bandwidth usage in the flow network’s costs, and strives to place tasks on machines with lightly-loaded network connections. In Figure 6.7a, I show CDFs of task response times I obtained using different cluster managers’ schedulers. I compare to a baseline that runs each task one by one in isolation on an otherwise idle cluster and network. Firmament achieves the closest task response time to the idle (isola- tion) baseline in the tail (80th percentile upwards) because it successfully avoids overcommitting machines’ network bandwidth. Other schedulers make random placements (Sparrow), are ef- fectively random as they do not take network bandwidth into account (Mesos, Kubernetes), or perform simple load-spreading based on the number of running tasks (Docker SwarmKit). Real-world clusters, however, run a mix of short interactive data processing tasks, long-running services and batch processing tasks (§2.1). I therefore extend the workload with long-running batch and service jobs to represent a similar mix. The long-running batch tasks are generated by fourteen iperf clients who communicate using UDP with seven iperf servers. Each iperf client generates 4 Gbps of sustained network traffic and simulates the network pattern of a machine learning (e.g., TensorFlow [ABC+16]) worker that communicates with a parameter server (i.e., one of the iperf servers) in a higher-priority network service class than the short batch tasks. Finally, I deploy three nginx web servers and seven HTTP clients as long-running service jobs. I run the cluster at about ≈80% network utilisation, and again measure the task response time for the interactive data processing tasks. 156 6.3. PLACEMENT QUALITY 0 2 4 6 8 10 12 14 16 18 20 22 Task response time [sec] 0.0 0.2 0.4 0.6 0.8 1.0 C D F of ta sk re sp on se tim e Idle (isolation) Firmament Docker SwarmKit Kubernetes Mesos Sparrow (a) Short interactive data processing tasks running on a clus- ter with an otherwise idle cluster and network. Overhead over “idle” due to network contention. 0 20 40 60 80 100 120 140 160 Task response time [sec] 0.0 0.2 0.4 0.6 0.8 1.0 C D F of ta sk re sp on se tim e Idle (isolation) Firmament Docker SwarmKit Kubernetes Mesos Sparrow (b) Short interactive data processing tasks running on a cluster with background traffic from long-running batch and service tasks. Figure 6.7: On a local 40-node cluster, Firmament reduces task response time of short batch tasks in the tail using a network-aware scheduling policy, both (a) without and (b) with background traffic. Note the different x-axis scale. In Figure 6.7b, I show that Firmament’s network-aware scheduling policy substantially im- proves the tail of the task response time distribution of interactive data processing tasks. For example, Firmament’s 99th percentile response time is 3.4× shorter than the Docker SwarmKit and Kubernetes ones, and 6.2× shorter than Sparrow’s. The tail matters, since the highest task response time often determines a batch job’s overall response time (the “straggler” problem). CHAPTER 6. FIRMAMENT EVALUATION 157 6.4 Summary In this chapter, I investigated Firmament’s placement quality and scalability to large cluster. My experiments show that Firmament: 1. Scales to large clusters: Firmament maintains the same placement quality as Quincy and achieves sub-second task placement latency in the common-case on a 12,500-machine cluster (§6.2.1). 2. Copes well with extreme cluster situations: Firmament’s min-cost flow solver com- pletes in at most 17 seconds, by contrast individual min-cost flow algorithms complete at best in 25 seconds in any situation (cost scaling), or fail to keep up in challenging situations (e.g., relaxation in oversubscribed clusters) (§6.2.2). 3. Scales to workloads comprising of only short-running tasks: Firmament has task placement latency comparable to the Sparrow distributed scheduler on a 1,000-machine cluster, and it keeps up with workloads comprising of 375ms long tasks. However, Fir- mament fails to keep up with task lengths below 5s on 10,000-machine clusters (§6.2.3). 4. Efficiently handles future workloads: Firmament places tasks with low latency even on a 300× accelerated Google workload from a production cluster with a 1.4 seconds median task duration (§6.2.4). 5. Chooses better placements than Quincy: Firmament can generate large flow networks that contain more preference arcs than the flow networks Quincy generates, because Flow- lessly can quickly find solutions. Thus, Firmament increases the percentage of locally read input data from 56% (with Quincy) to 71% (§6.3.1). 6. Outperforms centralised and distributed schedulers: Firmament reduces the response time of short-running network-intensive interactive tasks by up to 6× compared to other schedulers on a cluster running a real-world mixed task workload (§6.3.2). To sum up, Firmament chooses high-quality placements as good as advanced centralised sched- ulers, but at the speed and scale typically associated with distributed schedulers. However, there are workloads and situations in which Firmament may not choose high-quality placements. In Chapter 7, I discuss these situations, Firmament’s limitations, and how these can be addressed. 158 6.4. SUMMARY Chapter 7 Conclusions and future work In this dissertation, I described a novel approach for decoupling data processing workflow spec- ification from execution in large clusters, and I improved a centralised cluster scheduler to choose high-quality task placements with low placement latency at scale. • In Chapter 3, I presented Musketeer, a workflow manager that decouples front-end frameworks from back-end execution engines. Musketeer translates workflows expressed in front-end frameworks into an intermediate representation based on relational algebra operators. Following, it applies optimisations on the intermediate representation and uses several techniques such as operator merging and type inference to generate efficient code. Musketeer can also automatically map workflows to back-ends using an optimal, but slow exhaustive search algorithm, or a fast heuristic that chooses good mappings. • In Chapter 4, I showed that Musketeer automatically generates efficient code that is never more than 30% slower than hand-written optimised workflow implementations. I also demonstrated that Musketeer can reduce workflows’ makespan by mapping them to suitable back-ends or combinations of back-ends. Finally, I evaluated Musketeer’s automatic mapping mechanism and showed that for workflows I test it chooses mappings that are within 10% of the best mapping, when it has access to entire workflow history. • In Chapter 5, I extended Firmament, a centralised min-cost flow scheduler, to choose high-quality placements, scale to large clusters, and choose placements with latency typ- ically associated with distributed schedulers. To achieve this, I developed Flowlessly, a min-cost flow solver that combines multiple algorithms, incrementalises algorithms, and uses a range of techniques to speed up min-cost flow optimisations for cluster scheduling. • In Chapter 6, I showed using a trace of a 12,500-machine cluster, that Firmament places tasks within hundreds of milliseconds in the common case. I also demonstrated using a 250× accelerated cluster trace, that Firmament can place tasks with low scheduling 159 160 7.1. EXTENDING MUSKETEER latency even for future workloads that comprise of many tasks that complete within sec- onds. Finally, I showed that Firmament chooses higher-quality placements for a mixed- workload than state-of-the-art centralised and distributed schedulers on a local 40-machine cluster. Collectively, these chapters serve to prove the thesis I stated in Chapter 1. First, a new, data processing architecture that decouples front-end frameworks from back-end execution engines is possible. With my Musketeer proof-of-concept, I have demonstrated that the decoupled data processing architecture can be implemented, and that: (i) complex workflows can flexibly exe- cute on multiple back-ends without developer effort, (ii) workflows can be automatically ported to several back-ends, and (iii) workflows can execute more efficiently if multiple back-ends are combined. Second, my extensions to the Firmament scheduler prove two points: (i) cen- tralised, min-cost flow schedulers can be fast if the best min-cost flow algorithms are used and incrementalised, and (ii) min-cost flow schedulers can support desirable scheduling features (e.g., complex constraints, task co-location interference avoidance), which help them choose high-quality placements. My work is only a first step in improving data processing efficiency and flexibility in large data centres. In the remainder of this chapter, I discuss possible extensions to Musketeer and Firmament. 7.1 Extending Musketeer With Musketeer, I demonstrated that it is possible to implement the decoupled data processing architecture. However, in order to make Musketeer an alternative to production-ready state-of- the-art data processing systems, more work is required. In the following paragraphs, I discuss two areas in which Musketeer could be improved. Expressing workflows. Musketeer currently expects users to define graph processing work- flows in either Lindi, BEER, or its GAS DSL. However, many vertex-centric graph process- ing systems do not constrain users to express workflows only using sequences of relational operators, but allow them to provide arbitrary per-vertex code implemented in high-level pro- gramming languages (e.g., Java for Giraph, C++ for GraphChi). Musketeer uses user-defined functions (UDFs) to run such workflows. This limits Musketeer’s choice of back-ends that can execute the user-provided code, and it may cause Musketeer to generate code that uses ineffi- cient foreign-function interfaces. In the future, Musketeer’s support for user-defined functions could be improved. Musketeer could use source-to-source compilers to translate UDFs to each back-end’s programming language of choice, or it could automatically translate UDFs into re- lational operators using constraint-based synthesis [IZ10; CSM13]. CHAPTER 7. CONCLUSIONS AND FUTURE WORK 161 Back-end mapping cost function. The cost function that lays at the core of Musketeer’s automatic DAG partitioning algorithms could be refined. The cost function mainly uses signals that are statically computed before runtime (e.g., operator processing parameters computed in one-off calibration experiments), or are only updated upon job completion (e.g., intermediate data sizes, historical performance). The cost function could better predict makespans if it were to consider resource utilisation at the nodes on which the back-ends execute. Similarly, the cost function could reduce workflow makespan if it were to monitor back-ends’ state and take into account current job queue length when scoring sub-DAGs. Since it is common for back-ends to have long queues of jobs waiting to execute [ZBS+10], choosing a suboptimal back-end may sometimes reduce workflow makespan if the back-end is less busy. 7.2 Improving Firmament Firmament is a centralised min-cost flow scheduler that chooses high-quality placements with low latency at scale. However, Firmament’s placement quality could be further improved if: (i) the development of cost functions for setting flow net arc costs is improved and automated, and (ii) Flowlessly is extended to support generalised min-cost flow algorithms, which are required to model desirable scheduling features in flow networks. Improving scheduling policies. Firmament scheduling policies set minimum arc flow re- quirements and maximum arc capacities to restrict flow paths, and set arc costs to guide task flow supply via the nodes of best-suited machines. Scheduling policies use weighted linear functions to obtain arc costs, and to combine different goals (e.g., high task data locality and low task unscheduled time). The components and weights of the cost functions are defined by scheduling policy developers, who use the knowledge they gain from experimentation. How- ever, it is unlikely developers can run sufficient experiments to be able to develop cost functions that accurately predict how well-suited machines are in all situations. An interesting next step would be to study if these functions could be refined using auto-tuners. The auto-tuners could run benchmarks on clusters and adjust cost weights accordingly. Moreover, developers could provide contextual information in the form of a probabilistic model of the workload’s behaviour (e.g., the likelihood of different types of tasks to interfere) to reduce auto-tuning time [DSY17]. Extending Flowlessly. Desirable scheduling features such as complex constraints and gang scheduling are supported by only a few schedulers, which usually trade off task placement la- tency for these features. Min-cost flow schedulers can provide these features, but they have to solve the more complex general min-cost flow optimisations, instead of the traditional min- cost flow optimisations. Polynomial combinatorial algorithms for the generalised min-cost flow optimisation exist [Way99], but to my knowledge, no solver currently implements these al- gorithms. It would be interesting to implement these algorithms in Flowlessly, to adjust the 162 7.3. SUMMARY algorithms to incrementally recompute the optimal solution, and to develop heuristics specific to the flow networks generated by scheduling policies. 7.3 Summary As applications increasingly rely on large-scale data analytics workflows to provide high-quality services, the workloads executed by data processing systems and cluster managers are ever more diverse. In this dissertation, I have made the case that workflow specification ought to be de- coupled from execution to be able to flexibly adjust to changing workloads, and high-quality placement decisions can be made with low placement latency to efficiently execute these work- loads at scale. With Musketeer, I demonstrated that it is possible to decouple data processing systems, and with Firmament I showed that there is no need to trade off placement quality for low placement latency; Firmament provides both at scale. Bibliography [AAK+11] Ganesh Ananthanarayanan, Sameer Agarwal, Srikanth Kandula, Albert Green- berg, Ion Stoica, Duke Harlan, and Ed Harris. “Scarlett: Coping with Skewed Content Popularity in MapReduce Clusters”. In: Proceedings of the 6th Euro- pean Conference on Computer Systems (EuroSys). Salzburg, Austria, Apr. 2011, pp. 287–300 (cited on pages 49, 138). [ABB+13] Tyler Akidau, Alex Balikov, Kaya Bekirog˘lu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, et al. “MillWheel: Fault-tolerant Stream Processing at Internet Scale”. In: Proceedings of the VLDB Endowment 6.11 (Aug. 2013), pp. 1033–1044 (cited on page 28). [ABC+15] Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández- Moctezuma, Reuven Lax, Sam McVeety, et al. “The Dataflow Model: A Prac- tical Approach to Balancing Correctness, Latency, and Cost in Massive-scale, Unbounded, Out-of-order Data Processing”. In: Proceedings of the VLDB En- dowment 8.12 (Aug. 2015), pp. 1792–1803 (cited on page 66). [ABC+16] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. “TensorFlow: A system for large-scale machine learning”. In: Proceedings of the 12th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI). Savannah, Georgia, USA, Nov. 2016 (cited on pages 50, 155). [ABE+14] Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Frey- tag, Fabian Hueske, Arvid Heise, Odej Kao, et al. “The Stratosphere platform for big data analytics”. In: Proceedings of the VLDB Endowment 23.6 (2014), pp. 939–964 (cited on page 32). [AGO+92] Ravindra K. Ahuja, Andrew V. Goldberg, James B. Orlin, and Robert E. Tarjan. “Finding minimum-cost flows by double scaling”. In: Mathematical program- ming 53.1-3 (1992), pp. 243–266 (cited on page 120). [AGS+11] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. “Disk- locality in Datacenter Computing Considered Irrelevant”. In: Proceedings of the 13th USENIX Workshop on Hot Topics in Operating Systems (HotOS). Napa, Cal- ifornia, USA, May 2011, pp. 12–17 (cited on page 49). 163 164 BIBLIOGRAPHY [AGS+13] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. “Effective Straggler Mitigation: Attack of the Clones”. In: Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Lom- bard, IL, USA, 2013, pp. 185–198 (cited on page 49). [AH00] Ron Avnur and Joseph M. Hellerstein. “Eddies: Continuously Adaptive Query Processing”. In: Proceedings of the 2000 ACM SIGMOD International Confer- ence on Management of Data (SIGMOD). Dallas, Texas, USA, 2000, pp. 261– 272 (cited on page 66). [AKB+12a] Sameer Agarwal, Srikanth Kandula, Nico Bruno, Ming-Chuan Wu, Ion Stoica, and Jingren Zhou. “Reoptimizing Data Parallel Computing”. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI). San Jose, California, USA, 2012, pp. 281–294 (cited on page 75). [AKB+12b] Sameer Agarwal, Srikanth Kandula, Nico Bruno, Ming-Chuan Wu, Ion Stoica, and Jingren Zhou. “Re-Optimizing data-parallel computing”. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI). San Jose, California, USA, Apr. 2012, pp. 281–294 (cited on pages 84, 113). [ALI17] Alibaba cluster trace. http://github.com/alibaba/clusterdata/; accessed 15/12/2017. Alibaba Inc. (cited on page 47). [AMO93] Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network flows: theory, algorithms, and applications. Prentice Hall, 1993 (cited on pages 117– 118, 120). [AU79] Alfred V. Aho and Jeffrey D. Ullman. “Universality of Data Retrieval Languages”. In: Proceedings of the 6th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL). San Antonio, Texas, 1979, pp. 110–119 (cited on page 74). [AXL+15] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, et al. “Spark SQL: Relational Data Processing in Spark”. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD). Melbourne, Victoria, Australia, 2015, pp. 1383– 1394 (cited on pages 15, 27, 36, 67). [Bac15] Emilian D. Bacila. “Planchet: Decoupling graph computations with user-defined functions”. Computer Science Tripos Part II Dissertation. University of Cam- bridge Computer Laboratory, May 2015 (cited on page 77). [BBD05] Shivnath Babu, Pedro Bizarro, and David DeWitt. “Proactive Re-optimization”. In: Proceedings of the 2005 ACM SIGMOD International Conference on Man- agement of Data (SIGMOD). Baltimore, Maryland, 2005, pp. 107–118 (cited on page 66). BIBLIOGRAPHY 165 [BBJ+14] Yingyi Bu, Vinayak Borkar, Jianfeng Jia, Michael J. Carey, and Tyson Condie. “Pregelix: Big(Ger) Graph Analytics on a Dataflow Engine”. In: Proceedings of the VLDB Endowment 8.2 (Oct. 2014), pp. 161–172 (cited on pages 37, 73, 75, 80). [BCF+13] Arka A. Bhattacharya, David Culler, Eric Friedman, Ali Ghodsi, Scott Shenker, and Ion Stoica. “Hierarchical Scheduling for Diverse Datacenter Workloads”. In: Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC). Santa Clara, California, Oct. 2013, 4:1–4:15 (cited on page 52). [BCG+11] Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Ver- nica. “Hyracks: A flexible and extensible foundation for data-intensive comput- ing”. In: Proceedings of the 27th IEEE International Conference on Data Engi- neering (ICDE). Apr. 2011, pp. 1151–1162 (cited on pages 27, 30, 32, 37). [BCH13] Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. “The Datacenter as a Com- puter: An Introduction to the Design of Warehouse-Scale Machines, Second edi- tion”. In: Synthesis Lectures on Computer Architecture 8.3 (July 2013), pp. 1–154 (cited on page 44). [BEL+14] Eric Boutin, Jaliya Ekanayake, Wei Lin, Bing Shi, Jingren Zhou, Zhengping Qian, Ming Wu, et al. “Apollo: Scalable and Coordinated Scheduling for Cloud- Scale Computing”. In: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Broomfield, Colorado, USA, Oct. 2014, pp. 285–300 (cited on pages 24, 51–52, 54–55, 57, 68, 123, 153). [BHB+10] Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. “HaLoop: Efficient Iterative Data Processing on Large Clusters”. In: Proceedings of the VLDB Endowment 3.1-2 (Sept. 2010), pp. 285–296 (cited on pages 29–30). [BKV08] Robert M. Bell, Yehuda Koren, and Chris Volinsky. The BellKor solution to the Netflix prize. Technical report. AT&T Bell Labs, 2008 (cited on pages 94–95). [BT88a] Dimitri P. Bertsekas and Paul Tseng. “Relaxation Methods for Minimum Cost Ordinary and Generalized Network Flow Problems”. In: Operations Research 36.1 (Feb. 1988), pp. 93–114 (cited on page 119). [BT88b] Dimitri P. Bertsekas and Paul Tseng. “The Relax codes for linear minimum cost network flow problems”. In: Annals of Operations Research 13.1 (Dec. 1988), pp. 125–190 (cited on page 119). [CAB+12] Yanpei Chen, Sara Alspaugh, Dhruba Borthakur, and Randy Katz. “Energy Effi- ciency for Large-scaleMapReduceWorkloads with Significant Interactive Analy- sis”. In: Proceedings of the 7th ACM European Conference on Computer Systems (EuroSys). Bern, Switzerland, Apr. 2012, pp. 43–56 (cited on pages 45, 47). 166 BIBLIOGRAPHY [CAK12] Yanpei Chen, Sara Alspaugh, and Randy Katz. “Interactive Analytical Processing in Big Data Systems: A Cross-industry Study of MapReduce Workloads”. In: Proceedings of the VLDB Endowment 5.12 (Aug. 2012), pp. 1802–1813 (cited on pages 38, 67, 74, 148). [CBB+13] Michael Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko, Lucian Gri- jincu, Tom Jackson, Sandhya Kunnatur, et al. “Unicorn: A System for Searching the Social Graph”. In: Proceedings of the VLDB Endowment 6.11 (Aug. 2013), pp. 1150–1161 (cited on page 34). [CJL+08] Ronnie Chaiken, Bob Jenkins, Per-Ake Larson, Bill Ramsey, Darren Shakib, Si- mon Weaver, and Jingren Zhou. “SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets”. In: Proceedings of the VLDB Endowment 1.2 (Aug. 2008), pp. 1265–1276 (cited on pages 27, 36, 68). [CLL+11] Biswapesh Chattopadhyay, Liang Lin,Weiran Liu, SagarMittal, Prathyusha Aragonda, Vera Lychagina, Younghee Kwon, et al. “Tenzing: A SQL Implementation On The MapReduce Framework”. In: Proceedings of the 37th International Confer- ence on Very Large Data Bases (VLDB). Seattle, Washington, USA, Aug. 2011, pp. 1318–1327 (cited on pages 27, 36, 68). [Cod70] Edgar F. Codd. “A Relational Model of Data for Large Shared Data Banks”. In: Communications of the ACM 13.6 (June 1970), pp. 377–387 (cited on pages 36, 66, 74). [CRP+10] Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. “FlumeJava: Easy, Efficient Data-parallel Pipelines”. In: Proceedings of the 2010 ACM SIGPLAN Confer- ence on Programming Language Design and Implementation (PLDI). Toronto, Ontario, Canada, June 2010, pp. 363–375 (cited on pages 27, 36, 67, 75, 87). [CSC+15] Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. “PowerLyra: Differenti- ated Graph Computation and Partitioning on Skewed Graphs”. In: Proceedings of the 10th ACM European Conference on Computer Systems (EuroSys). Bordeaux, France, Apr. 2015 (cited on page 35). [CSM13] Alvin Cheung, Armando Solar-Lezama, and SamuelMadden. “Optimizing Database- backed Applications with Query Synthesis”. In: Proceedings of the 2013 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Seattle, Washington, USA, 2013, pp. 3–14 (cited on pages 68, 160). [CWI+16] Guoqiang Jerry Chen, Janet L. Wiener, Shridhar Iyer, Anshul Jaiswal, Ran Lei, Nikhil Simha, Wei Wang, et al. “Realtime Data Processing at Facebook”. In: Pro- ceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD). San Francisco, California, USA, 2016, pp. 1087–1098 (cited on page 28). BIBLIOGRAPHY 167 [DDD+16] Pamela Delgado, Diego Didona, Florin Dinu, andWilly Zwaenepoel. “Job-Aware Scheduling in Eagle: Divide and Stick to Your Probes”. In: Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC). Santa Clara, California, USA, Oct. 2016 (cited on pages 47, 52, 55–57, 154). [DDK+15] Pamela Delgado, Florin Dinu, Anne-Marie Kermarrec, and Willy Zwaenepoel. “Hawk: Hybrid Datacenter Scheduling”. In: Proceedings of the USENIX An- nual Technical Conference. Santa Clara, California, USA, July 2015, pp. 499– 510 (cited on pages 46, 52, 56–57, 153–154). [Den68] Jack B. Dennis. “Programming generality, parallelism and computer architec- ture”. In: 1968 (cited on page 29). [DG08] Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters”. In: Communications of the ACM 51.1 (Jan. 2008), pp. 107– 113 (cited on pages 15, 26–27, 29, 33, 46, 67). [DK13] Christina Delimitrou and Christos Kozyrakis. “Paragon: QoS-aware Schedul- ing for Heterogeneous Datacenters”. In: Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operat- ing Systems (ASPLOS). Houston, Texas, USA, Mar. 2013, pp. 77–88 (cited on pages 16, 44–45, 52, 125). [DK14] Christina Delimitrou and Christos Kozyrakis. “Quasar: Resource-Efficient and QoS-Aware Cluster Management”. In: Proceedings of the 18th International Con- ference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Salt Lake City, Utah, USA, Mar. 2014 (cited on pages 16, 45, 47, 51–52). [DSK15] Christina Delimitrou, Daniel Sanchez, and Christos Kozyrakis. “Tarcil: Recon- ciling Scheduling Speed and Quality in Large Shared Clusters”. In: Proceedings of the 6th ACM Symposium on Cloud Computing (SoCC). Kohala Coast, Hawaii, USA, Aug. 2015, pp. 97–110 (cited on pages 16, 52, 54, 57, 61, 142). [DSY17] Valentin Dalibard, Michael Schaarschmidt, and Eiko Yoneki. “BOAT: Building Auto-Tuners with Structured Bayesian Optimization”. In: Proceedings of the 26th International Conference on World Wide Web. Perth, Australia: International World Wide Web Conferences Steering Committee, 2017, pp. 479–488 (cited on page 161). [EK72] Jack Edmonds and Richard M. Karp. “Theoretical Improvements in Algorith- mic Efficiency for Network Flow Problems”. In: Journal of the ACM 19.2 (Apr. 1972), pp. 248–264 (cited on page 120). 168 BIBLIOGRAPHY [FBK+12] Andrew D. Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca. “Jockey: Guaranteed Job Latency in Data Parallel Clusters”. In: Pro- ceedings of the 7th ACM European Conference on Computer Systems (EuroSys). Bern, Switzerland, Apr. 2012, pp. 99–112 (cited on pages 48, 52). [FF57] Lester R. Ford and Delber R. Fulkerson. “A primal-dual algorithm for the capac- itated Hitchcock problem”. In: Naval Research Logistics Quarterly 4.1 (1957), pp. 47–54 (cited on page 119). [FLN16] Apache Software Foundation. Apache Flink. http://flink.apache.org; accessed 10/11/2016 (cited on pages 27, 30, 32). [FM06] Antonio Frangioni and Antonio Manca. “A Computational Study of Cost Reop- timization for Min-Cost Flow Problems”. In: INFORMS Journal on Computing 18.1 (2006), pp. 61–70 (cited on page 121). [GGS+15] Ionel Gog, Jana Giceva, Malte Schwarzkopf, Kapil Vaswani, Dimitrios Vytinio- tis, Ganesan Ramalingam, Manuel Costa, et al. “Broom: sweeping out Garbage Collection from Big Data systems”. In: Proceedings of the 15th USENIX/SIGOPS Workshop on Hot Topics in Operating Systems (HotOS). Kartause Ittingen, Switzer- land, May 2015 (cited on page 20). [GIA17] Ionel Gog, Michael Isard, and Martín Abadi. Falkirk: Rollback Recovery for Dataflow Systems. In submission. 2017 (cited on page 20). [GIR16] Apache Software Foundation. Apache Giraph. http://giraph.apache. org; accessed 14/11/2016 (cited on pages 27, 30, 33–34). [GJS76] M. R. Garey, D. S. Johnson, and L. Stockmeyer. “Some simplified NP-complete graph problems”. In: Theoretical Computer Science 1.3 (1976), pp. 237–267 (cited on page 86). [GK93] Andrew V. Goldberg and Michael Kharitonov. “On Implementing Scaling Push- Relabel Algorithms for the Minimum-Cost Flow Problem”. In: Network Flows and Matching: First DIMACS Implementation Challenge. Edited by D.S. John- son and C.C. McGeoch. DIMACS series in discrete mathematics and theoretical computer science. American Mathematical Society, 1993 (cited on page 120). [GKR+16] Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. “GRAPHENE: Packing and Dependency-Aware Scheduling for Data- Parallel Clusters”. In: Proceedings of the 12th USENIX Symposium on Operat- ing Systems Design and Implementation (OSDI). Savannah, Georgia, USA, Nov. 2016, pp. 81–97 (cited on page 43). [Gle15] Adam Gleave. “Fast and accurate cluster scheduling using flow networks”. Com- puter Science Tripos Part II Dissertation. University of Cambridge Computer Laboratory, May 2015 (cited on page 18). BIBLIOGRAPHY 169 [GLG+12] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. “PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs”. In: Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Hollywood, California, USA, Oct. 2012, pp. 17–30 (cited on pages 15, 26–27, 30, 33–35, 71). [GOF16] Jeff Dean. Software Engineering Advice from Building Large-Scale Distributed Systems. http://static.googleusercontent.com/media/resea rch.google.com/en//people/jeff/stanford-295-talk.pdf; accessed 13/11/2016 (cited on page 28). [Gol97] Andrew V. Goldberg. “An Efficient Implementation of a Scaling Minimum-Cost Flow Algorithm”. In: Journal of Algorithms 22.1 (1997), pp. 1–29 (cited on pages 61, 120, 134, 139). [GSC+15] Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P. Grosvenor, Allen Clement, and Steven Hand. “Musketeer: all for one, one for all in data processing systems”. In: Proceedings of the 10th ACM European Conference on Computer Systems (EuroSys). Bordeaux, France, Apr. 2015 (cited on pages 18–19). [GSG+15] Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, Robert N. M. Watson, AndrewW. Moore, Steven Hand, and Jon Crowcroft. “Queues don’t matter when you can JUMP them!” In: Proceedings of the 12th USENIX Symposium on Net- worked Systems Design and Implementation (NSDI). Oakland, California, USA, May 2015 (cited on pages 20, 34, 49, 143). [GSG+16] Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert N. M. Watson, and Steven Hand. “Firmament: fast, centralized cluster scheduling at scale”. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI). Savannah, Georgia, USA, 2016, pp. 99–115 (cited on pages 18–19, 51). [GSW15] Andrey Goder, Alexey Spiridonov, and Yin Wang. “Bistro: Scheduling Data- Parallel Jobs Against Live Production Systems”. In: Proceedings of the USENIX Annual Technical Conference. Santa Clara, California, USA, July 2015, pp. 459– 471 (cited on pages 51–54). [GT89] Andrew V. Goldberg and Robert E. Tarjan. “Finding Minimum-cost Circula- tions by Canceling Negative Cycles”. In: Journal of the ACM 36.4 (Oct. 1989), pp. 873–886 (cited on page 118). [GT90] Andrew V. Goldberg and Robert E. Tarjan. “Finding Minimum-Cost Circulations by Successive Approximation”. In: Mathematics of Operations Research 15.3 (Aug. 1990), pp. 430–466 (cited on page 120). 170 BIBLIOGRAPHY [GXD+14] Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. “GraphX: Graph Processing in a Distributed Dataflow Framework”. In: Proceedings of the 11th USENIX Symposium on Operating Sys- tems Design and Implementation (OSDI). Broomfield, Colorado, USA, Oct. 2014, pp. 599–613 (cited on pages 27, 37, 75, 80). [GZS+13] Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. “Choosy: max-min fair sharing for datacenter jobs with constraints”. In: Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys). Prague, Czech Republic, Apr. 2013, pp. 365–378 (cited on pages 51–52). [HAD16] Apache Software Foundation. Apache Hadoop. http://hadoop.apache. org/; accessed 13/11/2016 (cited on pages 29–30, 41). [HBB+12] Alexander Hall, Olaf Bachmann, Robert Büssow, Silviu Ga˘nceanu, and Marc Nunkesser. “Processing a Trillion Cells Per Mouse Click”. In: Proceedings of the VLDB Endowment 5.11 (July 2012), pp. 1436–1446 (cited on page 35). [HCS+12] Sungpack Hong, Hassan Chafi, Edic Sedlar, and Kunle Olukotun. “Green-Marl: A DSL for Easy and Efficient Graph Analysis”. In: Proceedings of the Seven- teenth International Conference on Architectural Support for Programming Lan- guages and Operating Systems (ASPLOS). London, England, United Kingdom, 2012, pp. 349–362 (cited on page 71). [HIV16] Ashish Thusoo. Hive - A Petabyte Scale Data Warehouse using Hadoop. htt ps : / / www . facebook . com / notes / facebook - engineering / hive-a-petabyte-scale-data-warehouse-using-hadoop/ 89508453919/; accessed 28/11/2016 (cited on page 68). [HKZ+11] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, et al. “Mesos: A platform for fine-grained resource sharing in the data center”. In: Proceedings of the 8th USENIX Confer- ence on Networked Systems Design and Implementation (NSDI). Boston, Mas- sachusetts, USA, Mar. 2011, pp. 295–308 (cited on pages 43, 47, 49, 52, 54). [IBY+07] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. “Dryad: Distributed Data-parallel Programs from Sequential Building Blocks”. In: Pro- ceedings of the 2nd ACM SIGOPS European Conference on Computer Systems (EuroSys). Lisbon, Portugal, Mar. 2007, pp. 59–72 (cited on pages 27, 30, 32). [IPC+09] Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. “Quincy: fair scheduling for distributed computing clusters”. In: Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP). Big Sky, Montana, USA, Oct. 2009, pp. 261–276 (cited on pages 17, 46, 49–54, 58–60, 112, 136–137). BIBLIOGRAPHY 171 [IZ10] Ming-Yee Iu and Willy Zwaenepoel. “HadoopToSQL: A MapReduce Query Op- timizer”. In: Proceedings of the 5th ACM European Conference on Computer Systems (EuroSys). Paris, France, 2010, pp. 251–264 (cited on pages 68, 160). [KAA+13] Zuhair Khayyat, Karim Awara, Amani Alonazi, Hani Jamjoom, Dan Williams, and Panos Kalnis. “Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing”. In: Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys). Prague, Czech Republic, 2013, pp. 169–182 (cited on page 74). [KBF+15] Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, et al. “Twitter Heron: Stream Process- ing at Scale”. In: Proceedings of the 2015 ACM SIGMOD International Confer- ence on Management of Data (SIGMOD). Melbourne, Victoria, Australia, 2015, pp. 239–250 (cited on page 28). [KBG12] Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. “GraphChi: Large-Scale Graph Computation on Just a PC”. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI). Hollywood, California, USA, 2012, pp. 31–46 (cited on pages 26–27, 30, 33, 35, 71). [KBM17] Cloud Native Computing Foundation. Kubernetes Kubemark. https://git hub.com/kubernetes/community/blob/master/contributo rs/design-proposals/kubemark.md; accessed 28/06/2017 (cited on page 148). [KD98] Navin Kabra and David J. DeWitt. “Efficient Mid-query Re-optimization of Sub- optimal Query Execution Plans”. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD). Seattle, Washing- ton, USA, 1998, pp. 106–117 (cited on page 66). [KIY13] Qifa Ke, Michael Isard, and Yuan Yu. “Optimus: A Dynamic Rewriting Frame- work for Data-parallel Execution Plans”. In: Proceedings of the 8th ACM Eu- ropean Conference on Computer Systems (EuroSys). Prague, Czech Republic, 2013, pp. 15–28 (cited on pages 67, 74–75). [KK12] Zoltán Király and P. Kovács. “Efficient implementations of minimum-cost flow algorithms”. In: Acta Universitatis Sapientiae 4.1 (2012), pp. 67–118 (cited on pages 121, 131). [KL70] Brian W. Kernighan and Shen Lin. “An efficient heuristic procedure for partition- ing graphs”. In: Bell System Technical Journal 49.2 (1970), pp. 291–307 (cited on page 86). [Kle67] Morton Klein. “A Primal Method for Minimal Cost Flows with Applications to the Assignment and Transportation Problems”. In: Management Science 14.3 (1967), pp. 205–220 (cited on page 118). 172 BIBLIOGRAPHY [KPX+11] Qifa Ke, Vijayan Prabhakaran, Yinglian Xie, Yuan Yu, Jingyue Wu, and Junfeng Yang. “Optimizing Data Partitioning for Data-Parallel Computing.” In: Proceed- ings of the 13th USENIX Workshop on Hot Topics in Operating Systems (HotOS). Napa, California, USA, May 2011 (cited on page 70). [KRC+15] Konstantinos Karanasos, Sriram Rao, Carlo Curino, Chris Douglas, Kishore Chali- parambil, Giovanni Matteo Fumarola, Solom Heddaya, et al. “Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters”. In: Proceed- ings of the USENIX Annual Technical Conference. Santa Clara, California, USA, July 2015, pp. 485–497 (cited on pages 46, 52, 56, 153–154). [KUB16] Cloud Native Computing Foundation. Kubernetes. http://k8s.io; accessed 14/09/2016 (cited on pages 43, 47). [LA04] Chris Lattner and Vikram Adve. “LLVM: a compilation framework for lifelong program analysis transformation”. In: International Symposium on Code Gen- eration and Optimization (CGO). Mar. 2004, pp. 75–86 (cited on pages 66, 73, 75). [LBG+12] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. “Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud”. In: Proceedings of the VLDB Endow- ment 5.8 (Apr. 2012), pp. 716–727 (cited on page 34). [LCG+15] David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. “Heracles: Improving Resource Efficiency at Scale”. In: Pro- ceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA). Portland, Oregon, USA, June 2015, pp. 450–462 (cited on pages 45–46). [LGZ+14] Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. “Tachy- on: Reliable, Memory Speed Storage for Cluster Computing Frameworks”. In: Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC). Seattle, Washington, USA, 2014, 6:1–6:15 (cited on page 49). [Liu12] Huan Liu. Host Server CPU utilization in Amazon EC2 cloud. https://tiny url.com/hn5yh9d; accessed 14/11/2015. 2012 (cited on page 61). [LND16] Derrek G. Murray. Building new frameworks on Naiad. Apr. 2014 (cited on pages 27, 37, 71). [Löb96] Andreas Löbel. Solving Large-Scale Real-World Minimum-Cost Flow Problems by a Network Simplex Method. Technical report SC-96-07. Zentrum für Informa- tionstechnik Berlin (ZIB), Feb. 1996 (cited on page 121). BIBLIOGRAPHY 173 [MAA+14] Stefan C. Müller, Gustavo Alonso, Adam Amara, and André Csillaghy. “Pydron: Semi-Automatic Parallelization for Multi-Core and the Cloud”. In: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI). Broomfield, Colorado, USA, Oct. 2014, pp. 645–659 (cited on page 66). [MAB+10] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. “Pregel: A System for Large-scale Graph Processing”. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD). Indianapolis, Indiana, USA, June 2010, pp. 135–146 (cited on pages 26–27, 30, 33–34, 71). [MAH16] Apache Software Foundation. Apache Mahout. http://mahout.apache. org/; accessed 14/11/2016 (cited on page 32). [McK08] McKinsey &Company. “Revolutionizing data center efficiency”. In: (2008) (cited on page 61). [McS14] Frank McSherry. GraphLINQ: A graph library for Naiad. Big Data at SVC blog, accessed 25/07/2016. May 2014. URL: http://bigdataatsvc.wordpres s.com/2014/05/08/graphlinq-a-graph-library-for-naiad/ (cited on pages 27, 37, 71). [MG12] Raghotham Murthy and Rajat Goel. “Peregrine: Low-latency Queries on Hive Warehouse Data”. In: XRDS: Crossroad, ACMMagazine for Students 19.1 (Sept. 2012), pp. 40–43 (cited on page 35). [MGL+10] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shiv- akumar, Matt Tolton, and Theo Vassilakis. “Dremel: Interactive Analysis of Web- scale Datasets”. In: Proceedings of the VLDB Endowment 3.1-2 (Sept. 2010), pp. 330–339 (cited on pages 26, 30). [MIM15] Frank McSherry, Michael Isard, and Derek G. Murray. “Scalability! But at what COST?” In: Proceedings of the 15th USENIX/SIGOPS Workshop on Hot Topics in Operating Systems (HotOS). Kartause Ittingen, Switzerland, May 2015 (cited on page 34). [MMI+13] Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Bar- ham, and Martín Abadi. “Naiad: A Timely Dataflow System”. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP). Nemacolin Woodlands, Pennsylvania, USA, Nov. 2013, pp. 439–455 (cited on pages 15, 26– 27, 30, 67, 70, 143). [MMK10] Yandong Mao, Robert Morris, and M. Frans Kaashoek. Optimizing MapReduce for multicore architectures. Technical report MIT-CSAIL-TR-2010-020. Mas- sachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory, 2010 (cited on pages 27, 30). 174 BIBLIOGRAPHY [MSS+11] Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, Steven Smith, Anil Madhavapeddy, and Steven Hand. “CIEL: a universal execution engine for dis- tributed data-flow computing”. In: Proceedings of the 8th USENIX Symposium on Networked System Design and Implementation (NSDI). Boston, Massachusetts, USA, Mar. 2011, pp. 113–126 (cited on pages 27, 30, 32, 66). [MT13] Jason Mars and Lingjia Tang. “Whare-map: Heterogeneity in “Homogeneous” Warehouse-scale Computers”. In: Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA). Tel-Aviv, Israel, June 2013, pp. 619– 630 (cited on pages 44, 52). [MTH+11] Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. “Bubble-up: Increasing utilization in modern warehouse scale computers via sen- sible co-locations”. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Porto Allegre, Brazil, Dec. 2011, pp. 248–259 (cited on page 45). [Mur11] Derek G. Murray. “A distributed execution engine supporting data-dependent control flow”. PhD thesis. University of Cambridge Computer Laboratory, July 2011 (cited on page 74). [NEF+12] Edmund B. Nightingale, Jeremy Elson, Jinliang Fan, Owen Hofmann, Jon How- ell, and Yutaka Suzue. “Flat Datacenter Storage”. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI). Hollywood, California, USA, Oct. 2012, pp. 1–15 (cited on page 138). [NIG07] Ripal Nathuji, Canturk Isci, and Eugene Gorbatov. “Exploiting platform hetero- geneity for power efficient data centers”. In: Proceedings of the 2007 Interna- tional Conference on Autonomic Computing (ICAC). Jacksonville, Florida, USA, 2007 (cited on page 44). [NRN+10] Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Kesari. “S4: Dis- tributed Stream Computing Platform”. In: Proceedings of the 2010 IEEE Inter- national Conference on Data Mining Workshops (ICDMW). Sydney, Australia, Dec. 2010, pp. 170–177 (cited on page 28). [OOZ16] Apache Software Foundation. Apache Oozie. http://oozie.apache.org /; accessed 14/11/2016 (cited on page 31). [OPR+13] Kay Ousterhout, Aurojit Panda, Joshua Rosen, Shivaram Venkataraman, Reynold Xin, Sylvia Ratnasamy, Scott Shenker, et al. “The case for tiny tasks in compute clusters”. In: Proceedings of the 14th USENIX Workshop on Hot Topics in Oper- ating Systems (HotOS). Santa Ana Pueblo, New Mexico, USA, May 2013 (cited on pages 41, 153). [Orl93] James B. Orlin. “A faster strongly polynomial minimum cost flow algorithm”. In: Operations research 41.2 (1993), pp. 338–350 (cited on page 120). BIBLIOGRAPHY 175 [ORS+08] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and An- drew Tomkins. “Pig Latin: A Not-so-foreign Language for Data Processing”. In: Proceedings of the 2008 ACM SIGMOD International Conference on Manage- ment of Data (SIGMOD). Vancouver, Canada, 2008, pp. 1099–1110 (cited on pages 27, 29, 36, 67–68, 70, 75). [ORS+11] Diego Ongaro, StephenM. Rumble, Ryan Stutsman, John Ousterhout, andMendel Rosenblum. “Fast Crash Recovery in RAMCloud”. In: Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP). Cascais, Portugal, Oct. 2011, pp. 29–41 (cited on page 49). [OWZ+13] Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. “Sparrow: Dis- tributed, Low Latency Scheduling”. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP). Nemacolin Woodlands, Pennsylvania, USA, Nov. 2013, pp. 69–84 (cited on pages 16, 46–47, 50–55, 57, 111, 151–152). [OZN+12] Zhonghong Ou, Hao Zhuang, Jukka K. Nurminen, Antti Ylä-Jääski, and Pan Hui. “Exploiting Hardware Heterogeneity Within the Same Instance Type of Amazon EC2”. In: Proceedings of the 4th USENIXWorkshop on Hot Topics in Cloud Com- puting (HotCloud). Boston, Massachusetts, USA, June 2012 (cited on page 44). [PDG+05] Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. “Interpreting the data: Parallel analysis with Sawzall”. In: Scientific Programming 13.4 (2005), pp. 277–298 (cited on pages 27, 36). [PE95] Bill Pottenger and Rudolf Eigenmann. “Idiom Recognition in the Polaris Paral- lelizing Compiler”. In: Proceedings of the 9th International Conference on Su- percomputing (ICS). Barcelona, Spain, 1995, pp. 444–448 (cited on page 80). [Pla13] David A Plaisted. “Source-to-Source Translation and Software Engineering”. In: Software Engineering and Applications 6.suppl 4A (2013), p. 30 (cited on pages 71, 77). [PTS+17] Shoumik Palkar, James J. Thomas, Anil Shanbhag, Malte Schwarzkopf, Saman Amarasinghe, and Matei Zaharia. “A Common Runtime for High Performance Data Analysis”. In: Proceedings of the 8th Biennial Conference on Innovative Data Systems Research (CIDR). Chaminade, California, USA, Jan. 2017 (cited on pages 66, 82). [RKK+16] Jeff Rasley, Konstantinos Karanasos, Srikanth Kandula, Rodrigo Fonseca, Milan Vojnovic, and Sriram Rao. “Efficient Queue Management for Cluster Schedul- ing”. In: Proceedings of the 11th ACM European Conference on Computer Sys- tems (EuroSys). London, United Kingdom, 2016, 36:1–36:15 (cited on pages 16, 52, 55–57, 112, 123). 176 BIBLIOGRAPHY [RMZ13] Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. “X-Stream: Edge-centric Graph Processing Using Streaming Partitions”. In: Proceedings of the Twenty- Fourth ACM Symposium on Operating Systems Principles (SOSP). Farminton, Pennsylvania, USA, 2013, pp. 472–488 (cited on pages 27, 30, 35). [RTG+12] Charles Reiss, Alexey Tumanov, Gregory R. Ganger, Randy H. Katz, andMichael A. Kozuch. “Heterogeneity and dynamicity of clouds at scale: Google trace anal- ysis”. In: Proceedings of the 3rd ACM Symposium on Cloud Computing (SoCC). San Jose, California, Oct. 2012, 7:1–7:13 (cited on pages 44, 47, 50, 61, 148). [SAM16] Apache Software Foundation. Apache Samza. http://samza.apache. org; accessed 13/11/2016 (cited on page 28). [SCH+11] Bikash Sharma, Victor Chudnovsky, Joseph L. Hellerstein, Rasekh Rifaat, and Chita R. Das. “Modeling and synthesizing task placement constraints in Google compute clusters”. In: Proceedings of the 2nd ACM Symposium on Cloud Com- puting (SoCC). Cascais, Portugal, Oct. 2011, 3:1–3:14 (cited on pages 50, 140). [Sch16] Malte Schwarzkopf. “Operating system support for warehouse-scale computing”. PhD thesis. University of Cambridge Computer Laboratory, Feb. 2016 (cited on pages 44–45, 112, 114, 135–138, 140, 142). [SKA+13] Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wil- kes. “Omega: flexible, scalable schedulers for large compute clusters”. In: Pro- ceedings of the 8th ACM European Conference on Computer Systems (EuroSys). Prague, Czech Republic, Apr. 2013, pp. 351–364 (cited on pages 16, 22, 51–52, 55, 148). [SPW09] Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. “DRAM Errors in the Wild: A Large-Scale Field Study”. In: Proceedings of the 11th ACM SIG- METRICS/PERFORMANCE Joint International Conference onMeasurement and Modeling of Computer Systems (SIGMETRICS). Seattle, WA, USA, 2009, pp. 193– 204 (cited on page 28). [TCG+12] Alexey Tumanov, James Cipar, Gregory R. Ganger, and Michael A. Kozuch. “Alsched: Algebraic Scheduling of Mixed Workloads in Heterogeneous Clou- ds”. In: Proceedings of the 3rd ACM Symposium on Cloud Computing (SoCC). San Jose, California, Oct. 2012, 25:1–25:7 (cited on pages 50, 52–53, 140). [TMV+11] Lingjia Tang, Jason Mars, Neil Vachharajani, Robert Hundt, and Mary Lou Soffa. “The impact of memory subsystem resource sharing on datacenter applications”. In: Proceedings of the 38th Annual International Symposium on Computer Ar- chitecture (ISCA). San Jose, California, USA, June 2011, pp. 283–294 (cited on page 44). BIBLIOGRAPHY 177 [TSJ+09] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, et al. “Hive: A Warehousing Solution over a Map- reduce Framework”. In: Proceedings of the VLDB Endowment 2.2 (Aug. 2009), pp. 1626–1629 (cited on pages 15, 27, 29, 36, 67–68, 75). [TTS+14] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, et al. “Storm @Twitter”. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD). Snowbird, Utah, USA, 2014, pp. 147–156 (cited on page 28). [TZP+16] Alexey Tumanov, Timothy Zhu, JunWoo Park, Michael A. Kozuch, Mor Harchol- Balter, and Gregory R. Ganger. “TetriSched: Global Rescheduling with Adaptive Plan-ahead in Dynamic Heterogeneous Clusters”. In: Proceedings of the 11th Eu- ropean Conference on Computer Systems (EuroSys). London, England, United Kingdom, 2016, 35:1–35:16 (cited on pages 50, 52–53, 125, 140, 142). [Val90] Leslie G. Valiant. “A Bridging Model for Parallel Computation”. In: Communi- cations of the ACM 33.8 (Aug. 1990), pp. 103–111 (cited on page 34). [VMD+13] Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Ma- hadev Konar, Robert Evans, Thomas Graves, et al. “Apache Hadoop YARN: Yet Another Resource Negotiator”. In: Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC). Santa Clara, California, Oct. 2013, 5:1–5:16 (cited on pages 43, 46–47, 50, 52, 54). [VPA+14] Shivaram Venkataraman, Aurojit Panda, Ganesh Ananthanarayanan, Michael J. Franklin, and Ion Stoica. “The Power of Choice in Data-Aware Cluster Schedul- ing”. In: Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Broomfield, Colorado, USA, Oct. 2014, pp. 301–316 (cited on page 52). [VPK+15] Abhishek Verma, Luis David Pedrosa, Madhukar Korupolu, David Oppenheimer, and JohnWilkes. “Large scale cluster management at Google”. In: Proceedings of the 10th ACM European Conference on Computer Systems (EuroSys). Bordeaux, France, Apr. 2015 (cited on pages 16, 22, 43, 47, 50–51, 53, 57, 148, 153). [Way99] Kevin D. Wayne. “A Polynomial Combinatorial Algorithm for Generalized Mini- mum Cost Flow”. In: Proceedings of the 31st Annual ACM Symposium on Theory of Computing. STOC ’99. Atlanta, Georgia, USA: ACM, 1999, pp. 11–18 (cited on pages 142, 161). [XRZ+13] Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. “Shark: SQL and Rich Analytics at Scale”. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIG- MOD). New York, New York, USA, 2013, pp. 13–24 (cited on pages 36, 67, 75). 178 BIBLIOGRAPHY [YIF+08] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. “DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language”. In: Pro- ceedings of the 8th USENIX Symposium on Operating Systems Design and Im- plementation (OSDI). San Diego, California, USA, Dec. 2008 (cited on pages 27, 32, 37, 67, 70). [ZBS+10] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker, and Ion Stoica. “Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling”. In: Proceedings of the 5th European Conference on Computer Systems (EuroSys). Paris, France, Apr. 2010, pp. 265– 278 (cited on pages 49, 52, 161). [ZCD+12] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, et al. “Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing”. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI). San Jose, California, USA, Apr. 2012, pp. 15–28 (cited on pages 15, 26– 27, 30, 32, 41, 49, 143). [ZKJ+08] Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy H. Katz, and Ion Stoica. “Improving MapReduce Performance in Heterogeneous Environments”. In: Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI). San Diego, California, USA, Dec. 2008, pp. 29–42 (cited on pages 49, 52). [ZTH+13] Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. “CPI2: CPU Performance Isolation for Shared Compute Clusters”. In: Proceedings of the 8th ACM European Conference on Computer Systems (Eu- roSys). Prague, Czech Republic, Apr. 2013, pp. 379–391 (cited on page 51). [ZWC+16] Mingxing Zhang, Yongwei Wu, Kang Chen, Xuehai Qian, Xue Li, and Weimin Zheng. “Exploring the Hidden Dimension in Graph Processing”. In: Proceed- ings of the 12th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI). Savannah, Georgia, USA, Nov. 2016, pp. 285–300 (cited on page 35).