Making State Explicit for Imperative Big Data Processing

Kalyvianaki, Evangelia; Castro Fernandez, Raul; Migliavacca, Matteo; Pietzuch, Peter

Making State Explicit for Imperative Big Data Processing

Published version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/294598

Repository DOI

https://doi.org/10.17863/CAM.41706

Files

Published version (404.61 KB)

Type

Conference Object

Authors

Kalyvianaki, Evangelia

Castro Fernandez, Raul

Migliavacca, Matteo

Pietzuch, Peter

Abstract

Data scientists often implement machine learning algo- rithms in imperative languages such as Java, Matlab and R. Yet such implementations fail to achieve the per- formance and scalability of specialised data-parallel pro- cessing frameworks. Our goal is to execute impera- tive Java programs in a data-parallel fashion with high throughput and low latency. This raises two challenges: how to support the arbitrary mutable state of Java pro- grams without compromising scalability, and how to re- cover that state after failure with low overhead. Our idea is to infer the dataflow and the types of state accesses from a Java program and use this information to generate a stateful dataflow graph (SDG). By explic- itly separating data from mutable state, SDGs have spe- cific features to enable this translation: to ensure scala- bility, distributed state can be partitioned across nodes if computation can occur entirely in parallel; if this is not possible, partial state gives nodes local instances for in- dependent computation, which are reconciled according to application semantics. For fault tolerance, large in- memory state is checkpointed asynchronously without global coordination. We show that the performance of SDGs for several imperative online applications matches that of existing data-parallel processing frameworks.

Journal Title

Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference

Conference Name

USENIX ATC ’14: 2014 USENIX Annual Technical Conference

Publisher DOI

https://doi.org/10.5555/2643634.2643640

Rights

Publisher's own licence

Collections

Cambridge University Research Outputs