Repository logo
 

Making State Explicit for Imperative Big Data Processing

Published version
Peer-reviewed

Type

Conference Object

Change log

Authors

Kalyvianaki, Evangelia 
Castro Fernandez, Raul 
Migliavacca, Matteo 
Pietzuch, Peter 

Abstract

Data scientists often implement machine learning algo- rithms in imperative languages such as Java, Matlab and R. Yet such implementations fail to achieve the per- formance and scalability of specialised data-parallel pro- cessing frameworks. Our goal is to execute impera- tive Java programs in a data-parallel fashion with high throughput and low latency. This raises two challenges: how to support the arbitrary mutable state of Java pro- grams without compromising scalability, and how to re- cover that state after failure with low overhead. Our idea is to infer the dataflow and the types of state accesses from a Java program and use this information to generate a stateful dataflow graph (SDG). By explic- itly separating data from mutable state, SDGs have spe- cific features to enable this translation: to ensure scala- bility, distributed state can be partitioned across nodes if computation can occur entirely in parallel; if this is not possible, partial state gives nodes local instances for in- dependent computation, which are reconciled according to application semantics. For fault tolerance, large in- memory state is checkpointed asynchronously without global coordination. We show that the performance of SDGs for several imperative online applications matches that of existing data-parallel processing frameworks.

Description

Keywords

Journal Title

Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference

Conference Name

USENIX ATC ’14: 2014 USENIX Annual Technical Conference

Journal ISSN

Volume Title

Publisher

Rights

Publisher's own licence