Kafka, Samza and the Unix Philosophy of Distributed Data

Kleppmann, M; Kreps, J

Kafka, Samza and the Unix Philosophy of Distributed Data

Accepted version

Peer-reviewed

Repository URI

https://www.repository.cam.ac.uk/handle/1810/286031

Repository DOI

https://doi.org/10.17863/CAM.33349

Files

Accepted version (222.3 KB)

Type

Article

Authors

Kleppmann, Martin

https://orcid.org/0000-0001-7252-6958

Kreps, J

Abstract

Apache Kafka is a scalable message broker, and Apache Samza is a stream processing framework built upon Kafka. They are widely used as infrastructure for implementing personalized online services and real-time predictive analytics. Besides providing high throughput and low latency, Kafka and Samza are designed with operational robustness and long-term maintenance of applications in mind. In this paper we explain the reasoning behind the design of Kafka and Samza, which allow complex applications to be built by composing a small number of simple primitives – replicated logs and stream operators. We draw parallels between the design of Kafka and Samza, batch processing pipelines, database architecture, and the design philosophy of Unix.

Journal Title

IEEE Data Engineering Bulletin

Volume Title

38

Publisher

IEEE

Publisher URL

http://sites.computer.org/debull/A15dec/issue1.htm

Rights

http://www.rioxx.net/licenses/all-rights-reserved

Collections

Cambridge University Research Outputs