Examining Raft's behaviour during partial network failures
Published version
Peer-reviewed
Repository URI
Repository DOI
Change log
Authors
Abstract
State machine replication protocols such as Raft are widely used to build highly-available strongly-consistent services, maintaining liveness even if a minority of servers crash. As these systems are implemented and optimised for production, they accumulate many divergences from the original specification. These divergences are poorly documented, resulting in operators having an incomplete model of the system's characteristics, especially during failures. In this paper, we look at one such Raft model used to explain the November Cloudflare outage and show that etcd's behaviour during the same failure differs. We continue to show the specific optimisations in etcd causing this difference and present a more complete model of the outage based on etcd's behaviour in an emulated deployment using reckon. Finally, we highlight the upcoming PreVote optimisation in etcd, which might have prevented the outage from happening in the first place.
Description
Keywords
Journal Title
Conference Name
Journal ISSN
Volume Title
Publisher
Publisher DOI
Sponsorship
Engineering and Physical Sciences Research Council (EP/R03351X/1)