Repository logo
 

Examining Raft's behaviour during partial network failures

Published version
Peer-reviewed

Type

Conference Object

Change log

Authors

Jensen, C 
Howard, H 

Abstract

State machine replication protocols such as Raft are widely used to build highly-available strongly-consistent services, maintaining liveness even if a minority of servers crash. As these systems are implemented and optimised for production, they accumulate many divergences from the original specification. These divergences are poorly documented, resulting in operators having an incomplete model of the system's characteristics, especially during failures. In this paper, we look at one such Raft model used to explain the November Cloudflare outage and show that etcd's behaviour during the same failure differs. We continue to show the specific optimisations in etcd causing this difference and present a more complete model of the outage based on etcd's behaviour in an emulated deployment using reckon. Finally, we highlight the upcoming PreVote optimisation in etcd, which might have prevented the outage from happening in the first place.

Description

Keywords

4606 Distributed Computing and Systems Software, 46 Information and Computing Sciences

Journal Title

HAOC 2021 - Proceedings of the 2021 1st Workshop on High Availability and Observability of Cloud Systems

Conference Name

EuroSys '21: Sixteenth European Conference on Computer Systems

Journal ISSN

Volume Title

Publisher

ACM
Sponsorship
Engineering and Physical Sciences Research Council (EP/M02315X/1)
Engineering and Physical Sciences Research Council (EP/R03351X/1)