Towards a worldwide storage infrastructure Julien Quintard firstname.lastname@cl.cam.ac.uk September 2010 University of Cambridge Computer Laboratory Jesus College This dissertation is submitted for the degree of Doctor of Philosophy Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation does not exceed the regulation length of sixty thousand words, including tables and footnotes. Towards a worldwide storage infrastructure Julien Quintard Abstract Peer-to-peer systems have recently gained a lot of attention in the academic com- munity especially through the design of KBR (Key-Based Routing) algorithms and DHT (Distributed Hash Table)s. On top of these constructs were built promising applications such as video streaming applications but also storage infrastructures benefiting from the availability and resilience of such scalable network protocols. Unfortunately, rare are the storage systems designed to be scalable and fault-tolerant to Byzantine behaviour, conditions required for such systems to be deployed in an environment such as the Internet. Furthermore, although some means of access control are often provided, such file systems fail to offer the end-users the flexibility required in order to easily manage the permissions granted to potentially hundreds or thousands of end-users. In addition, as for centralised file systems which rely on a special user, referred to as root on Unices, distributed file systems equally require some tasks to operate at the system level. The decentralised nature of these systems renders impossible the use of a single authoritative entity for performing such tasks since implicitly granting her superprivileges, unacceptable configuration for such decentralised systems. This thesis addresses both issues by providing the file system objects a completely decentralised access control and administration scheme enabling users to express ac- cess control rules in a flexible way but also to request administrative tasks without the need for a superuser. A prototype has been developed and evaluated, prov- ing feasible the deployment of such a decentralised file system in large-scale and untrustworthy environments. Acknowledgments I would like to thank Jean Bacon for her incredible patience, understanding and kindness. I am also indebted to Alastair Beresford for his advice and encourage- ment throughout the years. I would also like to thank the Opera group, especially Pedro Branda˜o, David Eyers, David Evans, Luis Vargas, Samuel Kounev, Jatinder Singh, Eiko Yoneki, Sriram Srinivasan, David Ingram, Salman Taherian, Scarlet Schwiderski and Ken Moody. During my PhD, I have been fortunate enough to cross Myoung Jin Nam’s path who I would like to thank sincerely for everything. Finally, I would like to thank my parents and my friends for bearing with me all these years and making me a better person. Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background 6 2.1 Overlay Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Unstructured Overlay Networks . . . . . . . . . . . . . . . . . 8 2.1.2 Hybrid Overlay Networks . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Structured Overlay Networks . . . . . . . . . . . . . . . . . . 11 2.1.3.1 Chord . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.3.2 Kelips . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.4 Social Overlay Networks . . . . . . . . . . . . . . . . . . . . . 18 2.2 Distributed Hash Tables . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Peer-to-Peer File Systems . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 Pangaea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.2 OceanStore . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.3 FARSITE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.4 CFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.5 Ivy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.6 Plutus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3.7 Pastis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3 Environment 39 3.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.1 File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.2 Peer-to-Peer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4 Design 53 4.1 Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1.2.1 Policy . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.1.2.2 Pattern . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1.2.3 Class . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1.3 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.1.4 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.1.5 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1.5.1 Physical Blocks . . . . . . . . . . . . . . . . . . . . . 62 4.1.5.2 Logical Blocks . . . . . . . . . . . . . . . . . . . . . 67 4.1.6 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.1.7 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.1 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.2.1 System-wide . . . . . . . . . . . . . . . . . . . . . . 88 4.2.2.2 User-wide . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2.4 Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2.4.1 Community . . . . . . . . . . . . . . . . . . . . . . . 92 4.2.4.2 Ownership . . . . . . . . . . . . . . . . . . . . . . . 106 4.2.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.2.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5 Implementation 120 5.1 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.2.1 Elle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.2.2 Lune . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.2.3 PIG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.2.4 Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.2.5 Etoile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.2.6 Hole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6 Evaluation 145 6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.1.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.1.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.1.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.2.1 Overlay Network . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.2.2 Distributed Hash Table . . . . . . . . . . . . . . . . . . . . . . 153 6.2.3 File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.2.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . 158 6.2.3.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . 163 7 Conclusion 169 Chapter 1 Introduction This chapter introduces the thesis by detailing the motivation driving this research before presenting the contributions this work brings to the research community. 1.1 Motivation Over the last decade, computers have become the universal tool for work, commu- nication and entertainment. Despite the incredible technology progress, computers still fail to provide the end-user a way to deal with data in an easy, reliable and secure way. Although people use computers daily for both personal and profes- sional tasks, users cannot rely on them when it comes to reliably storing documents, transparently sharing files with other users or synchronising data between multiple devices. The following further details these three functionalities—storage, sharing and synchronisation—and explains why existing products and services fail to provide end-users the features and properties they expect. Storage Computer networks are growing rapidly in importance as a medium for the storage and exchange of information. After years of encouraging people to amass a hoard of digital media as well as to store personal data on their local hard disk, users now expect computers to become as reliable as any other home devices such as televisions, Compact Disc players and so forth. Although computers will probably never be as reliable as televisions, most people feel like their local hard disk is a safe place for storing their sensitive files. The very few end-users concerned about losing their files tend to rely on manual backups. 1 CHAPTER 1. INTRODUCTION 2 Unfortunately, even for those users, files cannot be considered safe on external back- ups. Indeed, many plausible scenarios might lead to the complete loss of data including fire and theft amongst many others. The Internet made it possible to store files on company-run storage clusters such as Amazon’s [DHJ+07] and many others [Ope, Omn, Box, Dro]. Unfortunately, people willing to use such services must completely trust the company for reliably storing their files while not disclosing or using their personal information for any purpose. For the obvious above reasons, such systems may not be suitable for everyone. Peer-to-peer file sharing applications have gained great popularity over the last decade as a way for users to share their files with the rest of the world in a com- pletely decentralised manner. Although peer-to-peer applications are interesting for increasing privacy and anonymity in the sense that nobody has complete control over the system, such applications do not provide any guarantees in terms of persis- tence, availability and security. Therefore, such peer-to-peer applications cannot be used for reliably storing users’ files. Sharing Peer-to-peer file sharing applications completely changed the users’ day-to-day In- ternet experience. Indeed, people are now used to launching such an application whenever they want to listen to an unknown band’s music, download the last episode of their favourite TV series, watch or re-watch a famous movie and so forth. Unfortunately, the well-known eDonkey [HKLF+06], Gnutella [PSAS01], Bittorrent [Coh03] etc. still lack some fundamental functionalities. Indeed, although these ap- plications are generally very efficient at downloading popular content, no availability guarantees are provided for rare files, making it problematic for users to locate them. Additionally, these applications usually rely on a flat name space, making it com- plicated for users to look for a rare file whose name resembles another popular but completely different file. Finally, such applications basically aim at providing users a way to share their files with the rest of the world. However, one could be interested in controlling who has access to the shared files. Unfortunately, none of the well-known peer-to-peer file sharing applications provides any access control mechanism. Although some company-run storage systems [Omn, Dro, XDr, Box] provide such sharing capabilities, they cannot succeed in offering as much diverse content as peer- to-peer communities, not mentioning the cost in storage and bandwidth for such a company to provide this service. CHAPTER 1. INTRODUCTION 3 Finally, popular social websites such as Youtube, Flickr etc. also provide sharing capabilities but those services target a single medium such as video, sound or image, forcing the user to deal with multiple accounts and interfaces. Synchronisation With the advent of ubiquitous and mobile computing, people start getting their hands on multiple devices, all with amazing computing capabilities. With so many devices each with its own storage, users are forced to manually synchronise their data so that they can access a file from different devices and locations. Company-run products such as Apple’s MobileMe [App], Windows Live SkyDrive [Win] etc. make it easier for end-users to synchronise their devices through the use of an online storage space. Unfortunately, the online storage capacity is generally limited in many ways: number of files, file size, storage capacity and so forth. Besides, people might be concerned about privacy when relying on a private company for storing their personal and/or professional files. Finally, such applications often target the products of the same company only, making it difficult, if not impossible, for users to change or even use them on other systems. ? ? ? Interestingly, the three scenarios above have three points in common. Firstly, the devices involved are connected to the Internet being mobile phones, netbooks, office computers etc. Secondly, all these devices embed an unreliable storage capacity that can be used for storing, sharing and synchronising data. Thirdly, all these tasks— storing, sharing and synchronising—are related to the common abstraction known as the file. A unique system for storing, sharing and synchronising files, independently of their medium type, in a reliable, secure and transparent way would therefore make it easy for users to manage their data. 1.2 Contributions The thesis of this dissertation is that a file system abstraction on top of a peer-to- peer network is a viable platform and the most cost-effective one for ensuring the CHAPTER 1. INTRODUCTION 4 fundamental properties end-users expect when it comes to storing, sharing and syn- chronising their data. As detailed in Section 3.1, these properties include availability, integrity, durability, privacy and efficiency among others. The first contribution of this work is the definition of the properties that are required for the system to provide the end-users the expected guarantees. Several peer-to-peer file systems such as Ivy [MMGC02], CFS [DKK+01], Pastis [mBPS05] etc. were developed over the last decade. However, very few of them provided common file system features, such as an access control mechanism for in- stance, while administration in such systems was completely ignored by the research community. The second contribution of this dissertation is the design of an access control and administration scheme for decentralised untrustworthy environments making them suitable building blocks for peer-to-peer file systems. The third and final contribution is the implementation of a complete working peer- to-peer file system prototype along with an extensive evaluation proving feasible the deployment of such a system to a large community of users. Figure 1.1 illustrates a peer-to-peer network connecting nodes physically distributed throughout the world. The work presented in this document aims at building a storage infrastructure on top of such a network in order to ensure fundamental properties such as reliability, availability, privacy, anonymity and so forth. node link Figure 1.1: A worldwide storage infrastructure CHAPTER 1. INTRODUCTION 5 1.3 Outline The remainder of this dissertation is structured as follows. Chapter 2 discusses the relevant background from overlay networks to distributed hash tables. Special attention is given to the presentation of the extensive body of work on peer-to-peer file systems. Chapter 3 discusses the objectives of this work by precisely defining the required properties and deducing that the peer-to-peer file system model is suitable for achiev- ing them all. Chapter 4 discusses the semantic differences between centralised file systems and decentralised file systems and the impact on the user experience. Then, the chapter presents the design of the two building blocks peer-to-peer file systems require, namely, an access control and administration scheme. The prototype implementation is discussed in Chapter 5, detailing how the system has been broken into small independent units and how they relate to each other. Chapter 6 evaluates the performance of the prototype and validates the overall de- sign. The chapter also suggests some possible improvements in specific areas. Finally, Chapter 7 concludes and discusses directions for future investigation. Chapter 2 Background This chapter introduces peer-to-peer systems from overlay networks to routing al- gorithms capable of locating a node given its identifier in a decentralised manner to distributed hash tables which provide a storage abstraction to peer-to-peer file systems which enable the user to interact with the system following a standard file system interface. ? ? ? Peer-to-peer systems differ from common distributed systems in the sense that nodes composing the network can self-organise with very little information on the whole network. Such networks are designed with fault tolerance in mind because the num- ber of nodes populating such networks is generally so high that nodes disconnecting, crashing or acting maliciously are more probable than in other, more controlled, distributed systems. Such systems are often used to aggregate the resources of many heterogeneous com- puters across the world. Although those resources can be very diverse, this document focuses on the storage capacity such nodes provide. The lack of centralised servers makes such networks suitable to accommodate a very large number of nodes. However, these peer-to-peer networks also exhibit specific characteristics that need to be taken into account. 6 CHAPTER 2. BACKGROUND 7 Scalability The decentralised nature of peer-to-peer networks implies that the more nodes join the network, the more aggregated resources the system acquires, hence, the better the system. However, the network must cope with this potentially large number of nodes by relying on scalable protocols ensuring that the system keeps providing client nodes the expected service as nodes dynamically join and leave the network. Latency Unlike centralised topologies that require low-latency servers with a high bandwidth to supply all the clients, peer-to-peer networks rely mostly on personal computers. Such computers are generally connected to the network through a high-latency and low-bandwidth Internet connection. Systems built upon such networks, e.g. Internet, therefore cannot afford using the same protocols and algorithms as for centralised or partially-distributed systems. Churn The decentralised nature of peer-to-peer networks implies that every node con- tributes to the system by taking part in the basic tasks such as routing messages between nodes, managing the network state etc. Therefore, every node is considered an important component of the system. When- ever a node fails, other nodes must be informed and past operations involving this failing node may have to be re-performed. In addition, most peer-to-peer systems are open such that new nodes constantly join the network, in which case, the other nodes must be informed of their arrival. Unfortunately, studies showed that the churn rate of the studied peer-to-peer net- works was high [LSG+04, RGRK04]. Peer-to-peer systems must integrate this char- acteristic in the design of their algorithms such that, for instance, nodes are not assumed to be connected to the network at all times. Untrustworthiness Clients composing a peer-to-peer network run on computers under the full control of their respective user. The system therefore has no authority to force nodes to follow the system’s protocols. The network is thus assumed to be untrustworthy CHAPTER 2. BACKGROUND 8 since many of the nodes populating the system may be faulty. For example, a virus may have infected the whole client’s operating system or the user may have installed a modified version to take advantage of the system without contributing resources, nodes referred to as free riders. Peer-to-peer systems must be designed with this property in mind making sure that nothing relies on a single node, such a node being faulty could endanger the system in its entirety. 2.1 Overlay Networks The computers connected together and collaborating in the same peer-to-peer system form an overlay network [KS10] on top of a physical network e.g. the Internet. The topology of the overlay network, its degree of decentralisation as well as the communication protocol, vary from one peer-to-peer system to another. These char- acteristics are fundamental as they impact the scalability and performance of the network but also its capacity to self-organise and tolerate faults. Overlay networks can be classified in four categories according to the way nodes are connected to one another. Depending on the overlay network’s topology, it may be easier to join/leave the network but more difficult to locate a precise object in the network. The following discusses the different models of overlay networks along with the way objects are located in such networks. 2.1.1 Unstructured Overlay Networks The very first deployed peer-to-peer applications enabled users to contribute files to the system that any other user could download. These peer-to-peer file sharing applications allow users to search for files matching the keywords the user specified. The objective of such a system is to locate all the files whose name matches with those keywords. Then, the user, through the application, can download the files of interest to her. The overlay networks on which such applications were built had the property of lacking organisation in the way nodes were connected to each other. Besides, nodes connected to the network were all considered equal i.e. no node had more privileges than others. Such unstructured overlay networks [CFK03] are therefore sometimes referred to as being flat, forming a completely random graph. In such an environ- ment, a node wishing to join the network basically has to connect to an already connected node. CHAPTER 2. BACKGROUND 9 Since such networks have no overall structure, unstructured overlay networks are very easy to manage. Indeed, whenever a node leaves the network, only its neigh- bours must detect its departure and update their internal state. However, the other nodes of the network do not need to be notified of the change in the network’s topology, hence lowering the communication costs of maintenance. Figure 2.1 depicts such a unstructured overlay network in which nodes are connected without following any pre-defined structure. node link Figure 2.1: A flat unstructured overlay network Locating an object—e.g. a file—in such a network without any centralised entity maintaining a global state of the network requires every node to contribute. Indeed, one of the first routing algorithms designed for unstructured overlay networks con- sisted of flooding the network. The node issuing the search request starts by sending a message to all its neighbour nodes, asking them to locate files matching a list of keywords. Whenever a node receives such a request, it starts by checking if it does have such files among the files it contributed to the peer-to-peer system, and replies to the requesting node accordingly. Then, the message is forwarded to all the other neighbours until the message expires i.e. the TTL (Time To Live) reaches zero. Unfortunately, such an algorithm implies a high network overhead since messages CHAPTER 2. BACKGROUND 10 are sent to a large fraction of nodes which do not have the sought resource and are therefore not interested in the process. Such a routing algorithm and its variants [DHA03, YM02, YVGM04] are extremely simple to deploy and do not constrain the overlay network topology. However, the implied overhead makes these algorithms only suitable for small networks, though many projects are known to have used and still use them, most notably Gnutella [PSAS01] and Freenet [CSWH01] among many others [DFM01, DMS04]. 2.1.2 Hybrid Overlay Networks Although flat unstructured overlay networks are very good for handling churn, they do not perform well when it comes to locating a particular object or node. Hybrid overlay networks [SBA03, CRB+03], also known as multi-level unstructured overlay networks, address this problem by adding a level of highly-available supernodes, a.k.a. superpeers, forming a small inner overlay network. These supernodes are responsible for referencing the nodes connected to the network along with the objects they contribute to the system. node link supernode superlink Figure 2.2: A two-level hybrid overlay network CHAPTER 2. BACKGROUND 11 The main drawback of such a topology is the high load implied as well as the large state that must be kept by the supernodes. Figure 2.2 illustrates such a centralisation within the inner overlay of supernodes. If one of these supernodes fails, the impact on the overall network’s performance may be disastrous as the load the faulty supernode was handling must be balanced on the others. The routing algorithm in such overlay networks is however trivial. Indeed, whenever a user performs a search, the client node requests the supernode it is connected to, supplying some keywords. The supernode performs the matching process by comparing the keywords with the names of all the files in its records, and possibly contacts other supernodes if required. Although such routing algorithms [GEvS07] involve only a few nodes, they require supernodes to be extremely reliable, powerful and well-connected in order to handle all the client nodes’ requests. 2.1.3 Structured Overlay Networks Structured overlay networks were developed to overcome the limitations of unstruc- tured and hybrid overlay networks. Such networks are completely decentralised and organised such that nodes communicate with well-identified nodes according to the protocol in contrast with unstructured overlay networks in which nodes connect to other nodes in a unplanned way, hence forming a random graph. Figure 2.3 illustrates a structured overlay network in which every node is assigned an identifier following a ring-based identifier space [PRR97]: nodes are identified by a number such that every node follows the node with the preceeding number—i.e. highest number which is smaller than the current node’s—with the exception of the node with the smallest identifier which follows the node with the highest one, hence creating a loop within the identifier space. Although unstructured and hybrid overlay networks were primarily used for keyword- based lookups, other search criteria such as object identifiers, regular expressions and so on could have been used. In contrast, structured overlay networks organise nodes by assigning them an identifier while routing algorithms make use of this organ- isation to perform fast lookups. Therefore, structured overlay networks were not designed to perform attribute-based lookups as quickly as identifier-based lookups, though some decentralised data structures [RH04] were designed for specific types of queries. Besides, dissemination techniques used in unstructured overlay networks can also [CCR05] be used in structured overlay networks. Routing algorithms based on identifiers are sometimes referred to as KBR (Key-Based Routing) algorithms and provide an interface composed of a single routine, Lookup(ι), which returns the IP (Internet Protocol) address of the node in charge of the identifier ι. CHAPTER 2. BACKGROUND 12 Although attribute-based routing algorithms used in unstructured and hybrid over- lay networks enable rich searches within the set of objects, such algorithms do not scale well since they do not distribute the resource requirements evenly across the nodes. KBR algorithms, however, aim at locating the node responsible for an iden- tifier. Such algorithms were designed to scale so that locating an identifier involves a small number of nodes while each node maintains only a few links to other nodes. node link Figure 2.3: A ring-based structured overlay network Every node in a structured overlay network is assigned an identifier from a large identifier space. Identifiers are generated in a random fashion in order to provide network resource balancing and fault tolerance. Besides, nodes with close or even adjacent identifiers are, with high probablity, in different geographic locations, under distinct users’ control and with different computing and network resources. Objects, e.g. data blocks, files etc., are assigned identifiers from the same identifier space. Every object in the network is dynamically associated with a node, called the object’s home, or sometimes root. This node is responsible for storing the object and answering requests related to this object. Every node maintains a routing table containing the identifier and IP address of CHAPTER 2. BACKGROUND 13 some other nodes, depending on the topology. In most systems, nodes also maintain a set of neighbours containing the IP address of a few closest nodes. These two data structures are updated whenever a node is detected to have joined or left the network but also periodically in order to maintain the network in a consistent state. KBR algorithms are distinctive from other routing algorithms in the way that they determine the size of the routing tables as well as the length of the search paths, as detailed next. These metrics are important as they characterise the robustness and performance of the routing algorithm, hence of the whole network. Indeed, the more entries in a routing table, the more communication is required to maintain it in a consistent state. Likewise, the shorter the search path, the more efficient the lookup process. Several structured overlay networks and routing algorithms were designed over the last decade, from Chord [SMK+01] that is based on an oriented ring, to CAN (Content-Addressable Network) [RFSH01] with its multi-dimensional Cartesian co- ordinate space, to Pastry [RD01a] which is based on the Plaxton [PRR97] structure, to Tapestry [ZKJ01], Kademlia [MM02], Kelips [GBL+03], Viceroy [MNR02] and many more [ZKW05, MBRI03], all with different trade-offs between routing com- plexity, maintenance overhead and memory footprint. Although key-based routing algorithms are far more efficient than other previously described routing algorithms, the fact that they are based on collaboration implies several issues which are discussed next. Structured overlay networks have long been considered to tolerate churn. However, subsequent studies [LSG+04] showed that well-known DHT s suffered from churn. Research [RGRK04] therefore explored the critieria impacting churn tolerance such as periodic versus reactive recovery, the choice of nearby versus distant neighbours etc. Peer-to-peer networks have also been shown to implicitly suffer from attacks known as Sybil [Dou02] and Eclipse [SNDW06]. The Sybil attack consists of an attacker that generates enough virtual nodes to take over a large portion of the overlay network’s identifier space. Therefore, a malicious node could, for instance, control all the replicas of an object. On the other hand, the Eclipse attack consists of malicious nodes corrupting honest nodes’ routing table in order to increase the number of requests passing through such Byzantine nodes. Although these issues are very difficult to deal with, some routing algorithms were improved [CDG+02, DLLKA05, HKD07] to cope with such attacks. Routing algorithms in peer-to-peer systems rely on the collaboration of the nodes populating the network. Since peer-to-peer networks are, by nature, untrustworthy, a single node being unwilling to cooperate e.g. to contribute to the routing process, to store the object it has been given the responsibility for etc. suffices to harm CHAPTER 2. BACKGROUND 14 the system and its users. KBR algorithms tend to rely on iterative routing instead of recursive routing to minimise the impact and more easily detect such malicious nodes, though such a design makes the lookup process less efficient. Former peer-to-peer file sharing applications’ problems with free riders came from the lack of incentive for the users to contribute their files and/or bandwitdh. In the last decade, research started exploring a completely different but more promising way to cope with such behaviours by enforcing collaboration in peer-to-peer networks. Systems bringing incentive to peer-to-peer systems fall in two categories. The first class is composed of systems relying on resource bidding. These systems [CN03, CGM02, MT03b, BLV05] guarantee that, for instance, whenever a node wants to store a block of data on another node, it must offer this node some local storage space in return. The second class is composed of reputation systems. Those systems [WV03, SS02, ZH07, DMS03, MT03a] dynamically keep track of nodes’ behaviour in a completely decentralised way. Then, reputation is propagated through the system and correlation is made to detect Byzantine behaviours. Although both categories suffer some limitations, they represent the most promising solutions for enforcing collaboration in peer-to-peer networks. The remainder of this section focuses on detailing two very different structured overlay networks along with their key-based routing algorithm, giving the reader a good understanding of the trade-offs involved in the design of such systems: Chord achieves high scalability while Kelips focuses on ensuring constant time lookups. 2.1.3.1 Chord Chord is a KBR algorithm relying on a structured overlay network in which nodes are assigned random identifiers through the use of a hash function, for instance by applying SHA (Secure Hash Algorithm) on the node’s IP address. Identifiers are ordered in an identifier circle modulo 2m. Key k is assigned to the node whose identifier is equal to or follows k in the identifier space. This node is called the successor of key k, denoted successor(k). Note that the successor basically corresponds to the home or root node in other protocols i.e. the node responsible for the identifier. The idea of Chord is to provide efficient routing i.e. to locate the successor of a given key, by relying on a very small amount of local information. First, each node need only be aware of its successor node on the circle, ensuring that by passing the query around the circle, the key’s successor will eventually be reached. Although Chord nodes do maintain a link with their successor and therefore ensure that all lookups can be resolved correctly, this routing scheme is very inefficient i.e. O ( η ) . CHAPTER 2. BACKGROUND 15 3 2 10 18 21 22 24 26 28 29 31 9 11 13 14 16 interval successor 28 [27, 28[ 28 [28, 30[ [30, 2[ 31 [10, 26[ [2, 10[ 2 11 interval successor 24 [22, 23[ 22 [23, 25[ [25, 29[ 26 [5, 21[ [29, 5[ 29 9 interval successor 9 [4, 5[ 9 [5, 7[ [7, 11[ 9 [19, 3[ [11, 19[ 11 21 interval successor 11 [10, 11[ 11 [11, 13[ [13, 17[ 13 [25, 9[ [17, 25[ 18 26 node link 18 node identifier node’s routing table routing hop Figure 2.4: A Chord network of degree 5 with 17 nodes To accelerate the process, Chord maintains additional, but few compared to the network size, routing links. Each node maintains a routing table, known as the finger table, composed of m entries. Recall that the maximum number of nodes in the network has been set to 2m. Therefore, by keeping only m links, the finger table grows logarithmically with the size of the network. In the routing table of node n, the ith entry contains the identifier of the first node, s, that succeeds n by at least 2i−1 on the identifier circle: s = successor(n+ 2i−1) mod m, 1 < i < m The system calls s the ith finger of node n. A finger entry in Chord contains both CHAPTER 2. BACKGROUND 16 the identifier and IP address of the node. Note that the first entry—i.e. index zero—of the finger table points to what has been earlier called the node’s successor. This scheme has two important characteristics. First, each node stores information about only a small number of nodes, and knows more about close nodes than nodes on the other side of the circle. Second, often, a node’s finger table does not contain enough information to perform the resolution by itself. Therefore, a node wishing to locate a node it does not know about would have no choice but to take a node in its finger table, whose identifier is closer to the key k than its own, and ask it to carry on the lookup process. By repeating this operation, every node without the necessary information forwards the request so that every step brings the request closer to the target node and eventually reaches it. Figure 2.4 shows the organisation of a Chord ring along with the finger table of some nodes. In this illustrated network, node 3 issues a lookup on key 27 which is held by node 28. Since node 3 does not have the location of node 28 in its finger table, it forwards the request to the node 21, located in the farthest interval [19, 3[. Once node 21 receives the request, it inspects its finger table and notices that it cannot resolve the mapping either, hence forwards the message to the node 26 located in interval [25, 29[. One can easily notice that the interval is shrinking by half every time the request is forwarded. At this point, node 26 knows that node 28, located in interval [27, 28[, is responsible for the key 27 and therefore returns to the requesting node 3 the IP address of node 28, node 26’s successor. Chord provides a protocol for resolving an identifier into an IP address in a com- pletely decentralised manner. Assuming the network is composed of η nodes, Chord resolves lookups in O ( log(η) ) messages while nodes are required to maintain links to O ( log(η) ) other nodes. 2.1.3.2 Kelips As previously explained, malicious nodes involved in the routing process can interfere and harm the system by refusing to comply with the protocol. Since the longer the routing path, the higher the probability of a malicious node interfering, Kelips was designed to achieve O ( 1 ) routing complexity at the cost of increased storage overhead. Considering a network of η nodes, Kelips uses O (√ η ) space per node. This soft state suffices to resolve lookups with O ( 1 ) time and message complexity at the cost of more background communication. Kelips consists of κ virtual groups identified from 0 to κ − 1. Each node lies in a group determined by using a consistent hashing function such as SHA-1, applied on the node’s IP address for instance. The distribution property of hash functions ensures that, with high probability, the number of nodes in each group will be close CHAPTER 2. BACKGROUND 17 to η κ . Nodes’ soft state consists of two data structures. The first one, known as Contacts, contains the address of a small number of nodes lying in each of the other κ − 1 groups. The second data structure, known as Neighbours, contains the address of all the other nodes in the same group, hence the location of the home nodes of any key falling in this group. 0 1 2 3 9 11 16 14 13 18 22 21 24 26 28 29 31 group 2group 1group 0 group 3 group 4 group 5 11 0 1 3 24 26 29 group2(14) group3(22) group4(29) group5(31) group1(9) contactsneighbours neighbours contacts group2(16) group3(18) group4(24) group5(31) group0(3) neighbours contacts group0(1) group1(11) group2(14) group3(21) group4(28) neighbours contacts group0(3) group1(11) group2(16) group3(21) group5(31) group node’s routing table routing hop 22 node identifier node Figure 2.5: A Kelips network for 36 nodes The routing algorithm consists of the following steps. The node wishing to locate the root node of a given key starts by extracting the group identifier corresponding to that key, for instance by using the m most significant bits of the key. The node looks into its Contacts table, and, if not located in its own group, picks a node belonging to the destination group. It then sends a message to this node. When the CHAPTER 2. BACKGROUND 18 node receives the message, it simply looks in its Neighbours data structure to locate the root node of the given key. Figure 2.5 illustrates a Kelips network designed for η = 36 nodes. The network is composed of κ = √ 36 = 6 groups while every group can contain up to 6 nodes. The Contacts and Neighbours data structures are detailed for some nodes. Finally, an example of a routing process is depicted. Node 31 wants to find the node responsible for the key 25. Kelips follows the same rule as Chord, the node whose identifier is equal to or follows the key is considered its root. Node 31 starts by extracting the group number corresponding to the key 25: group 4. It then picks in its Contacts a node lying in the group 4, node 28, and sends it a message request. When node 28 receives the request, its looks in its Neighbours data structure and notices that node 26 is the root node of key 25. Therefore, node 28 directly returns node 31 the address of root node 25. Kelips ensures a O ( 1 ) routing complexity because a single hop is required to locate the home node: either directly within the node’s group or by contacting a node from the group in which lies the target node. Aside from the obvious performance benefits, this scheme allows the system to more easily detect malicious nodes since fewer intermediate nodes are asked to contribute to the routing process. However, Kelips does not scale well as the more nodes in the network, the more often the state changes, hence more communication is required to keep the state consistent. 2.1.4 Social Overlay Networks A social network is a social structure made of individuals connected through relation- ships such as friendship, kindship, belief, knowledge, collaboration or just interest. Such networks provide very interesting properties. Firstly, since routing in such networks consists in traversing nodes with some degree of trust, the routing process is less likely to be disturbed by malicious nodes than in other overlay networks. Secondly, many social networks exhibit the small-world phenomenon in which a generally short chain of acquaintances exists connecting one arbitrary node to any other node. Thus, the distance between two randomly chosen nodes grows pro- portionally to the logarithm of the number of nodes η in the network. Thirdly, in many applications, a node’s acquaintances share the same interests such that most objects requested by that node will already be held by its neighbours, hence, greatly improving data retrieval. Recently, research was conducted regarding the application of social behaviours to overlay networks in order to improve the performance and reliability of routing algo- rithms. Indeed, some existing networks, such as peer-to-peer file sharing communi- ties, have been shown [IRF04] to exhibit small-world patterns, while non-small-world CHAPTER 2. BACKGROUND 19 networks have been improved [SMZ03] through the addition of social links. As shown on Figure 2.6, nodes are connected to their friends, forming multiple loosely connected groups. In addition, every node could maintain a few links on a structured overlay in order to guarantee monotonic lookup progression. Indeed, if a node does not have any friend connection located closer to the target identifier, structured links can be used to move forward, hence guaranteeing liveness. link node Figure 2.6: A small-world-based social overlay network Some applications have been making use of social connections such as Turtle [PCT04], a peer-to-peer file sharing application relying on the friend relationship. Turtle’s routing protocol is similar to those of unstructured overlay networks, consisting of forwarding the request to the neighbour nodes, hence flooding the network. Other projects tried to enhance existing systems, such as SPROUT [MGGM04] which augments the Chord structured overlay network with social links in order to take advantage of the small-world network properties when possible and to rely on the structured nature of the underlying network otherwise. Finally, some social overlay networks were designed from the ground up to take advantage of the small-world CHAPTER 2. BACKGROUND 20 phenomenon. For instance, the SWOP (Small World Overlay Protocol) [And04] achieves improved object lookup performance over the existing routing protocols but also provides efficient replication especially regarding popular content. 2.2 Distributed Hash Tables A DHT (Distributed Hash Table) provides a hash table abstraction on top of a peer-to-peer overlay network. Such a service aggregates the network peers’ storage resources providing a distributed data structure. A DHT provides a way to store a block of data β given an address—a.k.a. storage key—α, usually through an interface [DZD+03] as simple as Put(α, β) and Get(α). In order for the service to be efficient but also scalable, DHT s make use of key-based routing algorithms. For instance, the distributed hash table PAST [RD01b] is built upon the Pastry KBR while DHash is based on the Chord overlay network. As discussed through the remainder of this section, redundancy is an absolute re- quirement for ensuring availability, durability and integrity. Distributed hash tables therefore abstract the process of replicating data and maintaining replication con- sistency [KWR06] as nodes fail and join the system. Indeed, considering a DHT in which every block is stored by a single node, avail- ability could not be ensured since the failure of this node would make all the blocks it was responsible for storing inaccessible. Besides, assuming that the node crashes permanently, the block would be lost forever. Redundancy is therefore an absolute requirement for ensuring both availability and durability. Furthermore, in a system lacking redundancy, nothing would prevent the home node from altering the data content and/or returning fake content to a client’s request. Although systems such as SUNDR [LKMS04] ensure integrity without relying on trusted storage servers, clients cannot retrieve the block’s latest valid content if the block’s only storage node does not want to cooperate and keeps acting maliciously. In order to provide the clients the assurance of valid data retrieval, the system must rely on redundancy so that a block is always stored by a set of nodes. There are basically two ways of achieving redundancy, either through replication or network coding schemes. Replication [SS05, JGH+98] consists of storing multiple identical instances of an object on different nodes, hence increasing availability and durability. Network coding schemes [OSV09] however rely on error-redundant codes such as Reed-Solomon, an erasure code [DGWR07] widely used in DVD (Digital Versatile Disc). Instead of plain object replication, erasure code schemes divide the object into m fragments and recode them into n segments, where n > m. The n segments are then stored in the DHT. The rate of encoding r = m n increases the CHAPTER 2. BACKGROUND 21 storage cost by a factor of 1 r . The key property of erasure codes is that the original object can be reconstructed from any m segments. For example, using a ratio r = 1 4 , a block is divided into m = 16 fragments and encoded into n = 64 segments, increasing the storage cost by a factor of four. Then, a client able to retrieve sixteen segments out of the sixty four present in the system would be able to reconstruct the original object. Noteworthy is that replication represents a subset of erasure codes where the number of segments n is one i.e. a single segment is enough to reconstitute the original object. Network coding schemes are very interesting because they require less storage space in order to achieve the same degree of availability and durability as through standard replication. As an example, assuming that ten percent of the ten million machines populating a network are down, replication ensures 99% availability by storing two replicas of each block. However, erasure codes can achieve over 99.9999998% yet consume the same amount of storage and bandwidth than their replication counter- part. Unfortunately, network coding schemes all suffer from the same problem. As nodes fail, the system loses the segments belonging to the network-coded objects the nodes were storing. In order to avoid losing them all, the system must, for every object involved, periodically refresh the missing segments. This refreshing process consists of reconstructing the object, re-computing all the segments and then re-storing the missing segments on other storage nodes. Unfortunately, this process is extremely costly, especially for large objects, though network coding schemes were designed to rarely require refreshing segments. Figure 2.7 illustrates DHash, a Chord -based distributed hash table in which blocks are replicated on the nodes following the home node, known as the neighbours. Since nodes with close identifiers are, with high probability, located in very different geographic places, storing replicas on such nodes ensures a low rate of correlated failures. Note that, whenever the home node fails, the Chord protocol takes over and assigns a new home node to the orphan objects. In addition, the DHT makes sure the replication ratio is maintained at all times by generating additional replicas if required. Although network coding has been studied [WK02] for decades and applied in nu- merous research systems as well as commercial products [Wua], replication [SS05] remains the most widely used technique to provide redundancy in peer-to-peer net- works. Given that redundancy is required for ensuring availability and durability, the sys- tem must guarantee consistency among the replicas. Unfortunately, like every In- ternet-based distributed system, DHT s are built on top of an asynchronous physical network, making it impossible [FLP85] to distinguish slow from faulty nodes. This CHAPTER 2. BACKGROUND 22 property impacts the consistency algorithms which must often take further steps should the nodes not be responding or acting maliciously. DHT s rely on consensus algorithms in order to cope with Byzantine behaviours in asynchronous networks. As summarised by Chockler et al. [CGKV08], such algo- rithms vary in several dimensions from the consistency guarantees, to the number of failures tolerated, to the performance achieved. Consensus algorithms can be classified in two categories: agreement and quorum protocols. neighbour node home node node routing link neighbour link Figure 2.7: The replication-based DHash distributed hash table Byzantine agreement protocols such as the BFT (Byzantine Fault Tolerant) [CL99] protocol and Paxos [Lam98, Lam01] achieve consensus through voting and can tol- erate up to γ Byzantine nodes by relying on ϕ ≥ 3γ + 1 servers. Such algorithms work as follows. A client willing to perform an operation starts by sending a re- quest to a server i.e. the leader. The server having received the client’s request then forwards it to the other servers. Every server receiving such a vote request responds to every other server, hence leading to a consensus. Finally, the leader CHAPTER 2. BACKGROUND 23 server transmits the servers’ decision back to the client. This multi-phase protocol is illustrated by Figure 2.8. Although such algorithms are extremely powerful for dealing with Byzantine behaviours, they unfortunately suffer from the several rounds of communication which generate a number of messages quadratically proportional to the number of servers involved. Agreement algorithms are thus often considered as being too expensive [DW01, Bus07] for many applications. Noteworthy is that many improvements and optimisations have been developed over the years. For in- stance, Borran et al. [BS10] proposed a leader-free Byzantine consensus algorithm while others [CSP07, LMZ09] presented optimisations of the Paxos algorithm re- garding specific configurations: multiple coordinators, reduced number of rounds in the absence of failures etc. server leader server server server client request propose accept reply Figure 2.8: The Paxos agreement protocol On the other hand, quorum-based algorithms [AJ92, MR97, GKLQ07, MAD02] consist of retrieving a subset of the replicas to make sure to identify the latest valid version of the object. Quorums rely on the property of intersection in order to minimise the number of storage nodes to contact but also to prevent conflicts. As for agreement protocols, quorum-based algorithms have been the subject of numerous research projects which have led to further improvements, especially in the field power management and mobile networks [BF08, GDZ+05, KLW11]. Although many quorum-based algorithms have been presented throughout the his- tory of distributed computing, Gifford et al.’s [Gif79] quorum-based protocol is de- tailed next because of its simplicity. CHAPTER 2. BACKGROUND 24 Considering a distributed system in which blocks are replicated on ϕ storage nodes, a client willing to perform an operation must acquire a quorum complying with the following rules, where ζr and ζw represent read and write quorums’ cardinality, respectively: 1. ζr + ζw > ϕ 2. ζw > ϕ 2 The first rule prevents read-write conflicts while the second rule prevents write-write conflicts, both contributing to maintain serialisability. (iii)(ii)(i) read quorum write quorum Figure 2.9: Three Gifford quorum configurations Figure 2.9 depicts a set of twelve storage nodes in three different quorum config- urations. In the first configuration, read and write quorums are composed of six and seven nodes, respectively, such that the quorums intersect on a single replica. Therefore, assuming that a client updates an object by contacting seven replicas out of the twelve present in the system, a subsequent read through a quorum of six replicas will inevitably provide the client the recently updated version. The second configuration does not comply with the Gifford quorum rules since ζw > ϕ 2 is not respected. Therefore, up to four clients could modify the object concurrently, lead- ing the system to an inconsistent state. Finally, the last configuration is generally referred to as ROWA (Read One, Write All). Indeed, in such a configuration, read- ing an object requires the client to contact a single node while all the nodes must respond positively whenever an object is updated. It is however extremely important to note that most of the configurations depicted in Figure 2.9 would not be suitable for Byzantine environments. Indeed, in case of arbitrary failures, such as in peer-to-peer networks, ϕ ≥ 3γ + 1 storage nodes are CHAPTER 2. BACKGROUND 25 required to tolerate up to γ Byzantine nodes while read and write quorums must contain, at least, 2γ + 1 replicas. To conclude, agreement protocols provide more expressivity than quorum protocols. Indeed, while agreement protocols can achieve consensus on virtually any kind of operation, quorum protocols are limited to reads and writes. However, quorum algorithms have proved to be well suited for peer-to-peer file systems which are built on top of distributed hash tables, such constructs providing functionalities as basic as Put(α, β) and Get(α). Besides, both agreement and quorum protocols are equally constrainable from the client’s perspective since at most 2γ + 1 instances of the block must be retrieved in order to cope with Byzantine behaviours. On the other hand, agreement protocols’ several voting phases imply a high number of message exchanges. Thus, agreement protocols are often considered as being very expensive [BBB+04, DW01, Bus07], especially in the context of peer-to-peer file systems though many projects have been making use of those [KBC+00, ABC+02]. 2.3 Peer-to-Peer File Systems The very first distributed systems targeted local-area networks, the most famous and still widely used being NFS (Network File System) [Osa88]. Such local-area distributed file systems are characterised by a low network latency as well as trust- worthy clients and servers, both evolving within a single administrative domain. Unlike local-area network file systems, AFS (Andrew File System) [HKM+88a] ad- dresses larger networks characterised by higher latencies and a larger number of computers. Such file systems rely on loose caching policies [SS96] in order to reduce the communication between the clients and the servers. Moreover, systems such as Coda [SKK+90] and Ficus [JGH+98] enable offline access through the use of opti- mistic replication [SS05], applying modifications once the computer re-connects to the network. Many other file systems were designed for small- and medium-sized networks, all with different objectives and constraints including Kosha [BJZH04] which equips NFS with redundancy over a scalable network, xFS [WA93, ADN+95], a wide-area file system relying on massive caching techniques, Plan9 [PPD+95], a distributed computing environment following the UNIX philosophy and LBFS (Low Bandwidth File System) [MCM01] which reduces communications by relying on indexes and applying Rabin fingerprints to the chunks of data. Unfortunately, these distributed file systems rely on trusted and often centralised servers making them impractical in more open environments. SFS (Self-Certifying File System) [MKKW99, KSMK03, Maz01, FKM02] addresses this issue by relying CHAPTER 2. BACKGROUND 26 on multiple self-certifying domains rather than a single global open network. Since the domains are independent, the domain management is assured by the local au- thority with its own rules and policies. The self-certifying property of SFS makes it impossible for an attacker to pretend to be or belong to another domain. SFS therefore achieves scalability through openness although the domains’ independence implies that the failure of a single domain renders all the users, groups, files and directories of this domain unavailable. Peer-to-peer networks have been shown [BDET00] to exhibit very interesting prop- erties for building highly available and reliable file systems. The remainder of this section discusses in detail some of those peer-to-peer file systems especially regarding their capacity to scale, cope with Byzantine behaviours but also provide common file system features such as access control. 2.3.1 Pangaea Pangaea [SKKM02] is a wide-area read/write file system which relies on an ad hoc decentralised storage infrastructure of trusted servers. Pangaea aims at providing clients efficient data access through the use of pervasive replication. In order to optimise the data placement, the nodes of the system are split into disjoint regions. A region is composed of nodes grouped according to their network latency. Every node maintains a global state of the whole system including the list of the nodes of the region, their network latency and free disk capacity, the location of the root directory’s replicas, the list of the regions etc. This information is propagated throughout the network periodically by means of an epidemic protocol. Pangaea maintains, for every file, a distributed and highly connected graph of the nodes storing replicas known as gold replicas. Such replicas are statically defined at file creation and are used to maintain a minimum replication ratio at all times. In addition, bronze replicas are also connected to the graph in a loose manner. Indeed, bronze replicas are created in a dynamic way i.e. every time a node accesses the file. This replication graph, composed of both gold replicas and bronze replicas, is used to propagate updates throughout the network in an efficient way. Figure 2.10 illustrates two directories along with both their gold and bronze replicas. Every directory replica instance contains the locations of the entries, being files or sub-directories. These locations are represented on the figure by the references. In addition, although not depicted by the figure, every replica maintains a backpointer to the parent directory which is used to update the parent directory should the gold replicas of a sub-entry be moved to another node. The data structure described above has the advantage of distinguishing the repli- cation of the directory from the replication of the objects, files and directories it CHAPTER 2. BACKGROUND 27 contains. Therefore, adding or removing bronze replicas only requires updating the graph related to this object while leaving both the parent directory and the potential sub-objects out of the process. Indeed, the only operation requiring updating of the parent directory is the modification of a gold replica’s location. The modifications applied to a replica are propagated throughout the graph following the edges connecting the replication nodes. Note that an operation description—e.g. create file ‘bar’ in directory ‘/foo/’—rather than the new object’s state is propagated to the replication nodes. Pangaea makes use of the last-write-wins consistency model by relying on global timestamps through the use of a NTP (Net- work Time Protocol) server. However, directory conflicts are resolved automatically if possible or left to the user otherwise. / /foo/ object gold replica bronze replica replication edge reference Figure 2.10: Pangaea file system representation Although Pangaea provides a powerful storage infrastructure through localised repli- cation, the assumption of a trustworthy network makes it impractical for most en- vironments. Indeed, since both authentication and access control is handled by the trusted servers, a single Byzantine node could easily harm the whole system. CHAPTER 2. BACKGROUND 28 2.3.2 OceanStore OceanStore [KBC+00] is a generic distributed storage infrastructure relying on the Tapestry [ZKJ01] DOLR (Decentralised Object Location and Routing). OceanStore aims at providing a wide range of consistency models in order for applications to bal- ance the trade-offs between performance and consistency. OceanStore’s architecture is partitioned into two levels or tiers. The first level, known as the primary tier, is composed of highly available nodes divided into multiple groups. Each group is responsible for a subset of the objects of the system. The nodes belonging to a group store the primary replicas of the objects the group is in charge of. The BFT [CL99] agreement algorithm is used by the group members for authorising, validating and applying operations on the replicas despite the potential presence of Byzantine nodes. Note however that such an algorithm is expensive as it requires three communication rounds between the servers to perform a single operation, generating O ( ψ2 ) messages, assuming every group is composed of ψ servers. Therefore, the nodes composing the primary tier must be very powerful, well connected and highly available to handle the high network load. The secondary tier is composed of more transient nodes with high latency and low bandwidth such as personal computers for instance. This level constitutes the mass storage capacity of the system in which secondary replicas are created in order to improve local accesses. As Figure 2.11 illustrates, whenever a client node modifies an object, being a file or a directory, a request is sent to the primary replication nodes as well as some randomly chosen secondary replication nodes. While primary servers serialise and verify the operation validity by running the BFT algorithm, the request is propa- gated to the other secondary replication nodes in an epidemic way. Finally, once approved, a confirmation is propagated throughout the network, sealing all the sec- ondary replicas. A certificate is attached to every object’s version, asserting the approval of the primary tier. Since primary servers may be malicious, the certificate cannot be generated by a single server. Therefore, OceanStore relies on threshold signature schemes [AMN01] so that a certificate is considered valid if composed of bψ−1 3 c + 1 legitimate partial signatures. OceanStore makes use of optimistic concurrency control for optimising operations’ response times. In order to detect conflicts, the system implements a semantic detection mechanism based on predicates. Therefore, depending on the type of object, whenever a conflict is detected, a pre-defined list of operations is applied to the object such as inserting, replacing or truncating the data. Note however that, although this mechanism is generic enough to automatically resolve conflicts, the CHAPTER 2. BACKGROUND 29 system often lacks semantic information since objects are usually encrypted. primary tier secondary tier (1) (1) (1) (2) (3) (3) (3) (3) (3) (3) (3) node message Figure 2.11: OceanStore’s organisation Security is provided through the use of ACL (Access Control List)s. An ACL con- tains the public keys of the users allowed to access and modify the object. Whenever an object is modified, it is signed by the user and submitted to the primary repli- cation servers. Then, every client and server retrieving the object can check the object’s validity by verifying the digital signature. Although OceanStore provides a powerful and flexible storage infrastructure, its hybrid architecture makes it more difficult to scale to large networks than other, completely decentralised, peer-to-peer file systems. 2.3.3 FARSITE FARSITE (Federated, Available, and Reliable Storage for an Incompletely Trusted Environment) [ABC+02, DAB+02] is a file system based on an ad hoc partially decentralised storage infrastructure. FARSITE aims at emulating the behaviour CHAPTER 2. BACKGROUND 30 of a centralised file system such as NTFS (NT File System) in a medium-scale environment and without introducing new semantics such as file versions, conflict resolutions etc. FARSITE has been designed to be deployed on the commodity hardware of medium- sized networks such as universities or companies. Such an environment is charac- terised by a high bandwidth network and transient nodes. Every node in FARSITE can play up to three roles: clients issue requests on behalf of end-users, servers store object replicas and managers, as members of a management group, take part in administrating the system’s metadata. Every management group is in charge of a subset of the file system’s namespace. The members of a management group act collectively through the use of the BFT agreement protocol. (1) (2) (3) (3) (3) client replication server management group node message Figure 2.12: The FARSITE architecture Every member of the management group maintains a copy of the metadata related to the namespace’s subset it is in charge of. Access and modification requests are sent to the management group which runs a vote to serialise the operation. The man- agement group keeps the locations of the replicas of every file lying in its namespace CHAPTER 2. BACKGROUND 31 as well as the hash of the file’s content. A client willing to access a file starts by contacting the management group in order to locate the file’s replicas. The client retrieves an instance of the file from one of these locations and checks the file’s integrity by re-computing the hash of the content. The client then caches the file locally for subsequent accesses and/or modifications. Whenever modified, a hash of the new file’s content is computed and sent to the management group which verifies that the user has the proper credentials for that operation. If the request is accepted, the servers storing the file’s replicas are told to retrieve the new file directly from the client. Figure 2.12 illustrates the described update protocol. FARSITE ensures strong consistency guarantees by relying on leases. Whenever a client reads a file, it is granted a lease by the management group, guaranteeing the freshness of the client’s local copy. FARSITE provides two types of lease: read leases ensure that the file will not be modified until the lease expires or is revoked while write leases guarantee an exclusive access to the file. Note that considering a client requesting a file, the system would immediately revoke the leases so that the eventual modifications are pushed back to the replication nodes, bringing the system back to a consistent state. Then, the requesting client could carry on its operation and retrieve the file in its latest form. FARSITE provides security through the use of convergent encryption. Every file is assigned an ACL containing the public key of the users authorised to modify the object. Whenever a client requests an operation to the management group, a secure communication channel is established in order to authenticate the user. The system guarantees that unauthorised users cannot access a file through the following protocol. For every new file, the client generates a random symmetric key and encrypts the file’s content with it. Then, the symmetric key is encrypted with the public key of every user having been granted permission to read the file. These encrypted symmetric keys are finally sent to the management group so that whenever a client requests a read operation, the management group returns the client its encrypted symmetric key. Then, the client can decrypt the symmetric key using the user’s private key. 2.3.4 CFS CFS (Chord File System) [DKK+01] is a completely decentralised file system relying on the DHash distributed hash table as a block storage abstraction. CFS aims at ensuring data integrity while balancing the storage load across the system’s nodes. The particularity of this system lies in the fact that a single user can update it, such that CFS is often considered a read-only file system. The entire CFS architecture relies on a block unit known as the CHB (Content CHAPTER 2. BACKGROUND 32 Hash Block). The special property of CHBs is that such blocks are self-certified, making the integrity verification process straightforward. Indeed, a CHB ’s address is computed by applying a one-way function, such as the SHA-1 hash function for instance, to the block’s content. That way, whenever a client retrieves such a block, its integrity can be verified by re-computing the block’s hash and checking if the fingerprint corresponds to the requested address. Unlike other file system objects however, the root directory is stored in a PKB (Public Key Block). PKBs are associated with a cryptographic key pair such that the block’s address is computed by applying a one-way function to the public key. In addition, a digital signature of the block’s content is embedded in the block for authenticity and integrity purposes. Noteworthy is that, unlike CHBs which are immutable, PKBs can be modified since their public key does not change over time. PKBs also embed a version number, which is increased whenever the object is updated, to differentiate the multiple instances of an object. α root Inode users music home README Inode README jmq ... Inode Welcome to CFS! PKB CHB CHB CHB CHB CHB Inode CHB Data Data Data mutable block immutable block relation Figure 2.13: The CFS hierarchical organisation Since the root directory block is mutable, the administrator can update the file sys- tem by re-signing the PKB with the private key he possesses. However, modifying the file /home/README, for instance, would imply, creating a new CHB for the file. Since the directory containing the file is a CHB as well, modifying a file also changes the directory content. Indeed, the file has a new content hence a new address and the directory content contains tuples of the form (name, address). Since the address CHAPTER 2. BACKGROUND 33 for the file /home/README has changed, the directory must be updated accordingly. Finally, since the directory /home/ just changed i.e. has a new address, the parent directory must be updated as well, and so on up to the root directory block. There- fore, modifying a single byte in a file implies updating the file system hierarchy up to the root block, which is in turn, re-signed by the administrator. Such a modifica- tion process is extremely expensive and inconceivable in a production environment. Figure 2.13 illustrates the CFS hierarchical organisation based on the UFS (UNIX File System) in which metadata are stored in objects known as inodes. The CFS ’s approach regarding file representation differs from many systems such as PAST [RD01b]. In PAST, every file constitutes a DHT block while CFS split files into chunks of regular size. Splitting files into chunks has the advantage of better balancing the storage load across nodes. However, since files are composed of multiple data blocks, the DHT routing process is requested more often, potentially leading to performance loss and increased security threats. 2.3.5 Ivy Ivy [MMGC02] is a multi-reader/multi-writer file system relying, like CFS, on the DHash distributed hash table. Ivy ’s architecture is based on per-participant logs describing the modifications the given participant has performed on the file system. Ivy implements a log in the form of a chain of records. Every record is stored in a CHB while the head of the chain is referenced in a PKB, modifiable by the participant. A view of the file system is composed of a set of such logs, as illustrated in Figure 2.14. The address of the log head blocks of the participants involved in the view are referenced in a view block. The view is stored in a CHB. The address of the view block identifies the file system and is therefore distributed to the users, enabling them to access and potentially modify it. Noteworthy is that since the view block is immutable, adding or removing a user to the view implies creating a new view. The records are sequentially numbered while the highest attributed number n is kept in the log head. Every record is also identified by a vector timestamp [ni] corresponding to the highest sequence number n of the various i logs composing the view at the time of the record creation. A user willing to modify the file system starts by reading the head block of every log composing the view in order to determine the vector timestamp to use for the new record. Then, the user adds a record to her log describing the operation performed such as the file path, offset, length and data for a write operation for instance. Consulting the file system however requires the user to explore all the view ’s logs from head to tail, looking for records related to the object and area of interest. CHAPTER 2. BACKGROUND 34 During this process, the client sorts the records according to their vector timestamps. For example, reading a file would require the client to locate all the write records related to the given file and intersecting with the area to read. CHB PKB CHB CHB View Head Record Record Recordα mutable block immutable block relation Figure 2.14: The Ivy log-based representation Although this process of serialisation is extremely expensive, as the performance depends on the number of writers involved, representing a file system by means of per-participant logs has two advantages. Firstly, assuming that a malicious user is detected, a participant can easily use another view which does not include the malicious user. Indeed, any subsequent operation would simply make use of the new set of logs. Secondly, in case of a partitioned network, every participant can continue accessing the file system, assuming that every partition contains at least one copy of the logs. Although modifying the file system in such a partitioned environment can potentially lead to more conflicts, Ivy already provides the necessary tools for dealing with such situations. Ivy makes use of optimistic concurrency control since operations are independently transcribed into records and serialised whenever a client needs to reconstitute the object. Note however that relying on independent per-participant logs does not pre- vent conflicts. Indeed, a scenario involving two participants modifying the same area of the same file based on the same file system state—i.e. vector timestamps—would obviously lead to a conflict. Ivy resolves such conflicts by ordering modifications CHAPTER 2. BACKGROUND 35 according to the user’s public key, hence guaranteeing the same consistent view for all the participants. Regarding security, Ivy does not provide access control mechanisms. Indeed, a user willing to restrict access to one of her files would have no choice but to manually encrypt it and distribute the encryption key to the authorised users making Ivy impractical for deployed environments. 2.3.6 Plutus Plutus [KRS+03] is a decentralised file system built upon an ad hoc overlay network. Plutus aims at detecting and preventing unauthorised accesses, differentiating be- tween read and write permissions and enabling the change of access rights. Plutus access control is based on two ideas. Firstly, the key distribution process is delegated to the client, leading to better server scalability while allowing the user to set arbitrary policies. Secondly, in order to reduce the number of keys users must keep, files are grouped into filegroups. The aggregation mechanism of filegroups prevents the number of keys the user has to manage to grow proportional to the number of files. An RSA (Rivest Shamir Adleman) key pair is associated with each filegroup. Files are grouped in filegroups according to their sharing attributes so that two files shared by the same users will have the same encryption key. Since users tend to use the same access control rules for their files, the number of filegroups a user’s files belong to can be expected to be very low. On the downside, using the same key for encrypting multiple files has the disadvan- tage that the same key encrypts more data, potentially increasing the vulnerability to known plaintext and ciphertext attacks. Plutus therefore uses unique encryp- tion keys for different files and stores those keys in a file-lockbox whose key is then distributed to the users of the same filegroup. Figure 2.15 illustrates the different keys involved in Plutus. Every file is split into data blocks, each of those blocks being encrypted with a unique symmetric file- block key. The lockbox contains all the file-block keys of the file and is encrypted with a symmetric file-lockbox key which is distributed to both readers and writers alike. Note that file-lockbox keys are the same for all the files belonging to the same filegroup. A hash of the file contents is computed for integrity purposes and signed with a file-sign private RSA key. The signature can subsequently be verified with a file-verify key i.e. the associated RSA public key. The file-sign key is handed to writers while the file-verify key is handed to readers so that the system can differentiate read from write access control. Thus, whenever a user modifies a file, she re-computes the hash and re-signs it. Readers however check the file’s integrity CHAPTER 2. BACKGROUND 36 by verifying the signature with the file-verify key and then make sure the hashes are valid according to the file’s contents. hash file−signsignature Data Data file−block Data file−block file−block file−block hash hash Inode Lock Box file−block file−lockbox file−block hash hash hash Inode hash Lock Box file−lockbox file−signsignature file−block file−block Data Data file−block file−block Reader/Writer file−verify Reader file−verify Writer file−lockbox file−lockbox file−lockbox file−sign file−sign File Group block relation Figure 2.15: The Plutus ’ keys, locks and groups Regarding access control management, Plutus makes use of lazy revocation. Indeed, re-encrypting the file’s contents whenever a user is revoked would incur a large performance overhead. Instead, the re-encryption is delayed until the file is modified. Plutus relies on key rotation [FKK06] to address the issues of lazy revocation in the context of file groups. Although Plutus aggregation of files according to their access control rules is ex- CHAPTER 2. BACKGROUND 37 tremely interesting, the overlay network has never been described. Besides, Plutus’ access control scheme may lack flexibility when it comes to managing hundreds, thousands or millions of users since the system does not provide any mechanism for aggregating users into groups, for instance, which would greatly ease access control management, especially in large-scale peer-to-peer file systems. 2.3.7 Pastis Pastis [mBPS05, Bus07] is a large-scale read/write peer-to-peer file system. Pastis relies on the PAST [RD01b] distributed hash table built upon the Pastry [RD01a] overlay network. α root Inode Data users music home README Inode Data README jmq ... Inode Data Welcome to Pastis! PKB CHB PKB CHB CHB PKB Inode PKB mutable block immutable block relation Figure 2.16: The Pastis organisation Pastis follows CFS ’s architecture but introduces mutable-block-based metadata in order to overcome CFS ’s major issues. Indeed, CFS is said to be a read-only file system because it can only be modified by the administrator. Besides, since its architecture is based on immutable blocks, whenever an object is modified, the hierarchy must be updated up to the root block, which is then re-signed by the administrator. Pastis introduces PKBs along the way so that modifying an object only implies creating new immutable data blocks as well as updating the mutable metadata block. Since the number of blocks modified is independent of the hierarchy depth, Pastis’s design is far more efficient than CFS ’s. In addition, since PKBs can CHAPTER 2. BACKGROUND 38 only be updated by a user possessing the private key required to re-sign the content, a Pastis file system can be modified by multiple users i.e. every object can be updated by its owner. Figure 2.16 illustrates the Pastis’ UFS -like hierarchical organisation composed of both metadata PKBs, known as inodes, and data CHBs. The figure depicts the use of PKBs which stops the update propagations. Since creating a file system object in Pastis implies creating a PKB, the user may end up keeping a lot of cryptographic key pairs. In order to avoid storing all those sensitive keys, PKBs embed both the public key of the block owner and a signature of that key done with the block’s generated private key. Therefore, the user does not need to keep any keys except her own key pair. Then, every operation performed on an object is signed with the user’s private key instead of the block’s private key. Pastis also provides a write-only certificate-based access control mechanism. Any user willing to grant the permission to modify one of her objects can generate a certificate and distribute it to the authorised users. Then, whenever such a user performs a modification on the given object, she attaches the certificate to the inode’s PKB, proving that this operation is legitimate. Thus, a client retrieving the inode can verify that both the block and the certificate are valid. A certificate embeds the inode block’s public key, the authorised user public key as well as an expiration date, the whole being signed by the object owner. Pastis therefore provides both the access control and consistency models required to build a usable file system in a peer-to-peer environment. However, the access control scheme still suffers from the number of certificates the user must keep. ? ? ? This section intended to detail the internals of the major peer-to-peer file systems developed over the last decades and to give an overview of the trade-offs involved in the design of such systems. Research was also conducted in order to improve peer-to-peer file systems in alternative ways: Chefs [Fu05] is an access-controlled content distribution network built upon SFSRO (SFS Read-Only) [FKM02], Total Recall [BTC+04] is a system predicting hosts availability in order to optimise replica placement, TFS (Transparent File System) [CCB07] is a transparent layer which makes use of the unused local storage until the local operating system claims it and overwrites the cached peer-to-peer data and Glacier [HMD05] is a storage system relying on erasure codes in order to increase availability and durability. Chapter 3 Environment This chapter starts by defining the properties end-users expect a modern storage system to provide before carefully defining a model capable of guaranteeing them all. Finally, the objectives and assumptions of this work are defined according to the background discussed in Chapter 2. 3.1 Properties This section discusses the properties end-users expect from a modern storage system. Although the properties below have been defined with the objective of designing an ideal storage system, some may seem more desirable than mandatory: anonymity, mobility, transparency, capacity or even cost for instance. Durability Durability ensures that once the system has agreed on storing some data blocks, those blocks will never be lost. A system lacking this property would be incapable of guaranteeing the user to even- tually retrieve her files. Commodity hardware such as hard disks but also external backups fail to ensure this property. It is actually impossible to ensure durability by relying on a single instance of the data because the hardware storing this instance could be destroyed, stolen etc. Therefore, reliable systems tend to rely on redundancy such as replication to guar- antee durability. Noteworthy is that peer-to-peer file sharing applications such as Bittorrent [Coh03] actually lack this property. Indeed, such systems make use of pervasive replication since every client retrieving a file implicitly creates a new replica which can be used 39 CHAPTER 3. ENVIRONMENT 40 to serve other client requests. Therefore, while popular content achieves a high replication ratio, rarely accessed files eventually get lost. This lack unfortunately makes the system difficult to use for users seeking unpopular content. Integrity A system providing integrity ensures that a client retrieving a block of data will end up with the exact content that has been previously inserted. Therefore, this property also guarantees that the data has not been altered in any way. The property is usually provided through the use of integrity codes such as crypto- graphic hash functions, authentication codes or digital signatures. Although most distributed systems perform the integrity verification process on the server handling the client’s request, the client itself should be able to verify the block’s integrity if given enough information such as the identity of the user emitting the authentication code for instance. Availability The availability property ensures that a data block stored by the system remains accessible at all times. This property coupled with the durability property makes the system reliable. As for durability, replication is a way of achieving availability by maintaining a replication ratio such that even if some replication nodes fail, enough replicas remain in the system for the clients to access the data. Note that applications relying on pervasive replication, such as Bittorrent [Coh03], can ensure neither the durability nor the availability property because the replication ratio depends on the content’s popularity. Privacy The privacy property ensures the user the possibility to keep her files completely private, both from other users and more powerful entities such as the potential organisation distributing the software, the user’s ISP (Internet Service Provider) or even governments etc. Privacy is usually provided by means of cryptography: every stored data block is first encrypted on the client side so that the servers never have access to the data in its plain form. CHAPTER 3. ENVIRONMENT 41 Sharing As described in Section 1.1 sharing has become increasingly important to Internet users. A viable system would therefore provide users means to share data with other specific users in an easy way. In order to prevent both the organisation running the system and unauthorised users from accessing the data, systems usually distribute the cryptographic key used for encrypting the data to the users granted access. Anonymity People are usually concerned about companies or govenment entities analysing users’ doings on the Internet. Indeed, people expect the same rights they are granted when it comes to their home privacy for instance. Anonymity should therefore be guaranteed by the system such that nobody, not even the government or the user’s ISP can know what data the user is storing, sharing or accessing. Although anonymity has been studied through various projects [CSWH01, DFM01], the mechanisms used for providing such a guarantee are often very expensive and therefore impact the user experience. Versioning Computers have become the ultimate tool for treating information. As such, every document evolves in its digital form from one version to the next. However, users may wish to undo a modification or rollback to a past version. Such a feature is considered fundamental in revision control systems such as CVS (Concur- rent Control System), Subversion and Git. Besides some storage systems [SFH+99] have started integrating such a functionality at the file system level. Therefore, any modern storage system should provide a way for users to track the modifications applied onto documents but also to naviguate through the versions and potentially restore a specific one. Mobility With the increasing diversity in mobile devices, people are trading their old sin- gle desktop computer for a variety of small nomad devices from mobile phones to netbooks to tablets and so forth. CHAPTER 3. ENVIRONMENT 42 A viable platform for storing, sharing and synchronising files should be accessible from all these mobile devices by coping with the characteristics of such resource- limited computers. Organisation Since the advent of personal computers, people have been used to manipulate in- formation through the abstraction known as the file. In addition, the hierarchical organisation consisting of a directory containing files and sub-directories has made its way to the general public as the traditional way for organising information. However, recent research along with some commercial products tried to introduce another way for retrieving and searching documents through tags. Although this scheme has not yet supplanted the hiearchical organisation on personal computers, it appears to be a serious alternative. No matter which scheme a storage system uses, the user requires a way for organising and managing her files. Transparency The files stored by the system should be accessible in a transparent manner such that the end-user does not have to differentiate accessing a locally stored file from accessing a file stored through a remote system. More precisely, the system should enable existing applications to manipulate the files stored through the given storage system as they did when stored on a local hard disk for instance. Efficiency The user experience is crucial, especially when it comes to accessing files that were supposedly stored locally and therefore quickly retrievable. The system should therefore focus on giving the end-user the impression that ac- cessing files residing on other computers through the Internet is actually “as fast as” accessing them locally. The user could well be aware of the fact that the network protocols impact the performance of the system especially regarding the network latency since nodes may be geographically far from each other. However, the networking aspect of the system should not make the user’s common operations a hassle, such as watching a movie, working on an office document, listening to music etc. CHAPTER 3. ENVIRONMENT 43 Various techniques could be used for achieving efficiency from optimised network protocols [Coh03] to caching algorithms for instance. Capacity The user should not be limited regarding the number of files or the size of the files that she can store and should have access to a storage capacity of the same order of magnitude as the capacity offered by her local hard disk. Indeed, given the evolution of the hard disk prices, user may lack incentive to move on to a reliable, secure and available storage infrastructure if one can get twice as much storage capacity by buying cheap external drives from a nearby hardware store. Cost Along with the capacity property, the cost of such a system should be low for the client wishing to use it. Besides, the costs for the organisation developing and maintaining the infrastructure should also be as low as possible because such costs would have to be passed on to the consumer, one way or the other. 3.2 Model This thesis claims that a file system abstraction on top of a peer-to-peer network is the most suitable model for achieving the fundamental properties defined above. 3.2.1 File System End-users have been accustomed to hierarchical organisations since the introduction of the file system paradigm. Providing the user a similar way to organise files is crucial. Although most storage services and products [Box, Dro, Win] provide such a hierarchical organisation, some still put the user in front of a flat name space. Peer-to-peer file sharing applications [Coh03, CSWH01, DFM01] for instance fail to offer users a hierarchical organisation making it difficult for people to organise the files they contribute to the system but also to browse other users’ contributions. Although the organisation property is fundamental, transparency is also extremely important. Much storage software [Box, Dro] forces the user to use a specific appli- cation. It may, at first, seem natural from the system designer’s perspective because CHAPTER 3. ENVIRONMENT 44 defining a specific interface gives the application the liberty to interact directly with the end-user but also to offer features specially designed for this system. However, such systems also suffer from it since breaking the compatibility with all the appli- cations relying on the file system interface automatically isolates the software from the rest of the world. Indeed, applications would not be able to use files stored through a system incompatible with the standard file system interface meaning that users would not be able to play their music files or watch their movies, they would have to first retrieve the file from the network, store it on the local disk before the application could proceed and open it. Thus, in order to respect the organisation and transparency properties, and accord- ing to end-users habits, such a storage infrastructure should be accessible through a standard file system interface. 3.2.2 Peer-to-Peer As studies [DB99, BDET00, BDET00, HAY+05] suggest, file systems can benefit from the peer-to-peer architecture in a number of ways. Peer-to-peer systems offer a way to aggregate and make use of the resources on com- puters across the network hence building a virtually infinite and highly adaptable system. Research showed [Vog99, DB99] that the usage of computers’ storage oscil- lates between 53% and 87% meaning that a large portion of the local storage space is, most of the time, unused. This storage characteristic indicates that peer-to-peer networks can ensure the durability, availability, versioning and capacity properties by making use of a user’s unused space for replicating the other users’ data. Fur- thermore, by relying on the clients for contributing the system in bandwidth and storage capacity, the costs for running such a system are kept extremely low, both for the organisation running the software and the end-user. As shown in Chapter 2, peer-to-peer overlay networks, more specifically structured overlay networks, have been designed to be highly scalable. However, although this characteristic implies that the load put on the nodes depends on the size of the network, it does not guarantee that mobile devices, for instance, will have enough resources to support such a load. Indeed, a user might want to contribute the peer-to- peer network from her home desktop computer only, while accessing it from multiple other devices, not mentioning that such resources-limited devices may not have the capacity to maintain the local network state. Fortunately, structured overlay networks have been designed to be highly tunable through several parameters. For instance, the Chord [SMK+01] overlay network’s base parameter can be chosen in order to achieve the desired trade-off between lookup performance and the size of the local state every node must maintain. CHAPTER 3. ENVIRONMENT 45 The large-scale nature of peer-to-peer networks also contributes to improving the overall system performance. Indeed, since data can be distributed throughout the network but also retrieved from multiple nodes at the same time, the bandwidth load is naturally balanced between the computers contributing to the system. The Bittorrent [Coh03] peer-to-peer file sharing application gained in popularity due to its efficient network protocol which makes use of this characteristic. The peer-to-peer model therefore appears as a natural network paradigm for ensuring most of the system’s properties such as durability, availability, versioning, mobility, efficiency, capacity and cost. 3.3 Mission Peer-to-peer networks have been shown to exhibit many interesting characteristics but also introduce many challenges. This thesis does not discuss the challenges related to overlay networks or even distributed hash tables because a substantial amount of work has already been achieved in these fields, as attested by Chap- ter 2. Topics ignored by this work therefore include, but are not limited to, re- dundancy algorithms [HAF10], consistency models, agreement protocols, fault tol- erance, atomicity, garbage collection [BCK+09], overlay network’s identifiers as- signment [FJG06], mutual exclusion algorithms [MCG05] and routing algorithms [HCW10, dALF10, HB11]. Unlike centralised facilities which are very expensive to build and maintain, peer- to-peer systems do not require any special administrative or financial arrangements. Such systems therefore became very popular for exchanging information freely, out- side any control. Research in anonymity arose as an additional step to freedom on the Internet, led by well-known projects such as FreeNet [CSWH01] and FreeHaven [DFM01]. Although anonymity may be considered by many as a fundamental re- quirement in today’s digital world, this research topic will not be discussed in this thesis and is left as future work [CLL07, ZSJ06, Mha11]. Likewise, the versioning [CRS05, JXY07] feature is not studied in this thesis. As such, the underlying distributed hash table is assumed to store the latest version of every block. Since overlay networks and distributed hash tables have been the focus of the re- search community for more than a decade, this work concentrates on providing file system functionalities in a decentralised, hence untrustworthy environment. The file system component, built on top of a block storage layer, provides the fol- lowing fundamental functionalities. First, file systems introduce the file abstraction: a block of arbitrary information. Then, files are associated with a human-readable CHAPTER 3. ENVIRONMENT 46 identifier, known as path in hierarchical systems, forming a tree-like organisation scheme. Third, the notion of user distinguishes the multiple entities interacting with the file system. Finally, the access control scheme enables users to control their files, directories etc. enabling files to be shared with or protected from other entities. File systems integrate another inherent but non-obvious functionality. Although most file systems’ operations relate to objects such as directories, files and links, a few operate on the whole system configuration. Centralised file systems tend to rely on a specific user, known as root on UNIX -like systems and Administrator on Win- dows, to perform such special operations. This user, being granted super-privileges, can perform system-wide actions such as creating users and groups, accessing or removing any file but also modifying the file system metadata such as its name, its capacity along with some specific parameters. Decentralised file systems however, cannot rely on such a special user because such systems were specifically designed to prevent a single entity from controlling the whole system. Therefore, the admin- istrator entity must be re-considered to fit such a decentralised environment. This thesis focuses on designing a flexible access control scheme for decentralised untrustworthy storage environments, providing peer-to-peer file systems’ users a way to control their files individually. In addition, an administration scheme is discussed which both prevents a single user from completely controlling the system and enables users to request an administrative operation such that, if it is beneficial to the system, it will be carried out. Although the community showed great interest in such distributed systems, rare [HAY+05] are the decentralised file systems to have been deployed. The final ob- jective of this thesis is to develop a viable prototype proving feasible the deployment of such a system to a large number of users in a production environment. 3.4 Assumptions The file system described in this document relies upon a distributed hash table, which in turn, is built on a peer-to-peer overlay network. Although the challenges related to overlay networks and distributed hash tables are not discussed throughout this thesis, several assumptions are made regarding the interface of the underlying storage layer but also the properties and guarantees of the network architecture. First, the peer-to-peer network is assumed to be untrustworthy. Indeed, since such networks are mostly populated by personal computers, no assumption can be made regarding the trustfulness of the contributing nodes. Furthermore, the decentralised nature of such networks coupled with the untrustworthy assumption implies that nodes must operate in a completely symmetric way. Indeed, since peer-to-peer nodes CHAPTER 3. ENVIRONMENT 47 are considered equally unprivileged, everything performed by one node could also be performed by another one. In addition to those fundamental characteristics— decentralisation, untrustworthiness and symmetry— the high dynamicity of such networks requires protocols to be scalable. Finally, the inherent network’s churn implies that no assumption should be made regarding nodes’ connectivity. Second, in order to ensure the durability and availability properties, redundancy has been shown to be an absolute requirement. Projects such as OceanStore [KBC+00] and FARSITE [ABC+02] have been using agreement algorithms such as the BFT [CL99] protocol and Paxos [Lam98] in order to ensure consistency among the repli- cas. Unfortunately, such algorithms are known to be expensive [DW01, Bus07] as detailed in Section 2.2. Other projects such as CFS [DKK+01], Ivy [MMGC02] and Pastis [mBPS05] chose to rely on quorums. Indeed, since peer-to-peer file systems only require to store and retrieve data, such algorithms achieve better performance than their agreement counterparts. According to the efficiency property, the under- lying distributed hash table will therefore be assumed to be making use of quorums for maintaining the replicas in a consistent state. As detailed in Section 2.2, quorum-based Byzantine systems replicate every object on ϕ ≥ 3γ + 1 storage nodes in order to tolerate up to γ malicious nodes while read and write quorums must contain contain, at least, 2γ + 1 replicas. Note however that every client must be able to distinguish the illegitimate replicas provided by Byzantine nodes. Therefore, and since the untrustworthy property implies that storage nodes cannot be trusted, every data item must be self-certified that is, every data block must include the necessary information in order to ensure that (i) the block corresponds to its supposed address (ii) the block’s integrity has been maintained and (iii) the block’s authenticity is guaranteed. The self-certification property along with the replication and symmetry ensures that any client can select the valid instance from a set of replicas. Third, the distributed hash table should provide an interface composed of, at least, the four fundamental routines below. ˆ Put(α, β) ˆ Get(α) −→ ⠈ Gather(α) −→ ⠈ Erase(α) The following details some of the essential distributed hash table’s protocols. Since these protocols reflect the particularities—untrustworthiness, symmetry and self- certification—of the given environment, understanding those protocols will help the reader comprehend the design decisions made in Chapter 4. CHAPTER 3. ENVIRONMENT 48 Put The Put(α, β) routine takes a unique address α along with the data block β to be stored. In order to ensure durability and availability, the block is replicated on a set of nodes Ω such that |Ω| = ϕ. The process of storing a block in the distributed hash table goes as follows. The client starts by sealing the block in order to ensure self-certification. Indeed, since the block is going to be stored on untrustworthy nodes, future clients retrieving the block must be able to detect blocks that have been illegally altered by malicious nodes. Integrity can be ensured through the use use of a cryptographic signature, a MAC (Message Authentication Code) [OM94] or equivalent. The client then computes the block’s address and invokes the Put() routine which locates the nodes Ω responsible for the given address α. The block β is sent to a write quorum of storage nodes after which the client waits for their responses. Once 2γ+1 acknowledgements have been received, the block is considered as being stored by the system. From the storage node perspective, the process consists in verifying the received block’s validity before storing it. The self-certification process is composed of several steps. Firstly, the node verifies that the address α corresponds to the block β. Note that this implies that the block’s address computation must be a function of the block itself. Secondly, the block’s integrity and authenticity is checked, by verifying the embedded cryptographic signature for instance. Note that the Ω storage nodes periodically synchronise with each other in order to maintain replica consistency. Besides, write operations performed on quorums are actually composed of two phases. The first phase consists in locking the quorum nodes in order to ensure mutual exclusion. In the second phase, the client sends the block’s content to the quorum nodes. However, for the sake of simplicity and clarity, the first phase is ignored and will therefore not be discussed throughout this thesis. Get The Get(α) routine is used to retrieve the block identified by α. More specifically, this routine returns to the client the first valid block instance located throughout the network. Therefore, depending on the implementation and the context, the block may be retrieved from one of the Ω nodes, from a node contributing to the routine process and having the block in cache, or even from the client’s cache. Once the client has received the block, it proceeds to the exact same verification process as the servers: (i) the received block β corresponds to the requested address α and (ii) the block’s integrity and authenticity is maintained. The reader should CHAPTER 3. ENVIRONMENT 49 notice the symmetry of the nodes’ behaviour. Indeed, the verifications performed by a node acting as a server will eventually be performed by a client. Therefore, the information necessary to enact a block’s validity must be available to every node, being a client or a server, hence the self-certification. Gather The Get(α) routine’s particularity lies in the fact that the first valid instance found is returned. Therefore, this routine is particularly interesting for retrieving immutable blocks. Indeed, since such blocks do not evolve over time, the client cannot end up with an incorrect version as a single version of this block will exist for ever. Therefore, retrieving an immutable block which passes the validity tests is sufficient to affirm that this block is the one the client is seeking. However, such a routine would not be satisfactory should an application make use of mutable blocks. By using the Get(α) method, a client could end up with a block which is valid but happens to represent an older version of the requested block. Indeed, since data dissemination in computer networks is by definition asynchronous, hence unreliable and non-atomic, multiple nodes may, at the same time, store or cache different versions of the same block. The Gather(α) routine addresses this issue by directly requesting the block from Ω through the formation of a read quorum composed of 2γ + 1 storage nodes. Once 2γ + 1 instances of the block have been received, the client starts by discarding any invalid block i.e. violating the integrity and/or authenticity for instance. Since the algorithm has been designed to tolerate up to γ Byzantine nodes, the number of such invalid blocks should not exceed γ. Finally, among the remaining instances, the client picks the one with the highest version number. Indeed, mutable blocks are expected to embed a version number in order to differentiate the multiple variations of a given block α. Besides, note that a storage node being requested to overwrite a mutable block would verify that the version number of the new block is strictly higher than the currently stored one. This additional verification step ensures mutable blocks evolve in a monotonic way. Figure 3.1 illustrates the protocol in a network tolerating up to γ = 2 Byzantine nodes. The system initial state is consistent since the Ω nodes store the exact same version of the block i.e. version 3, except for the Byzantine nodes whose behaviour cannot be predicted. Then, a client modifies the block by updating a write quorum containing both malicious nodes. Finally, another client requests a read quorum which contains the Byzantine nodes and the two nodes that were not included in the previous write quorum. However, since the algorithm ensures read and write quorums intersection, the client can discard the invalid blocks and pick the latest CHAPTER 3. ENVIRONMENT 50 version among the three remaining instances which happens to be version 4, as expected. 3 33 3 4 33 4 4 33 43 4 4 x x x x x xx x (i) (iii)(ii) write read non−Byzantine node quorum Byzantine node Figure 3.1: A three-step representation of a symmetric quorum-based system The following example, illustrated by Figure 3.2, shows that symmetry is absolutely crucial to decentralised untrustworthy environments. Let us consider a distributed hash table which, for some optimisation purposes, authorises clients to store data between 2 a.m. and 4 a.m. only. A client would start by sending its block to a write quorum of Ω. The storage nodes would verify that (i) the block is valid and (ii) the request has been made between 2 a.m. and 4 a.m., in other words, the current time lies in this interval. Then, let us consider the Byzantine node ω ∈ Ω which stores one of the block’s replicas. This node decides, although it is 5 p.m., to modify the block. However, instead of following the protocol, it just replaces the local replica with its new version and does not bother contacting the other Ω nodes. Considering that the block is valid and has the highest version number, a client wishing to retrieve the block could form a read quorum including ω. Therefore, the client would end up with, say, up to γ − 1 invalid instances of the block, γ + 1 valid instances of the block and one valid, though illegally forged, instance of the block embedding the highest version number. The client, following the protocol, would therefore discard the invalid instances and keep the latest version, version 5 in the example, which happens to be an illegal instance. CHAPTER 3. ENVIRONMENT 51 4 33 4 4 33 4 4 33 44 4 4 x x x x x x 55 (i) (iii)(ii) read Byzantine node non−Byzantine node quorum Figure 3.2: A three-step representation of an asymmetric quorum-based system This example illustrates the system’s lack of symmetry. Indeed, the validation pro- cess is not symmetric because it cannot be performed by clients since they do not have access to the time the block has been stored. Therefore, the clients have no choice but to trust the servers regarding this predicament. Since the environment is assumed to be untrustworthy, the assumption is violated and the system is flawed. Noteworthy is that the Get(α) and Gather(α) methods are distinguished in this document for the sake of clarity. However, most implementations, including the one discussed in Chapter 5, merge both functionalities into a single routine which, depending on the address α, operates in one or the other mode. Erase The Erase(α) routine takes the address of a block that must be removed from the distributed hash table. The client wishing to perform such a removal starts by sending a request to the Ω nodes storing a replica of the block. Storage nodes receiving such a request then challenge the user in order to verify the legitimacy of the operation. Once the client has been authentified as the creator, every replica is destroyed. Note however that the authentication process depends on the type of block, as described in Chapter 4. CHAPTER 3. ENVIRONMENT 52 ? ? ? To summarise, this thesis makes several assumptions among which are the dis- tributed hash table’s properties: decentralisation, untrustworthiness, symmetry, self- certification, scalability, replication through quorum protocols and non-connectivity. Considering such a distributed hash table, this thesis aims at designing a decen- tralised peer-to-peer file system by focusing on providing a flexible access control scheme along with a mechanism for users to request administration tasks in a sys- tem devoid of any authoritative entity. Chapter 4 Design This chapter details the design of the fundamental components of a peer-to-peer file system. The first section focuses on describing the access control scheme which introduces both the notion of user and the abstract representation of file system objects such as files, directories etc. This section starts by defining the objectives of the access control scheme given the environment described in Chapter 3. The model of the access control scheme is then discussed though this model will be refined later one. Finally, the concept behind the access control mechanism is introduced before dis- cussing in details the internal representation of the blocks composing the file system hierarchy. The second section discusses the file system organisation and the necessity for en- abling users to perform administrative operations. First, the semantics of centralised file systems are discussed in the context of peer-to-peer file systems. A model is then proposed for both administering the file system and transferring object ownership. The design of the proposed model is then detailed through the introduction of a new physical block. By the end of this chapter, the fundamental components will have been designed, leading the way to the implementation of a viable peer-to-peer file system prototype. 4.1 Access Control Although many decentralised peer-to-peer file systems have emerged in the last decade, none of them succeeded in providing users with a flexible access control system. The well-known file sharing applications such as Bittorrent [Coh03], Freenet [CSWH01], FreeHaven [DFM01] etc. actually provide a content distribution infrastructure more 53 CHAPTER 4. DESIGN 54 than a way of sharing files since users cannot decide which users are allowed to re- trieve the documents they contribute to the system. Although users may not be interested in controlling access to movies, they might be for more personal informa- tion such as family photos, work documents and so forth. While some distributed file system projects [DKK+01, MMGC02] lack an access control mechanism, others such as OceanStore [KBC+00] and FARSITE [ABC+02] do provide privacy control functionalities. Unfortunately, such systems suffer from fundamental flaws regarding our target environment such as the non-scalability of the network architecture or the use of expensive algorithms, as explained in Chapter 2. More recently, several research projects, including Chefs [Fu05], Plutus [KRS+03] and Pastis [mBPS05], focused on access control in untrustworthy environments. Chefs ’ single-writer/multi-reader design might well suit content distribution applica- tions such as Bittorrent [Coh03] but unfortunately lacks flexibility when it comes to large-scale file systems. On the contrary, Plutus provides a multi-writer/multi-reader scheme but, like Chefs, requires the users to be connected whenever an object’s owner wishes to grant them access. This connectivity requirement is unpractical for large- scale networks where the churn rate has been measured to be very high [LSG+04]. Plutus also puts some trust constraints on the storage nodes handling write op- erations, hence, violating the untrustworthiness predicate. Finally, Pastis’ access control scheme, very much like Plutus’, constrains the users in keeping a constantly growing number of certificates and cryptographic keys. Noteworthy is that the most recent work achieved through Chefs, Plutus and Pastis indicate that issues remain to be addressed. The remainder of this section therefore presents the design of a flexible access control scheme for the given environment which does not require users to keep any access information but their identity i.e. a single cryptographic key pair. 4.1.1 Objectives The following statements define the scope within which the access control scheme has been designed along with the characteristics such a mechanism should incorporate. Although one might disagree with these definitions, this set of rules has been defined in order to provide the access control mechanism functionalities common to most file systems while taking the environment’s particularities into consideration. ∇1 First, the environment characteristics and fundamental properties defined in Chapter 3 must be respected throughout the design process. These include decentralisation, scalability, untrustworthiness, symmetry and self-certification CHAPTER 4. DESIGN 55 but also non-connectivity due to churn, efficiency through quorum protocols and so on; ∇2 A user must be able to modify its object’s permissions. Furthermore, the effects of those modifications should be made effective immediately. Although atomicity is obviously unachievable given the asynchrony of the underlying physical network, this rule suggests that any operation that does not comply with the object’s last set of access control rules should be rejected; ∇3 Any user should be able to consult an object’s current permissions. Note that this rule conflicts with most common file systems which prevent users from collecting information on inaccessible objects. However, in the given context, the lack of a centralised entity makes it difficult to prevent a user from retrieving the block corresponding to the file’s metadata—assuming she knows the block’s address—hence accessing its access control information; ∇4 Section 4.1 showed that the most recent research regarding access control in decentralised environments suffered from the amount of access information users have to keep locally. Indeed, both Plutus [KRS+03] and Pastis [mBPS05] require clients to store a linearly increasing number of keys and certificates. The storage space required on clients for managing objects’ access control should therefore be ideally reduced to a single item; ∇5 The large-scale environment’s characteristic implies an extremely large and dy- namic number of users and files. While common centralised file systems such as ext2 (Second Extended File System) were designed with space consump- tion and simplicity in mind, decentralised file systems must provide flexible capabilities for users to manage the possibly thousands of users having been granted access to an object. The access control scheme should therefore en- able users to create hierarchical groups which, as the name suggests, can be composed of both users and/or sub-groups. This paradigm would enable users to organise their friends, acquaintances, family etc. hence easing the access control management; ∇6 According to the environment specifics and with regard to ∇5, the access control scheme should be as efficient as possible. Especially, the complexity of the process consisting in verifying that a user’s operation is legitimate should be logarithmic, if not constant time; and ∇7 Since data retrieval cannot be controlled, anyone is allowed to request a data block from the underlying distributed hash table. Therefore, accountability regarding users accessing data seems unachievable. However, users should not be capable of repudiation regarding object modification. CHAPTER 4. DESIGN 56 Note that every access control scheme candidate will be considered unsuitable if violating at least one of these objectives. Besides, whenever such an objective is mentioned through its ∇n symbolic name, the reader will be able to refer himself to the definitions summary located at the bottom of the page. 4.1.2 Model The following discusses the particularites of access control schemes in the given environment. 4.1.2.1 Policy An access control system is one which enables an authority to control the access to resources. In the context of file systems, access control systems enable a user to grant a set of other users access permissions onto an object, being a file, directory, link etc. Access control systems are often categorised as either discretionary or non-discretionary, the most widely recognised models being MAC (Mandatory Access Control) [Cla83], DAC (Discretionary Access Control) [Kar86] and RBAC (Role-Based Access Con- trol) [JB94, Vas08]. MAC is an access control policy determined by the system, through an authoritative entity. Historically, MAC has been designed and used by military organisations processing highly sensitive data. In such systems, subjects and objects are assigned a label so that a user can access a document only if her clearance level is equal or higher than the document’s sensitivity level. DAC is an access policy determined by the object’s owner. Therefore, the user de- cides who has access to the object and what operations they are allowed to perform. Note that unlike MAC, DAC models do not require any authoritative entity. Finally, RBAC is an alternative approach consisting in the definition of various roles matching the multiple organisation’s personnel functions. The permissions to perform certain operations are then assigned to roles. Finally, every member of the personnel is assigned a particular set of roles such that, through those assignments, the user acquires the permissions associated with the roles. The RBAC model simplifies the whole access control management since controlling the access policies consists in assigning roles to individuals. As discussed in Chapter 3, a consequence of the peer-to-peer environment is that no entity has complete control over the whole system. For this reason, both MAC and RBAC models, which require system-wide definitions, cannot be used in this context. Therefore, in order not to violate ∇1, the DAC model must be used. ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 57 4.1.2.2 Pattern Systems such as operating systems, file systems, websites etc. are said to make use of active access control because permission is granted at the time an operation is requested or performed. Most systems follow this pattern because all requests are made to a manager which decides whether or not to grant access. Since access control information is centralised, the manager can easily take such a decision. On the other hand, distributed systems tend to dynamically build managers by re- lying on Byzantine agreement protocols. Unfortunately, and as discussed in Chapter 3, since they reduce concurrency [DW01, Bus07], such algorithms are impractical for many applications, especially large-scale distributed file systems. As a consequence, the given environment cannot make use of managers. Access control is therefore said to be passive. The idea behind passive access control is to store access control information along with the object so that any client retrieving an object can verify that the last modification has been performed by a legitimate writer i.e. the writer had the permission at the time, τ , the operation was carried out. If, as described through Section 3.4, the block happens to be illegitimate, it is discarded until a valid instance is found. Thus, users writing an object must attach an atemporal proof such that, at any later time, anyone can verify that the object has been properly constructed according to the permissions in place at τ . By doing so, the system’s symmetry is maintained and ∇1 is respected. Furthermore, and in order to prevent violating ∇7, the proof should enable users to identify the writer. Regarding read operations, since the storage nodes cannot be trusted, the objects’ content should always be encrypted. The access control scheme should therefore enable objects’ owners to distribute the key to authorised readers while respecting ∇2 and ∇4. 4.1.2.3 Class Access control schemes basically fall into one or both of the two following classes: token-based and record-based. This section takes both of the access control classes and shows that no scheme can achieve the required properties in the given environ- ment. An implication of ∇5 is that permissions must be flexibly manageable through hi- erachical groups, giving the user a tool for organising users very much as a tree-like file system view enables users to organise their files. As mentioned in the previous section, a passive access control model implies that users must attach to the object a proof showing evidence of the legitimacy of the ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 58 operation. In more practical terms and with consideration of group hierarchies, every proof involving one or more of those groups will be composed of sub-proofs, each providing evidence of group membership at τ , the whole forming a chain of proofs. Token-Based In token-based access control schemes, objects’ owners distribute unforgeable tokens to clients, granting them the permission to perform operations, while nothing is kept on the manager’s side except what is strictly necessary to verify tokens’ validity. Certification and Capabilities, for instance, fall into this category. In active access control models, clients pass their token to the manager. If the chain of tokens is valid, the requested operation is accepted. In a passive scheme, the user attaches a chain of tokens to the object to be modified so that nodes retrieving the object can verify that the writer provided a proof of her legitimate action. Note that since everybody must be able to verify the tokens’ validity, such tokens must be protected from public disclosure, for example by securely identifying its holder. Certification schemes, for instance, include the user’s identity in a digital signature for ensuring this property. Additionally, such a user identification complies with ∇7. A problem arises when it comes to verifying a proof. Indeed, to verify that a user had the permission—the tokens had neither expired nor been invalidated—at τ , the object must carry time-related information such as the time the object was updated. Unfortunately, even assuming that the system benefits from a globally synchronised clock, neither the storage servers nor the users can be trusted to provide a correct time. Indeed, malicious clients and servers could go back in time and claim a date that makes past tokens still valid. The solution would be to either rely on a centralised and trusted time server for digitally timestamping every update or to make use of consensus algorithms, both violating ∇1 and ∇6. Record-Based In record-based access control models, a subject’s access depends on whether her identity is located in the records associated with the object. In active models, the manager keeps the records and performs the verification for every received request while, in passive models, an attached proof must provide evidence that, at τ , the user’s identity could be located in the records of the groups she claimed to have been a member of. ACL (Access Control List)s, for instance, fall into this class of access control. Unlike token-based access control models, access information is recorded in blocks, ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 59 along with the other object’s metadata. ∇5 and ∇3 imply that independent entities such as groups also record their access information in blocks. Besides, objects and groups, in order to be accessible through the same address α at all times, must rely on mutable blocks because the address of such blocks remains the same while their content evolves. Proving that a subject had the permission to perform the operation at τ comes down to proving there existed a link between the object and the subject, perhaps indirectly through several groups’ memberships. Therefore, a client updating an object must attach a snapshot of the chain of groups, hence proving the existence, at τ , of a path from the object to the subject. Unfortunately, since groups evolve over time, a group’s block exists in different versions. Therefore, nothing could prevent a malicious user from using a past group’s snapshot, at a time when she was a valid member. Thus, as nodes could not be trusted to provide a valid timestamp in token-based models, servers and clients, once again, cannot be trusted to include, in the chain of proofs, the proper latest version of the groups’ snapshot at τ . A malicious user could therefore go back in time by providing past versions of groups’ snapshots, granting herself the permission to perform the operation. 4.1.3 Constraints The previous section showed that any passive access control scheme violates the fundamental symmetry property. However, one should notice that, by loosening constraints, it becomes possible to design such a scheme. For instance, Plutus [KRS+03] makes use of a token-based access control scheme where an object’s owner distributes a key to the writers and the complement key to the readers. Since accountability is not a requirement, users are free to re-distribute the keys to whoever they wish. Therefore, Plutus requires users’ connectivity for passing keys. Besides, since keys are freely distributed, nobody can consult the currently granted permissions. The requirements of this work are therefore extremely strong compared to Plutus’ and as a result, it has been shown impossible to achieve them all. There is therefore no choice but to loosen the constraints in order to provide access control to peer-to-peer applications. Although ∇3 and ∇7 might seem questionable in terms of usefulness, especially in large-scale networks, connectivity is the envi- ronment property that the author believes is usually misinterpreted for the reasons exposed below. ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 60 ˆ Research regarding churn rates in peer-to-peer networks has been performed on peer-to-peer file sharing applications such as Bittorrent [Coh03] because these are the only deployed large-scale applications that can be used to gather such information. However, the application itself has an impact on the node churn since, for example, users tend to stop sharing a file that has been downloaded to avoid wasting their upload bandwidth. Therefore, the author claims that different well-integrated systems such as file systems, instead of file sharing applications, would decrease the churn rate especially if users have incentives in contributing to the system’s connectivity; ˆ Connectivity, very much like bandwidth, should increase as it has been the case since the advent of computer networks. On the other hand, new devices such as mobile phones, netbooks along with new user behaviours must be taken into consideration. One should note that although the number of such mobile devices is increasing extremely rapidly, most users possess multiple computers including a desktop computer at home and/or at work, a laptop, a mobile phone etc. Therefore, although the increasing mobility of computing devices implies nodes frequently joining and leaving the network, the user behind these devices is likely to be connected at all times through one or more of those devices; and ˆ Finally, although the probability of a specific user being connected might not be as high as expected because of the high churn rate measured in peer-to- peer networks, the probability of having at least one member in a set of users connected to the network should be higher, depending on the set cardinality. Therefore, loosening the non-connectivity requirement will enable users, as in Plutus, to retrieve information from other users. 4.1.4 Concept The following provides insights into the passive ACL (Access Control List)-based— i.e. DAC and record-based—access control scheme described throughout this sec- tion. Note that the record class has been chosen in order to prevent users from having to store access information such as certificates or keys as both Plutus [KRS+03] and Pastis [mBPS05] suffer from this characteristic. The idea behind the presented access control scheme is to distinguish users according to their access relation to the target object. First, the users who have been granted access to the object directly by the owner are referred to as the lords. These users ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 61 play a particularly important role in the access control scheme because their access permissions can only be modified by the object owner. On the other hand, users who have been granted access to the object indirectly through one or more group memberships are referred to as the vassals. The object owner has no direct control over these users since the access management has been delegated to the respective group managers. Therefore, vassals could join or leave groups that have been granted permissions on the object without the object owner even knowing. The access control scheme’s fundamental concept is to let lords access the object without additional constraints, assuming that they have been granted the appro- priate permissions. However, the vassals are never given the key for decrypting the data, neither can they update the object directly. Rather, the idea is to rely on lords to vouch for the vassals by verifying that the requesting vassal has the permission to perform the operation. Assuming that the vassal does have the proper rights, the lord generates a certificate stating that, at τ , the vassal had been indirectly granted the permission to update the object. Likewise, a vassal wishing to read an object would need to contact a lord which would verify the vassal’s permission before pass- ing her the key for decrypting the data. Quite obviously, the object’s owner as well as the group managers could also act as lords, hence vouch for vassals. Noteworthy is that, although the users’ connectivity is assumed to be higher than previously stated, only users accessing objects indirectly will need to contact other more privileged users. The lords’ connectivity is therefore absolutely crucial to the system. Thus, object owners should make sure to grant access to several lords in order to ensure that the number of connected lords is sufficient to enable legitimate vassals to operate on the object. Should the number of such lords be insufficient, the application could warn the user for instance. Finally, the access control scheme introduced hereby requires users to exchange information with one another. Therefore, as the overlay network enables nodes to route a message to the home node responsible for a given identifier, users now require the overlay network to provide a routine for locating a particular user. The rest of this document thus assumes that the underlying overlay network provides a Locate() method, which, given a set Ξ of user identities, returns the identity of a currently connected user. An easy way to provide such a functionality would be for the application to automatically set the user’s, potentially multiple, IP (Internet Protocol) address in its associated User logical block, described next, such that given an identifier, one can easily contact the user by sending a message to her node. ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 62 4.1.5 Scheme The access control scheme detailed in this section enables users to protect their ob- jects against unauthorised read and write operations by granting permission directly to specific users and/or indirectly by delegating access control to third parties i.e. groups. The following introduces the data block representations necessary to the access con- trol mechanism, from physical blocks to logical blocks such as file system objects, users, groups and so forth. Every block representation is illustrated by a figure along with the three procedures below: ˆ The Setup(β) −→ α method is invoked whenever a block is built and returns the address of the freshly initialised block; ˆ The Seal(α, β) routine is called whenever the block has been modified and requires to be sealed, before being stored in the underlying distributed hash table for instance; and ˆ Finally, the Validate(α, β) procedure verifies that the given block is valid. This method is never explicitly invoked but rather used internally by the Get(α) −→ β and Gather(α) −→ β routines. 4.1.5.1 Physical Blocks As in CFS [DKK+01], Pastis [mBPS05], SFS [MKKW99], OceanStore [KBC+00] and many other projects, the physical blocks are distinguished according to their immutability. Content Hash Block CHB (Content Hash Block)s are immutable blocks whose address is computed by applying a one-way function on the data. Thus, assuming the block is modified, a new content is implicitly created, hence generating a new address. Figure 4.1 illustrates such a CHB. CHBs are extremely interesting in terms of performance, as mentioned in Section 3.4. Indeed, a client wishing to access a CHB that is present in cache would not need to initiate network communication as it would be formally ensured of the block’s validity. ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 63 CHB [...] α Figure 4.1: The representation of a CHB Algorithms 1, 2 and 3 detail the set-up, seal and validation processes of CHBs, respectively. Note that the function h() denotes a one-way function such as SHA (Secure Hash Algorithm) for instance. 1. α← h(β) 2. return α Algorithm 1: Setup CHB (β) −→ α nothing to do as implicitly sealed Algorithm 2: Seal CHB (α, β) 1. if α 6= h(β) then 2. error “the address does not match the block” 3. end if Algorithm 3: Validate CHB (α, β) Public Key Block Unlike CHBs, PKB (Public Key Block)s are associated with a cryptographic key pair such that the address of such blocks is computed by applying a one-way function on the PKB ’s public key. Since this key does not change over time, PKBs are used as mutable blocks. In order to distinguish a block’s multiple versions, a version number is embedded. Besides, a cryptographic signature ensures integrity and authenticity, hence preventing anyone but the PKB ’s owner from updating the block. ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 64 Figure 4.2 details the PKB internals. The reader can notice that the block’s pub- lic key is embedded, along with a version number and a signature which ensures integrity and authenticity while respecting ∇7. As shown through Algorithms 4, 5 and 6, everything necessary to the block’s verification process is included within the block, hence ensuring symmetry. #1 PKB signature version data [...] α Kblock Figure 4.2: The representation of a PKB The reader should carefully consider the notation used throughout this document. While the K and k symbols represent the public and private keys, respectively, δκ˜ is equivalent to δx mod n for κ = (x, n) such that: ( δK˜ )k˜ = ( δk˜ )K˜ = δ (4.1) Therefore, δK˜ designates an encryption or signature verification while δk˜ expresses a signature or decryption. In addition, the reader should notice the presence of grouped attributes designated by ]x with x a unique number on a given figure. This grouping functionality is used to simplify the algorithms presented throughout this document. For example, while every field is represented on Figure 4.2, Algorithm 6 uses the β.]1 notation which is equivalent to β.[...]|β.version|β.signature with | the concatenation operator. 1. (Kblock, kblock)← generate cryptographic key pair 2. β.Kblock ← Kblock 3. β.data.version← 0 4. α← h(β.Kblock) 5. return α Algorithm 4: Setup PKB (β) −→ α ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 65 Require: (Kblock, kblock), the block’s randomly generated key pair 1. β.data.signature← h(β.]1)k˜block Algorithm 5: Seal PKB (α, β) 1. if α 6= h(β.Kblock) then 2. error “the address does not match the block” 3. end if 4. if β.data.signature ˜β.Kblock 6= h(β.]1) then 5. error “the data signature is invalid” 6. end if Algorithm 6: Validate PKB (α, β) As detailed in Algorithm 6, the first step of the validation process verifies that the internal public key β.Kblock is related to the block by applying the one-way function on the public key and comparing the result with the address. Once the public key is known to be valid, the signature can be verified in the second step ensuring the block’s integrity and authenticity. From this point on, the block is known to be valid and can therefore safely be used. Owner Key Block As described previously, PKBs enable users to make use of mutable blocks. However, by relying on such blocks, users will end up keeping as many key pairs as they have created blocks. Unfortunately, ∇4 stipulates that the access control scheme should not require users to store an increasing amount of access information. OKB owner #1 data signature #2 version signature [...] α Kblock K owner Figure 4.3: The representation of an OKB ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 66 For this reason, the OKB (Owner Key Block) has been introduced, following the original idea from Pastis [mBPS05]. Indeed, assuming that every user possesses a single identity key pair, OKBs enable users to create blocks without having to keep any access information. There are two differences between OKBs and PKBs. Firstly, the owner’s public key is recorded in the block and signed with the block’s private key. Secondly, the data are no longer signed with the block’s private key but with the owner’s private key. Thus, since operations are now performed with the owner’s key pair, the block’s key pair is no longer necessary and can therefore be discarded. Figure 4.3 depicts the OKB internal organisation which shows the inclusion of the block’s public key, followed by the owner’s public key which is then signed with the block’s private key. In addition, Algorithms 7, 8 and 9 illustrate OKBs’ set-up, seal and validation processes, respectively. Require: (Kuser, kuser), the user’s personal key pair 1. (Kblock, kblock)← generate cryptographic key pair 2. β.Kblock ← Kblock 3. β.owner.Kowner ← Kuser 4. β.owner.signature← h(β.]1)k˜block 5. β.data.version← 0 6. α← h(β.Kblock) 7. return α Algorithm 7: Setup OKB (β) −→ α Require: (Kuser, kuser), the user’s personal key pair 1. β.data.signature← h(β.]2) ˜kuser Algorithm 8: Seal OKB (α, β) 1. if α 6= h(β.Kblock) then 2. error “the address does not match the block” 3. end if 4. if β.owner.signature ˜β.Kblock 6= h(β.]1) then 5. error “the owner signature is invalid” 6. end if 7. if β.data.signature ˜β.owner.Kowner 6= h(β.]2) then 8. error “the data signature is invalid” 9. end if Algorithm 9: Validate OKB (α, β) ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 67 One may have noticed that the data signature is applied on β.Kblock—i.e. β.]2 includes β.Kblock—though this was not the case for PKBs. The inclusion of β.Kblock in the data signature is necessary to prevent Injection Attacks. Indeed, considering two blocks β1 and β2 created by the same owner, a malicious user could copy the data section from β1 and inject it into β2. This operation would be viewed as a perfectly valid update1 performed by the owner. Therefore, in order to prevent this kind of attack, the block’s public key β.Kblock is included in the data signature, ensuring that the data section is linked to this block, hence cannot be injected in another OKB. 4.1.5.2 Logical Blocks The following presents the logical blocks which introduce concepts such as users, groups etc. built on top of the physical blocks detailed above. User Because OKBs require the notion of user but also because access permissions will eventually be associated with users, this section introduces the User block. A User block represents a user entity in the storage system and contains information such as the user name, her email address etc. Interestingly, although based on a PKB, the User block does not require a crypto- graphic key pair to be generated. Indeed, as mentioned earlier, every user is assumed to possess a unique key pair. Therefore, instead of generating a random key pair for the block to become mutable, the user’s personal key pair is used. name email PKB #1 User α signature version [...] K user Figure 4.4: The representation of a PKB -based User block 1... assuming β1.data.version > β2.data.version ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 68 This characteristic is particularly interesting because, as described before, the ad- dress of a PKB is computed by applying a one-way function on the block’s public key, which happens to be, in this very specific case, the user’s public key. Thus, any- one given the user’s public key can compute the address of the User logical block, which can then be retrieved in order to get additional information on the user. Figure 4.4 depicts the User logical block internals while Algorithms 10, 11 and 12 detail the set-up, seal and validation processes, respectively, though almost identical to OKB ’s. Require: (Kuser, kuser), the user’s personal key pair 1. β.Kuser ← Kuser 2. β.version← 0 3. α← h(β.Kuser) 4. return α Algorithm 10: Setup User PKB (β) −→ α Require: (Kuser, kuser), the user’s personal key pair 1. β.signature← h(β.]1) ˜kuser Algorithm 11: Seal User PKB (α, β) 1. if α 6= h(β.Kuser) then 2. error “the address does not match the block” 3. end if 4. if β.signature ˜β.Kuser 6= h(β.]1) then 5. error “the signature is invalid” 6. end if Algorithm 12: Validate User PKB (α, β) Group A group represents a collection of users and/or sub-groups, administered by a single user, the group’s owner, often referred to as the group manager. In order to fulfill the ∇6 requirement, the system isolates the group metadata from the members’ listing. While the group metadata are recorded in a mutable Group logical block, the actual ACL (Access Control List) of members is recorded in an immutable Members block. This separation has been introduced to minimise the size ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 69 of the Group block. Indeed, as detailed through Section 3.4, retrieving a mutable block such as a PKB or an OKB requires a quorum of storage nodes to be contacted. Since 2γ + 1 instances of such a mutable block will be transferred back to the requesting client, reducing the size of mutable blocks would drastically increase the system’s overall performance. On the other hand, since immutable blocks such as CHBs can benefit from caching techniques and, in the worst case scenario, are transferred only once to the client, these blocks can embed far more information without damaging the system’s performance. The Group logical block is therefore based on an OKB physical block, being mod- ifiable by the group manager, its creator, only. On the other hand, the Members logical block is based on a CHB. Thus, whenever the list of members is modified, a separate Members block is created requiring the Group block to be updated in order to reference the new Members block. OKB Group owner signature data description α members α version signature #1 #2 α group Members CHB permissions permissions K user K permissions user [...] owner K K name block Figure 4.5: The representation of an OKB -based Group block Figure 4.5 details the Group and Members logical blocks and their relation. The reader will notice that the Members block contains a list of either user or group entries. While users are identified by their personal public key, groups are referenced by the address of the associated Group logical block. Noteworthy is that permissions are associated to members such that, to be allowed to read an object for instance, a user must identify herself through a chain of group ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 70 memberships in which, every group in the chain must have been granted the read permission, including her own user membership. This enables fine-grained group memberships management since a group owner can easily include a sub-group with- out granting too much permission. The set-up, seal and validation processes are not detailed for the Group and Members logical blocks since identical to the OKB ’s and CHB ’s. Object An Object is a protective layer built above actual data blocks which enables users to control access both in reading and writing. Therefore, as its name indicates, it can be used, in the context of file systems, to represent file system objects such as files, directories etc. An object is linked with a set of permissions granted to specific users and/or groups. In order to optimise the most common case in which an object is only accessible to its owner, the owner’s permissions are directly recorded in the Object block while prospective users and groups are listed in a separate and optional block, the Access logical block. Note that, as for the Group block, this specific arrangement reduces the size of the Object block and therefore optimises the communication costs. Figure 4.6 illustrates this optimisation, especially in the β.meta section where the owner’s permissions and the address of the Access block are recorded. As mentioned in Section 4.1.4, access beneficiaries are classified as either lords or vassals. Since the fundamental idea behind the access control scheme is not to constrain lords, these more privileged users are given the key for decrypting the data, assuming they have been granted read permission. Such a key comes in the form of a token which consists in the key being encrypted with the lord’s public key. On the other hand, vassals are not given the key and it is their responsibility to contact a lord whenever they wish to read the data. Once the legitimacy of the operation has been verified by a lord, the key is handed to the vassal so it can decrypt the data. Regarding write operations, the user updating the Object must attach a proof show- ing evidence of the action’s legitimacy. Since lords have been directly granted the permission by the object’s owner, providing such a proof comes down to specifying the location, in the Access block, of their user entry. Therefore, anyone retrieving the block can verify that the signature has been issued by the private key associated with the public key recorded in the given user entry but also that the permissions associated with this user include the right to update the object. The process is however a bit more complicated when it comes to vassals. Indeed, these users must request a lord to validate their action. The lord then issues a vouching certificate which can be attached to the Object block, hence proving the rightfulness of the ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 71 modification. Figure 4.6 illustrates these particularities: β.author.lord contains the index of the user entry in the Access block while β.author.voucher represents the certificate issued by a lord vouching for the user, certificate which includes the public key of the vassal having updated the object. Note that, should the owner update the block, no proof would be provided i.e. β.author =⊥. Likewise, the proof provided by a lord does not include a voucher, hence β.author.voucher =⊥. According to ∇2, a user losing the read permission should no longer be able to read the data. Therefore, the data blocks should be re-encrypted with a new key which would be distributed to the readers. However, since this process is very expensive, most distributed systems delay the re-encryption until the data is modified, a process referred to as Lazy Revocation [Fu99]. Unfortunately, in the presented system, a group manager having been granted the read permission on the object could decide to evict a user. As such, users could lose the read permission at any time without the object’s owner being aware of such events. Since there is no way for the owner to know when the data must be re-encrypted, a new key is generated every time the object is updated. Note however that this does not mean that all the data blocks are re-encrypted. Indeed, only the modified and new data blocks are encrypted with the new key. In addition, a Contents block contains the list of all the data blocks along with the key used for encrypting every one of those blocks. Finally, the Contents block is encrypted with the key having been generated for the last writing, the one which is distributed to authorised lords. Since the owner’s connectivity cannot be guaranteed whenever the object is modified, the new key must be generated by the writer. The writer thus generates the tokens based on the new key and distributes them to the read lords. Note that a malicious writer could perform an attack by distributing different—valid or invalid—keys to different lords such that a user retrieving the key would not be able to know which one of the data or the key is incorrect. In order to reduce the risks of such an attack, the writer also attaches a fingerprint of the key. Therefore, readers retrieving a key mismatching the fingerprint would know that the writer is malevolent and complaints regarding this user could then be made to the object’s owner. This specificity is illustrated in Figure 4.6, especially in the β.data section. The fact that the writer, often referred to as the author, updates the lords’ tokens implies that she must be allowed to modify the Access block. However, this block also embeds the lords’ identities and permissions, information that only the owner should be authorised to modify. In order to guarantee that the author can re-generate the lords’ tokens, the address of the Access block is left under the control of the writer, hence is included in the signature β.data.signature. However, in order to prevent the writer from modifying permissions, the sensitive data are also included in the signature issued by the owner i.e. β.meta.signature. Therefore, a malicious user trying to illegally modify metadata would inevitably render the block invalid by ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 72 violating the owner’s meta signature. OKB owner signature #1 Object α author voucher signature fingerprint version signature meta owner permissions α access version signature data #2 #4 #3 lord vassalK token key blockα CHB α block key [...] CHB Access α permissions token K user permissions token #5 group [...] K K block owner Contents contentsα Figure 4.6: The representation of an OKB -based Object block Algorithms 13 and 16 detail the set-up and validation processes, respectively. Algo- rithm 14 describes the sealing steps for the Object ’s data section, while Algorithm 15 describes those for the meta section. CHAPTER 4. DESIGN 73 Require: (Kuser, kuser), the user’s personal key pair 1. (Kblock, kblock)← generate cryptographic key pair 2. β.Kblock ← Kblock 3. β.owner.Kowner ← Kuser 4. β.owner.signature← h(β.]1)k˜block 5. β.data.version← 0 6. β.meta.owner.permissions← { read, write } 7. β.meta.version← 0 8. α← h(β.Kblock) 9. return α Algorithm 13: SetupObject OKB (β) −→ α Require: (Kuser, kuser), the user’s personal key pair 1. β.data.signature← h(β.]2) ˜kuser Algorithm 14: SealObject[data] OKB (α, β) Require: (Kuser, kuser), the user’s personal key pair 1. if β.meta.αaccess =⊥ then 2. ~← h(β.]3) 3. else 4. χ← Get(β.meta.αaccess) 5. ~← h(β.]3|χ.]5) – the | operator designates concatenation 6. end if 7. β.meta.signature← ~ ˜kuser Algorithm 15: SealObject[meta] OKB (α, β) CHAPTER 4. DESIGN 74 1. if α 6= h(β.Kblock) then 2. error “the address does not match the block” 3. end if 4. if β.owner.signature ˜β.Kblock 6= h(β.]1) then 5. error “the owner signature is invalid” 6. end if 7. if β.meta.αaccess =⊥ then 8. ~← h(β.]3) 9. else 10. χ← Get(β.meta.αaccess) 11. ~← h(β.]3|χ.]5) 12. end if 13. if β.meta.signature ˜β.owner.Kowner 6= ~ then 14. error “the meta signature is invalid” 15. end if 16. if β.author =⊥ then 17. if write /∈ β.meta.owner.permissions then 18. error “the owner does not have the permission” 19. end if 20. if β.data.signature ˜β.owner.Kowner 6= h(β.]2) then 21. error “the data signature is invalid” 22. end if 23. else 24. χ← Get(β.meta.αaccess) 25. if write /∈ χ[β.author.lord].permissions then 26. error “the lord does not have the permission” 27. end if 28. Klord ← χ[β.author.lord].Kuser 29. if β.author.voucher =⊥ then 30. Kauthor ← Klord 31. else 32. if β.author.voucher.signatureK˜lord 6= h(β.]4) then 33. error “the voucher signature is invalid” 34. end if 35. Kauthor ← β.author.voucher.Kvassal 36. end if 37. if β.data.signatureK˜author 6= h(β.]2) then 38. error “the data signature is invalid” 39. end if 40. end if Algorithm 16: ValidateObject OKB (α, β) CHAPTER 4. DESIGN 75 The validation process detailed through Algorithm 16 starts, as for OKBs, by verify- ing that the block corresponds to its associated address. Then, the signature in the owner section is verified proving the owner’s public key valid. Thus, the meta sig- nature can be verified guaranteeing that the identities and permissions of the lords have not been altered. Then, depending on the nature of the author—being the owner, a lord or a vassal—, the data signature is verified, leading to the assurance that the tokens are valid along with the encrypted Contents block. Let us consider the following scenario: the owner decides to modify the access control records by removing the write permission from a lord. Unfortunately, this lord happens to be the user having performed the latest modification on the object. As such, the author field references his record in the Access block while the data section signature has been computed with his private key and his therefore verified with his public key. Assume that another client later retrieves the object and starts verifying its validity. The verification process would detect that the author does not have write permission and would inevitably conclude that the object was forged, probably by a malicious user. Although the object is now considered invalid by everyone, it was, at the time, legally constructed. In order to overcome this security issue, should the owner remove the write permission from the user who signed the object’s data section, the owner would generate a voucher stating that the user’s action was legitimate at the time, no matter what his current permissions in the Access block are. Finally, it is important to note that lords’ behaviour cannot be guaranteed. Indeed, a malicious lord could, for instance, distribute invalid keys or could even refuse to vouch for a valid vassal. In such a context, a vassal would have no choice but to contact another lord until an honest one is found. 4.1.6 Algorithms This section provides the reader with a detailed understanding of the algorithms related to three fundamental operations applied to a file system object built upon the blocks described above. ˆ Govern(α, ψ) ˆ Read(α, λ) −→ δ ˆ Write(α, λ, δ) CHAPTER 4. DESIGN 76 Govern The Govern(α, ψ) routine enables the object’s owner to apply a set of modifications ψ on the meta-data, including the access permissions. This routine starts by retrieving the Object block. Then, the meta-data modifica- tions are applied depending on the presence of the Access block which may need to be fetched. Require: (Kuser, kuser), the user’s personal key pair 1. β ← Gather(α) 2. if Kuser 6= β.owner.Kowner then 3. error “the owner is the only user authorised to govern the object” 4. end if 5. β.meta.version← β.meta.version+ 1 6. apply the set of modifications ψ 7. for all (, ζ) such that ζ is a new immutable block do 8. Seal CHB (, ζ) 9. Put(, ζ) 10. end for 11. SealObject[meta] OKB (α, β) 12. Put(α, β) Algorithm 17: Govern(α, ψ) Read The Read(α, λ) routine takes the address of an Object block α along with the location λ of the data to read from this object. The routine returns the data δ. The algorithm starts by retrieving the Object block by calling the Gather() method. Then, the key κ used for encrypting the data must be retrieved. This process depends on the nature of the user. Indeed, if the user happens to be the object’s owner for instance, the key can be extracted by decrypting the token β.meta.owner.token. A lord, on the other hand, would have to locate its entry in the Access block and proceed to the decryption of the token. Finally, a vassal would need to request a lord to validate the rightfulness of this action before providing the vassal the key. Once the key κ is retrieved, its fingerprint is verified against the one provided by the author i.e. β.data.fingerprint. Then, the Contents block can be fetched and decrypted with κ after which the multiple data blocks are accessible. From this point, the routine can read the data identified by λ. CHAPTER 4. DESIGN 77 Noteworthy is that an Access user entry lacking the read permission does not nec- essarily mean that the user cannot access the data. Indeed, the user could still be granted the read permission through one or more group memberships, in which case, through a lord’s approval, the user could retrieve the encryption key. Nonetheless and for the sake of clarity, this possibility is ignored in Algorithm 18. Require: (Kuser, kuser), the user’s personal key pair 1. β ← Gather(α) 2. if Kuser = β.owner.Kowner then 3. if read /∈ β.meta.owner.permissions then 4. error “the owner does not have the read permission” 5. end if 6. κ← β.meta.owner.token ˜kuser 7. else 8. χ← Get(β.meta.αaccess) 9. if ∃ι : Kuser = χ[ι].Kuser ∀ι ∈ { 1, ..., |χ| } then 10. if read /∈ χ[ι].permissions then 11. error “the lord does not have the read permission” 12. end if 13. κ← χ[ι].token ˜kuser 14. else 15. Ξ← { χ[ι].Kuser : read ∈ χ[ι].permissions ∀ι ∈ { 1, ..., |χ| }} 16. ν ← Locate(Ξ) 17. κ← request lord ν for the key by sending the message 〈read, α,Kuser〉 18. end if 19. end if 20. if h(κ) 6= β.data.fingerprint then 21. error “the key does not match the fingerprint” 22. end if 23. σ ← Get(β.data.αcontents) 24. ξ ← decrypt σ with the key κ 25. δ ← read data from ξ at location λ 26. return δ Algorithm 18: Read(α, λ) −→ δ Write The Write(α, λ, δ) routine takes the address of an object along with some data δ and the location λ of the region of the object’s data that should be overwritten. CHAPTER 4. DESIGN 78 The algorithm starts by retrieving the Object block. Then, depending on the user’s relation to the object, being the owner, a lord or a vassal, a proof is constructed and attached to the object i.e. β.author. For instance, a lord constructs a proof by specifying the index of its user entry in the Access block while a vassal must contact a lord willing to vouch for her. Once the proof is complete, the user can update the data at λ and encrypt it with a freshly generated key κ, which is then distributed to the read lords i.e. the lords with the read permission. Finally, the Object is sealed and stored in the distributed hash table along with any additional data block. Note that, although not depicted in Algorithm 19, an author without the read per- mission would have no choice but to overwrite the existing Contents block while one with the permission would be able to modify specific portions of the existing data. CHAPTER 4. DESIGN 79 Require: (Kuser, kuser), the user’s personal key pair 1. β ← Gather(α) 2. χ← Get(β.meta.αaccess) 3. if Kuser = β.owner.Kowner then 4. if write /∈ β.meta.owner.permissions then 5. error “the owner does not have the write permission” 6. end if 7. β.author ←⊥ 8. else 9. if ∃ι : Kuser = χ[ι].Kuser ∀ι ∈ { 1, ..., |χ| } then 10. if write /∈ χ[ι].permissions then 11. error “the lord does not have the write permission” 12. end if 13. β.author.lord← ι 14. β.author.voucher ←⊥ 15. else 16. Ξ← { χ[ι].Kuser : write ∈ χ[ι].permissions ∀ι ∈ { 1, ..., |χ| }} 17. ν ← Locate(Ξ) 18. (ι, ϕ)← request lord ν for a voucher through 〈write, α,Kuser, δ〉 19. β.author.lord← ι 20. β.author.voucher ← ϕ 21. end if 22. end if 23. β.data.version← β.data.version+ 1 24. κ← generate cryptographic symmetric key 25. ξ ← write data δ at location λ 26. σ ← encrypt ξ with the key κ 27. β.data.αcontents ← Setup CHB (σ) 28. Seal CHB (β.data.αcontents, σ) 29. Put(β.data.αcontents, σ) 30. for all ι ∈ { 1, ..., |χ| } : read ∈ χ[ι].permissions do 31. χ[ι].token← κ ˜χ[ι].Kuser 32. end for 33. β.data.fingerprint← h(κ) 34. for all (, ζ) such that ζ is a new immutable block do 35. Seal CHB (, ζ) 36. Put(, ζ) 37. end for 38. SealObject[data] OKB (α, β) 39. Put(α, β) Algorithm 19: Write(α, λ, δ) CHAPTER 4. DESIGN 80 4.1.7 Analysis Section 4.1.2 claimed unfeasible any access control scheme complying with the de- fined objectives. As shown, any class of access control required some party to pro- vide temporal or contextual information. Unfortunately, since the environment is assumed to be untrustworthy, neither the clients nor the servers can be trusted to provide such a decisive element of proof. The object’s owner, being the only author- itative entity, cannot be relied upon either as no guarantee can be made regarding her connectivity to the system. As a result, the connectivity constraint had to be loosened in order to render the environment suitable for the design of a flexible access control scheme. As mentioned earlier, most centralised systems rely on a single manager through which every request goes. Such managers thus can control the legitimacy of every client action. Unfortunately, distributed systems cannot rely on a single centralised entity and therefore tend to build and maintain managers in a dynamic way through the use of consensus protocols. However, since known to be expensive, these proto- cols have been ignored as not considered suitable candidates for an access control system in the given large-scale environment. The presented access control scheme is innovative in the sense that it makes use of managers without achieving consensus. Indeed, the idea behind the described system is to rely on some specific users to act as managers. Since these users do not communicate, synchronise or achieve consensus, the access control scheme can fairly be claimed to be decentralised, as a client only requires to contact a single manager. Although neither storage nodes nor clients can be trusted, lords do benefit from a special status. Since lords have certain rights over the object, such as the permission to read and/or write its data, they can be considered as being partially trusted. It is equally fair to assume that lords also have the power to act maliciously, by erasing data for instance or by handing the decryption key to an unauthorised user. The concept lying behind the designed access control scheme is therefore to let lords act as managers, by validating requests related to the permissions they have been personally granted on the object. The most important aspect of this scheme is that it does not weaken the system’s security because a malicious write lord already had the power to abuse the system by writing data on behalf of another unauthorised user. Likewise, a malicious read lord could have distributed the key to anyone without the system or the object’s owner ever noticing. The access control scheme detailed throughout this section therefore enables users to protect their objects from ill-disposed users and Byzantine storage servers. Fur- thermore, the proposed model is flexible and expressive enough so that users can organise their friends, family, acquaintances and so forth, hence easing the access control management. Besides, the design complies with the very strong requirements CHAPTER 4. DESIGN 81 defined in Section 4.1.1. Thus, every user is required to hold a single personal crypto- graphic key pair in order to operate on the storage system; accountability is ensured regarding object updates; every access modification issued by the object’s owner is taking effect as soon as a write quorum of nodes has agreed on storing the updated object’s blocks, as detailed in Section 3.4, while every user is able to consult the current permissions associated with any object. Finally, the design complies with the specific environment’s properties defined in Chapter 3. Noteworthy is that since the access control scheme makes use of managers, it benefits from most of the advantages of active models. Indeed, as mentioned previously, a vassal wishing to, say, read an object would have to attach a proof showing evidence she had the read permission but also that every group of the chain she claimed to be a member of had been granted the read permission as well. This specificity provides objects’ owners and group managers with fine-grained access control functionalities. The scheme could also be embellished through the addition of black lists for instance. Such a functionality would provide objects’ owners and group managers additional flexibility by specifically listing users and/or groups that should be excluded from the set of granted subjects. Therefore, a group manager could, for example, grant read access to a sub-group but also specify that a subset of this sub-group’s members are not to be considered by this authorisation. Note however that these additional functionalities would increase the complexity of the process consisting for a lord to verify that a vassal’s claim is legitimate as the black list of every group included in the claim would have to be checked. Although the presented scheme exhibits many interesting properties, the design de- cisions especially regarding the connectivity requirement’s loosening imply a certain number of trade-offs. Firstly, the access control scheme requires many cryptographic signatures to be issued and verified, which in turn impacts the performance. Indeed, while a single signature needed to be verified in Plutus [KRS+03], our design requires objects to embed three cryptographic signatures. Secondly, clients wishing to op- erate on an object through group memberships must explore the group hierarchy until a chain of memberships is found granting the user the sought for permission. Although techniques such as caching can be used to improve this process, it nonethe- less represents an expensive task. Thirdly, the access control design largely relies on the assumption that, given a set of users Ξ, at least one non-Byzantine member of Ξ is connected at all times. Indeed, considering an object and its associated lords and assuming that, at a precise time, none of those lords are connected, vassals would become unable to perform legitimate operations. The following details the probability of such an event occurring. The probability PO of a vassal being able to perform a legitimate operation is equal to the probability of at least one non-Byzantine lord being connected among the ρ lords associated with the object; the probability of a lord being disconnected is given CHAPTER 4. DESIGN 82 by PD. Note however that the following assumes the lords to act honestly. Indeed, one should carefully distinguish the percentage of Byzantine nodes in a peer-to-peer network from the percentage of malicious users belonging to a small set of privileged users specifically chosen by the object’s owner. While Byzantine nodes misbehave in order to perform unauthorised operations such as making a document diseappear for instance, malicious lords have already been granted permission on the object, hence cannot gain anything more. Equation 4.2 provides a formula for computing the number ρ of lords required to achieve the desired probability PO depending on the average lord disconnectivity PD. ρ = logPD (1− PO) (4.2) As illustrated by Figure 4.7, a probability PO = 0.99 can be achieved by relying on five lords, assuming that users are connected to the network 60% of the time i.e. have a probability of being disconnected PD = 0.4. These figures indicate that the presented access control can realistically perform in a production environment. Indeed, as discussed in Section 4.1.3, while nodes’ connectivity can be assumed to be fairly low due to churn, this access control scheme relies on users’ connectivity which is inherently higher. 0 1 2 3 4 5 6 7 8 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ρ PO PO vs ρ PD = 0.1PD = 0.2PD = 0.3PD = 0.4PD = 0.5 Figure 4.7: A graph showing the relations between PO and ρ In addition, a certain number of optimisations can be considered which are described next. Firstly, the connectivity of the group managers can be assumed to be higher CHAPTER 4. DESIGN 83 than normal users. Indeed, such users create groups because they want to contribute to the system in more ways than most users. Since some of these group managers also act as lords, access control can benefit from this characteristic. Secondly, the key used for encrypting the Contents block could be distributed to vassals in a proactive manner, very much like Plutus [KRS+03]. Likewise, lords could try, on a periodic basis, to distribute the key to authorised vassals which were disconnected during the previous attempts. These techniques would increase the probability of a vassal receiving the key, one way or another. Thirdly, vassals having retrieved the key could act as read manager for other vassals. Therefore, a vassal incapable of making contact with a lord could request the key from another vassal. Note that this optimisation is even more interesting when coupled with the proactive distribution while not weakening the overall system’s security. Although this optimisation would require users to store a potential large number of keys, such keys could be stored within the system, in a dedicated encrypted block. Nevertheless, vassals would still need to request a lord in order to update the object. However, while delaying read operations, because of lords’ unavailability, may degrade the user’s experience, delaying write operations is perfectly acceptable and often used to optimise the system’s cache, though such delays increase the probability of conflicting updates for objects with multiple writers. To conclude, the devised access control scheme complies with the very strong re- quirements imposed by the environment. Indeed, compared to Plutus [KRS+03] for instance, this scheme evolves in a completely decentralised and untrustworthy environment, does not make use of expensive algorithms, ensures accountability, guarantees that access modifications are made effective immediately, allows anyone to consult an object’s current permissions and requires every user to keep a single item: her personal cryptographic key pair. The major drawback of this system lies in the fact that every object must grant permission to a sufficient number of lords in order to ensure that, at any time, a manager is connected to validate vassals’ oper- ations. However, it has been shown that the required connectivity was realistically achievable with a small number of lords, especially through the use of optimisations. 4.2 Administration Peer-to-peer file systems have emerged in the last decade as a way to provide avail- able and durable storage capacity at low costs by relying on a large-scale set of untrustworthy computers. Some projects such as CFS [DKK+01], Ivy [MMGC02], Pastis [mBPS05], OceanStore [KBC+00], FARSITE [ABC+02], Plutus [KRS+03] and Chefs [Fu05] succeeded in offering the user a file system interface. Unfortu- nately, all of those projects omitted to address one of the fundamental requirement CHAPTER 4. DESIGN 84 of such systems, the ability to administer the file system. In common file systems, being centralised or distributed such as NFS [Osa88], SFS [MKKW99] etc, a user has special privileges allowing her to perform administrative tasks such as creating new users, managing the groups, removing files or directories that she judges inappropriate and so forth. This specific user takes the name of root on Unices and Administrator on Windows systems. Decentralised file systems such as CFS [DKK+01], Ivy [MMGC02], Pastis [mBPS05] etc. however, cannot rely on a special and privileged entity because such systems are designed to prevent a single user from taking over the entire system. The remainder describes the design of an administration scheme that prevents a single user from completely controlling the system while enabling users to request administrative operations. 4.2.1 Semantics This section discusses the semantics of administrative tasks associated with common file systems and their relevance to the context of decentralised file systems. Note that since the access control scheme presented in Section 4.1 is based on the DAC model, this section focuses on discretionary-access-control-based file systems. In such systems, an object being a file, directory or link is controlled by a single user, known as the owner, in a completely autonomous way. Thus, every operation directed at the object is said to be object-oriented. The reader should notice that such operations compose most of a file system’s operations. However, assuming that a file system exposes a unique hierarchical organisation to the users and/or that the data are stored on a hardware device whose access is controlled by the operating system, functionalities operating at the system level become necessary. Since normal users cannot be granted the permission to operate at this critical level, file systems tend to rely on a super-privileged user, named root on Unices and Administrator on Windows. Such a superuser is commonly allowed to perform the following tasks. ˆ The superuser can create, delete and update user accounts; ˆ Likewise, the superuser can create, delete and update group accounts including the list of members and their permissions; ˆ The superuser is granted all the permissions on the file system. As a conse- quence, the superuser can create file system objects anywhere in the hierar- chical namespace but also access, update and delete any file, directory etc; and CHAPTER 4. DESIGN 85 ˆ The superuser is also able to change the ownership of a file system object so that object management is transferred to another user. The following discusses the relevance of such administrative tasks in the context of decentralised file systems. Entity As detailed in Section 4.1, creating a user in the given peer-to-peer file system comes down to (i) generating a cryptographic key pair (ii) creating and storing a User block. Unlike common file systems in which the superuser manages the user accounts, creating such a block implies that only the owner can update or delete the user entity. Note however that an organisation wishing to prevent users from creating entities in an autonomous way would have to rely on a CA (Certification Authority) for signing the User, Group etc. blocks associated with the fundamental entities. Although such a CA would violate the decentralisation requirement, some research [ACMR02] has been carried out on decentralised certificate management protocols. It is therefore fair to assume that the creation, update and deletion of user entities can be performed in a completely decentralised way, through the sole management of the User block by its owner, without the need for a super-privileged user. As for users, creating a group in a peer-to-peer file system comes down to storing a Group block. Very much like user entities, groups are owned by a single user which is responsible for its management, including maintaing the list of its members along with their permissions. Therefore, as for users, the creation, update and deletion of group entities can be performed in a completely decentralised way. Noteworthy is that the notion of user and group often comes with a functionality enabling people to retrieve an entity’s identifier given a human-readable representa- tion, such as a name. For instance, Unices store the system’s users and groups in the files /etc/passwd and /etc/group, respectively. These files record the human- readable user or group name along with its associated system identifier, known as the UID (User Identifier) and GID (Group Identifier). Additionally, Unices provide commands such as id which takes a username as argument and returns information on the user account including her UID, the groups she belongs to etc. Peer-to-peer file systems should integrate similar functionalities, especially because such large-scale systems deal with potentially million of users. Although leaving users the responsibility to communicate their storage user identifier—i.e. the user’s public key—to their friends might seem reasonable to many systems, such a method would not comply with the transparency requirement defined in Chapter 3. CHAPTER 4. DESIGN 86 A very simple solution for mapping a username to its identifier would be to create a block whose address is the hash of the username. Therefore, someone looking for a specific user would compute the hash of the username and retrieve this block which would then contain the associated user’s public key, hence leading to the User block as explained in Section 4.1.5. Although such a scheme benefits from an extreme simplicity, it forces users to know the exact names of the entities sought. Thus, one would be unable to look up entities according to a pattern such as a regular expression. Therefore, although functional, this method is limited in terms of expressivity. Object As mentioned above, superusers are granted the privilege to access, rename, delete and modify any file system object, independently of its location and without anyone’s consent. The environment’s properties defined in Chapter 3 stipulate that no user should have control over the whole system. This super-privilege therefore seems to strongly con- flict with the fundamental properties of the environment. Fortunately, the presented decentralised file system has been designed to prevent any user from accessing an object without the owner’s authorisation. As a consequence, this access super-privilege turns out to be undesirable, unneces- sary and unachievable in the given decentralised and discretionary environment. Noteworthy is that, historically, the notion of a user with super-privileges has been introduced in UNIX to deal with the root directory from which the name of the superuser comes. This directory deserves special attention as it is the most critical object in the file system hierarchy. Indeed, common file systems consider that having the right to remove a directory entry is semantically equivalent to being allowed to delete the object the entry points to. Therefore, a user removing a directory entry pointing to a sub-directory indirectly deletes the directory object but also its data, including the sub-entries being files or directories, and so on down to the leaf objects. A file system in which everyone is allowed to modify the root directory would thus inevitably lead to chaos as anyone could destroy branches of the namespace. As a consequence, file systems tend to grant privileges on the root directory to the administrator alone. The specificity of the peer-to-peer environment implies that such special privileges cannot be granted to a single user. Indeed, and as mentioned earlier, no user should have the power to make file system objects disappear without the owner’s consent. However, in order for the file system to evolve, the root directory must be modifi- able as every other object in the namespace. Therefore, an administration system adapted to the given environment’s characteristics should be provided to overcome CHAPTER 4. DESIGN 87 this issue. Finally, superusers are also granted the right to modify the ownership of an object. Such an operation has probably been provided to the superuser to prevent users from repudiating the ownership by suddenly giving it away to another user without her authorisation or even her awareness. As detailed in Section 4.1.5, the presented peer-to-peer file system, being based on a discretionary access control scheme, makes it impossible for anyone but the owner to modify the object’s metadata. Indeed, every Object block is associated with a user, known as the object’s owner, through a digital signature. This signature is applied on the owner’s public key and sealed with the block’s private key. Therefore, since the block’s private key is known exclusively from the object owner, nobody can modify this object-user link but the owner itself. Transferring the object’s ownership to another user would consist in changing the owner’s public key and re-signing it with the block’s private key. Unfortunately, this method suffers from two major issues. Firstly, the block’s private key has been discarded in order to prevent users from keeping too much access information. Therefore, the signature of the owner’s public key could not be re-issued. Secondly, assuming that owners keep the blocks’ private key, the owner’s public key could indeed be re-signed. However, once the new owner is in place, nothing could prevent the original owner from overwriting the owner’s signature once again. Thus, the file system object representation described throughout Section 4.1.5 does not seem suitable for enabling the transmission of ownership. An advanced and more specific functionality involving both users to agree on the operation would be necessary. ? ? ? In conclusion, most of the privileges granted to a superuser turn out to be unneces- sary as such tasks can be performed in a discretionary manner. However, a subset of these operations do not comply with the given environment and therefore require a specific solution. Among those, user and group entities, once created, should be registered in an inventory of some kind. Additionally, the root directory should be accessible by everyone in reading while writing should be carefully supervised in order to prevent rash operations. Finally, an object’s owner should be given the possibility to transfer the ownership to another user, assuming the other user agrees to take over. CHAPTER 4. DESIGN 88 4.2.2 Model This section discusses the properties of the system-wide and user-wide organisation models. 4.2.2.1 System-wide The vast majority of file systems, being centralised or distributed, expose a single, often hierarchical, namespace to the users such that everyone experiences the same organisation of directories and files. Such file systems are therefore said to make use of a system-wide organisation model because the organisation is applied at the system level, independently of the users’ preferences. Note that such a model is being used by a variety of other systems, from the DNS (Domain Name System) [Com85, MD88] to Wikipedia. Considering such a model implies numerous obvious advantages but also unexpected issues, especially in the context of peer-to-peer networks. Indeed, as mentioned in Section 3.4, peer-to-peer file systems store data blocks in an underlying distributed hash table. The particularity of such file systems compared to centralised ones is that every object is independent of the others. As such, the link existing between a file and the directory that references it is more logical than physical. While a malicious user would be unable to render a centralised file system inconsistent, in a peer-to-peer file system, nothing can prevent a user from destroying a file without updating its parent directory for instance. Should such a scenario occur, users would be able to browse the directory normally, but trying to access the file by fetching the Object block would inevitably result in a system failure. Noteworthy is that an honest user deleting one of her files may end up in the same situation, because not authorised to update the parent directory. Indeed, since the access control model the presented file system relies upon is discretionary, a user must have the permission to destroy the file and its data blocks from the distributed hash table along with the right to update the file’s parent directory. Besides, coherency within an object cannot be guaranted either. Indeed, a malicious user legitimately updating a directory Object block may deliberately build the directory entries in a way which does not comply with the file system’s format. Thus, any user reading this directory would be incapable of understanding its content. Such an attack is obviously inconceivable in a centralised or distributed file system because every request is verified and applied by an authoritative entity. Therefore, although consistency is naturally expected from file systems, it turns out that such a property is unachievable in file systems relying on a system-wide organisation model and devoid of any authoritative entity for controlling that every update complies with the system’s format. Note however that assuming that most CHAPTER 4. DESIGN 89 users follow the protocols, consistency should be maintained most of the time and manually fixed when violated. Finally, since peer-to-peer file systems cannot rely on a superuser, such systems suffer from the issues discussed in Section 4.2.1 such as the transfer of ownership, the permissions on the root directory etc. 4.2.2.2 User-wide People have different cultures, backgrounds and tastes and therefore have different ways of naming and organising information. For instance, one may organise music according to the genre followed by the band name while another may ignore the genre classification. The user-wide organisation model decouples the data objects from the organisation objects by enabling users to create their own view. A view is composed of organ- isation objects referencing data objects. In the context of file systems, a view is represented by a hierarchy of directories, every directory being stored in an Object block. This way, a user can name and organise the file system content, i.e. the data objects, according to its preferences. Note that such a model implies that the organisation objects, i.e. the directories, are controlled by their respective owners very much like data objects. Indeed, since both directories and files are stored in Object blocks, access to such objects can be restricted or shared. Thus, one user can imagine sharing its view with another user. Besides, the application could enable users to use a view for some parts of the namespace and switch to another view for a specific subset of the hierarchy. For example, one might want to use an official view most of the time but switch to the view provided by Google when it comes to the directory /company/google/ as information is believed to be better organised in this specialised view. This feature is similar to the way stackable file systems [HP94, PPD+95, WDG+06] use union mounts in order to alter the namespace ex- posed to users according to the context. In such a user-wide model, views evolve independently from the data. Besides, file objects are no longer attached to a single hierarchy as they were in the system-wide model. Therefore, a user deleting an object would not be able to update the various directories referencing it as nobody can know the views involving this object. Note however that the user may update views she owns and/or has agreed to maintain, assuming she has the proper credentials. Therefore, a user-wide file system must be considered as residing in a perpetual inconsistent state, as nothing can guarantee otherwise. Nonetheless, it is interesting to notice that the WWW (World Wide Web) has been built on the same model in which pages evolved independently of the other pages ref- CHAPTER 4. DESIGN 90 erencing them through hyperlinks. Therefore, whenever a page is moved or deleted, none of the pages referencing it is updated. Instead the hyperlink becomes irrele- vant as pointing to an invalid location, the various references being fixed over time as webmasters notice and correct the problem. Although this model has performed extremely well for the WWW, providing users the liberty to express themselves ouside any control, the inconsistency drawback resulting from the many references pointing to invalid locations may be less suitable to file systems than it has been to the WWW. Indeed, by relying on such a model in a file system, users may often end up seeing their accesses failing because of an invalid address. Such a behaviour may irritate the user especially because using the local file system has accustomed end-users to reliability and efficiency. Noteworthy is that the issues mentioned in Section 4.2.1 are inherent to the system- wide model and must therefore be reconsidered for the user-wide model. For in- stance, system-wide models suffer from the fact that the root directory is critical as being the base of the hierarhical organisation. In user-wide models however, multiple root directories exist, one for every view. The access permissions of the view’s root directory are directly controlled by the view’s owner such that a user disagreeing with the way the view is managed can decide to use another view. Unfortunately, the issues related to the transfer of object ownership as well as the necessity for searching the database of users and groups remain. ? ? ? Considering both models, it is highly probable that a user-wide organisation model would perform better in terms of the acceptance and expansion of such a large- scale system very much like the constraint-free hyperlinks enabled users to express themselves through Web pages. However, as mentioned above, such a model may not be suitable for the file system context as users are expecting reliability and efficiency and would probably be irritated by the lack of consistency. Although no model seems to perfectly fit with the required properties defined in Chapter 3, it is interesting to note that the issues discussed throughout Section 4.2.1 are relevant to both models, with the exeption of the root directory. Further- more, one may notice that both the root directory and the users/groups inventory exhibit an identical flaw: the impossibility to prevent rash modifications should the permission be granted to everyone. CHAPTER 4. DESIGN 91 The remainder of this section therefore focuses on providing mechanisms for (i) preventing impulsive object modifications (ii) transferring object ownership. 4.2.3 Objectives The objectives regarding the design of an administration scheme are closely related to the access control scheme’s defined in Section 4.1.1. ∇1 First, the environment fundamental properties defined in Chapter 3 must be respected. These include decentralisation, scalability, untrustworthiness and symmetry but also efficiency through quorum protocols, non-connectivity due to churn though this property has been refined in Section 4.1.3; ∇2 The set of users allowed to moderate administrative tasks must be modifiable. Note however that such an operation also constitutes an administrative task as nobody should be allowed to suddenly reform this set without the consent of several other users; ∇3 As for the users and groups having been granted permission on an object, the set of moderators must be consultable. This is required should a user need to complain for instance; ∇4 The scheme should not require users to store an excessive amount of informa- tion related to administrative tasks; and ∇7 Finally, moderators should be made accountable for the administrative tasks they approve so that their position can be challenged should they fail to honour their duty for example. Any administration scheme candidate will be considered unsuitable if violating at least one of these objectives. As in Section 4.1, the reader will be able to refer himself to the definitions summary located at the bottom of every page. 4.2.4 Scheme This section describes the administration scheme which is composed of (i) a commu- nity mechanism enabling users to request administrative tasks and (ii) a ownership user-to-user protocol providing object owners the possibility to transfer their own- ership to another user. ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 92 4.2.4.1 Community The community mechanism consists in the introduction of a physical block known as the TKB (Table Key Block). TKBs differ from OKBs or even PKBs in the sense that such blocks are neither owned nor administered by a single user, the block’s owner, but instead by a set of users referred to as the table of knights. The idea behind the presented mechanism is to require users wishing to update a block to acquire the approval from a majority of knights. The scheme has been designed in order to respect the various properties related to the given environment, defined in Section 4.2.3. For instance, the table of knights has been introduced for scalability purposes. Indeed, while the number of users populating peer-to-peer networks can be assumed to be high, many of those users may also be considered dead i.e. users which will no longer connect to the system. Therefore, designing an administration scheme requiring a user to acquire the ap- proval from the majority of all the users would be impractical but also extremely inefficient. In order to comply with ∇1, the scheme thus requires an extremely small set of users—i.e. compared to the total number of users—to contribute further to the system by moderating a specific object’s updates. Noteworthy is that, when coupled with the access control mechanism provided by the Object logical block, administrative requests can be limited to some specific users by relying on the per- missions field. The ability to precisely control which users are granted the permission to request an operation from the table of knights can help limit the administrative load put on the users acting as knights. According to the modifications introduced by the community mechanism, the per- missions must be refined in order to comply with the semantics of TKBs. Indeed, the permissions described so far have been used to grant a user or group either the right to read or write data. The table of knights introduces several subtleties that deserve special attention. Firstly, the write permission is irrelevant to such blocks as any operation involving the block’s modification is inherently prohibited; any update must be approved by the knights. Therefore, instead of the permission to write the object’s data, users are granted the permission to request an update to the table of knights. Secondly, an Object logical block is composed of multiple sections. While the data section could be updated by any user having been granted, directly or indirectly, the write permission, the meta section was administered by the block’s owner. However, TKBs are devoid of the notion of owner, concept which has been replaced by the table of knights. Similarly, the table of knights itself must be modifiable, as stipulated by ∇2. Indeed, some knights may wish to leave their position, others may be evicted by the community while users may volunteer to join. Updating the metadata and the table of knights both represent operations that, as any other modification, must be approved by the knights. Therefore, the ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 93 permissions field should reflect these extensions through the addition of the gov- ern—the administration of the object’s metadata—and elect—the administration of the table’s composition—permissions. Thirdly, the TKB physical block could be used by other logical blocks such as Groups for instance. Such a construct would enable a set of users to cooperatively manage a group, a characteristic which should be very much appreciated by communities. Unfortunately, no permission has been associated with such operations because the group owner was the only user allowed to administrate the Group block. Additional permissions should thus be introduced in order to provide group members the right to request the modification of the Group’s table of knights but also the list of members. Since group members could theoretically be indirectly granted the permission to request the modification of the metadata and table of knights associated with an Object, permissions to update the Object and the Group blocks should be made distinguishable. Therefore, a group member could be granted the manage permission to modify the Group’s table of knights and the edit permission to modify the Group’s data i.e. the members and their permissions. Note that the sub-groups recorded in the Members block could also be granted those permissions. In addition to these permissions, every group member can be indirectly granted the elect, govern, read and/or write permissions on an Object. Table 4.1 summarises the permissions which can be granted to users according to their role, i.e. lord or vassal, in the context of TKBs. These extensions demonstrate the flexibility and adaptability of the access control scheme designed in Section 4.1. Figure 4.8 depicts the internal organisation of a TKB physical block. One might notice that unlike most of the physical blocks presented in Section 4.1.5, TKBs embed a seed. While the address of mutable blocks such as PKBs and OKBs is computed by applying a one-way function on the public key of the block’s key pair, the address of a TKB is computed by applying a one-way function on a randomly generated integer: β.seed. Indeed, since TKBs are devoid of any owner, there is no need to generate a cryptographic key pair. Additionally, the table of knights is included in the block, hence complying with ∇3. Note that, as for PKBs and OKBs, the data section contains a version number as well as a signature issued by the author whose public key is also included in the block: β.Kauthor. Finally, and in order to comply with ∇7, every acquired knight’s vote is attached to the block so that anyone can verify the block’s validity: (i) every vote is unique and relates to this block (ii) a majority of votes has been reached. Noteworthy is that every vote contains an index to the related knight in β.table.board along with a signature which has been applied on (i) the block’s identity i.e. the seed (ii) the table (iii) the author’s public key and (iv) the data signature. Such inclusions prevent votes from being forged or re-used in other contexts i.e. for another block, another operation etc. ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 94 K user K user [...] user K TKBα table seed #i board version #2 votes signature signature knight [...] K author data version signature [...] knight Figure 4.8: The representation of a TKB ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 95 Subject Permission Description Lord elect request a modification of the Object ’s table of knights govern request a modification of the Object ’s metadata read read the Object ’s data write request a modification of the Object ’s data Vassal elect request a modification of the Object ’s table of knights govern request a modification of the Object ’s metadata read read the Object ’s data write request a modification of the Object ’s data manage request a modification of the Group’s table of knights edit request a modification of the Group’s composition Table 4.1: A summary of the permissions in the file system Algorithms 20, 21 and 22 detail the set-up, seal and validation processes of TKBs, respectively. Require: (Kuser, kuser), the user’s personal key pair 1. β.seed← generate integer 2. β.table.board← { Kuser } 3. β.table.version← 0 4. β.data.version← 0 5. α← h(β.seed) 6. return α Algorithm 20: Setup TKB (β) −→ α Require: (Kuser, kuser), the user’s personal key pair 1. β.data.signature← h(β.]2) ˜kuser Algorithm 21: Seal TKB (α, β) ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 96 1. if α 6= h(β.seed) then 2. error “the address does not match the block” 3. end if 4. if |β.votes| < ⌈ |β.table.board| 2 ⌉ then 5. error “the table’s majority has not been acquired” 6. end if 7. for all ι ∈ { 1, ..., |β.votes| } do 8. Kknight ← β.table.board[β.votes[ι].knight] 9. if β.votes[ι].signatureK˜knight 6= h(β.]i) then 10. error “the vote signature is invalid” 11. end if 12. end for 13. if β.data.signature ˜β.Kauthor 6= h(β.]2) then 14. error “the data signature is invalid” 15. end if Algorithm 22: Validate[client] TKB (α, β) — client side One may have noticed that, unlike mutable blocks such as PKBs and OKBs, TKBs contain a section, the table, which is not protected through the use of a cryptographic signature. Indeed, unlike other mutable blocks which are administered by a single and static authority i.e. the block’s owner, TKBs’ authority is represented by the table of knights whose composition evolves over time. Since no static relation exists between the block’s address and its table of knights, the self-certification property is violated. As a consequence, the table’s integrity and authenticity cannot be guaranteed, implying that anyone can theoretically modify the table’s composition. Thus, a client could submit a version of the block in which the table’s composition has been replaced. The client would transfer the block to a write quorum of storage nodes which would verify the block’s validity. According to the table of knights embedded in the block, the votes prove the knights’ approval since (i) every vote is valid and (ii) a majority of votes has been provided. Storage nodes, considering the block as legitimate, would therefore agree on storing the data, hence overwriting the previous version. Let us consider a user updating an object from version νi to νi+1. In order for the modification to be considered valid, the user has to attach votes showing evidence that the knights at νi approved this modification. However, since the operation relates to the modification of the data, the table is not, in theory, being modified so that the tables at νi and νi+1 are identitical. Therefore, the votes issued by the knights at νi and attached to the block νi+1 can legitimately be checked against the table of knights at νi+1. Although the semantics regarding data modifications conform to the administration scheme’s objectives, the operation consisting in up- ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 97 dating the table’s composition violates ∇2, as illustrated by the next scenario. A user wanting to update the table of knights’ composition must provide evidence that the current table of knights at νi approves the future composition which will be attached to the instance νi+1. Unfortunately, the validation process, detailed in Algorithm 22, checks the votes against the embedded table of knights i.e. both at νi+1. The validation process can therefore be considered as fundamentally flawed when it comes to verifying a block whose table of knights has been modified. Besides, should the Validate[client] TKB (α, β) routine be revised to check every vote in νi+1 against the table of knights at νi, the symmetry property 2 would be violated. Indeed, since clients cannot have access to the current and past versions of a given block, this verification step could not be performed on the client-side. Thus, as detailed in Section 3.4, clients would have no choice but to trust the servers, hence inevitably violating the untrustworthiness property. Although the system has been designed around quorum algorithms because such algorithms exhibit better performances than their agreement counterpart, such algo- rithms appear impotent regarding this issue. Indeed, quorum algorithms have shown to perform well in the given environment because the validation process could detect any illegitimate block instance. As such, by acquiring a read quorum composed of 2γ+ 1 instances, the potential γ invalid blocks can be identified and thus discarded, leaving the client with γ + 1 valid blocks among which at least one instance is the latest version i.e. with the highest version number. However, it has been previously shown that, since the table of knights cannot be statically protected through self- certification, valid instances of TKBs cannot be distinguished from illegitimate ones. The storage algorithms must therefore be reconsidered in order to rely on consensus, the only paradigm ensuring the client to retrieve the latest valid non-self-certified instance in a Byzantine environment. The system’s protocols must be modified for the specific purpose of TKBs to either (i) rely on agreement algorithms such as BFT [CL99], Paxos [Lam98] etc. or (ii) rely on specialised quorum algorithms by breaking the symmetry property, as explained in Section 3.4. On the one hand, agreement protocols would provide the flexibility required to handle advanced functionalities in which case the set of storage nodes Ω would run a Byzantine protocol ensuring that every client’s request is processed by the servers until a consensus is reached. On the other hand, relying exclusively on quorum algorithms would avoid developers having to maintain the source code for both agreement and quorum algorithms. For the sake of coherency within the system but also because agreement protocols are known to be expensive, quorum algorithms will be specialised for the purpose of TKBs. Another argument in favour of quorum 2... peer-to-peer nodes are considered equally unprivileged so that everything performed by one node could also be performed by another one ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 98 algorithms is that symmetry has never been completely respected. Indeed, although it has never been discussed in detail, every storage node receiving a mutable block to store verifies the block’s validity but also checks that the embedded version numbers are increasing in a strict monotonic way. Obviously, this validation step can only be performed on the server-side since, unlike clients, servers can access both the current and future versions of a block. Note however that although this additional verification step violates the symmetry property, it is not fundamentally required for ensuring the system’s safety. Indeed, assuming that version numbers are not verified, block versions could increase in a non-monotonic way but this would not prevent clients from retrieving the version with the highest number. Similarly, clients could submit blocks with version numbers being lower than the current ones. Such scenarios are especially likely to occur whenever different clients concurrently update the same block. Therefore a client updating a block from version νi to νi+1 could see her operation rejected because the block has been concurrently updated, say thrice, to version νi+3 before receiving the client’s update. Without such a version verification, a client could believe that her update has been applied while, in fact, it does not have the highest version number. The submitted instance will therefore never be used leading to the loss of the modifications. Let us consider another scenario in which the storage nodes do break the symmetry property by verifying the version νi+1’s votes against the current table of knights i.e. which is located in version νi. Let us assume that the client wishes the addition of five knights to the block’s table for a total of ten knights. The client, by requesting the current knights, starts by acquiring a majority of three votes. The votes are then included in the new block which embeds the future table composed of ten knights. The storage nodes receiving this block verify that (i) a majority of votes has been reached according to the table of knights at νi—i.e. three votes out of five—and (ii) the embedded votes have been issued by the knights of this same table. Since, in this scenario, both conditions have been met, storage nodes consider the block as valid and therefore accept it. Let us recall that, as shown in Algorithm 22, the client’s verification procedure checks the attached votes against the embedded table of knights. Therefore, a client fetching and verifying the block νi+1 would reject it because the three votes attached to the block have been issued by the knights at νi and therefore do not match the embedded table which is composed of ten knights. The votes provided for modifying the table of knights should therefore be distinguished from the embedded votes. The protocol for updating TKBs must therefore be slightly improved as detailed next. Every client wanting to update a TKB, being the data or the table of knights, must acquire a majority of votes from the knights at νi along with a majority from the future knights i.e. at νi+1. Then, the client builds the new block by including the new table along with the votes issued by these future knights. Finally, the client ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 99 sends to the storage nodes both the votes issued by the current knights and the new block. Every storage node receiving the block starts by checking the additional votes against the table composition at νi—hence proving that the modification has been approved by the current knights—before verifying the block’s validity: (i) a majority of votes has been reached and ( ii) the attached votes have been issued by the knights referenced in the block’s table. Algorithm 23 illustrates the validation process from the server’s perspective. Note that the verifications regarding the monotonically increasing version numbers are ignored for the sake of simplicity as it has been throughout this chapter. This verification procedure is composed of two steps. First, the additional votes are checked against the table of knights referenced in the current version of the block, hence proving that the current knights approved the modification. Then, the given block is validated by following the client verification procedure: the votes are checked against the embedded table of knights and the data signature is finally verified. ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 100 Require: ϑ, the current version of the α block ε, the set of votes provided by the client and issued by the knights in ϑ 1. if |ε| < ⌈ |ϑ.table.board| 2 ⌉ then 2. error “the table’s majority has not been acquired” 3. end if 4. for all ι ∈ { 1, ..., |ε| } do 5. Kknight ← ϑ.table.board[ε[ι].knight] 6. if ε[ι].signatureK˜knight 6= h(β.]i) then 7. error “the vote signature is invalid” 8. end if 9. end for 10. if α 6= h(β.seed) then 11. error “the address does not match the block” 12. end if 13. if |β.votes| < ⌈ |β.table.board| 2 ⌉ then 14. error “the table’s majority has not been acquired” 15. end if 16. for all ι ∈ { 1, ..., |β.votes| } do 17. Kknight ← β.table.board[β.votes[ι].knight] 18. if β.votes[ι].signatureK˜knight 6= h(β.]i) then 19. error “the vote signature is invalid” 20. end if 21. end for 22. if β.data.signature ˜β.Kauthor 6= h(β.]2) then 23. error “the data signature is invalid” 24. end if Algorithm 23: Validate[server] TKB (α, β) — server side Since TKBs do not comply with the self-certification property, the quorums must be adapted so as to behave in a consensus way. Indeed, rather than relying on the blocks’ self-certification property, the TKB -specific quorum algorithm relies on the fact that up to γ replicas can be illegitimate such that the valid and latest version can be identified by gathering at least γ + 1 identical instances. The quorums are thus redefined in order to reflect this paradigm. Firstly, a client wishing to retrieve the block would need to acquire a read quorum composed of 2γ + 1 so that it is guaranteed to identify γ + 1 identical instances. Secondly, in order to ensure that every storage node provides the block’s latest version, a modification requires the client to acquire a write quorum composed of |Ω| = 3γ + 1 nodes. Figure 4.9 depicts a scenario in which two Byzantine nodes collude in order to ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 101 mislead the clients. At first, the storage nodes are consistent, each one storing version 4 except for both Byzantine nodes which answer every request with a forged block embedding an illegal table of knights granting them full privileges. Then, after having received the authorisation from the table of knights, a client submits an update by acquiring a write quorum composed of the seven nodes, though the Byzantine nodes ignore the update. Finally, another client acquires a read quorum, receiving three version 5 and two illegal version 9. Following the protocol, the client isolates the γ + 1 identical instances which happen to be the block’s latest and legitimate version 5. 4 44 4 55 5 5 55 54 5 5 5 x x 9 9 99 9 9 (i) (ii) (iii) write read non−Byzantine node Byzantine node quorum Figure 4.9: A scenario illustrating the TKB -specific quorum algorithm Figure 4.10 illustrates the internal organisation of an Object logical block built upon a TKB. The Object benefits from the community features provided by the TKB such that users and groups can be granted the permission to read the data; request a data modification i.e. write; request a metadata modification i.e. govern or request the modification of the table of knights i.e. elect. Apart from these permission exten- sions, the Object block behaves as expected except that, obviously, every request requires the user to acquire the table of knights’ consent through the voting process. The internal organisation differences between OKB -based and TKB -based Object blocks are twofold. First, let us recall that, within the context of OKB -based Ob- jects, a vassal wishing to update the block needs to request a lord to certify that the vassal had been granted the proper permission. In addition, TKBs require users up- dating the block to request knights to certify that the modification conforms to the ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 102 community’s interests. Thus, TKB -based Object blocks theoretically require users to request both a lord and a majority of knights. Since the critical component of this process lies in users’ connectivity, requiring users to contact a lord would impact the performance since knights, as the ultimate authority, could perform both certifi- cations. The TKB -based Object ’s internal structure therefore no longer embeds an author section containing the index of the lord along with a potential voucher. In- stead, the user updating the data includes her public key in β.Kauthor while knights verify that every user requesting an operation has been granted, directly or indi- rectly, the proper permissions. Second, since TKBs are devoid of the notion of owner, the TKB -based Object ’s metadata can be updated by any user having been granted the govern permission. Thus, as for data modifications, a user wanting to update the metadata, including the access permissions, must acquire the authorisa- tion from the table of knights. In addition, assuming the table of knights approved the update, the user must record her public key in the field β.Kgovernor. Noteworthy is that the entries of the table of knights do not contain a permissions field. Therefore, in order to be able to perform operations as other users, users acting as knights must also be recorded in the Access block, directly as lords or indirectly through group memberships as vassals. Interestingly and quite ironically, knights may not be granted the read permission though they are requested to approve every data modification. ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 103 key blockα CHB α block key [...] #i CHB Access α permissions token K user permissions token #4 group [...] K user K user K user [...] votes signature signature knight knight fingerprint version signature meta data #2 Kgovernor authorK [...] version board Object α table seed TKB α access version signature #3 Contents contentsα Figure 4.10: The representation of a TKB -based Object block ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 104 Algorithm 24 details the set-up process while Algorithm 27 describes the client-side validation processes. Note that the server-side validation process is not provided as a combination of Algorithm 23 and Algorithm 27. Also, Algorithm 25 lists the sealing steps for the TKB -based Object ’s data section while Algorithm 26 details the meta section’s. Require: (Kuser, kuser), the user’s personal key pair 1. β.seed← generate integer 2. β.table.board← { Kuser } 3. β.table.version← 0 4. β.data.version← 0 5. β.meta.version← 0 6. α← h(β.seed) 7. return α Algorithm 24: SetupObject TKB (β) −→ α Require: (Kuser, kuser), the user’s personal key pair 1. β.data.signature← h(β.]2) ˜kuser Algorithm 25: SealObject[data] TKB (α, β) Require: (Kuser, kuser), the user’s personal key pair 1. if β.meta.αaccess =⊥ then 2. ~← h(β.]3) 3. else 4. χ← Get(β.meta.αaccess) 5. ~← h(β.]3|χ.]4) 6. end if 7. β.meta.signature← ~ ˜kuser Algorithm 26: SealObject[meta] TKB (α, β) ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 105 1. if α 6= h(β.seed) then 2. error “the address does not match the block” 3. end if 4. if |β.votes| < ⌈ |β.table.board| 2 ⌉ then 5. error “the table’s majority has not been acquired” 6. end if 7. for all ι ∈ { 1, ..., |β.votes| } do 8. Kknight ← β.table.board[β.votes[ι].knight] 9. if β.votes[ι].signatureK˜knight 6= h(β.]i) then 10. error “the vote signature is invalid” 11. end if 12. end for 13. if β.data.signature ˜β.Kauthor 6= h(β.]2) then 14. error “the data signature is invalid” 15. end if 16. if β.meta.αaccess =⊥ then 17. ~← h(β.]3) 18. else 19. χ← Get(β.meta.αaccess) 20. ~← h(β.]3|χ.]4) 21. end if 22. if β.meta.signature ˜β.Kgovernor 6= ~ then 23. error “the meta signature is invalid” 24. end if Algorithm 27: Validate[client]Object TKB (α, β) — client side Finally, Figure 4.11 details the organisation of a TKB -based Group block demon- strating how TKBs can be adapted to a variety of logical blocks. The set-up, seal and validation processes are not provided as identical to the TKB ’s. ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 106 K user K user [...] user K TKBα table seed #i board version votes signature signature knight [...] K author data knight Group name description α members version signature #2 α group Members CHB permissions permissions K user K permissions user [...] Figure 4.11: The representation of an TKB -based Group block 4.2.4.2 Ownership The other mechanism included in the administration scheme focuses on providing users the possibility to transfer their ownership to another user, should this user agree to take over this responsibility. ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 107 Interestingly, the previously described TKB physical block seems to provide the ideal construct for this functionality. Indeed, every Object could be built upon a TKB such that the table of knights would be composed of a single user, the block’s owner. Unlike OKB -based Objects, such blocks would not be statically bound to their owner since the table of knights can evolve over time. This characteristic makes the transfer of ownership achievable through the sole modification of the table of knights’ composition. Although this model would provide the required functionality, every file system object whose ownership may eventually be transferred would have to be built on such a physical block. Unfortunately, such blocks may perform extremely poorly depending on the owner’s connectivity. Indeed, since every block modification must be approved by the table of knights, which in this case is composed of a single user, users wanting to update the block may fail to do so because of the owner’s unavailability to validate the operation. A solution could be to introduce another physical block benefiting from both OKBs’ and TKBs’ advantages. Such a physical block would embed a table of knights for moderating operations regarding the modifications of the block’s metadata including the composition of the table of knights. However, the data modifications would not require knights’ approval. Therefore, such a physical block would perform exactly as OKBs though relying on a table of knights. Besides, such a model, applied to Object logical blocks, could benefit from the lords’ connectivity, as discussed in Sections 4.1.3 & 4.1.7. Unfortunately, while this model implies the introduction of another physical block, hence increasing the system’s complexity, it also fails to fulfill the objective: the future owner must accept to take over the ownership. Since the future owner must be included in the process, a user-to-user protocol is absolutely necessary. The remainder details an extremely simple user-to-user protocol which assumes that every file system object is stored in an OKB physical block, except for community objects such as the root directory, the users inventory and so forth which would rely on TKBs. Since OKBs are bound to their respective owners, transferring the ownership implies creating a new OKB. However, although the content of the Object block must be cloned, the Contents block, the Access block as well as the data blocks can be re- used. As such, the Object block’s content is copied into another Object while the new owner’s public key is inserted and the metadata section’s signature is re-issued, by the new owner. This straightforward process is therefore very efficient: (i) a single key pair must be generated (ii) a single block must be cloned and (iii) two cryptographic signatures must be recomputed. Noteworthy is that this method implies several additional actions to be performed. ∇1: environment, ∇2: modification, ∇3: consultation, ∇4: space, ∇5: flexibility, ∇6: efficiency, ∇7: accountability CHAPTER 4. DESIGN 108 Firstly, the old Object block must be deleted though the blocks it references such as the Access, Contents and data blocks must not. Note however that a garbage collection mechanism would not require the original owner to do anything as the system would detect the block as no longer used and it would therefore be automat- ically swept out. Secondly, every directory referencing the old Object block must be updated with the address of the new Object. The protocol, detailed below, enables a user to transfer the ownership of the block identified by its address α to another user µ. 1. First, the Object ’s current owner sends the message 〈transfer, α, κK˜µ〉 to the potentially future owner µ. Note that κ, the data encryption key, is provided so that the future owner can access the data. However, the key is encrypted with the µ user’s public key in order to ensure security. 2. The user receiving such a message decides whether or not to accept the own- ership. If so, the block is cloned through the creation of another Object block and updated accordingly. Then, the message 〈accept, α, 〉 is sent back to the original owner. The message includes , the address of the cloned Object block. 3. Finally, the original owner, having received the accept message, updates every directory referencing the Object block with the new address . This protocol is particularly interesting because it can be applied to every physical block. Thus, a user could decide to transfer the ownership not to another user but instead to a community so that a TKB would be created to replace the OKB. Similarly, a community could decide that the object no longer needs several users to monitor the object but instead that a single user should be made owner, hence transforming a TKB into an OKB. 4.2.5 Algorithms This section provides the reader a complete set of algorithms for manipulating TKB - based Object and Group logical blocks. ˆ Elect(α, θ) ˆ Govern(α, ψ) ˆ Read(α, λ) −→ δ CHAPTER 4. DESIGN 109 ˆ Write(α, λ, δ) ˆ Manage(α, θ) ˆ Edit(α, ψ) ˆ Transfer(α,Kuser) Note that, as described earlier, every TKB -related Put() request must provide both the block to store and a set of votes issued by the current knights. This action is symbolised next by the | operator which concatenates the block with the additional votes so that every storage node can extract the necessary information and perform the server-specific verification process. Although every algorithm described below makes use of timeouts, no information is given regarding the mechanism to handle such timeouts. The reader should consider that clients are supposed to retry the operation later and eventually return an error after a certain number of successive failures. Elect The Elect(α, θ) routine takes the address α of an Object object along with a set of knights θ and requests that the table’s composition be changed. The elect permission authorising users to perform this operation can be granted to anyone, directly in the Access block associated with the Object or indirectly through a chain of group memberships, assuming that every group of the chain is granted the elect permission along with the user. Algorithm 28 illustrates the client’s process which consists in acquiring votes from the current knights in order to prove the storage nodes that the operation has been approved but also from the future knights in order to build a valid block. CHAPTER 4. DESIGN 110 Require: (Kuser, kuser), the user’s personal key pair 1. β ← Gather(α) 2. for all φ ∈ β.table.board do 3. request a vote from knight φ by sending the message 〈elect, α,Kuser, θ〉 4. end for 5. ε← wait until ⌈ |β.table.board| 2 ⌉ votes have been received, or timeout 6. for all φ ∈ θ do 7. request a vote from knight φ by sending the message 〈elect, α,Kuser, θ〉 8. end for 9. υ ← wait until ⌈ |θ| 2 ⌉ votes have been received, or timeout 10. β.votes← υ 11. β.table.board← θ 12. β.table.version← β.table.version+ 1 13. SealObject TKB (α, β) 14. Put(α, β|ε) – the | operator designates concatenation Algorithm 28: Elect(α, θ) Govern The Govern(α, ψ) routine enables any user having been granted the govern permis- sion to update the Object ’s metadata, including the access control list in the Access block. The function takes the address of the Object block along with a set of modifications to apply on the metadata. CHAPTER 4. DESIGN 111 Require: (Kuser, kuser), the user’s personal key pair 1. β ← Gather(α) 2. for all φ ∈ β.table.board do 3. request a vote from knight φ by sending the message 〈govern, α,Kuser, ψ〉 4. end for 5. υ ← wait until ⌈ |β.table.board| 2 ⌉ votes have been received, or timeout 6. β.votes← υ 7. β.meta.version← β.meta.version+ 1 8. apply the set of modifications ψ 9. for all (, ζ) such that ζ is a new immutable block do 10. Seal CHB (, ζ) 11. Put(, ζ) 12. end for 13. SealObject[meta] TKB (α, β) 14. Put(α, β|υ) – the | operator designates concatenation Algorithm 29: Govern(α, ψ) Read The Read(α, λ, δ) method is very similar to the OKB -based Object-specific Algorithm 18 except that there is no special case for the owner. Let us recall that vassals are no longer required to acquire the validation of their modification from a lord as it was with OKBs. Instead, in the context of TKBs, the knights perform this verification for every modification, no matter the role of the author: knight, lord or vassal. However, vassals must still contact a lord when it comes to reading the data in order to retrieve the key required to decrypt the Object ’s Contents block. CHAPTER 4. DESIGN 112 Require: (Kuser, kuser), the user’s personal key pair 1. β ← Gather(α) 2. χ← Get(β.meta.αaccess) 3. if ∃ι : Kuser = χ[ι].Kuser ∀ι ∈ { 1, ..., |χ| } then 4. if read /∈ χ[ι].permissions then 5. error “the lord does not have the read permission” 6. end if 7. κ← χ[ι].token ˜kuser 8. else 9. Ξ← { χ[ι].Kuser : read ∈ χ[ι].permissions ∀ι ∈ { 1, ..., |χ| }} 10. ν ← Locate(Ξ) 11. κ← request user ν for the key by sending the message 〈read, α,Kuser〉 12. end if 13. if h(κ) 6= β.data.fingerprint then 14. error “the key does not match the fingerprint” 15. end if 16. σ ← Get(β.data.αcontents) 17. ξ ← decrypt σ with the key κ 18. δ ← read data from ξ at location λ 19. return δ Algorithm 30: Read(α, λ) −→ δ Write The Write(α, λ, δ) routine provides authorised users the possibility to request the modification of the Object ’s data. Unlike OKB -based Objects, both lords and vassals must acquire the approval from the table of knights. As for elect, govern and read, the write permission can be granted to anyone, in- cluding group members. CHAPTER 4. DESIGN 113 Require: (Kuser, kuser), the user’s personal key pair 1. β ← Gather(α) 2. χ← Get(β.meta.αaccess) 3. for all φ ∈ β.table.board do 4. request a vote from knight φ by sending the message 〈write, α,Kuser, δ〉 5. end for 6. υ ← wait until ⌈ |β.table.board| 2 ⌉ votes have been received, or timeout 7. β.votes← υ 8. β.data.version← β.data.version+ 1 9. κ← generate cryptographic symmetric key 10. ξ ← write data δ at location λ 11. σ ← encrypt ξ with the key κ 12. β.data.αcontents ← Setup CHB (σ) 13. Seal CHB (β.data.αcontents, σ) 14. Put(β.data.αcontents, σ) 15. for all ι ∈ { 1, ..., |χ| } : read ∈ χ[ι].permissions do 16. χ[ι].token← κ ˜χ[ι].Kuser 17. end for 18. β.data.fingerprint← h(κ) 19. for all (, ζ) such that ζ is a new immutable block do 20. Seal CHB (, ζ) 21. Put(, ζ) 22. end for 23. SealObject[data] TKB (α, β) 24. Put(α, β|υ) – the | operator designates concatenation Algorithm 31: Write(α, λ, δ) Manage The Manage(α, θ) routine is equivalent to the Elect(α, θ) method in every aspect except that it operates on a Group block. Note however that this permission cannot be granted to a user or group in an Access block since the operation is not related to an Object. As previously mentioned, distinguishing the manage from the elect permission en- ables a group member to be given the right to modify the table of knights of both the user’s Group and an Object the group has been granted the permission to. CHAPTER 4. DESIGN 114 Require: (Kuser, kuser), the user’s personal key pair 1. β ← Gather(α) 2. for all φ ∈ β.table.board do 3. request a vote from knight φ by sending the message 〈manage, α,Kuser, θ〉 4. end for 5. ε← wait until ⌈ |β.table.board| 2 ⌉ votes have been received, or timeout 6. for all φ ∈ θ do 7. request a vote from knight φ by sending the message 〈elect, α,Kuser, θ〉 8. end for 9. υ ← wait until ⌈ |θ| 2 ⌉ votes have been received, or timeout 10. β.votes← υ 11. β.table.board← θ 12. β.table.version← β.table.version+ 1 13. SealGroup TKB (α, β) 14. Put(α, β|ε) – the | operator designates concatenation Algorithm 32: Manage(α, θ) Edit The Edit(α, ψ) routine takes the address of a Group block along with a set of metadata modifications. As for manage, the edit permission can only be granted to group members, granting them the right to request some modifications, including the set of members and their permissions. CHAPTER 4. DESIGN 115 Require: (Kuser, kuser), the user’s personal key pair 1. β ← Gather(α) 2. for all φ ∈ β.table.board do 3. request a vote from knight φ by sending the message 〈edit, α,Kuser, ψ〉 4. end for 5. υ ← wait until ⌈ |β.table.board| 2 ⌉ votes have been received, or timeout 6. β.votes← υ 7. β.data.version← β.data.version+ 1 8. apply the set of modifications ψ 9. for all (, ζ) such that ζ is a new immutable block do 10. Seal CHB (, ζ) 11. Put(, ζ) 12. end for 13. SealGroup TKB (α, β) 14. Put(α, β|υ) – the | operator designates concatenation Algorithm 33: Edit(α, ψ) Transfer The Transfer(α, µ) routine enables a user owning an OKB -based Object to transfer her ownership to another user. The routine takes the address α of an Object block along with the identity µ of the user to whom the ownership is to be offered. Require: (Kuser, kuser), the user’s personal key pair 1. β ← Gather(α) 2. if Kuser 6= β.owner.Kowner then 3. error“the owner is the only user authorised to transfer the object’s ownership” 4. end if 5. κ← β.meta.owner.token ˜kuser 6. request user µ to take over by sending the message 〈transfer, α, κ〉 7. wait for message 〈accept, α, 〉, or timeout 8. update the referencing directories with the address  Algorithm 34: Transfer(α, µ) CHAPTER 4. DESIGN 116 4.2.6 Analysis The previous sections detailed the design of an administration scheme composed of a community and ownership mechanism. Although the community mechanism enables objects to be managed with great flexibility, it also introduces numerous inconveniences which are discussed below. The TKB -specialized quorums have been optimised in order to prioritise reads over writes. Firstly, readings have been assumed to be more common than writings. Al- though this assumption seems fair, some specific file systems may wish to optimise the file system the other way around. Fortunately, quorum algorithms are highly adaptable and could therefore be optimised for writings: write quorums would re- quire the acquisition of 2γ + 1 nodes while read quorums would be made more expensive with 3γ + 1 nodes. Secondly, TKB physical blocks require every mod- ification to be approved by acquiring knights’ vote, process which can drastically delay the operation from taking effect. On the other hand, read operations can be performed without such approvals though vassals may have to contact another user to retrieve the key. Therefore, while write operations are prone to large delays, readings can be performed in real-time, so to speak: in the worst case scenario a single user’s node must be contacted. The delay potentially implied by write quorums comes from the fact that the 3γ+ 1 storage nodes may not be operational at the time of the acquisition, hence making the writing impossible to complete. However, since the underlying network protocols have been designed with self-adaptability in mind, other nodes will be chosen to take over the non-responsive ones such that, within a few seconds, a write quorum can be expected to be acquired. Besides, let us recall that writings require users to request votes from the table of knights, a process which could take from a few minutes to several days. Unfortunately, this delay cannot be determined because independent from the system. Indeed, should a majority of the knights of a single table be absent for several days, every request made during this period would be delayed until the knights come back and process the requests manually. As a result, the delay inherent in write quorums can be ignored in regard to the expensive votes acquisition process. Finally, note that most implementations purposely delay the commit of write operations. This technique is often used to buffer successive writes but also to avoid network communications should a file be created and immediately destroyed, operations which are very common in the process of compiling source code files for instance. For all the reasons exposed above, the TKB -specialized quorum algorithm has been optimised for reading rather than writing. Note that the common quorum algorithm used for other mutable blocks such as OKBs and PKBs has not been been optimised for one or the other type of operation because both can be performed in real-time i.e. CHAPTER 4. DESIGN 117 without the approval of some authority. However, the file system could very much be configured for optimising reading for instance, in which case read quorums would require the acquisition of γ + 1 instances while write quorums would be composed of 3γ + 1 storage nodes. The TKB physical block has been designed for the specific purpose of administration and should not be used otherwise. For instance, it would be inappropriate to build an entire file system out of TKBs. Indeed, as most of the file system objects are administered in a discretionary manner [Vog99], Object blocks would embed a table of knights composed of a single user. Since approval is required to update such blocks, the connectivity of the knights would thus be absolutely crucial to the system. Unfortunately, Section 4.1.7 showed that the connectivity of a single user would not suffice to ensure users the possibility to interact normally with objects. Although TKBs suffer the exact same connectivity issue as OKBs, both blocks exhibit different characteristics. Firstly, OKBs require vassals to request a lord in order to operate on the object, both for reading and writing. TKBs, on the other hand, require every user wanting to update the block to contact knights while vassals have to request a lord for the encryption key when it comes to read operations. Secondly, while a single lord was enough to perform an OKB -related operation, at least half of the knights must be contacted in order to achieve consensus. Thirdly, the users acting as an OKB ’s lords were designated by the block’s owner without their accord. However, every user acting as a TKB ’s knight chooses this position knowing exactly what it implies. Therefore, knights can be expected to be extremely well-connected users, probably responding to every administrative request within a few hours. Besides, the table of knights has been designed to be self-moderated so that a knight failing to perform her duty could be evicted by the community itself. The client’s connectivity also plays an important role in the votes acquisition pro- cess. Assuming that some knights were disconnected at the time the client made its request, the client should periodically try to resend the request so that every knight eventually receives it. Besides, the client must be connected to the network in order to collect the votes from the multiple knights. To summarise, the TKB physical block is costly to manipulate because (i) the TKB - specific quorums are expensive (ii) writes may suffer from delay, from a few minutes to several days (iii) the block contains more information than OKBs, especially because of the embedded table of knights along with the attached votes and (iv) TKBs are more expensive to validate than OKBs because every vote includes a cryptographic signature. Unfortunately, TKB ’s disposition for inducing delays in write operations drastically increases concurrency issues. As for OKBs, multiple users may concurrently update a block from version νi to νi+1 implying that the first one to commit the modifications CHAPTER 4. DESIGN 118 would automatically render the other ones invalid3. Alas, while the delay between a block retrieval and the commit of its modifications may approximate several seconds for OKBs, it could approach several days for TKBs. The probability of such update collisions are therefore extremely high, especially since TKBs will probably be used for popular content such as the users inventory, object which should be modified on a regular basis. Noteworthy is that agreement protocols such as BFT [CL99], Paxos [Lam98] etc. do not suffer from such a limitation because such algorithms ensure serializability. To conclude, although the community mechanism has been designed to prevent chaos, especially regarding critical objects such as the root directory, the user in- ventory and so on, the design’s extreme flexibility provides any set of users the possibility to manage a block in a cooperative manner. Furthermore, it is inter- esting to notice that Section 4.1.2 discussed the various access control paradigms, from MAC and DAC to RBAC, discarding both MAC and RBAC models because they require system-wide definitions. Ironically, the community mechanism designed throughout this section shows how such system-wide definitions could actually be achieved through the use of a dedicated community deciding whether or not to cre- ate new roles or to include a user in a higher clearance level for instance. Note however that although such paradigms could be designed through this framework, they would not resolve the fundamental flaws that have been discussed throughout this chapter. ? ? ? This chapter focused on designing the key components required in order to build a peer-to-peer file system. The first section introduced the notion of user but also presented the design of the Object block abstraction upon which every file system object can be built. Also, a flexible access control scheme has been integrated, providing users the means to protect or share information with other users and/or groups. The second section discussed several organisation models leading to the conclusion that peer-to-peer file systems, as common centralised file systems, need to provide an administration scheme. The proposed scheme is composed of both a community and ownership mechanism enabling communities to manage blocks in a cooperative manner. 3Let us recall that the version numbers must increase in a strictly monotonic way. CHAPTER 4. DESIGN 119 The contributions of this design are threefold. Firstly, the access control scheme enables users to express access restrictions in a flexible way that is unprecedented. Secondly, this work seems to be the first, as far as the author is aware, to address the issues related to providing an administration mechanism in such a decentralised environment. Thirdly, an extremely simple yet efficient user-to-user protocol enables users to transfer their ownership to other users. However, the design does not come without trade-offs. First of all, the connectivity requirement has been loosened against the fundamental properties defined in Chap- ter 3, in order to achieve the required flexibility. Note however that this constraint is minor, especially compared to other projects [KRS+03], as the study carried out throughout Sections 4.1.7 & 4.2.6 showed that a set of five lords with a connectiv- ity ratio of 0.6 suffices to enable vassals to interact with the object. Finally, the major drawback lies in the community mechanism’s TKB physical block which, due to its tendency to large delays, drastically increases the probability of conflicting concurrent updates. Chapter 5 Implementation This chapter details the implementation of the Infinit peer-to-peer file system which follows the philosophy of WheelFS [Sa07] by focusing on providing the fundamen- tal functionalities such as access control and administration through a file system abstraction. The Infinit prototype is composed of 40, 000 lines of source code written in C++. The implementation relies on several libraries, mainly the STL (Standard Template Library) for common system features and the OpenSSL (Open Secure Socket Layer) for cryptographic capabilities. Besides, since the file system has been written in C++, the Infinit processes reside in userspace. Therefore, the implementation relies on FUSE (File System in User Space) [FUS] for forwarding the system calls from the operating system kernel to a specific Infinit process. Note that although Infinit provides a completely useable file system abstraction, many components have been implemented in their simplest possible form in order to meet the time constraints. 5.1 Representation The file system prototype relies, as defined in Chapter 3, on a distributed storage layer which ensures some fundamental network properties such as scalability, dura- bility, availability and so forth. It is important to note that, as for CFS [DKK+01], Plutus [KRS+03] or even Pastis [mBPS05], the data being from a file, directory or else are split into blocks in order to better balance the storage load between the nodes. Infinit follows the UFS (UNIX File System) organisation in which every file system object such as files, directories etc. are represented by an inode, though Infinit uses a slightly different terminology, detailed next. 120 CHAPTER 5. IMPLEMENTATION 121 File objects are represented through the logical block Object which contains, as described throughout Chapter 4, the metadata. Also, the Access block contains the access control rules enabling the object owner to restrict and/or share access to other users. However, as described in Section 4.1, the access control scheme provides far more expressivity than the UNIX permissions. Finally, the Contents block contains references to the set of encrypted Data blocks. The relations between these logical blocks, forming a so-called file, are illustrated by Figure 5.1. α OKB Object data meta CHB Contents Access CHB CHB CHB Data Data 1011011000100001 1110011101011111 1011001101110111 1010010000110101 1010000110110101 1011010110110100 1011101101110101 1011011011010110 1101101110110111 1011011000011101 1101101011110101 0110110111101000 1101111101011010 1101110 0110110101100100 Figure 5.1: A file representation Directory objects are similar to files in every aspect except regarding the data se- mantics. The directory entries are stored in Catalog logical blocks, as shown on Figure 5.2. Note that every directory entry is composed of the name of the ref- erenced object along with the address of the pointed to Object block, being a file, sub-directory etc. In addition, Infinit provides users with objects commonly referred to as links. How- ever, one should carefully note that these objects relate to UNIX symbolic rather than hard links. Indeed, since every Infinit file system object is administered in a discretionary manner, hard links would be impossible to implement, since a counter should be updated following every hard link creation and destruction, not mention- ing inconsistencies inherent to malevolent behaviours. CHAPTER 5. IMPLEMENTATION 122 α OKB Object data meta CHB Contents Access CHB CHB CHB firefox emacs Catalog Catalog wc zsh gcc less mplayer Figure 5.2: A directory representation α OKB Object data meta CHB Contents Access CHB CHB Reference /user/jmq/ Figure 5.3: A link representation The Reference block contains the path indicating the namespace location of the CHAPTER 5. IMPLEMENTATION 123 target object. Noteworthy is that, should the path be extremely long, it would actually be split over several Reference blocks chained and referenced by the object’s Contents block. Figure 5.3 illustrates these relations although, in this example, the path is short enough to fit in a single Reference block. The prototype makes use of a system-wide organisation model, meaning that a single hierarchical namespace is exposed to the users, independently of their preferences. As discussed in Section 4.2.2, the system-wide organisation model may suffer from inconsistencies, though they are less likely to occur than in the user-wide model. Besides, this model is far easier to implement than a user-wide scheme, which would require the development of an extended set of applications for manipulating views. 1 / 2 users 3 music/ 4 Tool/ 5 Camel/ 6 Magma/ 7 . . . 8 README Listing 5.1: An example of hierarchical namespace Figure 5.4 depicts the relations between the various logical blocks composing the hierarchical namespace given by Listing 5.1. This example illustrates the use of the community mechanism described in Section 4.2. Indeed, both the root directory and the /users file rely on the TKB construct. The /users special file is assumed, in this context, to act as a user inventory in which every entry is composed of a name and the address of the associated User logical block. Note that for the sake of clarity, the Contents and Access blocks have been omitted from Figure 5.4. CHAPTER 5. IMPLEMENTATION 124 CHB Catalog OKB Object README music users TKB Object TKB Object CHB PKB User OKB Object albert isaac α root Catalog CHB ... Camel Tool Data CHB Data Magma Welcome to Infinit! Feel free to store your files in a reliable way. ... Figure 5.4: The Infinit system-wide hierarchical representation 5.2 Architecture The architecture of Infinit has been broken down into small, coherent, and flexible components known as Agent, Etoile, PIG and Hole. In addition, every component relies on one or both of the Elle and Lune libraries. All these components are described below. Figure 5.5 details the components composing a node connecting to the Infinit peer- to-peer network. Whenever a user, through an application, performs a file system- related operation on an Infinit partition, the standard C Library issues a system call to the operating system kernel. The kernel, noticing that Infinit is a userspace file system, forwards the call to the PIG (POSIX/Infinit Gateway) component. PIG aims at transcoding POSIX file system calls by sending the corresponding request to Etoile. Note that whenever a cryptographic operation must be performed on behalf of the requesting user, Etoile sends a message to the Agent component whose CHAPTER 5. IMPLEMENTATION 125 purpose is to handle operations such as signing and decrypting data with the user’s personal key pair. Finally, should Etoile require to update or retrieve information from the distributed hash table, a request is sent to Hole which acts as the gate to the peer-to-peer network. fuse module libc etoile hole kernel VFS agent PIG user1 agent wc ls user1 user1 user2 application proc proc u user−specific process system process trusted channel authenticated channel user2 user1 Figure 5.5: The architecture of an Infinit node Noteworthy is that Infinit can be configured depending on the node’s purpose. For instance, one may want to set up Infinit on an enterprise server in order to contribute the server’s storage capacity to the company’s private Infinit network. However, since nobody is using the server as a personal computer, the components PIG, Agent and Etoile can be omitted. Similarly, one might be willing to configure Infinit, on a mobile phone, so that whenever Etoile requires Agent to perform a cryptographic operation, the request is sent to the user’s home desktop computer rather than to an Agent running on the mobile phone. The decomposition of Infinit into several units enables users to configure the software depending both on their preferences but CHAPTER 5. IMPLEMENTATION 126 also on the device’s specifics, hence complying with the mobility property defined in Chapter 3. The remainder of this section discusses the internals of every component, including the libraries. 5.2.1 Elle Elle is a general-purpose library used by all the other components of the Infinit file system. Elle provides several packages, each of which offers a specific set of functionalities. The most important of those packages are described next. core The core package contains a set of basic types on which everything else is built. These types were introduced in order to ease the process of data serialisation. Those includes Boolean, Byte, Integer8, Integer64, Natural32, Real among others. system The system package contains system-specific definitions such as the microprocessor’s endianness, the path to major locations including the user’s local home directory, the local root directory etc. standalone The standalone package is very specific as it provides fundamental functionali- ties which may require features provided by other Elle packages, leading to inter- dependencies. The classes provided by this package were thus implemented by some- times relying on code redundancy. The package offers functionalities such as Region for manipulating static and dy- namic buffers, Report for handling error messages or even Maid for automatic mem- ory deallocation etc. radix The radix package contains fundamental base classes. Firstly, the Meta class repre- sents the ultimate base class which is directly or indirectly inherited by any other class. Secondly, every class must inherit one of the following classes: Object or Entity. CHAPTER 5. IMPLEMENTATION 127 The Object class must be inherited by any class which describes an object that can be serialised, compared, cloned, recycled and so on. On the other hand, any other class must inherit the Entity class. archive The archive package contains the Archive class which provides methods for serial- ising and extracting the types defined in the core package. Since most objects are built upon such types, an additional mechanism is required to render these types serialisable. Every class inheriting the Object class implicitely derives Archivable. Such classes must implement the associated Serialize() and Extract() methods which specify the internal attributes to be included in the serialisation process. 1 Status PublicKey : : Serialize ( Archive& archive ) const 2 { 3 // s e r i a l i z e the i n t e r n a l numbers . 4 i f ( archive . Serialize (* th i s−>key−>pkey . rsa−>n , 5 * th i s−>key−>pkey . rsa−>e ) == StatusError ) 6 escape ( ’ unable to s e r i a l i z e the i n t e r n a l numbers ’ ) ; 8 re turn ( StatusOk ) ; 9 } Listing 5.2: The PublicKey::Serialize() method 1 Status KeyPair : : Extract ( Archive& archive ) 2 { 3 // ex t r a c t the i n t e r n a l keys . 4 i f ( archive . Extract ( th i s−>K , th i s−>k ) == StatusError ) 5 escape ( ’ unable to ex t r a c t the i n t e r n a l keys ’ ) ; 7 re turn ( StatusOk ) ; 8 } Listing 5.3: The KeyPair::Extract() method Listing 5.2 illustrates the implementation of the Serialize() method for the specific purpose of the PublicKey class. Similarly, Listing 5.3 depicts the opposite process through the Extract() implementation for the KeyPair class. The reader can notice that the terminology—K and k for the public and private keys, respectively—is equivalent to the notation introduced in Chapter 4. CHAPTER 5. IMPLEMENTATION 128 io The io package provides basic input/output classes and interfaces such as Fileable and Dumpable. The most notable Uniquable interface enables any deriving class to represent objects in the form of a unique literal string. Listing 5.4 illustrates the use of the Uniquable interface for the address of a block. Also, the original object can be reconstructed from a Unique representation making such strings extremely convenient as human-readable tokens. 1 RUxMAQkCDwAgAAAAAAAAADau4cGNxRZEfNZJzG3vk2iOmVVdbwbad1uvRa7VDMj8 Listing 5.4: The Base64 Unique representation of a block address factory The factory package provides classes for the dynamic reconstruction of serialised objects. By relying on the factory, one can receive a serialised item and reconstruct the associated object without knowledge of its original type. cryptography The cryptography package provides a set of classes and methods for performing cryp- tographic operations on any serialisable item. The functionalities provided range from hash functions via the OneWay class to symmetric and asymmetric cryptosys- tems through the SecretKey, KeyPair, PublicKey, PrivateKey classes, among oth- ers. concurrency The concurrency package embeds common mechanisms such as Mutex, Condition, Semaphore, Thread. Interestingly though, these functionalities are seldom used because Infinit has been designed to rely on the event [Maz01, LCG07] programming paradigm, as opposed to the multithreaded model. The multithreaded paradigm relies on multiple threads operating in parallel and synchronising whenever necessary. On the other hand, the event paradigm relies on components processing events as they occur, potentially generating other events which, in turn, would be processed whenever possible. Research [AHT+02] showed that, although both models suffer pitfalls, the best of both worlds is achievable through automatic stack management. The event programming paradigm is well-known for simplifying concurrency while reducing opportunities for race conditions and deadlocks. However, the model also CHAPTER 5. IMPLEMENTATION 129 implies the event process’ local statelessness. Indeed, should an event process depend on some other inputs, an event would be generated, leading to another computa- tion and so on. Therefore, although very powerful, event-driven programming can become cumbersome as one must manually store information in the global state until an event is received in order to carry on a previously started logical unit of computation. Adya et al. [AHT+02] showed that a programming style exists which benefits from the event-driven paradigm’s advantages without the burden of manual state management. The Elle library incorporates such a continuation [MI09] mechanism through the Fiber class. A fiber is a lightweight thread of execution as it solely consists of a stack and a set of registers. Although fibers implement the event paradigm, such objects can also block while maintaining a local state. Whenever a fiber stops, another waiting fiber is scheduled by switching back on its stack as well as restoring its set of registers. Once the event necessary for the blocked fiber to continue is received, it can be re-scheduled, hence restoring its local state. The notion of fiber is extremely powerful because developers do not have to change their habits regarding state management. Listing 5.5 provides an example of fibers requiring another resource in order to continue. Noteworthy is that both fibers, represented by the functions Fiber1() and Fiber2(), evolve within their own environment such that the value of the local variable i is maintained during the entire fiber’s lifespan. 1 Timer Timer1 ; 2 Timer Timer2 ; 4 Resource ResourceA ; 5 Resource ResourceB ; 7 Status Main ( const Natural32 argc , 8 const Character* argv [ ] ) 9 { 10 Callback<> fiber1(&Fiber1 ) ; 11 Callback<> fiber2(&Fiber2 ) ; 13 // c r e a t e and s t a r t the timer1 , launching the f i b e r 1 . 14 i f ( Timer1 . Create ( Timer : : ModeSingle , fiber1 ) == StatusError ) 15 escape ( ’ unable to c r e a t e the t imer ’ ) ; 17 Timer1 . Start ( 1 0 0 ) ; 19 // c r e a t e and s t a r t the timer2 , launching the f i b e r 2 . 20 i f ( Timer2 . Create ( Timer : : ModeSingle , fiber2 ) == StatusError ) 21 escape ( ’ unable to c r e a t e the t imer ’ ) ; 23 Timer2 . Start ( 1 000 ) ; CHAPTER 5. IMPLEMENTATION 130 25 re turn ( StatusOk ) ; 26 } 28 Status Fiber1 ( ) 29 { 30 Natural32 i = 42 ; 32 // wait f o r ResourceA . 33 i f ( Fiber : : Wait(&ResourceA ) == StatusError ) 34 escape ( ’ unable to wait f o r the r e sou r c e ’ ) ; 36 // awaken ResourceB . 37 i f ( Fiber : : Awaken(&ResourceB ) == StatusError ) 38 escape ( ’ unable to awaken the r e sou r c e ’ ) ; 40 re turn ( StatusOk ) ; 41 } 43 Status Fiber2 ( ) 44 { 45 Natural32 i = 21 ; 47 // awaken ResourceA . 48 i f ( Fiber : : Awaken(&ResourceA ) == StatusError ) 49 escape ( ’ unable to awaken the r e sou r c e ’ ) ; 51 // wait f o r ResourceB . 52 i f ( Fiber : : Wait(&ResourceB ) == StatusError ) 53 escape ( ’ unable to wait f o r the r e sou r c e ’ ) ; 55 re turn ( StatusOk ) ; 56 } Listing 5.5: An illustration of fibers network The network package contains high-level network functionalities from local syn- chronous communication mechanisms via the Door and Lane classes to remote asyn- chronous communication routines through Slot, Gate and Bridge. Noteworthy is that every network message is identified by a Tag which specifies the type of the message. The Tag is included in a Header which is followed by the actual data. The Header also includes an Event number which can be used to link a response to a previously sent request for instance. The package is especially interesting for developing networking components because CHAPTER 5. IMPLEMENTATION 131 of its template-based classes and methods. As such, a component willing to send and/or receive messages must declare the associated types. This process can easily be achieved through the macro-functions outward() and inward(). Listing 5.6 illustrates the Agent ’s message definitions, such definitions being usually located in a file referred to as the component’s manifest. 1 inward ( agent : : TagDecrypt , 2 parameters ( elle : : Code ) ) ; 3 outward ( agent : : TagDecrypted , 4 parameters ( elle : : Clear ) ) ; 6 inward ( agent : : TagSign , 7 parameters ( elle : : Plain ) ) ; 8 outward ( agent : : TagSigned , 9 parameters ( elle : : Signature ) ) ; Listing 5.6: The message definition process In addition, a component willing to receive a message must register the tag cor- responding to the message of interest and associate it with the callback to trigger should such a message be received. Listing 5.7 illustrates this straightforward pro- cess for the Agent ’s messages defined above. 1 elle : : Callback decrypt(&Agent : : Decrypt ) ; 2 elle : : Callback sign(&Agent : : Sign ) ; 4 // r e g i s t e r the decrypt message . 5 i f ( elle : : Network : : Register(decrypt ) == elle : : StatusError ) 6 escape ( ’ unable to r e g i s t e r the decrypt ca l l ba ck ’ ) ; 8 // r e g i s t e r the s i gn message . 9 i f ( elle : : Network : : Register(sign ) == elle : : StatusError ) 10 escape ( ’ unable to r e g i s t e r the s i gn ca l l ba ck ’ ) ; Listing 5.7: The message registration process Noteworthy is that whenever a message is received, a fiber is spawned in which the associated callback is triggered. Therefore, the message handler can behave according to the fiber’s specifics, such as waiting for a resource or spawning a sub- fiber for instance. util The util package provides miscellaneous functionalities including format classes such as Base64 and Hexadecimal but also time manipulation through Time, arguments parser via the Parser class etc. CHAPTER 5. IMPLEMENTATION 132 5.2.2 Lune Lune is a small library which provides functionalities for manipulating locally stored information such as the user’s key pairs, the addresses of the multiple Infinit net- works the user’s has been authorised to connect to etc. 5.2.3 PIG PIG (POSIX/Infinit Gateway) is an application which, through FUSE [FUS], re- ceives every system call related to an Infinit file system. The objective of PIG is to translate these POSIX system calls into requests complying with Etoile’s API (Application Programming Interface). Listing 5.8 illustrates PIG ’s internals through the rmdir() FUSE call. The function starts by loading both the parent and target directories, hence retrieving two context identifiers. The directory entry pointing to the target sub-directory is then deleted. Finally, the modifications on the parent directory are commited while the target sub-directory is destroyed. 1 i n t PIG : : Rmdir ( const char * path ) 2 { 3 etoile : : path : : Slice name ; 4 etoile : : path : : Way child ( path ) ; 5 etoile : : path : : Way parent ( child , name ) ; 6 etoile : : context : : Identifier directory ; 7 etoile : : context : : Identifier subdirectory ; 9 // load the d i r e c t o r y . 10 i f ( PIG : : Channel . Call ( 11 elle : : Inputs(parent ) , 12 elle : : Outputs(directory ) ) == 13 elle : : StatusError ) 14 error ( ENOENT ) ; 16 // load the subd i r e c to ry . 17 i f ( PIG : : Channel . Call ( 18 elle : : Inputs(child ) , 19 elle : : Outputs(subdirectory ) ) == 20 elle : : StatusError ) 21 error ( ENOENT ) ; 23 // remove the entry . 24 i f ( PIG : : Channel . Call ( 25 elle : : Inputs(directory , name ) , 26 elle : : Outputs()) == elle : : StatusError ) 27 error ( EACCES ) ; CHAPTER 5. IMPLEMENTATION 133 29 // s t o r e the d i r e c t o r y . 30 i f ( PIG : : Channel . Call ( 31 elle : : Inputs(directory ) , 32 elle : : Outputs()) == elle : : StatusError ) 33 error ( EINTR ) ; 35 // des t roy the subd i r e c to ry . 36 i f ( PIG : : Channel . Call ( 37 elle : : Inputs(subdirectory ) , 38 elle : : Outputs()) == elle : : StatusError ) 39 error ( EINTR ) ; 41 re turn ( 0 ) ; 42 } Listing 5.8: The PIG ’s rmdir() POSIX system call Note that although most FUSE calls are converted into Infinit requests, some are ignored, such as link(), while the semantics of others are slightly modified in order to comply with Infinit. For instance the getattr() FUSE call is invoked in order to retrieve information on the object referenced by the given path. Such information ranges from the object’s mode, to the owner’s UID and GID, to the data size and so on. Unfortunately, the system cannot translate an Infinit user/group identifier into a UNIX UID/GUID, respectively. The system has to be provided with a set of pre-defined mappings between Infinit and UNIX identifiers. The local file $HOME/.infinit/map.asct provides such a mapping. However, should the system fail to translate an Infinit identifier, PIG would rely on a special entity referred to as somebody. The somebody UNIX user and group are therefore attributed to any file system object whose owner identity has not been linked to a UNIX entity. 5.2.4 Agent The Agent component aims at performing cryptographic operations on behalf of a specific user. This component has been introduced in order to isolate the sensitve user’s key pair from other potentially more vulnerable components. Noteworthy is that this data isolation process is similarly used in SSH (Secure Shell) through its so-called SSH agent. Whenever Etoile receives a request that requires performing a cryptographic oper- ation, the component contacts the user’s Agent and requests the operation to be performed. That way, the user’s keys never leave its Agent whose source code, being extremely small, is easier to maintain and secure than Etoile or Hole for instance. CHAPTER 5. IMPLEMENTATION 134 The Agent component starts by connecting and identifying itself to the Etoile com- ponent. The Etoile component then challenges the Agent regarding the identity of the user by sending a phrase encrypted with the user’s public key. The Agent decrypts the phrase and sends back a hash of the phrase, hence proving that it legitimally operates on behalf of the claimed user. Besides, the phrase is stored in the local file $HOME/.infinit/$USER.phr so that the user’s applications, including PIG, can authenticate to Etoile by providing this phrase. As discussed in Section 5.2.1, the Agent exports an interface composed of function- alities for decrypting and signing data with the user’s private key. Listings 5.9 & 5.10 provide the source code for these operations. 1 elle : : Status Agent : : Decrypt ( const elle : : Code& code ) 2 { 3 elle : : Clear clear ; 5 // perform the cryptograph ic opera t i on . 6 i f ( Agent : : Pair . k . Decrypt (code , clear ) == elle : : StatusError ) 7 escape ( ’ unable to perform the decrypt ion ’ ) ; 9 // rep ly to the c a l l e r . 10 i f ( Agent : : Channel−>Reply ( 11 elle : : Inputs(clear ) ) == elle : : StatusError ) 12 escape ( ’ unable to r ep ly to the c a l l e r ’ ) ; 14 re turn ( StatusOk ) ; 15 } Listing 5.9: The Agent::Decrypt() method 1 elle : : Status Agent : : Sign ( const elle : : Plain& plain ) 2 { 3 elle : : Signature signature ; 5 // perform the cryptograph ic opera t i on . 6 i f ( Agent : : Pair . k . Sign ( plain , signature ) == elle : : StatusError ) 7 escape ( ’ unable to perform the s i gna tu r e ’ ) ; 9 // rep ly to the c a l l e r . 10 i f ( Agent : : Channel−>Reply ( 11 elle : : Inputs(signature ) ) == elle : : StatusError ) 12 escape ( ’ unable to r ep ly to the c a l l e r ’ ) ; 14 re turn ( StatusOk ) ; 15 } Listing 5.10: The Agent::Sign() method CHAPTER 5. IMPLEMENTATION 135 components Directory File Object User System Journal Community Link Group kernel ObjectPublicKeyBlock Access User Catalog Group Data ReferenceContentHashBlock wall depot context path user Etoile journal proc proc u system process user−specific process authenticated channel intra−Etoile calls PIG user application user user agent hole Figure 5.6: The internals of the Etoile component 5.2.5 Etoile The Etoile component implements the Infinit file system API. Following Infinit ’s architecture and as shown in Figure 5.6, Etoile is partitioned into logical units such as wall, path, context, kernel etc. which are discussed below. wall The wall unit exports Etoile’s interface such that every incoming message generates an event triggering a method located in this unit. Listing 5.11 provides some of CHAPTER 5. IMPLEMENTATION 136 Etoile’s message definitions while Listing 5.12 illustrates the related event handlers. The reader will notice that these definitions correspond to the messages sent by PIG in Listing 5.8. Whenever a request is received, wall retrieves the context associated with the given context identifier. The request is then passed to the components unit. Finally, a response is sent back to the requesting client. 1 // d i r e c t o r y 2 inward ( etoile : : TagDirectoryCreate , 3 parameters ( ) ) ; 4 inward ( etoile : : TagDirectoryLoad , 5 parameters ( etoile : : path : : Way ) ) ; 6 inward ( etoile : : TagDirectoryAdd , 7 parameters ( etoile : : context : : Identifier , 8 etoile : : path : : Slice , 9 etoile : : context : : Identifier ) ) ; 10 inward ( etoile : : TagDirectoryLookup , 11 parameters ( etoile : : context : : Identifier , 12 etoile : : path : : Slice ) ) ; 13 outward ( etoile : : TagDirectoryEntry , 14 parameters ( etoile : : kernel : : Entry ) ) ; 15 inward ( etoile : : TagDirectoryConsult , 16 parameters ( etoile : : context : : Identifier , 17 etoile : : kernel : : Index , 18 etoile : : kernel : : Size ) ) ; 19 outward ( etoile : : TagDirectoryRange , 20 parameters ( etoile : : kernel : : Range)) ; 21 inward ( etoile : : TagDirectoryRename , 22 parameters ( etoile : : context : : Identifier , 23 etoile : : path : : Slice , 24 etoile : : path : : Slice ) ) ; 25 inward ( etoile : : TagDirectoryRemove , 26 parameters ( etoile : : context : : Identifier , 27 etoile : : path : : Slice ) ) ; 28 inward ( etoile : : TagDirectoryDiscard , 29 parameters ( etoile : : context : : Identifier ) ) ; 30 inward ( etoile : : TagDirectoryStore , 31 parameters ( etoile : : context : : Identifier ) ) ; 32 inward ( etoile : : TagDirectoryDestroy , 33 parameters ( etoile : : context : : Identifier ) ) ; Listing 5.11: Etoile’s wall message definitions for directory objects 1 c l a s s Directory 2 { 3 pub l i c : 4 s t a t i c elle : : Status Create ( ) ; CHAPTER 5. IMPLEMENTATION 137 5 s t a t i c elle : : Status Load ( const path : : Way&); 6 s t a t i c elle : : Status Lock ( const context : : Identifier&); 7 s t a t i c elle : : Status Release ( const context : : Identifier&); 8 s t a t i c elle : : Status Add ( const context : : Identifier&, 9 const path : : Slice&, 10 const context : : Identifier&); 11 s t a t i c elle : : Status Lookup ( const context : : Identifier&, 12 const path : : Slice&); 13 s t a t i c elle : : Status Consult ( const context : : Identifier&, 14 const kernel : : Index&, 15 const kernel : : Size&); 16 s t a t i c elle : : Status Rename ( const context : : Identifier&, 17 const path : : Slice&, 18 const path : : Slice&); 19 s t a t i c elle : : Status Remove ( const context : : Identifier&, 20 const path : : Slice&); 21 s t a t i c elle : : Status Discard ( const context : : Identifier&); 22 s t a t i c elle : : Status Store ( const context : : Identifier&); 23 s t a t i c elle : : Status Destroy ( const context : : Identifier&); 24 } ; Listing 5.12: Etoile’s wall handler definitions for directory objects context Depending on the request, a context is either created, loaded, stored or discarded. For example, a call such as Directory::Load() creates a context while Direc- tory::Remove() retrieves an existing context according to the given Identifier. The notion of context has been introduced in order to optimise the system’s perfor- mance by delaying expensive operations. As detailed in Chapter 4, an Object logical block contains several signatures. These signatures are computed in order to seal the object before being stored in the peer-to-peer network. However, considering a file system devoid of the notion of context, signatures would be re-computed following every call to the write() POSIX function. One can easily imagine many scenarios such as copying files leading to a large number of signatures being unnecessarily computed. Contexts, on the other hand, enable an application to perform a set of operations on an object before commiting the modifications, in which case the object is sealed and finally stored in the distributed hash table. The context unit contains functionalities for manipulating such contexts. components The components unit provides a very similar interface to the wall ’s, carrying on operations on a given context. The objective of the components unit is to maintain CHAPTER 5. IMPLEMENTATION 138 consistency between the blocks composing an object. Indeed, while the kernel unit provides functionalities at the block level, the components unit ensures that, should the Access or Contents blocks be modified for instance, the Object is updated ac- cordingly. Listing 5.13 illustrates the Remove() method for a directory object. This method takes the context of the directory object along with the name of the entry to remove. First, the requesting client’s rights are determined and checked. Then, the contents is opened by potentially fetching additional blocks. Finally, the Contents and Catalog blocks are modified by requesting the kernel unit. 1 elle : : Status Directory : : Remove ( context : : Directory* context , 2 const path : : Slice& name ) 3 { 4 // compute the cur rent user ’ s r i g h t s on the context . 5 i f ( Rights : : Determine ( context ) == elle : : StatusError ) 6 escape ( ’ unable to determine the r i g h t s ’ ) ; 8 // check the user ’ s r i g h t s accord ing to the opera t i on . 9 i f ( ! ( context−>rights−>record . permissions & kernel : : PermissionWrite ) ) 10 escape ( ’ unable to perform the opera t i on without the permis s ion ’ ) ; 12 // open the contents . 13 i f ( Contents : : Open ( context ) == elle : : StatusError ) 14 escape ( ’ unable to open the contents ’ ) ; 16 // remove the entry from the d i r e c t o r y contents . 17 i f ( context−>contents−>content−>Remove ( name ) == elle : : StatusError ) 18 escape ( ’ unable to remove the d i r e c t o r y entry ’ ) ; 20 re turn ( StatusOk ) ; 21 } Listing 5.13: The components unit’s Directory::Remove() method path The path unit provides methods for manipulating paths such as resolving a path into an object’s address. user The user unit contains information regarding the clients connected to Etoile. A client is composed of an agent and a set of applications such as PIG. Whenever the connection with the client’s Agent is broken, all the applications are notified and their connection discarded. CHAPTER 5. IMPLEMENTATION 139 kernel The kernel unit contains the internal representations of the fundamental blocks. The unit includes classes such as Object, Access, Contents, Data, Reference, User, Group, ContentHashBlock among others. Listing 5.14 depicts the internal representation of an Object logical block through its Object class. The reader will notice that this structure closely resembles the OKB -based Object logical block depicted on Figure 4.6. 1 c l a s s Object : 2 pub l i c OwnerKeyBlock 3 { 4 Author author ; 6 s t r u c t 7 { 8 s t r u c t 9 { 10 Permissions permissions ; 11 Token token ; 12 } owner ; 14 Genre genre ; 15 elle : : Time stamp ; 17 Attributes attributes ; 19 hole : : Address access ; 21 Version version ; 22 elle : : Signature signature ; 23 } meta ; 25 s t r u c t 26 { 27 hole : : Address contents ; 29 Size size ; 30 elle : : Time stamp ; 32 elle : : Digest fingerprint ; 34 Version version ; 35 elle : : Signature signature ; 36 } data ; 37 } ; Listing 5.14: The Object class CHAPTER 5. IMPLEMENTATION 140 depot The depot unit provides an abstraction for storing and retrieving blocks, indepen- dently of their location. Indeed, although the depot may, most of the time, rely on the Hole component in order to retrieve data from the underlying distributed hash table, the block could also lie in one of Etoile’s caches, either in the depot ’s cache, referred to as the repository, or in the journal ’s. Note that a block located in the repository may actually be stored in either the cache i.e. in main memory or in the reserve i.e. on the local hard disk. Therefore, whenever a block is requested by the components unit, the depot is invoked. The unit then sequentially inspects the journal, the cache, the reserve and, at last, sends a request to the Hole component. journal Whenever the modifications applied to a context are requested to be commited, the context is pushed into the journal. As discussed throughout Chapter 4, updates are delayed in order to optimise the system’s performances since blocks could be subsequently modified or even destroyed. Then, on a periodic basis, the blocks are stored in the distributed hash table by requesting the depot ’s Put() method. 5.2.6 Hole The Hole component provides an abstraction for storing, retrieving and deleting a Block associated with an Address. The interface exported by the component is composed of the methods listed in Listing 5.15. Note that this interface closely resembles the one defined in Section 3.4. 1 c l a s s Hole 2 { 3 pub l i c : 4 s t a t i c elle : : Status Put ( const Address&, 5 const Block * ) ; 6 s t a t i c elle : : Status Get ( const Address&, 7 Block *&); 8 s t a t i c elle : : Status Erase ( const Address&); 9 } ; Listing 5.15: The Hole component’s interface Noteworthy is that an Address contains a header composed of (i) a family represent- ing the physical block (ii) a component representing the logical block and (iii) the hash of the creating user’s public key. This particular construct prevents different blocks’ addresses from conflicting. CHAPTER 5. IMPLEMENTATION 141 Since this interface is common [DZD+03] to most DHT s, the implementation can easily be changed, even at run-time. The Infinit prototype is currently composed of the Hole implementations discussed next. local The local implementation stores the data blocks locally. This implementation, though benefiting from extreme simplicity, obviously lacks fundamental properties such as availability, durability, scalability etc. remote The remote implementation stores the data blocks on a unique remote server. As for local, the remote implementation suffers from fundamental limitations. However both these implementations are useful regarding debugging. kool Finally, the kool implementation stores blocks in a distributed hash table, hence complying with the properties defined in Chapter 3. As the mobility property defined in Chapter 3 suggests, devices lacking the re- sources to contribute or even maintain the network’s state should be able to configure the distributed hash table accordingly. Structured overlay networks have been de- signed with scalability in mind. For instance, Chord [SMK+01] performs lookups in O ( log(η) ) hops while nodes are required to maintain a state composed of O ( log(η) ) entries, where η represents the number of nodes. Unfortunately, a node could find itself in a position where the network state it must maintain exceeds its storage capacity. kool follows the Kelips [GBL+03] design but extends it so that the DHT can be con- figured by specifying the degree of partitioning δ. This characteristic implies that, as δ increases, the routing path is lengthened while the network state to be maintained on every node is reduced. Therefore, while the routing complexity remains O ( 1 ) , the state complexity becomes O ( δ √ η ) . For instance, kool2 represents a Kelips network in which nodes are assigned to groups, kool3 relies on cube roots, hence dividing the space into a two-level hierarchy while kool4 goes one step further and partitions the identifier space into three layers. By relying on such a flexible DHT, Infinit can easily be set up on a variety of exotic devices. Mobile phones for instance would probably make use of kool4 in order to CHAPTER 5. IMPLEMENTATION 142 reduce the memory fingerprint while a desktop computer could afford to trade off memory consumption for faster lookups through kool2 or even kool1. Although such configurations could be applied to other overlay networks such as Chord [SMK+01], the Kelips [GBL+03] design has been chosen for its simplicity. Table 5.1 summarises the relations between the several configuration factors given a degree of partitioning δ, a network designed to support η nodes and a set of connected nodes ζ. The cardinality represents the size of every group, the number of groups in every level of the hierarchy as well as the hierarchy depth. The neighbourhood parameter specifies the number of connected nodes populating every group. This equation assumes the use of a consistent hash function for attributing nodes to groups in a uniform manner. Finally, the connectivity indicates the number of links a node must maintain (i) with every other node in its group i.e. the neighbourhood and (ii) with every other group belonging to the node’s hierarchical level plus every super-group in the hierarchy upper-level plus every super-super-group in the upper- upper-level and so on. Note that every link to groups, upper-groups, upper-upper- groups etc. are made redundant according to the ratio γd depending on the degree d. However, for the sake of simplicity the following assumes that γ = γd for every degree d. Cardinality κ = δ √ η Neighbourhood φ = ζ κδ−1 Connectivity λ = φ− 1 + δ−1∑ d=1 ( γd ( κ− 1)) Table 5.1: kool parameters Figure 5.7 depicts a network illustrating the relations between the parameters de- tailed above. This example shows how every group is populated by an average of φ = 2 nodes while every node maintains a total of λ = 5 links with the other groups and upper-groups. CHAPTER 5. IMPLEMENTATION 143 η = 27 δ = 3 ζ = 18 γ = 1 κ = 3 φ = 2 λ = 5 Parameters node second−level group first−level group Figure 5.7: An example of a kool3 network Table 5.2 provides formulas for computing the number of hops as well as the size of the network state nodes are expected to maintain. Note that the state is assumed to be composed of a table of 30-byte entries containing the target node’s IP (Internet Protocol) address. Hops δ − 1 State λ× 30B Table 5.2: kool formulas Finally, Table 5.3 compares three different kool configurations capable of supporting η = 1010 nodes. The network is assumed to be populated by ζ = 109 nodes while the connectivity redundancy ratio is set to γ = 3, hence improving resilience. This study shows that the state can be reduced from 8.86MB for kool2 to 384.72KB for kool3, representing a gain of 95.7% at the expense of a single additional hop. CHAPTER 5. IMPLEMENTATION 144 Network Configuration Parameters Hops State η = 1010 ζ = 109 γ = 3 kool2 δ = 2 κ = 105 φ = 104 λ = 309996 1 8.86MB kool3 δ = 3 κ = 2154 φ = 215 λ = 13132 2 384.72KB kool4 δ = 4 κ = 316 φ = 31 λ = 2865 3 83.95KB Table 5.3: Comparison of the kool configurations Although Kelips would incur a high background communication overhead to keep a 8.86MB state up-to-date, kool can be configured to trade off one additional hop against a state reduction. This flexibility appears extremely useful since resource- limited devices such as mobile phones should be able to maintain a 83.95KB state. Note that an overlay network such as Chord [SMK+01] could also benefit from such flexible configurations by modifying the base parameter for instance. ? ? ? The presented prototype follows the UFS representation consisting in expressing every file, directory and link through an inode. Given the decentralised nature of the Infinit peer-to-peer file system, inodes are not stored in the kernel or on a dedicated server but rather in the underlying distributed hash table; inodes are represented by the Object logical block which has been detailed throughout Chapter 4. Regarding the actual implementation of the Infinit prototype, the system has been broken down into independent units which communicate by sending messages, also referred to as IPC (Inter-Process Communication) [DMM08]. Finally, the Hole com- ponent which implements the storage layer interface offers several network imple- mentations which can be used according to the environment: topology, constraints etc. Chapter 6 Evaluation The administration scheme presented in Section 4.2 enables users to request special tasks. However, such operations may take several days to complete, depending on the availability and commitment of the users contributing to the system. Unfortu- nately, this particular aspect of the system cannot be evaluated in an unrealistic environment; a well-established production environment with hundreds of users is required to perform a long term analysis, especially regarding the impact of the users’ connectivity as well as the concurrency of administrative requests. Likewise, the access control scheme is particularly interesting because of its extreme flexibility. Unfortunately, this expressivity comes at the expense of an increased connectivity requirement. Although it has been shown in Section 4.1.7 that such a connectivity was realistically achievable, evaluating its impact would, once again, require a production environment populated by hundreds of users sharing files, cre- ating and managing groups and so forth. Since evaluating the qualitativeness of the file system through a simulation is incon- ceivable because reliance on human actions for validating administration requests, for example, and since setting up a production environment may take years to achieve, the evaluation of the Infinit design and implementation has been performed in sev- eral complementary environments. This chapter therefore discusses the details of this evaluation. Firstly, aspects of the underlying overlay network and distributed hash table are evaluated, such as the routing latency and the network scalability. Secondly, the peer-to-peer file system is evaluated both regarding its implementation and deployment especially considering the access control and administration mechanisms designed in Chapter 4. 145 CHAPTER 6. EVALUATION 146 6.1 Methodology This section discusses the environments, benchmarks and data sets this evaluation relies upon. 6.1.1 Environments The large-scale specificity of peer-to-peer file systems implies that experiments should be carried out in an appropriate environment i.e. a network populated by thousands of nodes and hundreds of users. Unfortunately, since setting up such a network is extremely complicated and time-consuming, researchers tend to rely on other tech- niques. This evaluation follows the common practice which consists in measuring the file system’s behaviour in several complementary environments, offering a trade-off between realism and scalability. The first environment, referred to as the simulated environment, makes use of the ns-3 [NSN] discrete-event network simulator. The simulator is run on a Linux -based computer with a dual-core 1.8Ghz microprocessor and 4GB of RAM (Random Ac- cess Memory). This environment is especially useful to simulate a large-scale net- work composed of several thousand nodes. Thus, both the evaluations of the overlay network and the distributed hash table take place in this environment. The network topology used by these benchmarks has been generated with the iNet Topology Gen- erator [iNe] in which the maximum latency between diametrically opposed nodes has been set to 300ms. The second environment, known as the realistic environment, is composed of 16 heterogeneous nodes located throughout Europe and the United States of America. This environment is used to evaluate the implementation of the Infinit file system, in a more realistic way, through the use of Andrew benchmarks. Finally, the third environment, referred to as the production environment, is com- posed of 2, 156 heterogeneous nodes populating a campus network split into two geographic sites. This network is used on a daily basis by five schools for approxi- mately 6, 000 users including students, teachers, staff etc. This environment is used to evaluate the behaviour of the Infinit file system in a production environment in order to validate the assumptions made regarding the connectivity requirements for both the access control and administration schemes. 6.1.2 Benchmarks This section presents the three types of benchmarks carried out on the system along with their processing methods and data sets. CHAPTER 6. EVALUATION 147 Firstly, the evaluation focuses on the overlay network and distributed hash table components. The experiments carried out on the kool implementation aim at eval- uating some characteristics of the underlying network such as the latency of the routing requests. In order to perform such experiments, the simulated environment is set up for hosting several thousand nodes. Then, a benchmark composed of a pre-generated set of requests is run, hence stressing the network’s behaviour. Secondly, the peer-to-peer file system implementation is evaluated and compared with NFS [Osa88] through the Andrew [HKM+88b] benchmark. This methodology makes Infinit comparable with many other file systems which have been evaluated in very similar conditions, including OceanStore [KBC+00], Ivy [MMGC02], Pangaea [SKKM02], FARSITE [ABC+02] and Pastis [mBPS05]. The Andrew [HKM+88b] benchmark aims at individually evaluating specific aspects of the file system. This benchmark is composed of several phases, each one dealing with a particular file system operation such as copying directories or creating files through a compilation process. The source code of the Andrew benchmark used throughout this evaluation is given in Listing 6.1. 1 #! /bin / sh 3 Prepare ( ) 4 { 5 echo ‘‘−−−[ Prepare ’ ’ 6 cd ‘ ‘ ${from }/ ’ ’ 7 directories=$ ( find . / −type d ) 8 time \ 9 ( f o r directory in ${directories} ; do 10 mkdir ‘ ‘ ${to}/${directory } ’ ’ 11 done ) >>andrew . log 12 } 14 Copy ( ) 15 { 16 echo ‘‘−−−[ Copy ’ ’ 17 cd ‘ ‘ ${from }/ ’ ’ 18 files=$ ( find . / −type f −or −type l ) 19 time \ 20 ( f o r file in ${files} ; do 21 cp −P ‘ ‘ ${file } ’ ’ ‘ ‘ ${to}/${file } ’ ’ 22 i f [ ${?} −ne 0 ] ; then e x i t 1 ; f i 23 done ) >>andrew . log 24 } 26 List ( ) 27 { 28 echo ‘‘−−−[ List ’ ’ CHAPTER 6. EVALUATION 148 29 time \ 30 (ls −Rla ‘ ‘ ${to } ’ ’ ) >>andrew . log 31 } 33 Search ( ) 34 { 35 echo ‘‘−−−[ Search ’ ’ 36 time \ 37 ( grep −R ‘ ‘ teton ’ ’ ‘ ‘ ${to } ’ ’ ) >>andrew . log 38 } 40 Compile ( ) 41 { 42 echo ‘‘−−−[ Compile ’ ’ 43 cd ‘ ‘ ${to }/ ’ ’ 44 time \ 45 ( . / configure && make ) >>andrew . log 46 } 48 i f [ ${#} −ne 2 ] ; then 49 echo ‘ ‘ [ usage ] andrew . sh {from} {to } ’ ’ 50 e x i t 0 51 f i 53 cd ‘ ‘ $ {1} ’ ’ 54 from=’ ’${PWD } ’ ’ 55 cd ‘ ‘ ${OLDPWD } ’ ’ 57 cd ‘ ‘ $ {2} ’ ’ 58 to=’ ’${PWD } ’ ’ 59 cd ‘ ‘ ${OLDPWD } ’ ’ 61 rm −f andrew . log 63 Prepare 64 Copy 65 List 66 Search 67 Compile Listing 6.1: The Andrew benchmark The first phase of the Andrew benchmark, referred to as Prepare in Listing 6.1, clones the hierarchy of directories of a given project. The second phase copies the files into the freshly created hierarchy. The List phase lists all the files and directories created so far, hence retrieving the attributes of every file system object. The fourth phase, known as Search, reads the content of every file by invoking the grep utility. CHAPTER 6. EVALUATION 149 Finally, the fifth phase launches the make command, hence compiling and linking the source files. This benchmark has been, and still is, widely used for evaluating file systems because every phase focuses on a specific aspect. For instance, the first and second phases illustrate the process of creating directories and files, respectively. Likewise, the third phase focuses on inspecting every object’s metadata information while the fourth retrieves the files’ content. The fifth phase however is more general as it includes all the previous operations: retrieving Make files’ attributes, reading source files and writing object files, among others. The Andrew benchmark therefore takes a project directory containing Make files and source files and measures the duration of every phase. In order to make this evalua- tion’s experiments as realistic as possible, the executions of the Andrew benchmark are based on the project directory of OpenSSL-1.0.0. This project is composed of approximately 900 source files and 200 header files while the compilation process generates around 700 object files. Thirdly, the Infinit file system is stressed in a production environment by replaying the operations recorded in a 3-week file system trace generated from a system in production. This benchmark is extremely interesting for measuring two things: the accessibility of the file system objects depending on the users’ connectivity and the conflicts generated by concurrent updates. In order to validate the assumptions defined in Chapter 4, the Infinit file system is evaluated by setting up an environment in which user and group entites have been created in order to best match the file system trace, especially when it comes to sharing and working cooperatively. The following gives some insights into the file system trace. Table 6.1 provides general information on the file system trace. This table indicates the number of files and user accounts being active, as opposed to the total number of users and files. Indeed, since the file system trace covers a 3-week period, only the users and files recorded in the trace are considered. Number of users 5, 932 Number of files 1, 339, 776, 000 Table 6.1: General information regarding the users and files Noteworthy is that the Infinit file system set up for this evaluation relies on the campus’ topology which implies that the nodes do not belong to the users as is common in open peer-to-peer networks. Instead, an instance of Hole, set up by the system administrators, runs on every computer so that every user must launch her CHAPTER 6. EVALUATION 150 own Infinit client in order to access her files. This specificity mainly implies two things regarding the design presented in Chapter 4. First, the nodes populating the network are highly available since always running; this is part of the campus policy. Therefore, the peer-to-peer network does not suffer from churn. Second, since the nodes do not belong to the users, the users’ Infinit instance is running as long as the user is logged in the campus network. This is very different from an open peer- to-peer network in which a user could have multiple Infinit instances running: one on her smartphone, another on her desktop computer and so on. This particularity implies that contacting a user—in order to request an operation or provide a key for instance—is far more complicated as the users’ connectivity is impacted. Average Node Availability 87.4% Average Daily User Uptime 5 hours, 48 minutes and 33 seconds Table 6.2: General information regarding the nodes and users connectivity Table 6.2 provides information regarding the average node availability and user uptime. The user uptime represents the period of time during which the user is logged in the campus network, daily. Note that the uptime is extremely high because only the users repreented in the file system trace are taken into account. This period directly impacts the availability of a user’s Infinit instance for responding to other users’ requests. The following discusses the interactions betweens user and group entities by analysing the sharing properties of files, directories etc., both in reading and writing. First, the sharing distribution is studied shedding light on how and with whom users share objects in reading. Figure 6.1 shows with how many users file system objects are being shared. This study confirms the fact that most objects, i.e. 79.9%, are kept private in which case, in an Infinit file system, no Access block would be referenced. This figure also shows that users tend to share with many users, up to twelve, with an unexpected peak for sixteen readers, though most shared objects, i.e. 99.8%, are shared with a single user. Note that the system files, which are accessible by any user, have been ignored from this evaluation. The second analysis focuses on file system objects that are writable by multiple users, implying some sort of a cooperative behaviour. Figure 6.2 shows that, as expected, the number of such “cooperative” objects is lower than the number of shared ones. It is however interesting to notice that these measures indicate that users tend not to cooperate, or at least not to rely on file systems access control mechanisms for doing so. Indeed, a further analysis showed that many users relied upon third-party applications such as Subversion, Git etc. for cooperative work, CHAPTER 6. EVALUATION 151 especially given the fact that the campus’ IT (Information Technology) department provides functionalities for setting up such tools. 1 10 100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09 0 2 4 6 8 10 12 14 16 18 20 O b j e c t s Readers Sharing Figure 6.1: General information regarding the users’ sharing behaviours 1 10 100 1000 10000 0 1 2 3 4 5 6 7 8 9 10 O b j e c t s Writers Cooperation Figure 6.2: General information regarding the user’s cooperative behaviours CHAPTER 6. EVALUATION 152 This analysis of the file system trace provides enough information for setting up Infinit file system environments in which the evaluation of both the access control and administration mechanisms will take place. 6.1.3 Metrics The primary metric this evaluation is interested in relates to the overlay network, especially regarding the latency of the routing requests. Likewise, the distributed hash table’s benchmarks are intended to illustrate the complexity of the operations consisting of storing and retrieving blocks. As a second step, the Infinit file system implementation will be stressed, the evalu- ation focusing on the execution time of the various Andrew benchmark phases. Finally, the benchmarks taking place in the production environment concentrate on analysing the accessibility of the file system objects along with the concurrency of the updates. These metrics will be analysed through a two-phase process. First, an Infinit file system will be set up in order to study the access control mechanism. For this, all the file system’s Object blocks will be built upon OKBs. Then, another environment will be set up in order to examine the cooperative interactions through the use of TKB -based objects. Note that in order to ensure the results’ validity, every benchmark is run several times. 6.2 Results This section discusses various results, from the impact of the cryptosystems on the overall performance to the latency of the block retrieval process. Also, the Infinit file system’s behaviour is evaluated in more realistic environments, hence assessing the impact of the access control and administration schemes designed in Chapter 4. 6.2.1 Overlay Network The first set of experiments aims at validating the routing algorithm of the kool implementation, as presented in Section 5.2.6. This evaluation illustrates the per- formance of the Lookup(ι) routine in several kool configurations and depending on the number of nodes composing the peer-to-peer network. CHAPTER 6. EVALUATION 153 0 20 40 60 80 100 120 140 1 10 100 1000 10000 100000 L a t e n c y ( m s ) Nodes Lookup() kool2 kool3 kool4 Figure 6.3: The performance of the overlay network’s Lookup(ι) routine The results of Figure 6.3 confirm the design expectations of Section 5.2.6 i.e. the requests’ latency evolves according to the routing path’s length. Let us recall that the kool2 configuration performs lookups in a single hop while the kool3 and kool4 require two and three hops, respectively. Noteworthy is that the number of nodes has a small though noticeable impact on the routing complexity, probably because of the node failures which force the clients to re-send some messages. 6.2.2 Distributed Hash Table The multiple physical and logical blocks defined throughout Chapter 4 differ from one another depending on their function but also their mutability. The mutability property is especially important in regard to the performance of the underlying dis- tributed hash table. Indeed, as explained in Section 3.4, the Get(α) and Gather(α) routines exhibit radically different behaviours. While the Get(α) method can easily identify the block but also make extensive use of caching, the Gather(α) method is required to contact a quorum of nodes responsible for the given block. The following experiments concentrate on the distributed hash table, starting with the Get(α) routine. Note that for the sake of simplicity, the next experiments are carried out in a simulated network of 10, 000 nodes. CHAPTER 6. EVALUATION 154 0 100 200 300 400 500 kool2 kool3 kool4 L a t e n c y ( m s ) Configurations Get() Figure 6.4: The performance of the immutable-specific Get(α) routine Figure 6.4 provides the results of the experiments carried out on 4096-byte chunks of data. Let us recall that the Get(α) method can benefit from caching techniques because of the immutability property of the blocks associated with this routine. Note however that the cache has been disabled throughout these experiments. The results show that immutable blocks can be retrieved very efficiently from the dis- tributed hash table. This is especially interesting since most blocks composing the file system’s hierarchy are CHBs such as Access, Data, Contents, Catalog, Members etc. The next set of experiments focuses on the performance of the distributed hash table regarding mutable blocks. The benchmarks consist of a set of Gather(α) calls involving 4096-byte mutable blocks. These experiments illustrate the behaviour of the distributed hash table under several quorum configurations identified by their fault-tolerance factor γ i.e. the number of Byzantine nodes such a quorum can tolerate. For instance, γ = 3 refers to a quorum configuration in which every block is replicated on 10 storage nodes while every client willing to retrieve a block must assemble a read quorum composed of 7 of those nodes. Likewise, the γ = 9 quorum configuration consists of 28 storage nodes while clients must assemble a read quorum of 19 nodes. As expected and illustrated by Figure 6.5, the Gather(α) method is several times more expensive than its Get(α) counterpart since several block instances must be retrieved directly from the quorum of storage nodes. CHAPTER 6. EVALUATION 155 0 100 200 300 400 500 kool2 kool3 kool4 L a t e n c y ( m s ) Configurations Gather() γ = 2 γ = 3 γ = 5 γ = 9 Figure 6.5: The performance of the mutable-specific Gather(α) routine 0 100 200 300 400 500 kool2 kool3 kool4 L a t e n c y ( m s ) Configurations Put() γ = 2 γ = 3 γ = 5 γ = 9 Figure 6.6: The performance of the Put(α, β) routine Finally, the Put(α, β) routine is evaluated in a similar environment i.e. a network of 10, 000 nodes with 4096-byte blocks. Figure 6.6 shows the results related to this CHAPTER 6. EVALUATION 156 series of experiments. Note that the kool implementation does not send the block to every one of the write quorum’s nodes. Instead, a single instance is sent to one storage node while the others are sent a hash of the block along with the network address of the node with the complete block. This technique drastically improves the client’s performance though complicating the process on the server’s side. Note that this process could further be improved to behave in a way similar to Bittorrent [Coh03]’s dissemination protocol. These experiements show that both the overlay network and the distributed hash table behaves as expected i.e. in a scalable way. The following studies the perfor- mance of the distributed hash table when it comes to blocks composing an Infinit file system. Table 6.3 provides a summary of the blocks composing the Infinit peer-to-peer file system. For every logical block, the table provides (i) its median size and (ii) the la- tency of the process consisting in retrieving this particular block from the distributed hash table. Note that next to the latency are provided, between parentheses, the re- trieving and validation time. The retrieving time represents the sum of the network latencies involved in the fetching process. The validation time however is composed of the times spent verifying the integrity and authenticity of the block instances. Note that this experiment has been carried out in a kool2 network populated by 10, 000 nodes with a fault-tolerance factor γ = 3. Therefore, whenever a client as- sembles a read quorum, at least seven instances of the mutable block are transferred back to the client while the Validate(α, β) routine is invoked between four and seven times. It is also worth noting that this evaluation considers worst case scenarios. For exam- ple, the OKB -based Object logical block’s verification process depends on whether the author is the owner, a lord or a vassal. For the purpose of this evaluation, a vassal is assumed to be the author since representing the most expensive case i.e. the Access block must be fetched, the vouching lord’s permission must be checked and the voucher’s signature must be verified. Besides, every benchmark starts with empty caches although such optimisations would greatly improve the system’s per- formance. For instance, fetching the Access block accounts for nearly 64% of the verification process mentioned above. Note however that once the Access block has been fetched, the subsequent requests can benefit from the cache. Therefore, out of seven ValidateObject OKB (α, β), a single call will actually make a network request for the Object ’s associated Access block. The reader may also notice that TKB physical blocks are far larger than their OKB counterparts. As discussed throughout Section 4.2, TKBs embed both the public key of the knights composing the table as well as the votes of the knights having authorised the last modification. These cryptographic components are responsible CHAPTER 6. EVALUATION 157 for a large portion of the size of TKBs. Note that for the purpose of the evaluation, TKBs’ table is composed of 5 knights—the maximum number of writers measured from the file system trace, as illustrated by Figure 6.2—while embedding 3 votes i.e. a majority. Let us recall that the TKB -specific quorum algorithm differs from the one used for other mutable blocks. Indeed, while the size of the read quorum remains the same i.e. 2γ + 1, the client selects the proper block by identifying γ + 1 identical instances. The important aspect about this quorum algorithm is that a hash is applied onto every received instance in order to detect the identical elements. Then, once detected, the Validate TKB (α, β) routine is invoked in order to ensure the integrity and authenticity of the selected instance. Therefore, while the integrity and authenticity of every received instance of an OKB is validated, a TKB however is validated once. This insight is made clear in Table 6.3. While the validation process of OKB -based Groups takes 3.91 ms, it takes only 2.23 ms for the equivalent TKB -based Groups. Noteworthy is that this ratio is not respected for the Object logical block because the step consisting in fetching the Access block covers the rest of the validation process. Block Median Size Latency CHB Contents 4071 bytes 72.02 (72.010.01) ms Access 1045 bytes 56.08 (56.070.01) ms Data 3095 bytes 71.42 (71.410.01) ms Catalog 2278 bytes 63.24 (63.230.01) ms Reference 56 bytes 46.21 (46.200.01) ms Members 1181 bytes 56.23 (56.220.01) ms PKB User 367 bytes 102.63 (101.681.86) ms OKB Object 974 bytes 179.16 (119.8959.27) ms Group 517 bytes 129.04 (125.133.91) ms TKB Object 3551 bytes 340.43 (279.4261.01) ms Group 2177 bytes 198.42 (196.192.23) ms Table 6.3: An evaluation summary of the Infinit blocks CHAPTER 6. EVALUATION 158 6.2.3 File System Until now, the implementation has been evaluated through the Infinit network com- ponent only, i.e. Hole. This section evaluates the Infinit design and implementation by taking into account the whole system architecture composed of the several com- ponents presented in Chapter 5. 6.2.3.1 Implementation This section focuses on implementation aspects by discussing some design choices and analysing their impact on the performance. First, the cryptosystems must be studied in order to evaluate their impact on the system’s performance. Indeed, some cryptosystems perform some operations faster than others but at the expense of larger keys for instance. The Infinit file system prototype makes use of the cryptosystems which are listed in Table 6.4 along with their respective benchmarks. Note that every experiment has been performed on a randomly generated chunk of 4096 bytes of data. These algorithms have been chosen based on the study carried out by Busca [Bus07] which showed that these cryptosystems were the most beneficial to systems making use of mutable blocks such as PKBs and OKBs. One can notice that the generation process of the RSA (Rivest Shamir Adleman) asymmetric cryptosystem is several orders of magnitude more expensive than the other operations. Cryptosystem Operation Duration RSA1024 Generation 96.281 ms Encryption 0.315 ms Decryption 2.728 ms Signature 2.854 ms Verification 0.233 ms AES256 Generation 0.011 ms Encryption 0.072 ms Decryption 0.069 ms SHA256 Hash 0.005 ms Table 6.4: Performance of the Infinit ’s cryptosystems In addition, Table 6.5 provides the reader with the size of the principal cryptographic components whenever embedded in blocks such as the ones described throughout CHAPTER 6. EVALUATION 159 Chapter 4. These components are assumed to be issued from the cryptosystems described above i.e. RSA1024 and SHA256. Component Size Public Key 151 bytes Private Key 644 bytes Signature 128 bytes Hash 20 bytes Table 6.5: The size of the principal cryptographic components Given this information, one may wonder how cryptography impacts the file system, especially regarding the use of OKBs. Indeed, since Object blocks rely on the OKB physical block, whenever a file, directory or link is created, a RSA key pair is actually generated. This design decision is expected to drastically impact the system’s overall performance and must therefore be studied. 0 1000 2000 3000 4000 5000 Prepare Copy List Search Compile Total D u r a t i o n ( s ) Phases Initial Benchmark File System Logic Cryptography Network Communication Inter-Component Communication Figure 6.7: An initial benchmark with time phases Figure 6.7 shows the result of an initial Andrew benchmark carried out in the realistic environment. For each phase, the duration is split into four sections. The File System Logic represents the time spent processing the file system call within one CHAPTER 6. EVALUATION 160 of the Infinit components such as Etoile. The Inter-Component Communication represents the time spent sending messages between the Infinit components. The Network Communication and Cryptographic sections are self-explanatory. Although one may have thought that cryptography would have had an enormous im- pact on the file system performance, it appears that inter-component communication does much more harm. The Infinit implementation has therefore been re-worked in order to merge all the components into a single processing unit. In addition, the OKB physical construct has been replaced by another physical construct referred to as IB (Imprint Block) which is detailed next. IBα K owner timestamp salt data #2 version signature [...] #1 Figure 6.8: The representation of an IB Figure 6.8 depicts the IB internal organisation. This construct ensures that, for a given user, every block created has a unique address by applying a one-way func- tion on the tuple (Kowner, timestamp, salt). In addition, Algorithms 35, 36 and 37 illustrate an IB ’s set-up, seal and validation processes, respectively. Require: (Kuser, kuser), the user’s personal key pair 1. β.Kowner ← Kuser 2. β.timestamp← retrieve current timestamp 3. β.salt← generate random salt 4. β.data.version← 0 5. α← h(β.]1) 6. return α Algorithm 35: Setup IB (β) −→ α CHAPTER 6. EVALUATION 161 Require: (Kuser, kuser), the user’s personal key pair 1. β.data.signature← h(β.]2) ˜kuser Algorithm 36: Seal IB (α, β) 1. if α 6= h(β.]1) then 2. error “the address does not match the block” 3. end if 4. if β.data.signature ˜β.Kowner 6= h(β.]2) then 5. error “the data signature is invalid” 6. end if Algorithm 37: Validate IB (α, β) As the reader can notice, the IB construct is interesting because it removes the key pair generation but also simplifies the block verification process which now requires a single signature verification. 0 1000 2000 3000 4000 5000 Prepare Copy List Search Compile Total D u r a t i o n ( s ) Phases Refined Benchmark File System Logic Cryptography Network Communication Figure 6.9: A refined benchmark with time phases Figure 6.9 illustrates the performance gain resulting from the components merging along with the introduction of the Imprint Block. While the inter-components com- munication has completely been removed, the cryptography-specific optimisation by itself led to a 9.78% gain in overall performance. CHAPTER 6. EVALUATION 162 The following evaluates the Infinit file system through a set of experiments which aim at comparing Infinit, in several configurations, with NFS [Osa88]. Since similar systems have also performed such a comparative evaluation, the performance of Infinit can be indirectly compared with similar projects, especially Pastis [mBPS05]. Note that for this bencharmk, a couple of caching optimisations have been activated. The first one, known as “block cache”, is a common technique which consists in keeping a local copy of accessed blocks for some time in order to speed up future accesses. The second one, referred to as“path cache”, consists in caching the relations between logical paths such as /bin/ls and the address of the associated Object block. The impact of such an optimisation is quite important since the system no longer has to fetch the intermediate directory objects in order to resolve a recently accessed path. As shown in Figure 6.10, this experiment compares the performance of NFS against Infinit using three different Hole implementations, as detailed in Section 5.2.6. Note that the kool2 implementation relies on the realistic environment, as presented in Section 6.1.1, which is composed of 16 nodes located throughout the world. Also, the quorum algorithm has been configured to tolerate up to γ = 3 Byzantine nodes implying that every block is constantly replicated on 10 storage nodes. 0 500 1000 1500 2000 2500 3000 Prepare Copy List Search Compile Total D u r a t i o n ( s ) Phases Optimised Benchmark NFS Infinit [local] Infinit [remote] Infinit [kool2] Figure 6.10: Comparison between NFS and several Infinit configurations The results of this experiment show that Infinit generally performs well with execu- tion times approximately 1.8 times slower than NFS. By comparison, Ivy [MMGC02] CHAPTER 6. EVALUATION 163 is 2 to 3 times slower, depending on the configuration. Unfortunately, Pangaea [SKKM02], which exhibits better performance than NFS, cannot be considered for comparison because of only accessing local copies while not being designed to toler- ate Byzantine behaviours. Likewise, FARSITE [ABC+02] has been compared with NTFS (NT File System) from which it draws its features. Pastis [mBPS05] however did perform a complete analysis of its performance to which Infinit can be compared. Pastis performs slightly better than Infinit, between 1.4 and 1.9 slower than NFS, depending on the client’s consistency model. However, several configuration param- eters as well as system characteristics must be taken into account. Firstly, Pastis relies on Pastry [RD01a] which has a slightly longer routing path than kool2’s. This characteristic however favours Pastis regarding the experimental results. On the other hand, Infinit suffers from more communication due to a higher replication factor. Indeed, every block is replicated ten times in Infinit against four in Pastis. Also, Pastis relies on a single mutable block for representing file system objects while providing some means of access control. Infinit however stores access control information in a separate immutable block, the Access logical block. Thus, follow- ing every Object block retrieval, the Access block must be fetched as well, hence incurring additional network overhead. Finally, while Infinit ’s architecture has been realistically deployed through the use of FUSE [FUS], Pastis, although supporting FUSE as well, performed its evaluation by providing a specific Java [JAV] interface. These differences may account for Pastis’ slightly better performance results. 6.2.3.2 Design This section focuses on evaluating the Infinit file system in a production environment in order to analyse both the accessibility of the file system objects along with the concurrency of the updates. Accessibility In this first phase, the accessibility of the file system objects is analysed. More precisely, the aim of this evaluation is to study the file system access control mech- anism by measuring the rate of accesses successfully performed on files, directories etc. that have been shared in reading, as shown in Table 6.1. For that purpose, an Infinit file system environment is created based on the campus topology. Then, user entities are generated according to the file system trace. Un- fortunately, the group entities could not be generated that easily since, as for most centralised file systems, the trace’s groups are managed following the MAC access control policy, i.e. users are not granted the right to create or even manage groups in the system. CHAPTER 6. EVALUATION 164 Since Infinit complies to the opposite policy, i.e. DAC, the groups had to be gener- ated through another technique. For every file system object that has been shared in reading, a group is generated that includes all such users with read permission. Note that such a group generation process implies that, in the Infinit environment, every such file, directory etc. actually references groups only. In other words, such file system objects mainly reference vassals with only the object owner acting as lord. Although this decision represents the worst case scenario, given the Infinit design presented in Chapter 4, the system could not have decided automatically how many lords to create, or even which users to include as a lord. Noteworthy is that there is an exception when it comes to file system objects that are shared in reading with a single user other than the owner, in which case no group is created, i.e. the single user granted permission acts as a lord. In order to alleviate this extreme arrangement in which a group is created for every set of readers, several optimisations have been activated such as the proactive distri- bution of the key used for encrypting the Contents block. In addition, users acting as vassals can request the key from other vassals, should one of them be connected at that time. Figure 6.11 summarises the 156, 729 file system object accesses according to the entity requesting it. This figure shows that most accesses are actually performed by the owner along with lords. The large number of accesses performed by lords can probably be explained by the fact that 99.8% of the 267, 955, 200 shared file system objects are shared with a single user i.e. a lord. Entities Owner 62.0% Lord 33.4% Vassal 4.5% Figure 6.11: Summary of the accesses according to the entity CHAPTER 6. EVALUATION 165 Finally, Figure 6.12 focuses on the vassals’ accesses by analysing the file system’s state when such accesses are requested. As shown on the figure, it appears that 43.4% of the the accesses made by vassals could be performed because the vassal already had received the key or could contact the owner to retrieve it. This high number can probably be explained by social interactions where students, for example, actually seated next to each other, decide to share a file. Such scenarios might explain the availability of both parties during the object creation or the access. The figure also demonstrates the efficiency of the proactive distribution optimisation which enables 54.5% of the accesses to be performed by retrieving the key from another vassal which happened to be logged in when the object was created. Unfortunately however, 2% of the accesses could not be performed because of the unavailability of the owner and the impossibility to retrieve the key from another vassal. State Contacted Owner 32.0% Proactively Received Key 11.4% Contacted Other Vassal 54.5% Unable to Access 2.0% Figure 6.12: State of the accesses performed by the vassals Concurrency The second phase aims at analysing the update conflicts especially when it comes to the administration scheme described in Chapter 4. Let us recall that the administration mechanism requires file system objects to be associated with a set of users responsible for taking management decisions including the modification of the object’s content. As discussed in Section 4.2, although this design is extremely flexible, these special users, referred to as knights, may not be well-connected enough in order to handle the flow of requests generated by the object’s lords and vassals. CHAPTER 6. EVALUATION 166 For the purpose of this evaluation, an Infinit file system environment is created for which some file system objects will rely on the administration mechanism. Indeed, for every object being writable by multiple users, a TKB -based Object is created while all the writers are included in the table of knights. Note that such an organisa- tion represents the worst-case scenario. Indeed, in a deployed Infinit environment, users would likely elect another user to the grade of knight assuming this user is well-connected and involved in the system. This condition is crucial for the admin- istration requests to be processed as quickly as possible. The following benchmark analyses every one of the 632 modification requests made to file system objects which have been shared with other writers, as summarised in Table 6.2. Figure 6.13 summarises the results by considering several time frames. A time frame indicates how much time a knight needs to be connected in order to process a request i.e. issue a vote. For instance, the shortest time frame considered is 1 minute, which indicates that a user acting as a knight will take, on average, 1 minute to issue her vote. Such time frames emulate the fact that users may take some time to (i) notice the fact that a request has been made and (ii) consider the request and vote. The figure shows that as the time frame increases, consensus take less time to be reached. 0 50 100 150 200 250 300 350 400 450 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 A v e r a g e C o n s e n s u s P e r i o d ( m ) N u m b e r o f C o n f l i c t s Time Frame (m) Concurrency Consensus Conflicts Figure 6.13: Concurrency analysis depending on the knights’ reactivity In addition, the figure depicts the number of conflicts detected. A conflict can be identified whenever an object’s knights have not yet reached consensus on a request CHAPTER 6. EVALUATION 167 while a new request is issued. The reader can notice that the number of such conflicts is quite low: 8. It is fair to assume that users relying on the file system’s access control mechanism do not intend to update files in a concurrent manner. For that purpose, users tend to rely on specific tools such as Subversion, Git etc. as further analyses confirm. The number of such conflicts appears to decrease as consensus is reached more quickly. Note however that a conflict remains anyway. A detailed analysis showed that this conflict comes from users updating, a few seconds apart, the same directory by creating different files. Unfortunately, the knights do not have the time to vote and reach consensus within such an extremely short period of time. This conflict illustrates the core drawback of the given administration scheme. ? ? ? To conclude, although Infinit exhibits decent performance results, one must be re- minded that many components have been implemented in their simplest form pos- sible. In addition, a number of optimisations could be applied in order to decrease the impact of some of the design’s characteristics. Beside optimisations that could be applied both from the design and implementation perspectives, interesting con- cepts could also be borrowed from similar systems. For example, Plutus [KRS+03] introduced the idea of filegroups so that objects sharing similar access control rules could be optimised in several ways: (i) fewer keys are generated since shared among the objects of the same filegroup and, in turn, (ii) less space is used for metadata while the system can rely further on caching techniques. Therefore, although Infinit performs generally well, many additional design and implementation improvements could be made in order to bring its performance closer to those of NFS [Osa88]. The analysis of a file system trace proved that the Infinit design presented in Chapter 4 could actually work in a production environment. The access control mechanism proved itself efficient given extreme configurations including the worst-case scenario in which object owners grant access to users solely through the use of group en- tities. The benchmark illustrated that the extremely small number of lords along with their unpredictable connectivity could be overcome by applying several optimi- sations such as the proactive distribution of the key used for encrypting the object’s content. Although the analysis revealed that some accesses could not be performed due to the unavailability of users from which to retrieve the key, it is fair to assume that an Infinit production environment would behave differently. For instance, by CHAPTER 6. EVALUATION 168 relying on an open peer-to-peer network, users would be more likely to be connected through one of their computing devices which in turn would alleviate the discussed accessibility issues. Although the access control mechanism deployment was proven successful, the ad- ministration scheme suffered from the knights’ reactivity to treat the incoming re- quests. This trait became critical when considering concurrent updates since, as de- scribed in Section 4.2, administrative requests cannot be treated in parallel. Should such a conflict occur, the user would have to delay her request and re-submit it later, which obviously could represent an important hurdle for users. Note however that the evaluation considered the worst-case scenario in which every object being writable by multiple users was considered a cooperative object. Indeed, in practice, it is very likely that the number of such objects would remain extremely small, the administration scheme being used for critical objects only. Chapter 7 Conclusion The peer-to-peer model has shown itself to be a powerful paradigm for the design of large-scale, adaptative and highly-resilient systems. Over the last decades, many peer-to-peer systems have been conceived offering services as diverse as telephone, video streaming or even file sharing through the popular Bittorrent [Coh03] applica- tion. On the other hand, the file abstraction provided by the hierarchical file system interface has become the common way for organising, naming and accessing digital data. As mentioned in Section 3.2, these paradigms can be combined in order to build a file system benefiting from the properties inherent to the peer-to-peer model such as scalability, fault-tolerance, durability, availability and so forth. Throughout this thesis, the author has shown it possible to design a modern storage system based on these paradigms in order to ensure a certain number of fundamental properties, defined in Section 3.1. Therefore, an access control and an administration scheme have been designed, paving the way for the implementation of such a system. Noteworthy is that this work has been based on several assumptions which could very well be challenged by other authors. Note however that such decisions have been justified in Section 3.4. The first assumption relates to the peer-to-peer environment which implies a number of properties such as the untrustworthiness of the computers populating the network, the decentralised and symmetric behaviour of those nodes as well as the required scalability of the underlying network protocols such as locating a node responsible for a given identifier. Besides, the network’s untrustworthiness and symmetry implies that the data blocks associated with an identifier must be self-certified so that anyone can distinguish a valid from an illegitimately forged block. Note however that the administration scheme presented in Section 4.2 introduced blocks, referred to as TKBs, which do not conform to these principles, thus violating both the symmetry and self-certification properties. The nodes and especially their connectivity to the peer-to-peer network constitutes 169 CHAPTER 7. CONCLUSION 170 another assumption. Section 4.1.2 showed that the fundamental properties were too constraining for an access control scheme to be designed in such an environment. Therefore, the connectivity requirement was loosened as thought to be the more flexible parameter. Most research interested in such aspects of peer-to-peer networks has focused on the nodes’ churn ratio. The access control scheme presented in this document relies on the connectivity of users which, in such a modern system, can be connected to the system through multiple computing devices being computers, mobile phones, tablets, netbooks etc. The third and final assumption stipulates that the distributed hash table upon which the presented file system relies should ensure consistency among the replicas through a quorum-based protocol. Although similar projects such as CFS [DKK+01] or Pastis [mBPS05] have also made this assumption, most Byzantine-fault-tolerant distributed systems tend to make use of agreement algorithms such as BFT [CL99] or Paxos [Lam98] because these algorithms provide far more flexibility than their quorum counterparts. On the other hand, quorum algorithms are well suited for distributed file systems because they rely on the basic operations consisting of read- ing and writing data items. As such, file systems do not require the underlying storage layer to provide advanced functionalities. All in all, quorum algorithms imply self-certification which in turn requires more cryptographic operations while agreement protocols require storage nodes to exchange more messages in order to achieve consensus. However, and as shown in Chapter 6, the cryptographic opera- tions account for an extremely small portion of the retrieval and storing processes, hence confirming the initial assumption regarding the better performance of quorum algorithms. The contributions of this work are threefold. First, the functionalities such a modern storage system should provide to end-users have been defined though some have intentionally been left for future work. In addition, the system’s properties such as untrustworthiness, decentralisation, symmetry, self-certification etc. have been inferred from the peer-to-peer file system’s paradigms. Second, the fundamental file system components such as file, directory, user, group etc. have been defined through the design of an access control and administration scheme. Unlike previous projects such as Chefs [Fu05], Plutus [KRS+03] and Pastis [mBPS05], the access control scheme has been designed for large-scale decentralised and untrustworthy environments while providing users with the means to express access control rules in a very flexible way. In addition, the administration scheme allows users to request administrative tasks while preventing a single user from taking complete control over the system. Third, a prototype implementation has been developed proving feasible the deployment of such a system. This prototype has been developed so as to provide a modular architecture enabling the user to set up the system according to its device’s hardware characteristics as well as the user’s preferences. CHAPTER 7. CONCLUSION 171 The evaluation carried out in Chapter 6 shows that Infinit performs generally well, especially compared to similar systems such as CFS [DKK+01], Ivy [MMGC02] and Pastis [mBPS05]. Interestingly, the access control mechanism has been proven efficient and robust especially in extreme environments with low connectivity. How- ever, a long-term analysis would still have to be performed in a large-scale realistic environment in order to validate the qualitative aspects of the system especially re- garding the administration scheme which suffers from the knights’ reactivity when it comes to concurrent updates. Finally, although this thesis provides the fundamental components for the imple- mentation of a large-scale decentralised and Byzantine-fault-tolerant storage system, some properties have been left as future work, such as anonymity and versioning, while other aspects have been voluntarily ignored including garbage collection, con- sistency models, advanced quorum algorithms, concurrency conflicts resolution and many more. Although every one of these topics has been tackled through other re- search projects, Infinit could not be considered complete without taking such design factors into account. List of Figures 1.1 A worldwide storage infrastructure . . . . . . . . . . . . . . . . . . . 4 2.1 A flat unstructured overlay network . . . . . . . . . . . . . . . . . . . 9 2.2 A two-level hybrid overlay network . . . . . . . . . . . . . . . . . . . 10 2.3 A ring-based structured overlay network . . . . . . . . . . . . . . . . 12 2.4 A Chord network of degree 5 with 17 nodes . . . . . . . . . . . . . . 15 2.5 A Kelips network for 36 nodes . . . . . . . . . . . . . . . . . . . . . . 17 2.6 A small-world-based social overlay network . . . . . . . . . . . . . . . 19 2.7 The replication-based DHash distributed hash table . . . . . . . . . . 22 2.8 The Paxos agreement protocol . . . . . . . . . . . . . . . . . . . . . . 23 2.9 Three Gifford quorum configurations . . . . . . . . . . . . . . . . . . 24 2.10 Pangaea file system representation . . . . . . . . . . . . . . . . . . . 27 2.11 OceanStore’s organisation . . . . . . . . . . . . . . . . . . . . . . . . 29 2.12 The FARSITE architecture . . . . . . . . . . . . . . . . . . . . . . . 30 2.13 The CFS hierarchical organisation . . . . . . . . . . . . . . . . . . . 32 2.14 The Ivy log-based representation . . . . . . . . . . . . . . . . . . . . 34 2.15 The Plutus ’ keys, locks and groups . . . . . . . . . . . . . . . . . . . 36 2.16 The Pastis organisation . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.1 A three-step representation of a symmetric quorum-based system . . 50 3.2 A three-step representation of an asymmetric quorum-based system . 51 4.1 The representation of a CHB . . . . . . . . . . . . . . . . . . . . . . 63 4.2 The representation of a PKB . . . . . . . . . . . . . . . . . . . . . . 64 4.3 The representation of an OKB . . . . . . . . . . . . . . . . . . . . . . 65 172 LIST OF FIGURES 173 4.4 The representation of a PKB -based User block . . . . . . . . . . . . 67 4.5 The representation of an OKB -based Group block . . . . . . . . . . . 69 4.6 The representation of an OKB -based Object block . . . . . . . . . . . 72 4.7 A graph showing the relations between PO and ρ . . . . . . . . . . . . 82 4.8 The representation of a TKB . . . . . . . . . . . . . . . . . . . . . . 94 4.9 A scenario illustrating the TKB -specific quorum algorithm . . . . . . 101 4.10 The representation of a TKB -based Object block . . . . . . . . . . . 103 4.11 The representation of an TKB -based Group block . . . . . . . . . . . 106 5.1 A file representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.2 A directory representation . . . . . . . . . . . . . . . . . . . . . . . . 122 5.3 A link representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.4 The Infinit system-wide hierarchical representation . . . . . . . . . . 124 5.5 The architecture of an Infinit node . . . . . . . . . . . . . . . . . . . 125 5.6 The internals of the Etoile component . . . . . . . . . . . . . . . . . 135 5.7 An example of a kool3 network . . . . . . . . . . . . . . . . . . . . . 143 6.1 General information regarding the users’ sharing behaviours . . . . . 151 6.2 General information regarding the user’s cooperative behaviours . . . 151 6.3 The performance of the overlay network’s Lookup(ι) routine . . . . . 153 6.4 The performance of the immutable-specific Get(α) routine . . . . . . 154 6.5 The performance of the mutable-specific Gather(α) routine . . . . . . 155 6.6 The performance of the Put(α, β) routine . . . . . . . . . . . . . . . . 155 6.7 An initial benchmark with time phases . . . . . . . . . . . . . . . . . 159 6.8 The representation of an IB . . . . . . . . . . . . . . . . . . . . . . . 160 6.9 A refined benchmark with time phases . . . . . . . . . . . . . . . . . 161 6.10 Comparison between NFS and several Infinit configurations . . . . . 162 6.11 Summary of the accesses according to the entity . . . . . . . . . . . . 164 6.12 State of the accesses performed by the vassals . . . . . . . . . . . . . 165 6.13 Concurrency analysis depending on the knights’ reactivity . . . . . . 166 List of Tables 4.1 A summary of the permissions in the file system . . . . . . . . . . . . 95 5.1 kool parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 5.2 kool formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.3 Comparison of the kool configurations . . . . . . . . . . . . . . . . . . 144 6.1 General information regarding the users and files . . . . . . . . . . . 149 6.2 General information regarding the nodes and users connectivity . . . 150 6.3 An evaluation summary of the Infinit blocks . . . . . . . . . . . . . . 157 6.4 Performance of the Infinit ’s cryptosystems . . . . . . . . . . . . . . . 158 6.5 The size of the principal cryptographic components . . . . . . . . . . 159 174 Listings 5.1 An example of hierarchical namespace . . . . . . . . . . . . . . . . . 123 5.2 The PublicKey::Serialize() method . . . . . . . . . . . . . . . . . 127 5.3 The KeyPair::Extract() method . . . . . . . . . . . . . . . . . . . 127 5.4 The Base64 Unique representation of a block address . . . . . . . . . 128 5.5 An illustration of fibers . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.6 The message definition process . . . . . . . . . . . . . . . . . . . . . . 131 5.7 The message registration process . . . . . . . . . . . . . . . . . . . . 131 5.8 The PIG ’s rmdir() POSIX system call . . . . . . . . . . . . . . . . . 132 5.9 The Agent::Decrypt() method . . . . . . . . . . . . . . . . . . . . . 134 5.10 The Agent::Sign() method . . . . . . . . . . . . . . . . . . . . . . . 134 5.11 Etoile’s wall message definitions for directory objects . . . . . . . . . 136 5.12 Etoile’s wall handler definitions for directory objects . . . . . . . . . 136 5.13 The components unit’s Directory::Remove() method . . . . . . . . 138 5.14 The Object class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.15 The Hole component’s interface . . . . . . . . . . . . . . . . . . . . . 140 6.1 The Andrew benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 147 175 List of Algorithms 1 Setup CHB (β) −→ α . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2 Seal CHB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3 Validate CHB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4 Setup PKB (β) −→ α . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5 Seal PKB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6 Validate PKB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7 Setup OKB (β) −→ α . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 8 Seal OKB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 9 Validate OKB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 10 Setup User PKB (β) −→ α . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 11 Seal User PKB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 12 Validate User PKB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 13 SetupObject OKB (β) −→ α . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 14 SealObject[data] OKB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 15 SealObject[meta] OKB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 16 ValidateObject OKB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 17 Govern(α, ψ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 18 Read(α, λ) −→ δ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 19 Write(α, λ, δ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 20 Setup TKB (β) −→ α . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 21 Seal TKB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 22 Validate[client] TKB (α, β) — client side . . . . . . . . . . . . . . . . 96 23 Validate[server] TKB (α, β) — server side . . . . . . . . . . . . . . . . 100 24 SetupObject TKB (β) −→ α . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 25 SealObject[data] TKB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 26 SealObject[meta] TKB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 27 Validate[client]Object TKB (α, β) — client side . . . . . . . . . . . . . . . . 105 28 Elect(α, θ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 29 Govern(α, ψ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 30 Read(α, λ) −→ δ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 31 Write(α, λ, δ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 32 Manage(α, θ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 176 LIST OF ALGORITHMS 177 33 Edit(α, ψ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 34 Transfer(α, µ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 35 Setup IB (β) −→ α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 36 Seal IB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 37 Validate IB (α, β) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Bibliography [ABC+02] Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak, Ron- nie Chaiken, John R. Douceur, Jon Howell, Jacob R. Lorch, Marvin Theimer, and Roger P. Wattenhofer. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), pages 1–14, 2002. [ACMR02] Sameer Ajmani, Dwaine E. Clarke, Chuang-Hue Moh, and Steven Rich- man. ConChord: Cooperative SDSI certificate storage and name res- olution. In In First International Workshop on Peer-to-Peer Systems, pages 141–154, 2002. [ADN+95] Thomas Anderson, Michael Dahlin, Jeanna Neefe, David P. Terson, Drew Roselli, and Randolph Wang. Serverless network file systems. In In Proceedings of the 15th Symposium on Operating System Principles. ACM, pages 109–126, Copper Mountain Resort, Colorado, December 1995. [AHT+02] A. Adya, J. Howell, M. Theimer, W. Bolosky, and J. Douceur. Coop- erative task management without manual stack management, 2002. [AJ92] Gagan Agrawal and Pankaj Jalote. An efficient protocol for voting in distributed systems. In International Conference on Distributed Com- puting Systems, pages 640–647, 1992. [AMN01] Michel Abdalla, Sara K. Miner, and Chanathip Namprempre. Forward- secure threshold signature schemes. In Proceedings of the 2001 Con- ference on Topics in Cryptology: The Cryptographer’s Track at RSA, CT-RSA 2001, pages 441–456, London, UK, UK, 2001. Springer-Verlag. [And04] Ken H. And. Small world overlay P2P networks, 2004. [App] http://www.me.com. 178 BIBLIOGRAPHY 179 [BBB+04] Jean-Michel Busca, Marin Bertier, Fatima Belkouch, Pierre Sens, and Luciana Arantes. A performance evaluation of a quorum-based state- machine replication algorithm for computing grids. In Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing, pages 116–123, Washington, DC, USA, 2004. IEEE Com- puter Society. [BCK+09] Gal Badishi, Germano Caronni, Idit Keidar, Raphael Rom, and Glenn Scott. Deleting files in the celeste peer-to-peer storage system. Journal of Parallel and Distributed Computing, 69(7):613–622, July 2009. [BDET00] William J. Bolosky, John R. Douceur, David Ely, and Marvin Theimer. Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs. SIGMETRICS Perform. Eval. Rev., 28(1):34–43, June 2000. [BF08] Joa˜o Barreto and Paulo Ferreira. The obscure nature of epidemic quo- rum systems. In Proceedings of the 9th workshop on Mobile computing systems and applications, HotMobile ’08, pages 69–73, New York, NY, USA, 2008. ACM. [BJZH04] Ali R. Butt, Troy A. Johnson, Yili Zheng, and Charlie Y. Hu. Kosha: A Peer-to-Peer enhancement for the network file system. In The Inter- national Conference for High Performance Computing and Communi- cations (SC2004), page 51, November 2004. [BLV05] A. Blanc, Yi-Kai Liu, and A. Vahdat. Designing incentives for peer-to- peer routing. volume 1, pages 374–385 vol. 1, March 2005. [Box] http://www.box.net. [BS10] Fatemeh Borran and Andre´ Schiper. A leader-free byzantine consen- sus algorithm. In Proceedings of the 11th international conference on Distributed computing and networking, ICDCN’10, pages 67–78, Berlin, Heidelberg, 2010. Springer-Verlag. [BTC+04] Ranjita Bhagwan, Kiran Tati, Yu C. Cheng, Stefan Savage, and Geof- frey M. Voelker. Total recall: system support for automated availability management. In Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1, page 25, Berkeley, CA, USA, 2004. USENIX Association. [Bus07] Jean-Michel Busca. Pastis : Un Syste`me Pair a` Pair de Gestion de Fichiers. PhD thesis, Universite´ Pierre et Marie Curie, 2007. BIBLIOGRAPHY 180 [CCB07] James Cipar, Mark D. Corner, and Emery D. Berger. TFS: a transpar- ent file system for contributory storage. In FAST’07: Proceedings of the 5th conference on USENIX Conference on File and Storage Tech- nologies, page 28, Berkeley, CA, USA, 2007. USENIX Association. [CCR05] M. Castro, M. Costa, and A. Rowstron. Debunking some myths about structured and unstructured overlays. In 2nd Symposium on Networked Systems Design and Implementation (NSDI’05), pages 85–98, 2005. [CDG+02] Miguel Castro, Peter Druschel, Ayalvadi Ganesh, Antony Rowstron, and Dan S. Wallach. Secure routing for structured peer-to-peer overlay networks. SIGOPS Oper. Syst. Rev., 36(SI):299–314, 2002. [CFK03] Edith Cohen, Amos Fiat, and Haim Kaplan. A case for associative peer to peer overlays. SIGCOMM Comput. Commun. Rev., 33:95–100, January 2003. [CGKV08] Gregory Chockler, Rachid Guerraoui, Idit Keidar, and Marko Vukolic. Reliable distributed storage. IEEE Computer, 2008. [CGM02] Brian F. Cooper and Hector Garcia-Molina. Bidding for storage space in a peer-to-peer data preservation system. In Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS’02), ICDCS ’02, pages 372–, Washington, DC, USA, 2002. IEEE Computer Society. [CL99] Castro and Liskov. Practical byzantine fault tolerance. In OSDI: Sym- posium on Operating Systems Design and Implementation, pages 173– 186. USENIX Association, Co-sponsored by IEEE TCOS and ACM SIGOPS, 1999. [Cla83] Anne-Marie G. Claybrook. Directions in computer security. In Proceed- ings of the 1983 annual conference on Computers : Extending the hu- man resource, ACM ’83, pages 42–, New York, NY, USA, 1983. ACM. [CLL07] Chin-Chen Chang, Chih-Yang Lin, and Keng-Chu Lin. Simple effi- cient mutual anonymity protocols for peer-to-peer network based on primitive roots. J. Netw. Comput. Appl., 30:662–676, April 2007. [CN03] L. Cox and B. Noble. Samsara: Honor among thieves in Peer-to- Peer storage. ACM SIGOPS Operating Systems Review, 37(5):120–132, 2003. [Coh03] Bram Cohen. Incentives build robustness in BitTorrent, 2003. BIBLIOGRAPHY 181 [Com85] Douglas E. Comer. Domain names (panel session, abstract only): hi- erarchy in need of organization. SIGCOMM Comput. Commun. Rev., 15:72–, September 1985. [CRB+03] Yatin Chawathe, Sylvia Ratnasamy, Lee Breslau, Nick Lanham, and Scott Shenker. Making gnutella-like p2p systems scalable. In Pro- ceedings of the 2003 conference on Applications, technologies, archi- tectures, and protocols for computer communications, SIGCOMM ’03, pages 407–418, New York, NY, USA, 2003. ACM. [CRS05] Germano Caronni, Raphael Rom, and Glenn Scott. Maintaining object ordering in a shared p2p storage environment. In Proceedings of the Third IEEE International Security in Storage Workshop, pages 52–62, Washington, DC, USA, 2005. IEEE Computer Society. [CSP07] La´saro Jonas Camargos, Rodrigo Malta Schmidt, and Fernando Pe- done. Multicoordinated paxos. In Proceedings of the twenty-sixth an- nual ACM symposium on Principles of distributed computing, PODC ’07, pages 316–317, New York, NY, USA, 2007. ACM. [CSWH01] Ian Clarke, Oskar Sandberg, Brandon Wiley, and Theodore W. Hong. Freenet: a distributed anonymous information storage and retrieval system. In International workshop on Designing privacy enhancing technologies, pages 46–66. Springer-Verlag New York, Inc., 2001. [DAB+02] John R. Douceur, Atul Adya, William J. Bolosky, Dan Simon, and Marvin Theimer. Reclaiming space from duplicate files in a serverless distributed file system. In ICDCS ’02: Proceedings of the 22 nd Inter- national Conference on Distributed Computing Systems (ICDCS’02), pages 617–624, Washington, DC, USA, 2002. IEEE Computer Society. [dALF10] Francisco de Asis Lopez-Fuentes. A routing scheme for content local- ization in peer-to-peer networks. In Proceedings of the 2010 IEEE Elec- tronics, Robotics and Automotive Mechanics Conference, CERMA ’10, pages 249–254, Washington, DC, USA, 2010. IEEE Computer Society. [DB99] John R. Douceur and William J. Bolosky. A large-scale study of file- system contents. In SIGMETRICS ’99: Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 59–70, New York, NY, USA, 1999. ACM. [DFM01] Roger Dingledine, Michael J. Freedman, and David Molnar. The free haven project: distributed anonymous storage service. In International BIBLIOGRAPHY 182 workshop on Designing privacy enhancing technologies, pages 67–95, New York, NY, USA, 2001. Springer-Verlag New York, Inc. [DGWR07] A. G. Dimakis, P. B. Godfrey, M. J. Wainwright, and K. Ramchandran. Network coding for distributed storage systems. In INFOCOM 2007. 26th IEEE International Conference on Computer Communications. IEEE, pages 2000–2008, May 2007. [DHA03] Anwitaman Datta, Manfred Hauswirth, and Karl Aberer. Updates in highly unreliable, replicated Peer-to-Peer systems. In 23rd Interna- tional Conference on Distributed Computing Systems (ICDCS 2003), pages 76–87, Providence, Rhode Island, USA, May 2003. [DHJ+07] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubra- manian, Peter Vosshall, and Werner Vogels. Dynamo: amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev., 41:205–220, Octo- ber 2007. [DKK+01] Frank Dabek, Frans M. Kaashoek, David Karger, Robert Morris, and Ion Stoica. Wide-area cooperative storage with CFS. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP ’01), Chateau Lake Louise, Banff, Canada, October 2001. [DLLKA05] George Danezis, Chris Lesniewski-Laas, Frans M. Kaashoek, and Ross Anderson. Sybil-Resistant DHT routing. pages 305–318. 2005. [DMM08] John Day, Ibrahim Matta, and Karim Mattar. Networking is ipc: a guiding principle to a better internet. In Proceedings of the 2008 ACM CoNEXT Conference, CoNEXT ’08, pages 67:1–67:6, New York, NY, USA, 2008. ACM. [DMS03] Roger Dingledine, Nick Mathewson, and Paul Syverson. Reputation in P2P anonymity systems, 2003. [DMS04] Roger Dingledine, Nick Mathewson, and Paul Syverson. Tor: The Second-Generation onion router. In Proceedings of the 13th USENIX Security Symposium, pages 303–320, San Diego, CA, USA, August 2004. [Dou02] John R. Douceur. The sybil attack. In Revised Papers from the First International Workshop on Peer-to-Peer Systems, IPTPS ’01, pages 251–260, London, UK, 2002. Springer-Verlag. [Dro] http://www.getdropbox.com. BIBLIOGRAPHY 183 [DW01] John R. Douceur and Roger P. Wattenhofer. Optimizing file availability in a secure serverless distributed file system. 2001. [DZD+03] F. Dabek, B. Zhao, P. Druschel, J. Kubiatowicz, and Stoica. Towards a common API for structured Peer-to-Peer overlays. In International Workshop on Peer-to-Peer Systems, 2003. [FJG06] Ronaldo A. Ferreira, Suresh Jagannathan, and Ananth Grama. Local- ity in structured peer-to-peer networks. J. Parallel Distrib. Comput., 66:257–273, February 2006. [FKK06] Kevin Fu, Seny Kamaram, and Yoshi Kohno. Key regression: Enabling efficient key distribution for secure distributed storage. In Network and Distributed System Security Symposium (NDSS ’06), 2006. [FKM02] Kevin Fu, M. Frans Kaashoek, and David Mazie`res. Fast and secure distributed read-only file system. ACM Trans. Comput. Syst., 20(1):1– 24, February 2002. [FLP85] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Im- possibility of distributed consensus with one faulty process. J. ACM, 32:374–382, April 1985. [Fu99] Kevin Fu. Group sharing and random access in cryptographic storage file systems. Master’s thesis, Massachusetts Institute of Technology, May 1999. [Fu05] Kevin E. Fu. Integrity and access control in untrusted content distribu- tion networks. PhD thesis, Cambridge, MA, USA, 2005. AAI0808990. [FUS] http://fuse.sf.net. [GBL+03] Indranil Gupta, Ken Birman, Prakash Linga, Al Demers, and Rob- bert van Renesse. Kelips: Building an efficient and stable P2P DHT through increased memory and background overhead. In IPTPS ’03: 2nd International Workshop on Peer-to-Peer Systems, 2003. [GDZ+05] Lei Gao, Mike Dahlin, Jiandan Zheng, Lorenzo Alvisi, and Arun Iyen- gar. Dual-quorum replication for edge services. In Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware, Middleware ’05, pages 184–204, New York, NY, USA, 2005. Springer- Verlag New York, Inc. [GEvS07] P. Garbacki, D. H. J. Epema, and M. van Steen. Optimizing peer relationships in a Super-Peer network. In ICDCS ’07: Proceedings of BIBLIOGRAPHY 184 the 27th International Conference on Distributed Computing Systems, page 31, Washington, DC, USA, July 2007. IEEE Computer Society. [Gif79] David K. Gifford. Weighted voting for replicated data, 1979. [GKLQ07] Rachid Guerraoui, Dejan Kostic, Ron R. Levy, and Vivien Quema. A high throughput atomic storage algorithm. In ICDCS ’07: Proceedings of the 27th International Conference on Distributed Computing Sys- tems, page 19, Washington, DC, USA, 2007. IEEE Computer Society. [HAF10] Yaser Houri, Bernhard Amann, and Thomas Fuhrmann. A quanti- tative analysis of redundancy schemes for peer-to- peer storage sys- tems. In Proceedings of the 12th international conference on Stabiliza- tion, safety, and security of distributed systems, SSS’10, pages 519–530, Berlin, Heidelberg, 2010. Springer-Verlag. [HAY+05] Ragib Hasan, Zahid Anwar, William Yurcik, Larry Brumbaugh, and Roy Campbell. A survey of Peer-to-Peer storage techniques for dis- tributed file systems. In ITCC ’05: Proceedings of the Interna- tional Conference on Information Technology: Coding and Computing (ITCC’05) - Volume II, pages 205–213, Washington, DC, USA, 2005. IEEE Computer Society. [HB11] Cyrus Harvesf and Douglas M. Blough. Replica placement for route diversity in tree-based routing distributed hash tables. IEEE Trans. Dependable Secur. Comput., 8:419–433, May 2011. [HCW10] Guowei Huang, Jiangang Chen, and Lian Wei. Routeguard: A trust- based scheme for guarding routing in structured peer-to-peer overlays. In Proceedings of the 2010 International Conference on Communica- tions and Mobile Computing - Volume 01, CMC ’10, pages 330–334, Washington, DC, USA, 2010. IEEE Computer Society. [HKD07] Andreas Haeberlen, Petr Kouznetsov, and Peter Druschel. PeerReview: practical accountability for distributed systems. In SOSP ’07: Proceed- ings of twenty-first ACM SIGOPS symposium on Operating systems principles, pages 175–188, New York, NY, USA, 2007. ACM. [HKLF+06] S. B. Handurukande, A. M. Kermarrec, F. Le Fessant, L. Massoulie´, and S. Patarin. Peer sharing behaviour in the eDonkey network, and implications for the design of server-less file sharing systems. In EuroSys ’06: Proceedings of the 2006 EuroSys conference, pages 359–371, New York, NY, USA, 2006. ACM Press. BIBLIOGRAPHY 185 [HKM+88a] John H. Howard, Michael L. Kazar, Sherri G. Menees, David A. Nichols, M. Satyanarayanan, Robert N. Sidebotham, and Michael J. West. Scale and performance in a distributed file system. ACM Trans. Comput. Syst., 6(1):51–81, February 1988. [HKM+88b] John H. Howard, Michael L. Kazar, Sherri G. Menees, David A. Nichols, M. Satyanarayanan, Robert N. Sidebotham, and Michael J. West. Scale and performance in a distributed file system. ACM Trans. Comput. Syst., 6(1):51–81, February 1988. [HMD05] Andreas Haeberlen, Alan Mislove, and Peter Druschel. Glacier: Highly durable, decentralized storage despite massive correlated failures. In IN PROC. OF NSDI, 2005. [HP94] John S. Heidemann and Gerald J. Popek. File-system development with stackable layers. ACM Trans. Comput. Syst., 12(1):58–89, 1994. [iNe] http://topology.eecs.umich.edu/inet/. [IRF04] Adriana Iamnitchi, Matei Ripeanu, and Ian Foster. Small-World File- Sharing communities. In The 23rd Conference of the IEEE Communi- cations Society (InfoCom 2004), Hong Kong, # 2004. [JAV] http://www.java.com. [JB94] Marjan Jurecˇicˇ and Herbert Bunz. Exchange of patient records- prototype implementation of a security attributes service in x.500. In Proceedings of the 2nd ACM Conference on Computer and communi- cations security, CCS ’94, pages 30–38, New York, NY, USA, 1994. ACM. [JGH+98] Jr, R. Guy, J. Heidemann, D. Ratner, P. Reiher, A. Goel, G. Kuenning, and G. Popek. Perspectives on optimistically replicated, Peer-to-Peer filing. Software Practice and Experience, February 1998. [JXY07] Yi Jiang, Guangtao Xue, and Jinyuan You. Distributed hash table based peer-to-peer version control system for collaboration. In Pro- ceedings of the 10th international conference on Computer supported cooperative work in design III, CSCWD’06, pages 489–498, Berlin, Hei- delberg, 2007. Springer-Verlag. [Kar86] P A Karger. Authentication and discretionary access control in com- puter networks. Comput. Secur., 5:314–324, December 1986. BIBLIOGRAPHY 186 [KBC+00] John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao. OceanStore: An architecture for global-scale persistent storage. In In Proceedings of the 9th International Conference on Architectural Sup- port for Programming Languages and Operating Systems (ASPLOS, 2000. [KLW11] Andreas Klappenecker, Hyunyoung Lee, and Jennifer L. Welch. Quorum-based dynamic regular registers in systems with churn. In Proceedings of the 3rd International Workshop on Theoretical Aspects of Dynamic Distributed Systems, TADDS ’11, pages 3–7, New York, NY, USA, 2011. ACM. [KRS+03] Mahesh Kallahalla, Erik Riedel, Ram Swaminathan, Qian Wang, and Kevin Fu. Plutus: Scalable secure file sharing on untrusted storage. In FAST ’03: Proceedings of the 2nd USENIX Conference on File and Storage Technologies, pages 29–42, Berkeley, CA, USA, 2003. USENIX Association. [KS10] Jinu Kurian and Kamil Sarac. A survey on the design, applications, and enhancements of application-layer overlay networks. ACM Comput. Surv., 43:5:1–5:34, December 2010. [KSMK03] Michael Kaminsky, George Savvides, David Mazieres, and Frans M. Kaashoek. Decentralized user authentication in a global file system. SIGOPS Oper. Syst. Rev., 37(5):60–73, December 2003. [KWR06] P. Knezevic, A. Wombacher, and T. Risse. Highly Available DHTs: Keeping Data Consistency After Updates. In 4th International Work- shop, AP2PC 2005, July 25, 2005, Revised Papers, Utrecht, Nether- lands, July 2006. [Lam98] Leslie Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16(2):133–169, May 1998. [Lam01] Leslie Lamport. Paxos made simple. ACM SIGACT News, 32(4):18–25, December 2001. [LCG07] Haifeng Liu, Xianglan Chen, and Yuchang Gong. Babyos: a fresh start. In Proceedings of the 38th SIGCSE technical symposium on Computer science education, SIGCSE ’07, pages 566–570, New York, NY, USA, 2007. ACM. BIBLIOGRAPHY 187 [LKMS04] Jinyuan Li, Maxwell Krohn, David Mazie`res, and Dennis Shasha. Se- cure untrusted data repository (SUNDR). In OSDI’04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Im- plementation, page 9, Berkeley, CA, USA, 2004. USENIX Association. [LMZ09] Leslie Lamport, Dahlia Malkhi, and Lidong Zhou. Vertical paxos and primary-backup replication. In Proceedings of the 28th ACM sympo- sium on Principles of distributed computing, PODC ’09, pages 312–313, New York, NY, USA, 2009. ACM. [LSG+04] Jinyang Li, Jeremy Stribling, Thomer M. Gil, Robert Morris, and M. Frans Kaashoek. Comparing the performance of distributed hash tables under churn. In In The 3th International Workshop on Peer-to- Peer Systems (IPTPS’04), 2004. [MAD02] Jean-Philippe Martin, Lorenzo Alvisi, and Michael Dahlin. Minimal byzantine storage. In DISC ’02: Proceedings of the 16th International Conference on Distributed Computing, pages 311–325, London, UK, 2002. Springer-Verlag. [Maz01] David Mazie`res. A toolkit for User-Level file systems. In Proceedings of the General Track: 2002 USENIX Annual Technical Conference, pages 261–274, Berkeley, CA, USA, 2001. USENIX Association. [mBPS05] Jean michel Busca, Fabio Picconi, and Pierre Sens. Pastis: A highly- scalable multi-user peer-to-peer file system. In in Euro-Par 2005, 2005. [MBRI03] Gurmeet S. Manku, Mayank Bawa, Prabhakar Raghavan, and Verity Inc. Symphony: Distributed hashing in a small world. In In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems, pages 127–140, 2003. [MCG05] M. Muhammad, A. S. Cheema, and I. Gupta. Efficient mutual ex- clusion in peer-to-peer systems. In Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing, GRID ’05, pages 296–299, Washington, DC, USA, 2005. IEEE Computer Society. [MCM01] Athicha Muthitacharoen, Benjie Chen, and David Mazières. A low-bandwidth network file system. In SOSP ’01: Proceedings of the eighteenth ACM symposium on Operating systems principles, pages 174–187, New York, NY, USA, 2001. ACM Press. [MD88] P. Mockapetris and K. J. Dunlap. Development of the domain name system. SIGCOMM Comput. Commun. Rev., 18:123–133, August 1988. BIBLIOGRAPHY 188 [MGGM04] Sergio Marti, Prasanna Ganesan, and Hector Garcia-Molina. Dht rout- ing using social links. In 3rd International Workshop on Peer-to-Peer Systems (IPTPS 2004), February 2004. [Mha11] Darshan Mhapasekar. Accomplishing anonymity in peer to peer net- work. In Proceedings of the 2011 International Conference on Commu- nication, Computing & Security, ICCCS ’11, pages 555–558, New York, NY, USA, 2011. ACM. [MI09] Ana Lu´cia De Moura and Roberto Ierusalimschy. Revisiting coroutines. ACM Trans. Program. Lang. Syst., 31:6:1–6:31, February 2009. [MKKW99] David Mazie`res, Michael Kaminsky, M. Frans Kaashoek, and Emmett Witchel. Separating key management from file system security. In Proceedings of the seventeenth ACM symposium on Operating systems principles, SOSP ’99, pages 124–139, New York, NY, USA, 1999. ACM. [MM02] P. Maymounkov and D. Mazieres. Kademlia: A peer-to-peer informa- tion system based on the XOR metric, 2002. [MMGC02] Athicha Muthitacharoen, Robert Morris, Thomer M. Gil, and Benjie Chen. Ivy: A Read/Write Peer-to-Peer file system. In Proceedings of 5th Symposium on Operating Systems Design and Implementation, 2002. [MNR02] Dahlia Malkhi, Moni Naor, and David Ratajczak. Viceroy: a scalable and dynamic emulation of the butterfly. In PODC ’02: Proceedings of the twenty-first annual symposium on Principles of distributed comput- ing, pages 183–192, New York, NY, USA, 2002. ACM. [MR97] Dahlia Malkhi and Michael Reiter. Byzantine quorum systems. In STOC ’97: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 569–578, New York, NY, USA, 1997. ACM Press. [MT03a] Tim Moreton and Andrew Twigg. Enforcing collaboration in peer-to- peer routing services. In Proceedings of the 1st international conference on Trust management, iTrust’03, pages 255–270, Berlin, Heidelberg, 2003. Springer-Verlag. [MT03b] Tim Moreton and Andrew Twigg. Trading in trust, tokens and stamps. In 1st Workshop on the Economics of P2P systems, 2003. [NSN] http://www.nsnam.org. BIBLIOGRAPHY 189 [OM94] Kazuo Ohta and Mitsuru Matsui. Differential attack on message au- thentication codes. In Proceedings of the 13th Annual International Cryptology Conference on Advances in Cryptology, CRYPTO ’93, pages 200–211, London, UK, 1994. Springer-Verlag. [Omn] http://www.omnidrive.com. [Ope] http://www.openomy.com. [Osa88] Alex Osadzinski. The network file system (nfs). Comput. Stand. Inter- faces, 8:45–48, July 1988. [OSV09] Christian Ortolf, Christian Schindelhauer, and Arne Vater. Classifying peer-to-peer network coding schemes. In Proceedings of the twenty- first annual symposium on Parallelism in algorithms and architectures, SPAA ’09, pages 310–318, New York, NY, USA, 2009. ACM. [PCT04] Bogdan C. Popescu, Bruno Crispo, and Andrew S. Tanenbaum. Safe and private data sharing with turtle: Friends Team-Up and beat the system. In In Proc. of the 12th Cambridge Intl. Workshop on Security Protocols, 2004. [PPD+95] Rob Pike, Dave Presotto, Sean Dorward, Bob Flandrena, Ken Thomp- son, Howard Trickey, and Phil Winterbottom. Plan 9 from bell labs. Computing Systems, 8(3):221–254, 1995. [PRR97] Greg C. Plaxton, Rajmohan Rajaraman, and Andrea W. Richa. Ac- cessing nearby copies of replicated objects in a distributed environment. In ACM Symposium on Parallel Algorithms and Architectures, pages 311–320, 1997. [PSAS01] Marius Portmann, Pipat Sookavatana, Se´bastien Ardon, and Aruna Seneviratne. The cost of peer discovery and searching in the gnutella peer-to-peer file sharing protocol. In Proceedings of the 9th IEEE Inter- national Conference on Networks, ICON ’01, pages 263–, Washington, DC, USA, 2001. IEEE Computer Society. [RD01a] Antony Rowstron and Peter Druschel. Pastry: Scalable, decentralized object location, and routing for Large-Scale Peer-to-Peer systems. Lec- ture Notes in Computer Science, 2218:329–351, 2001. [RD01b] Antony Rowstron and Peter Druschel. Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. In Proc. of the 18th ACM Symposium on Operating System Principles, October 2001. BIBLIOGRAPHY 190 [RFSH01] Sylvia Ratnasamy, Paul Francis, Scott Shenker, and Mark Handley. A scalable Content-Addressable network. In In Proceedings of ACM SIGCOMM, pages 161–172, 2001. [RGRK04] Sean Rhea, Dennis Geels, Timothy Roscoe, and John Kubiatowicz. Handling churn in a DHT. In Proc. of the 2004 Usenix Annual Technical Conference, June 2004. [RH04] Sriram Ramabhadran and Joseph M. Hellerstein. Prefix hash tree: An indexing data structure over distributed hash tables, 2004. [Sa07] Jeremy Stribling and Emil Sit and. Don’t give up on distributed file systems. In Proc. of the 6th IPTPS, February 2007. [SBA03] Sbarc: A supernode based peer-to-peer file sharing system. In Proceed- ings of the Eighth IEEE International Symposium on Computers and Communications, ISCC ’03, pages 1053–, Washington, DC, USA, 2003. IEEE Computer Society. [SFH+99] Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, Alis- tair C. Veitch, Ross W. Carton, and Jacob Ofir. Deciding when to forget in the elephant file system. SIGOPS Oper. Syst. Rev., 33(5):110–123, 1999. [SKK+90] M. Satyanarayanan, James J. Kistler, Puneet Kumar, Maria E. Okasaki, Ellen H. Siegel, and David C. Steere. Coda: A highly available file system for a distributed workstation environment. IEEE Transac- tions on Computers, 39(4):447–459, 1990. [SKKM02] Yasushi Saito, Christos Karamanolis, Magnus Karlsson, and Mallik Mahalingam. Taming aggressive replication in the pangaea wide-area file system. SIGOPS Oper. Syst. Rev., 36(SI):15–30, 2002. [SMK+01] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In SIGCOMM ’01: Proceedings of the 2001 con- ference on Applications, technologies, architectures, and protocols for computer communications, volume 31, pages 149–160, New York, NY, USA, October 2001. ACM. [SMZ03] K. Sripanidkulchai, B. Maggs, and H. Zhang. Efficient content location using interest-based locality in peer-to-peer systems. In IEEE Infocom, San Francisco, CA, March 2003. BIBLIOGRAPHY 191 [SNDW06] Atul Singh, Tsuen-Wan Ngan, Peter Druschel, and Dan S. Wallach. Eclipse attacks on overlay networks: Threats and defenses. In 25th Conference on Computer Communications (INFOCOM 2006). IEEE, 2006. [SS96] Mirjana Spasojevic and M. Satyanarayanan. An empirical study of a wide-area distributed file system. ACM Trans. Comput. Syst., 14(2):200–222, May 1996. [SS02] J. Sabater and C. Sierra. Regret: a reputation model for gregarious societies. In C. Castelfranchi and L. Johnson, editors, Proceedings of the 1st International Joint Conference on Autonomous Agents and Multi- Agent Systems (AAMAS-2002), pages 475–482. ACM Press, 2002. [SS05] Yasushi Saito and Marc Shapiro. Optimistic replication. ACM Comput. Surv., 37(1):42–81, March 2005. [Vas08] Nadejda Belbus Vasilyevna. An rbac design with discretionary and mandatory features. In Proceedings of the 2008 International Sympo- sium on Ubiquitous Multimedia Computing, pages 260–263, Washing- ton, DC, USA, 2008. IEEE Computer Society. [Vog99] Werner Vogels. File system usage in windows NT 4.0. In SOSP ’99: Proceedings of the seventeenth ACM symposium on Operating systems principles, pages 93–109, New York, NY, USA, 1999. ACM. [WA93] Randolph Y. Wang and Thomas E. Anderson. xFS: A wide area mass storage file system. In Workshop on Workstation Operating Systems, pages 71–78, 1993. [WDG+06] Charles P. Wright, Jay Dave, Puja Gupta, Harikesavan Krishnan, David P. Quigley, Erez Zadok, and Mohammad N. Zubair. Versatility and unix semantics in namespace unification. Trans. Storage, 2(1):74– 105, 2006. [Win] http://skydrive.live.com. [WK02] H. Weatherspoon and J. Kubiatowicz. Erasure coding vs. replication: A quantitative comparison. In Revised Papers from the First Interna- tional Workshop on Peer-to-Peer Systems, pages 328–338, 2002. [Wua] http://www.wuala.com. [WV03] Yao Wang and Julita Vassileva. Trust and reputation model in Peer- to-Peer networks. In Third International Conference on Peer-to-Peer Computing (P2P’03), 2003. BIBLIOGRAPHY 192 [XDr] http://www.xdrive.com. [YM02] Beverly Yang and Hector G. Molina. Improving search in Peer-to-Peer networks. In ICDCS ’02: Proceedings of the 22 nd International Con- ference on Distributed Computing Systems (ICDCS’02), Washington, DC, USA, 2002. IEEE Computer Society. [YVGM04] B. Yang, P. Vinograd, and H. Garcia-Molina. Evaluating GUESS and non-forwarding peer-to-peer search. pages 209–218, 2004. [ZH07] Runfang Zhou and Kai Hwang. PowerTrust: A robust and scalable reputation system for trusted Peer-to-Peer computing. Transactions on Parallel and Distributed Systems, 18(4):460–473, 2007. [ZKJ01] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. Tapestry: An infras- tructure for fault-tolerant wide-area location and routing. Technical Report UCB/CSD-01-1141, Computer Science Division, UC Berkeley, April 2001. [ZKW05] Chi Zhang, Arvind Krishnamurthy, and Olph Y. Wang. Brushwood: Distributed trees in peer-to-peer systems. In In Proceedings of the 4th International Workshop on Peer-to-Peer Systems (IPTPS’05, pages 47–57, 2005. [ZSJ06] Bo Zhu, Sanjeev Setia, and Sushil Jajodia. Providing witness anonymity in peer-to-peer systems. In Proceedings of the 13th ACM conference on Computer and communications security, CCS ’06, pages 6–16, New York, NY, USA, 2006. ACM.