Section 3: Large-Scale File Systems

One interesting recent development in systems is the growth of extremely large-scale file systems. These file systems provide interfaces to radically enormous amounts of stable-storage data. The data stored is so large that it can’t fit on any one machine, so it’s distributed over many machines.

Early widely-used distributed file systems (NFS, AFS) pretended to be disk file systems. They provided the same interface to user processes as disk file systems, and plugged in to the kernel’s VFS layer like disk file systems do. When this works, it works beautifully and transparently, but distributed file systems have failure modes that disk file systems don’t. NFS and AFS can cause strange process behavior, such as processes that cannot be killed.

Modern extremely large-scale file systems are built in a different way. They don’t pretend to be disk file systems; they have an entirely different interface. It’s fun to learn about these different interface choices and figure out why they were made. In this section, you’ll learn about an important extremely large-scale file system, the Google file system.

“The Google File System.” Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. In Proc. SOSP 2003. Link

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.

The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.

In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

Preparation

As you read the paper, consider the following question, and post your answer by one hour before section; post your answer as a followup to the Piazza announcement of section.

GFS wants to support many concurrently-active clients. To enable high scalability for data accesses, GFS stores data across a large number of chunkservers; but it uses a single master to manage metadata. Was this design decision a reasonable one for Google to make? You can refer to the paper’s arguments, your own intuitions, or research into later distributed file systems, such as Google’s Colossus.

How to read research papers

There are as many ways to read research papers as there are people. Eddie often reads papers sequentially, starting at the beginning. However, particularly at the beginning of your career, sequential reading can be very difficult. You may want to skim for high points, then read again for more depth.

Take a look at S. Keshav’s “How to Read a Paper” (local copy). I especially recommend following his advice for the first pass, which aims to develop a mental framework for the paper as soon as possible. Once that framework is in place, reading for content and ideas becomes much easier.

You might also be interested in Michael Mitzenmacher’s paper-reading instructions.