Notes on Bigtable: A Distributed Storage System for Structured Data

The most influential systems publications of the 2000s may be the two first papers on Google’s internal cluster storage, GFS [1] and Bigtable [2]. GFS offers a file system-like interface, Bigtable a database-like interface; that is, GFS stores unstructured files (byte streams), and Bigtable stores structured data (rows, columns). But neither system uses a conventional interface. You read and write GFS files using a GFS API, and read and write Bigtable using a Bigtable API, not SQL.

Bigtable in particular is a delicious smorgasbord of data storage techniques, with a lot to teach us about building storage systems. On the other hand, several aspects of its design are sensitive to its deployment at Google, on top of GFS. To explain the design, we’ll pretend to build it up from first principles.

Reliable storage: durability and replication

Most any storage system aims to store data reliably, so that if a computer fails, the data can be recovered. We worry about both temporary failures, where a computer goes offline for a while but will come back, and permanent failures, where a computer dies. Network partitions, power blips, and program crashes generally cause temporary failures; hardware failure, fires, and sabotage generally cause permanent failures. We assume (with good reason) that temporary failures are more common and unpredictable than permanent ones.

To guard against power blips and program crashes, a system must store data on durable media, such as disks and flash memory. Only data stored on durable media will survive reboot. (Reboot is a magic solution for many temporary failures.)

But durable media cannot guard against permanent failures. That requires replication, where the system keeps multiple copies of the data: backups, basically. If the data is stored several times, on several geographically distributed computers, then only a major catastrophe will cause data loss.

Most (but, interestingly, not all) distributed systems use both durability and replication to store data reliably. For instance, each data modification might be written to at least three disks. If one disk fails, the data is proactively copied onto a new disk, so that at least three copies are usually available. That way, only three simultaneous permanent failures cause data loss. (Non-durable replication has not been considered sufficient since temporary failures—which are more common, and so might happen simultaneously—lose non-durable data.)

Most GFS files are replicated to three computers, which write them durably onto disks and flash.

Sequential storage

GFS was designed to store very large files that are generally accessed sequentially: starting from the first byte and proceeding in order. Sequential access is almost always the fastest way to access files on any storage system. Why?

Structured storage

Bigtable, however, stores structured data, including large items (like web pages) and small items (like the text of a link). A typical Bigtable transaction might involve only a couple small data items, but many, many clients may access a Bigtable at a time. This offers both performance and correctness challenges. How can such a system scale?

Bigtable makes a couple data model choices relevant for our understanding.

Building up Bigtable

We now describe roughly how Bigtable could have been designed, starting with the basics.

However, to make the issues clear, we’ll start a data model even simpler than Bigtable’s. Specifically, we’ll pretend that Bigtable started as a hash table, or key/value store, that maps string keys to string values. Here, a key combines the real Bigtable’s row and column names. Think of a key as the concatenation of those names (like “rowname|columnname”). We’ll see later why rows and columns are important to differentiate at the system level. But notice how far we can get without explicit columns: it may surprise you!

Basic reads and writes

Scalability

Transaction support

Optimizations

We only describe a limited set; see the paper for more.

Bigtable as a whole

Here’s an overview of the whole Bigtable system as we’ve described it.

Cluster level

Tablet level

Row level

Client level

Comparison with conventional databases

Contributions

Bigtable not only introduced an interesting data model (rows, columns, column families, timestamps, atomic row updates), it also combined a large number of interesting and useful data representation techniques (mutable stacks of immutable SSTables, Bloom filters, compressed tablets), some of them new. The paper offers a deep set of systems techniques and obviously good engineering. The Chubby/master/tablet-server interactions (which we didn’t particularly focus on above) show that single-master systems can avoid bottlenecks and scale tremendously.


  1. “The Google file system,” Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, in Proc. 19th SOSP, 2003 (ACM Digital Library, Google Research Publications)

  2. “Bigtable: A distributed storage system for structured data,” Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, in Proc. 7th OSDI, Nov. 2006 (Via USENIX, ACM Digital Library, Google Research Publications)

  3. “The log-structured merge-tree (LSM-tree),” Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil, Acta Informatica 33(4):351–385, 1996.

  4. “Performance tradeoffs in read-optimized databases,” Stavros Harizopoulos, Velen Liang, Daniel J. Abadi, and Samuel Madden, in Proc. VLDB ’06, pages 487–498, 2006.

  5. “Rose: Compressed, log-structured replication,” Russell Sears, Mark Callaghan, and Eric Brewer, in Proc. VLDB ’08, August 2008.