Notes on Bigtable: A Distributed Storage System for Structured Data

The most influential systems publications of the 2000s may be the two first papers on Google’s internal cluster storage, GFS [1]gfs and Bigtable [2]bt. GFS offers a file system-like interface, Bigtable a database-like interface; that is, GFS stores unstructured files (byte streams), and Bigtable stores structured data (rows, columns). But neither system uses a conventional interface. You read and write GFS files using a GFS API, and read and write Bigtable using a Bigtable API, not SQL.

Bigtable in particular is a delicious smorgasbord of data storage techniques, with a lot to teach us about building storage systems. On the other hand, several aspects of its design are sensitive to its deployment at Google, on top of GFS. To explain the design, we’ll pretend to build it up from first principles.

Reliable storage: durability and replication

Most any storage system aims to store data reliably, so that if a computer fails, the data can be recovered. We worry about both temporary failures, where a computer goes offline for a while but will come back, and permanent failures, where a computer dies. Network partitions, power blips, and program crashes generally cause temporary failures; hardware failure, fires, and sabotage generally cause permanent failures. We assume (with good reason) that temporary failures are more common and unpredictable than permanent ones.

To guard against power blips and program crashes, a system must store data on durable media, such as disks and flash memory. Only data stored on durable media will survive reboot. (Reboot is a magic solution for many temporary failures.)

But durable media cannot guard against permanent failures. That requires replication, where the system keeps multiple copies of the data: backups, basically. If the data is stored several times, on several geographically distributed computers, then only a major catastrophe will cause data loss.

Most (but, interestingly, not all) distributed systems use both durability and replication to store data reliably. For instance, each data modification might be written to at least three disks. If one disk fails, the data is proactively copied onto a new disk, so that at least three copies are usually available. That way, only three simultaneous permanent failures cause data loss. (Non-durable replication has not been considered sufficient since temporary failures—which are more common, and so might happen simultaneously—lose non-durable data.)

Most GFS files are replicated to three computers, which write them durably onto disks and flash.

Sequential storage

GFS was designed to store very large files that are generally accessed sequentially: starting from the first byte and proceeding in order. Sequential access is almost always the fastest way to access files on any storage system. Why?

Because hard disks are mechanical objects that spin. Reading data in a random order asks a disk’s mechanical “head” to jump around, a process called seeking. The head estimates the place to jump to and then must settle to get it right. A disk can do at most a couple hundred seeks a second. Sequential access (on sequentially-laid-out files) avoids seeking. Although in flash memory the seek penalty is much, much smaller, it still exists.
Because sequential access is predictable, all system caches have an easier job. The operating system can prefetch future data, dramatically speeding up future reads, simply by reading the next couple blocks of the file. The disk/flash itself can do the same thing with on-drive caches.

Structured storage

Bigtable, however, stores structured data, including large items (like web pages) and small items (like the text of a link). A typical Bigtable transaction might involve only a couple small data items, but many, many clients may access a Bigtable at a time. This offers both performance and correctness challenges. How can such a system scale?

Bigtable makes a couple data model choices relevant for our understanding.

Sparse hashtable. A Bigtable is essentially a sparse 3D hash table, where the dimensions are row names, column names, and versions (timestamps).
Strings. All Bigtable row names, column names, and data items are strings (sequences of characters). Bigtable has no true schema: everything’s a string.
Put, get, scan. Bigtable supports four fundamental operations: put (store a value in a row/column entry), get (return the value in a row/column entry), delete (delete a row/column entry), and scan (return many values from many row/column entries, in sorted order).

Building up Bigtable

We now describe roughly how Bigtable could have been designed, starting with the basics.

However, to make the issues clear, we’ll start a data model even simpler than Bigtable’s. Specifically, we’ll pretend that Bigtable started as a hash table, or key/value store, that maps string keys to string values. Here, a key combines the real Bigtable’s row and column names. Think of a key as the concatenation of those names (like “rowname|columnname”). We’ll see later why rows and columns are important to differentiate at the system level. But notice how far we can get without explicit columns: it may surprise you!

Basic reads and writes

Challenge: Efficiently yet reliably storing updates
- Explanation: In disk storage efficient means sequential, so efficiently storing updates requires writing those updates sequentially. But updates arrive in random order, and must be stored as quickly as they arrive, since clients are waiting.
- Solution: The only way to store updates sequentially is to order them sequentially: updates must be stored in a commit log, chronologically, as they arrive. This technique is ubiquitous in structured storage.
- GFS note: GFS only provides reliable semantics for sequential log storage, which it supports via an “record append” operation.
Challenge: Efficiently supporting reads
- Explanation: Logs are great for writing efficiently, but make no sense for reading (to read an item, a reader would have to scan the whole log to find the most recently written version). Most systems with log storage also maintain another data structure optimized for reads.
- Solution: Bigtable servers store recent updates in memory, in a data structure called the memtable. Reads are quickly served out of the memtable. If a server crashes, its memtable can be reconstructed from the commit log.
Challenge: Data too big for memory
- Explanation: Memtables work only as long as all data fits in memory, and servers rarely crash (restoring from log is slow). We need another durable data structure optimized for reads.
- Solution: When a memtable gets too big or too old, Bigtable converts it into a durable structure called an SSTable. SSTables are optimized for reading. They store information about rows (keys) in lexicographic (dictionary) order, so scanning a table uses sequential access (fast). They are divided into 64KB segments, and a compact initial header specifies the initial key in each segment; thus a reader can seek to a key without reading the whole SSTable into memory.
  
  The memtable-to-SSTable process is called a minor compaction. It can happen in parallel with updates: Bigtable first starts a new memtable, then compacts the previous, frozen memtable in the background.
Challenge: Updates after minor compaction
- Explanation: Converting to an SSTable does not stop the flow of updates. What should be done with them once the SSTable is on disk?
- Conventional solution: Most databases solve this problem by maintaining a durable read/write data structure, usually a B-tree or variant (B+tree, B-link tree). Updates are first written to the log, which as in Bigtable is the primary commit point, and then lazily applied to the durable read/write structure. This works, but has some serious performance consequences. It is very hard to modify a read/write structure safely, and as the structure is modified, it inevitably drifts away from sequential layout.
- Context: Bigtable’s storage layer, namely GFS, is ill suited for read/write structures. Not only was it designed for append—so in-place writes might be slow—but it doesn’t even provide consistency for in-place writes!
- Solution: A particular Bigtable is implemented as an overlay stack of multiple tables. The memtable is on top, with the most recent updates; any value stored in the memtable has precedence over all other values. Underneath it are zero or more immutable SSTables: Bigtable never modifies an SSTable after it is created.
  
  To find a particular key, Bigtable checks these tables in reverse chronological order, using the first value it finds. Thus, each table acts as an overlay on top of older tables. To scan the database, Bigtable scans each of the tables in parallel, merge-sort-style; this is easy since each table is in sorted order. (If the same key appears in more than one table, Bigtable uses the value in the most recent table.) To update a key, Bigtable writes to the memtable. To delete a key, Bigtable must explicitly store a deletion record, or tombstone, that hides any lower occurrences of the key. (This resembles the use of tombstones in open-addressed hash tables.)
  
  (The overlay stack idea relates to log-structured merge trees [3]lsmtree and read-optimized stores [4]readopt [5]rose.)
Challenge: Garbage collection
- Explanation: As updates and deletes collect, the stack of SSTables will get taller. This causes two problems. First, the lookup process takes O(t) time for a t-high stack. Second, past versions of updated data still take up space in the stack.
- Solution: This is a garbage collection problem, and is solved garbage collection style. Periodically, Bigtable merges together several SSTables into one. In a merging compaction, the memtable and several recent SSTables are combined into a single SSTable. In a major compaction, all SSTables are combined into a single SSTable.
  
  Why two types? A merging compaction might perturb updates a bit (since it consumes the memtable), whereas a major compaction need not (it does not appear to involve the memtable). A merging compaction will involve relatively less data (in SSTable stacks we expect the lower SSTables to contain more data), and is therefore faster to run. A merging compaction often needs to preserve tombstones to hide keys in lower, uncompacted SSTables, whereas a major compaction eliminates all deleted data.
  
  Note that many of these compaction operations can occur in the background as updates and lookups proceed in parallel. Whereas parallel databases often worry a lot about locking disciplines and scalability, Bigtable’s operations are naturally parallel: SSTables are immutable, and there is no need to obtain a lock before accessing an immutable object!

Scalability

Challenge: Data too large for a single computer
- Context: The mechanisms we’ve described so far are really important to get right for any size database. But Bigtable is meant to scale to databases and client loads far too large for any single computer to handle, even if we assume that Bigtable’s underlying file system, GFS, scales perfectly.
- Solution: Partition the Bigtable database among many servers. If there are n servers, give each server 1/n of the database. If we assume that each query touches just one key, then each server handles 1/n of the total query load. Linear scalability!
Challenge: Partitioning the key space
- Context: How to split arbitrary string keys?
- Solution: Divide key space into lexicographic ranges. If there are n servers, define pivot points x₀ ≤ … ≤ x_n, where x₀ is the smallest possible string (the empty string) and x_n is the largest possible string (∞). Then server i ∈ [0, n) handles all keys in the range [x_i, x_i+1).
  
  The portion of data stored on a particular server is called a tablet, and the server is called a tablet server. So a tablet consists of a commit log, a memtable, and zero or more SSTables. A tablet server can serve multiple tablets.
Challenge: Locating servers
- Context: How can clients find the server responsible for a key?
- Solution: Store this information in Bigtable itself! A set of METADATA tablets, arranged B-tree-style, list the locations of all other tablet servers in the system. These are indexed by table name and key range. The location of the topmost METADATA tablet is stored in Chubby, a reliable component outside of GFS. So a client can find a tablet by contacting Chubby and walking through METADATA tablets. In practice, the client caches METADATA tablet data.
Challenge: Compensating for failures
- Context: What happens if a tablet server dies? The data is safe, since it’s stored reliably in GFS according to the procedures above, but we need a Bigtable server to coordinate the pieces and actually serve data.
- Solution: A distinguished master component monitors failed tablet servers and reassigns their tablets as necessary. Tablet servers register themselves as available for tablets; the master then assigns tablets to servers, recording its choices in the METADATA tablets so clients can find them. This master is a centralized component, but note that it’s not on the critical path for client requests—clients can read METADATA tablets independent of the master. However, a working master is required to assign tablets to servers. A separate system component, the “cluster management system,” restarts the master as necessary.
Challenge: Updating the partitioning
- Context: Data isn’t uniformly distributed. A stream of updates and deletes may make some tablets (partitions) grow too large to be effective, or get so small that they’re not worthwhile. Partition points should be updated based on the characteristics of the data.
- Solution: Tablet servers split themselves, entering new split points in the METADATA tablets. All other changes (tablet merges, new tables, schema changes) are managed by the master.

Transaction support

Challenge: Applications want atomic, consistent updates and transactions
- Context: Everyone loves ACID (Atomic, Consistent, Isolated, Durable) transactions. Can Bigtable provide this notion of consistency? And if not, can it provide any notion of consistency?
- Solution: Bigtable already provides durability and atomicity through the commit log. But conventional databases provide arbitrary-sized transactions: a single transaction can modify the entire database in one atomic step. In Bigtable’s distributed context, however, arbitrary transactions would require coordination among many tablet servers. This coordination—basically, locking—would compromise scalability. So Bigtable punts. Bigtable does not support arbitrary transactions.
  
  Coordination is easy within a single tablet server, though. So Bigtable does support limited “read-modify-write”/“compare-and-swap” type transactions that touch one tablet server at a time. However, Bigtable would need to worry about splits and other concurrent updates during a transaction. So Bigtable ensures that the data relevant to one transaction will never be split across two tablets. An easy way to do that would be to limit transactions to support exactly one key: basically, compare-and-swap on a single key/value pair.
Challenge: Single-key transactions are too limiting
- Context: Single-key transactions are easy to implement, but are too limiting for most clients. There’s a tremendous flexibility difference between “atomic integers” and “atomic structure operations” on arbitrary structures.
- Solution: Split the key into two parts, a row and a column. Partition data based on row key, so that each tablet handles a contiguous range of rows. Then we know that any split will keep all of a row’s data together, and Bigtable can support transactions that operate on a single row at a time without too much work. This is much better than single-key transactions, because a row can have arbitrarily many columns.
  
  (Of course, it’s very likely that Bigtable had rows and columns from the start, but they weren’t exactly necessary until now.)
  
  Unlike conventional databases, Bigtable rows can have arbitrarily many columns, and different rows can have different sets of columns. This shows how close the Bigtable column idea is to general string keys. It also makes some cool programming tricks possible; for instance, consider Figure 1 [2]bt. A single row has “anchor:cnnsi.com” and “anchor:my.look.ca” columns, which a conventional database would store in a separate table (a “page_anchor” table with a two-column unique key, “page_url + “anchor_url”). Bigtable general columns are especially useful since Bigtable transactions are limited. Many updates that, in a conventional database, would require cross-table coordination, Bigtable can handle with atomic row updates.
Challenge: Support transactions in the future
- Context: Many applications don’t need full transaction support, so maybe it’s fine that Bigtable doesn’t provide it. But it would rock to build some sort of system on top of Bigtable that supported transactions. Is there anything that Bigtable might need to support now that would simplify transaction implementation later?
- Solution: Multiple versions of data. Bigtable is willing to store many versions of a value for a given row/column pair. These versions are indexed by timestamp. Given both timestamps and read-modify-write operations, Bigtable can support both locks and multi-version concurrency control like snapshot isolation.
  
  Of course, timestamps are useful for other, related purposes. Clients can roll their own “transactions” by examining only information from a specific timestamp range, or check recent changes by scanning for updates after a given timestamp. Bigtable supports configurable garbage collection of old data: compactions can throw out old versions.

Optimizations

We only describe a limited set; see the paper for more.

Challenge: Handle more data per server
- Context: Reading from disk is slow and proportional to the amount of data read. Memory is also a bottleneck.
- Solution: Compress SSTables, using new, speed-optimized compression algorithms. Compression is done block by block, so a tablet server can scan to a given SSTable block without uncompressing all prior blocks.
Challenge: Skip irrelevant data
- Context: The Bigtable representation encourages application designers to combine different classes of information into single rows, where conventional databases would split those information classes into different tables. (Consider Figure 1, which stores both multiple versions of a crawled web page and many small text snippets.) As a result, rows are very large and can combine big data items with small ones. An application might be interested in scanning over just the small data items (e.g., the text snippets), but the SSTable representation described so far would require tablet servers to read all data items into memory for each row.
- Solution: Divide columns into subsets, and store column subsets separately. Specifically, split each column into two parts, the column family and the column qualifier. Multiple column families can be grouped together into locality groups. All columns in the same locality group are stored in the same SSTable stack, but columns in different families are stored in different SSTable stacks. Then, we can group the Figure 1’s small text snippets together, and scan them all without reading web page data into memory.
  
  This design means that a single tablet can comprise multiple SSTable stacks. (It appears that the memtable is shared among locality groups, and a minor compaction can split a single memtable into several SSTables, at most one per stack.) But this doesn’t affect atomic update consistency. Since all the SSTables for a given row are always stored by the same tablet server, the tablet server can still easily update rows atomically.
Challenge: A t-high SSTable stack takes O(t) SSTable reads to test for a key
- Solution: Bigtable can associate a Bloom filter with each SSTable. The Bloom filter is a conservative set representation that takes very little memory. It can confirm that a particular key is definitely not in an SSTable, but it can’t say for sure whether a key is actually present. That is, it can give false positives but never false negatives, which is why it’s conservative. If a tablet server keeps all SSTables’ Bloom filters into memory, it can often avoid failed key lookups in SSTables. This can reduce the number of SSTable reads to below O(t), although obviously the number of Bloom filter reads is still O(t).

Bigtable as a whole

Here’s an overview of the whole Bigtable system as we’ve described it.

Cluster level

A Bigtable cluster has exactly one Chubby instance.
A Bigtable cluster has at most one Bigtable master. (If the master fails, it is restarted by other system components.)
A Bigtable cluster has one tablet hierarchy.
The tablet hierarchy is rooted at a single root tablet.
A tablet hierarchy has many METADATA tablets, which store location information for other user tablets.
Each user tablet belongs to exactly one Bigtable.
All of a Bigtable’s tablets are part of the same cluster and in the same hierarchy.

Tablet level

Each tablet is served by at most one tablet server.
A tablet server can serve many tablets. (The master assigns tablets to servers.)
Each tablet comprises one or more SSTable stacks, one per locality group, as well as a memtable and a portion of a commit log.
Commit logs and tablet servers correspond one-to-one (except during failures).
Each commit log contains data from one or more tablets (see “Commit log implementation” in §6).
Each memtable contains data from exactly one tablet.
Each SSTable stack comprises zero or more SSTables.
Each SSTable comprises one or more 64KB blocks, plus an optional Bloom filter. The blocks can be compressed; each block is compressed separately.
Minor compactions change a single memtable into a set of new SSTables, at most one per stack. Minor compactions increase SSTable stack height.
Merging compactions are minor compactions that also combine some of the upper SSTables on the previous stacks with the new SSTables. Merging compactions can reduce SSTable stack height.
Major compactions combine all existing SSTables together. Major compactions cut SSTable stack height down to one.

Row level

Each tablet contains data for a contiguous range of rows.
Each column family belongs to one locality group.
Each locality group contains one or more column families.
Each column belongs to a single column family.
A transaction accesses data in at most one row.
Each row can contain data in many columns. Different rows can contain different columns.

Client level

A client accesses many tablet servers. A client will practically never contact the master, or Chubby.
A client can create arbitrary rows and arbitrary columns in pre-existing column families. Only a Bigtable’s administrator can change its column families.
Access control decisions are made at the column family level.

Comparison with conventional databases

Database ~ Bigtable
Table ~ Column family
Primary key ~ Row
- In Bigtable, all “tables” (column families) always have the same primary key.
B-tree node ~ Tablet
- In a conventional database a B-tree node stores a row range from a single table, whereas a tablet contains row ranges for many column families.
Transaction ~ Atomic row update
Schema ~ Column family schema

Contributions

Bigtable not only introduced an interesting data model (rows, columns, column families, timestamps, atomic row updates), it also combined a large number of interesting and useful data representation techniques (mutable stacks of immutable SSTables, Bloom filters, compressed tablets), some of them new. The paper offers a deep set of systems techniques and obviously good engineering. The Chubby/master/tablet-server interactions (which we didn’t particularly focus on above) show that single-master systems can avoid bottlenecks and scale tremendously.

gfs
“The Google file system,” Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, in Proc. 19th SOSP, 2003 (ACM Digital Library, Google Research Publications)
bt
“Bigtable: A distributed storage system for structured data,” Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, in Proc. 7th OSDI, Nov. 2006 (Via USENIX, ACM Digital Library, Google Research Publications)
lsmtree
“The log-structured merge-tree (LSM-tree),” Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil, Acta Informatica 33(4):351–385, 1996.
readopt
“Performance tradeoffs in read-optimized databases,” Stavros Harizopoulos, Velen Liang, Daniel J. Abadi, and Samuel Madden, in Proc. VLDB ’06, pages 487–498, 2006.
rose
“Rose: Compressed, log-structured replication,” Russell Sears, Mark Callaghan, and Eric Brewer, in Proc. VLDB ’08, August 2008.