Introduction

What is a distributed system?

A distributed system is two or more computers collaborating over a network. Distributed systems bring both pleasure and pain.

Why pleasure? Scale (more computers) means your system can store more data, perform more computations per second, and be more reliable than any individual computer.

Why pain? More computers means less control. A processor has a specification; the processor vendors are paid to ensure their processors work according to that specification; a given instruction behaves exactly as defined. The computer’s privileged software (the operating system kernel) has full control; it can observe the full state of every process. But a distributed system inherently involves local knowledge. No system component can gain a consistent view of the system’s state. Limitations imposed by local knowledge make distributed algorithms hard to design and distributed systems hard to debug.

Limitations on local knowledge can be exaggerated. We can build distributed systems, and we do all the time. But the limitations are real too, and learning about those limitations, and the coping techniques systems and algorithms use, is part of what the class is about.

Example: Nines

Reliability is often described using “nines.” A system that’s available 99% of the time has “two nines” of reliability. A typical server computer offers around two nines: down 3.65 days a year.

But assume you can build a distributed system of N servers where the system as a whole is available if at least one of the component servers is available. How many nines does the system as a whole have? Well, with two servers, the system is unavailable if neither component is available. Each server is dead with probability $$p = 0.01$$, so both servers are dead with probability $$p^2 = 0.0001$$, and the system as a whole has four nines of availability. Generalizing, N servers have 2N nines, and system with 5 computers will be down 3 milliseconds a year. Amazon S3 claims more than 11 nines of data durability in 2026.

The flip side of a system whose reliability comes from replication is that the system will often be in a state of partial failure! In a system with 5 servers (N = 5) that fail with probability $$p = 0.01$$, at least one server will be down $$1 - (1-p)^5 = 4.9\%$$ of the time. And as the system grows, this gets exponentially worse: $$1 - (1-p)^{100} = 63.4\%$$.