Breaking consensus in simulation

In your second problem set, you get an event-driven simulator for distributed systems code, and an implementation of the Chandra-Toueg consensus algorithm that achieves distributed consensus using failure detectors. You will extend the network simulator to simulate interesting network behaviors and failures, and show that your tests can detect subtle errors in this implementation of consensus.

We’ll discuss your partial results in class on Monday February 23. The turnin date will be later that week.

Background

What makes distributed ecosystem programming hard? It’s the “limitations of local knowledge”; it’s the difficulty of managing administrative boundaries. But fundamentally, it’s that distributed systems problems are practically unreproducible. Distributed systems face a hard-to-conceive range of issues. Messages get lost or delayed in unpredictable ways; systems fail that shouldn’t; one message arrives a microsecond before another, a butterfly’s wings flap, and the whole ecosystem starts behaving in an unexpected way. Debugging depends on reproducible test cases, but in distributed systems and ecosystems those are hard to come by.

The result is pain. Here’s a quote from a student in the University of Washington’s distributed systems class: “Just 3 days before the deadline of the project, my partner and I discovered that our Paxos failed 1 of 100,000 tests. … We needed to rewrite fifty percent of the whole project but we did not give up. Finally, after 30 hours of work in 2 days, we fixed the design flaw and eliminated the bug.” link

Whole branches of computer science have been developed to tackle irreproducibility in distributed systems, including model checking and specification and verification languages. These techniques are super powerful and worth reading about; they are in active use in the world’s most important systems companies. But in our problem sets, you will aim to make failures reproducible by building deterministic tests.

Deterministic test infrastructure has several advantages. You can write tests that work on actual systems code, not a model of systems code. (In a future pset, you will write code that can talk over the network or in a single-threaded reproducible test.) You will learn how to think about test harnesses—a critical software engineering skill. And you will be in good company. One of the best papers about deploying distributed systems, “Paxos Made Live”, describes how Google engineers used this technique:

“One of our tests verifies the fault-tolerant log. It simulates a distributed system consisting of a random number of replicas and takes our fault-tolerant log through a random sequence of network outages, message delays, timeouts, process crashes and recoveries, file corruptions, schedule interleavings, etc. We wanted this test to be repeatable to aid in debugging. To this end, we use a random number generator to determine the schedule of failures. The seed for the random number generator is given at the beginning of the test run. We ensure that two test runs with the same random number seed are identical by running the test in a single thread to remove unwanted non-determinism from multi-threading. This is possible because [our code] can run in a single-threaded environment (even though it normally runs in a multi-threaded environment). Each test execution reports success or failure. If a test fails, we rerun that test with the failing random number seed…. This is possible because these tests are repeatable.”

Techniques like model checking and verification have their own advantages. A verifier can prove that an algorithm is correct; a tester can never prove correctness, only demonstrate incorrectness. Testing is very powerful, but can take time: in “Paxos Made Live,” Chandra, Griesemer, and Redstone report that although simple testing eliminated many bugs, some “took weeks of simulated execution time (at extremely high failure rates) to find.” It can still expose very subtle bugs, as I hope you will see.

(Tushar Chandra, co-author of “Paxos Made Live,” is Tushar Deepak Chandra, co-author of “Unreliable failure detectors for reliable distributed systems” and Chandra-Toueg consensus!)

Setting

Our handout code is in the pset2 directory. I’ve tried to put the code you least need to understand under detail/, but everything is more or less commented. The source files are:

Build with cmake -B build && cmake --build build, then run the ping program with build/ping. You should see exactly this:

2021-10-12 20:21:09.000000: server 0 sends initial ping
2021-10-12 20:21:09.020001: server 1 received 0, sends 1
2021-10-12 20:21:09.040002: server 0 received 1, sends 2
2021-10-12 20:21:09.060003: server 1 received 2, sends 3
2021-10-12 20:21:09.080004: server 0 received 3, sends 4
2021-10-12 20:21:09.100005: server 1 received 4, sends 5
2021-10-12 20:21:09.120006: server 0 received 5, sends 6
2021-10-12 20:21:09.140007: server 1 received 6, sends 7
2021-10-12 20:21:09.160008: server 0 received 7, sends 8
2021-10-12 20:21:09.180009: server 1 received 8, sends 9
2021-10-12 20:21:09.200010: server 0 received 9, sends 10

If you run build/ping -V, you can see the messages being sent and received too—but again, you should get the same output every time: the simulation is totally deterministic.

Look at ping.cc with reference to the Cotamer manual until you’re comfortable with how it works.

The ctconsensus program is not totally deterministic, because the initial inputs for the $$N$$ servers are chosen randomly. But the network behavior is totally deterministic. Compare three runs of build/ctconsensus -V. The colors are different, but message types, timings, sources, and destinations are all exactly the same.

If you want a specific initial condition, build/ctconsensus -S NUM (where NUM is a 64-bit number) will perform exactly the same every time. For instance, you should be able to exactly replicate the output above with -S 2620, -S 2621, and -S 2622.

Goal, turnin

Your goals in this problem set are:

  1. Extend the network simulator to support failures and network delays.
  2. Use your extended simulator to find bugs in broken CT consensus algorithms.
  3. But your extended simulator should not break the correct CT consensus algorithm!

You will modify netsim.hh and create one or more copies of ctconsensus.cc that have errors in their CT consensus algorithm. (We recommend storing these copies using Git branches or tags; see below.) Your writeup in TURNIN.md will describe how you made our network model more realistic and how you broke CT consensus. You must also give specific random seeds and network sizes (-S and -n arguments) that demonstrate the errors in your broken CT consensus programs.

(As you add functionality to netsim.hh, you may also add code to ctconsensus.cc to configure that functionality; but make sure that ctconsensus.cc continues to implement a correct CT algorithm.)

I intend for your work on this problem set to carry over to future problem sets.

Example

Our handout code has an example incorrect CT consensus implementation in ctstubborn.cc. The differences from correct CT consensus are pretty serious: with probability 0.001, a server refuses to acknowledge any color other than its own.

This sounds bad, but if you run build/ctstubborn, almost certainly consensus will be achieved anyway! Only one in a thousand servers are stubborn; if one server is stubborn, it will likely convince the others to go along; if two servers are stubborn, they might have the same color, or the third will eventually just pick one of them. But there are bad seeds. Try build/ctstubborn -V -S 15732322361894304040.

The most efficient way to search for bad seeds is to run a ctconsensus-type program with the -R COUNT -q options, which try COUNT random seeds and exit on the first failure. With these options, my laptop can test roughly 100,000 3-server consensus instances per second, and it takes around 40 seconds to find a failing seed for ./ctstubborn.

Notes

Network and computation model

Chandra-Toueg consensus works in the asynchronous reliable communication model: messages can be delayed or reordered, and servers can take indefinitely long to execute a computation step, but messages are never dropped.

Super-long message delays look like drops, making consensus impossible (and complicating tests). Your implementation may reorder messages arbitrarily, but we suggest that you never delay a message by more than a minute. That should be enough to demonstrate any interesting failures.

Network simulator extensions

Here’s a list of network simulator changes we recommend.

  1. Jitter.

    Delay messages by random times, rather than a fixed time. You can compute that delay in many ways. You could start with a base and then add uniformly-chosen extra delay. Alternately, you could choose delays from a Gaussian, or you could start with a base and add an exponentially-distributed delay (more realistic in the network setting). You could make one server farther away than others by setting its channel delays to higher values.

    The network<T> object offers uniform(), normal(), and exponential() functions to generate random numbers or durations in all of these distributions; for instance, net_.exponential(200us) generates a duration from an exponential distribution whose mean is 200µs.

  2. Failure.

    In the handout code, servers never fail, which is pretty weird given that CT consensus is designed to tolerate failure. So add support for server failure. You can do this by co_returning early from consensus(). Alternately, you could implement failure in the channel type, since failing a server is equivalent to dropping all messages originating at that server. A coroutine like this fails server i after some delay:

    cot::task<> fail_server_after(ctconsensus::network_type& net,
                                  int i /* server to fail */,
                                  int N /* total number of servers */,
                                  cot::clock::duration delay) {
        assert(i >= 0 && i < N);
        co_await cot::after(delay);
        for (int j = -1 /* Nancy */; j != N; ++j) {
            net.link(i, j).fail();       // fail() is for you to implement
        }
    }
    
  3. Computation delay.

    In addition to delaying message delivery, you could delay message receipt, thereby modeling computation slowdowns. Add a random delay to the port<T>::receive function.

But the failures you introduce must not violate the requirements for CT consensus! For example, in a network with $$N$$ servers, CT consensus supports at most $$\lfloor (N-1)/2 \rfloor$$ server failures, so you must ensure that at least $$\lceil (N+1)/2 \rceil$$ servers keep working. The working CT consensus code in ctconsensus.cc should still work, even with your networking changes. (Unless there’s a bug in ctconsensus.cc, in which case great job finding it!)

Designing erroneous implementations

The error we added to ctstubborn.cc is rather dramatic. (Since the handout network simulator is reliable and stable—it never even varies link delay—and since servers never fail in the handout code, only aggressive incorrectness can demonstrate a consensus failure.) In your code, you should aim for subtler, more realistic mistakes in your broken consensus programs. Here are some examples.

  1. Disable or weaken your failure detector.

    CT consensus requires that, among working servers’ failure detectors, there is a time after which (1) every failed server is always flagged by every failure detector (without backsliding), and (2) some correct server is never flagged by any failure detector. (The CT consensus paper calls these properties “strong completeness” and “eventual weak accuracy.”)

    The handout code’s failure detector is very simple: a 1.5s timeout on leader messages. That meets these requirements when messages are delivered relatively quickly, which they are in the handout code. But you could easily break this failure detector. For instance, 10% of the time, the failure detector could implement a 15-minute timeout—too long for Nancy to wait.

  2. Drop certain retransmission messages.

    CT consensus is very chatty; a network with $$N$$ servers sends $$N^2$$ DECIDE messages. Maybe you could avoid some of these? Well, try it and see what happens.

  3. Skip internal state updates.

    CT consensus tracks both a current color and a current “color round,” the time a color was most recently updated. The color round is updated in two places, once by the leader based on PREPARE message contents and once when a server receives a PROPOSE message. But do we really need to update the color round in both places? Or can we just always use the current round? Delete one or both updates and see.

Be creative! I will be very excited to see a realistic failure that was hard to reproduce. In addition, this form of active counterexample search can be very helpful for understanding why a protocol is the way it is.

Pitfalls

Deterministic network simulation is both powerful and contingent. Changing a single constant, or accessing randomness one additional time, can totally change the behavior caused by a given random seed. So make sure you take a snapshot of your code every time you find an interesting seed. git tag and git branch make this relatively easy. When you find something, check that the seed is reproducible, document it in NOTEBOOK.md or TURNIN.md, commit your code (make sure you’ve added any necessary files), and tag the result.

Part of the art here lies in selecting good ranges of failure probability and network variability. You don’t want to set them too aggressively high, or a seed search will never test the normal behavior; and you don’t want to set them too low, or interesting behavior won’t manifest. For context, the 0.1% probability of stubbornness in ctstubborn.cc is too low to trigger errors quickly.

Consider adding more command line options so you can turn features on and off. For instance, rather than copying the whole ctconsensus.cc file, you could add command line options that turn on specific errors. If you do this, make sure your turnin documents all the options necessary to replicate a failure.