Project 1: Simulating Distributed Infrastructures

Our survey of many distributed computing infrastructures has touched on many systems that differ on many systems implementation dimensions:

What scale of deployment is supported?
How much data is transferred per message?
Which entities perform scheduling decisions?
How often must one job wait for another?
What kind of consistency is required?
How is fault tolerance ensured?
How much memory is required?
How much stable storage is required?
How fast is network communication?
How is memory and storage reclaimed?

How can we compare these systems or understand these systems in a concrete way without deploying them? How can we play with their parameters for problems and explore possible combinations?

Simulators

One common approach to questions like these is to build a simulator. A simulator aims to estimate the behavior of a system on a specific problem.

There are many kinds of simulator. You may be familiar with the “cycle-accurate” simulators common in the architecture world. These simulators aim to exactly model a system like a processor; they can be pretty precise, but are far slower than the systems they model. For this problem we instead want a discrete event simulator that models high-level behavior one event at a time. Such a simulator might, for example, model a network as an all-to-all communication fabrid that can support transfers of up to a specific bandwidth. When a simulated node begins a transfer of size N bytes, the simulator would guess that the transfer ended N/bandwidth seconds later. Discrete event simulators are not precise, but they can be very useful for comparing deployment ideas to within an order of magnitude.

An example paper that uses an event-driven simulator to explore TensorFlow partitioning and scheduling

Phase 1: Design

In Phase 1 of this project, please meet in groups to produce a tentative high-level design for a distributed computing infrastructure simulator.

That’s an extremely open-ended question, of course. We want to keep the question open ended, but here are some goals for your simulator.

Your simulator should support the specification of different communication patterns.
1. Your simulator should support the modeling of MapReduce computation patterns.
2. If given parameters that represent the Grep and Sort tasks from the MapReduce paper (§5.2–5.3), your simulator should also find that the Grep task completes much faster than the Sort task (~10x).
3. Your simulator should support at least one other kind of computation pattern—for example, the Spark PageRank computation, or a Dask blocked-array computation.
Your simulator should have enough detail to model some of the features of clusters that we have read about, such as:
- Stragglers.
- Failures and failure recovery.
- Memory limitations.
- Heterogeneous hardware.
- Network bandwidth limitations (i.e., a connection, or the network as a whole, has limited capacity).
- Host bandwidth limitations (i.e., a host has limited network capacity).
- Collocated storage.

It would be extremely easy to overbuild a simulator. Simulators don’t have to be complicated or precise to be useful. We want these simulators to help us understand distributed computation patterns in a more concrete way, and to help us play.

We’ll discuss our designs on Wednesday, February 16, then start implementing them!