Our survey of many distributed computing infrastructures has touched on many systems that differ on many systems implementation dimensions:
- What scale of deployment is supported?
- How much data is transferred per message?
- Which entities perform scheduling decisions?
- How often must one job wait for another?
- What kind of consistency is required?
- How is fault tolerance ensured?
- How much memory is required?
- How much stable storage is required?
- How fast is network communication?
- How is memory and storage reclaimed?
How can we compare these systems or understand these systems in a concrete way without deploying them? How can we play with their parameters for problems and explore possible combinations?
Simulators
One common approach to questions like these is to build a simulator. A simulator aims to estimate the behavior of a system on a specific problem.
There are many kinds of simulator. You may be familiar with the
“cycle-accurate” simulators common in the architecture world. These simulators
aim to exactly model a system like a processor; they can be pretty precise,
but are far slower than the systems they model. For this problem we instead
want a discrete event simulator that models high-level behavior one
event at a time. Such a simulator might, for example, model a network as
an all-to-all communication fabrid that can support transfers of up to a
specific bandwidth. When a simulated node begins a transfer of size N
bytes,
the simulator would guess that the transfer ended N/bandwidth
seconds later.
Discrete event simulators are not precise, but they can be very useful for
comparing deployment ideas to within an order of magnitude.
Phase 1: Design
In Phase 1 of this project, please meet in groups to produce a tentative high-level design for a distributed computing infrastructure simulator.
That’s an extremely open-ended question, of course. We want to keep the question open ended, but here are some goals for your simulator.
-
Your simulator should support the specification of different communication patterns.
-
Your simulator should support the modeling of MapReduce computation patterns.
-
If given parameters that represent the Grep and Sort tasks from the MapReduce paper (§5.2–5.3), your simulator should also find that the Grep task completes much faster than the Sort task (~10x).
-
Your simulator should support at least one other kind of computation pattern—for example, the Spark PageRank computation, or a Dask blocked-array computation.
-
-
Your simulator should have enough detail to model some of the features of clusters that we have read about, such as:
- Stragglers.
- Failures and failure recovery.
- Memory limitations.
- Heterogeneous hardware.
- Network bandwidth limitations (i.e., a connection, or the network as a whole, has limited capacity).
- Host bandwidth limitations (i.e., a host has limited network capacity).
- Collocated storage.
It would be extremely easy to overbuild a simulator. Simulators don’t have to be complicated or precise to be useful. We want these simulators to help us understand distributed computation patterns in a more concrete way, and to help us play.
We’ll discuss our designs on Wednesday, February 16, then start implementing them!