Systems seen so far
- MPI
p4
- MapReduce
- Spark
- DryadLINQ
- Naiad (timely dataflow)
What “is” each system we’ve discussed?
- “MPI is a…standard message passing interface originally designed for writing applications and libraries for distributed memory environments.”
- “
p4
is a portable library of C and Fortran subroutines for programming parallel computers.” - “MapReduce is a programming model and an associated implementation for processing and generating large data sets.”
- “Resilient Distributed Datasets (RDDs) [are] a distributed memory
abstraction that lets programmers perform in-memory computations on large
clusters in a fault-tolerant manner.”
- “…motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools.”
- “Spark is the first system that allows a general-purpose programming language to be used at interactive speeds for in-memory data mining on clusters.”
- “DryadLINQ is a system and a set of language extensions that enable a
new programming model for large scale distributed computing.”
- “It generalizes previous execution environments…by adopting an expressive data model of strongly typed .NET objects; and by supporting general-purpose imperative and declarative operations on datasets”
- “We therefore believe that the language issues addressed in this paper are currently among the most pressing research areas for data-intensive computing”
- “Timely dataflow is a computational model that attaches virtual
timestamps to events in structured cyclic dataflow graphs.”
- “[It] supports stateful iterative and incremental computations [and] enables both low-latency stream processing and high-throughput batch processing.”
What are the central contributions of these systems?
- MPI: Message passing library, portable typed messages
p4
: Same plus monitors (locking) and abstractions (p4_askfor
= task queues)- MapReduce: Map/reduce programming model
- Spark: persisted in-memory data sets with lineage
- DryadLINQ: optimization framework
- Timely dataflow: ???
- Timely dataflow: partially-ordered timestamps for fast incremental computation
What are the secondary contributions of these systems?
- MPI: application topologies, collective communication, reducers
p4
: clock routines, error messages, tracing, arguablyp4_askfor
(“The most useful monitor of all”)- MapReduce: locality, ‘battle hardened’ mechanisms for debuggability and resilience (stragglers, record skipping, local execution, status information, counters)
- Spark: slick Scala interface, interactive code distribution
- DryadLINQ: dynamic code generation for e.g. serialization code, annotations for communicating programmer intent,
Apply
- Timely dataflow: ???
- Timely dataflow: global progress tracker, optimizations to accumulate updates?
More questions
- What aspects of these systems have been overtaken by events?
- Which papers were easiest to read and understand?
- Which papers were most exciting?
Expressing one system in another system’s terms
-
MapReduce:
-
MapReduce in DryadLinq:
Implementation
- Which parts of MapReduce implement these DryadLINQ optimizations?