DryadLINQ and Naiad

Systems seen so far

MPI
p4
MapReduce
Spark
DryadLINQ
Naiad (timely dataflow)

What “is” each system we’ve discussed?

“MPI is a…standard message passing interface originally designed for writing applications and libraries for distributed memory environments.”
“p4 is a portable library of C and Fortran subroutines for programming parallel computers.”
“MapReduce is a programming model and an associated implementation for processing and generating large data sets.”
“Resilient Distributed Datasets (RDDs) [are] a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.”
- “…motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools.”
- “Spark is the first system that allows a general-purpose programming language to be used at interactive speeds for in-memory data mining on clusters.”
“DryadLINQ is a system and a set of language extensions that enable a new programming model for large scale distributed computing.”
- “It generalizes previous execution environments…by adopting an expressive data model of strongly typed .NET objects; and by supporting general-purpose imperative and declarative operations on datasets”
- “We therefore believe that the language issues addressed in this paper are currently among the most pressing research areas for data-intensive computing”
“Timely dataflow is a computational model that attaches virtual timestamps to events in structured cyclic dataflow graphs.”
- “[It] supports stateful iterative and incremental computations [and] enables both low-latency stream processing and high-throughput batch processing.”

What are the central contributions of these systems?

MPI: Message passing library, portable typed messages
p4: Same plus monitors (locking) and abstractions (p4_askfor = task queues)
MapReduce: Map/reduce programming model
Spark: persisted in-memory data sets with lineage
DryadLINQ: optimization framework
Timely dataflow: ???
Timely dataflow: partially-ordered timestamps for fast incremental computation

What are the secondary contributions of these systems?

MPI: application topologies, collective communication, reducers
p4: clock routines, error messages, tracing, arguably p4_askfor (“The most useful monitor of all”)
MapReduce: locality, ‘battle hardened’ mechanisms for debuggability and resilience (stragglers, record skipping, local execution, status information, counters)
Spark: slick Scala interface, interactive code distribution
DryadLINQ: dynamic code generation for e.g. serialization code, annotations for communicating programmer intent, Apply
Timely dataflow: ???
Timely dataflow: global progress tracker, optimizations to accumulate updates?

Expressing one system in another system’s terms

MapReduce:
MapReduce in DryadLinq:

Implementation

Which parts of MapReduce implement these DryadLINQ optimizations?

DryadLINQ execution plan for MapReduce

DryadLINQ and Naiad

Systems seen so far

What “is” each system we’ve discussed?

What are the central contributions of these systems?

What are the secondary contributions of these systems?

More questions

Expressing one system in another system’s terms

Implementation