Cluster schedulers

Quincy, Mesos, & Borg

What’d you think?

What “is” each work?

“[Quincy is] a powerful and flexible new framework for scheduling concurrent distributed jobs with fine-grain resource sharing.”
“Mesos [is] a platform for sharing commodity clusters between multiple diverse cluster computing frameworks, such as Hadoop and MPI.”
“Borg is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines.”

What is the general focus of each paper?

Quincy: Graph model for general fine-grained scheduling
Mesos: Common coarse-grained API for accessing cluster resources
Borg: Scale? Proving that Borg’s co-tenancy choices are good ones?

What did the papers compare against?

Quincy: Their own “good faith” implementations of queue-based scheduling frameworks
Mesos: Static partitioning
Borg: Other deployments of the hardware with less shared tenancy
What do you think of these comparison points?

Connections to other papers

“a distinguishing feature of the data-intensive clusters we are interested in is that the computers in the cluster have large disks directly attached to them. … high-performance computing clusters traditionally do not have a large quantity of direct-attached storage” [Quincy]

Connections to other papers

The return of MPI!

Connections to other papers

The “sticky slot” problem [Quincy, and a Mesos predecessor]

Connections to other papers

“Quincy [25] is a fair scheduler for Dryad that uses a centralized scheduling algorithm for Dryad’s DAG-based programming model. In contrast, Mesos provides the lower-level abstraction of resource offers to support multiple cluster computing frameworks.” [Mesos]

Cluster schedulers

Quincy, Mesos, & Borg

What’d you think?

What “is” each work?

What is the general focus of each paper?

What did the papers compare against?

Connections to other papers

Connections to other papers

Connections to other papers

Connections to other papers

Which works could you replicate and why or why not?

Which work has the most interesting core idea?

Which work has the most valuable implementation?

Which work has the most valuable secondary ideas?

Is the Borg scheduler queue-based?

Can you model Mesos as a min-cost flow problem?

Can you model Borg as a min-cost flow problem?

What do these works teach you about simulating cluster behavior?