Lecture 23: Virtual machines

2019 PowerPoint

2018 text

This presentation was influenced by HSSV: Hardware and Software Support for Virtualization (Synthesis Lectures on Computer Architecture), Edouard Bugnion, Jason Nieh, and Dan Tsafrir, Morgan & Claypool, 2017. Link

Virtualization

What is a virtual machine? Virtualization is in some ways a general concept, a kind of abstraction or enforced modularity. We might loosely describe a process as “a virtual computer”: the operating system provides an interface to the process that abstracts all important features of the machine (CPU, memory, hardware devices via system calls).

But it's useful to distinguish virtualization from general forms of abstraction and layering. Virtualization involves adding a layer of enforced modularity in which the exposed higher-level interface is exactly the same as the lower-level interface. The modularity is enforced, meaning the higher-level software cannot get around it.

Popek-Goldberg

In earlier days systems researchers proved theorems more than they do now! Theorems are good for precision, for enhancing understanding, and for getting your name in the history books. They aren't always good for advancing the field. Here’s roughly their theorem (quoted/summarized from their CACM article).

“Formal requirements for virtualizable third generation architectures.” Gerald J. Popek and Robert P. Goldberg. Communications of the ACM 17(7), July 1974. Link

We distinguish several classes of instruction.

A privileged instruction can only be executed when the machine is in privileged mode (e.g., x86-64 CPL 0—kernel mode). When executed in user mode, a privileged instruction must trap (transfer control to the control program [the kernel]).
A sensitive instruction is an instruction that observes or modifies privileged machine state, which is any state that can be used to change the current processor’s privilege level.
- A user-sensitive instruction is sensitive when executed in user/unprivileged mode.
A safe, or innocuous, instruction is not sensitive.

A virtual machine monitor is a control program that can control the execution of a guest program satisfying three requirements:

The efficiency property. All safe guest instructions are executed by the hardware directly.
The resource control property. The guest cannot affect control program resources.
The equivalence property. Guests cannot distinguish whether they are running directly on the hardware or atop the control program.

Theorem 1. A virtual machine monitor may be constructed for an architecture in which every sensitive instruction is privileged.

The proof involves constructing a VMM for the hypothetical architecture, in which the control program runs the guest in unprivileged mode. The guest then traps to the control program whenever it's about to access a sensitive instruction. The VMM can then interpret that instruction!

A hybrid virtual machine monitor is like a VMM, except that the efficiency property is relaxed. A Popek-Goldberg VMM must execute all safe instructions on the hardware; in a hybrid VMM, we allow safe instructions to be interpreted in privileged mode.

Theorem 3. A hybrid VMM may be constructed for an architecture in which every user-sensitive instruction is privileged.

Again, the proof is constructive.

Sad trombone

For decades, no widely-deployed architecture was Popek-Goldberg virtualizable. x86-32, for example, wasn't. Here are the 17 unprivileged instructions that violate Popek-Goldberg:

pushf, popf, and iret offer access to the interrupt flag.
lar, verr, verw, and lsl offer visibility into segment descriptors.
pop [seg], push [seg], and mov [seg] manipulate segment descriptors.
sgdt, sldt, sidt, and smsw offer read-only access to privileged state.
far call, long jmp, far ret, str, and int N are protected control transfer instructions that are also sometimes safe.

“Indeed, before the introduction of VMware, engineers from Intel Corporation were convinced their processors could not be virtualized in any practical sense” [HSSV p25].

VMware and dynamic translation

Popek–Goldberg seemed for decades like a straitjacket, and people just stopped working on VMMs. But it offers a sufficient, but not necessary condition. And it defines “efficiency” and “equivalence” in perhaps overly strict ways. The virtualization revolution kicked off when researchers noticed that these definitions could be relaxed.

Efficiency: Hybrid VMMs were considered too slow for practical use because in modern systems, kernels execute a lot of instructions; interpreting all those instructions is wicked slow. So VMware used dynamic binary translation to translate guest kernels into code that could run directly and safely on the hardware.
Equivalence: VMware simply dropped support for irritating parts of the x86 architecture. “Any unsupported requests, e.g., attempting to execute code at [CPL 1 or 2], which never happens on any supported guest operating system, would simply abort execution.” HSSV And VMware ignored uninteresting violations of equivalence. x86-32 is not even hybrid-virtualizable according to Popek–Goldberg, because some instructions, specifically sgdt, sldt, sidt, and smsw, are user-sensitive but unprivileged. “Fortunately, even Intel’s manual describes [these instructions] as available but not useful to applications” HSSV. So VMware just exposes those instructions! Why not.

After dynamic translation

Fueled by their fast dynamic translation, VMware sold a ton of VMMs, mostly (as far as I know) to facilitate server consolidation. (Multiple services, such as email serving and web serving, would run on different hardware due to IT security policies; VMMs let them run as if on different hardware, reducing hardware costs.) Intel and other chip manufacturers took notice and introduced new virtualization features in their chips. Intel’s version is called VT-x.

How’d they do it? You might think they’d fix the problematic instructions that break Popek-Goldberg virtualization. But that would break backward compatibility. So instead they just introduced a whole new kind of privilege that fits underneath all the existing machinery! This new kind of privilege is managed by instructions including vmxon and vmxoff and a “virtual machine control structure” (VMCS) stored in VMM-managed memory. (You can read about these extensions in Intel’s manuals, Volume 3, chapters 23–33 [December 2017 version].)

When the VM extensions were new, they actually had worse performance than the best dynamic translation versions! The biggest issue was memory virtualization. For safety, all guest-OS page table manipulations must be validated by the VMM. In its initial iterations, though, this required incredibly expensive traps into and out of VMM mode.

Virtualization techniques: “A Comparison of Software and Hardware Techniques for x86 Virtualization.” Keith Adams and Ole Agesen. In Proc. ASPLOS 2006. Link

Until recently, the x86 architecture has not permitted classical trap-and-emulate virtualization. Virtual Machine Monitors for x86, such as VMware Workstation and Virtual PC, have instead used binary translation of the guest kernel code. However, both Intel and AMD have now introduced architectural extensions to support classical virtualization.

We compare an existing software VMM with a new VMM designed for the emerging hardware support. Surprisingly, the hardware VMM often suffers lower performance than the pure software VMM. To determine why, we study architecture-level events such as page table updates, context switches and I/O, and find their costs vastly different among native, software VMM and hardware VMM execution.

We find that the hardware support fails to provide an unambiguous performance advantage for two primary reasons: first, it offers no support for MMU virtualization; second, it fails to co-exist with existing software techniques for MMU virtualization. We look ahead to emerging techniques for addressing this MMU virtualization problem in the context of hardware-assisted virtualization.

Several things have happened since then. First, VMM transitions have gotten much, much faster! This table, summarized from HSSV, lists the costs of a VM transition due to a pagefault (“vmexit/#PF”), over many iterations of Intel’s microarchitecture:

Architecture	Cost (likely in cycles)
Prescott (2005)	1926
Merom (2006)	1156
Penryn (2008)	858
Westmere (2010)	569
Sandy Bridge (2011)	507
Ivy Bridge (2012)	466
Haswell (2013)	512
Broadwell (2014)	531

Second, the hardware vendors introduced MMU virtualization. A VMM can install its own page tables, which virtualize the “physical” memory addresses visible to guests! That is, a guest application uses virtual addresses; the guest OS defines a translation from virtual to “guest physical” addresses using hardware-interpreted page tables; and the VMM defines a translation from those “guest physical” to true, host physical addresses using another set of hardware-interpreted page tables. This feature in general is called Second Level Address Translation. AMD introduced an implementation relatively early on; Intel’s implementation is called Extended Page Tables.

It’s amazing that this works. The hardware must do a lot more work to translate addresses now: each intermediate page table in a 4-level lookup in the guest’s page table requires another 4-level lookup in the VMM’s EPT, for a quadratic number of lookups overall! But it does work, and so well that newer versions of the best VMM solutions have dropped most support for software-only virtualization.