What is virtualization?
- “Virtual machine” is an overloaded term (see the “Java virtual machine”)
- Abstraction that does not significantly change the underlying interface
- In operating systems, a virtual machine is a software abstraction that
emulates a processor and associated device hardware
- Usually emulates the same processor model as the underlying hardware
Why virtualization?
- Testing and debugging
- Efficiency
- Security
- IT security policies would require that multiple services, such as email and web serving, run on different hardware
- Limits damage if one server’s operating system is compromised
- Requires a lot of hardware!
- Virtual machines let multiple virtual computers run on the same hardware (cheaper)
Virtualization example: Bochs
- Bochs is an x86 emulator that implements the x86-64 instruction set
- Instructions are interpreted in software
class BOCHSAPI BX_CPU_C : public logfunctions {
public: // for now...
unsigned bx_cpuid;
...
// General register set
// rax: accumulator
// rbx: base
// rcx: count
// rdx: data
// rbp: base pointer
// rsi: source index
// rdi: destination index
// esp: stack pointer
// r8..r15 x86-64 extended registers
// rip: instruction pointer
// ssp: shadow stack pointer
// tmp: temp register
// nil: null register
bx_gen_reg_t gen_reg[BX_GENERAL_REGISTERS+4];
...
BX_SMF void ADD_GqEqR(bxInstruction_c *) BX_CPP_AttrRegparmN(1);
...
};
void BX_CPP_AttrRegparmN(1) BX_CPU_C::ADD_GqEqR(bxInstruction_c *i)
{
Bit64u op1_64, op2_64, sum_64;
op1_64 = this->gen_reg[i->dst()].rrx;
op2_64 = this->gen_reg[i->src()].rrx;
sum_64 = op1_64 + op2_64;
this->gen_reg[i->dst()].rrx = sum_64;
SET_FLAGS_OSZAPC_ADD_64(op1_64, op2_64, sum_64);
this->prev_rip = this->gen_reg[BX_64BIT_REG_RIP].rrx;
BX_INSTR_AFTER_EXECUTION(BX_CPU_ID, i);
this->icount++;
if (this->async_event) return;
++i;
BX_INSTR_BEFORE_EXECUTION(BX_CPU_ID, i);
this->gen_reg[BX_64BIT_REG_RIP].rrx += i->ilen();
return (this->*(i->execute1)) (i);
}
- Can this handle any CPU?
- Is this fast?
History of virtualization
- “[B]efore the introduction of VMware, engineers from Intel Corporation were convinced their processors could not be virtualized in any practical sense” [HSSV p25].
- Why not?
Popek–Goldberg virtualization
Theorem 1. A virtual machine monitor may be constructed for an architecture in which every sensitive instruction is privileged.
Theorem 3. A hybrid VMM may be constructed for an architecture in which every user-sensitive instruction is privileged.
“Formal requirements for virtualizable third generation architectures.” Gerald J. Popek and Robert P. Goldberg. Communications of the ACM 17(7), July 1974. Link
- Intel x86 is not such an architecture!
- Has user-sensitive instructions that aren’t privileged
- People thought this meant Intel could not be practically virtualized
Virtual machine monitor
- The equivalent of a kernel for a virtual machine
- On a computer supporting virtualization, the VMM, not the kernel, has
fully privileged access to machine resources
- Kernels can access machine resources only as allowed by the VMM
- In security terms, VMM is to kernel as kernel is to process
- VMM allows safe sharing of underlying machine among distinct operating systems, according to policy
Sensitive and privileged instructions
- Privileged state is any processor state that represents the current
processor privilege level
- Example: in x86-64, the
%cs
register
- Example: in x86-64, the
- A privileged instruction can only be executed when the machine is in
privileged mode (e.g., x86-64 CPL 0—kernel mode)
- When executed in user mode, a privileged instruction traps (transfers control to the the kernel)
- A sensitive instruction is an instruction that observes or modifies privileged machine state
- A user-sensitive instruction is sensitive when executed in user/unprivileged mode
- An innocuous instruction is not sensitive
How do instruction types relate to the VMM?
- Fundamentally, a VMM executes the kernel in unprivileged mode
- Goal is to exactly emulate the hardware
- This means the kernel must not be able to detect it is running on a VM
- A Popek-Goldberg VMM is required to satisfy:
- The efficiency property: All innocuous instructions are executed directly on the hardware (without software emulation)
- The resource control property: Guests cannot control hardware resources
- The equivalence property: Guests cannot distinguish whether they are running directly on hardware or on a VMM
Har de har
Virtualization theorems
Theorem 1. A virtual machine monitor may be constructed for an architecture in which every sensitive instruction is privileged.
Theorem 3. A hybrid VMM may be constructed for an architecture in which every user-sensitive instruction is privileged.
[Hybrid VMMs relax the efficiency property; the VMM may emulate, rather than execute, innocuous instructions, but only if the guest is in kernel mode.]
- Unfortunately, in the Intel x86 architecture, user-sensitive instructions
are not privileged!
- Some instructions executable in unprivileged mode can observe or modify privileged state
- If you agree to Popek–Goldberg’s definitions, this makes x86 VMMs impossible
The evil 17 instructions
pushf
,popf
, andiret
offer access to the interrupt flaglar
,verr
,verw
, andlsl
offer visibility into segment descriptorspop [seg]
,push [seg]
, andmov [seg]
manipulate segment descriptorssgdt
,sldt
,sidt
, andsmsw
offer read-only access to privileged state- far
call
, longjmp
, farret
,str
, andint N
are protected control transfer instructions that are also sometimes safe
Har de har
Virtualization in practice, not theory
- P–G: Instructions execute either in emulation (an interpreter) or directly, on the hardware
- VMware: Instructions are compiled
- Dynamic binary translation translates guest kernels into code that can run directly and safely on hardware
- P–G: The VMM must be indistinguishable from the hardware
- VMware: Meh
- Irritating parts of the x86 architecture are simply not supported
- “Unsupported requests [that never happen on any supported guest] simply abort execution” HSSV
- Uninteresting aspects of privileged state are simply exposed
- “Fortunately, even Intel’s manual describes [
sgdt
, etc.] as available but not useful to applications”
Dynamic translation for everyone
- VMware initiated a VM revolution
- VMM can implement fascinating system optimizations
- Example: Memory compression
- All system memory pages that contain all 0s are shared!
- New OS interface: Communication between VMM and guest OS
- Paravirtualization
- New processor interface
- Dynamic binary translation is difficult and dangerous
- Can hardware vendors help?
Intel VT-x
- VT-x introduces a new privilege mode for VMMs
- Root machine privilege
- Non-root privilege is used for any guest OS
- Root machine privilege is orthogonal to CPL
- Kernel can run with or without root privilege
- User-level process can run with or without root privilege
- New registers store root privilege information, registers for different privilege modes
- New root-privilege instructions, e.g.
vmxon
,vmxoff
,vmlaunch
,vmresume
, manage guests - New non-root-privilege instruction,
vmexit
, allows guests to communicate to VMM
Cost of #vmexit
Architecture | Cost (cycles) |
---|---|
Prescott (2005) | 1926 |
Merom (2006) | 1156 |
Penryn (2008) | 858 |
Westmere (2010) | 569 |
Sandy Bridge (2011) | 507 |
Ivy Bridge (2012) | 466 |
Haswell (2013) | 512 |
Broadwell (2014) | 531 |