3. Which versions represent the biggest phase changes, and what happened?
This is subjective, but the biggest phase changes (simultaneous changes to the performance of many operations) seem to be:
-
v3.8 → v3.9: Many operations get ~20% faster. Hypothesis: “Missing CPU idle states” (§4.3.3). The kernel learns how to utilize lighter idle states.
Older CPUs were either on or off, nothing in between. But CPUs are very power-hungry devices, and saving power is important for many reasons, including battery life in consumer devices, energy cost, the environment, and heat dissipation. Modern CPUs offer a continuous tradeoff in power use, aiming to both perform well when necessary and save energy when possible. Depending on power state, a Xeon D-1700 processor is reported to use 25–85 watts, a variance of more than 3x. In the lowest-power states, the processor cannot really do anything, and spinning up to a more capable state may take a while, harming performance; in intermediate-power states, the processor can spin up more quickly. System software requests specific x86-64 processor power states with
mwait
instructions, and with other system registers (see Vol. 3, Ch. 14 in the Intel manuals). So when software only requests the worst-performing state—performance will be worse! -
v3.10 … v3.15: Between these versions, many operations are more than 100% slower. Hypothesis: The “forced context tracking” misconfiguration (§4.3.1). Every kernel crossing gets 400–600ns slower.
The problem here is unnecessary bookkeeping, related to CPU usage tracking and to garbage collection, that was performed on every core, on every kernel crossing (transition between user and kernel mode). In normal configurations, Linux performs certain bookkeeping tasks once per timer interrupt. This design, which batches the bookkeeping tasks, can reduce overhead; when a lot of the same kind of work is performed sequentially, that work goes faster (often because of cache effects), and less time is spent repeatedly checking for work. However, timer interrupts also have cost. When a CPU is idle or running only one CPU-heavy task, reducing or eliminating timer interrupts is a tempting way to save power. Linux’s reduced scheduling-clock ticks feature avoids timer interrupts; but this necessarily requires that bookkeeping tasks be performed elsewhere, such as on every kernel crossing. This might or might not be a good tradeoff. The paper shows that, in the context of LEBench, RSCT is generally a bad tradeoff.
-
v4.13 → v4.14: Many operations get 30% or more slower. Hypothesis: Spectre and Meltdown mitigation (§4.1.1–2).
Spectre and Meltdown are serious vulnerabilities in modern hardware that caused a huge splash in 2018 when they were discovered. A Spectre or Meltdown attacker can use processor side channels to extract information in violation of process isolation. For example, a Meltdown attack can read information in kernel memory.
A side channel exists when a clever observer can extract secrets not directly, or through exploiting a bug, but by observing side effects of some computation. (Like when a detective deduces who was in a room from their lingering perfume.) In Meltdown, a processor speculatively executes an attacker instruction. This instruction fails, but leaves behind observable effects in the form of processor cache contents. For instance, an unprivileged attacker tries to loads kernel memory. Eventually, this fails with a page fault; but before noticing the failure, the processor speculatively executes the following attacker instructions, which can load addresses that depend on the kernel data. After a page fault, the attacker can determine the kernel data by checking which addresses are fast to access. Meltdown is an extremely dangerous attack; it can extract up to 580 KB/s of data from a Linux kernel! [ref] Spectre is a more general family of attacks that target other kinds of speculative execution, such as branch prediction; Spectre was also exploitable from user processes, but less effectively (“We are able to leak 1809 B/s with 1.7% of bytes wrong/unreadable” [ref]).
Meltdown and Spectre were so serious that their disclosure was delayed for 6 months while software vendors worked on patches to protect against them. The Linux kernel implemented several mitigations. The Meltdown-related patch, kernel page-table isolation (KPTI), removes almost all kernel mappings from user page tables; this prevents Meltdown, but requires that almost every kernel crossing install a new page table, which is slow. The Spectre-related patch, so-called “retpolines,” replace some high-value target speculation instructions (indirect jumps) with “gadgets” that confuse the processor’s speculative execution unit. But confusing the speculative executor also slows things down!
1. What are the evolutionary pressures affecting Linux?
According to the paper, security enhancements and new features are the primary root causes for performance fluctuations. (Configuration changes is next.)
Linux developers are clearly pressured to respond to disclosures of new security vulnerabilities, such as Spectre and Meltdown, and to proactively harden the kernel against unknown vulnerabilities. “Overall, Linux users are paying a hefty performance tax for security enhancements.” (§3) The Meltdown-related KPTI patch solves a potentially devastating vulnerability, and one that would affect any kernel, regardless of the kernel’s correctness. To leave Linux vulnerable to Meltdown would be crazy. Some other changes, such as SLAB freelist randomization and hardened usercopy, are different in scope. While Meltdown exploits what’s arguably a processor bug, SLAB randomization attempts to make kernel bugs less easily exploitable. This process is called “hardening,” and it’s useful, but does not address the root problem, which is that kernel vulnerabilities are disastrous.
“New features” can be an evolutionary pressure—people like to build new stuff—but it’s awfully general. So let’s look at the new features specifically described in this paper. Of those, two are focused on virtualization and containers.
-
Control groups memory controller (§4.2.2): Control groups, or cgroups, are an essential part of Linux and Docker containers. Containers are a collection of technologies, introduced gradually into Linux over the years, that introduce a new kind of isolation. Although process isolation is an extremely powerful primitive, long experience has shown its limitations; for instance, a “fork bomb” can easily use all a system’s resources, and cannot be controlled within the limits of the most basic UNIX API. Control groups allow Linux administrators to constrain groups of processes, accounting for and limiting their resource usage as a unit.
-
Userspace page fault handling (§4.2.3): Virtual machine monitors, common in cloud computing, can benefit from tighter kernel integration. This feature streamlines the process by which one user process can handle page faults for another.
It thus seems like kernel developers face evolutionary pressure to improve support for containers and VMs, and, likely, cloud computing more generally.
The other listed “new features” aren’t really new features at all.
-
Fault around (§4.2.1): Not a new feature. This optimization aims to reduce page fault overhead. When a file is mapped into process memory (via
mmap
), not all mappings are installed at once; instead, the kernel maps pages on demand, as they are accessed. -
Transparent huge pages (§4.2.3): Arguably a configuration change. Many processors, including the x86-64 processors used in this study, support multiple sizes of memory page. Smaller memory pages are generally the default, because small pages offer less fragmentation. Larger pages, however, can perform better, since they reduce pressure on the processor’s cache of page mappings. (The processor needs 512 minimum-size mappings to cover 2 MiB of data—or just one “huge” mapping!) Linux supports huge pages, but the implementation has been in flux for years. The “transparent” part means that Linux will automatically detect opportunities to install huge pages. This detection uses a bunch of heuristics that can go wrong; in Ubuntu mantic, fully transparent huge pages are still off by default.
2. Which changes required new hardware support?
-
A later version of the KPTI patch addressing Meltdown relied on a feature in recent x86-64 hardware, the process-context identifier. In processors that support PCID, kernel and user processes can essentially partition the TLB cache. A Meltdown attack will fail on a PCID-enabled machine because the attack relies on the user process and the kernel sharing TLB cache entries. The paper reports that KPTI is still expensive because changing the PCID register is expensive—but it’s easy to imagine Intel speeding that up.
-
Intermediate power-saving states were not exactly new when Linux 3.9 came out, but they were new to Linux!
4. How to improve kernel development?
This is a very subjective question! Lots of people complain about the kernel development process, which can be harsh and angry. It is also fast and in public.
It’s interesting to look at some of the discussions around changes mentioned in this paper. A Linus Torvalds rant about some early-stage Intel patches for Spectre made the rounds: “Linus Torvalds declares Intel fix for Meltdown/Spectre ‘COMPLETE AND UTTER GARBAGE’”. In this article on hardened usercopy, security-oriented kernel developer Kees Cook is quoted thus:
There's a long history of misunderstanding and miscommunication (intentional or otherwise) by everyone on these topics. I'd love it if we can just side-step all of it, and try to stick as closely to the technical discussions as possible. Everyone involved in these discussions wants better security, even if we go about it in different ways. If anyone finds themselves feeling insulted, just try to let it go, and focus on the places where we can find productive common ground, remembering that any fighting just distracts from the more important issues at hand.
So Linux kernel development is a harsh environment. But maybe Intel’s proposed fix for Meltdown and Spectre was not the best idea.
The paper suggests several times that Linux should have more “proactive” optimization.
“Forced context tracking (§4.3.1) was only disabled after plauging five versions for more than 11 months… control group memory controller (§4.2.2) remained unoptimized for 6.5 years… more frequent and thorough testing, as well as more proactive performance optimizations, would have avoided these impacts”
Fair enough? And yet, as the paper points out, performance-tuning Linux is an incredibly expensive task; major kernel and cloud vendors optimize their kernels for up to 18 months before releasing them! Who will do that work? Will it take new research to do the work more effectively? Should the kernel developers do that work themselves—and if so, what should they give up?
As an open-source project, Linux development happens in public. Many kernel developers are paid for their work by their employers, but not all are. Even paid developers are often underpaid (open-source development is vulnerable to underpayment), or expected to do their open-source work on the side, in effectively spare time. As a result, kernel developers maintain the sense that they are volunteers, doing volunteer work. They hate to have their time “wasted”. If you don’t like what kernel developers are doing, you’re expected to do the work yourself. So though this paper’s results would ideally have some impact on the kernel development process, it’s not clear how much.
5. How do microbenchmarks compare with macrobenchmarks?
Macrobenchmarks did follow the same performance trends as microbenchmarks, but the absolute size of the effects were smaller. No application slowed down by more than 50% or so. The figure containing the relevant data (Fig. 7) is harder to read than the microbenchmark figures, and not directly comparable. The figure also adds nuance to the claim that “All kernel operations are slower than they were four years ago” (§1), or the feeling that the paper creates that Linux performance naturally degrades over time. Yes, application performance has suffered, but the reason is clear: Spectre and Meltdown mitigations landed in 4.14 and dramatically hamstrung application performance. Considering the versions before that time, Linux performance appears to have improved for most applications: Redis is faster in versions 4.0–4.13 than before, and the same for Nginx (a web server). Intriguingly, Apache performance slowly degraded since 3.0; perhaps that is because Apache has an older-style architecture than the comparatively-more-modern Nginx?
Another way to look at this: System calls are inherently expensive, and
reducing system calls (or making smarter system calls) is one of the most
basic mechanisms available for improving application performance. A
microbenchmark suite that focuses on the performance of individual system
calls, and that includes some measurements of quite-inexpensive system calls
(e.g., send
ing 1 byte over a TCP connection; small-read
ing a 4 KiB file),
will inherently exaggerate the impact of operating system changes.
(Many operating system research projects have focused on single-system-call performance, arguably missing the larger point that high-performance applications should make fewer system calls. It’s still worth looking at these projects, some of which have been influential and mind-expanding; for example, the Synthesis kernel, by Alexia Massalin et al., uses runtime code generation to synthesize individual system calls on demand, and contains what was at the time the world’s fastest “read one byte from a specific file descriptor” system call!)
The microbenchmark results are reported exclusively in relative terms: one
benchmark improves by 25% relative to 4.0. This hides the absolute speed
of each benchmark. For example, if small-write
slows down by 50%, but that
50% represents only 500 nanoseconds, do we really care that much?
Some of the changes that affect LEBench results were not observed by their
authors to cause problems for applications. For instance, the LWN.net article
on hardened usercopy describes the patch
authors measuring performance: “Cook said that he ’couldn't detect a
measurable performance change with these features enabled’, when running tests
like kernel builds and hackbench. … Linus Torvalds said that a stat()
-heavy
workload (e.g. something like git diff
) would be one way to test it, but
indicated that he thought the checks would not be all that onerous.” On the
other hand, we could question whether “kernel builds,” hackbench (“a benchmark
and a stress test for the Linux kernel scheduler” that stresses sockets and
pipes), and git diff
represent an adequate or rigorous test suite; and point
tests on individual changes will always miss the kind of gradual degradation
this paper warns us about.
Other questions
-
How can a “misconfiguration” (i.e., a configuration mistake) impact kernel performance for everyone? How do they know the misconfiguration was a mistake?
At first I wasn’t sure that the paper would make a good case for misconfiguration as opposed to intentionally different configuration, but boy howdy §4.3.1 makes that case well for forced context tracking. The TLB layout change described in §4.3.2 is arguably less of a configuration change and more of an optimization for newer hardware.
-
Are “core kernel operations” a fair way to measure kernel performance?
I’d argue no, but on the other hand, what should “fair” actually mean here? Core kernel operations are a simple, easily interpretable, and universal way to measure kernel performance, and sometimes a simple measurement is better than a so-called “fairer” measurement that’s much messier and harder to capture.
-
Is Linux especially bad?
It’s not at all clear that Linux’s performance evolution is worse than other kinds of software. For instance, Dan Luu once “carried around a high speed camera and measured the response latency of devices I’ve run into in the past few months”; he found that the Apple IIe, a 1983 machine with a 1 MHz clock, had less than one third the response latency as a 2.6 GHz MacBook Pro 2014. (That measurement of course involves hardware as well as software, but it’s still fun.)