Section 2 Notes

3. Which versions represent the biggest phase changes, and what happened?

This is subjective, but the biggest phase changes (simultaneous changes to the performance of many operations) seem to be:

1. What are the evolutionary pressures affecting Linux?

According to the paper, security enhancements and new features are the primary root causes for performance fluctuations. (Configuration changes is next.)

Linux developers are clearly pressured to respond to disclosures of new security vulnerabilities, such as Spectre and Meltdown, and to proactively harden the kernel against unknown vulnerabilities. “Overall, Linux users are paying a hefty performance tax for security enhancements.” (§3) The Meltdown-related KPTI patch solves a potentially devastating vulnerability, and one that would affect any kernel, regardless of the kernel’s correctness. To leave Linux vulnerable to Meltdown would be crazy. Some other changes, such as SLAB freelist randomization and hardened usercopy, are different in scope. While Meltdown exploits what’s arguably a processor bug, SLAB randomization attempts to make kernel bugs less easily exploitable. This process is called “hardening,” and it’s useful, but does not address the root problem, which is that kernel vulnerabilities are disastrous.

“New features” can be an evolutionary pressure—people like to build new stuff—but it’s awfully general. So let’s look at the new features specifically described in this paper. Of those, two are focused on virtualization and containers.

It thus seems like kernel developers face evolutionary pressure to improve support for containers and VMs, and, likely, cloud computing more generally.

The other listed “new features” aren’t really new features at all.

2. Which changes required new hardware support?

4. How to improve kernel development?

This is a very subjective question! Lots of people complain about the kernel development process, which can be harsh and angry. It is also fast and in public.

It’s interesting to look at some of the discussions around changes mentioned in this paper. A Linus Torvalds rant about some early-stage Intel patches for Spectre made the rounds: “Linus Torvalds declares Intel fix for Meltdown/Spectre ‘COMPLETE AND UTTER GARBAGE’”. In this article on hardened usercopy, security-oriented kernel developer Kees Cook is quoted thus:

There's a long history of misunderstanding and miscommunication (intentional or otherwise) by everyone on these topics. I'd love it if we can just side-step all of it, and try to stick as closely to the technical discussions as possible. Everyone involved in these discussions wants better security, even if we go about it in different ways. If anyone finds themselves feeling insulted, just try to let it go, and focus on the places where we can find productive common ground, remembering that any fighting just distracts from the more important issues at hand.

So Linux kernel development is a harsh environment. But maybe Intel’s proposed fix for Meltdown and Spectre was not the best idea.

The paper suggests several times that Linux should have more “proactive” optimization.

“Forced context tracking (§4.3.1) was only disabled after plauging five versions for more than 11 months… control group memory controller (§4.2.2) remained unoptimized for 6.5 years… more frequent and thorough testing, as well as more proactive performance optimizations, would have avoided these impacts”

Fair enough? And yet, as the paper points out, performance-tuning Linux is an incredibly expensive task; major kernel and cloud vendors optimize their kernels for up to 18 months before releasing them! Who will do that work? Will it take new research to do the work more effectively? Should the kernel developers do that work themselves—and if so, what should they give up?

As an open-source project, Linux development happens in public. Many kernel developers are paid for their work by their employers, but not all are. Even paid developers are often underpaid (open-source development is vulnerable to underpayment), or expected to do their open-source work on the side, in effectively spare time. As a result, kernel developers maintain the sense that they are volunteers, doing volunteer work. They hate to have their time “wasted”. If you don’t like what kernel developers are doing, you’re expected to do the work yourself. So though this paper’s results would ideally have some impact on the kernel development process, it’s not clear how much.

5. How do microbenchmarks compare with macrobenchmarks?

Macrobenchmarks did follow the same performance trends as microbenchmarks, but the absolute size of the effects were smaller. No application slowed down by more than 50% or so. The figure containing the relevant data (Fig. 7) is harder to read than the microbenchmark figures, and not directly comparable. The figure also adds nuance to the claim that “All kernel operations are slower than they were four years ago” (§1), or the feeling that the paper creates that Linux performance naturally degrades over time. Yes, application performance has suffered, but the reason is clear: Spectre and Meltdown mitigations landed in 4.14 and dramatically hamstrung application performance. Considering the versions before that time, Linux performance appears to have improved for most applications: Redis is faster in versions 4.0–4.13 than before, and the same for Nginx (a web server). Intriguingly, Apache performance slowly degraded since 3.0; perhaps that is because Apache has an older-style architecture than the comparatively-more-modern Nginx?

Another way to look at this: System calls are inherently expensive, and reducing system calls (or making smarter system calls) is one of the most basic mechanisms available for improving application performance. A microbenchmark suite that focuses on the performance of individual system calls, and that includes some measurements of quite-inexpensive system calls (e.g., sending 1 byte over a TCP connection; small-reading a 4 KiB file), will inherently exaggerate the impact of operating system changes.

(Many operating system research projects have focused on single-system-call performance, arguably missing the larger point that high-performance applications should make fewer system calls. It’s still worth looking at these projects, some of which have been influential and mind-expanding; for example, the Synthesis kernel, by Alexia Massalin et al., uses runtime code generation to synthesize individual system calls on demand, and contains what was at the time the world’s fastest “read one byte from a specific file descriptor” system call!)

The microbenchmark results are reported exclusively in relative terms: one benchmark improves by 25% relative to 4.0. This hides the absolute speed of each benchmark. For example, if small-write slows down by 50%, but that 50% represents only 500 nanoseconds, do we really care that much?

Some of the changes that affect LEBench results were not observed by their authors to cause problems for applications. For instance, the article on hardened usercopy describes the patch authors measuring performance: “Cook said that he ’couldn't detect a measurable performance change with these features enabled’, when running tests like kernel builds and hackbench. … Linus Torvalds said that a stat()-heavy workload (e.g. something like git diff) would be one way to test it, but indicated that he thought the checks would not be all that onerous.” On the other hand, we could question whether “kernel builds,” hackbench (“a benchmark and a stress test for the Linux kernel scheduler” that stresses sockets and pipes), and git diff represent an adequate or rigorous test suite; and point tests on individual changes will always miss the kind of gradual degradation this paper warns us about.

Other questions

  1. How can a “misconfiguration” (i.e., a configuration mistake) impact kernel performance for everyone? How do they know the misconfiguration was a mistake?

    At first I wasn’t sure that the paper would make a good case for misconfiguration as opposed to intentionally different configuration, but boy howdy §4.3.1 makes that case well for forced context tracking. The TLB layout change described in §4.3.2 is arguably less of a configuration change and more of an optimization for newer hardware.

  2. Are “core kernel operations” a fair way to measure kernel performance?

    I’d argue no, but on the other hand, what should “fair” actually mean here? Core kernel operations are a simple, easily interpretable, and universal way to measure kernel performance, and sometimes a simple measurement is better than a so-called “fairer” measurement that’s much messier and harder to capture.

  3. Is Linux especially bad?

    It’s not at all clear that Linux’s performance evolution is worse than other kinds of software. For instance, Dan Luu once “carried around a high speed camera and measured the response latency of devices I’ve run into in the past few months”; he found that the Apple IIe, a 1983 machine with a 1 MHz clock, had less than one third the response latency as a 2.6 GHz MacBook Pro 2014. (That measurement of course involves hardware as well as software, but it’s still fun.)