Lecture 18 – CS 161 2018

System call costs experiment

The following program is a simple system call benchmarker. It calls zero or more system calls 1 million times in a tight loop; which system call(s) depends on the option argument. Paste the code into syscall.cc (or download it), then compile with c++ -O2 syscall.cc -o syscall (or c++ -std=gnu++11 -O2 syscall.cc -o syscall) and run with (for example) ./syscall -0 or ./syscall -o.

#include <unistd.h>
#include <time.h>
#include <stdio.h>
#include <inttypes.h>
#include <fcntl.h>
#include <stdlib.h>

static void* initial_brk;

unsigned f_return0() {
    return 0;
}

unsigned f_getpid() {
    return getpid();
}

unsigned f_getppid() {
    return getppid();
}

unsigned f_time() {
    return time(nullptr);
}

unsigned f_clock_gettime() {
    timespec ts;
    clock_gettime(CLOCK_REALTIME, &ts);
    return ts.tv_nsec;
}

unsigned f_clock_gettime_coarse() {
    timespec ts;
    clock_gettime(CLOCK_REALTIME_COARSE, &ts);
    return ts.tv_nsec;
}

unsigned f_sbrk() {
    return reinterpret_cast<uintptr_t>(sbrk(0));
}

unsigned f_brk() {
    return brk(initial_brk);
}

unsigned f_open_close() {
    int fd = open("/dev/null", O_RDONLY);
    close(fd);
    return fd;
}

unsigned f_close() {
    return close(0);
}

void usage() {
    fprintf(stderr, "Usage: ./syscall [-0pPtcCsbox]\n");
    exit(1);
}

int main(int argc, char** argv) {
    unsigned (*volatile f)(void) = f_getpid;
    initial_brk = sbrk(0);

    int opt;
    while ((opt = getopt(argc, argv, "0pPtcCsbox")) != -1) {
        switch (opt) {
        case '0':
            f = f_return0;
            break;
        case 'p':
            f = f_getpid;
            break;
        case 'P':
            f = f_getppid;
            break;
        case 't':
            f = f_time;
            break;
        case 'c':
            f = f_clock_gettime;
            break;
        case 'C':
            f = f_clock_gettime_coarse;
            break;
        case 's':
            f = f_sbrk;
            break;
        case 'b':
            f = f_brk;
            break;
        case 'o':
            f = f_open_close;
            break;
        case 'x':
            f = f_close;
            break;
        default:
            usage();
        }
    }

    if (optind != argc) {
        usage();
    }

    timespec ts0;
    clock_gettime(CLOCK_REALTIME, &ts0);
    unsigned long n = 0;

    for (unsigned i = 0; i != 1000000; ++i) {
        n += f();
    }

    timespec ts1;
    clock_gettime(CLOCK_REALTIME, &ts1);

    double t0 = ts0.tv_sec + ts0.tv_nsec / 1e9;
    double t1 = ts1.tv_sec + ts1.tv_nsec / 1e9;
    printf("result: %lu in %.06fs\n", n, t1 - t0);
}

Run this program with the different arguments.

Which system calls are most expensive and which least expensive? Do system call costs divide into classes?
Try to explain why some system calls are so much more expensive than others. For instance, use programs like gdb and/or strace and/or /usr/bin/time -v and/or perf to trace the operation of the program.

Here’s one set of observations:

Argument System call Time Slowdown

-0 None 0.001770s 1x

-t time 0.003583s 2.02x

-s sbrk 0.004928s 2.78x

-C clock_gettime(CLOCK_REALTIME_COARSE) 0.007307s 4.13x

-c clock_gettime(CLOCK_REALTIME) 0.028216s 15.9x

-P getppid 0.234567s 133x

-p getpid 0.234863s 133x

-b brk 0.238244s 135x

-x close 0.240716s 136x

-o open + close 1.335160s 754x

We see roughly three performance classes. Fastest are time, sbrk, and clock_gettime, which aren’t much slower than a null function call. Next fastest are getppid, getpid, brk, and close. Then, in its own class, open + close, which is more than 2x slower than close alone.

What’s going on? First, let’s take a look at strace output, to check that system calls are actually being made. We see some surprises:

Argument System call # strace lines

-0 None 30

-t time 30

-s sbrk 30

-C clock_gettime(CLOCK_REALTIME_COARSE) 30

-c clock_gettime(CLOCK_REALTIME) 30

-P getppid 1000030

-p getpid 1000030

-b brk 1000030

-x close 1000030

-o open + close 2000030

It looks like the first class of system calls…aren’t system calls at all! The C library has somehow cached the results of the system calls, or used another mechanism to avoid system calls altogether.

We can figure this out by looking at the C library source code or by stepping through the code with GDB or a disassembler.

The sbrk “system call” is actually a wrapper function around a more fundamental system call, brk. And that wrapper has a special case for when the argument is 0: it just returns a cached value! Code

The time and clock_gettime system calls are implemented on Linux by a special shared-memory mechanism called vdso. This is a special shared library created by the kernel and linked in to every process automatically. The shared-library page is updated by the kernel with the current time, so processes can read the current time without doing a system call!

Here’s the code for the x86-64 versions of the VDSO time functions. This code also explains the (slight) performance differences between the variants. time is the cheapest because it does the least work (it just reads an int). clock_gettime is more expensive because it loops to ensure that it reads a consistent second/nanosecond pair. The REALTIME version is more expensive than the REALTIME_COARSE version because the REALTIME version additionally uses expensive instructions, such as rdtsc (Read TimeStamp Counter), to get a precise nanosecond count.

How important is this optimization? Many programs, particularly server programs, frequently check the current time (to, for example, time out dead connections). People notice when clock_gettime gets slow, which is possible in virtualized environments such as Nick’s laptop. See “Two frequently used system calls are ~77% slower on AWS EC2”!

Pretty cool what you can do when you can control the kernel.

Some students also saw getpid and/or getppid running quickly. These students were observing a PID caching optimization that the GNU C library eventually rejected as too error-prone and complex! Read about this history on the getpid manual page.

Why might open + close be so much slower than close alone? Because close alone does not modify kernel state after the first call. The first call to close closes the standard input; the other 999,999 calls simply return an error. open, on the other hand, must search the file descriptor table for an empty slot, look up a pathname, allocate memory for a file structure, modify the file descriptor table, etc.—expensive stuff.

Argument	System call	Time	Slowdown
`-0`	None	0.001770s	1x
`-t`	`time`	0.003583s	2.02x
`-s`	`sbrk`	0.004928s	2.78x
`-C`	`clock_gettime(CLOCK_REALTIME_COARSE)`	0.007307s	4.13x
`-c`	`clock_gettime(CLOCK_REALTIME)`	0.028216s	15.9x
`-P`	`getppid`	0.234567s	133x
`-p`	`getpid`	0.234863s	133x
`-b`	`brk`	0.238244s	135x
`-x`	`close`	0.240716s	136x
`-o`	`open` + `close`	1.335160s	754x

Argument	System call	# `strace` lines
`-0`	None	30
`-t`	`time`	30
`-s`	`sbrk`	30
`-C`	`clock_gettime(CLOCK_REALTIME_COARSE)`	30
`-c`	`clock_gettime(CLOCK_REALTIME)`	30
`-P`	`getppid`	1000030
`-p`	`getpid`	1000030
`-b`	`brk`	1000030
`-x`	`close`	1000030
`-o`	`open` + `close`	2000030

The C10K problem

Much OS research in the late 1990s and early 2000s was concerned with addressing performance problems with a specific set of system calls, namely network connection system calls. Wide-area network connections have high latency, so a server (such as a web or FTP server) that services wide-area clients will often be waiting for those clients. A single server in the late 1990s could theoretically service the active workload provided by 10,000 simultaneous network clients, but in practice many servers broke down well before that level, because certain critical system calls had design flaws that prevented them from scaling. This “10,000 connection” problem became known as the “C10K problem.”

This paper traced the C10K problem to two specific kernel functions, select (a system call) and ufalloc (a kernel function for allocating the numerically smallest unused file descriptor, invoked by open, socket, accept, etc.):

“Scalable Kernel Performance for Internet Servers Under Realistic Loads.” Gaurav Banga and Jeffrey C. Mogul. In Proc. USENIX ATC 1998. Link

Read the manual page for select and/or poll and try to figure out, from these system calls’ specifications, why they perform badly on servers with 10,000 or more open, but mostly idle, connections. Develop your own hypothesis and then check it out by talking with other groups or reading the paper. Then sketch a solution for this problem: system calls that serve the same need as select, but that could scale to large numbers of idle connections.

“A Scalable and Explicit Event Delivery Mechanism for UNIX.” Gaurav Banga, Jeffrey C. Mogul, and Peter Druschel, In Proc. USENIX ATC 1999. Link

Consider a server with N mostly-idle connections. This server is interested in events (such as data arrivals) on N file descriptors, but since these FDs are mostly idle, only a couple of them (say, O(1)) actually have events available at any given moment. How would select or poll handle this situation? Well, the server would typically run like this:

The server process processes all the events that it knows about.

The server process prepares to block until there is more work to do. It calls select (or poll), passing down the full set of N file descriptors to the kernel (which takes O(N) work).

The kernel does O(N) work to check all N file descriptors for events.

If no events are available, the kernel does O(N) work to block on all N file descriptors (so if an event arrives on any of the file descriptors, the process’s kernel task will wake up).

Eventually, a few events arrive. The kernel reports those events to the server process.

Goto step 1.

Thus, it takes O(N) work to process O(1) events!

This problem is baked in to these system calls’ designs. Although poll addressed some problems with select (namely, select requires the user process to construct new event bitmasks every time, and in some older implementations select didn’t handle large file descriptor values), the consensus is that poll is worse in most realistic scenarios (because poll’s data structures are much larger than select’s), and poll doesn’t fix the O(N) problem.

To fix the O(N) problem, we must move state into the kernel, to avoid the O(N) state transfer in step 2 above. This is what the epoll and kqueue system calls do. Here’s how they work:

The server process sets up an “event handle,” a special kind of file descriptor that knows how to listen for events on other file descriptors.

The server process adds the N file descriptors to the handle. This takes O(N) work, but needs to be done just once. Then:

The server process processes all the events it knows about.

The server process calls epoll or kqueue, passing down just the event handle.

Since the kernel has already set up internal structures corresponding to the event handle, it can wait for events with just O(1) work!

Goto step 3.

As connections come and go, the server process can add and remove event triggers to and from the handle.

As you might imagine, event handles are subtle to implement, and no mechanism is perfect. It’s fun to read people getting mad online about them. Bryan Cantrill, a not infrequently obnoxious former Sun employee, pooped on epoll for years. (Former Sun employees are way too attached to Solaris, which was at best fine.) My favorite set of complaints about event handles comes from Marc Lehmann’s libev documentation; search for “special problem” or jump to “portability notes.” Some quotes:

The epoll mechanism deserves honorable mention as the most misdesigned of the more advanced event mechanisms: mere annoyances include silently dropping file descriptors, requiring a system call per change per file descriptor (and unnecessary guessing of parameters), problems with dup, returning before the timeout value, resulting in additional iterations (and only giving 5ms accuracy while select on the same platform gives 0.1ms) and so on. … Epoll is also notoriously buggy - embedding epoll fds should work, but of course doesn't, and epoll just loves to report events for totally different file descriptors (even already closed ones, so one cannot even remove them from the set) than registered in the set (especially on SMP systems). … Epoll is truly the train wreck among event poll mechanisms, a frankenpoll, cobbled together in a hurry, no thought to design or interaction with others. Oh, the pain, will it ever stop...

Kqueue deserves special mention, as at the time of this writing, it was broken on all BSDs except NetBSD (usually it doesn't work reliably with anything but sockets and pipes, except on Darwin, where of course it's completely useless). Unlike epoll, however, whose brokenness is by design, these kqueue bugs can (and eventually will) be fixed….

OS/X AND DARWIN BUGS: The whole thing is a bug if you ask me - basically any system interface you touch is broken, whether it is locales, poll, kqueue or even the OpenGL drivers. … The kqueue syscall is broken in all known versions…. Instead of fixing kqueue, Apple replaced their (working) poll implementation by something calling kqueue internally around the 10.5.6 release, so now kqueue and poll are broken. … All that's left is select, and of course Apple found a way to fuck this one up as well.

The scalable event interface for Solaris is called "event ports". Unfortunately, this mechanism is very buggy in all major releases. If you run into high CPU usage, your program freezes or you get a large number of spurious wakeups, make sure you have all the relevant and latest kernel patches applied. No, I don't know which ones…

As with everything on Solaris, [the Solaris 10 event port mechanism is] really slow, but it still scales very well (O(active_fds)). … On the positive side, this backend actually performed fully to specification in all tests and is fully embeddable, which is a rare feat among the OS-specific backends (I vastly prefer correctness over speed hacks). On the negative side, the interface is bizarre - so bizarre that even sun itself gets it wrong in their code examples … Fortunately libev seems to be able to work around these idiocies.

Lecture 18: System call costs

System call costs experiment

The C10K problem