System call costs experiment
The following program is a simple system call benchmarker. It calls zero or
more system calls 1 million times in a tight loop; which system call(s)
depends on the option argument. Paste the code into syscall.cc
(or download it), then compile
with c++ -O2 syscall.cc -o syscall
(or c++ -std=gnu++11 -O2 syscall.cc -o syscall
) and run with (for example) ./syscall -0
or ./syscall -o
.
#include <unistd.h>
#include <time.h>
#include <stdio.h>
#include <inttypes.h>
#include <fcntl.h>
#include <stdlib.h>
static void* initial_brk;
unsigned f_return0() {
return 0;
}
unsigned f_getpid() {
return getpid();
}
unsigned f_getppid() {
return getppid();
}
unsigned f_time() {
return time(nullptr);
}
unsigned f_clock_gettime() {
timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
return ts.tv_nsec;
}
unsigned f_clock_gettime_coarse() {
timespec ts;
clock_gettime(CLOCK_REALTIME_COARSE, &ts);
return ts.tv_nsec;
}
unsigned f_sbrk() {
return reinterpret_cast<uintptr_t>(sbrk(0));
}
unsigned f_brk() {
return brk(initial_brk);
}
unsigned f_open_close() {
int fd = open("/dev/null", O_RDONLY);
close(fd);
return fd;
}
unsigned f_close() {
return close(0);
}
void usage() {
fprintf(stderr, "Usage: ./syscall [-0pPtcCsbox]\n");
exit(1);
}
int main(int argc, char** argv) {
unsigned (*volatile f)(void) = f_getpid;
initial_brk = sbrk(0);
int opt;
while ((opt = getopt(argc, argv, "0pPtcCsbox")) != -1) {
switch (opt) {
case '0':
f = f_return0;
break;
case 'p':
f = f_getpid;
break;
case 'P':
f = f_getppid;
break;
case 't':
f = f_time;
break;
case 'c':
f = f_clock_gettime;
break;
case 'C':
f = f_clock_gettime_coarse;
break;
case 's':
f = f_sbrk;
break;
case 'b':
f = f_brk;
break;
case 'o':
f = f_open_close;
break;
case 'x':
f = f_close;
break;
default:
usage();
}
}
if (optind != argc) {
usage();
}
timespec ts0;
clock_gettime(CLOCK_REALTIME, &ts0);
unsigned long n = 0;
for (unsigned i = 0; i != 1000000; ++i) {
n += f();
}
timespec ts1;
clock_gettime(CLOCK_REALTIME, &ts1);
double t0 = ts0.tv_sec + ts0.tv_nsec / 1e9;
double t1 = ts1.tv_sec + ts1.tv_nsec / 1e9;
printf("result: %lu in %.06fs\n", n, t1 - t0);
}
Run this program with the different arguments.
- Which system calls are most expensive and which least expensive? Do system call costs divide into classes?
- Try to explain why some system calls are so much more expensive than
others. For instance, use programs like
gdb
and/orstrace
and/or/usr/bin/time -v
and/orperf
to trace the operation of the program.
Here’s one set of observations:
Argument System call Time Slowdown -0
None 0.001770s 1x -t
time
0.003583s 2.02x -s
sbrk
0.004928s 2.78x -C
clock_gettime(CLOCK_REALTIME_COARSE)
0.007307s 4.13x -c
clock_gettime(CLOCK_REALTIME)
0.028216s 15.9x -P
getppid
0.234567s 133x -p
getpid
0.234863s 133x -b
brk
0.238244s 135x -x
close
0.240716s 136x -o
open
+close
1.335160s 754x We see roughly three performance classes. Fastest are
time
,sbrk
, andclock_gettime
, which aren’t much slower than a null function call. Next fastest aregetppid
,getpid
,brk
, andclose
. Then, in its own class,open
+close
, which is more than 2x slower thanclose
alone.What’s going on? First, let’s take a look at
strace
output, to check that system calls are actually being made. We see some surprises:
Argument System call # strace
lines-0
None 30 -t
time
30 -s
sbrk
30 -C
clock_gettime(CLOCK_REALTIME_COARSE)
30 -c
clock_gettime(CLOCK_REALTIME)
30 -P
getppid
1000030 -p
getpid
1000030 -b
brk
1000030 -x
close
1000030 -o
open
+close
2000030 It looks like the first class of system calls…aren’t system calls at all! The C library has somehow cached the results of the system calls, or used another mechanism to avoid system calls altogether.
We can figure this out by looking at the C library source code or by stepping through the code with GDB or a disassembler.
The
sbrk
“system call” is actually a wrapper function around a more fundamental system call,brk
. And that wrapper has a special case for when the argument is 0: it just returns a cached value! CodeThe
time
andclock_gettime
system calls are implemented on Linux by a special shared-memory mechanism called vdso. This is a special shared library created by the kernel and linked in to every process automatically. The shared-library page is updated by the kernel with the current time, so processes can read the current time without doing a system call!Here’s the code for the x86-64 versions of the VDSO time functions. This code also explains the (slight) performance differences between the variants.
time
is the cheapest because it does the least work (it just reads an int).clock_gettime
is more expensive because it loops to ensure that it reads a consistent second/nanosecond pair. TheREALTIME
version is more expensive than theREALTIME_COARSE
version because theREALTIME
version additionally uses expensive instructions, such asrdtsc
(Read TimeStamp Counter), to get a precise nanosecond count.How important is this optimization? Many programs, particularly server programs, frequently check the current time (to, for example, time out dead connections). People notice when
clock_gettime
gets slow, which is possible in virtualized environments such as Nick’s laptop. See “Two frequently used system calls are ~77% slower on AWS EC2”!Pretty cool what you can do when you can control the kernel.
Some students also saw
getpid
and/orgetppid
running quickly. These students were observing a PID caching optimization that the GNU C library eventually rejected as too error-prone and complex! Read about this history on thegetpid
manual page.Why might
open
+close
be so much slower thanclose
alone? Becauseclose
alone does not modify kernel state after the first call. The first call toclose
closes the standard input; the other 999,999 calls simply return an error.open
, on the other hand, must search the file descriptor table for an empty slot, look up a pathname, allocate memory for a file structure, modify the file descriptor table, etc.—expensive stuff.
The C10K problem
Much OS research in the late 1990s and early 2000s was concerned with addressing performance problems with a specific set of system calls, namely network connection system calls. Wide-area network connections have high latency, so a server (such as a web or FTP server) that services wide-area clients will often be waiting for those clients. A single server in the late 1990s could theoretically service the active workload provided by 10,000 simultaneous network clients, but in practice many servers broke down well before that level, because certain critical system calls had design flaws that prevented them from scaling. This “10,000 connection” problem became known as the “C10K problem.”
This paper traced the C10K problem to two specific kernel functions, select
(a system call) and ufalloc
(a kernel function for allocating the
numerically smallest unused file descriptor, invoked by open
, socket
,
accept
, etc.):
“Scalable Kernel Performance for Internet Servers Under Realistic Loads.” Gaurav Banga and Jeffrey C. Mogul. In Proc. USENIX ATC 1998. Link
Read the manual page for select
and/or poll
and try to figure out, from
these system calls’ specifications, why they perform badly on servers with
10,000 or more open, but mostly idle, connections. Develop your own hypothesis
and then check it out by talking with other groups or reading the paper. Then
sketch a solution for this problem: system calls that serve the same need as
select
, but that could scale to large numbers of idle connections.
“A Scalable and Explicit Event Delivery Mechanism for UNIX.” Gaurav Banga, Jeffrey C. Mogul, and Peter Druschel, In Proc. USENIX ATC 1999. Link
Consider a server with N mostly-idle connections. This server is interested in events (such as data arrivals) on N file descriptors, but since these FDs are mostly idle, only a couple of them (say, O(1)) actually have events available at any given moment. How would
select
orpoll
handle this situation? Well, the server would typically run like this:
- The server process processes all the events that it knows about.
- The server process prepares to block until there is more work to do. It calls
select
(orpoll
), passing down the full set of N file descriptors to the kernel (which takes O(N) work).- The kernel does O(N) work to check all N file descriptors for events.
- If no events are available, the kernel does O(N) work to block on all N file descriptors (so if an event arrives on any of the file descriptors, the process’s kernel task will wake up).
- Eventually, a few events arrive. The kernel reports those events to the server process.
- Goto step 1.
Thus, it takes O(N) work to process O(1) events!
This problem is baked in to these system calls’ designs. Although
poll
addressed some problems withselect
(namely,select
requires the user process to construct new event bitmasks every time, and in some older implementationsselect
didn’t handle large file descriptor values), the consensus is thatpoll
is worse in most realistic scenarios (becausepoll
’s data structures are much larger thanselect
’s), andpoll
doesn’t fix the O(N) problem.To fix the O(N) problem, we must move state into the kernel, to avoid the O(N) state transfer in step 2 above. This is what the
epoll
andkqueue
system calls do. Here’s how they work:
- The server process sets up an “event handle,” a special kind of file descriptor that knows how to listen for events on other file descriptors.
- The server process adds the N file descriptors to the handle. This takes O(N) work, but needs to be done just once. Then:
- The server process processes all the events it knows about.
- The server process calls
epoll
orkqueue
, passing down just the event handle.- Since the kernel has already set up internal structures corresponding to the event handle, it can wait for events with just O(1) work!
- Goto step 3.
As connections come and go, the server process can add and remove event triggers to and from the handle.
As you might imagine, event handles are subtle to implement, and no mechanism is perfect. It’s fun to read people getting mad online about them. Bryan Cantrill, a not infrequently obnoxious former Sun employee, pooped on
epoll
for years. (Former Sun employees are way too attached to Solaris, which was at best fine.) My favorite set of complaints about event handles comes from Marc Lehmann’s libev documentation; search for “special problem” or jump to “portability notes.” Some quotes:The epoll mechanism deserves honorable mention as the most misdesigned of the more advanced event mechanisms: mere annoyances include silently dropping file descriptors, requiring a system call per change per file descriptor (and unnecessary guessing of parameters), problems with dup, returning before the timeout value, resulting in additional iterations (and only giving 5ms accuracy while select on the same platform gives 0.1ms) and so on. … Epoll is also notoriously buggy - embedding epoll fds should work, but of course doesn't, and epoll just loves to report events for totally different file descriptors (even already closed ones, so one cannot even remove them from the set) than registered in the set (especially on SMP systems). … Epoll is truly the train wreck among event poll mechanisms, a frankenpoll, cobbled together in a hurry, no thought to design or interaction with others. Oh, the pain, will it ever stop...
Kqueue deserves special mention, as at the time of this writing, it was broken on all BSDs except NetBSD (usually it doesn't work reliably with anything but sockets and pipes, except on Darwin, where of course it's completely useless). Unlike epoll, however, whose brokenness is by design, these kqueue bugs can (and eventually will) be fixed….
OS/X AND DARWIN BUGS: The whole thing is a bug if you ask me - basically any system interface you touch is broken, whether it is locales, poll, kqueue or even the OpenGL drivers. … The kqueue syscall is broken in all known versions…. Instead of fixing kqueue, Apple replaced their (working) poll implementation by something calling kqueue internally around the 10.5.6 release, so now kqueue and poll are broken. … All that's left is select, and of course Apple found a way to fuck this one up as well.
The scalable event interface for Solaris is called "event ports". Unfortunately, this mechanism is very buggy in all major releases. If you run into high CPU usage, your program freezes or you get a large number of spurious wakeups, make sure you have all the relevant and latest kernel patches applied. No, I don't know which ones…
As with everything on Solaris, [the Solaris 10 event port mechanism is] really slow, but it still scales very well (O(active_fds)). … On the positive side, this backend actually performed fully to specification in all tests and is fully embeddable, which is a rare feat among the OS-specific backends (I vastly prefer correctness over speed hacks). On the negative side, the interface is bizarre - so bizarre that even sun itself gets it wrong in their code examples … Fortunately libev seems to be able to work around these idiocies.