Notes by Abby Lyons
Here is a drawing:
CPU | Cache Hierarchy | Bus (PCR) / \ \ Primary memory, disk, network device
Let's examine the data movement on the lowest level of this machine:
- 4-32kb moves from disk to memory
- 64-128b moves between primary memory and cache
- 64-1500b move from the network device to primary memory And the interrupts:
- Disk interrupts: one for read complete, one for write complete. One interrupt every 32kb.
- Network interrupts: one for packet arrival, one for completing transmission. One interrupt per packet = 100-1000 times more interrupts for the same amount of data. Networks can send 100 gigabits/second, so this results in up to 1 billion interrupts per second. This is bad.
Livelock: throughput drops to 0 because too much new work. For networks, this means that the input packet rate increases, but the output packet rate approaches 0. This happens because new packets are handled immediately, even if there are old packets still waiting. As a result, if packets are coming in too quickly, the network device keeps getting interrupted before anything can get done.
Sidebar: What is TCP?
- Establish a connection
- Client sends SYN packet.
- Server sends SYN ACK in response.
- Client sends ACK. Connection is now established.
- Talk to the server
- Client sends ACK. The acknowledgment number means "I have heard all previous pieces in this communication".
- Server sends ACK so client knows data is being received. (Not every packet, but O(number of packets)). This results in a lot of small (64b) packets.
- Rinse and repeat.
Solutions to livelock
Anyway, back to the livelock badness. Let's come up with some solutions:
- Polling. Every timer interrupt, process every network packet that's available. This still results in livelock eventually, because we are still privileging new work over existing work.
- Polling, but with a limit on the number of packets processed (the example given was 5 per timer interrupt). This works.
- Batched interrupts. We don't need to do polling anymore.
This paper described the problem precisely and showed how to solve it.
Most operating systems use interface interrupts to schedule network tasks. Interrupt-driven systems can provide low overhead and good latency at low offered load, but degrade significantly at higher arrival rates unless care is taken to prevent several pathologies. These are various forms of receive livelock, in which the system spends all its time processing interrupts, to the exclusion of other necessary tasks. Under extreme conditions, no packets are delivered to the user application or the output of the system.
To avoid livelock and related problems, an operating system must schedule network interrupt handling as carefully as it schedules process execution. We modified an interrupt-driven networking implementation to do so; this eliminates receive livelock without degrading other aspects of system performance. We present measurements demonstrating the success of our approach.
Direct packet delivery
Fast forward to 2008-ish. We have more CPUs and faster network cards, which come with some new bottlenecks, namely the single communication channel on the network card that needs to be synchronized. What to do?
Delegate a core to doing networking. This is really bad for the cache.
Network devices with multiple transmit and receive queues. Packets are split roughly evenly among receive queues. Similarly, one transmit queue per core means no synchronization is necessary among cores.
A whole set of interrelated networking device features enable this, including Receive Side Scaling (RSS). Linux documentation on scalable networking
Kind of unrelated but https://en.wikipedia.org/wiki/IPv4_address_exhaustion
One more bottleneck for some network servers is the number of copies that need to be made. The receive copy is the copy from the network device to kernel memory. The second copy happens when the kernel processes TCP headers and writes them to a socket buffer (a structure kind of like a pipe buffer). The third copy happens when copying to userspace.
User-level networking can be implemented with zero copies! How? Just give the user process direct access to the device! There are several systems that do this:
Remote direct memory access
It can also be done with -1 copies. RDMA can put things directly into the cache hierarchy, or read/write data from other computers' memory without involving the CPU. Crazy, right?