High-Performance RPC

Your first problem set is a warmup: Speed up a simple client–server distributed system by speeding up its RPC subsystem.

The goals of this problem set are to give you hands-on experience with important tools and ideas for low-level distributed systems performance, as well as hands-on experience with an important, heavily-engineered distributed system library.

Deadline: We’ll discuss your progress in class on Wednesday February 11.

Client–server operation

The handout code contains a simple client and a simple server. Here’s how they work.

The client:

Reads an input file containing K string/integer pairs.
Connects to the server.
N times (where N is large, and might be larger than K):
1. Selects the next pair from the input (wrapping around if it runs out of pairs).
2. Creates a “TryRequest” message containing that pair and a serial number. This message is formatted using Protocol Buffers via the gRPC framework.
3. Sends the message to the server.
4. Waits for a “TryResponse” message, which contains an integer.
Finally, after all TryResponses have been received:
1. Sends a “DoneRequest” message to the server.
2. Waits for a “DoneResponse” message, which contains two checksums, one of the pairs the server received and one of its responses.
3. Computes its own version of these checksums.
4. Prints whether the checksums match and the RPC processing rate.

The server:

Listens for connections.
On receiving a connection, repeatedly:
1. Waits for a message.
2. If the message is “TryRequest,” computes a “TryResponse” response and returns it.
3. Otherwise, the message is “DoneRequest.” Computes checksums based on the pairs it received and the responses it sent (sorted in serial order), sends a “DoneResponse” response with those checksums, and shuts down.

(Both the server and the client compute their checksums incrementally, but the results are the same as computing them once at the end. Also, the gRPC framework might automatically create multiple server threads to handle client requests in parallel, so the server uses synchronization objects to ensure it obeys the Fundamental Law of Synchronization and processes requests in serial order.)

Handout code

Fetching the handout code uses a process familiar from CS 61.

Accept our GitHub Classroom assignment to get a private repository.
Clone your repository locally (e.g., git clone git@github.com:readablesystems/cs2620-s26-psets-YOURNAME).
Add a remote for our handout code: git remote add handout https://github.com/readablesystems/cs2620-s26-psets
Merge our handout code: git pull handout main (do not use “rebase and merge”)
git push

Here’s what you’ll see in the handout.

The rpcgame.hh file contains helper code and function declarations used by both client and server.
The rpcgame.proto file defines the message formats; the protoc protocol buffer compiler, plus gRPC plugins, compiles this file into helper classes.
In addition to these common files, the handout client lives in rpcg-client.cc and clientstub.cc. The client driver rpcg-client.cc implements the overall logic, while clientstub.cc interfaces between the client driver and our chosen communication libraries.
Similarly, the handout server lives in rpcg-server.cc and serverstub.hh.

Build

Our code depends on open-source libraries including Google Protocol Buffers, gRPC, and xxHash (for computing checksums). These build instructions work on Mac OS X running Homebrew; we’ll update the pset with Linux instructions.

Install required libraries and the CMake build system: brew install xxhash grpc cmake
Change into the pset1 subdirectory
Configure a build directory: cmake -B build
Build the code: (cd build; cmake --build .) (In some configurations, we see a lot of warnings from within the gRPC framework. They appear safe to ignore.)
Run it: (killall rpcg-server; build/rpcg-server&; sleep 0.5; build/rpcg-client)

This command line works as follows.
1. killall rpcg-server: Kills any lingering server processes. The server will shut itself down if told to do so by the client, but if you quit the client early, the server will persist.
2. build/rpcg-server&: Runs the server process in the background.
3. sleep 0.5: Gives the server a chance to start up.
4. build/rpcg-client: Runs the client process in the foreground. By default, this sends 100,000 RPCs, printing a progress message every 10,000 RPCs. It then reports the checksums and whether they matched, plus the overall rate. You can supply -n N to run for N RPCs rather than 100,000.

Goal, tools, techniques

This problem sets asks you to improve the performance of the client–server system as much as possible without cheating. The client must send functionally identical messages to the server, must use the server’s responses to compute its server checksum (rather than copying the server’s checksum computation locally), and high-level client and server driver logic must remain the same as in the handout code. When given the same input, your client and server should compute the same checksums as the handout code. But nevertheless the system should get faster.

gRPC and Protocol Buffers are both efficient and very widely used. But they are also very heavily engineered, and contain many features, such as automatic retry or “telemetry” (system-level profiling), that may be unnecessary in this context or actively harmful for performance. Can you streamline the setup we’ve handed out?

Here are some general performance tools you can apply:

Windowing. The handout client code waits for each message to be acknowledged before sending the next. Is there a faster way?
Copy avoidance. The handout client code ends up slinging complex objects and frequently copies memory, for instance when managing std::string objects. Can you reduce some of the copies?
Profiling. Can you apply a smart profiler, such as Linux perf, to find hot spots in the client and/or server code?
Library choice. Maybe gRPC is slow for this application. Would a different framework be better?

Phases

Your work on the pset involves at least two phases. Use Git branches to distinguish your code for each phase.

In phase 1, you should change the client and server stubs (clientstub.cc and serverstub.cc), but not the drivers (client.cc and server.cc), and not the wire format used for messages (i.e., still Protocol Buffers).

During this phase, you must at least change the client to send messages asynchronously (i.e., windowing: the client may send a new RPC before previous RPCs have been acknowledged). The C++ coding for this won’t be pleasant, but you will learn some useful patterns, such as callbacks and objects that collect asynchronous notifications.

In phase 2, you may change the client and server stubs and the wire format used for messages. You still may not change the drivers (client.cc and server.cc).

During this phase, you must rewrite the client and server stubs to try at least one other RPC library. (There are tons; for example, in no particular order, msgpack-rpc, smf, rpclib, Apache Thrift.) This will require serious code surgery and experimenting with variously well-documented APIs. I would recommend making a copy of the handout code (say, in YOURPSETS/pset1/msgpack-rpc) and then asking a coding assistant for help in rewriting clientstub.cc and serverstub.cc to use your chosen framework. Check its work, of course! You should understand all the code you turn in, and your writeup should document what you learn.

As you work, track the performance of your evolving solution in a “lab notebook” file, such as NOTEBOOK.md, or by including performance numbers in your Git commits. Write down what you did and what difference, if any, it made. You will be tempted to skip this; you might try something, observe it makes no difference to performance, and then undo it without recording your attempt. Avoid this temptation! If you document the performance of your attempts, you’ll have a record you can go back to, and you’ll be less likely to retry an idea that you forgot didn’t work.

In both phases, your code must work for any input file and for any N.

Going further (optional)

Hungry for more? Try testing the server and client on different machines. Maybe the protocol features needed for hyper-speed communication on extremely fast networks, like localhost, differ from those needed on slower networks!

Our handout performance & solution performance

Our handout code prints this when run on my desktop:

$ (killall rpcg-server; build/rpcg-server&; sleep 0.5; build/rpcg-client; sleep 0.1)
No matching processes belonging to you were found
Server listening on localhost:29381
sent 10000 RPCs, recently 4504 RPCs/sec...
sent 20000 RPCs, recently 4512 RPCs/sec...
sent 30000 RPCs, recently 4516 RPCs/sec...
sent 40000 RPCs, recently 4455 RPCs/sec...
sent 50000 RPCs, recently 4376 RPCs/sec...
sent 60000 RPCs, recently 4403 RPCs/sec...
sent 70000 RPCs, recently 4185 RPCs/sec...
sent 80000 RPCs, recently 4298 RPCs/sec...
sent 90000 RPCs, recently 4426 RPCs/sec...
sent 100000 RPCs, recently 4310 RPCs/sec...
client checksums: e221211901eeed66/e221211901eeed66
server checksums: 83dceb304bf7f399/83dceb304bf7f399
match: true
sent 100000 RPCs in 22.748712197 sec
sent 4396 RPCs per sec
Server exiting

Our phase 1 solutions, which implement windowing plus some other tricks, are a little faster:

No matching processes belonging to you were found
Server listening on localhost:29381
sent 10000 RPCs, recently 21559 RPCs/sec...
sent 20000 RPCs, recently 35159 RPCs/sec...
sent 30000 RPCs, recently 36416 RPCs/sec...
sent 40000 RPCs, recently 36735 RPCs/sec...
sent 50000 RPCs, recently 36288 RPCs/sec...
sent 60000 RPCs, recently 34525 RPCs/sec...
sent 70000 RPCs, recently 30900 RPCs/sec...
sent 80000 RPCs, recently 36795 RPCs/sec...
sent 90000 RPCs, recently 35104 RPCs/sec...
sent 100000 RPCs, recently 36298 RPCs/sec...
client checksums: e221211901eeed66/e221211901eeed66
server checksums: 83dceb304bf7f399/83dceb304bf7f399
match: true
sent 100000 RPCs in 3.016785653 sec
sent 33148 RPCs per sec
Server exiting

And other frameworks, or a framework you built yourself, could perform better still!

Turnin

You’ll share your code with others in class, and we plan to post a leaderboard. The most important turnin file will be your lab notebook: a write up of your experience. What changes were most important to performance? What changes did not affect performance? What did you find most difficult or interesting about the assignment?

Note that your code on this pset is not a heavy part of your grade. Neither is absolute performance: any substantive work will be enough (i.e., 6–8 hours). Leaderboard glory is just for fun.