High performance and minimalist cache server.
- Build an alternative to Redis cache from scratch. Requirements are limited to a simple distributed key-value storage for string data.
- Focus on performance, not readability or some 'patterns' and idioms. Prefer plain C-like code when necessary.
- Avoid excessive defensive programming and argument checking deep inside the server code. Validate the input from network, but do not validate anything after. For example: there is no point to validate pointers in kvs.get(ptr) method, because we should not have passed invalid pointer there! And if it was passed -> we should fix the 'outer' layer, not memory storage.
- Avoid using external libraries (e.g. boost), unless necessary. Exception: unit tests and non-core functionality (e.g. Prometheus metrics).
- Solution should support only Linux, preferrably alpine or similar distribution. No Windows or Mac OS support... ever.
- Solution should be container and (hopefully) orchestrator friendly
Local python script which is leveraging multiprocessing to send requests to the running server and await response from server.
Example:
export TEST_POOL_SIZE=96 && python3 ./tcp_server_test.py -p -b 100
There are few testing scenarios supported right now:
- Multiple GET requests
- Multiple SET requests
- Multiple DEL requests
- (SET key, GET key, GET non_existent_key) workflow
- Single request per single connection test (not recommended)
Functional RPS is calculated based on: (Tclient + Tserver) / N
- Tclient - time spent to send all the requests by client + time to receive and verify the responses
- Tserver - time spent to process and respond to all the request by server
- N - total number of requests
Lunix kernel settings used as much as possible for both local and docker setups can be found in local_server_setup.bash
- Ubuntu 24.04 kernel 6.14.0-27-generic (with high end processor and half gbit internet).
- Docker on Ubuntu or Windows (with high end processor and half gbit internet)
Free github hosted runner hardware
Local setup. 10 million requests per test suite, 96 test client processes forked
more than 100 000 RPS.
- more than 1 500 000 RPS (GET/DEL)
- more than 1 000 000 RPS (SET)
- around 3 000 000 RPS (SET key, GET key, GET non_existent_key) workflow
around 90 000 RPS
TBD
- TBD
1 million requests total (4 processes and 250000 chunks per process)
around 22 500 RPS
more than 200 000 RPS (SET/GET/DEL) more than 400 000 RPS (SET key, GET key, GET non_existent_key) workflow
Next step is 10M+ functional RPS on Ubuntu (with our without pipelining)
Below are results that I got from using Redis.
Installed via https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/apt/
python3 ./tcp_server_test.py -p -b 100 --redis
around 120 000 RPS for GET / SET / DEL tests
around 650 000 RPS for SET tests, around 900 000 RPS for GET / DEL tests
around 1 250 000 RPS for (SET key, GET key, GET non_existent_key) workflow tests
redis-benchmark -t set -r 1000000 -n 1000000 -d 12 -P 100
around 120 000 RPS for GET / SET tests
around 1 000 000 RPS for GET / SET tests
Run redis in docker
docker compose -f docker-compose-local.yaml --profile redis build
docker compose -f docker-compose-local.yaml --profile redis up
Run official redis-benchmark tool
docker exec 2d279699e307 redis-benchmark -t set -r 1000000 -n 1000000 -d 12 -P 100
around 110 000 RPS for GET / SET tests
TBD
TODO. Server performance (client agnostic) should be calculated on server. Can introduce PERF command into the protocol
Local Ubuntu, 10M records:
- KeyValueStoreTest.LargeJSONFiles (1167 ms)
- KeyValueStoreTest.AddAndRetrieveElements (22795 ms)
- KeyValueStoreTest.OverwriteElements (45561 ms)
Tip
Use NUM_SHARDS environment variable (for local setup it's defined in common-compose-config.yaml) to control the number of shards for the server.
A high number of shards involves a small memory overhead, however, it may boost server performance and help to avoid collisions for large storage.
Run cache and tests
docker compose -f docker-compose-local.yaml --profile main --profile tests build
docker compose -f docker-compose-local.yaml --profile main up --detach
Wait a few moments for the server to start. Check server container logs for TCP server is ready to process incoming connections.
After the server has started, run the test script:
docker compose -f docker-compose-local.yaml --profile tests up
You can check Prometheus metrics while tests are running by opening http://localhost:8080/metrics
Don't forget to shut the detached container down by issuing:
docker compose -f docker-compose-local.yaml --profile main down
docker build -f Dockerfile.utests . -t cache-tests:latest
docker run -it cache-tests:latest
Open terminal in repository root and apply system configuration via
sudo bash ./local_server_setup.bash
Open second terminal somewhere on your hard drive and install required dependencies
sudo apt update && sudo apt upgrade -y
sudo apt install -y git cmake build-essential libgtest-dev zlib1g-dev gcc-14 g++-14
git clone https://github.com/jupp0r/prometheus-cpp.git && cd prometheus-cpp && \
git submodule init && git submodule update && \
mkdir _build && cd _build && \
cmake .. -DBUILD_SHARED_LIBS=ON -DENABLE_PUSH=OFF -DENABLE_COMPRESSION=OFF && \
cmake --build . --parallel $(nproc) && \
ctest -V && \
sudo cmake --install .
Set env variables, for example:
source .env
Run unit tests:
./run-all-tests.bash
The repository is scanned with GitHub CodeQL for C++, Python, and GitHub Actions sources. CodeQL analyses for Python and GitHub Actions run in build-mode: none, so no manual build steps are required for those languages. The C++ analysis path uses build-mode: manual to compile the project with GCC 14 and a locally installed copy of prometheus-cpp. To reproduce the same environment locally, use the following commands (they require sudo privileges):
sudo apt-get update
sudo apt-get install -y software-properties-common
sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install -y gcc-14 g++-14 cmake ninja-build pkg-config zlib1g-dev libgtest-dev
if [ ! -d prometheus-cpp ]; then
git clone https://github.com/jupp0r/prometheus-cpp.git prometheus-cpp
fi
git -C prometheus-cpp submodule update --init --recursive
cmake -S prometheus-cpp -B prometheus-cpp/_build -G Ninja \
-DBUILD_SHARED_LIBS=ON \
-DENABLE_TESTING=OFF \
-DENABLE_PUSH=OFF \
-DENABLE_COMPRESSION=OFF \
-DENABLE_LOGGING=OFF
cmake --build prometheus-cpp/_build --parallel
cmake --install prometheus-cpp/_build --prefix "$(pwd)/prometheus-cpp/_install"
CMAKE_PREFIX_PATH="$(pwd)/prometheus-cpp/_install" cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --parallelRunning the steps above ensures the CodeQL build matches the CI configuration and that the necessary dependencies are present before launching a local CodeQL analysis.
Tip
Use Ctrl+C to send SIGTERM
Build app using Cmake extension for VsCode, then run command below (or click a button in VsCode CMAKE extension).
./out/build/Release/src/poor-man-s-cache
Setup python virtual environment and run python tests from tests folder. For example:
cd tests && \
virtualenv .venv && \
source .venv/bin/activate && \
pip install -r requirements.txt && \
export TEST_POOL_SIZE=96 && python3 ./tcp_server_test.py -p -b 100
Tip
You can change the number of request sequences in tests via export TEST_ITERATIONS=100000
Run (e.g. for Release build)
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose ./out/build/Release/src/poor-man-s-cache
Run python tests, e.g. from tests folder:
python3 ./tcp_server_test.py -p -b 100
Profiling setup is similar for all Valgrind tools, below is an example for callgrind. For callgrind, Debug build is recommended, but not required.
Run (e.g. for Debug build)
valgrind --tool=callgrind --simulate-cache=yes ./out/build/Debug/src/poor-man-s-cache
Find pid, for example via ps aux. Callgrind will create a file called callgrind.out.
Run python tests, e.g. from tests folder:
python3 ./tcp_server_test.py
After tests have finished, send SIGTERM to cache server and check callgrind output from callgrind.out. file created by callgrind. Open callgrind output file with kcachegrind
Else (if you are a samurai), you can try to figure things out from callgrind_annotate
callgrind_annotate --tree=both --inclusive=yes --auto=yes --show-percs=yes callgrind.out.<server process id>
Tip
You can replace namespace and release name. You can also supply different deployment configuration by editing values.yaml or providing arguments to helm, see docs. Use helm cmd arguments or edit image.tag in values.yaml to deploy different tag
helm upgrade --install poor-man-s-cache ./helm/poor-man-s-cache -n poor-man-s-cache --create-namespace
Warning
Debug mode is very slow! Performance could be 20 or 30 times slower. Valgrind (profiler) settings can be changed in the docker-compose file under the 'cache-valgrind' service configuration.
Tip
Modify the BUILD_TYPE build argument to switch between Debug (contains extra output to std and starts server much faster) and Release (optimized) builds.
Tip
Replace valgrind with helgrind to debug multithreading issues.
To run cache and tests:
docker compose -f docker-compose-local.yaml --profile valgrind --profile tests-valgrind build
docker compose -f docker-compose-local.yaml --profile valgrind up --detach
Wait a few moments for the server to start (Valgrind mode does not require long initialization time). Check server container logs for TCP server is ready to process incoming connections.
After the server has started, run the test script:
docker compose -f docker-compose-local.yaml --profile tests-valgrind up
Check Valgrind output in container std during and after execution.
To run cache and tests:
docker compose -f docker-compose-local.yaml --profile callgrind --profile tests-callgrind build
docker compose -f docker-compose-local.yaml --profile callgrind up --detach
Wait a few minutes for the server to start (Callgrind slows down startup). Check server container logs for TCP server is ready to process incoming connections.
After the server has started, run the test script:
docker compose -f docker-compose-local.yaml --profile tests-callgrind up
Tip
ps aux can help to find the server process ID inside the cache-callgrind container.
- Exec into the cache-callgrind container shell.
- Send a termination signal to the cache server process (
kill -s SIGTERM <server process id>, e.g.kill -s SIGTERM 7). - Check Callgrind output in container stdout after execution.
- Execute
callgrind_annotate --tree=both --inclusive=yes --auto=yes --show-percs=yes callgrind.out.<server process id>(e.g.callgrind_annotate --tree=both --inclusive=yes --auto=yes --show-percs=yes callgrind.out.7).
Tip
Results of the callgrind_annotate command are hard to read without a GUI. This repository does not provide any GUI example, but it's recommended to use kcachegrind.
The callgrind.out file can be found in the /callgrind directory inside the Docker container.
sysctl -a - Check that all required sysctl options were overwritten successfully in Docker.
netstat -an | grep 'TIME_WAIT' | wc -l or netstat -an | grep 'ESTABLISHED|CONNECTED' | wc -l - Check what's going on with sockets, useful during execution of the Python test script (example in sockmon.bash).
echo -ne "SET key1 value1\x1F" | nc localhost 9001 - Send a single SET request to cache server (nice for quick testing)
echo -ne "GET key1\x1F" | nc localhost 9001 - Send a single GET request to cache server (nice for quick testing)
Netstat in a loop:
while :
do
netstat -pan
sleep 1
done
- Try some super fast hashtable (like the one from Google or boost), if it can increase performance by 20% -> use it, else just continue with the existing one and iterate on improvements.
- Test edge case scenarios
- Integrate valgrind checks into CI
- More corouties + refactor coroutine code to templates & other fancy things (if that won't hurt performance)
- Replace server metrics with wide observability events. Improve integration between the main server and the metrics server.
- Support key expiration, support more operations.
- Check if we can reduce memory usage during decompression as well.
- Continue improving collision resolution (endless task, tbh...).
- Check if there are better ways of avoiding double hash calculation in the server and KVS (right now we just provide extra public methods in
kvs.cppwhich accept hash as an argument). - There is an opportunity to try out Robot Framework for testing & writing test cases (I've never used that tool). OR just use Cucumber for Golang aka Godog tests, which I know.
- Write more documentation and describe the communication protocol.
- Try out a scaled multi-instance setup (this may require writing a custom load balancer or reverse proxy or using existing solutions like Nginx/Envoy/etc.).
- Check why Valgrind always shows a tiny memory leak from the Prometheus-cpp lib (
116 bytes in 1 block are still reachable in loss record 1 of 1). - Read http://www.kegel.com/c10k.html
- Continue reading https://www.chiark.greenend.org.uk/~sgtatham/quasiblog/coroutines-c++20/
- https://beej.us/guide/bgnet/html/#close-and-shutdownget-outta-my-face
- https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-broken-12/
- https://copyconstruct.medium.com/the-method-to-epolls-madness-d9d2d6378642
- https://eklitzke.org/blocking-io-nonblocking-io-and-epoll
- [Simon Tatham, 2023 - Writing custom C++20 coroutine systems] (https://www.chiark.greenend.org.uk/~sgtatham/quasiblog/coroutines-c++20/).