Skip to content

[NSDI 2026] Checkmate: Zero-Overhead Model Checkpointing via Network Gradient Replication

Notifications You must be signed in to change notification settings

hipersys-team/checkmate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Install packages

  1. Clone the repository
git clone [email protected]:hipersys-team/checkmate.git
cd checkmate
git submodule update --init --recursive
  1. Set up the environemnt: run build.sh WITHTOUT conda environment
conda deactivate
chmod a+x script/*.sh
script/build.sh

How to run

Add Hugepages for DPDK

cd script/
mkdir -p /tmp/mnt/huge
sudo ./dpdk-hugepages.py --mount --directory /tmp/mnt/huge --user `id -u` --group `id -g` --setup 8G

Run Training

Information can be found in training.

Note

To specify the number of storage nodes please set environment variable NUM_STORAGE to the desired number.

Run Storage Server(s)

Information can be found in storage.

Note

To specify the number of training nodes please set environment variable NUM_TRAINING to the desired number.

Manual Build

Most steps are automated in build.sh script. However, if you want to build manually, follow the steps below.

Compile libtpa

cd third_party/libtpa
make -j
sudo -E make install

Important

For the newer NICs like ConnectX-7 use DPDK version by setting export DPDK_VERSION=v22.11

Compile NCCL

cd third_party/nccl
sudo make -j src.install

Compile NCCL Plugin

cd third_party/nccl-plugin/cc
make clean; make -j

About

[NSDI 2026] Checkmate: Zero-Overhead Model Checkpointing via Network Gradient Replication

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •