This code accompanies the paper Hardware and Software Platform Inference. Our method identifies the underlying GPU architecture and software stack of a (black-box) machine learning model solely based on its input–output behavior. We leverage the inherent differences of various GPU architectures and compilers to distinguish between different GPU types and software stacks. We evaluate HSPI against models served on real hardware and achieve between 83.9% and 100% accuracy in a white-box setting and up to 3× higher than random-guess accuracy in a black-box setting.
Just run: conda env create -f environment.yml
- Conda env:
env.yaml
- Pip requirements:
requirements.txt
If you want to specifically differentiate and use different cuda versions, please set these in env.yaml. When doing border image generation across GPUs be careful to ensure that the compatible version of nccl are installed (best if the same version).
Use this script to differentiate different quantization schemes on the same GPU by generating "border images" via projected gradient descent (PGD). It supports both single-image and batched operation modes, configurable learning-rate schedules, and optional transferability checks against additional models.
When transferability checks are enabled (using the --other-models
flags), the script evaluates how each crafted border image generalizes to the specified additional models. It computes a transferability ratio per image and outputs detailed metadata plus a summary file (transferability.yaml
) showing how often predicted quantization tags match the true tags across those models.
python img_border_batched_quantization.py `<model_name>` `<dataset>` `<save_dir>` [options]
<model_name>
: one ofvgg16
,resnet18
,resnet50
,efficientnet_b0
,densenet121
,mobilenet_v2
,mobilenet_v3_small
,mobilenet_v3_large
<dataset>
: one ofcifar10
,cifar100
,imagenet
<save_dir>
: directory where checkpoints, border images, and logs will be saved
-
--method
: Engineering method; only1-vs-1-pgd
is implemented (default:1-vs-1-pgd
). -
Fine-tuning
--fine-tune-batch-size
: batch size for initial FP32 fine-tuning (default:128
).
-
PGD Engineering
--engineer-batch-size
: batch size for border-image crafting (use1
to emulate single-image mode) (default:32
).--start-lr
,--end-lr
: linear learning-rate schedule start and end values (default:1e-3
,1e-4
).--num-iters
: number of PGD iterations per image (default:400
).--check-every
: integer or ratio to check for border images during PGD (default: ratioNone
which skips intermediate checks).--noise-scale
: initial random noise scale added to source images (default:0.01
).
-
Quantization Tags
--q-tags
: list of quantization schemes to evaluate (e.g.fp32 bf16 fp16 mxint8 fp8-e3m4 fp8-e4m3 int8-dynamic
).
-
Transferability (optional)
--other-models
: list of other model names to test border-image transferability.--other-model-ckpts
: corresponding checkpoint paths for the other models.--other-model-tags
: tags for each of the other models (defaults to--q-tags
).
-
Miscellaneous
--device
: computation device (e.g.cuda
,cpu
).--seed
: random seed (default:42
).--overwrite
/-ow
: overwrite existingsave_dir
if it exists.--skip-q-test
: skip quantized-model accuracy evaluation.--create-model-ckpt-only
: exit after saving the FP32 model checkpoint, without engineering border images.
python img_border_batched_quantization.py resnet18 cifar10 results/BI_quant_resnet18 \
--method 1-vs-1-pgd \
--fine-tune-batch-size 128 \
--engineer-batch-size 32 \
--num-iters 200 --check-every 20 \
--q-tags fp32 bf16 fp16 \
--device cuda
If transferability check to another model is required:
python img_border_batched_quantization.py resnet18 cifar10 results/BI_quant_resnet18_transfer_resnet50 \
--method 1-vs-1-pgd \
--fine-tune-batch-size 128 \
--engineer-batch-size 32 \
--num-iters 200 --check-every 20 \
--q-tags fp32 bf16 fp16 \
--device cuda
--other-models resnet50
Before you run the main border-image script in distributed (NCCL) mode, you can verify your NCCL connectivity:
# On your “master” node:
python nccl_simple_test.py \
--rank 0 \
--world_size 2 \
--master_addr 111.111.11.11 \
--master_port 22222
# On the second node:
python nccl_simple_test.py \
--rank 1 \
--world_size 2 \
--master_addr 111.111.11.11 \
--master_port 22222
If both ranks report correct all_reduce sums, your network & NCCL setup is good. If connection does not work, it might be nccl version comptability or firewall rules. Check firewall rules with sudo ufw status.
Once NCCL is working, run the border-image script on both machines, pointing at the same master IP & port. You’ll also need to set NCCL_SOCKET_IFNAME
if your machines have multiple NICs:
# On the “master” server (rank 0):
NCCL_SOCKET_IFNAME=eno1 \
python img_border_batched_GPUs.py \
--rank 0 \
--world_size 2 \
--master_addr 111.111.11.11 \
--master_port 22222 \
--model_name resnet18 \
--dataset cifar10 \
--save_dir results/border_images_batchsize \
[other options…]
# On the second server (rank 1):
NCCL_SOCKET_IFNAME=eth0 \
python img_border_batched_GPUs.py \
--rank 1 \
--world_size 2 \
--master_addr 111.111.11.11 \
--master_port 22222 \
--model_name resnet18 \
--dataset cifar10 \
--save_dir results/border_images_batchsize \
[same other options…]
Be sure that:
--master_addr
&--master_port
match on both callsrank
is0
on the master,1
on the second nodeworld_size
is2
on both nodes- You use the correct NIC name for
NCCL_SOCKET_IFNAME
on each machine
Other options are similar to the quantization version.
A fast, gradient‐free way to tell quantization schemes apart by training an SVM on the bit‐patterns of a model’s logits.
Collects logits, converts them to 32-bit IEEE-754 bit patterns, groups them into blocks, then trains and evaluates a One-vs-One LinearSVC
within a single model.
Usage
python img_logits_svm.py \
--model-ckpt checkpoints/resnet50-cifar10.pt \
--save-dir results/img_logits_svm \
--q-tags fp32 bf16 fp16 int8-dynamic \
--num-samples 5000 \
--no-logits 10 \
--eval-batch-size 128 \
--device cuda
--model-ckpt
: Path to your model checkpoint (fine-tuned on CIFAR-10).--save-dir
: Where to store (and/or load)uniform_logits_… .pth
anduniform_y_labels_… .pth
.--q-tags
: Quantization schemes to compare.--num-samples
: Number of images per quantization tag.--no-logits
: How many logits make up one SVM example.--eval-batch-size
,--device
,--num-workers
: Standard DataLoader & compute settings.
The script will:
- Generate or load
uniform_logits_<…>.pth
&uniform_y_labels_<…>.pth
. - Bit-encode and block-group the logits.
- Split 80/20, train a One-vs-One
LinearSVC
. - Print within-model SVM accuracy & full classification report.
Trains on one model’s logits, then tests on another model’s logits to measure cross-model robustness of the logit fingerprint.
Usage
python img_logits_svm_transferability.py \
--model_ckpt1 checkpoints/mobilenet-v2-cifar10.pt \
--model_ckpt2 checkpoints/efficientnet-b0-cifar10.pt \
--save-dir results/img_logits_svm_transfer \
--q-tags fp32 bf16 fp16 \
--num-samples 5000 \
--no-logits 10 \
--eval-batch-size 64 \
--device cuda
--model_ckpt1
/--model_ckpt2
: Checkpoints for train and test models.--logits_file1
/--logits_file2
(optional): Pre-computeduniform_logits_…
files to reuse.- All other flags match
img_logits_svm.py
.
This script will:
- Load or generate logits for both models.
- Bit-encode & block-group each.
- Train the SVM on model 1’s blocks.
- Evaluate that classifier on model 2’s blocks.
- Print transfer accuracy & classification report.
Extends the transferability test to distinguish not only between quantization schemes and model architectures, but also between different GPU hardware. You supply two (optionally different if transferability needs to be checked) model checkpoints and train an SVM on the bit-patterns of one GPU’s logits, then test on the other GPU’s logits. The logits from the two different GPU's can be prepared in advance by running logits_svm.py on each GPU and then sharing the produced logits and running this script pointing at these files. If the exact same model should be used to compute the logits on both GPUs, transfer the model between servers in advance.
Usage
python img_logits_svm_transferability_gpus.py \
--model_ckpt1 checkpoints/mobilenet-v2-cifar10.pt \
--model_ckpt2 checkpoints/mobilenet-v2-cifar10.pt \
--logits_file1 checkpoints/RTX8000uniform_logits_{args_num_samples}_{args.no_logits}_{model_ckpt_base}_{q_tags_str}.pth \
--logits_file2 checkpoints/A100uniform_logits_{args_num_samples}_{args.no_logits}_{model_ckpt_base}_{q_tags_str}.pth \
--save_dir results/img_logits_svm_transfer_gpus \
--q-tags fp32 bf16 fp16 \
--num-samples 5000 \
--no-logits 10 \
--eval-batch-size 128 \
--device cuda:0
Scripts of HSPI-LD experiments for SGL can be found under sgl-hspi-ld. This requires launching sgl server first before collecting logits.
-
Please refer to SGL's docs to install SGL properly.
-
Launch SGL server first. You may use SGL CLI args like
--tp 2
,--dp 2
,--attention-backend flashinfer
etc to enable tensor parallel, data parallel, specify kernel backend, etc.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
- Run the collect script to generate & save logits
cd sgl-hspi-ld
# check usage
python collect.py collect --help
# pass a config file instead of specify args
python collect.py collect --config ./config --save_dir ./path/to/save/logits
- Run the classifier to train an SVM to predict SW/HW stack.
# check CLI actions and usage
python classify.py --help
# train svm
python classify.py train-svm ./path/to/saved-logits
You can find example logits collected by collect.py
at Cheng98/HSPI-SGL