Description
Discussed in https://github.com/orgs/open-quantum-safe/discussions/2076
Originally posted by lakshya-chopra February 11, 2025
In the current version of libOQS, running the speed_kem.c tests for ML-KEM
is using CPU cycles as a benchmark for GPU based cuPQC (on platforms with GPU & where OQS_USE_CUPQC=ON
). To verify this, I added debug statements in the following file to check which function gets called. To my surprise, running the speed test always invoked cuPQC's function, yet the reported benchmark results were still based on CPU cycle counts.
Build CMD:
cmake -DBUILD_SHARED_LIBS=ON -DOQS_USE_OPENSSL=OFF -DCMAKE_BUILD_TYPE=Release -DOQS_DIST_BUILD=ON \
-DOQS_USE_CUPQC=ON -DCMAKE_PREFIX_PATH=/home/master/cupqc/cupqc-pkg-0.2.0/cmake \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.6/bin/nvcc -DCMAKE_CUDA_ARCHITECTURES=86 \
-DOQS_ENABLE_KEM_ml_kem_768_cuda=ON ..
Speed comparisons
To further confirm this, I compared the speed results of Kyber768 & ML-KEM-768 (which should be similar) and got these results:
$ ./speed_kem Kyber768
Configuration info
==================
Target platform: x86_64-Linux-5.15.0-131-generic
Compiler: gcc (11.4.0)
Compile options: [-Wa,--noexecstack;-O3;-fomit-frame-pointer;-fdata-sections;-ffunction-sections;-Wl,--gc-sections;-Wbad-function-cast]
OQS version: 0.12.1-dev (major: 0, minor: 12, patch: 1, pre-release: -dev)
Git commit: 5afca642057faa54878cf6937b46fe6f00b45646
OpenSSL enabled: No
AES: NI
SHA-2: C
SHA-3: C
OQS build flags: BUILD_SHARED_LIBS OQS_DIST_BUILD OQS_LIBJADE_BUILD OQS_OPT_TARGET=generic CMAKE_BUILD_TYPE=Release
CPU exts active: ADX AES AVX AVX2 BMI1 BMI2 PCLMULQDQ POPCNT SSE SSE2 SSE3
Speed test
==========
Started at 2025-02-12 18:37:02
Operation | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
Kyber768 | | | | | |
keygen | 376913 | 3.000 | 7.959 | 0.736 | 19219 | 1532
encaps | 295155 | 3.000 | 10.164 | 0.486 | 24552 | 923
decaps | 377094 | 3.000 | 7.956 | 0.527 | 19211 | 891
For ML-KEM-768:
OQS build flags: BUILD_SHARED_LIBS OQS_DIST_BUILD OQS_LIBJADE_BUILD OQS_OPT_TARGET=generic CMAKE_BUILD_TYPE=Release
CPU exts active: ADX AES AVX AVX2 BMI1 BMI2 PCLMULQDQ POPCNT SSE SSE2 SSE3
Speed test
==========
Started at 2025-02-12 18:36:45
Operation | Iterations | Total time (s) | Time (us): mean | pop. stdev | CPU cycles: mean | pop. stdev
------------------------------------ | ----------:| --------------:| ---------------:| ----------:| -------------------------:| ----------:
ML-KEM-768 | | | | | |
keygen | 18847 | 3.000 | 159.178 | 539.811 | 385029 | 1305897
encaps | 19025 | 3.000 | 157.695 | 5.361 | 381451 | 12921
decaps | 18271 | 3.000 | 164.196 | 5.137 | 397182 | 12384
Clearly, these results are far off & do not represent an accurate picture.
Feature Request
It would be beneficial if the speed test could accurately measure GPU performance when cuPQC is used.
As an example,
If this is an actual issue, I’d be happy to help :)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status