Starter code for running quantized YOLO models on Warboy, Furiosa AI's 1st generation 14nm 64 TOPS NPU.
Compared to the baseline code, we optimize the code for throughput-oriented tasks, with batching and postprocessing improvements. We also provide a simple dashboard for monitoring NPU utilization.
Joint work done with Githarold.
Checking the SDK documentation is recommended before running the code. The SDK documentation can be found here.
You should install: - Driver, Device, and Runtime - Python SDK, as well as - Command Line Tools
We use a custom decoder for running quantized YOLO models on Warboy. The decoder can be built using the following commands:
chmod +x ./scripts/build_decoder.sh
cd decoder
make
Create model directories for storing the quantized models:
chmod + x ./scripts/create_model_dirs.sh
./scripts/create_model_dirs.sh
We provide baseline code for running quantized YOLO models on Warboy. The script can be run using the following commands:
python run.py
--model [MODEL_NAME]
--num_workers [NUM_WORKERS] # number of worker threads
--batch_size [BATCH_SIZE] # batch size
--save_step [SAVE_STEP] # save step for quantized ONNX (for example, if set to 10, a model will be saved every 10 calibration steps)
--num_calib_imgs [NUM_CALIB_IMGS] # number of calibration images
--calib_method [CALIB_METHOD] # calibration method for quantization (see docs - https://furiosa-ai.github.io/docs/latest/en/api/python/furiosa.quantizer.html#module-furiosa.quantizer)
--calib_p [CALIB_P] # calibration percentile
--device [DEVICE] # device configuration for NPUs (see docs - https://furiosa-ai.github.io/docs/latest/en/api/python/furiosa.runtime.html#device-specification)
--input_type [INPUT_TYPE] # type of input data (float32 or uint8)
--output_type [OUTPUT_TYPE] # type of output data (float32, int8, or uint8)
--fuse_conv_bn [FUSE_CONV_BN] # whether to fuse conv and bn layers
--simplify_onnx [SIMPLIFY_ONNX] # whether to use onnx-simplify
--optimize_onnx [OPTIMIZE_ONNX] # whether to use onnxoptimizer
--do_trace [DO_TRACE] # whether to trace the model (can view with snakeviz)
--do_profile [DO_PROFILE] # whether to profile the model (can view with traceprocessor, from Perfetto)
--scheduling [SCHEDULING] # scheduling method (round_robin or queue, see docs - https://furiosa-ai.github.io/docs/latest/en/api/python/furiosa.runtime.html#runner-api)
--input_queue_size [INPUT_QUEUE_SIZE] # input queue size, for queue scheduling
--output_queue_size [OUTPUT_QUEUE_SIZE] # output queue size, for queue scheduling
Example Commands
python run.py \
--model "yolov8s" \
--num_workers 4 \
--batch_size 16 \
--num_calib_imgs 10 \
--calib_method "MIN_MAX_ASYM" \
--device "npu0pe0-1" \
--fuse_conv_bn \
--simplify_onnx \
--optimize_onnx \
--scheduling "queue"
The following are the most impactful improvements, with respect to the baseline code:
- Batching
- The baseline code is for latency-oriented tasks, with a batch size of 1.
- We enable batching for throughput-oriented tasks, with a batch sizes that are powers of 2.
- Unlike the baseline dataloader, we adopt the DataLoader from torch
- Postprocessing
- We optimize non-maximum suppression (NMS) and rounding methods, the 2 main bottlenecks in profiling results.
We run some preliminary experiments on 8 different YOLO models. Baseline stats are from refactor_jw branch of furiosa-ai/warboy-vision-models.
Warboy, the target NPU, has 2 processing elements (PEs) per chip. Inference can be done by either using a single PE or using PE-fusion (fused mode, see paper for details).
We provide a simple dashboard for monitoring NPU utilization. The dashboard can be run using the following commands:
python my_utils/npu_util_viewer.py
Baseline - Round Robin (batch_size =1) |
Round Robin (batch_size =16) |
Queue (batch_size =16) |
---|---|---|
![]() |
![]() |
![]() |