furiosa-warboy-inference

Starter code for running quantized YOLO models on Warboy, Furiosa AI's 1st generation 14nm 64 TOPS NPU.

Compared to the baseline code, we optimize the code for throughput-oriented tasks, with batching and postprocessing improvements. We also provide a simple dashboard for monitoring NPU utilization.

Joint work done with Githarold.

Installation

Furiosa AI - Warboy SDK Installation

Checking the SDK documentation is recommended before running the code. The SDK documentation can be found here.

You should install: - Driver, Device, and Runtime - Python SDK, as well as - Command Line Tools

Building the Decoder

We use a custom decoder for running quantized YOLO models on Warboy. The decoder can be built using the following commands:

chmod +x ./scripts/build_decoder.sh
cd decoder
make

Create Directory Structure

Create model directories for storing the quantized models:

chmod + x ./scripts/create_model_dirs.sh
./scripts/create_model_dirs.sh

Running

We provide baseline code for running quantized YOLO models on Warboy. The script can be run using the following commands:

python run.py 
    --model [MODEL_NAME]
    --num_workers [NUM_WORKERS] # number of worker threads
    --batch_size [BATCH_SIZE] # batch size
    --save_step [SAVE_STEP] # save step for quantized ONNX (for example, if set to 10, a model will be saved every 10 calibration steps)
    --num_calib_imgs [NUM_CALIB_IMGS] # number of calibration images
    --calib_method [CALIB_METHOD] # calibration method for quantization (see docs - https://furiosa-ai.github.io/docs/latest/en/api/python/furiosa.quantizer.html#module-furiosa.quantizer)
    --calib_p [CALIB_P] # calibration percentile
    --device [DEVICE] # device configuration for NPUs (see docs - https://furiosa-ai.github.io/docs/latest/en/api/python/furiosa.runtime.html#device-specification)
    --input_type [INPUT_TYPE] # type of input data (float32 or uint8)
    --output_type [OUTPUT_TYPE] # type of output data (float32, int8, or uint8)
    --fuse_conv_bn [FUSE_CONV_BN] # whether to fuse conv and bn layers
    --simplify_onnx [SIMPLIFY_ONNX] # whether to use onnx-simplify
    --optimize_onnx [OPTIMIZE_ONNX] # whether to use onnxoptimizer
    --do_trace [DO_TRACE] # whether to trace the model (can view with snakeviz)
    --do_profile [DO_PROFILE] # whether to profile the model (can view with traceprocessor, from Perfetto)
    --scheduling [SCHEDULING] # scheduling method (round_robin or queue, see docs - https://furiosa-ai.github.io/docs/latest/en/api/python/furiosa.runtime.html#runner-api)
    --input_queue_size [INPUT_QUEUE_SIZE] # input queue size, for queue scheduling
    --output_queue_size [OUTPUT_QUEUE_SIZE] # output queue size, for queue scheduling

Example Commands

python run.py \
    --model "yolov8s" \
    --num_workers 4 \
    --batch_size 16 \
    --num_calib_imgs 10 \
    --calib_method "MIN_MAX_ASYM" \
    --device "npu0pe0-1" \
    --fuse_conv_bn \
    --simplify_onnx \
    --optimize_onnx \
    --scheduling "queue"

Improvements

The following are the most impactful improvements, with respect to the baseline code:

Batching
- The baseline code is for latency-oriented tasks, with a batch size of 1.
- We enable batching for throughput-oriented tasks, with a batch sizes that are powers of 2.
- Unlike the baseline dataloader, we adopt the DataLoader from torch
Postprocessing
- We optimize non-maximum suppression (NMS) and rounding methods, the 2 main bottlenecks in profiling results.

Results

We run some preliminary experiments on 8 different YOLO models. Baseline stats are from refactor_jw branch of furiosa-ai/warboy-vision-models.

Warboy, the target NPU, has 2 processing elements (PEs) per chip. Inference can be done by either using a single PE or using PE-fusion (fused mode, see paper for details).

Speed Tests

Quantization Tests

NPU Utilization Dashboard

We provide a simple dashboard for monitoring NPU utilization. The dashboard can be run using the following commands:

python my_utils/npu_util_viewer.py

Example - YOLOv8-pose

Baseline - Round Robin (`batch_size`=1)	Round Robin (`batch_size`=16)	Queue (`batch_size`=16)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
my_utils		my_utils
scripts		scripts
tests		tests
LICENSE		LICENSE
README.md		README.md
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

furiosa-warboy-inference

Installation

Furiosa AI - Warboy SDK Installation

Building the Decoder

Create Directory Structure

Running

Improvements

Results

Speed Tests

Quantization Tests

NPU Utilization Dashboard

Example - YOLOv8-pose

References

About

Uh oh!

Releases

Packages

Languages

License

d7chong/furiosa-warboy-inference

Folders and files

Latest commit

History

Repository files navigation

furiosa-warboy-inference

Installation

Furiosa AI - Warboy SDK Installation

Building the Decoder

Create Directory Structure

Running

Improvements

Results

Speed Tests

Quantization Tests

NPU Utilization Dashboard

Example - YOLOv8-pose

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages