Skip to content

d7chong/furiosa-warboy-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

furiosa-warboy-inference

Starter code for running quantized YOLO models on Warboy, Furiosa AI's 1st generation 14nm 64 TOPS NPU.

Compared to the baseline code, we optimize the code for throughput-oriented tasks, with batching and postprocessing improvements. We also provide a simple dashboard for monitoring NPU utilization.

Joint work done with Githarold.

Installation

Furiosa AI - Warboy SDK Installation

Checking the SDK documentation is recommended before running the code. The SDK documentation can be found here.

You should install: - Driver, Device, and Runtime - Python SDK, as well as - Command Line Tools

Building the Decoder

We use a custom decoder for running quantized YOLO models on Warboy. The decoder can be built using the following commands:

chmod +x ./scripts/build_decoder.sh
cd decoder
make

Create Directory Structure

Create model directories for storing the quantized models:

chmod + x ./scripts/create_model_dirs.sh
./scripts/create_model_dirs.sh

Running

We provide baseline code for running quantized YOLO models on Warboy. The script can be run using the following commands:

python run.py 
    --model [MODEL_NAME]
    --num_workers [NUM_WORKERS] # number of worker threads
    --batch_size [BATCH_SIZE] # batch size
    --save_step [SAVE_STEP] # save step for quantized ONNX (for example, if set to 10, a model will be saved every 10 calibration steps)
    --num_calib_imgs [NUM_CALIB_IMGS] # number of calibration images
    --calib_method [CALIB_METHOD] # calibration method for quantization (see docs - https://furiosa-ai.github.io/docs/latest/en/api/python/furiosa.quantizer.html#module-furiosa.quantizer)
    --calib_p [CALIB_P] # calibration percentile
    --device [DEVICE] # device configuration for NPUs (see docs - https://furiosa-ai.github.io/docs/latest/en/api/python/furiosa.runtime.html#device-specification)
    --input_type [INPUT_TYPE] # type of input data (float32 or uint8)
    --output_type [OUTPUT_TYPE] # type of output data (float32, int8, or uint8)
    --fuse_conv_bn [FUSE_CONV_BN] # whether to fuse conv and bn layers
    --simplify_onnx [SIMPLIFY_ONNX] # whether to use onnx-simplify
    --optimize_onnx [OPTIMIZE_ONNX] # whether to use onnxoptimizer
    --do_trace [DO_TRACE] # whether to trace the model (can view with snakeviz)
    --do_profile [DO_PROFILE] # whether to profile the model (can view with traceprocessor, from Perfetto)
    --scheduling [SCHEDULING] # scheduling method (round_robin or queue, see docs - https://furiosa-ai.github.io/docs/latest/en/api/python/furiosa.runtime.html#runner-api)
    --input_queue_size [INPUT_QUEUE_SIZE] # input queue size, for queue scheduling
    --output_queue_size [OUTPUT_QUEUE_SIZE] # output queue size, for queue scheduling

Example Commands

python run.py \
    --model "yolov8s" \
    --num_workers 4 \
    --batch_size 16 \
    --num_calib_imgs 10 \
    --calib_method "MIN_MAX_ASYM" \
    --device "npu0pe0-1" \
    --fuse_conv_bn \
    --simplify_onnx \
    --optimize_onnx \
    --scheduling "queue"

Improvements

The following are the most impactful improvements, with respect to the baseline code:

  1. Batching
    • The baseline code is for latency-oriented tasks, with a batch size of 1.
    • We enable batching for throughput-oriented tasks, with a batch sizes that are powers of 2.
    • Unlike the baseline dataloader, we adopt the DataLoader from torch
  2. Postprocessing
    • We optimize non-maximum suppression (NMS) and rounding methods, the 2 main bottlenecks in profiling results.

Results

We run some preliminary experiments on 8 different YOLO models. Baseline stats are from refactor_jw branch of furiosa-ai/warboy-vision-models.

Warboy, the target NPU, has 2 processing elements (PEs) per chip. Inference can be done by either using a single PE or using PE-fusion (fused mode, see paper for details).

Speed Tests

speed_test

Quantization Tests

q_test

NPU Utilization Dashboard

We provide a simple dashboard for monitoring NPU utilization. The dashboard can be run using the following commands:

python my_utils/npu_util_viewer.py

Example - YOLOv8-pose

Baseline - Round Robin (batch_size=1) Round Robin (batch_size=16) Queue (batch_size=16)
GIF4 GIF5 GIF6

References

About

Running quantized YOLO models on Furiosa AI's 1st generation 14nm NPU, Warboy

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages