Best way to showcase GPU_HW_MATMUL? #13806

raymondlo84 · 2022-11-02T21:48:53Z

raymondlo84
Nov 2, 2022

What's the best way to see if GPU_HW_MATMUL is being utilized? Somehow I'm getting lower performance with INT8 with GPU. Is that normal? Is HW_MATMUL using FP16? or INT8?

[Step 1/11] Parsing and validating input arguments
[ WARNING ]  -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README. 
[Step 2/11] Loading OpenVINO
[ WARNING ] PerformanceMode was not explicitly specified in command line. Device GPU.1 performance hint will be set to THROUGHPUT.
[ INFO ] OpenVINO:
         API version............. 2022.2.0-7713-af16ea1d79a-releases/2022/2
[ INFO ] Device info
         GPU
         Intel GPU plugin........ version 2022.2
         Build................... 2022.2.0-7713-af16ea1d79a-releases/2022/2

[Step 3/11] Setting device configuration
[ WARNING ] -nstreams default value is determined automatically for GPU.1 device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README.
[Step 4/11] Reading network files
[ INFO ] Read model took 53.09 ms
[Step 5/11] Resizing network to match image sizes and given batch
[ INFO ] Network batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model input 'images' precision u8, dimensions ([N,C,H,W]): 1 3 640 640
[ INFO ] Model output 'output' precision f32, dimensions ([...]): 1 25200 85
[ INFO ] Model output '462' precision f32, dimensions ([...]): 1 3 80 80 85
[ INFO ] Model output '520' precision f32, dimensions ([...]): 1 3 40 40 85
[ INFO ] Model output '578' precision f32, dimensions ([...]): 1 3 20 20 85
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 9560.84 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] DEVICE: GPU.1
[ INFO ]   AVAILABLE_DEVICES  , ['0', '1']
[ INFO ]   RANGE_FOR_ASYNC_INFER_REQUESTS  , (1, 2, 1)
[ INFO ]   RANGE_FOR_STREAMS  , (1, 2)
[ INFO ]   OPTIMAL_BATCH_SIZE  , 1
[ INFO ]   MAX_BATCH_SIZE  , 1
[ INFO ]   FULL_DEVICE_NAME  , Intel(R) Arc(TM) A380  Graphics (dGPU)
[ INFO ]   DEVICE_TYPE  , Type.DISCRETE
[ INFO ]   OPTIMIZATION_CAPABILITIES  , ['FP32', 'BIN', 'FP16', 'INT8', 'GPU_HW_MATMUL']
[ INFO ]   GPU_UARCH_VERSION  , 12.7.1
[ INFO ]   GPU_EXECUTION_UNITS_COUNT  , 128
[ INFO ]   PERF_COUNT  , False
[ INFO ]   MODEL_PRIORITY  , Priority.MEDIUM
[ INFO ]   GPU_HOST_TASK_PRIORITY  , Priority.MEDIUM
[ INFO ]   GPU_QUEUE_PRIORITY  , Priority.MEDIUM
[ INFO ]   GPU_QUEUE_THROTTLE  , Priority.MEDIUM
[ INFO ]   GPU_ENABLE_LOOP_UNROLLING  , True
[ INFO ]   CACHE_DIR  , 
[ INFO ]   PERFORMANCE_HINT  , PerformanceMode.THROUGHPUT
[ INFO ]   COMPILATION_NUM_THREADS  , 20
[ INFO ]   NUM_STREAMS  , 1
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS  , 0
[ INFO ]   DEVICE_ID  , 1
[Step 9/11] Creating infer requests and preparing input data
[ INFO ] Create 2 infer requests took 110.03 ms
[ WARNING ] No input files were given for input 'images'!. This input will be filled with random values!
[ INFO ] Fill input 'images' with random values 
[Step 10/11] Measuring performance (Start inference asynchronously, 2 inference requests using 1 streams for GPU.1, inference only: True, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 29.73 ms
[Step 11/11] Dumping statistics report
Count:          2230 iterations
Duration:       60102.31 ms
Latency:
    Median:     53.62 ms
    AVG:        53.77 ms
    MIN:        30.07 ms
    MAX:        99.57 ms
Throughput: 37.10 FPS

!benchmark_app -m yolov5/yolov5m/yolov5m_openvino_model/optimized/yolov5m.xml -d GPU.1
[Step 1/11] Parsing and validating input arguments
[ WARNING ]  -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README. 
[Step 2/11] Loading OpenVINO
[ WARNING ] PerformanceMode was not explicitly specified in command line. Device GPU.1 performance hint will be set to THROUGHPUT.
[ INFO ] OpenVINO:
         API version............. 2022.2.0-7713-af16ea1d79a-releases/2022/2
[ INFO ] Device info
         GPU
         Intel GPU plugin........ version 2022.2
         Build................... 2022.2.0-7713-af16ea1d79a-releases/2022/2

[Step 3/11] Setting device configuration
[ WARNING ] -nstreams default value is determined automatically for GPU.1 device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README.
[Step 4/11] Reading network files
[ INFO ] Read model took 51.07 ms
[Step 5/11] Resizing network to match image sizes and given batch
[ INFO ] Network batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model input 'images' precision u8, dimensions ([N,C,H,W]): 1 3 640 640
[ INFO ] Model output 'output' precision f32, dimensions ([...]): 1 25200 85
[ INFO ] Model output '462' precision f32, dimensions ([...]): 1 3 80 80 85
[ INFO ] Model output '520' precision f32, dimensions ([...]): 1 3 40 40 85
[ INFO ] Model output '578' precision f32, dimensions ([...]): 1 3 20 20 85
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 13448.01 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] DEVICE: GPU.1
[ INFO ]   AVAILABLE_DEVICES  , ['0', '1']
[ INFO ]   RANGE_FOR_ASYNC_INFER_REQUESTS  , (1, 2, 1)
[ INFO ]   RANGE_FOR_STREAMS  , (1, 2)
[ INFO ]   OPTIMAL_BATCH_SIZE  , 1
[ INFO ]   MAX_BATCH_SIZE  , 1
[ INFO ]   FULL_DEVICE_NAME  , Intel(R) Arc(TM) A380  Graphics (dGPU)
[ INFO ]   DEVICE_TYPE  , Type.DISCRETE
[ INFO ]   OPTIMIZATION_CAPABILITIES  , ['FP32', 'BIN', 'FP16', 'INT8', 'GPU_HW_MATMUL']
[ INFO ]   GPU_UARCH_VERSION  , 12.7.1
[ INFO ]   GPU_EXECUTION_UNITS_COUNT  , 128
[ INFO ]   PERF_COUNT  , False
[ INFO ]   MODEL_PRIORITY  , Priority.MEDIUM
[ INFO ]   GPU_HOST_TASK_PRIORITY  , Priority.MEDIUM
[ INFO ]   GPU_QUEUE_PRIORITY  , Priority.MEDIUM
[ INFO ]   GPU_QUEUE_THROTTLE  , Priority.MEDIUM
[ INFO ]   GPU_ENABLE_LOOP_UNROLLING  , True
[ INFO ]   CACHE_DIR  , 
[ INFO ]   PERFORMANCE_HINT  , PerformanceMode.THROUGHPUT
[ INFO ]   COMPILATION_NUM_THREADS  , 20
[ INFO ]   NUM_STREAMS  , 1
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS  , 0
[ INFO ]   DEVICE_ID  , 1
[Step 9/11] Creating infer requests and preparing input data
[ INFO ] Create 2 infer requests took 100.16 ms
[ WARNING ] No input files were given for input 'images'!. This input will be filled with random values!
[ INFO ] Fill input 'images' with random values 
[Step 10/11] Measuring performance (Start inference asynchronously, 2 inference requests using 1 streams for GPU.1, inference only: True, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 109.85 ms
[Step 11/11] Dumping statistics report
Count:          1548 iterations
Duration:       60122.38 ms
Latency:
    Median:     76.74 ms
    AVG:        77.54 ms
    MIN:        45.13 ms
    MAX:        202.05 ms
Throughput: 25.75 FPS

raymondlo84 · 2022-11-02T22:02:08Z

raymondlo84
Nov 2, 2022
Author

Also, somehow iGPU is more performant? Is that normal?

The work is based on this notebook. FYI.
https://docs.openvino.ai/latest/notebooks/220-yolov5-accuracy-check-and-quantization-with-output.html

!benchmark_app -m yolov5/yolov5m/yolov5m_openvino_model_fp16/yolov5m.xml -d GPU.0
FP16 Performance on GPU
[Step 1/11] Parsing and validating input arguments
[ WARNING ]  -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README. 
[Step 2/11] Loading OpenVINO
[ WARNING ] PerformanceMode was not explicitly specified in command line. Device GPU.0 performance hint will be set to THROUGHPUT.
[ INFO ] OpenVINO:
         API version............. 2022.2.0-7713-af16ea1d79a-releases/2022/2
[ INFO ] Device info
         GPU
         Intel GPU plugin........ version 2022.2
         Build................... 2022.2.0-7713-af16ea1d79a-releases/2022/2

[Step 3/11] Setting device configuration
[ WARNING ] -nstreams default value is determined automatically for GPU.0 device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README.
[Step 4/11] Reading network files
[ INFO ] Read model took 60.18 ms
[Step 5/11] Resizing network to match image sizes and given batch
[ INFO ] Network batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model input 'images' precision u8, dimensions ([N,C,H,W]): 1 3 640 640
[ INFO ] Model output 'output' precision f32, dimensions ([...]): 1 25200 85
[ INFO ] Model output '462' precision f32, dimensions ([...]): 1 3 80 80 85
[ INFO ] Model output '520' precision f32, dimensions ([...]): 1 3 40 40 85
[ INFO ] Model output '578' precision f32, dimensions ([...]): 1 3 20 20 85
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 2655.27 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] DEVICE: GPU.0
[ INFO ]   AVAILABLE_DEVICES  , ['0', '1']
[ INFO ]   RANGE_FOR_ASYNC_INFER_REQUESTS  , (1, 2, 1)
[ INFO ]   RANGE_FOR_STREAMS  , (1, 2)
[ INFO ]   OPTIMAL_BATCH_SIZE  , 1
[ INFO ]   MAX_BATCH_SIZE  , 1
[ INFO ]   FULL_DEVICE_NAME  , Intel(R) Iris(R) Xe Graphics (iGPU)
[ INFO ]   DEVICE_TYPE  , Type.INTEGRATED
[ INFO ]   OPTIMIZATION_CAPABILITIES  , ['FP32', 'BIN', 'FP16', 'INT8']
[ INFO ]   GPU_UARCH_VERSION  , 12.0.0
[ INFO ]   GPU_EXECUTION_UNITS_COUNT  , 96
[ INFO ]   PERF_COUNT  , False
[ INFO ]   MODEL_PRIORITY  , Priority.MEDIUM
[ INFO ]   GPU_HOST_TASK_PRIORITY  , Priority.MEDIUM
[ INFO ]   GPU_QUEUE_PRIORITY  , Priority.MEDIUM
[ INFO ]   GPU_QUEUE_THROTTLE  , Priority.MEDIUM
[ INFO ]   GPU_ENABLE_LOOP_UNROLLING  , True
[ INFO ]   CACHE_DIR  , 
[ INFO ]   PERFORMANCE_HINT  , PerformanceMode.THROUGHPUT
[ INFO ]   COMPILATION_NUM_THREADS  , 20
[ INFO ]   NUM_STREAMS  , 1
[ INFO ]   PERFORMANCE_HINT_NUM_REQUESTS  , 0
[ INFO ]   DEVICE_ID  , 0
[Step 9/11] Creating infer requests and preparing input data
[ INFO ] Create 2 infer requests took 15.77 ms
[ WARNING ] No input files were given for input 'images'!. This input will be filled with random values!
[ INFO ] Fill input 'images' with random values 
[Step 10/11] Measuring performance (Start inference asynchronously, 2 inference requests using 1 streams for GPU.0, inference only: True, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 35.19 ms
[Step 11/11] Dumping statistics report
Count:          2302 iterations
Duration:       60090.89 ms
Latency:
    Median:     51.51 ms
    AVG:        52.03 ms
    MIN:        36.88 ms
    MAX:        74.89 ms
Throughput: 38.31 FPS

0 replies

raymondlo84 · 2022-11-02T22:39:48Z

raymondlo84
Nov 2, 2022
Author

Also, I checked, I got the resizeable bar working with the Intel ARC A380 (maybe it's not?). I'm using eGPU not sure if that also matters.

0 replies

isanghao · 2022-11-03T07:15:16Z

isanghao
Nov 3, 2022
Collaborator

HW_MATMUL supports int8 and fp16.
If you want to check whether HW_MATMUL is utilized or not, you can use -pc option from benchmark_app. If you see entries with execType: jit:ir, HW_MATMUL is being used.
OpenVINO 22.3 will be more mature for dGPU supports. For example, when I tested yolo-v5m on A380 with recent version of OpenVINO, it was 63 fps.
Could you also paste result of "-pc" option on dGPU so that I can check whether it is working as expected?

0 replies

raymondlo84 · 2022-11-03T19:33:20Z

raymondlo84
Nov 3, 2022
Author

HW_MATMUL supports int8 and fp16.
So fundamentally int8 and fp16 should be the same performance because they are approximately the same pipeline? or?

1 reply

isanghao Nov 4, 2022
Collaborator

int8 should be faster than fp16. It requires less multiplication hw compared to fp16.

raymondlo84 · 2022-11-04T09:13:33Z

raymondlo84
Nov 4, 2022
Author

a380.txt
igpu.txt

Tried the yolov5... :\ Is this normal?

Full device name: Intel(R) Iris(R) Xe Graphics (iGPU)

[ INFO ] Count:      2308 iterations
[ INFO ] Duration:   60150.29 ms
[ INFO ] Latency:
[ INFO ]        Median:     99.77 ms
[ INFO ]        Average:    104.07 ms
[ INFO ]        Min:        63.45 ms
[ INFO ]        Max:        169.57 ms
[ INFO ] Throughput: 38.37 FPS

Total time: 45890    microseconds

Full device name: Intel(R) Arc(TM) A380 Graphics (dGPU)

[ INFO ] Count:      1864 iterations
[ INFO ] Duration:   60251.97 ms
[ INFO ] Latency:
[ INFO ]        Median:     128.64 ms
[ INFO ]        Average:    129.14 ms
[ INFO ]        Min:        79.19 ms
[ INFO ]        Max:        189.56 ms
[ INFO ] Throughput: 30.94 FPS

1 reply

raymondlo84 Nov 4, 2022
Author

:) and I see execType: jit:ir in the log now.

raymondlo84 · 2022-11-04T09:20:11Z

raymondlo84
Nov 4, 2022
Author

Oohh I got 49fps now if I use this use_device_mem flag... Ok! I think it makes sense now.

[ INFO ] Input command: C:\Users\raymo\Documents\openvino\bin\intel64\Release\benchmark_app.exe -m .\yolo\yolov5m.xml -d GPU.1 -pc -use_device_mem 
Full device name: Intel(R) Arc(TM) A380 Graphics (dGPU)

[ INFO ] Count:      2980 iterations
[ INFO ] Duration:   60098.08 ms
[ INFO ] Latency:
[ INFO ]        Median:     80.51 ms
[ INFO ]        Average:    80.55 ms
[ INFO ]        Min:        63.66 ms
[ INFO ]        Max:        113.47 ms
[ INFO ] Throughput: 49.59 FPS

0 replies

raymondlo84 · 2022-11-04T09:24:11Z

raymondlo84
Nov 4, 2022
Author

And lots better if I use batch size 16 and so... :D Oh dear I think it's working!

PS C:\Users\raymo\Documents\openvino\bin\intel64\Release> .\benchmark_app.exe -m .\yolo\yolov5m.xml -d GPU.1 -b 16 -use_device_mem
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[ INFO ] Input command: C:\Users\raymo\Documents\openvino\bin\intel64\Release\benchmark_app.exe -m .\yolo\yolov5m.xml -d GPU.1 -b 16 -use_device_mem
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO: OpenVINO Runtime version ......... 2022.3.0
[ INFO ] Build ........... 2022.3.0-8523-87f61cf8227
[ INFO ]
[ INFO ] Device info:
[ INFO ] GPU
[ INFO ] Intel GPU plugin version ......... 2022.3.0
[ INFO ] Build ........... 2022.3.0-8523-87f61cf8227
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(GPU.1) performance hint will be set to THROUGHPUT.
[Step 4/11] Reading network files
[ INFO ] Loading network files
[ INFO ] Read network took 45.51 ms
[ INFO ] Original network I/O parameters:
Network inputs:
    images (node: images) : f32 / [...] / {1,3,640,640}
Network outputs:
    output (node: output) : f32 / [...] / {1,25200,85}
    462 (node: 462) : f32 / [...] / {1,3,80,80,85}
    520 (node: 520) : f32 / [...] / {1,3,40,40,85}
    578 (node: 578) : f32 / [...] / {1,3,20,20,85}
[Step 5/11] Resizing network to match image sizes and given batch
[ WARNING ] images: layout is not set explicitly, so it is defaulted to NCHW. It is STRONGLY recommended to set layout manually to avoid further issues.
[ INFO ] Reshaping network: 'images': {16,3,640,640}
[ INFO ] Reshape network took 3.51 ms
[Step 6/11] Configuring input of the model
[ INFO ] Network batch size: 16
Network inputs:
    images (node: images) : u8 / [N,C,H,W] / {16,3,640,640}
Network outputs:
    output (node: output) : f32 / [...] / {16,25200,85}
    462 (node: 462) : f32 / [...] / {16,3,80,80,85}
    520 (node: 520) : f32 / [...] / {16,3,40,40,85}
    578 (node: 578) : f32 / [...] / {16,3,20,20,85}
[Step 7/11] Loading the model to the device
[ INFO ] Load network took 9374.36 ms
[Step 8/11] Setting optimal runtime parameters
[ INFO ] Device: GPU.1
[ INFO ]   { NETWORK_NAME , torch-jit-export }
[ INFO ]   { OPTIMAL_NUMBER_OF_INFER_REQUESTS , 4 }
[ INFO ]   { PERF_COUNT , NO }
[ INFO ]   { MODEL_PRIORITY , MEDIUM }
[ INFO ]   { GPU_HOST_TASK_PRIORITY , MEDIUM }
[ INFO ]   { GPU_QUEUE_PRIORITY , MEDIUM }
[ INFO ]   { GPU_QUEUE_THROTTLE , MEDIUM }
[ INFO ]   { GPU_ENABLE_LOOP_UNROLLING , YES }
[ INFO ]   { CACHE_DIR ,  }
[ INFO ]   { PERFORMANCE_HINT , THROUGHPUT }
[ INFO ]   { COMPILATION_NUM_THREADS , 20 }
[ INFO ]   { NUM_STREAMS , 2 }
[ INFO ]   { PERFORMANCE_HINT_NUM_REQUESTS , 0 }
[ INFO ]   { INFERENCE_PRECISION_HINT , undefined }
[ INFO ]   { DEVICE_ID , 1 }
[Step 9/11] Creating infer requests and preparing input blobs with data
[ INFO ] Device memory will be used for input and output blobs
[ INFO ] Prepare remote blob for input 'images' with random values (image is expected)
[ INFO ] Prepare remote blob for input 'images' with random values (image is expected)
[ INFO ] Prepare remote blob for input 'images' with random values (image is expected)
[ INFO ] Prepare remote blob for input 'images' with random values (image is expected)
[Step 10/11] Measuring performance (Start inference asynchronously, 4 inference requests, limits: 60000 ms duration)
[ INFO ] BENCHMARK IS IN INFERENCE ONLY MODE.
[ INFO ] Input blobs will be filled once before performance measurements.
[ INFO ] First inference took 145.64 ms

[Step 11/11] Dumping statistics report
[ INFO ] Count:      632 iterations
[ INFO ] Duration:   60563.74 ms
[ INFO ] Latency:
[ INFO ]        Median:     382.74 ms
[ INFO ]        Average:    382.34 ms
[ INFO ]        Min:        197.58 ms
[ INFO ]        Max:        445.35 ms
[ INFO ] Throughput: 166.96 FPS

0 replies

raymondlo84 · 2022-11-04T09:33:34Z

raymondlo84
Nov 4, 2022
Author

vs my gen 12 CPU :)

[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[ INFO ] Input command: C:\Users\raymo\Documents\openvino\bin\intel64\Release\benchmark_app.exe -m .\yolo\yolov5m.xml -d CPU -t 30
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO: OpenVINO Runtime version ......... 2022.3.0
[ INFO ] Build ........... 2022.3.0-8523-87f61cf8227
[ INFO ]
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] openvino_intel_cpu_plugin version ......... 2022.3.0
[ INFO ] Build ........... 2022.3.0-8523-87f61cf8227
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(CPU) performance hint will be set to THROUGHPUT.
[Step 4/11] Reading network files
[ INFO ] Loading network files
[ INFO ] Read network took 45.42 ms
[ INFO ] Original network I/O parameters:
Network inputs:
    images (node: images) : f32 / [...] / {1,3,640,640}
Network outputs:
    output (node: output) : f32 / [...] / {1,25200,85}
    462 (node: 462) : f32 / [...] / {1,3,80,80,85}
    520 (node: 520) : f32 / [...] / {1,3,40,40,85}
    578 (node: 578) : f32 / [...] / {1,3,20,20,85}
[Step 5/11] Resizing network to match image sizes and given batch
[ WARNING ] images: layout is not set explicitly, so it is defaulted to NCHW. It is STRONGLY recommended to set layout manually to avoid further issues.
[Step 6/11] Configuring input of the model
[ INFO ] Network batch size: 1
Network inputs:
    images (node: images) : u8 / [N,C,H,W] / {1,3,640,640}
Network outputs:
    output (node: output) : f32 / [...] / {1,25200,85}
    462 (node: 462) : f32 / [...] / {1,3,80,80,85}
    520 (node: 520) : f32 / [...] / {1,3,40,40,85}
    578 (node: 578) : f32 / [...] / {1,3,20,20,85}
[Step 7/11] Loading the model to the device
[ INFO ] Load network took 467.94 ms
[Step 8/11] Setting optimal runtime parameters
[ INFO ] Device: CPU
[ INFO ]   { NETWORK_NAME , torch-jit-export }
[ INFO ]   { OPTIMAL_NUMBER_OF_INFER_REQUESTS , 5 }
[ INFO ]   { NUM_STREAMS , 5 }
[ INFO ]   { AFFINITY , HYBRID_AWARE }
[ INFO ]   { INFERENCE_NUM_THREADS , 0 }
[ INFO ]   { PERF_COUNT , NO }
[ INFO ]   { INFERENCE_PRECISION_HINT , f32 }
[ INFO ]   { PERFORMANCE_HINT , THROUGHPUT }
[ INFO ]   { PERFORMANCE_HINT_NUM_REQUESTS , 0 }
[Step 9/11] Creating infer requests and preparing input blobs with data
[ WARNING ] No input files were given: all inputs will be filled with random values!
[ INFO ] Test Config 0
[ INFO ] images  ([N,C,H,W], u8, {1, 3, 640, 640}, static):     random (image is expected)
[Step 10/11] Measuring performance (Start inference asynchronously, 5 inference requests, limits: 30000 ms duration)
[ INFO ] BENCHMARK IS IN INFERENCE ONLY MODE.
[ INFO ] Input blobs will be filled once before performance measurements.
[ INFO ] First inference took 208.49 ms

[Step 11/11] Dumping statistics report
[ INFO ] Count:      390 iterations
[ INFO ] Duration:   30528.04 ms
[ INFO ] Latency:
[ INFO ]        Median:     404.54 ms
[ INFO ]        Average:    390.30 ms
[ INFO ]        Min:        260.55 ms
[ INFO ]        Max:        518.52 ms
[ INFO ] Throughput: 12.78 FPS ```

0 replies

masahi · 2022-11-20T06:11:16Z

masahi
Nov 20, 2022

openvino/src/inference/include/openvino/runtime/intel_gpu/properties.hpp

Lines 136 to 140 in 1ad4a99

    
           /** 
        
            * @brief Device has hardware block for matrix multiplication 
        
            * @ingroup ov_runtime_ocl_gpu_prop_cpp_api 
        
            */ 
        
           constexpr static const auto HW_MATMUL = "GPU_HW_MATMUL";

Does HW_MATMUL mean XMX? I've got a source build of openvino and samples / demos everything working on my A370M. Also curious to know when XMX is used (input shape, precision, layout etc) and how to tell it from a given execution. I haven't tried -pc option yet.

2 replies

raymondlo84 Nov 23, 2022
Author

Yes, I talked to the engineers and MATMUL is the tech under the hood. :)

raymondlo84 Nov 23, 2022
Author

openvino/src/inference/include/openvino/runtime/intel_gpu/properties.hpp

Lines 136 to 140 in 1ad4a99

/**

* @brief Device has hardware block for matrix multiplication

* @ingroup ov_runtime_ocl_gpu_prop_cpp_api

*/

constexpr static const auto HW_MATMUL = "GPU_HW_MATMUL";

Does HW_MATMUL mean XMX? I've got a source build of openvino and samples / demos everything working on my A370M. Also curious to know when XMX is used (input shape, precision, layout etc) and how to tell it from a given execution. I haven't tried -pc option yet.

-pc option is your best bet. You have to trace down the kernels.

Best way to showcase GPU_HW_MATMUL? #13806

Uh oh!

Uh oh!

raymondlo84 Nov 2, 2022

Replies: 9 comments · 4 replies

Uh oh!

raymondlo84 Nov 2, 2022 Author

Uh oh!

Uh oh!

raymondlo84 Nov 2, 2022 Author

Uh oh!

isanghao Nov 3, 2022 Collaborator

Uh oh!

raymondlo84 Nov 3, 2022 Author

Uh oh!

isanghao Nov 4, 2022 Collaborator

Uh oh!

Uh oh!

raymondlo84 Nov 4, 2022 Author

Uh oh!

raymondlo84 Nov 4, 2022 Author

Uh oh!

Uh oh!

raymondlo84 Nov 4, 2022 Author

Uh oh!

Uh oh!

raymondlo84 Nov 4, 2022 Author

Uh oh!

Uh oh!

raymondlo84 Nov 4, 2022 Author

Uh oh!

masahi Nov 20, 2022

Uh oh!

raymondlo84 Nov 23, 2022 Author

Uh oh!

raymondlo84 Nov 23, 2022 Author

raymondlo84
Nov 2, 2022

Replies: 9 comments 4 replies

raymondlo84
Nov 2, 2022
Author

raymondlo84
Nov 2, 2022
Author

isanghao
Nov 3, 2022
Collaborator

raymondlo84
Nov 3, 2022
Author

isanghao Nov 4, 2022
Collaborator

raymondlo84
Nov 4, 2022
Author

raymondlo84 Nov 4, 2022
Author

raymondlo84
Nov 4, 2022
Author

raymondlo84
Nov 4, 2022
Author

raymondlo84
Nov 4, 2022
Author

masahi
Nov 20, 2022

raymondlo84 Nov 23, 2022
Author

raymondlo84 Nov 23, 2022
Author