Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix - nvbandwidth benchmark need to handle N/A value #675

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

polarG
Copy link
Contributor

@polarG polarG commented Dec 2, 2024

Description

  1. Fixed the bug that nvbandwidth benchmark need to handle 'N/A' values in nvbandwidth cmd output.
  2. Replaced the input format of test cases with a list.
  3. Add nvbandwidth configuration example in default config files.

@polarG polarG added bug Something isn't working configuration Benchmark configurations labels Dec 2, 2024
@polarG polarG requested a review from a team as a code owner December 2, 2024 06:21
Copy link

codecov bot commented Dec 2, 2024

Codecov Report

Attention: Patch coverage is 70.00000% with 18 lines in your changes missing coverage. Please review.

Project coverage is 85.45%. Comparing base (249e21c) to head (bd6aab2).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...erbench/benchmarks/micro_benchmarks/nvbandwidth.py 70.00% 18 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #675      +/-   ##
==========================================
- Coverage   85.61%   85.45%   -0.16%     
==========================================
  Files          99       99              
  Lines        7165     7210      +45     
==========================================
+ Hits         6134     6161      +27     
- Misses       1031     1049      +18     
Flag Coverage Δ
cpu-python3.10-unit-test 71.14% <70.00%> (-0.70%) ⬇️
cpu-python3.7-unit-test 71.11% <69.49%> (-0.70%) ⬇️
cpu-python3.8-unit-test 71.15% <70.00%> (-0.68%) ⬇️
cuda-unit-test 83.27% <70.00%> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dpower4
Copy link
Contributor

dpower4 commented Dec 5, 2024

@polarG , do we also handle when we run few tests that are not valid for the underlying system. Such tests results in output as "waived".
for eg: running thedevice_to_device_memcpy_read_sm on a single gpu machine results

`nvidia@localhost:/home/nvidia/nvbandwidth$ ./nvbandwidth -t 18
nvbandwidth Version: v0.5
Built from Git version: 

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12040
CUDA Driver Version: 12040
Driver Version: 550.54.15

Device 0: NVIDIA GH200 480GB

Waived: 

Copy link
Member

@abuccts abuccts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add test cases for the "N/A" and "Waived" cases

superbench/benchmarks/micro_benchmarks/nvbandwidth.py Outdated Show resolved Hide resolved
superbench/benchmarks/micro_benchmarks/nvbandwidth.py Outdated Show resolved Hide resolved
superbench/benchmarks/micro_benchmarks/nvbandwidth.py Outdated Show resolved Hide resolved
tests/benchmarks/micro_benchmarks/test_nvbandwidth.py Outdated Show resolved Hide resolved
@polarG
Copy link
Contributor Author

polarG commented Dec 5, 2024

@polarG , do we also handle when we run few tests that are not valid for the underlying system. Such tests results in output as "waived". for eg: running thedevice_to_device_memcpy_read_sm on a single gpu machine results

`nvidia@localhost:/home/nvidia/nvbandwidth$ ./nvbandwidth -t 18
nvbandwidth Version: v0.5
Built from Git version: 

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12040
CUDA Driver Version: 12040
Driver Version: 550.54.15

Device 0: NVIDIA GH200 480GB

Waived: 

Good point! I will try to catch this in the code.
For the waived test cases, shall we show a negative value in the report? or just add a log contain the name/index? @abuccts @dpower4

@dpower4
Copy link
Contributor

dpower4 commented Dec 6, 2024

@polarG , do we also handle when we run few tests that are not valid for the underlying system. Such tests results in output as "waived". for eg: running thedevice_to_device_memcpy_read_sm on a single gpu machine results

`nvidia@localhost:/home/nvidia/nvbandwidth$ ./nvbandwidth -t 18
nvbandwidth Version: v0.5
Built from Git version: 

NOTE: This tool reports current measured bandwidth on your system.
Additional system-specific tuning may be required to achieve maximal peak bandwidth.

CUDA Runtime Version: 12040
CUDA Driver Version: 12040
Driver Version: 550.54.15

Device 0: NVIDIA GH200 480GB

Waived: 

Good point! I will try to catch this in the code. For the waived test cases, shall we show a negative value in the report? or just add a log contain the name/index? @abuccts @dpower4

Its better to show the waived test in the report in line with how other failed benchmarks are treated.

@guoshzhao
Copy link
Contributor

Looks good to me. One more question for docs: Any documents need to be updated to align this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working configuration Benchmark configurations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants