-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add punet benchmarking to the regression suite #19088
Conversation
e756785
to
40fa10b
Compare
08c6852
to
ea2be7f
Compare
@@ -196,13 +242,19 @@ def job_summary_process(ret_value, output): | |||
def test_sdxl_rocm_benchmark( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O_O is this single def test_
function now 350 lines of code, with a few big branches for if rocm_chip ==
?
Can this be split into multiple test functions? This looks particularly difficult to run and edit as it gets more complex. We should also start thinking about generalizing to more than SDXL, at which point we'll definitely need the modularity.
For the job_summary.md
step, each test case could write some output via pytest for a test report or file aggregation step to summarize after the test run. See
- https://docs.pytest.org/en/stable/how-to/output.html#record-property
- https://docs.pytest.org/en/stable/reference/reference.html#record-property
- https://docs.pytest.org/en/stable/reference/reference.html#pytest.Item.user_properties
In the IREE ONNX tests I used those properties here:
- Write properties in each test case: https://github.com/iree-org/iree-test-suites/blob/e5429e42d0b6b3835e8564751cb7d0c29bcdd9b6/onnx_ops/conftest.py#L253-L257
- Run pytest with
--report-log
: https://github.com/iree-org/iree-test-suites/blob/e5429e42d0b6b3835e8564751cb7d0c29bcdd9b6/.github/workflows/test_onnx_ops.yml#L67 - Read properties from test log file: https://github.com/iree-org/iree-test-suites/blob/e5429e42d0b6b3835e8564751cb7d0c29bcdd9b6/onnx_ops/update_config_xfails.py#L63-L104
Here's how I think this could be structured:
def test_sdxl_rocm_unet():
...
def test_sdxl_rocm_clip():
...
def test_sdxl_rocm_vae():
...
def test_sdxl_rocm_e2e():
...
@pytest.mark.gpu_gfx942
def test_sdxl_rocm_punet_int8_fp16():
...
@pytest.mark.gpu_gfx942
def test_sdxl_rocm_punet_int8_fp8():
...
Where each test case runs just that benchmark and then writes some results to the log. Then, a script parses the log (or maybe in pytest_report_to_serializable
: https://docs.pytest.org/en/stable/reference/reference.html#pytest.hookspec.pytest_report_to_serializable) and generates the markdown table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I was thinking the same as I was coding it up, but was just thinking that this probably won't grow any further so didn't think too much of it. But makes sense, I can split it up into multiple tests. Will definetely be nicer for the eye and debugging 🙃. Also, benchmarking will only be this involved for halo models with this setup. I only see us having benchmarks/sdxl, benchmarks/llama, benchmarks/flux down the line. I don't think this is the best way to build out a general benchmarking test suite. That should live elsewhere. We should probably stick with a custom markdown at least for now due to the specific stats that we care about, but we can definitely leverage pytest reporting features for the general benchmark suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://en.wikipedia.org/wiki/Boiling_frog 😉
If we have a significant number of developers depending on a piece of infrastructure, it is worth the time to support that infrastructure at a level we are confident in (e.g. move the code out of experimental/
, encourage more developers to run it locally as part of their regular development, etc.).
What you have here could be merged as-is IMO, but I'd like to see a ~3 month time scale plan for "halo model" testing and benchmarking. I want us to be able to scale up from roughly 3 models on 2 devices and 2 backends to 20 models on 10 devices and 5 backends, then 100 models on 20 devices, etc.. The way to add a new model to the test suite can't be "fork this 600 line experimental/benchmark_sdxl_on_mi250_with_rocm.py
file and ask the codegen team for a 300 line attention_and_matmul_spec_{model_name}.mlir
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha agreed. Sure, let me write something up. Things become a lot easier to scale when we stick to basic configurations and keep things as simple as possible, which should be the case with both our general model testing/benchmarking suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok raised this issue to layout a plan of what's needed going forward: #19115. Feel free to edit and add on directly. I think we decided that we will land this, but make this issue plan a priority going forward and move out of experimental
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Lots of good design details to go through on that issue.
9eb6af1
to
f9b3c67
Compare
5f6a0cf
to
ea823ee
Compare
ea823ee
to
9aaf517
Compare
Signed-off-by: saienduri <[email protected]> Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]> Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]> Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
Signed-off-by: saienduri <[email protected]>
This commit adds support to benchmark punet fp16/fp8 performance TOM. This concludes adding all the necessary testing for the SDXL model. It also switches the compilation of punet to use the spec file as it is necessary for tuning at the current state of the project. I've also updated the artifacts in azure, and this time using the date as part of the azure link so everyone knows the time the artifacts were generated. Nithin is working on implementing the spec file optimizations as part of the compiler itself, so we can remove the usage of such files in the future. All tests have timeouts now too and updated existing ones because the CLI flag seems to be per test timeouts (not the whole pytest command itself). Now, we can avoid hangs such as https://github.com/iree-org/iree/actions/runs/11748746984/job/32734141414