-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e matmul test improvements #19016
base: main
Are you sure you want to change the base?
e2e matmul test improvements #19016
Conversation
Signed-off-by: Benoit Jacob <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive by: I'd still really like to delete all the in-tree code for this test suite and migrate to https://github.com/iree-org/iree-test-suites/tree/main/linalg_ops. Different sets of improvements have been made to both places at this point.
The tests hit a complexity level that I'm not comfortable supporting in-tree, and we should continue transitioning to package-based testing (decoupled from the core project CMake build, runnable from dev/nightly/stable package builds on a wide range of systems) if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScottTodd Let's chat next week. I really want to help with the test changes you're making and I would like to offer directly taking care of tihs to take this off your plate. But there's something important here -- these tests aren't just "regression" tests that would matter for QA but maybe not so much for day to day development.
These tests are are the center of day to day development, so changes to these tests need to be made as frequently as we add codegen support for a new compilation path, such as when we implement GPU data tiling on another GPU target or supporting another data type.
I think you caught this particular PR because it touches the files under tools/testing/
while the more frequent case with my day-to-day PRs is that they only touch the code under tests/e2e/matmul/
. The latter is really changing at a high frequency (see git history). The former, a bit less high frequency, but still, just today I noticed I needed to make changes there for f64, #19093, for instance (which I need to test F64 MFMA on CDNA3, which I need for completeness because if the hw supports f64, it's easier to just support it than to refrain from it even if no immediate need).
That doesn't preclude moving these tests out of tree but that does mean that:
- The out-of-tree move needs to be atomic --- if it means I can't land test changes concurrently, that's literally blocking my work. If other test changes/improvements are planned together with the move, it's really important to keep them separate from the move itself, so the atomic op here can have low latency.
- The out-of-tree move will incur a permanent penalty on my development velocity. I think I'm OK with that because I really care about test infrastructure health and trust your judgement here, but that's part of why I'd like to be involved (the other part being that I really want to help with that!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still thinking through some of the design questions here, I don't have a final state in mind that I'm totally happy with. Here are some responses to your individual points.
These tests are are the center of day to day development, so changes to these tests need to be made as frequently as we add codegen support for a new compilation path, such as when we implement GPU data tiling on another GPU target or supporting another data type.
It's my (perhaps naive) hope that we could somewhat exhaustively enumerate the data types we could support in the test suite, independent of what we currently do support on any given target. I'd like to get out of the habit of only testing what we already know works. For that I want to support XFAIL tests, but I had trouble building that into the in-tree tests. To be fair... the out of tree tests don't yet have XFAIL either, and I'm not sure if CMake/ctest is the right framework for that level of configuration. I've been much happier with pytest for handling test filtering and expected outcomes.
As for things like feature flag flips and choosing compilation pipelines, I like that putting an air gap between the compiler+runtime and the tests keeps us honest about default compiler behavior and user-friendly optimization settings. The air gap / github repository boundary does introduce friction for daily development though. I can think of a few ways we can mitigate that friction, but any of them would require some changes to developer workflows:
- Include iree-test-suites as a subproject using CMake FetchContent (not a git submodule, so random users cloning the repo don't pay the cost for test suites)
- Add scripts that go through common steps like checking out the test suite repo, building the tools, running tests, getting test results, collecting benchmarks, etc.
- Having the tests in their own project gives us more freedom here to deviate from IREE's core CMake setup
RE: atomic changes, I started https://github.com/iree-org/iree-test-suites/tree/main/linalg_ops as a relatively simple fork at first and then started making deeper changes to take advantage of the new decoupling:
- Checking in the test files so you can see what they include without needing to run the generator script: https://github.com/iree-org/iree-test-suites/blob/main/linalg_ops/matmul/generated/f16_into_f32/matmul_f16_into_f32_large.mlir
- Restructuring the CMake test files, though they currently lack full coverage of the configurations we want to test (e.g. CPU features, GPU flags): https://github.com/iree-org/iree-test-suites/blob/main/linalg_ops/matmul/CMakeLists.txt
- Adding a new top level CMake project, decoupled from things like the
iree-test-deps
target and theIREE_BUILD_COMPILER
option (including cross-compilation concerns): https://github.com/iree-org/iree-test-suites/blob/main/linalg_ops/CMakeLists.txt
Leaving things half migrated is unfortunate, but my time was pulled elsewhere. @erman-gurses has also been helping get convolution and attention tests moved into the test suite repo recently. If we come to a decision we are all happy with then we can schedule some time on the roadmap to finish the migration work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details! Let's find time this week for a video call. For now just recording a couple data points. One is:
~/iree-build$ find . -name '*e2e_matmul*mlir' -printf "%s\n" | awk '{sum+=$1} END {print sum}'
26158552
Another data point: pace of introduction of new element types in CDNA architectures:
- CDNA1: f32, f16, i8
- CDNA2: f64, bf16
- CDNA3: xf32, f8E5M2FNUZ, f8E4M3FNUZ
- Near future: f8E5M2 (non-FNUZ), f8E4M3 (non-FNUZ), FP6, FP4, INT6, INT4, presumably all what's being standardized in Microscaling 1.0.
- Meanwhile, NVIDIA architectures support some of these types already in production, as well as altogether different types such as TF32.
Working on #18980 let me spend quality time with e2e matmul tests and suggested some changes.
The main change is to simplify the printing of numerical values to always use high precision, meaning print all significant digits of floating point values.
Since our tests generate small integral values and the intent is generally to be testing mostly the exact arithmetic that happens on small integral values, in most cases this doesn't make any difference.
But I found that RDNA3 float arithmetic produces non-exact results even on those values. As a result, I got values like 1+epsilon where 1 was expected, causing a test to fail (since we didn't know we needed to opt out from requiring exact results) and the test output cryptically printed both values as "1".
The other change is to more consistently print the same number of rows and columns regardless of whether we are at the start or in the middle of a dimension, and to have that number be what we call "context" (before, it was "2 * context").
Also a seasonal emoji change.