Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Begin testing models from the ONNX Model Zoo. #23

Merged
merged 26 commits into from
Sep 19, 2024

Conversation

ScottTodd
Copy link
Member

@ScottTodd ScottTodd commented Sep 6, 2024

Progress on #6.

A sample test report HTML file is available here: https://scotttodd.github.io/iree-test-suites/onnx_models/report_2024_09_17.html

These new tests

  • Download models from https://github.com/onnx/models
  • Extract metadata from the models to determine which functions to call with random data
  • Run the models through ONNX Runtime as a reference implementation
  • Import the models using iree-import-onnx (until we have a better API: [onnx] Build real onnx frontend cli/API iree#18289)
  • Compile the models using iree-compile (currently just for llvm-cpu but this could be parameterized later)
  • Run the models using iree-run-module, checking outputs using --expected_output and the reference data

Tests are written in Python using a set of pytest helper functions. As the tests run, they can log details about what commands they are running. When run locally, the artifacts/ directory will contain all the relevant files. More can be done in follow-up PRs to improve the ergonomics there (like generating flagfiles).

Each test case can use XFAIL like @pytest.mark.xfail(raises=IreeRunException). As we test across multiple backends or want to configure the test suite from another repo (e.g. iree-org/iree), we can explore more expressive marks.

Note that unlike the ONNX operator tests, these tests use onnxruntime and iree-import-onnx at test time. The operator tests handle that as an infrequently ran offline step. We could do something similar here, but the test inputs and outputs can be rather large for real models and that gets into Git LFS or cloud storage territory.

If this test authoring model works well enough, we can do something similar for other ML frameworks like TFLite (#5).

@ScottTodd ScottTodd marked this pull request as ready for review September 6, 2024 23:29
Copy link
Member Author

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some self review comments from weekend thinking. Busy for a bit then will make those changes

onnx_models/conftest.py Outdated Show resolved Hide resolved
onnx_models/vision_models_test.py Outdated Show resolved Hide resolved
# )


# https://github.com/onnx/models/tree/main/validated/vision/classification/mobilenet
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can write a script to get these directories / files from the onnx/models repo and translate those into test cases

onnx_models/vision_models_test.py Outdated Show resolved Hide resolved
onnx_models/conftest.py Outdated Show resolved Hide resolved
@ScottTodd
Copy link
Member Author

Ping @zjgarvey ? Any high level feedback? I should have some time this week to address my own comments, but I'd like to prefetch any other feedback/discussion.

@zjgarvey
Copy link
Collaborator

Ping @zjgarvey ? Any high level feedback? I should have some time this week to address my own comments, but I'd like to prefetch any other feedback/discussion.

I'll take a look today!

Copy link
Collaborator

@zjgarvey zjgarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest concern right now is the handling of dynamic dims and getting the input/output signatures to work in those cases.

Some suggestions pertaining to this:

  1. Have a test-specific dictionary of dim_param assignments to handle models with dynamic dims.
  2. Since you are already generating an onnxruntime session, it might be easier to pass the session itself, together with a dim_param dictionary, to generate the input signature.
  3. Again, since you are using the onnxruntime session, the most reliable method of getting a correct output signature is definitely from the reference outputs of that session run. It might be prohibitively complicated to try and infer the output shapes from a dynamic onnx model. Even with shape inference and dim_params being included, I'm not sure it would be able to determine correct output shapes without actually fixing the dim_params in the onnx model itself (however, there is tooling for that https://github.com/microsoft/onnxruntime/blob/291a5352b27ded5714e5748b381f2efb88f28fb9/tools/python/util/onnx_model_utils.py#L177).

If I don't revisit this soon, please feel free to ping again.

onnx_models/conftest.py Outdated Show resolved Hide resolved
# A) List all metadata explicitly
# B) Get metadata on demand from the .onnx protobuf using 'onnx'
# C) Get metadata on demand from the InferenceSession using 'onnxruntime'
# This is option B.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the issues with option B is that, if you try to get the output signature for a graph that hasn't yet performed shape inference, you will likely get dim_params rather than actual dims for the output shape.

I'm still trying to figure out a good way to do this, since currently I've been getting the input/output signature through option C (which comes with it's own baggage).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, It looks like you use 1 as the shape for dims without a dim_value. I'll try and keep an eye on where that might cause problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option I'm considering is having test case import be an offline, tool-assisted step, instead of an online step that always runs as part of the tests. That way, we can have developers importing model tests decide explicitly what functions and signatures they want to test. It would be nice if there was only one obvious function and signature for each model that can be safely inferred though :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some functional updates, but I'll continue to iterate on the code style.

This is now using option C - getting the input and output signatures from onnxruntime.InferenceSession. The output signatures are now exactly what comes from onnxruntime, while the input signatures are still turning dynamic dims into 1.

for input in onnx_model_metadata["inputs"]:
input_type = input["iree_type"]
input_data_path = input["input_data_path"]
run_module_args.append(f"--input={input_type}=@{input_data_path}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah. I'm not sure how this will work if the dynamic input dims aren't actually 1 (referring to the previous comment about setting dims without a dim_value equal to 1).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with the outputs below, only there is a much higher chance of output dynamic dims being something beyond our control.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have control over the inputs. As long as there aren't restrictions (like must be a multiple of 4, or two dynamic inputs must be equal), setting to 1 seems safe enough to start. We could also override per test case or add a flag that lets you override for the entire test suite.

I at least made it so the output shape comes from the inference session, since that value is out of out control.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know many llms have dim params like past_seq_len and then elsewhere past_seq_len + 1, so there are definitely restrictions for some models.

After looking more carefully at the setup (which is really nice, btw), I think it would be pretty feasible to just add an optional dim_param_dict as an input for compare_between_iree_and_onnxruntime_fn.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In short, I think that could be added later as necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I just started with vision classification models but as more are added this test code will grow more features. Are there particular ONNX models that you know need that treatment? I can look at importing those next.

onnx_models/utils.py Outdated Show resolved Hide resolved
@ScottTodd
Copy link
Member Author

Okay, I made a bunch of updates today and I'm happy enough with where this is now to merge it and start running it nightly.

Major changes:

PTAL?

onnx_models/conftest.py Outdated Show resolved Hide resolved
@zjgarvey
Copy link
Collaborator

@ScottTodd , I don't have any pressing comments. The printout you shared was really easy to navigate and the code here feels like it will be painless to iterate on with updates. Please let me know when you feel like this is in a state to merge, and I'll take a final look.

@ScottTodd
Copy link
Member Author

PTAL. My plan is to keep this running nightly here in this repo, then also consider running a small number tests in the IREE repo to watch for regressions. That sort of cross repo testing will need either a way to set XFAIL from CLI flags (like the ONNX operator tests do with config files) or we could only test already passing models and just grow the set that is tested over time. Right now the small number of tests only takes 2-3 minutes to run, even including time spent downloading files.

@ScottTodd ScottTodd merged commit 7b8bdf7 into iree-org:main Sep 19, 2024
2 checks passed
@ScottTodd ScottTodd deleted the onnx-models branch September 19, 2024 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants