-
Notifications
You must be signed in to change notification settings - Fork 270
Fix onnx Attention and torch SDPA quantization handling #3751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Fix onnx Attention and torch SDPA quantization handling #3751
Conversation
734ba64 to
791f962
Compare
|
Hm. iirc Do you have any preferences if I should mark this test as @pytest.mark.skipif(
version.parse(onnx.__version__) < version.parse("1.18.0"),
reason="Opset 23 was added in onnx 1.18.0",
)or bump the version or something else? |
Hi @ruro, thanks for your contribution. We currently support multiple versions of ONNX, and the |
791f962 to
9af3bfb
Compare
| input_port_ids = [input_edge.input_port_id] + input_edge.parallel_input_port_ids | ||
| node_name = nncf_node.node_name | ||
| for input_port_id in input_port_ids: | ||
| allowed_pre_hook_insertion_points.append(PreHookInsertionPoint(node_name, input_port_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruro Could you please briefly explain why these changes are necessary?
@daniil-lyakhov Please take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've outlined my reasoning in the last two comments in #3750. The short version is that parallel edges aren't directly represented in the PTNNCFGraph (because it's not a Multi graph and doesn't allow repeated edges), but are instead stored in the parallel_input_port_ids property.
In this case, unbind has 3 outputs that are passed as q, k and v inputs of the sdpa node. Each of these 3 edges should be considered separately for the purposes of quantizer insertion/propagation, but the previous logic only added insertion points for "real" edges, ignoring any extra parallel edges.
Let me know if anything is unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great contribution! Could you please share a netron/ nncf graph visualization of the brand new supported subgraph? (nncf graph visualization api: https://github.com/openvinotoolkit/nncf/blob/develop/src/nncf/common/graph/graph.py#L611-L613)
This part of the code is the core logic of the NNCF, we need to figure out all possible side effects of this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, what you mean by "netron/nncf graph". The second image in the PR body is the expected graph for unbind+sdpa after applying quantization. Does that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.


Changes
ONNXAttentionMetatypefor the opset 23AttentionONNX nodescaled_dot_product_attentionquantization intorch2for the case whenQ,KandVare parallel edges coming from the same input nodeReason for changes
See #3750
Related tickets
Fixes #3750
Tests
tests/onnx/quantization/test_graphs.py::test_synthetic_models_graph[AttentionModel]tests/torch2/function_hook/quantization/test_quantized_graphs.py::test_quantized_graphs[unbind_scaled_dot_product_attention_model]