Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "[Flow] Convert from tensor.cast to flow.tensor.reshape early …(#18256)" #18331

Merged
merged 1 commit into from
Aug 22, 2024

Conversation

nirvedhmeshram
Copy link
Contributor

@nirvedhmeshram nirvedhmeshram commented Aug 22, 2024

This reverts commit 1c0c5a6.

This is causing an issue with https://github.com/iree-org/iree/actions/runs/10505447157/job/29119827242#step:5:2269
iree/tests/e2e/regression/check_regression_llvm-cpu_layernorm.mlir

Triaging the bug I see that one of the dispatches is slightly different but should result to the same numerics but it does not
Here is how the problem dispatch used to be originally

        func.func @old_dispatch2(%11 : tensor<128x384xf32>, %12: tensor<128xf32>) -> tensor<128x384xf32> {
        %cst = arith.constant 5.000000e+00 : f32
        %cst_0 = arith.constant 0.000000e+00 : f32
        %13 = tensor.empty() : tensor<128x384xf32>
        %14 = tensor.empty() : tensor<128xf32>
        %15 = linalg.fill ins(%cst_0 : f32) outs(%14 : tensor<128xf32>) -> tensor<128xf32>
        %16 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%11 : tensor<128x384xf32>) outs(%15 : tensor<128xf32>) {
        ^bb0(%in: f32, %out: f32):
          %18 = arith.addf %in, %out : f32
          linalg.yield %18 : f32
        } -> tensor<128xf32>
        %17 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%11, %16, %12 : tensor<128x384xf32>, tensor<128xf32>, tensor<128xf32>) outs(%13 : tensor<128x384xf32>) {
        ^bb0(%og_in : f32, %in: f32, %in_1: f32, %out: f32):
          %18 = arith.mulf %in, %in_1 : f32
          %19 = arith.subf %og_in, %18 : f32
          linalg.yield %19 : f32
        } -> tensor<128x384xf32>
        return %17 :  tensor<128x384xf32>
        }

with the reverting PR this dispatch is becoming

        func.func @new_dispatch2(%11 : tensor<128x384xf32>, %12: tensor<128xf32>) -> tensor<128x384xf32> {
      %cst = arith.constant 5.000000e+00 : f32
      %cst_0 = arith.constant 0.000000e+00 : f32
      %13 = tensor.empty() : tensor<128x384xf32>
      %14 = tensor.empty() : tensor<128xf32>
      %15 = linalg.fill ins(%cst_0 : f32) outs(%14 : tensor<128xf32>) -> tensor<128xf32>
      %16 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%11 : tensor<128x384xf32>) outs(%15 : tensor<128xf32>) {
      ^bb0(%in: f32, %out: f32):
        %18 = arith.addf %in, %out : f32
        linalg.yield %18 : f32
      } -> tensor<128xf32>
      %17 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%16, %12 : tensor<128xf32>, tensor<128xf32>) outs(%13 : tensor<128x384xf32>) {
      ^bb0(%in: f32, %in_1: f32, %out: f32):
        %18 = arith.mulf %in, %in_1 : f32
        %19 = arith.subf %cst, %18 : f32
        linalg.yield %19 : f32
      } -> tensor<128x384xf32>
      return %17 :  tensor<128x384xf32>
      }

The difference is at %19 = arith.subf %cst, %18 : f32
Note that in both cases %11 : tensor<128x384xf32> is 5.0 from the model input and hence the output should be same, however on arch64 it is not but on x86 it is, the IR at the mlir llvm dialect is identicalbetween x86 and arch64 so its not some easy to spot intrinsics / fast math kind of bug AFAICS

@nirvedhmeshram nirvedhmeshram enabled auto-merge (squash) August 22, 2024 23:36
@nirvedhmeshram nirvedhmeshram merged commit 8da4564 into main Aug 22, 2024
41 checks passed
@nirvedhmeshram nirvedhmeshram deleted the users/nirvedhmeshram/arm_issue branch August 22, 2024 23:49
@MaheshRavishankar
Copy link
Contributor

I dont see how the PR that you have would have resulted in the wrong case. There is nothing in that PR that should result in use of %cst instead of the input right?

@nirvedhmeshram
Copy link
Contributor Author

I dont see how the PR that you have would have resulted in the wrong case. There is nothing in that PR that should result in use of %cst instead of the input right?

I think the tensor.cast that otherwise end up in the dispatch regions probably prevent some analysis which now can happen with the flow.tensor.reshapes.

Also, am unsure that this test represents real world dynamic scenario. The analysis is able to infer the sizes from the flow.dynamic.constant input types and in the above case inline the whole input as a constant. I don't think this will happen for a real model.

@MaheshRavishankar
Copy link
Contributor

I dont see how the PR that you have would have resulted in the wrong case. There is nothing in that PR that should result in use of %cst instead of the input right?

I think the tensor.cast that otherwise end up in the dispatch regions probably prevent some analysis which now can happen with the flow.tensor.reshapes.

Also, am unsure that this test represents real world dynamic scenario. The analysis is able to infer the sizes from the flow.dynamic.constant input types and in the above case inline the whole input as a constant. I don't think this will happen for a real model.

Ok, I need more context here. This might be an ARM64 bug and we might have to mark as xfail, narrow down an LLVM repro and reland this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants