Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize optimized_portable_ops versions of portable ops? #9241

Open
swolchok opened this issue Mar 13, 2025 · 16 comments
Open

Vectorize optimized_portable_ops versions of portable ops? #9241

swolchok opened this issue Mar 13, 2025 · 16 comments
Assignees
Labels
actionable Items in the backlog waiting for an appropriate impl/fix module: kernels Issues related to kernel libraries and utilities, and code under kernels/

Comments

@swolchok
Copy link
Contributor

swolchok commented Mar 13, 2025

🚀 The feature, motivation and pitch

Similarly to #8932, we should be able to conditionally compile portable ops to do some vectorization. I imagine that this would look like either passing a second lambda to our util functions, or perhaps passing template lambdas that we then could use for both some scalar T and also Vectorized<T>. The second option would require us to get an std-workalike interface to Vectorized operations so that things like exp would work seemlessly, which probably would have a similar solution to pytorch/pytorch#144495 .

RFC

As a concrete example, op_add currently calls a util workhorse function with a lambda:

    utils::apply_bitensor_elementwise_fn<CTYPE_COMPUTE, op_name>(
        [val_alpha](const CTYPE_COMPUTE val_a, const CTYPE_COMPUTE val_b) {
          return val_a + val_alpha * val_b;
        },

We could imagine instead making the call look like this, with a template lambda, so that we could seamlessly use the lambda with Vectorized:

    utils::apply_bitensor_elementwise_fn<CTYPE_COMPUTE, op_name>(
        [val_alpha](const auto val_a, const auto val_b) {
          return val_a + val_alpha * val_b;
        },

A second, harder example is op_exp:

Tensor& exp_out(KernelRuntimeContext& ctx, const Tensor& in, Tensor& out) {
  return internal::unary_ufunc_realhbbf16_to_floathbf16(std::exp, ctx, in, out);
}

I think ideally we would find a solution to the above-mentioned PyTorch issue and then write this as

Tensor& exp_out(KernelRuntimeContext& ctx, const Tensor& in, Tensor& out) {
  return internal::unary_ufunc_realhbbf16_to_floathbf16_v2([](auto x) { return c10::math::exp(x); }, ctx, in, out);
}

using a template lambda that could be instantiated with either a scalar or Vectorized, as outlined above.

cc @larryliu0820 @manuelcandales

@swolchok swolchok added actionable Items in the backlog waiting for an appropriate impl/fix module: kernels Issues related to kernel libraries and utilities, and code under kernels/ labels Mar 13, 2025
@swolchok swolchok self-assigned this Mar 13, 2025
@swolchok
Copy link
Contributor Author

swolchok commented Mar 18, 2025

Sketch of more detailed plan:

  • Make sure we have committed code that uses ATen Vectorized (optimized/op_gelu does this already)
  • Broaden "optimized portable ops" detection from just ET_USE_THREADPOOL -- something needs to tell us it's OK to use Vectorized. (ET_USE_VECTORIZED? what would define it? ET_BUILDING_OPTIMIZED_PORTABLE_OPS? probably we define a set of specific macros when we build the optimized_portable_ops target) (Make PyTorch headers available in optimized_portable_kernels, define ET_USE_PYTORCH_HEADERS #9384)
  • Implement unary_ufunc functions using elementwise_util #9386
  • Specialize elementwise_util ops for the non-mixed-dtype case. If we don't, vectorization doesn't make any sense. (RFC: Specialize for non-mixed-dtype in elementwise_util #9388)
  • Proof-of-concept vectorization for elementwise_util ops using some sort of metaprogramming/SFINAE on their lambda (need to maintain backward compatibility with existing code)
  • Roll out specialized, vectorized elementwise_util ops
  • Establish c10::math namespace in upstream PyTorch
  • Sync c10::math header into ET core so that we can use c10::math namespace in ExecuTorch portable ops, which are part of core
  • Roll out vectorized unary_ufunc ops using c10::math namespace

@swolchok
Copy link
Contributor Author

vectorization for elementwise ops

This is trickier than I remembered currently. We outline our loads in elementwise_util.h in the name of build time and code size; getting to a point where vectorization would be the next item on the list would take some time. Accordingly, I am going to skip elementwise_util.h for now and move on to unary_ufunc; will come back to elementwise afterwards.

@swolchok
Copy link
Contributor Author

skip elementwise_util.h for now and move on to unary_ufunc

This isn't right. The unary_func_* ops in pattern.h seem to be somewhat redundant with elementwise_util.h.

@swolchok
Copy link
Contributor Author

Plan above updated to reflect resolution of my confusion about whether to start with elementwise_util or unary_ufunc_* . (In short, unary_ufunc_* will call through to elementwise_util.)

@kimishpatel
Copy link
Contributor

so that we could seamlessly use the lambda with Vectorized:

Do you have an example for this, for say add refactor that you described in the summary of the RFC

@kimishpatel
Copy link
Contributor

template lambda that could be instantiated with Vectorized

Do you have an example of this

@kimishpatel
Copy link
Contributor

something needs to tell us it's OK to use Vectorized.

Can we not have implementation of Vectorized that is just scalar? This is already the case in pytorch core and et, no?

@swolchok
Copy link
Contributor Author

template lambda that could be instantiated with Vectorized

Do you have an example of this

[](auto x) { return c10::math::exp(x); }, where c10::math contains using std::exp; and also template <typename T> auto exp(at::Vectorized<T> v) { return v.exp(); }

@swolchok
Copy link
Contributor Author

something needs to tell us it's OK to use Vectorized.

Can we not have implementation of Vectorized that is just scalar? This is already the case in pytorch core and et, no?

I don't particularly want to sign up to ensure that we generate code for that that is just as good as writing scalar code directly.

@swolchok
Copy link
Contributor Author

I'm running into trouble with my intended implementation here. Apparently, SFINAE can't be used together with generic lambdas to detect whether they will actually compile when passed an argument of a particular type. See #9432 ; I expect to resolve this tomorrow after sleeping on the problem.

@swolchok
Copy link
Contributor Author

Updated #9432 with documentation about the SFINAE + generic lambda issue. I suspect I only ran into this because of the way I sequenced my stack; if I move the following steps in my plan before vectorizing elementwise_util, I expect better results:

  • Establish c10::math namespace in upstream PyTorch
  • Sync c10::math header into ET core so that we can use c10::math namespace in ExecuTorch portable ops, which are part of core
  • Update unary_ufunc ops to use c10::math ops instead of std so that they can cleanly handle Vectorized

@kimishpatel
Copy link
Contributor

template lambda that could be instantiated with Vectorized

Do you have an example of this

[](auto x) { return c10::math::exp(x); }, where c10::math contains using std::exp; and also template <typename T> auto exp(at::Vectorized<T> v) { return v.exp(); }

ok I get that but i dont get if there will be other functions that will call into this lambda using Vectorized vs pure scalar (like in Loops.h in PyTorch). Maybe I can just wait for you to have a working version

@swolchok
Copy link
Contributor Author

implementation of Vectorized that is just scalar

in particular it's important to note that if you wanted at::vec::Vectorized to work nicely in this mode, you would presumably have a default Vectorized::size() of 1 so that Vectorized code was just scalar code if a specialized implementation was not available, which is not how it actually works

@kimishpatel
Copy link
Contributor

implementation of Vectorized that is just scalar

in particular it's important to note that if you wanted at::vec::Vectorized to work nicely in this mode, you would presumably have a default Vectorized::size() of 1 so that Vectorized code was just scalar code if a specialized implementation was not available, which is not how it actually works

Yeah thats a bit unfortunate.

ok so separately, it can still be called portable fallback though? I didnt look through the header to ensure it doesnt make platform specific assumptions but I would have guessed it could be considered portable as in it can compile for platform that has c/cpp compiler with > c++17 support available.

@digantdesai
Copy link
Contributor

digantdesai commented Mar 31, 2025

Just curious, did you consider copying a portable op into an optimized op and SIMD-ifying it there? The distinction will be instead of vector w/ scalar fall back in the portable_op.cpp file it will be in optimized.yml with portable.yml fallback during the selective build.

@swolchok
Copy link
Contributor Author

did you consider copying a portable op into an optimized op and SIMD-ifying it there

that would be worse because we would then have copy/pasted code.

swolchok added a commit that referenced this issue Apr 2, 2025
…ET_USE_PYTORCH_HEADERS (#9384)

Enables following diff. First real step of #9241.
kirklandsign pushed a commit that referenced this issue Apr 11, 2025
…ET_USE_PYTORCH_HEADERS (#9384)

Enables following diff. First real step of #9241.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
actionable Items in the backlog waiting for an appropriate impl/fix module: kernels Issues related to kernel libraries and utilities, and code under kernels/
Projects
None yet
Development

No branches or pull requests

3 participants