Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize portable ops if threadpool is available, with fallback to parallel_for-as-for-loop #8932

Open
swolchok opened this issue Mar 4, 2025 · 14 comments
Assignees
Labels
actionable Items in the backlog waiting for an appropriate impl/fix module: kernels Issues related to kernel libraries and utilities, and code under kernels/ triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@swolchok
Copy link
Contributor

swolchok commented Mar 4, 2025

🚀 The feature, motivation and pitch

It seems suboptimal to me that we have to create separate optimized ops just to get basic stuff like parallelization (and vectorization, but let's start with parallelization). Here's what I'd like to do: (The timeline here is "ASAP", but I'm opening an issue because this got too long for chat and so that I can point to this issue on the PRs.)

  1. Set up a proper CMake build for extension/parallel; right now it's free-riding on buck and getting automatically duplicated into 3 different targets per the generated executorch_srcs.cmake. (done; Add proper CMake build for extension_parallel #8938)
  2. Make extension_threadpool itself export the -DET_USE_THREADPOOL macro we already use and define somewhat ad-hoc. (done; Properly export ET_USE_THREADPOOL from the threadpool extension #8947)
  3. move extension/parallel/thread_parallel.h to core. (@larryliu0820 suggests runtime/kernel/thread_parallel.h) (Yes I will leave a stub header behind for backward compatibility.) Move thread_parallel.cpp to threadpool, since there will be no reason not to provide it when threads are available. Provide a default implementation of parallel_for if threadpool is not built (gated behind ET_USE_THREADPOOL) that is just an inlinable for loop. (Split & remove extension_parallel #8983)
  4. use parallel_for in at least one portable op, either directly or via the workhorse "util" functions. (Add basic parallel_for support to reduce_util #8986)
  5. Verify that, because the optimized library is built with threadpool, it gets parallelization. Adjust build configuration for optimized ops lib if necessary. (Build optimized_portable_kernels if threadpool is enabled #8987)
  6. Roll out parallel_for across portable ops and workhorse "util" functions.

Thoughts? Blockers?

Alternatives

status quo -- slow portable ops

Additional context

No response

RFC (Optional)

No response

cc @larryliu0820 @manuelcandales

@swolchok swolchok added actionable Items in the backlog waiting for an appropriate impl/fix module: kernels Issues related to kernel libraries and utilities, and code under kernels/ triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Mar 4, 2025
@swolchok swolchok self-assigned this Mar 4, 2025
@kimishpatel
Copy link
Contributor

One of the goals on portable op was readability as well. SO it would be good to at the least touch upon that.

move extension/parallel/thread_parallel.h to core

probably be better find a different location because it will start pulling in pthreadpool and other dependencies when you actually want threadpool. Keeping in extension would mean portable ops will depend on something outside of the core, but within the core will complicate the build system?

@swolchok
Copy link
Contributor Author

swolchok commented Mar 4, 2025

Why would the header pull in threadpool? I didn't say I would move the .cpp file.

@digantdesai
Copy link
Contributor

digantdesai commented Mar 4, 2025

I am curious, since we have a dedicated CPU delegate for cases when we need CPU perf, why enhancing that to deal with perf issues is not an option, as opposed to pushing perf through portable or even optimized ops? We can also explore custom op interface if upstreaming to XNNPACK and XNNPACK rebase is too much effort.

@swolchok
Copy link
Contributor Author

swolchok commented Mar 4, 2025

readability

parallel_for isn't much worse than a for loop

@swolchok
Copy link
Contributor Author

swolchok commented Mar 4, 2025

since we have a dedicated CPU delegate for cases when we need CPU perf, why enhancing that to deal with perf issues is not an option, as opposed to pushing perf through portable or even optimized ops

It is much better if our ops are reasonably fast by default, rather than only performing well for specific cases we have looked at and done work on.

@swolchok
Copy link
Contributor Author

swolchok commented Mar 4, 2025

XNNPACK

forgot to mention that it is also better if we don't have to write ops multiple times (once for portable and once for optimized and/or XNNPACK).

swolchok added a commit that referenced this issue Mar 4, 2025
Previously it was copied in several places per executorch_srcs.cmake.

Needed for #8932

Test Plan: Compare cmake-out/executorch_srcs.cmake before/after for my usual testing cmake config with "all the CPU stuff" on; found that thread_parallel.cpp is now duplicated only in one place instead of multiple (it's in llama_runner, which needs a general fixup because it's duplicating several extensions).

ghstack-source-id: 333d1c9787f197b7c9163521af19cee4753426df
ghstack-comment-id: 2698715235
Pull Request resolved: #8938
@mergennachin
Copy link
Contributor

mergennachin commented Mar 4, 2025

Do you have a prototype of what a portable op would look like with this? Maybe anchor with an example

swolchok added a commit that referenced this issue Mar 4, 2025
Previously it was copied in several places per executorch_srcs.cmake.

Needed for #8932

Test Plan: Compare cmake-out/executorch_srcs.cmake before/after for my usual testing cmake config with "all the CPU stuff" on; found that thread_parallel.cpp is now duplicated only in one place instead of multiple (it's in llama_runner, which needs a general fixup because it's duplicating several extensions).

ghstack-source-id: d4917831ebe5f00628b1ffcbda8f92a08f0e9457
ghstack-comment-id: 2698715235
Pull Request resolved: #8938
@swolchok
Copy link
Contributor Author

swolchok commented Mar 4, 2025

Do you have a prototype of what a portable op would look like with this?

Sure, I expect

for (const auto out_ix : c10::irange(out.numel())) {
to be replaced with parallel_for(0, out.numel(), internal::GRAIN_SIZE, [&]() {.
Relatedly, I expect
for (size_t i = start; i <= end; i++) {
and
for (size_t i = start; i <= end; i++) {
will read parallel_for(start, end, internal::GRAIN_SIZE, [&]() {.

@mergennachin
Copy link
Contributor

That's looks ugly... A simple for loop now has three new concepts, namely (1) a new parallel_for util function (2) grain_size (3) lambda stuff

Can we do completely inline and using macros?

for (size_t i = start; i < end; i ++) {
   ...
}

could be replaced by directly

FOR(a, start, end) {
   ...
} END_FOR

and define #FOR and #END_FOR macros and switch based on a flag, and hide the grain size and lambdas etc underneath?

@swolchok
Copy link
Contributor Author

swolchok commented Mar 4, 2025

Can we do completely inline and using macros?

That's not better. Now you have 1) macros, which are inherently a problem 2) a new FOR concept 3) a lambda that you can't see (overlaps with point (1)). The one thing we can easily get rid of is the grain size, by simply adding an overload of parallel_for that defaults it.

swolchok added a commit that referenced this issue Mar 5, 2025
Previously it was copied in several places per executorch_srcs.cmake.

Needed for #8932

Test Plan: Compare cmake-out/executorch_srcs.cmake before/after for my usual testing cmake config with "all the CPU stuff" on; found that thread_parallel.cpp is now duplicated only in one place instead of multiple (it's in llama_runner, which needs a general fixup because it's duplicating several extensions).

ghstack-source-id: d09bf5e6c6561574999d81b13b3fa3d363639596
ghstack-comment-id: 2698715235
Pull Request resolved: #8938
swolchok added a commit that referenced this issue Mar 5, 2025
Previously it was copied in several places per executorch_srcs.cmake.

Needed for #8932

Test Plan: Compare cmake-out/executorch_srcs.cmake before/after for my usual testing cmake config with "all the CPU stuff" on; found that thread_parallel.cpp is now duplicated only in one place instead of multiple (it's in llama_runner, which needs a general fixup because it's duplicating several extensions).
swolchok added a commit that referenced this issue Mar 6, 2025
As per plan in #8932, we want to be able to include thread_parallel.h to build
libraries that are *capable* of parallelization, but don't *require* it. So, we
move the header to ExecuTorch core and add a fallback implementation (with
tests!) of `parallel_for` that just does a regular `for` loop. Then, targets
that link `extension_threadpool` will get parallelization automagically.

This PR doesn't add any optionally-parallelized code; that will be in the next
PR.

ghstack-source-id: 4f473748293c0b7b1fb7092117db0a89d541db63
ghstack-comment-id: 2702414287
Pull Request resolved: #8983
swolchok added a commit that referenced this issue Mar 6, 2025
Initial parallel_for integration in a portable op. Needed for #8932. Feel free to hold review until rest of stack is ready and we observe successful paralleliztaion.

ghstack-source-id: 3d510f0abf35069c3c3939605ff9c5639f8f845d
ghstack-comment-id: 2702502530
Pull Request resolved: #8986
zonglinpeng pushed a commit that referenced this issue Mar 6, 2025
Previously it was copied in several places per executorch_srcs.cmake.

Needed for #8932

Test Plan: Compare cmake-out/executorch_srcs.cmake before/after for my usual testing cmake config with "all the CPU stuff" on; found that thread_parallel.cpp is now duplicated only in one place instead of multiple (it's in llama_runner, which needs a general fixup because it's duplicating several extensions).
swolchok added a commit that referenced this issue Mar 7, 2025
We had a bunch of targets that would define this macro ad-hoc. It's supposed to indicate that the threadpool extension is available, so just make sure that we have it as a PUBLIC target_compile_definition in CMake and an exported_preprocessor_flags entry in Buck for extension_threadpool.

Test Plan: CI on following NOCOMMIT PR, which fails builds if ET_USE_THREADPOOL is not defined in affected places.

Needed for #8932.
swolchok added a commit that referenced this issue Mar 11, 2025
As per plan in #8932, we want to be able to include thread_parallel.h to build
libraries that are capable of parallelization, but don't require it. So, we
move the header to ExecuTorch core and add a fallback implementation (with
tests!) of parallel_for that just does a regular for loop. Then, targets
that link extension_threadpool will get parallelization automagically.

This PR doesn't add any optionally-parallelized code; that will be in the next
PR.
swolchok added a commit that referenced this issue Mar 12, 2025
This is step (5) of #8932.

At this exact moment, this rebuild is inefficient because it rebuilds
the whole portable op library, but ops don't support optional
parallelization just yet. This will become less true when we roll out
parallel_for support across portable ops immediately following this PR.
@swolchok
Copy link
Contributor Author

#9197 is the top of a stack of rollouts for reductions. After that, need to roll out across other util functions (most notably elementwise_util) before closing.

@swolchok
Copy link
Contributor Author

Looking through the other util functions to finish rolling out. Questions:

  • should we parallelize apply_kernel_2d_reduce_then_map_fn? @manuelcandales
  • ditto padding_util

Notes:

  • parallelizing memcpy seems unlikely to be very effective on mobile/embedded devices, skipping repeat_util.cpp and select_copy_util.cpp and slice_util.cpp and transpose_util.h

swolchok added a commit that referenced this issue Mar 18, 2025
The other ones are reductions.

More #8932 rollout.

ghstack-source-id: 3b888b8629dc9c0120d1dfb1011146e228b71ec3
ghstack-comment-id: 2731383174
Pull Request resolved: #9348
swolchok added a commit that referenced this issue Mar 18, 2025
The other ones are reductions.

More #8932 rollout.

ghstack-source-id: 3b888b8629dc9c0120d1dfb1011146e228b71ec3
ghstack-comment-id: 2731383174
Pull Request resolved: #9348
swolchok added a commit that referenced this issue Mar 18, 2025
The other ones are reductions.

More #8932 rollout.

ghstack-source-id: ff7bc1d28c6a7ad2945cbda35b5864dde9928348
ghstack-comment-id: 2731383174
Pull Request resolved: #9348
swolchok added a commit that referenced this issue Mar 18, 2025
The other ones are reductions.

More #8932 rollout.

ghstack-source-id: a5f1129d95b663bd5993e749481715abe02b98a8
ghstack-comment-id: 2731383174
Pull Request resolved: #9348
swolchok added a commit that referenced this issue Mar 18, 2025
The other ones are reductions.

More #8932 rollout.

ghstack-source-id: a5f1129d95b663bd5993e749481715abe02b98a8
ghstack-comment-id: 2731383174
Pull Request resolved: #9348
swolchok added a commit that referenced this issue Mar 18, 2025
The other ones are reductions.

More #8932 rollout.
@digantdesai
Copy link
Contributor

it is also better if we don't have to write ops multiple times (once for portable and once for optimized and/or XNNPACK).

Portable ops with threads may be necessary (default out-of-box is not too bad) but not really sufficient if we care about performance. So, while we do this, it doesn't practically save us from implementing arm and/or x86 specific (which we mainly care about) SIMD + Multi-threaded ops in either optimized or XNNPACK.

That said, I agree that it may reduce pressure from us providing an optimized impl, and not having "guaranteed bad" perf when using portable. Do we have perf uplift data from this? Curious to see how far this + autovec + out-of-order CPUs can get.

oscarandersson8218 pushed a commit to oscarandersson8218/executorch that referenced this issue Mar 21, 2025
@swolchok
Copy link
Contributor Author

swolchok commented Mar 25, 2025

SIMD + Multi-threaded

we are also going to vectorize, at least for elementwise ops. #9241

perf uplift data

nothing particularly concrete, but I can vouch that it goes faster.

DannyYuyang-quic pushed a commit to CodeLinaro/executorch that referenced this issue Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
actionable Items in the backlog waiting for an appropriate impl/fix module: kernels Issues related to kernel libraries and utilities, and code under kernels/ triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants