-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize portable ops if threadpool is available, with fallback to parallel_for-as-for-loop #8932
Comments
One of the goals on portable op was readability as well. SO it would be good to at the least touch upon that.
probably be better find a different location because it will start pulling in pthreadpool and other dependencies when you actually want threadpool. Keeping in extension would mean portable ops will depend on something outside of the core, but within the core will complicate the build system? |
Why would the header pull in threadpool? I didn't say I would move the .cpp file. |
I am curious, since we have a dedicated CPU delegate for cases when we need CPU perf, why enhancing that to deal with perf issues is not an option, as opposed to pushing perf through portable or even optimized ops? We can also explore custom op interface if upstreaming to XNNPACK and XNNPACK rebase is too much effort. |
parallel_for isn't much worse than a for loop |
It is much better if our ops are reasonably fast by default, rather than only performing well for specific cases we have looked at and done work on. |
forgot to mention that it is also better if we don't have to write ops multiple times (once for portable and once for optimized and/or XNNPACK). |
Previously it was copied in several places per executorch_srcs.cmake. Needed for #8932 Test Plan: Compare cmake-out/executorch_srcs.cmake before/after for my usual testing cmake config with "all the CPU stuff" on; found that thread_parallel.cpp is now duplicated only in one place instead of multiple (it's in llama_runner, which needs a general fixup because it's duplicating several extensions). ghstack-source-id: 333d1c9787f197b7c9163521af19cee4753426df ghstack-comment-id: 2698715235 Pull Request resolved: #8938
Do you have a prototype of what a portable op would look like with this? Maybe anchor with an example |
Previously it was copied in several places per executorch_srcs.cmake. Needed for #8932 Test Plan: Compare cmake-out/executorch_srcs.cmake before/after for my usual testing cmake config with "all the CPU stuff" on; found that thread_parallel.cpp is now duplicated only in one place instead of multiple (it's in llama_runner, which needs a general fixup because it's duplicating several extensions). ghstack-source-id: d4917831ebe5f00628b1ffcbda8f92a08f0e9457 ghstack-comment-id: 2698715235 Pull Request resolved: #8938
Sure, I expect
parallel_for(0, out.numel(), internal::GRAIN_SIZE, [&]() { .Relatedly, I expect
parallel_for(start, end, internal::GRAIN_SIZE, [&]() { .
|
That's looks ugly... A simple for loop now has three new concepts, namely (1) a new parallel_for util function (2) grain_size (3) lambda stuff Can we do completely inline and using macros?
could be replaced by directly
and define #FOR and #END_FOR macros and switch based on a flag, and hide the grain size and lambdas etc underneath? |
That's not better. Now you have 1) macros, which are inherently a problem 2) a new FOR concept 3) a lambda that you can't see (overlaps with point (1)). The one thing we can easily get rid of is the grain size, by simply adding an overload of parallel_for that defaults it. |
Previously it was copied in several places per executorch_srcs.cmake. Needed for #8932 Test Plan: Compare cmake-out/executorch_srcs.cmake before/after for my usual testing cmake config with "all the CPU stuff" on; found that thread_parallel.cpp is now duplicated only in one place instead of multiple (it's in llama_runner, which needs a general fixup because it's duplicating several extensions). ghstack-source-id: d09bf5e6c6561574999d81b13b3fa3d363639596 ghstack-comment-id: 2698715235 Pull Request resolved: #8938
Previously it was copied in several places per executorch_srcs.cmake. Needed for #8932 Test Plan: Compare cmake-out/executorch_srcs.cmake before/after for my usual testing cmake config with "all the CPU stuff" on; found that thread_parallel.cpp is now duplicated only in one place instead of multiple (it's in llama_runner, which needs a general fixup because it's duplicating several extensions).
As per plan in #8932, we want to be able to include thread_parallel.h to build libraries that are *capable* of parallelization, but don't *require* it. So, we move the header to ExecuTorch core and add a fallback implementation (with tests!) of `parallel_for` that just does a regular `for` loop. Then, targets that link `extension_threadpool` will get parallelization automagically. This PR doesn't add any optionally-parallelized code; that will be in the next PR. ghstack-source-id: 4f473748293c0b7b1fb7092117db0a89d541db63 ghstack-comment-id: 2702414287 Pull Request resolved: #8983
Previously it was copied in several places per executorch_srcs.cmake. Needed for #8932 Test Plan: Compare cmake-out/executorch_srcs.cmake before/after for my usual testing cmake config with "all the CPU stuff" on; found that thread_parallel.cpp is now duplicated only in one place instead of multiple (it's in llama_runner, which needs a general fixup because it's duplicating several extensions).
We had a bunch of targets that would define this macro ad-hoc. It's supposed to indicate that the threadpool extension is available, so just make sure that we have it as a PUBLIC target_compile_definition in CMake and an exported_preprocessor_flags entry in Buck for extension_threadpool. Test Plan: CI on following NOCOMMIT PR, which fails builds if ET_USE_THREADPOOL is not defined in affected places. Needed for #8932.
As per plan in #8932, we want to be able to include thread_parallel.h to build libraries that are capable of parallelization, but don't require it. So, we move the header to ExecuTorch core and add a fallback implementation (with tests!) of parallel_for that just does a regular for loop. Then, targets that link extension_threadpool will get parallelization automagically. This PR doesn't add any optionally-parallelized code; that will be in the next PR.
This is step (5) of #8932. At this exact moment, this rebuild is inefficient because it rebuilds the whole portable op library, but ops don't support optional parallelization just yet. This will become less true when we roll out parallel_for support across portable ops immediately following this PR.
#9197 is the top of a stack of rollouts for reductions. After that, need to roll out across other util functions (most notably elementwise_util) before closing. |
Looking through the other util functions to finish rolling out. Questions:
Notes:
|
The other ones are reductions. More #8932 rollout.
Portable ops with threads may be necessary (default out-of-box is not too bad) but not really sufficient if we care about performance. So, while we do this, it doesn't practically save us from implementing arm and/or x86 specific (which we mainly care about) SIMD + Multi-threaded ops in either optimized or XNNPACK. That said, I agree that it may reduce pressure from us providing an optimized impl, and not having "guaranteed bad" perf when using portable. Do we have perf uplift data from this? Curious to see how far this + autovec + out-of-order CPUs can get. |
The other ones are reductions. More pytorch#8932 rollout.
we are also going to vectorize, at least for elementwise ops. #9241
nothing particularly concrete, but I can vouch that it goes faster. |
The other ones are reductions. More pytorch#8932 rollout.
🚀 The feature, motivation and pitch
It seems suboptimal to me that we have to create separate optimized ops just to get basic stuff like parallelization (and vectorization, but let's start with parallelization). Here's what I'd like to do: (The timeline here is "ASAP", but I'm opening an issue because this got too long for chat and so that I can point to this issue on the PRs.)
-DET_USE_THREADPOOL
macro we already use and define somewhat ad-hoc. (done; Properly export ET_USE_THREADPOOL from the threadpool extension #8947)runtime/kernel/thread_parallel.h
) (Yes I will leave a stub header behind for backward compatibility.) Move thread_parallel.cpp to threadpool, since there will be no reason not to provide it when threads are available. Provide a default implementation of parallel_for if threadpool is not built (gated behindET_USE_THREADPOOL
) that is just an inlinablefor
loop. (Split & remove extension_parallel #8983)parallel_for
in at least one portable op, either directly or via the workhorse "util" functions. (Add basic parallel_for support to reduce_util #8986)parallel_for
across portable ops and workhorse "util" functions.Thoughts? Blockers?
Alternatives
status quo -- slow portable ops
Additional context
No response
RFC (Optional)
No response
cc @larryliu0820 @manuelcandales
The text was updated successfully, but these errors were encountered: