-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][NFCI] Refactor device code split implementation once again #8833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][NFCI] Refactor device code split implementation once again #8833
Conversation
…ric-module-splitter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, doing it this way is so much simpler and easier to understand. there's so much less nonsense now. thanks for doing this!
@sarnex, @asudarsa, sorry for delay. I've rebased the PR on top of #8763 and it is now ready for review. Changes since last update:
I would like you to take another look at the PR before I merge it, to review recent changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great to me, only nits, thanks a lot! looking forward to making use of this soon!
@sarnex, @asudarsa, hopefully this is now the final iteration and the patch will be ready for merge once CI passes. I finally figured out the root cause of pre-commit failures. It turned out that #8763 (inadvertently, I presume) don't emit llvm/llvm/tools/sycl-post-link/ModuleSplitter.h Lines 65 to 75 in 7bdbd59
As you can see, we return llvm/llvm/tools/sycl-post-link/ModuleSplitter.h Lines 60 to 61 in 7bdbd59
In my PR, I "compute" the property after all splitting and merging is done, based on the actual content of the module, so the property gets set for modules containing There were two changes since last update:
My plan is the following:
Please let me know if there are questions or concerns. @asudarsa, it would be especially good to hear feedback from you, because the PR touches the work you recently did on propagating compilation options to backends. |
::sycl::kernel_props::ATTR_LARGE_GRF, "large-grf"); | ||
Categorizer.registerListOfIntegersInMetadataSortedRule("sycl_used_aspects"); | ||
Categorizer.registerListOfIntegersInMetadataRule("reqd_work_group_size"); | ||
Categorizer.registerSimpleStringAttributeRule( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new changes lgtm also, thanks.
in my experience invoke_simd is very sensitive to the environment, so im not surprised changing the optlevel causes an issue. dropping the flag and making a bug for the gpu people makes sense, ill email you who to assign it to
…ric-module-splitter
Merge with |
@AlexeySachkov Thanks again for doing this! I'm going to use this for some work I'm doing immediately! |
Apology for a not so small PR (or rather PR description?) in advance.
The PR is marked as
NFCI
, because no functional changes are intended, but I'm not 100% sure if there are corner-cases when behavior changes.Intro
This is a refactoring of how we perform device code split in
sycl-post-link
, which is intended to solve several existing issues with the current implementation:sycl-post-link
sycl-post-link
A bit more context about the issues above:
(1) Increase peak RAM consumption is caused by the fact that we currently preserve all splits in-memory, even though we can process them on-by-one and discard them as soon as we stored them to a disk. This was implemented as a memory consumption optimization in #5021, but it got accidentally reverted in #7302 as an attempt to workaround (2).
(2) is pretty much summarized in our source code:
llvm/llvm/tools/sycl-post-link/sycl-post-link.cpp
Lines 806 to 811 in afebb25
(3) is caused by a bad implementation decision made in #7302: because every split is now identified by a hash, every time you add a new split "dimension"/new feature to an account, it results in different hashes for existing tests. Just look how many unrelated tests had to be updated in #7512, #8056 and #8167
Now to the PR itself:
It introduces a new infrastructure for categorizing/grouping kernel functions: instead of using hashes, we now build a string description for each kernel function and then group kernels with the same description string together.
String description is built by a new entity: it accepts a set of rules, where each rule is a simple function which returns a string for passed
llvm::Function
. Results of all rules are concatenated together and rules are invoked in a stable order of their registration.There is a simple API for building those rules. It provides some predefined rules for the most popular use cases like turning a function attribute or a metadata into a string descriptor for the function. There is also a possibility to pass a custom callback there to implement more complicated logic.
How does this PR help with issues above?
(1) and (2) are fixed in conjunction:
sycl-post-link
was refactored to avoid storing more than one split module at a time and that is possible because the PR unifies per-scope and optional-kernel-features splitters into a single generic splitter. The new API for kernels categorization seems to be flexible enough to provide that infrastructure so merged splitters still look OK code-wise.(3) is caused by using string identifiers instead of hashes as well as by using a data structure which sorts identifiers.
Any other benefits from this PR?
About 50 lines of code less to support :)
Extending device code split for more optional features would be even easier than it is now: instead of adding several changes to various places around
UsedOptionalFeatures
structure, it will be just adding a 1-3 lines of code. Please also note thatUsedOptionalFeatures
contains tons of inconsistencies in its implementation, which will all gone with this PR: inoperator==
we don't use hash and instead compare certain fields directly (and we do miss some of them);generateModuleName
method skips some of optional features and ignores them.Cross-module
device_global
usages checks should now work at all split dimensions (except for ESIMD).Any potential downsides?
With current
UsedOptionalFeatures
there is a possibility to embed various information (used aspects,large-grf
flag, etc.) directly during device code split to avoid re-gathering that information later when we generate properties. With the suggested approach, it would be harder to do, because it doesn't seem to naturally fit to the proposed infrastructure: see changes I did aroundlarge-grf
in this PR.However, we have never actually implemented this and re-querying some metadata from function doesn't seem like a bottleneck, so it should really be a very minor and only theoretical downside.