Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Bitfield Insert and Extract Intrinsics Proposal #316

Closed
wants to merge 5 commits into from

Conversation

natevm
Copy link

@natevm natevm commented Sep 11, 2024

This PR adds a tentative proposal to add bitfield insertion and extraction intrinsics to HLSL to enable certain low level optimizations involving quantized structures in high performance codes. This proposal would address the issue here: microsoft/DirectXShaderCompiler#6902

In theory, adding these intrinsics should be straight forward enough that I could submit a little PR for it myself. DXIR already has instructions for these intrinsics, they're just currently inaccessible, so I'd be adding some functions to allow HLSL users to tap into them.

@damyanp suggested I first submit a spec proposal here to get some feedback from a language-perspective before I dive into a potential PR to dxc.

Sketching up spec proposal draft for bitfield insertion / extraction intrinsics.
Copy link
Member

@damyanp damyanp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this proposal! I realize I should have pointed you to this document as well: https://github.com/microsoft/hlsl-specs/blob/main/docs/Process.md

This review has a couple of comments. I'll work on getting some clearer feedback about what we'd need to do before completing this PR and progressing to getting the detailed design started.

proposals/NNNN-bitfield-insert-extract.md Outdated Show resolved Hide resolved
proposals/NNNN-bitfield-insert-extract.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@llvm-beanz llvm-beanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My immediate gut reaction here is that HLSL has bitfields, so why don't we generate these instructions for bitfields instead of adding user-exposed APIs.

That seems like the better language approach to me.

@natevm
Copy link
Author

natevm commented Sep 11, 2024

My immediate gut reaction here is that HLSL has bitfields, so why don't we generate these instructions for bitfields instead of adding user-exposed APIs.

That seems like the better language approach to me.

Struct bitfields would definitely benefit from these instructions. I would have thought those were already mapped to these intrinsics.

In short, we need these bitfield intrinsics to efficiently manipulate the bits of an integer, namely where compile-time C++-style bitfields are too constraining.

Struct bitfields are “locked in” during compilation time, whereas these intrinsics allow for “bit” and “offset” parameters that might only be known during runtime. There’s also the sign extension functionality in signed bit field extract, which we use to quickly generate masks of subsets of bits for stack entry manipulation (eg, masking out processed items within a stack entry found using “firsthighbit”). I'll provide some example code in the proposal text to illustrate this.

For my use case in Slang, we’re using these intrinsics for fast dynamic “reinterpretation” of structs of differing types. (Similar to C-style casts, or perhaps C-unions when all fields have well-defined sizes.) This works well in our SPIR-V and CUDA backends thanks to the bitfield intrinsics there, but currently I just have a slower software emulated fallback for our DXC path.

@natevm
Copy link
Author

natevm commented Sep 11, 2024

Ok, I've made some changes to the proposal. The text should now clarify that the usage focused here is in enabling the developer to explicitly leverage these intrinsics in their code, particularly for cases where "offset" and "bits" are not known during compilation time.

To give some intuition, I've provided a small example demonstrating how bitfield instructions can be used to manipulate a stack of items encoded within a single integer. The paper by Ylitie et al do something similar to this for their 8-wide BVH traversal algorithm.

Copy link
Collaborator

@llvm-beanz llvm-beanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the motivation from this outside of Slang using HLSL as a code generation target. Making HLSL a better generation target isn't really a goal of our language evolution, so struggling to see how this feature aligns with the direction of HLSL.

Copy link
Collaborator

@llvm-beanz llvm-beanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm also not clear why we need to surface this in the language at all. Can't we just implement an instcombine pattern in DXC to translate masks and shifts to the corresponding instructions?

@natevm
Copy link
Author

natevm commented Sep 12, 2024

I guess I'm also not clear why we need to surface this in the language at all. Can't we just implement an instcombine pattern in DXC to translate masks and shifts to the corresponding instructions?

Sorry, I'm not sure what this instcombine pattern concept is yet. (reading LLVM, looks like it's a transformation there?)

Perhaps this concrete example is a bit more useful to talk over. I'm not sure if you saw it in the updated proposal, so just reposting here.

uint popHexFromQueue(inout uint hexadecimalQueue) {
    uint slot = firstBitLow(bitfieldExtract(hexadecimalQueue, 24, 6));
    uint hex = bitfieldExtract(hexadecimalQueue, 4 * slot, 4);
    hexadecimalQueue = bitfieldInsert(hexadecimalQueue, 0xF, 4 * slot, 4);
    hexadecimalQueue = bitfieldInsert(hexadecimalQueue, 0, 24 + slot, 1);
    return hex;
}

Without the intrinsics, this is what emulating the behavior of these intrinsincs would look like in HLSL:

uint popHexFromQueue(inout uint hexadecimalQueue) {
    uint slot = firstBitLow((hexadecimalQueue>>24)&((1u<<6)-1));
    uint hex = (hexadecimalQueue>>(4 * slot))&((1u<<4)-1);
    hexadecimalQueue = (uint(~(((1u << 4) - 1u) << 4 * slot)) | uint((0xF& ((1u << 4) - 1u)) << 4 * slot)); 
    hexadecimalQueue = (uint(~(((1u << bits) - 1u) << (24 + slot))) | uint((0 & ((1u << bits) - 1u)) << (24 + slot))); 
    return hex;
}

Without explicit bitfield intrinsics, how can I implement the above code in HLSL and ensure that it correctly translates to Lbve / UBfe / Bfi?

I suppose it just seems like a tall order to ask the compiler to correctly identify these operations as bitfield ops, especially as intermediate instructions are transformed and rearranged.

@natevm
Copy link
Author

natevm commented Sep 12, 2024

I'm not sure I understand the motivation from this outside of Slang using HLSL as a code generation target. Making HLSL a better generation target isn't really a goal of our language evolution, so struggling to see how this feature aligns with the direction of HLSL.

I can't speak for your vision, but personally I'd love to see HLSL continue growing as a powerful, device agnostic high performance compute language. I suspect you feel the same way. Over the years, HLSL has adopted a lot of features which has increased feature parity to CUDA by quite a bit, but for truly high performance codes, I need direct access to certain low level intrinsics.

For example, the firstbithigh intrinsic that HLSL is a critical piece in extracting "common prefixes" in space filling curve codes, which is foundational in modern GPU-accelerated BVH construction. Because HLSL now supports this instruction, I've been able to port over my BVH construction implementations from CUDA to HLSL.

I tried to state this in the proposal, these intrinsics are very important in manipulating compressed wide stack entries for performing depth first traversals and priority queues. Struct-style bitfields do not allow for runtime offsets or bit lengths like in the example I shared. But more broadly, bitfield intrinsics are becoming more and more important in implementing efficient data compression and decompression to reduce memory bandwidth usage and register pressure.

The Slang use case was just an example. I didn't mean to get caught up with that one specific case.

@llvm-beanz
Copy link
Collaborator

I suppose it just seems like a tall order to ask the compiler to correctly identify these operations as bitfield ops, especially as intermediate instructions are transformed and rearranged.

This is exactly the kind of problem LLVM's Instruction Combining pass handles. Intermediate instructions and ordering don't matter because it is a pattern match on the instruction DAG.

@natevm
Copy link
Author

natevm commented Sep 12, 2024

I suppose it just seems like a tall order to ask the compiler to correctly identify these operations as bitfield ops, especially as intermediate instructions are transformed and rearranged.

This is exactly the kind of problem LLVM's Instruction Combining pass handles. Intermediate instructions and ordering don't matter because it is a pattern match on the instruction DAG.

For context, I originally opened this issue here, asking how to get exactly this sort of transformation to work:
microsoft/DirectXShaderCompiler#6902

My issue was closed with the following response:

Unfortunately, there is no built-in available to directly generate these DXIL opcodes.

We could consider rephrasing this as a feature request to add an optimization to DXC to try and detect this case, but given our current priorities we're unlikely to address this ourselves, although we'd consider contributions adding it. We'd also need to understand if this is something that the driver backend might be better placed to detect and optimize.

Alternatively, maybe this is a request to add some new hlsl intrinsics that would generate these opcodes? This has a similar funding issue, but we'd also accept contributions for this.

So the logic there was, the barrier is too high to implement the transformations required to detect and emit these bitfield intrinsics automatically, and so @damyanp's suggestion was that adding more explicit hlsl intrinsics would be better "! / $". (Personally I'm not sure how much work such a transformation would actually be to implement. I just know that it currently doesn't work, hence this issue)

@llvm-beanz
Copy link
Collaborator

For context, I originally opened this issue here, asking how to get exactly this sort of transformation to work: microsoft/DirectXShaderCompiler#6902

My issue was closed with the following response:

Unfortunately, there is no built-in available to directly generate these DXIL opcodes.
We could consider rephrasing this as a feature request to add an optimization to DXC to try and detect this case, but given our current priorities we're unlikely to address this ourselves, although we'd consider contributions adding it. We'd also need to understand if this is something that the driver backend might be better placed to detect and optimize.
Alternatively, maybe this is a request to add some new hlsl intrinsics that would generate these opcodes? This has a similar funding issue, but we'd also accept contributions for this.

The conversation in that issue quickly moved away from "why is this compiler missing an optimization?" to "let's add an intrinsic". I'm not sure that took the correct turn.

Updating DXC to match an instruction pattern and generate different but still valid output is a non-language altering change. As long as the updated code generation is an improvement by some measure it seems pretty much like an obvious win.

Adding a new intrinsic becomes a long term support impact not just for DXC but also for any future HLSL compiler, it may break existing code if someone defined a similarly named utility, and it has other considerations that need to be factored in.

HLSL added bitfields to the language to avoid the need to write shifting and masking code manually. There are some bugs (which we should fix!), but that was the intentional language direction. Nothing in your proposal persuades me that these intrinsics are the way people want to write HLSL rather than using native bitfields, which are simpler syntax than both the bitwise operations and the proposed intrinsics.

@natevm
Copy link
Author

natevm commented Sep 12, 2024

HLSL added bitfields to the language to avoid the need to write shifting and masking code manually. There are some bugs (which we should fix!), but that was the intentional language direction. Nothing in your proposal persuades me that these intrinsics are the way people want to write HLSL rather than using native bitfields, which are simpler syntax than both the bitwise operations and the proposed intrinsics.

Struct bitfields in HLSL do not allow for runtime bitfield offset or bit width. The problem is that I cannot use native struct bitfields to implement a runtime bitfield extraction / insertion with runtime offset/bitwidth parameters. It's literally impossible to do so.

If I have a shader like so:

RWStructuredBuffer<uint> result0;
RWStructuredBuffer<int> result1;
[shader("compute")]
[numthreads(1, 1, 1)]
void main()
{
    uint X = 0xABCD;
    uint offset = result0[0];
    result1[0] = bitfieldExtract(X, offset , 4);
}

this cannot be implemented with structure bitfields, as the offset is only known at runtime.

Yes, it can be done with shifts and masks, but at least today, this does not generate bitfield ops, and my issue was closed on the possibility of getting it to work.

Perhaps you could alternatively re-open that DXC issue if adding these intrinsics isn't something you're a fan of? Ultimately, the ideal solution would be that DXC starts just start using the intrinsics via a transformation, since then, all codes would benefit from them, rather than the few codes explicitly using the intrinsics.

@llvm-beanz
Copy link
Collaborator

Struct bitfields in HLSL do not allow for runtime bitfield offset or bit width. The problem is that I cannot use native struct bitfields to implement a runtime bitfield extraction / insertion with runtime offset/bitwidth parameters. It's literally impossible to do so.

It is unclear to me that this problem is common enough or even that intrinsics are the right solution to it.

Yes, it can be done with shifts and masks, but at least today, this does not generate bitfield ops, and my issue was closed on the possibility of getting it to work.

Perhaps you could alternatively re-open that DXC issue if adding these intrinsics isn't something you're a fan of? Ultimately, the ideal solution would be that DXC starts just start using the intrinsics via a transformation, since then, all codes would benefit from them, rather than the few codes explicitly using the intrinsics.

I believe that the bug of DXC not pattern matching to generate the bitfield extract/insert instructions is a bug. We should not close an issue for that bug, but it is also likely not high enough priority for us to invest in fixing it at this time. Patches welcome.

@llvm-beanz
Copy link
Collaborator

Worth noting: bitfield extract and insert optimization is implemented in instruction selection in LLVM. DXC doesn't use instruction selection because DXIL cuts off in the middle of the compiler flow. It generally would be up to the driver compiler to provide this kind of target-specific operation fusion, and generally compilers are really good at this kind of instruction graph matching.

DXC being 9 years out of date from LLVM unfortunately lacks a great deal of modern compiler capabilities.

I'm not opposed to adding new intrinsics on principle, but I think that if we need to add a new language construct to handle something we should have a compelling use case, and I'm not sure I see a compelling use case here that can't be solved by a helper function in the user's code and fixing the missing optimization.

@natevm
Copy link
Author

natevm commented Sep 12, 2024

At least on NVIDIA hardware, our observations with NSIGHT show that they do not successfully identify or transform these patterns to bit field operations either. I’d guess that driver engineers are unaware that HLSL is missing this feature.

At the moment I’m mostly a bit puzzled by the (unintentional) lack of standards here. If our vision for HLSL is to avoid exposing intrinsics in favor of language syntax, then why did we add firstbithigh, countbits, mad/fma, and the other low level intrinsics in the first place? Just legacy reasons?

If legacy was the reason, then why was countbits relatively recently added in SM5? Couldn’t all of these have also simply been user written utility functions? And why is the logic the reverse here with bfi / ube/ lbe?

Ultimately, what constitutes a use case as “compelling” enough to warrant directly exposing intrinsics? Isn’t that inherently subjective? It seems like anything could be labeled as “not compelling enough”.

Beyond this, the LLVM transformation argument is a bit of a double entendre. Yes, this LLVM transformation should be supported. But that doesn’t really explain why we shouldn’t also expose bfe/ bfi as intrinsics, or why these two things should even be in the same proposal.

As great as the thought sounds of HLSL catching up to latest LLVM in the near future, I just don’t buy this as a sound or logical position for the HLSL community to place our bets on.

(I’d be happy to be wrong haha, but I don’t think I’m going to be, that seems like a large undertaking)

What I can realistically do is add these intrinsics to what we have today. This would follow established HLSL precedent. It does not assume anything about the progression of updating LLVM. The technical debt is low, bitfield ops aren’t going away any time soon. It improves HLSL’s ability to support more low level HPC codes, and improves intercompatibility with other GPU programming languages.

Perhaps if “compelling” were more rigorously defined, I could better pitch the intrinsic. However, my sense is that your mind is already made up, in which case we should probably just move onto closing the PR.

@natevm natevm closed this Sep 12, 2024
@llvm-beanz
Copy link
Collaborator

First and foremost I want to say this is not the kind of discourse we should be having in this community. Getting angry and flaming on the internet is not a helpful way to conduct a professional technical discussion. Especially given that you and I both work for companies that have long histories of collaboration.

I hope that I've made clear my point is:

  1. There is a compiler bug we should track and would accept a patch for (missing optimization).
  2. If the bug is fixed, I don't see a compelling reason for the addition of these intrinsics to the language.

I apologize if my feedback has upset you.

I'm not going to respond point by point to your message, but I do want to note that SM 5 was released in 2014. The HLSL team has grown and changed a lot since then. Decisions made a decade ago based on the FXC compiler don't hold today for DXC and they don't hold in the future for Clang.

We will continue to reassess and reevaluate the decisions of the past as we evolve HLSL, and we're not going to be beholden to "how things used to be done" because we're all smarter and more experienced today than we were yesterday.

Perhaps if “compelling” were more rigorously defined, I could better pitch the intrinsic. However, my sense is that your mind is already made up, in which case we should probably just move onto closing the PR.

It is your decision to close this PR. I am one voice on a team. I provided feedback that I didn't think the proposal was compelling, but that we should address the underlying problem differently. If your response to getting feedback from one person is to withdraw the proposal, that's your decision.

@natevm
Copy link
Author

natevm commented Sep 20, 2024

With this announcement on DirectX moving towards SPIR-V, the constraints have changed a lot on my end.

Now, I think the better move is to just write an inline SPIR-V function for specialized uses of intrinsics like these.

Sorry, it wasn’t my intention to flame. If I had known DirectX was moving to SPIR-V, then I wouldn’t have opened this PR.

For 1), my issue on this bug was closed by the HLSL team. I don’t have the resources to lead that bug hunt, but hopefully as the LLVM support improves, that will naturally be on the roadmap.

@llvm-beanz
Copy link
Collaborator

I think the initial issue you filed was probably not correctly triaged. The issue came in with a user-support tag, which generally indicates a user question or usability issue, not a bug.

The conversation seems to have made it look like a feature request for new builtins rather than a bug report for a missing optimization.

If you'd like we can reopen the issue and track it as a request for fixing the missing optimization. Similar pattern matching of bitwise operations is implemented in this optimization pass:

https://github.com/microsoft/DirectXShaderCompiler/blob/main/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp

It shouldn't be too difficult to add the appropriate patterns (see: similar patterns), this just isn't likely to be high on our team's priority list since we're focused on implementing HLSL support in Clang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants