Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reinterpret with structs containing uint64_t or uint8_t incorrectly throws “T contains fields that cannot be packed into AnyValue” error #4817

Open
natevm opened this issue Aug 12, 2024 · 23 comments · May be fixed by #6059
Assignees
Labels
goal:client support Feature or fix needed for a current slang user. Needs reporter feedback Bugs awaiting more information from the reporter

Comments

@natevm
Copy link
Contributor

natevm commented Aug 12, 2024

I’m finding that with a recent build of Slang, I can no longer use reinterpret. Even for very simple cases, I get this error: cannot be packed into AnyValue

I seem to be able to repro with something like:

struct A {
uint64_t data;
}

struct B {
uint64_t data;
}

A a;
B b = reinterpret(a);

@natevm
Copy link
Contributor Author

natevm commented Aug 12, 2024

I’ve also tried having both of these types extend a common interface but I get the same error.

This is with the slang->SPV backend. Perhaps it has something to do with me using the scalar struct layout flag? 🤔

@natevm
Copy link
Contributor Author

natevm commented Aug 12, 2024

Looks like it has something to do with structs containing "uint64_t". If I use "uint", then I don't get this compilation error.

@natevm natevm changed the title reinterpret incorrectly throws “cannot be packed into AnyValue” error reinterpret with structs containing uint64_t incorrectly throws “T contains fields that cannot be packed into AnyValue” error Aug 12, 2024
@ArielG-NV ArielG-NV added kind:bug something doesn't work like it should goal:client support Feature or fix needed for a current slang user. and removed kind:bug something doesn't work like it should labels Aug 12, 2024
@natevm
Copy link
Contributor Author

natevm commented Aug 18, 2024

Looks like I also get this issue with reinterprets involving uint8_t.

@natevm natevm changed the title reinterpret with structs containing uint64_t incorrectly throws “T contains fields that cannot be packed into AnyValue” error reinterpret with structs containing uint64_t or uint8_t incorrectly throws “T contains fields that cannot be packed into AnyValue” error Aug 18, 2024
@bmillsNV bmillsNV added this to the Q3 2024 (Summer) milestone Aug 19, 2024
@natevm
Copy link
Contributor Author

natevm commented Aug 24, 2024

Looking into this a bit, it appears that the logic in the marshalling code supports 16-bit and 32-bit types, but not 8-bit or 64-bit.

https://github.com/shader-slang/slang/blob/master/source/slang/slang-ir-any-value-marshalling.cpp#L328-L338

https://github.com/shader-slang/slang/blob/master/source/slang/slang-ir-any-value-marshalling.cpp#L514-L520

case kIROp_UInt64Type:
case kIROp_Int64Type:
case kIROp_DoubleType:
return -1;

@natevm
Copy link
Contributor Author

natevm commented Aug 24, 2024

If it's any help, here's the current reproducer I'm using:

RWStructuredBuffer<uint64_t> result1;
RWStructuredBuffer<uint8_t> result2;
[shader("compute")]
[numthreads(1, 1, 1)]
void main()
{
    uint8_t[8] bytes = { 1, 2, 3, 4, 5, 6, 7, 8 };
    uint64_t i64 = reinterpret<uint64_t>(bytes);
    result1[0] = i64;

    uint8_t[8] backToBytes = reinterpret<uint8_t[8]>(i64);
    result2[0] = backToBytes[0];
    result2[1] = backToBytes[1];
    result2[2] = backToBytes[2];
    result2[3] = backToBytes[3];
    result2[4] = backToBytes[4];
    result2[5] = backToBytes[5];
    result2[6] = backToBytes[6];
    result2[7] = backToBytes[7];
}

@natevm
Copy link
Contributor Author

natevm commented Aug 27, 2024

@csyonghe Fwiw, the wide BVH traversal code I’m writing jumps from 19FPS here:

BVH8Node fetchNode(RWByteAddressBuffer BVH8N, int node_index) {
    return BVH8N.Load<BVH8Node>(node_index * sizeof(BVH8Node));
};

to about 180FPS by moving to this:

BVH8NodeWide fetchNode(RWByteAddressBuffer BVH8N, int node_index) {
    //return BVH8N.Load<BVH8Node>(node_index * sizeof(BVH8Node));

    float4 n[5];
    n[0] = BVH8N.LoadAligned<float4>(sizeof(float4) * (node_index * 5 + 0));
    n[1] = BVH8N.LoadAligned<float4>(sizeof(float4) * (node_index * 5 + 1));
    n[2] = BVH8N.LoadAligned<float4>(sizeof(float4) * (node_index * 5 + 2));
    n[3] = BVH8N.LoadAligned<float4>(sizeof(float4) * (node_index * 5 + 3));
    n[4] = BVH8N.LoadAligned<float4>(sizeof(float4) * (node_index * 5 + 4));
    BVH8NodeWide wnode = reinterpret<BVH8NodeWide>(n);
    return wnode;
};

The main difference being, BVH8NodeWide was made to contain only 32-bit types, whereas BVH8Node contains a mix of 32bit and 8bit types. With the BVH8NodeWide, I’m doing shifts and masks to extract the bytes, versus the original BVH8Node format which exposes the quantized bytes more directly via uint8.

I’m a bit shocked by how large of a perf delta that is. Is that expected?

@csyonghe
Copy link
Collaborator

Yes, LoadAligned is known to have big performance benefits.

@natevm
Copy link
Contributor Author

natevm commented Aug 31, 2024

@expipiplus1 perhaps since I’ve got a good reproducer here, I should be assigned this issue? It seems like something I can probably fix.

@csyonghe
Copy link
Collaborator

csyonghe commented Sep 2, 2024

@natevm we definitely appreciate your contribution!

@csyonghe csyonghe assigned natevm and unassigned expipiplus1 Sep 2, 2024
@natevm
Copy link
Contributor Author

natevm commented Sep 2, 2024

@csyonghe Would it be possible to tap into bitfieldInsert and bitfieldExtract intrinsics from within Slang's IRBuilder class?

I see some implementations here in glsl.meta.slang:

__generic<let N:int>
[__readNone]
[ForceInline]
[require(cpp_cuda_glsl_hlsl_spirv, GLSL_400)]
public vector<uint,N> bitfieldInsert(vector<uint,N> base, vector<uint,N> insert, int offset, int bits)
{
__target_switch
{
case glsl: __intrinsic_asm "bitfieldInsert";
case spirv: return spirv_asm {
result:$$vector<uint,N> = OpBitFieldInsert $base $insert $offset $bits
};
default:
vector<uint,N> result;
[ForceUnroll]
for (int i = 0; i < N; ++i)
{
result[i] = bitfieldInsert(base[i], insert[i], offset, bits);
}
return result;
}
}

[__readNone]
[ForceInline]
[require(cpp_cuda_glsl_hlsl_spirv, GLSL_400)]
public uint bitfieldExtract(uint value, int offset, int bits)
{
__target_switch
{
case glsl: __intrinsic_asm "bitfieldExtract";
case spirv: return spirv_asm {
result:$$uint = OpBitFieldUExtract $value $offset $bits
};
default:
return (value >> offset) & ((1u << bits) - 1);
}
}

I've found these to be much simpler and more efficient than the shifts and masks that are otherwise required for general bitfield manipulation.

@natevm
Copy link
Contributor Author

natevm commented Sep 2, 2024

The fallback there is effectively what Slang's doing here when packing uint16_t:

if (fieldOffset < static_cast<uint32_t>(anyValInfo->fieldKeys.getCount()))
{
auto srcVal = builder->emitLoad(concreteVar);
if (dataType->getOp() == kIROp_HalfType)
{
srcVal = builder->emitBitCast(builder->getType(kIROp_UInt16Type), srcVal);
}
srcVal = builder->emitCast(builder->getType(kIROp_UIntType), srcVal);
auto dstAddr = builder->emitFieldAddress(
uintPtrType,
anyValueVar,
anyValInfo->fieldKeys[fieldOffset]);
auto dstVal = builder->emitLoad(dstAddr);
if (intraFieldOffset == 0)
{
dstVal = builder->emitBitAnd(
dstVal->getFullType(), dstVal,
builder->getIntValue(builder->getUIntType(), 0xFFFF0000));
}
else
{
srcVal = builder->emitShl(
srcVal->getFullType(), srcVal,
builder->getIntValue(builder->getUIntType(), 16));
dstVal = builder->emitBitAnd(
dstVal->getFullType(), dstVal,
builder->getIntValue(builder->getUIntType(), 0xFFFF));
}
dstVal = builder->emitBitOr(dstVal->getFullType(), dstVal, srcVal);
builder->emitStore(dstAddr, dstVal);
}

@csyonghe
Copy link
Collaborator

csyonghe commented Sep 2, 2024

I think we just need to extend the cases in slang-ir-any-value-marshalling.cpp and make sure uint8 is handled properly. It should be simple and clean additions, no need to involve other files or passes.

@natevm
Copy link
Contributor Author

natevm commented Sep 2, 2024

To handle uint8 types efficiently for high performance code, the proper solution is to use bitfield insertion and extraction, not shifting and masking.

@natevm
Copy link
Contributor Author

natevm commented Sep 2, 2024

For the moment, I'm thinking to fix this issue I'll match what's currently done. It will be very slow, but it will be correct.

From there, we should really consider refactoring this code to use more modern bitfield intrinsics. But that can probably go into a separate "performance feature" PR

@csyonghe
Copy link
Collaborator

csyonghe commented Sep 2, 2024

If you want to implement that, you will need to turn that glsl.meta.slang to map bitFieldExtract/insert into its own slang IR instruction and implement emit logic to map that instruction to all targets. This is definitely doable but it is more work. I am not sure it "will be very slow", given that this is just simple peeophole optimization and instruction selection if the hardware has more dedicated support for bit field extracts.

@csyonghe
Copy link
Collaborator

csyonghe commented Sep 2, 2024

I highly doubt there will be any noticeable performance difference because shifting and masking is the default way when people are writing HLSL and glsl code and it should be handled very well by llvm and any downstream compiler infrastructure.

@natevm
Copy link
Contributor Author

natevm commented Sep 2, 2024

Profiling this traversal code written in Slang, using SPIR-V shifts and masks vs direct bitfield insertion and extraction, I can demonstrate substantial performance improvements. In CUDA we do this very frequently, but I'm suspecting higher level shader languages aren't as used to "ninja optimizations" which really do matter for large scale workloads.

Actually, let me get some numbers for you to really prove that point.

@csyonghe
Copy link
Collaborator

csyonghe commented Sep 2, 2024

OK, then we need to:

  1. Add ir opcode for bitFieldExtract/Set in slang-ir-inst-defs.h
  2. Add related struct definitions in slang-ir-inst.h
  3. Add IRBuilder::emitBitFieldExtract().
  4. In mightHaveSideEffects, return false for these opcodes.
  5. Modify the existing glsl.meta.slang intrinsics to map to these insts with __intrinsic_op()
  6. Implement the logic in CLikeCodeEmitter for all textual targets.
  7. Implement the logic in slang-emit-spirv to emit the corresponding spirv inst.
  8. modify the anyvalue marshaling pass to use these opcodes instead of doing masking and shifting.

@natevm
Copy link
Contributor Author

natevm commented Sep 2, 2024

Just to be a bit more data driven, in this case, I appear to go from ~160FPS to 205FPS by switching from shifting / masking towards a bitfield extraction op.

The primary difference is in commenting out these lines of code enabling the intrinsic, and instead always reverting back to default behavior:

[ForceInline] 
uint bfe(uint value, int offset, int bits)
{
    __target_switch
    {
    // case glsl: __intrinsic_asm "bitfieldExtract";
    // case spirv: return spirv_asm {
    //     result:$$uint = OpBitFieldUExtract $value $offset $bits
    // };
    default:
        return (value >> offset) & ((1u << bits) - 1);
    }
}

[ForceInline] 
uint3 bfe(uint3 value, int offset, int bits)
{
    __target_switch
    {
    // case glsl: __intrinsic_asm "bitfieldExtract";
    // case spirv: return spirv_asm {
    //     result:$$uint3 = OpBitFieldUExtract $value $offset $bits
    // };
    default:
        uint3 result;
        [ForceUnroll]
        for (int i = 0; i < 3; ++i) result[i] = bfe(value[i], offset, bits);
        return result;
    }
}

Then, here's a top-down breakdown of the pieces of this "intersect node" routine which is using the bitfield extraction, as well as the instruction mix (using bitfields to extract bytes from a 32-bit field of quantized child KDOP planes):
image

image

So in the summary above, I'm primarily bottlenecked by a MAD instruction (scaling the quantized planes by inverse ray dir / multinode bounds, then translating by ray origin.). But then from there, the second dominant instruction type come from bit manipulation from the shifting and masking.

Now, here's after I switch over to using the bitfield extraction intrinsics (now getting 205FPS):
image

And then the updated instruction mix:
image

@natevm
Copy link
Contributor Author

natevm commented Sep 2, 2024

I find it a bit weird that certain function calls do or do not show in Nsight graphics. This might also be a special case where these ninja level optimizations matter more. There might also be something going awry in the compilation that I'm not able to see.

But beyond performance, I also think the more official bitfield insertion / extraction would clean the code up a lot which would reduce tech debt, and also be useful more generally.

So, in short, I think it's worth considering looking into.

OK, then we need to:

  1. Add ir opcode for bitFieldExtract/Set in slang-ir-inst-defs.h
  2. Add related struct definitions in slang-ir-inst.h
  3. Add IRBuilder::emitBitFieldExtract().
  4. In mightHaveSideEffects, return false for these opcodes.
  5. Modify the existing glsl.meta.slang intrinsics to map to these insts with __intrinsic_op()
  6. Implement the logic in CLikeCodeEmitter for all textual targets.
  7. Implement the logic in slang-emit-spirv to emit the corresponding spirv inst.
  8. modify the anyvalue marshaling pass to use these opcodes instead of doing masking and shifting.

@natevm
Copy link
Contributor Author

natevm commented Sep 3, 2024

Ok, I have 0->6 done. Working through step 7. Seems straight forward enough.

@natevm
Copy link
Contributor Author

natevm commented Sep 3, 2024

@csyonghe could I get some clarifications for 6 and 7?

If I follow the compiler logic right, __intrinsic_op() is an attribute that can only be assigned to a declaration, and not a function definition like those currently in glsl.meta.slang

When emitting SPIR-V, slang-emit-spirv.cpp catches the KIROp_BitfieldExtract, and then emits the corresponding SPIR-V instruction. I have this working.

But we need to ensure this has a proper fallback path for non-SPIRV targets. Since the current bitfieldExtract intrinsic is only available when users supply a flag enabling glsl intrinsics, putting the intrinsic op in glsl.meta.slang causes a regression for other HLSL codes where bit field insertion/extraction are emitted during a reinterpret.

Therefore, I’ve moved the intrinsic op for bit fields into core.meta.slang. This seems to have the effect that any C-like codes have access to the intrinsic. But from there I’m a bit confused about what the logic for 6 should be.

If I have CLikeSourceEmitter emit a one-to-one “bitfieldExtract(op0, op1, op2)”, then SPIRV correctly picks this up as an intrinsic to convert to SPIRV. But when targeting HLSL, bitfieldExtract has no definition, and I understandably cannot add one because intrinsic_op() prevents me from doing so. So I hit this catch 22 situation.

Alternatively I could bring over the function definitions currently in glsl.meta.slang, then rename the function marked as intrinsic_op() to __bitfieldExtract(op0, op1, op2). Then theoretically CLikeSourceEmitter could generate a bitfieldExtract wrapper function call, which contains the fallback logic for targets which are missing the intrinsic (mainly HLSL).

But then i’m a bit confused about the interaction between the INST line in slang-ir-inst-defs.h, CLikeSourceEmitter and intrinsic_op(). Does INST map to CLikeSourceEmitter first for SPIR-V targets? Or should I treat CLikeSourceEmitter as the fallback logic when an intrinsic isn’t handled by a more specialized emitter like slang-emit-spirv?

I’m assuming I shouldn’t be emitting DXIR in the same way that we emit SPIR. But then in not sure how to add a fallback intrinsic op to HLSL. In some cases, like the custom wave intrinsics that slang adds on top of HLSL, there appears to be a full pass over the code to handle those. But that seems a bit like a heavy handed solution. I could do that too, but I don’t want to if there’s a more trivial solution.

@csyonghe
Copy link
Collaborator

csyonghe commented Sep 5, 2024

intrinsic_op maps a slang function into a slang IR opcode that you defined with INST macro.

All SlangIR opcodes needs to be implemented by the emitter to define how they map to target code.
For SPIRV and GLSL, this trivially map to the spirv op or glsl intrinsic.

For HLSL, because there isn't a corrresponding intrinsic in HLSL, we have to implement the emit logic by emitting bit shift and masking. You can directly implement it in CLikeSourceEmitter for the kIROp_BitfieldExtract opcode using shifting and masking, and that will be the fallback case if any derived SourceEmitter didn't provide an override for that inst emit logic.

For GLSL, you want to override the logic to emit the glsl intrinsic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
goal:client support Feature or fix needed for a current slang user. Needs reporter feedback Bugs awaiting more information from the reporter
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants