`reinterpret` with structs containing `uint64_t` or `uint8_t` incorrectly throws “T contains fields that cannot be packed into AnyValue” error #4817

natevm · 2024-08-12T04:55:41Z

I’m finding that with a recent build of Slang, I can no longer use reinterpret. Even for very simple cases, I get this error: cannot be packed into AnyValue

I seem to be able to repro with something like:

struct A {
uint64_t data;
}

struct B {
uint64_t data;
}

A a;
B b = reinterpret(a);

The text was updated successfully, but these errors were encountered:

natevm · 2024-08-12T04:57:45Z

I’ve also tried having both of these types extend a common interface but I get the same error.

This is with the slang->SPV backend. Perhaps it has something to do with me using the scalar struct layout flag? 🤔

natevm · 2024-08-12T06:26:42Z

Looks like it has something to do with structs containing "uint64_t". If I use "uint", then I don't get this compilation error.

natevm · 2024-08-18T08:03:01Z

Looks like I also get this issue with reinterprets involving uint8_t.

natevm · 2024-08-24T06:12:21Z

Looking into this a bit, it appears that the logic in the marshalling code supports 16-bit and 32-bit types, but not 8-bit or 64-bit.

https://github.com/shader-slang/slang/blob/master/source/slang/slang-ir-any-value-marshalling.cpp#L328-L338

https://github.com/shader-slang/slang/blob/master/source/slang/slang-ir-any-value-marshalling.cpp#L514-L520

slang/source/slang/slang-ir-any-value-marshalling.cpp

Lines 699 to 702 in b2ca2d5

    
           case kIROp_UInt64Type: 
        
           case kIROp_Int64Type: 
        
           case kIROp_DoubleType: 
        
               return -1;

natevm · 2024-08-24T06:48:28Z

If it's any help, here's the current reproducer I'm using:

RWStructuredBuffer<uint64_t> result1;
RWStructuredBuffer<uint8_t> result2;
[shader("compute")]
[numthreads(1, 1, 1)]
void main()
{
    uint8_t[8] bytes = { 1, 2, 3, 4, 5, 6, 7, 8 };
    uint64_t i64 = reinterpret<uint64_t>(bytes);
    result1[0] = i64;

    uint8_t[8] backToBytes = reinterpret<uint8_t[8]>(i64);
    result2[0] = backToBytes[0];
    result2[1] = backToBytes[1];
    result2[2] = backToBytes[2];
    result2[3] = backToBytes[3];
    result2[4] = backToBytes[4];
    result2[5] = backToBytes[5];
    result2[6] = backToBytes[6];
    result2[7] = backToBytes[7];
}

natevm · 2024-08-27T03:57:47Z

@csyonghe Fwiw, the wide BVH traversal code I’m writing jumps from 19FPS here:

BVH8Node fetchNode(RWByteAddressBuffer BVH8N, int node_index) {
    return BVH8N.Load<BVH8Node>(node_index * sizeof(BVH8Node));
};

to about 180FPS by moving to this:

BVH8NodeWide fetchNode(RWByteAddressBuffer BVH8N, int node_index) {
    //return BVH8N.Load<BVH8Node>(node_index * sizeof(BVH8Node));

    float4 n[5];
    n[0] = BVH8N.LoadAligned<float4>(sizeof(float4) * (node_index * 5 + 0));
    n[1] = BVH8N.LoadAligned<float4>(sizeof(float4) * (node_index * 5 + 1));
    n[2] = BVH8N.LoadAligned<float4>(sizeof(float4) * (node_index * 5 + 2));
    n[3] = BVH8N.LoadAligned<float4>(sizeof(float4) * (node_index * 5 + 3));
    n[4] = BVH8N.LoadAligned<float4>(sizeof(float4) * (node_index * 5 + 4));
    BVH8NodeWide wnode = reinterpret<BVH8NodeWide>(n);
    return wnode;
};

The main difference being, BVH8NodeWide was made to contain only 32-bit types, whereas BVH8Node contains a mix of 32bit and 8bit types. With the BVH8NodeWide, I’m doing shifts and masks to extract the bytes, versus the original BVH8Node format which exposes the quantized bytes more directly via uint8.

I’m a bit shocked by how large of a perf delta that is. Is that expected?

csyonghe · 2024-08-27T04:23:17Z

Yes, LoadAligned is known to have big performance benefits.

natevm · 2024-08-31T20:24:05Z

@expipiplus1 perhaps since I’ve got a good reproducer here, I should be assigned this issue? It seems like something I can probably fix.

csyonghe · 2024-09-02T06:28:01Z

@natevm we definitely appreciate your contribution!

natevm · 2024-09-02T19:06:04Z

@csyonghe Would it be possible to tap into bitfieldInsert and bitfieldExtract intrinsics from within Slang's IRBuilder class?

I see some implementations here in glsl.meta.slang:

slang/source/slang/glsl.meta.slang

Lines 1250 to 1271 in 45e0eee

    
           __generic<let N:int> 
        
           [__readNone] 
        
           [ForceInline] 
        
           [require(cpp_cuda_glsl_hlsl_spirv, GLSL_400)] 
        
           public vector<uint,N> bitfieldInsert(vector<uint,N> base, vector<uint,N> insert, int offset, int bits) 
        
           { 
        
               __target_switch 
        
               { 
        
               case glsl: __intrinsic_asm "bitfieldInsert"; 
        
               case spirv: return spirv_asm { 
        
                   result:$$vector<uint,N> = OpBitFieldInsert $base $insert $offset $bits 
        
               }; 
        
               default: 
        
                   vector<uint,N> result; 
        
                   [ForceUnroll] 
        
                   for (int i = 0; i < N; ++i) 
        
                   { 
        
                       result[i] = bitfieldInsert(base[i], insert[i], offset, bits); 
        
                   } 
        
                   return result; 
        
               } 
        
           }

slang/source/slang/glsl.meta.slang

Lines 1192 to 1206 in 45e0eee

    
           [__readNone] 
        
           [ForceInline] 
        
           [require(cpp_cuda_glsl_hlsl_spirv, GLSL_400)] 
        
           public uint bitfieldExtract(uint value, int offset, int bits) 
        
           { 
        
               __target_switch 
        
               { 
        
               case glsl: __intrinsic_asm "bitfieldExtract"; 
        
               case spirv: return spirv_asm { 
        
                   result:$$uint = OpBitFieldUExtract $value $offset $bits 
        
               }; 
        
               default: 
        
                   return (value >> offset) & ((1u << bits) - 1); 
        
               } 
        
           }

I've found these to be much simpler and more efficient than the shifts and masks that are otherwise required for general bitfield manipulation.

natevm · 2024-09-02T19:08:20Z

The fallback there is effectively what Slang's doing here when packing uint16_t:

slang/source/slang/slang-ir-any-value-marshalling.cpp

Lines 294 to 324 in 45e0eee

    
           if (fieldOffset < static_cast<uint32_t>(anyValInfo->fieldKeys.getCount())) 
        
           { 
        
               auto srcVal = builder->emitLoad(concreteVar); 
        
               if (dataType->getOp() == kIROp_HalfType) 
        
               { 
        
                   srcVal = builder->emitBitCast(builder->getType(kIROp_UInt16Type), srcVal); 
        
               } 
        
               srcVal = builder->emitCast(builder->getType(kIROp_UIntType), srcVal); 
        
               auto dstAddr = builder->emitFieldAddress( 
        
                   uintPtrType, 
        
                   anyValueVar, 
        
                   anyValInfo->fieldKeys[fieldOffset]); 
        
               auto dstVal = builder->emitLoad(dstAddr); 
        
               if (intraFieldOffset == 0) 
        
               { 
        
                   dstVal = builder->emitBitAnd( 
        
                       dstVal->getFullType(), dstVal, 
        
                       builder->getIntValue(builder->getUIntType(), 0xFFFF0000)); 
        
               } 
        
               else 
        
               { 
        
                   srcVal = builder->emitShl( 
        
                       srcVal->getFullType(), srcVal, 
        
                       builder->getIntValue(builder->getUIntType(), 16)); 
        
                   dstVal = builder->emitBitAnd( 
        
                       dstVal->getFullType(), dstVal, 
        
                       builder->getIntValue(builder->getUIntType(), 0xFFFF)); 
        
               } 
        
               dstVal = builder->emitBitOr(dstVal->getFullType(), dstVal, srcVal); 
        
               builder->emitStore(dstAddr, dstVal); 
        
           }

csyonghe · 2024-09-02T19:13:29Z

I think we just need to extend the cases in slang-ir-any-value-marshalling.cpp and make sure uint8 is handled properly. It should be simple and clean additions, no need to involve other files or passes.

natevm · 2024-09-02T19:14:46Z

To handle uint8 types efficiently for high performance code, the proper solution is to use bitfield insertion and extraction, not shifting and masking.

natevm · 2024-09-02T19:16:48Z

For the moment, I'm thinking to fix this issue I'll match what's currently done. It will be very slow, but it will be correct.

From there, we should really consider refactoring this code to use more modern bitfield intrinsics. But that can probably go into a separate "performance feature" PR

csyonghe · 2024-09-02T19:18:29Z

If you want to implement that, you will need to turn that glsl.meta.slang to map bitFieldExtract/insert into its own slang IR instruction and implement emit logic to map that instruction to all targets. This is definitely doable but it is more work. I am not sure it "will be very slow", given that this is just simple peeophole optimization and instruction selection if the hardware has more dedicated support for bit field extracts.

csyonghe · 2024-09-02T19:20:27Z

I highly doubt there will be any noticeable performance difference because shifting and masking is the default way when people are writing HLSL and glsl code and it should be handled very well by llvm and any downstream compiler infrastructure.

natevm · 2024-09-02T19:24:28Z

Profiling this traversal code written in Slang, using SPIR-V shifts and masks vs direct bitfield insertion and extraction, I can demonstrate substantial performance improvements. In CUDA we do this very frequently, but I'm suspecting higher level shader languages aren't as used to "ninja optimizations" which really do matter for large scale workloads.

Actually, let me get some numbers for you to really prove that point.

csyonghe · 2024-09-02T19:30:42Z

OK, then we need to:

Add ir opcode for bitFieldExtract/Set in slang-ir-inst-defs.h
Add related struct definitions in slang-ir-inst.h
Add IRBuilder::emitBitFieldExtract().
In mightHaveSideEffects, return false for these opcodes.
Modify the existing glsl.meta.slang intrinsics to map to these insts with __intrinsic_op()
Implement the logic in CLikeCodeEmitter for all textual targets.
Implement the logic in slang-emit-spirv to emit the corresponding spirv inst.
modify the anyvalue marshaling pass to use these opcodes instead of doing masking and shifting.

natevm · 2024-09-02T20:04:19Z

Just to be a bit more data driven, in this case, I appear to go from ~160FPS to 205FPS by switching from shifting / masking towards a bitfield extraction op.

The primary difference is in commenting out these lines of code enabling the intrinsic, and instead always reverting back to default behavior:

[ForceInline] 
uint bfe(uint value, int offset, int bits)
{
    __target_switch
    {
    // case glsl: __intrinsic_asm "bitfieldExtract";
    // case spirv: return spirv_asm {
    //     result:$$uint = OpBitFieldUExtract $value $offset $bits
    // };
    default:
        return (value >> offset) & ((1u << bits) - 1);
    }
}

[ForceInline] 
uint3 bfe(uint3 value, int offset, int bits)
{
    __target_switch
    {
    // case glsl: __intrinsic_asm "bitfieldExtract";
    // case spirv: return spirv_asm {
    //     result:$$uint3 = OpBitFieldUExtract $value $offset $bits
    // };
    default:
        uint3 result;
        [ForceUnroll]
        for (int i = 0; i < 3; ++i) result[i] = bfe(value[i], offset, bits);
        return result;
    }
}

Then, here's a top-down breakdown of the pieces of this "intersect node" routine which is using the bitfield extraction, as well as the instruction mix (using bitfields to extract bytes from a 32-bit field of quantized child KDOP planes):

So in the summary above, I'm primarily bottlenecked by a MAD instruction (scaling the quantized planes by inverse ray dir / multinode bounds, then translating by ray origin.). But then from there, the second dominant instruction type come from bit manipulation from the shifting and masking.

Now, here's after I switch over to using the bitfield extraction intrinsics (now getting 205FPS):

And then the updated instruction mix:

natevm · 2024-09-02T20:08:41Z

I find it a bit weird that certain function calls do or do not show in Nsight graphics. This might also be a special case where these ninja level optimizations matter more. There might also be something going awry in the compilation that I'm not able to see.

But beyond performance, I also think the more official bitfield insertion / extraction would clean the code up a lot which would reduce tech debt, and also be useful more generally.

So, in short, I think it's worth considering looking into.

OK, then we need to:

Add ir opcode for bitFieldExtract/Set in slang-ir-inst-defs.h

Add related struct definitions in slang-ir-inst.h

Add IRBuilder::emitBitFieldExtract().

In mightHaveSideEffects, return false for these opcodes.

Modify the existing glsl.meta.slang intrinsics to map to these insts with __intrinsic_op()

Implement the logic in CLikeCodeEmitter for all textual targets.

Implement the logic in slang-emit-spirv to emit the corresponding spirv inst.

modify the anyvalue marshaling pass to use these opcodes instead of doing masking and shifting.

natevm · 2024-09-03T00:06:51Z

Ok, I have 0->6 done. Working through step 7. Seems straight forward enough.

natevm · 2024-09-03T21:56:39Z

@csyonghe could I get some clarifications for 6 and 7?

If I follow the compiler logic right, __intrinsic_op() is an attribute that can only be assigned to a declaration, and not a function definition like those currently in glsl.meta.slang

When emitting SPIR-V, slang-emit-spirv.cpp catches the KIROp_BitfieldExtract, and then emits the corresponding SPIR-V instruction. I have this working.

But we need to ensure this has a proper fallback path for non-SPIRV targets. Since the current bitfieldExtract intrinsic is only available when users supply a flag enabling glsl intrinsics, putting the intrinsic op in glsl.meta.slang causes a regression for other HLSL codes where bit field insertion/extraction are emitted during a reinterpret.

Therefore, I’ve moved the intrinsic op for bit fields into core.meta.slang. This seems to have the effect that any C-like codes have access to the intrinsic. But from there I’m a bit confused about what the logic for 6 should be.

If I have CLikeSourceEmitter emit a one-to-one “bitfieldExtract(op0, op1, op2)”, then SPIRV correctly picks this up as an intrinsic to convert to SPIRV. But when targeting HLSL, bitfieldExtract has no definition, and I understandably cannot add one because intrinsic_op() prevents me from doing so. So I hit this catch 22 situation.

Alternatively I could bring over the function definitions currently in glsl.meta.slang, then rename the function marked as intrinsic_op() to __bitfieldExtract(op0, op1, op2). Then theoretically CLikeSourceEmitter could generate a bitfieldExtract wrapper function call, which contains the fallback logic for targets which are missing the intrinsic (mainly HLSL).

But then i’m a bit confused about the interaction between the INST line in slang-ir-inst-defs.h, CLikeSourceEmitter and intrinsic_op(). Does INST map to CLikeSourceEmitter first for SPIR-V targets? Or should I treat CLikeSourceEmitter as the fallback logic when an intrinsic isn’t handled by a more specialized emitter like slang-emit-spirv?

I’m assuming I shouldn’t be emitting DXIR in the same way that we emit SPIR. But then in not sure how to add a fallback intrinsic op to HLSL. In some cases, like the custom wave intrinsics that slang adds on top of HLSL, there appears to be a full pass over the code to handle those. But that seems a bit like a heavy handed solution. I could do that too, but I don’t want to if there’s a more trivial solution.

csyonghe · 2024-09-05T01:14:40Z

intrinsic_op maps a slang function into a slang IR opcode that you defined with INST macro.

All SlangIR opcodes needs to be implemented by the emitter to define how they map to target code.
For SPIRV and GLSL, this trivially map to the spirv op or glsl intrinsic.

For HLSL, because there isn't a corrresponding intrinsic in HLSL, we have to implement the emit logic by emitting bit shift and masking. You can directly implement it in CLikeSourceEmitter for the kIROp_BitfieldExtract opcode using shifting and masking, and that will be the fallback case if any derived SourceEmitter didn't provide an override for that inst emit logic.

For GLSL, you want to override the logic to emit the glsl intrinsic.

natevm changed the title ~~reinterpret incorrectly throws “cannot be packed into AnyValue” error~~ reinterpret with structs containing uint64_t incorrectly throws “T contains fields that cannot be packed into AnyValue” error Aug 12, 2024

ArielG-NV added kind:bug something doesn't work like it should goal:client support Feature or fix needed for a current slang user. and removed kind:bug something doesn't work like it should labels Aug 12, 2024

bmillsNV assigned expipiplus1 Aug 19, 2024

bmillsNV added this to the Q3 2024 (Summer) milestone Aug 19, 2024

csyonghe assigned natevm and unassigned expipiplus1 Sep 2, 2024

This was referenced Sep 6, 2024

How to get bitfield insert / extract to emit corresponding DXIL instructions? microsoft/DirectXShaderCompiler#6902

Closed

Promote bitfieldExtract and bitfieldInsert to become Slang intrinsics #5020

Closed

bmillsNV modified the milestones: Q3 2024 (Summer), Q4 2024 (Fall) Sep 26, 2024

bmillsNV modified the milestones: Q4 2024 (Fall), Q1 2025 (Winter) Jan 9, 2025

bmillsNV added the Needs reporter feedback Bugs awaiting more information from the reporter label Jan 9, 2025

juliusikkala linked a pull request Jan 10, 2025 that will close this issue

Implement AnyValue marshalling for 8-bit integers #6059

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`reinterpret` with structs containing `uint64_t` or `uint8_t` incorrectly throws “T contains fields that cannot be packed into AnyValue” error #4817

`reinterpret` with structs containing `uint64_t` or `uint8_t` incorrectly throws “T contains fields that cannot be packed into AnyValue” error #4817

natevm commented Aug 12, 2024

natevm commented Aug 12, 2024

natevm commented Aug 12, 2024

natevm commented Aug 18, 2024

natevm commented Aug 24, 2024

natevm commented Aug 24, 2024

natevm commented Aug 27, 2024 •

edited

Loading

csyonghe commented Aug 27, 2024

natevm commented Aug 31, 2024

csyonghe commented Sep 2, 2024

natevm commented Sep 2, 2024

natevm commented Sep 2, 2024

csyonghe commented Sep 2, 2024

natevm commented Sep 2, 2024

natevm commented Sep 2, 2024

csyonghe commented Sep 2, 2024

csyonghe commented Sep 2, 2024

natevm commented Sep 2, 2024 •

edited

Loading

csyonghe commented Sep 2, 2024

natevm commented Sep 2, 2024

natevm commented Sep 2, 2024 •

edited

Loading

natevm commented Sep 3, 2024

natevm commented Sep 3, 2024 •

edited

Loading

csyonghe commented Sep 5, 2024

reinterpret with structs containing uint64_t or uint8_t incorrectly throws “T contains fields that cannot be packed into AnyValue” error #4817

reinterpret with structs containing uint64_t or uint8_t incorrectly throws “T contains fields that cannot be packed into AnyValue” error #4817

Comments

natevm commented Aug 12, 2024

natevm commented Aug 12, 2024

natevm commented Aug 12, 2024

natevm commented Aug 18, 2024

natevm commented Aug 24, 2024

natevm commented Aug 24, 2024

natevm commented Aug 27, 2024 • edited Loading

csyonghe commented Aug 27, 2024

natevm commented Aug 31, 2024

csyonghe commented Sep 2, 2024

natevm commented Sep 2, 2024

natevm commented Sep 2, 2024

csyonghe commented Sep 2, 2024

natevm commented Sep 2, 2024

natevm commented Sep 2, 2024

csyonghe commented Sep 2, 2024

csyonghe commented Sep 2, 2024

natevm commented Sep 2, 2024 • edited Loading

csyonghe commented Sep 2, 2024

natevm commented Sep 2, 2024

natevm commented Sep 2, 2024 • edited Loading

natevm commented Sep 3, 2024

natevm commented Sep 3, 2024 • edited Loading

csyonghe commented Sep 5, 2024

`reinterpret` with structs containing `uint64_t` or `uint8_t` incorrectly throws “T contains fields that cannot be packed into AnyValue” error #4817

`reinterpret` with structs containing `uint64_t` or `uint8_t` incorrectly throws “T contains fields that cannot be packed into AnyValue” error #4817

natevm commented Aug 27, 2024 •

edited

Loading

natevm commented Sep 2, 2024 •

edited

Loading

natevm commented Sep 2, 2024 •

edited

Loading

natevm commented Sep 3, 2024 •

edited

Loading