In some situations like int8 dot product, we want to accumulate into a higher bitwidth accumulator, but how do we go about supporting this in a sane and logical way? Currently, the system is very simple T in == T out but if we want to start doing say accumulate to u32 then this becomes considerably harder...