The following program is a matrix multiplication of two matrices in BF16 on an H100. We achieve a rough throughput of 600 TFLOP/s (which is impossible without using TensorCores). BF16 matrix multiplication is TensorCore-eligible on the H100. The GPU kernel stats page of the JAX profiler incorrectly states that the op is not TensorCore-eligible and that no TensorCores are used. The framework op stats page also incorrectly identifies the op as not being TensorCore-eligible.
However, again, such high throughput is impossible without using TensorCores (I stated the throughput manually measured by myself, which corresponds to the one listed in the graph viewer).
On a slightly different note, for the FLOPS utilization it is unclear to me whether the peak FLOP/s used for the calculation are correct (which should be different for BF16 and TF32).