Skip to content

Wrong metrics for FLOP utilization and TensorCore usage when using bfloat16 #1712

@emergenz

Description

@emergenz

The following program is a matrix multiplication of two matrices in BF16 on an H100. We achieve a rough throughput of 600 TFLOP/s (which is impossible without using TensorCores). BF16 matrix multiplication is TensorCore-eligible on the H100. The GPU kernel stats page of the JAX profiler incorrectly states that the op is not TensorCore-eligible and that no TensorCores are used. The framework op stats page also incorrectly identifies the op as not being TensorCore-eligible.

However, again, such high throughput is impossible without using TensorCores (I stated the throughput manually measured by myself, which corresponds to the one listed in the graph viewer).

On a slightly different note, for the FLOPS utilization it is unclear to me whether the peak FLOP/s used for the calculation are correct (which should be different for BF16 and TF32).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions