[Q][GPU][BF16] torch.mul is lowered to HLO as an f32 multiply #8545

apivovarov · 2025-01-08T00:35:21Z

❓ Questions and Help

torch 2.5.1
torch_xla 2.5.1
cuda 12.4
GPU NVIDIA L4

The following example uses torch.mul where both operands are bf16, but in the HLO graph, I see an f32 multiply operation.

export XLA_FLAGS="--xla_dump_to=/tmp/dump --xla_dump_hlo_as_text --xla_dump_hlo_pass_re=.*"

import torch
import torch_xla as xla

device = xla.device(0)

def foo(a, b):
  y = torch.mul(a, b)
  return y

a = torch.ones([5, 9216, 64], dtype=torch.bfloat16, device=device)
b = torch.ones([5, 9216, 64], dtype=torch.bfloat16, device=device)

y = foo(a, b)
print(y)

hlo: module_0000.SyncTensorsGraph.16.before_optimizations.txt

HloModule SyncTensorsGraph.16, entry_computation_layout={()->(bf16[5,9216,64]{2,1,0})}

ENTRY SyncTensorsGraph.16 {
  constant.7 = bf16[] constant(1)
  reshape.8 = bf16[1,1,1]{2,1,0} reshape(constant.7)
  broadcast.9 = bf16[1,1,1]{2,1,0} broadcast(reshape.8), dimensions={0,1,2}
  reshape.10 = bf16[] reshape(broadcast.9)
  broadcast.11 = bf16[5,9216,64]{2,1,0} broadcast(reshape.10), dimensions={}
  convert.12 = f32[5,9216,64]{2,1,0} convert(broadcast.11)
  constant.1 = bf16[] constant(1)
  reshape.2 = bf16[1,1,1]{2,1,0} reshape(constant.1)
  broadcast.3 = bf16[1,1,1]{2,1,0} broadcast(reshape.2), dimensions={0,1,2}
  reshape.4 = bf16[] reshape(broadcast.3)
  broadcast.5 = bf16[5,9216,64]{2,1,0} broadcast(reshape.4), dimensions={}
  convert.6 = f32[5,9216,64]{2,1,0} convert(broadcast.5)
  multiply.13 = f32[5,9216,64]{2,1,0} multiply(convert.12, convert.6)
  convert.14 = bf16[5,9216,64]{2,1,0} convert(multiply.13)
  ROOT tuple.15 = (bf16[5,9216,64]{2,1,0}) tuple(convert.14)
} // SyncTensorsGraph.16

I was able to achieve bf16 multiplication by setting export XLA_USE_BF16=1, but I received the following warning

XLA_USE_BF16 will be deprecated after the 2.5 release, please convert your model to bf16 directly

I'm not sure how I can enable bf16 multiplication in HLO (High-Level Optimizer) in the correct way, without using the deprecated flag.

The text was updated successfully, but these errors were encountered:

avizon-aws · 2025-01-13T17:08:04Z

I tried the same thing using autocast, and it seems to be working as you expect. Below is the code to replicate.

a = torch.ones([5, 9216, 64], dtype=torch.bfloat16, device=xm.xla_device())
b = torch.ones([5, 9216, 64], dtype=torch.bfloat16, device=xm.xla_device())

with torch.autocast(device_type='xla', dtype=torch.bfloat16):
    y=torch.mul(a, b)

xm.mark_step()

Below is the HLO:

HloModule SyncTensorsGraph.16, entry_computation_layout={()->(bf16[5,9216,64]{2,1,0}, bf16[5,9216,64]{2,1,0}, bf16[5,9216,64]{2,1,0})}

ENTRY %SyncTensorsGraph.16 () -> (bf16[5,9216,64], bf16[5,9216,64], bf16[5,9216,64]) {
  %constant.1 = bf16[] constant(1)
  %broadcast.5 = bf16[5,9216,64]{2,1,0} broadcast(bf16[] %constant.1), dimensions={}
  %constant.6 = bf16[] constant(1)
  %broadcast.10 = bf16[5,9216,64]{2,1,0} broadcast(bf16[] %constant.6), dimensions={}
  %constant.7 = bf16[] constant(1)
  %broadcast.12 = bf16[5,9216,64]{2,1,0} broadcast(bf16[] %constant.7), dimensions={}
  ROOT %tuple.15 = (bf16[5,9216,64]{2,1,0}, bf16[5,9216,64]{2,1,0}, bf16[5,9216,64]{2,1,0}) tuple(bf16[5,9216,64]{2,1,0} %broadcast.5, bf16[5,9216,64]{2,1,0} %broadcast.10, bf16[5,9216,64]{2,1,0} %broadcast.12), frontend_attributes={neff_output_names="output0,output1,output2"}
}

Flags:
export XLA_DOWNCAST_BF16=0
export XLA_USE_BF16=0

Can you try replicating and confirm.

ysiraichi · 2025-01-21T14:22:50Z

This is actually expected behavior. In fact, PyTorch CUDA also does the same thing.

In summary, PyTorch converts each operand (stored as bf16) to at::OpMathType<bfloat16> (i.e. float), runs the operation, and stores the result back as bf16. PyTorch/XLA just replicates the same behavior.

apivovarov changed the title ~~[GPU][BF16] torch.mul is lowered to hlo as f32 multiply~~ [GPU][BF16] torch.mul is lowered to HLO as an f32 multiply Jan 8, 2025

apivovarov changed the title ~~[GPU][BF16] torch.mul is lowered to HLO as an f32 multiply~~ [Q][GPU][BF16] torch.mul is lowered to HLO as an f32 multiply Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q][GPU][BF16] torch.mul is lowered to HLO as an f32 multiply #8545

[Q][GPU][BF16] torch.mul is lowered to HLO as an f32 multiply #8545

apivovarov commented Jan 8, 2025 •

edited

Loading

avizon-aws commented Jan 13, 2025 •

edited

Loading

ysiraichi commented Jan 21, 2025

[Q][GPU][BF16] torch.mul is lowered to HLO as an f32 multiply #8545

[Q][GPU][BF16] torch.mul is lowered to HLO as an f32 multiply #8545

Comments

apivovarov commented Jan 8, 2025 • edited Loading

❓ Questions and Help

avizon-aws commented Jan 13, 2025 • edited Loading

ysiraichi commented Jan 21, 2025

apivovarov commented Jan 8, 2025 •

edited

Loading

avizon-aws commented Jan 13, 2025 •

edited

Loading