[Error] measure_vram.py  _scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True

Tried to run [`nanoVLM/measure_vram.py`](https://github.com/huggingface/nanoVLM/blob/main/measure_vram.py) as is with kaggle T4 GPU. 

But encountered the error:

```
_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True
```

<details>

<summary>See full error</summary>

```
--- VRAM Measurement ---

Testing Batch Size: 1
W0530 05:01:15.467000 19 torch/_inductor/utils.py:1137] [0/0_1] Not enough SMs to use max_autotune_gemm mode
/usr/local/lib/python3.11/dist-packages/torch/_inductor/compile_fx.py:1948: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/torch/_inductor/compile_fx.py:1948: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/torch/_inductor/compile_fx.py:1948: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/torch/_inductor/compile_fx.py:1948: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/torch/_inductor/compile_fx.py:1948: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/torch/_inductor/compile_fx.py:1948: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/torch/_inductor/compile_fx.py:1948: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(
An unexpected runtime error occurred for batch size 1: Failed running call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(1, 9, 128, 64), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(1, 9, 128, 64), grad_fn=<ViewBackward0>), FakeTensor(..., device='cuda:0', size=(1, 9, 128, 64), dtype=torch.bfloat16,
           grad_fn=<ViewBackward0>)), **{'attn_mask': FakeTensor(..., device='cuda:0', size=(1, 1, 1, 128)), 'dropout_p': 0.0, 'is_causal': True}):
_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True

from user code:
   File "/tmp/ipykernel_19/3657429462.py", line 596, in torch_dynamo_resume_in_forward_at_589
    x, kv_cache[i] = block(x, cos, sin, attention_mask, kv_cache[i])
  File "/tmp/ipykernel_19/3657429462.py", line 540, in forward
    x, block_kv_cache = self.attn(x, cos, sin, attention_mask, block_kv_cache)
  File "/tmp/ipykernel_19/3657429462.py", line 481, in forward
    y = torch.nn.functional.scaled_dot_product_attention(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True


Testing Batch Size: 2
An unexpected runtime error occurred for batch size 2: Failed running call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64),
           grad_fn=<ViewBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), dtype=torch.bfloat16,
           grad_fn=<ViewBackward0>)), **{'attn_mask': FakeTensor(..., device='cuda:0', size=(s5, 1, 1, 128)), 'dropout_p': 0.0, 'is_causal': True}):
_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True

from user code:
   File "/tmp/ipykernel_19/3657429462.py", line 596, in torch_dynamo_resume_in_forward_at_589
    x, kv_cache[i] = block(x, cos, sin, attention_mask, kv_cache[i])
  File "/tmp/ipykernel_19/3657429462.py", line 540, in forward
    x, block_kv_cache = self.attn(x, cos, sin, attention_mask, block_kv_cache)
  File "/tmp/ipykernel_19/3657429462.py", line 481, in forward
    y = torch.nn.functional.scaled_dot_product_attention(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True


Testing Batch Size: 4
An unexpected runtime error occurred for batch size 4: Failed running call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64),
           grad_fn=<ViewBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), dtype=torch.bfloat16,
           grad_fn=<ViewBackward0>)), **{'attn_mask': FakeTensor(..., device='cuda:0', size=(s5, 1, 1, 128)), 'dropout_p': 0.0, 'is_causal': True}):
_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True

from user code:
   File "/tmp/ipykernel_19/3657429462.py", line 596, in torch_dynamo_resume_in_forward_at_589
    x, kv_cache[i] = block(x, cos, sin, attention_mask, kv_cache[i])
  File "/tmp/ipykernel_19/3657429462.py", line 540, in forward
    x, block_kv_cache = self.attn(x, cos, sin, attention_mask, block_kv_cache)
  File "/tmp/ipykernel_19/3657429462.py", line 481, in forward
    y = torch.nn.functional.scaled_dot_product_attention(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True


--- Summary of VRAM Usage ---
Batch Size 1: Error: Failed running call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(1, 9, 128, 64), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(1, 9, 128, 64), grad_fn=<ViewBackward0>), FakeTensor(..., device='cuda:0', size=(1, 9, 128, 64), dtype=torch.bfloat16,
           grad_fn=<ViewBackward0>)), **{'attn_mask': FakeTensor(..., device='cuda:0', size=(1, 1, 1, 128)), 'dropout_p': 0.0, 'is_causal': True}):
_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True

from user code:
   File "/tmp/ipykernel_19/3657429462.py", line 596, in torch_dynamo_resume_in_forward_at_589
    x, kv_cache[i] = block(x, cos, sin, attention_mask, kv_cache[i])
  File "/tmp/ipykernel_19/3657429462.py", line 540, in forward
    x, block_kv_cache = self.attn(x, cos, sin, attention_mask, block_kv_cache)
  File "/tmp/ipykernel_19/3657429462.py", line 481, in forward
    y = torch.nn.functional.scaled_dot_product_attention(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Batch Size 2: Error: Failed running call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64),
           grad_fn=<ViewBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), dtype=torch.bfloat16,
           grad_fn=<ViewBackward0>)), **{'attn_mask': FakeTensor(..., device='cuda:0', size=(s5, 1, 1, 128)), 'dropout_p': 0.0, 'is_causal': True}):
_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True

from user code:
   File "/tmp/ipykernel_19/3657429462.py", line 596, in torch_dynamo_resume_in_forward_at_589
    x, kv_cache[i] = block(x, cos, sin, attention_mask, kv_cache[i])
  File "/tmp/ipykernel_19/3657429462.py", line 540, in forward
    x, block_kv_cache = self.attn(x, cos, sin, attention_mask, block_kv_cache)
  File "/tmp/ipykernel_19/3657429462.py", line 481, in forward
    y = torch.nn.functional.scaled_dot_product_attention(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Batch Size 4: Error: Failed running call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64),
           grad_fn=<ViewBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), dtype=torch.bfloat16,
           grad_fn=<ViewBackward0>)), **{'attn_mask': FakeTensor(..., device='cuda:0', size=(s5, 1, 1, 128)), 'dropout_p': 0.0, 'is_causal': True}):
_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True

from user code:
   File "/tmp/ipykernel_19/3657429462.py", line 596, in torch_dynamo_resume_in_forward_at_589
    x, kv_cache[i] = block(x, cos, sin, attention_mask, kv_cache[i])
  File "/tmp/ipykernel_19/3657429462.py", line 540, in forward
    x, block_kv_cache = self.attn(x, cos, sin, attention_mask, block_kv_cache)
  File "/tmp/ipykernel_19/3657429462.py", line 481, in forward
    y = torch.nn.functional.scaled_dot_product_attention(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True
```

</details>

To reproduce, run this kaggle notebook: https://www.kaggle.com/code/asapannarakesh/vram-usage?scriptVersionId=242665513

---

I am planning to use the same `measure_vram` function for paligemma (look at https://github.com/ariG23498/gemma3-object-detection/issues/9#issuecomment-2896859613)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Error] measure_vram.py _scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True #90

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Error] measure_vram.py _scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True #90

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions