Skip to content

[Error] measure_vram.py _scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True #90

Open
@INF800

Description

@INF800

Tried to run nanoVLM/measure_vram.py as is with kaggle T4 GPU.

But encountered the error:

_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True
See full error
--- VRAM Measurement ---

Testing Batch Size: 1
W0530 05:01:15.467000 19 torch/_inductor/utils.py:1137] [0/0_1] Not enough SMs to use max_autotune_gemm mode
/usr/local/lib/python3.11/dist-packages/torch/_inductor/compile_fx.py:1948: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/torch/_inductor/compile_fx.py:1948: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/torch/_inductor/compile_fx.py:1948: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/torch/_inductor/compile_fx.py:1948: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/torch/_inductor/compile_fx.py:1948: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/torch/_inductor/compile_fx.py:1948: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/torch/_inductor/compile_fx.py:1948: UserWarning: Tesla T4 does not support bfloat16 compilation natively, skipping
  warnings.warn(
An unexpected runtime error occurred for batch size 1: Failed running call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(1, 9, 128, 64), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(1, 9, 128, 64), grad_fn=<ViewBackward0>), FakeTensor(..., device='cuda:0', size=(1, 9, 128, 64), dtype=torch.bfloat16,
           grad_fn=<ViewBackward0>)), **{'attn_mask': FakeTensor(..., device='cuda:0', size=(1, 1, 1, 128)), 'dropout_p': 0.0, 'is_causal': True}):
_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True

from user code:
   File "/tmp/ipykernel_19/3657429462.py", line 596, in torch_dynamo_resume_in_forward_at_589
    x, kv_cache[i] = block(x, cos, sin, attention_mask, kv_cache[i])
  File "/tmp/ipykernel_19/3657429462.py", line 540, in forward
    x, block_kv_cache = self.attn(x, cos, sin, attention_mask, block_kv_cache)
  File "/tmp/ipykernel_19/3657429462.py", line 481, in forward
    y = torch.nn.functional.scaled_dot_product_attention(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True


Testing Batch Size: 2
An unexpected runtime error occurred for batch size 2: Failed running call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64),
           grad_fn=<ViewBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), dtype=torch.bfloat16,
           grad_fn=<ViewBackward0>)), **{'attn_mask': FakeTensor(..., device='cuda:0', size=(s5, 1, 1, 128)), 'dropout_p': 0.0, 'is_causal': True}):
_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True

from user code:
   File "/tmp/ipykernel_19/3657429462.py", line 596, in torch_dynamo_resume_in_forward_at_589
    x, kv_cache[i] = block(x, cos, sin, attention_mask, kv_cache[i])
  File "/tmp/ipykernel_19/3657429462.py", line 540, in forward
    x, block_kv_cache = self.attn(x, cos, sin, attention_mask, block_kv_cache)
  File "/tmp/ipykernel_19/3657429462.py", line 481, in forward
    y = torch.nn.functional.scaled_dot_product_attention(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True


Testing Batch Size: 4
An unexpected runtime error occurred for batch size 4: Failed running call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64),
           grad_fn=<ViewBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), dtype=torch.bfloat16,
           grad_fn=<ViewBackward0>)), **{'attn_mask': FakeTensor(..., device='cuda:0', size=(s5, 1, 1, 128)), 'dropout_p': 0.0, 'is_causal': True}):
_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True

from user code:
   File "/tmp/ipykernel_19/3657429462.py", line 596, in torch_dynamo_resume_in_forward_at_589
    x, kv_cache[i] = block(x, cos, sin, attention_mask, kv_cache[i])
  File "/tmp/ipykernel_19/3657429462.py", line 540, in forward
    x, block_kv_cache = self.attn(x, cos, sin, attention_mask, block_kv_cache)
  File "/tmp/ipykernel_19/3657429462.py", line 481, in forward
    y = torch.nn.functional.scaled_dot_product_attention(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True


--- Summary of VRAM Usage ---
Batch Size 1: Error: Failed running call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(1, 9, 128, 64), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(1, 9, 128, 64), grad_fn=<ViewBackward0>), FakeTensor(..., device='cuda:0', size=(1, 9, 128, 64), dtype=torch.bfloat16,
           grad_fn=<ViewBackward0>)), **{'attn_mask': FakeTensor(..., device='cuda:0', size=(1, 1, 1, 128)), 'dropout_p': 0.0, 'is_causal': True}):
_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True

from user code:
   File "/tmp/ipykernel_19/3657429462.py", line 596, in torch_dynamo_resume_in_forward_at_589
    x, kv_cache[i] = block(x, cos, sin, attention_mask, kv_cache[i])
  File "/tmp/ipykernel_19/3657429462.py", line 540, in forward
    x, block_kv_cache = self.attn(x, cos, sin, attention_mask, block_kv_cache)
  File "/tmp/ipykernel_19/3657429462.py", line 481, in forward
    y = torch.nn.functional.scaled_dot_product_attention(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Batch Size 2: Error: Failed running call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64),
           grad_fn=<ViewBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), dtype=torch.bfloat16,
           grad_fn=<ViewBackward0>)), **{'attn_mask': FakeTensor(..., device='cuda:0', size=(s5, 1, 1, 128)), 'dropout_p': 0.0, 'is_causal': True}):
_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True

from user code:
   File "/tmp/ipykernel_19/3657429462.py", line 596, in torch_dynamo_resume_in_forward_at_589
    x, kv_cache[i] = block(x, cos, sin, attention_mask, kv_cache[i])
  File "/tmp/ipykernel_19/3657429462.py", line 540, in forward
    x, block_kv_cache = self.attn(x, cos, sin, attention_mask, block_kv_cache)
  File "/tmp/ipykernel_19/3657429462.py", line 481, in forward
    y = torch.nn.functional.scaled_dot_product_attention(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Batch Size 4: Error: Failed running call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64),
           grad_fn=<ViewBackward0>), FakeTensor(..., device='cuda:0', size=(s0, 9, 128, 64), dtype=torch.bfloat16,
           grad_fn=<ViewBackward0>)), **{'attn_mask': FakeTensor(..., device='cuda:0', size=(s5, 1, 1, 128)), 'dropout_p': 0.0, 'is_causal': True}):
_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True

from user code:
   File "/tmp/ipykernel_19/3657429462.py", line 596, in torch_dynamo_resume_in_forward_at_589
    x, kv_cache[i] = block(x, cos, sin, attention_mask, kv_cache[i])
  File "/tmp/ipykernel_19/3657429462.py", line 540, in forward
    x, block_kv_cache = self.attn(x, cos, sin, attention_mask, block_kv_cache)
  File "/tmp/ipykernel_19/3657429462.py", line 481, in forward
    y = torch.nn.functional.scaled_dot_product_attention(

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

To reproduce, run this kaggle notebook: https://www.kaggle.com/code/asapannarakesh/vram-usage?scriptVersionId=242665513


I am planning to use the same measure_vram function for paligemma (look at ariG23498/gemma3-object-detection#9 (comment))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions