In #1415 networkx seems to fail to complete min cut #1567

crcrpar · 2024-12-18T08:30:54Z

🐛 Bug

networkx seems to fail to complete min cut for an MLP with two torchao.float8 linears and GELU, bf16.
The script below works when dtype is float32.
If the activation is ReLU, then I see a different error.

Traceback (most recent call last):
  File "/opt/pytorch/lightning-thunder/thunder/core/rematerialization.py", line 378, in find_cut
    _, (reachable, non_reachable) = nx.minimum_cut(g, "source", "sink")
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<class 'networkx.utils.decorators.argmap'> compilation 4", line 3, in argmap_minimum_cut_1
  File "/usr/local/lib/python3.12/dist-packages/networkx/utils/backends.py", line 967, in __call__
    return self.orig_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/networkx/algorithms/flow/maxflow.py", line 454, in minimum_cut
    R = flow_func(flowG, _s, _t, capacity=capacity, value_only=True, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<class 'networkx.utils.decorators.argmap'> compilation 8", line 3, in argmap_preflow_push_5
  File "/usr/local/lib/python3.12/dist-packages/networkx/utils/backends.py", line 967, in __call__
    return self.orig_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/networkx/algorithms/flow/preflowpush.py", line 422, in preflow_push
    R = preflow_push_impl(G, s, t, capacity, residual, global_relabel_freq, value_only)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/networkx/algorithms/flow/preflowpush.py", line 41, in preflow_push_impl
    detect_unboundedness(R, s, t)
  File "<class 'networkx.utils.decorators.argmap'> compilation 16", line 3, in argmap_detect_unboundedness_13
  File "/usr/local/lib/python3.12/dist-packages/networkx/utils/backends.py", line 967, in __call__
    return self.orig_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/networkx/algorithms/flow/utils.py", line 173, in detect_unboundedness
    raise nx.NetworkXUnbounded(
networkx.exception.NetworkXUnbounded: Infinite capacity path, flow unbounded above.

To Reproduce

Steps to reproduce the behavior:

Checkout [torchao float8tensor] #1415, more specifically, 0893fbe.

Code sample

import torch
import torch.nn as nn
from torchao.float8 import convert_to_float8_training
import thunder
from thunder.tests.make_tensor import make_tensor


def main():
    batch_size, in_features, out_features = 16, 32, 64

    device = torch.device("cuda")
    dtype = torch.bfloat16
    bias = True

    model = nn.Sequential(
        nn.Linear(in_features, out_features, bias=bias),
        nn.GELU(approximate="tanh"),
        nn.Linear(out_features, out_features, bias=bias),
    ).to(device=device, dtype=dtype)
    fp8_model = convert_to_float8_training(model)
    x = make_tensor((batch_size, in_features), device=device, dtype=dtype)

    jitted = thunder.jit(fp8_model, executors=[thunder.get_executor("torch"), thunder.get_executor("nvfuser")])
    actual = jitted(x)


if __name__ == "__main__":
    main()

Error with ReLU --

Expected behavior

Environment

PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

For this MLP with nvfuser executor, there seems to be either NVIDIA/Fuser#3498 or this one, depending on whether or not I'm applying DCE implemented in 232328c

The text was updated successfully, but these errors were encountered:

crcrpar self-assigned this Dec 18, 2024

crcrpar mentioned this issue Dec 18, 2024

[torchao float8tensor] #1415

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In #1415 networkx seems to fail to complete min cut #1567

In #1415 networkx seems to fail to complete min cut #1567

crcrpar commented Dec 18, 2024 •

edited

Loading

In #1415 networkx seems to fail to complete min cut #1567

In #1415 networkx seems to fail to complete min cut #1567

Comments

crcrpar commented Dec 18, 2024 • edited Loading

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

crcrpar commented Dec 18, 2024 •

edited

Loading