Skip to content

PyTorch Profiler produces large trace files (~1GB) causing TensorBoard to crash #720

Open
@jxmmy7777

Description

@jxmmy7777

When using the PyTorch Profiler with TensorBoard, the generated trace files are too large (e.g., 1 ~2 GB for just 10 steps), causing TensorBoard to crash or hang.

To reproduce

Steps to reproduce the behavior:

  1. Set up a PyTorch Lightning Trainer with the following profiler configuration:
profiler = PyTorchProfiler(
    on_trace_ready=torch.profiler.tensorboard_trace_handler("<path_to_logs>"),
    schedule=torch.profiler.schedule(skip_first=2, wait=1, warmup=0, active=5),
    profile_memory=True
)

Run the training for a few steps.
The produced trace file size becomes excessively large.
Attempt to open with TensorBoard.
TensorBoard crashes or becomes unresponsive when viewing in the trace or memory tab.

Expected behavior
The trace file should be of manageable size, or there should be a method to limit or chunk the file size to prevent such issues. Additionally, TensorBoard should be able to handle large trace files more gracefully.

Environment:
PyTorch Lightning Version: 1.9.0
Python version: 3.9.18

I have tried 1) Disabled profile_memory and 2) Reducing active steps in the profiler schedule. However, it seems like the trace file is always more than 1GB, which I can't view on tensorbaord. Can someone suggest some alternatives for profiling ?

Given the challenges with the current profiler, I am looking for alternative methods or tools to view the profile my PyTorch Lightning training. Suggestions or recommendations would be highly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions