Skip to content

Title: RuntimeError: Timed out initializing process group in store based barrier #54

Open
@hugocool

Description

@hugocool
    from transformers import TrainingArguments
    import torch

    # get the number of gpus
    num_gpus = torch.cuda.device_count()
    if num_gpus > 1:
        from parallelformers import parallelize

        parallelize(model, num_gpus=num_gpus, fp16=True, verbose="detail")

gives

RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=9, timeout=0:30:00) WARNING No nodes ran. Repeat the previous runner.py:213 command to attempt a new run. [10/15/23 12:57:26] ERROR Node 'sort_using_baal: node.py:356 preprocess_and_sort([baal.reed_textkernel_labeled,params:reed.pretrained_model_name,reed.aimwel_labeled.finetuned_pre_trained_isco_classifier]) -> [reed.textkernel_labeled.sorted_jobs,baal.reed_textkernel_labeled_parquet]' failed with error: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=9, timeout=0:30:00)

Environment

python 3.10.1
parralelformers latest
o: ubuntu

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions