[Bug] Evaluation accuracy drops when using distributed accelerate launch with multi-GPU

### What happened?

I'm seeing a major drop in evaluation accuracy when using `distributed accelerate launch` with multiple GPUs vs single-GPU evaluation, with no changes to the config file.

Here is the config file I used (`oumi/qwen7b/eval_config_base.yml`):

```yml
model:
  model_name: "Qwen/Qwen2.5-7B-Instruct"
  trust_remote_code: True
  shard_for_eval: True

tasks:
  - evaluation_backend: lm_harness
    task_name: mmlu_pro
    num_samples: 5

inference_engine: NATIVE  # NATIVE

output_dir: "eval_results/qwen7b_base"
```

When I run the evaluation on **a single GPU**, I get reasonable accuracy:

```bash
CUDA_VISIBLE_DEVICES=0 oumi evaluate -c oumi/qwen7b/eval_config_base.yml
# Accuracy: 35%
```

However, when I try using **multiple GPUs** with `distributed accelerate launch`, the accuracy drops drastically:

```bash
CUDA_VISIBLE_DEVICES=0,1 oumi distributed accelerate launch -m oumi evaluate -c oumi/qwen7b/eval_config_base.yml
# Accuracy: 5%
```

I was not able to find any error message or log that could indicate why this was happening.

No changes were made to the config or environment besides switching the command and enabling multi-GPU. Could this be an issue with how the dataset is sharded or results aggregated in distributed mode?

Let me know what further details I can provide to help debug.

### Steps to reproduce the bug

file eval.yml
```yml
model:
  model_name: "Qwen/Qwen2.5-7B-Instruct"
  trust_remote_code: True
  shard_for_eval: True

tasks:
  - evaluation_backend: lm_harness
    task_name: mmlu_pro
    num_samples: 5

inference_engine: NATIVE  # NATIVE

output_dir: "eval_results/qwen7b_base"
```

```bash
conda create -n test-oumi python=3.11
pip install oumi[gpu]
CUDA_VISIBLE_DEVICES=0,1 oumi distributed accelerate launch -m oumi evaluate -c eval.yml
```


### System Info

```shell
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
   Oumi environment information:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

┌────────────────┬───────────────────────────────────────────────┐
│ Oumi version   │ 0.1.12                                        │
│ Python version │ 3.11.11                                       │
│ Platform       │ Linux-5.15.0-84-generic-x86_64-with-glibc2.31 │
└────────────────┴───────────────────────────────────────────────┘

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
   Installed dependencies:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ PACKAGE          ┃ VERSION         ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ accelerate       │ 1.2.1           │
│ aiohttp          │ 3.11.18         │
│ bitsandbytes     │ 0.45.5          │
│ datasets         │ 3.2.0           │
│ diffusers        │ <not installed> │
│ einops           │ 0.8.1           │
│ jsonlines        │ 4.0.0           │
│ liger-kernel     │ 0.5.9           │
│ llama-cpp-python │ <not installed> │
│ lm-eval          │ 0.4.8           │
│ mlflow           │ 2.21.3          │
│ numpy            │ 1.26.4          │
│ nvidia-ml-py     │ 12.560.30       │
│ omegaconf        │ 2.4.0.dev3      │
│ open_clip_torch  │ <not installed> │
│ pandas           │ 2.2.3           │
│ peft             │ 0.14.0          │
│ pexpect          │ 4.8.0           │
│ pillow           │ 11.1.0          │
│ pydantic         │ 2.9.2           │
│ responses        │ 0.25.7          │
│ sglang           │ <not installed> │
│ skypilot         │ 0.7.0           │
│ tensorboard      │ 2.18.0          │
│ timm             │ <not installed> │
│ torch            │ 2.5.1           │
│ torchdata        │ 0.9.0           │
│ torchvision      │ 0.20.1          │
│ tqdm             │ 4.67.1          │
│ transformers     │ 4.51.3          │
│ trl              │ 0.16.1          │
│ typer            │ 0.15.3          │
│ vllm             │ 0.7.3           │
│ wandb            │ 0.19.11         │
└──────────────────┴─────────────────┘

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
   Environment variables:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ VARIABLE                        ┃ VALUE     ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ ACCELERATE_DYNAMO_BACKEND       │ <not set> │
│ ACCELERATE_DYNAMO_MODE          │ <not set> │
│ ACCELERATE_DYNAMO_USE_DYNAMIC   │ <not set> │
│ ACCELERATE_DYNAMO_USE_FULLGRAPH │ <not set> │
│ ACCELERATE_USE_FSDP             │ <not set> │
│ CUDA_VISIBLE_DEVICES            │ 0,1       │
│ LOCAL_RANK                      │ <not set> │
│ LOCAL_WORLD_SIZE                │ <not set> │
│ OUMI_EXTRA_DEPS_FILE            │ <not set> │
│ OUMI_FORCE_EDITABLE_INSTALL     │ <not set> │
│ OUMI_SLURM_CONNECTIONS          │ <not set> │
│ OUMI_USE_SPOT_VM                │ <not set> │
│ RANK                            │ <not set> │
│ WORLD_SIZE                      │ <not set> │
└─────────────────────────────────┴───────────┘

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
   PyTorch information:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

┌────────────────┬───────────────────────┐
│ CUDA available │ True                  │
│ CUDA version   │ 12.4                  │
│ cuDNN version  │ 90.1.0                │
│ Number of GPUs │ 2                     │
│ GPU type       │ NVIDIA A100 80GB PCIe │
│ GPU memory     │ 79.2GB                │
└────────────────┴───────────────────────┘
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Evaluation accuracy drops when using distributed accelerate launch with multi-GPU #1677

What happened?

Steps to reproduce the bug

System Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Evaluation accuracy drops when using distributed accelerate launch with multi-GPU #1677

Description

What happened?

Steps to reproduce the bug

System Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions