Skip to content

OOM when running LLaMA-30B (FP16) on 2×24 GB GPUs under offload mode w_gpu 83, w_cpu 17 #28

@JiuChen0

Description

@JiuChen0

Running LLaMA-2-30B (FP16) on two RTX 4090 (24 GB) across different machines/regions (cross-Internet setup)
causes OOM during weight loading even with batch = 4.
Both GPUs reach ~24 GB VRAM and crash before inference.

Component Details
Model LLaMA-2-30B
GPUs 2 × RTX 4090 (24 GB each)
Deployment Cross-machine setup over the public Internet (different data centers / regions)
Offload Mode w_gpu 83%, w_cpu 17%, kv_gpu 100%, kv_cpu 0%
Batch Sizes Tested ✅ 1–2 OK 💥 4 OOM
Precision FP16
python -m bloombee.cli.run_dht \
  --host_maddrs /ip4/0.0.0.0/tcp/31340 \
  --announce_maddrs /ip4/12.150.85.67/tcp/31340 \
  --identity_path bootstrapp1.id

CUDA_VISIBLE_DEVICES=0 python -m bloombee.cli.run_server huggyllama/llama-30b \
  --initial_peers $BBSERVER --block_indices 0:30 \
  --identity_path server40901.id --quant_type none \
  --public_name node40901 \
  --host_maddrs /ip4/0.0.0.0/tcp/31341 \
  --announce_maddrs /ip4/12.150.85.67/tcp/31341 \
  --batch_size 4

CUDA_VISIBLE_DEVICES=0 python -m bloombee.cli.run_server huggyllama/llama-30b \
  --initial_peers $BBSERVER --block_indices 30:60 \
  --identity_path server40903.id --quant_type none \
  --public_name node40903 \
  --host_maddrs /ip4/0.0.0.0/tcp/50123 \
  --announce_maddrs /ip4/142.214.185.167/tcp/50123 \
  --batch_size 4


python BloomBee/benchmarks/benchmark_inference.py \
  --model huggyllama/llama-30b --initial_peers "$BBSERVER" \
  --torch_dtype float32 --prompt_len 52 --seq_len 482 --batch_size 4 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions