OOM when running LLaMA-30B (FP16) on 2×24 GB GPUs under offload mode `w_gpu 83, w_cpu 17`

Running **LLaMA-2-30B (FP16)** on **two RTX 4090 (24 GB)** across different machines/regions (cross-Internet setup)
causes **OOM during weight loading** even with **batch = 4**.
Both GPUs reach ~24 GB VRAM and crash before inference.
| Component          | Details                                                                             |
| ------------------ | ----------------------------------------------------------------------------------- |
| Model              | LLaMA-2-30B                                                                         |
| GPUs               | 2 × RTX 4090 (24 GB each)                                                           |
| Deployment         | Cross-machine setup over the public Internet (different data centers / regions) |
| Offload Mode       | `w_gpu 83%`, `w_cpu 17%`, `kv_gpu 100%`, `kv_cpu 0%`                                |
| Batch Sizes Tested | ✅ 1–2 OK 💥 4 OOM                                                                   |
| Precision          | FP16                                                                                |





```bash

python -m bloombee.cli.run_dht \
  --host_maddrs /ip4/0.0.0.0/tcp/31340 \
  --announce_maddrs /ip4/12.150.85.67/tcp/31340 \
  --identity_path bootstrapp1.id

CUDA_VISIBLE_DEVICES=0 python -m bloombee.cli.run_server huggyllama/llama-30b \
  --initial_peers $BBSERVER --block_indices 0:30 \
  --identity_path server40901.id --quant_type none \
  --public_name node40901 \
  --host_maddrs /ip4/0.0.0.0/tcp/31341 \
  --announce_maddrs /ip4/12.150.85.67/tcp/31341 \
  --batch_size 4

CUDA_VISIBLE_DEVICES=0 python -m bloombee.cli.run_server huggyllama/llama-30b \
  --initial_peers $BBSERVER --block_indices 30:60 \
  --identity_path server40903.id --quant_type none \
  --public_name node40903 \
  --host_maddrs /ip4/0.0.0.0/tcp/50123 \
  --announce_maddrs /ip4/142.214.185.167/tcp/50123 \
  --batch_size 4


python BloomBee/benchmarks/benchmark_inference.py \
  --model huggyllama/llama-30b --initial_peers "$BBSERVER" \
  --torch_dtype float32 --prompt_len 52 --seq_len 482 --batch_size 4 
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OOM when running LLaMA-30B (FP16) on 2×24 GB GPUs under offload mode `w_gpu 83, w_cpu 17` #28

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Details
Model	LLaMA-2-30B
GPUs	2 × RTX 4090 (24 GB each)
Deployment	Cross-machine setup over the public Internet (different data centers / regions)
Offload Mode	`w_gpu 83%`, `w_cpu 17%`, `kv_gpu 100%`, `kv_cpu 0%`
Batch Sizes Tested	✅ 1–2 OK 💥 4 OOM
Precision	FP16

OOM when running LLaMA-30B (FP16) on 2×24 GB GPUs under offload mode w_gpu 83, w_cpu 17 #28

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

OOM when running LLaMA-30B (FP16) on 2×24 GB GPUs under offload mode `w_gpu 83, w_cpu 17` #28