-
Notifications
You must be signed in to change notification settings - Fork 21
Open
Description
Running LLaMA-2-30B (FP16) on two RTX 4090 (24 GB) across different machines/regions (cross-Internet setup)
causes OOM during weight loading even with batch = 4.
Both GPUs reach ~24 GB VRAM and crash before inference.
| Component | Details |
|---|---|
| Model | LLaMA-2-30B |
| GPUs | 2 × RTX 4090 (24 GB each) |
| Deployment | Cross-machine setup over the public Internet (different data centers / regions) |
| Offload Mode | w_gpu 83%, w_cpu 17%, kv_gpu 100%, kv_cpu 0% |
| Batch Sizes Tested | ✅ 1–2 OK 💥 4 OOM |
| Precision | FP16 |
python -m bloombee.cli.run_dht \
--host_maddrs /ip4/0.0.0.0/tcp/31340 \
--announce_maddrs /ip4/12.150.85.67/tcp/31340 \
--identity_path bootstrapp1.id
CUDA_VISIBLE_DEVICES=0 python -m bloombee.cli.run_server huggyllama/llama-30b \
--initial_peers $BBSERVER --block_indices 0:30 \
--identity_path server40901.id --quant_type none \
--public_name node40901 \
--host_maddrs /ip4/0.0.0.0/tcp/31341 \
--announce_maddrs /ip4/12.150.85.67/tcp/31341 \
--batch_size 4
CUDA_VISIBLE_DEVICES=0 python -m bloombee.cli.run_server huggyllama/llama-30b \
--initial_peers $BBSERVER --block_indices 30:60 \
--identity_path server40903.id --quant_type none \
--public_name node40903 \
--host_maddrs /ip4/0.0.0.0/tcp/50123 \
--announce_maddrs /ip4/142.214.185.167/tcp/50123 \
--batch_size 4
python BloomBee/benchmarks/benchmark_inference.py \
--model huggyllama/llama-30b --initial_peers "$BBSERVER" \
--torch_dtype float32 --prompt_len 52 --seq_len 482 --batch_size 4 Metadata
Metadata
Assignees
Labels
No labels