Why are the memory requirements higher when offloading to NVMe compared to offloading to the CPU? #4059
-
Here is my config: {
"stage": 3,
"overlap_comm": True,
"contiguous_gradients": True,
"offload_param": {
"device": "nvme",
"nvme_path": nvme_path,
"pin_memory": True,
"buffer_count": 60,
"buffer_size": 2.6e8,
"max_in_cpu": 0,
},
"offload_optimizer": {
"device": "nvme",
"nvme_path": nvme_path,
"pin_memory": True,
"buffer_count": 4,
"fast_init": False
},
"load_from_fp32_weights": False,
"stage3_param_persistence_threshold": 0,
"stage3_max_live_parameters": 0,
"stage3_prefetch_bucket_size": 0,
"sub_group_size" : 1e8,
"memory_efficient_linear": True,
"round_robin_gradients": False,
} I am testing the pretrained bloom560m model. When I use CPU offloading, it only requires 14GB of memory, which matches the result from the estimate_zero3_model_states_mem_needs_all_live function. However, when I use NVMe offloading, the CPU requires almost 45GB of space and the NVMe requires 260GB of space. I suspect this is due to the buffer size. But if I reduce the buffer size or buffer count, I encounter an assertion error during the optimization step, stating that there are no more free buffers. Is there any configuration I can set up to prevent memory requirements from exceeding those of CPU offloading when I use NVMe offloading? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
@DandinPower, NVMe offloading consumes extra CPU memory because of the page-locked intermediate buffers that are required for transferring data to/from NVMe. |
Beta Was this translation helpful? Give feedback.
@DandinPower, NVMe offloading consumes extra CPU memory because of the page-locked intermediate buffers that are required for transferring data to/from NVMe.