Why are the memory requirements higher when offloading to NVMe compared to offloading to the CPU? #4059

DandinPower · 2023-07-29T13:45:51Z

DandinPower
Jul 29, 2023

Here is my config:

{
        "stage": 3,
        "overlap_comm": True,
        "contiguous_gradients": True,
        "offload_param": {
            "device": "nvme",
            "nvme_path": nvme_path,
            "pin_memory": True,
            "buffer_count": 60,
            "buffer_size": 2.6e8,  
            "max_in_cpu": 0,
        },
        "offload_optimizer": {
            "device": "nvme",
            "nvme_path": nvme_path,
            "pin_memory": True,
            "buffer_count": 4,
            "fast_init": False
        },
        "load_from_fp32_weights": False,
        "stage3_param_persistence_threshold": 0,
        "stage3_max_live_parameters": 0,
        "stage3_prefetch_bucket_size": 0,
        "sub_group_size" : 1e8,
        "memory_efficient_linear": True,
        "round_robin_gradients": False,
    }

I am testing the pretrained bloom560m model. When I use CPU offloading, it only requires 14GB of memory, which matches the result from the estimate_zero3_model_states_mem_needs_all_live function. However, when I use NVMe offloading, the CPU requires almost 45GB of space and the NVMe requires 260GB of space. I suspect this is due to the buffer size. But if I reduce the buffer size or buffer count, I encounter an assertion error during the optimization step, stating that there are no more free buffers. Is there any configuration I can set up to prevent memory requirements from exceeding those of CPU offloading when I use NVMe offloading?

Answered by tjruwase

Jul 31, 2023

@DandinPower, NVMe offloading consumes extra CPU memory because of the page-locked intermediate buffers that are required for transferring data to/from NVMe.

View full answer

tjruwase · 2023-07-31T11:15:19Z

tjruwase
Jul 31, 2023
Maintainer

@DandinPower, NVMe offloading consumes extra CPU memory because of the page-locked intermediate buffers that are required for transferring data to/from NVMe.

1 reply

DandinPower Jul 31, 2023
Author

Thank you for your reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are the memory requirements higher when offloading to NVMe compared to offloading to the CPU? #4059

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Why are the memory requirements higher when offloading to NVMe compared to offloading to the CPU? #4059

DandinPower Jul 29, 2023

Replies: 1 comment · 1 reply

tjruwase Jul 31, 2023 Maintainer

DandinPower Jul 31, 2023 Author

DandinPower
Jul 29, 2023

Replies: 1 comment 1 reply

tjruwase
Jul 31, 2023
Maintainer

DandinPower Jul 31, 2023
Author