Skip to content

Inference very slow since some of the params are going to CPU after fine tuning Nemotron-70B #73

@pulkitmehtaworkmetacube

Description

We did following :

  1. Took nvidia/Llama-3.1-Nemotron-70B-Instruct-HF base model and performed fine tuning using our custom data set for classification task . Training completed in 6 hrs or so and we got adapter weights.

  2. Trying to do inferencing on our test set by first loading base model then adapter weights using PEFT .

We have 2 A 100 80 GB GPUs . After step 1 , we have around 67 GB GPU util on each GPU while after loading adapter one of the GPU gets to 80 GB mark and we get message Some parameters are on the meta device because they were offloaded to the cpu.

We also tried loading base model in 8 bits but then we are getting error TypeError: Input tensors need to be on the same GPU, but found the following tensor and device combinations:
[(torch.Size([170, 8192]), device(type='cuda', index=0)), (torch.Size([8192, 8192]), device(type='cuda', index=1)), (torch.Size([170, 8192]), device(type='cuda', index=0))]

Any suggestion , leads will be highly appreciated .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions