Save/Load deepspeed checkpoint with cpu #793

jisngprk · 2021-02-25T04:19:28Z

jisngprk
Feb 25, 2021

I have some question.

I am using two GPU in ond node

I want to load the model with cpu
How I could load the model ckpts with normal torch.load or deepspeed engine without distributed gpu environment setting?
If there is some example in DeepspeedExample repo, please let me know.
When I save the ckpt , the ckpt is saved in two separate directory that are named with loss. Is it normal that the ckpt is saved separately ? - I am guessing because of loss in name of directory

  if step_cnt % 2000 == 0:
      fstring = 'epoch%d-step%d-loss%.2f' % (epoch, step_cnt, loss_avg)
      self.model_engine.save_checkpoint(wpath, fstring, client_state={
          'epoch': epoch,
          'step': step_cnt})
      logging.info("[Rank - %d, MODEL SAVE]: %s" % (args.rank, os.path.join(wpath, fstring)))

To load the model, does we need all of the checkpoints ?

the deepspeed engine save two checkpoints

# ls checkpoints/test5/epoch0-step12000-loss4.23/
mp_rank_00_model_states.pt                zero_pp_rank_0_mp_rank_00optim_states.pt

# ls checkpoints/test5/epoch0-step12000-loss4.12/
zero_pp_rank_1_mp_rank_00optim_states.pt

Thank you!

asigalov61 · 2021-03-22T05:35:27Z

asigalov61
Mar 22, 2021

@jisngprk I am also new to deep speed so I may be wrong but this is what works for many people:

model_engine.save('checkpoint_dir', f'{epoch}_{step}')

model_engine.load('checkpoint_dir')

As far as checkpoints go, I think one is the checkpoint, and another is the full model with state dic + optimizer state + other stuff. At least this is how it works with torch. For example:

Save:

torch.save(model.state_dict(), '/content/model.pth')

checkpoint = {'state_dict': model.state_dict(),'optimizer' :optim.state_dict()}
torch.save(checkpoint, '/content/model_sd_opt.pth')

Load:

torch.load('/content/model_sd_opt.pth')

And if you want to use just the CPU, then you usually just specify it.

I.e.

model.cuda()
model.cpu()

Hope this is helpful.

1 reply

jisngprk Mar 23, 2021
Author

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save/Load deepspeed checkpoint with cpu #793

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Save/Load deepspeed checkpoint with cpu #793

jisngprk Feb 25, 2021

Replies: 1 comment · 1 reply

asigalov61 Mar 22, 2021

jisngprk Mar 23, 2021 Author

jisngprk
Feb 25, 2021

Replies: 1 comment 1 reply

asigalov61
Mar 22, 2021

jisngprk Mar 23, 2021
Author