cc_staff
46
edits
No edit summary |
No edit summary |
||
Line 2,067: | Line 2,067: | ||
Checkpointing can also be done while running a distributed training program. With PyTorch Lightning, no extra code is required other than using the checkpoint callback as described above. If you are using DistributedDataParallel or Horovod however, checkpointing should be done only by one process (one of the ranks) of your program, since all ranks will have the same state at the end of each iteration. The following example uses the first process (rank 0) to create a checkpoint: | Checkpointing can also be done while running a distributed training program. With PyTorch Lightning, no extra code is required other than using the checkpoint callback as described above. If you are using DistributedDataParallel or Horovod however, checkpointing should be done only by one process (one of the ranks) of your program, since all ranks will have the same state at the end of each iteration. The following example uses the first process (rank 0) to create a checkpoint: | ||
<!--T:311--> | |||
if global_rank == 0: | if global_rank == 0: | ||
torch.save(ddp_model.state_dict(), "./checkpoint_path") | torch.save(ddp_model.state_dict(), "./checkpoint_path") |