PyTorch: Difference between revisions

Jump to navigation Jump to search
1 byte removed ,  4 months ago
no edit summary
No edit summary
No edit summary
Line 2,067: Line 2,067:
Checkpointing can also be done while running a distributed training program. With PyTorch Lightning, no extra code is required other than using the checkpoint callback as described above. If you are using DistributedDataParallel or Horovod however, checkpointing should be done only by one process (one of the ranks) of your program, since all ranks will have the same state at the end of each iteration. The following example uses the first process (rank 0) to create a checkpoint:
Checkpointing can also be done while running a distributed training program. With PyTorch Lightning, no extra code is required other than using the checkpoint callback as described above. If you are using DistributedDataParallel or Horovod however, checkpointing should be done only by one process (one of the ranks) of your program, since all ranks will have the same state at the end of each iteration. The following example uses the first process (rank 0) to create a checkpoint:


<!--T:311-->
<!--T:311-->
  if global_rank == 0:
  if global_rank == 0:
         torch.save(ddp_model.state_dict(), "./checkpoint_path")
         torch.save(ddp_model.state_dict(), "./checkpoint_path")
cc_staff
46

edits

Navigation menu