PyTorch: Difference between revisions

PyTorch (view source)

1 byte removed , 4 months ago

no edit summary

cc_staff

46

edits

@@ Line 2,074: / Line 2,074: @@
 You must be careful when loading a checkpoint created in this manner. If a process tries to load a checkpoint that has not yet been saved by another, you may see errors or get wrong results. To avoid this, you can add a barrier to your code to make sure the process that will create the checkpoint finishes writing it to disk before other processes attempt to load it. Also note that <code>torch.load</code> will attempt to load tensors to the GPU that saved them originally (<code>cuda:0</code> in this case) by default. To avoid issues, pass <code>map_location</code> to <code>torch.load</code> to load tensors on the correct GPU for each rank.
- <!--T:313-->
+<!--T:313-->
   torch.distributed.barrier()
   map_location = f"cuda:{local_rank}"