PyTorch: Difference between revisions

Jump to navigation Jump to search
1 byte removed ,  4 months ago
no edit summary
No edit summary
No edit summary
Line 2,074: Line 2,074:
You must be careful when loading a checkpoint created in this manner. If a process tries to load a checkpoint that has not yet been saved by another, you may see errors or get wrong results. To avoid this, you can add a barrier to your code to make sure the process that will create the checkpoint finishes writing it to disk before other processes attempt to load it. Also note that <code>torch.load</code> will attempt to load tensors to the GPU that saved them originally (<code>cuda:0</code> in this case) by default. To avoid issues, pass <code>map_location</code> to <code>torch.load</code> to load tensors on the correct GPU for each rank.
You must be careful when loading a checkpoint created in this manner. If a process tries to load a checkpoint that has not yet been saved by another, you may see errors or get wrong results. To avoid this, you can add a barrier to your code to make sure the process that will create the checkpoint finishes writing it to disk before other processes attempt to load it. Also note that <code>torch.load</code> will attempt to load tensors to the GPU that saved them originally (<code>cuda:0</code> in this case) by default. To avoid issues, pass <code>map_location</code> to <code>torch.load</code> to load tensors on the correct GPU for each rank.


<!--T:313-->
<!--T:313-->
  torch.distributed.barrier()
  torch.distributed.barrier()
  map_location = f"cuda:{local_rank}"   
  map_location = f"cuda:{local_rank}"   
cc_staff
46

edits

Navigation menu