cc_staff
46
edits
No edit summary |
No edit summary |
||
Line 2,074: | Line 2,074: | ||
You must be careful when loading a checkpoint created in this manner. If a process tries to load a checkpoint that has not yet been saved by another, you may see errors or get wrong results. To avoid this, you can add a barrier to your code to make sure the process that will create the checkpoint finishes writing it to disk before other processes attempt to load it. Also note that <code>torch.load</code> will attempt to load tensors to the GPU that saved them originally (<code>cuda:0</code> in this case) by default. To avoid issues, pass <code>map_location</code> to <code>torch.load</code> to load tensors on the correct GPU for each rank. | You must be careful when loading a checkpoint created in this manner. If a process tries to load a checkpoint that has not yet been saved by another, you may see errors or get wrong results. To avoid this, you can add a barrier to your code to make sure the process that will create the checkpoint finishes writing it to disk before other processes attempt to load it. Also note that <code>torch.load</code> will attempt to load tensors to the GPU that saved them originally (<code>cuda:0</code> in this case) by default. To avoid issues, pass <code>map_location</code> to <code>torch.load</code> to load tensors on the correct GPU for each rank. | ||
<!--T:313--> | |||
torch.distributed.barrier() | torch.distributed.barrier() | ||
map_location = f"cuda:{local_rank}" | map_location = f"cuda:{local_rank}" |