PyTorch: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
No edit summary
Line 37: Line 37:


<!--T:546-->
<!--T:546-->
'''Note:''' There are known issues with PyTorch 1.10 on our clusters (except for Narval). If you encounter problems while using distributed training, or if you get an error containing <code>c10::Error</code>, we recommend installing PyTorch 1.9.1 using <code>pip install --no-index torch==1.9.1</code>.
<b>Note:</b> There are known issues with PyTorch 1.10 on our clusters (except for Narval). If you encounter problems while using distributed training, or if you get an error containing <code>c10::Error</code>, we recommend installing PyTorch 1.9.1 using <code>pip install --no-index torch==1.9.1</code>.


====Extra==== <!--T:21-->
====Extra==== <!--T:21-->
Line 99: Line 99:


<!--T:548-->
<!--T:548-->
On version 1.7.0 PyTorch has introduced support for [https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/ Nvidia's TensorFloat-32 (TF32) Mode], which in turn is available only on Ampere and later Nvidia GPU architectures. This mode of executing tensor operations has been shown to yield up to 20x speed-ups compared to equivalent single precision (FP32) operations and is enabled by default in PyTorch versions 1.7.x up to 1.11.x. However, such gains in performance come at the cost of potentially decreased accuracy in the results of operations, which may become problematic in cases such as when dealing with ill-conditioned matrices, or when performing long sequences of tensor operations as is common in deep learning models. Following calls from its user community, TF32 is now '''disabled by default for matrix multiplications''', but still '''enabled by default for convolutions''' starting with PyTorch version 1.12.0.
On version 1.7.0 PyTorch has introduced support for [https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/ Nvidia's TensorFloat-32 (TF32) Mode], which in turn is available only on Ampere and later Nvidia GPU architectures. This mode of executing tensor operations has been shown to yield up to 20x speed-ups compared to equivalent single precision (FP32) operations and is enabled by default in PyTorch versions 1.7.x up to 1.11.x. However, such gains in performance come at the cost of potentially decreased accuracy in the results of operations, which may become problematic in cases such as when dealing with ill-conditioned matrices, or when performing long sequences of tensor operations as is common in deep learning models. Following calls from its user community, TF32 is now <b>disabled by default for matrix multiplications</b>, but still <b>enabled by default for convolutions</b> starting with PyTorch version 1.12.0.


<!--T:549-->
<!--T:549-->
Line 124: Line 124:


<!--T:167-->
<!--T:167-->
With small scale models, we strongly recommend using '''multiple CPUs instead of using a GPU'''. While training will almost certainly run faster on a GPU (except in cases where the model is very small), if your model and your dataset are not large enough, the speed up relative to CPU will likely not be very significant and your job will end up using only a small portion of the GPU's compute capabilities. This might not be an issue on your own workstation, but in a shared environment like our HPC clusters, this means you are unnecessarily blocking a resource that another user may need to run actual large scale computations! Furthermore, you would be unnecessarily using up your group's allocation and affecting the priority of your colleagues' jobs.
With small scale models, we strongly recommend using <b>multiple CPUs instead of using a GPU</b>. While training will almost certainly run faster on a GPU (except in cases where the model is very small), if your model and your dataset are not large enough, the speed up relative to CPU will likely not be very significant and your job will end up using only a small portion of the GPU's compute capabilities. This might not be an issue on your own workstation, but in a shared environment like our HPC clusters, this means you are unnecessarily blocking a resource that another user may need to run actual large scale computations! Furthermore, you would be unnecessarily using up your group's allocation and affecting the priority of your colleagues' jobs.


<!--T:168-->
<!--T:168-->
Line 400: Line 400:


<!--T:251-->
<!--T:251-->
In cases where a model is fairly small, such that it does not take up a large portion of GPU memory and it cannot use a reasonable amount of its compute capacity, it is '''not advisable to use a GPU'''. Use [[PyTorch#PyTorch_with_Multiple_CPUs|one or more CPUs]] instead. However, in a scenario where you have such a model, but have a very large dataset and wish to perform training with a small batch size, taking advantage of Data parallelism on a GPU becomes a viable option.  
In cases where a model is fairly small, such that it does not take up a large portion of GPU memory and it cannot use a reasonable amount of its compute capacity, it is <b>not advisable to use a GPU</b>. Use [[PyTorch#PyTorch_with_Multiple_CPUs|one or more CPUs]] instead. However, in a scenario where you have such a model, but have a very large dataset and wish to perform training with a small batch size, taking advantage of Data parallelism on a GPU becomes a viable option.  


<!--T:252-->
<!--T:252-->
Data Parallelism, in this context, refers to methods to perform training over multiple replicas of a model in parallel, where each replica receives a different chunk of training data at each iteration. Gradients are then aggregated at the end of an iteration and the parameters of all replicas are updated in a synchronous or asynchronous fashion, depending on the method. Using this approach may provide a significant speed-up by iterating through all examples in a large dataset approximately ''N'' times faster, where ''N'' is the number of model replicas. An '''important caveat''' of this approach, is that in order to get a trained model that is equivalent to the same model trained without Data Parallelism, the user must scale either the learning rate or the desired batch size in function of the number of replicas. See [https://discuss.pytorch.org/t/should-we-split-batch-size-according-to-ngpu-per-node-when-distributeddataparallel/72769/13 this discussion] for more information.   
Data Parallelism, in this context, refers to methods to perform training over multiple replicas of a model in parallel, where each replica receives a different chunk of training data at each iteration. Gradients are then aggregated at the end of an iteration and the parameters of all replicas are updated in a synchronous or asynchronous fashion, depending on the method. Using this approach may provide a significant speed-up by iterating through all examples in a large dataset approximately ''N'' times faster, where ''N'' is the number of model replicas. An <b>important caveat</b> of this approach, is that in order to get a trained model that is equivalent to the same model trained without Data Parallelism, the user must scale either the learning rate or the desired batch size in function of the number of replicas. See [https://discuss.pytorch.org/t/should-we-split-batch-size-according-to-ngpu-per-node-when-distributeddataparallel/72769/13 this discussion] for more information.   


<!--T:253-->
<!--T:253-->
rsnt_translations
56,420

edits

Navigation menu