Translations:PyTorch/64/en
Data parallelism with multiple GPUs
Data Parallelism, in this context, refers to methods to perform training over multiple replicas of a model in parallel, where each replica receives a different chunk of training data at each iteration. Gradients are then aggregated at the end of an iteration and the parameters of all replicas are updated in a synchronous or asynchronous fashion, depending on the method. Using this approach may provide a significant speed-up by iterating through all examples in a large dataset approximately N times faster, where N is the number of model replicas. An important caveat of this approach, is that in order to get a trained model that is equivalent to the same model trained without Data Parallelism, the user must scale either the learning rate or the desired batch size in function of the number of replicas. See this discussion for more information. In the multiple-GPU case, each GPU hosts a replica of your model. Consequently, the model must be small enough to fit inside the memory of a single GPU. Refer to the Model Parallelism section for options to train very large models that do not fit inside a single GPU.