Translations:PyTorch/357/en

From Alliance Doc
Jump to navigation Jump to search

Combining model and data parallelism

In cases where a model is too large to fit inside a Single GPU and, additionally, the goal is to train such a model using a very large training set, combining model parallelism with data parallelism becomes a viable option to achieve high performance. The idea is straightforward: you will split a large model into smaller parts, give each part its own GPU, perform pipeline parallelism on the inputs, then, additionally, you will create replicas of this whole process, which will be trained in parallel over separate subsets of the training set. As in the example from the previous section, gradients are computed independently within each replica, then an aggregation of these gradients is used to update all replicas synchronously or asynchronously, depending on the method used. The main difference here is that each model replica lives in more than one GPU.