Translations:PyTorch/359/en

From Alliance Doc
Jump to navigation Jump to search

The following example is a reprise of the ones from previous sections. Here we combine Torch RPC and DistributedDataParallel to split a model in two parts, then train four replicas of the model distributed over two nodes in parallel. In other words, we will have 2 model replicas spanning 2 GPUs on each node. An important caveat of using Torch RPC is that currently it only supports splitting models inside a single node. For very large models that do not fit inside the combined memory space of all GPUs of a single compute node, see the next section on DeepSpeed.