Deepspeed: Difference between revisions
(Created page with "DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale. Fully compatible with PyTorch, DeepSpeed features implementations of novel memory-efficient distributed training methods, based on the Zero Redundancy Optimizer (ZeRO) concept. Through the use of ZeRO, DeepSpeed enables distributed storage and computing of different elements of a training task - such as optimizer states, model weights, model...") |
|||
Line 167: | Line 167: | ||
parser = argparse.ArgumentParser(description='cifar10 classification models, distributed data parallel test') | parser = argparse.ArgumentParser(description='cifar10 classification models, distributed data parallel test') | ||
parser.add_argument('--batch_size', type=int, default= | parser.add_argument('--batch_size', type=int, default=64, help='') | ||
parser.add_argument('--num_workers', type=int, default=0, help='') | parser.add_argument('--num_workers', type=int, default=0, help='') | ||
parser.add_argument('--max_epochs',type=int,default=2,help='number of epochs') | |||
parser.add_argument('--max_epochs',type=int,default=2,help=' | |||
parser = deepspeed.add_config_arguments(parser) | parser = deepspeed.add_config_arguments(parser) |
Revision as of 17:34, 25 May 2023
DeepSpeed is a deep learning training optimization library, providing the means to train massive billion parameter models at scale. Fully compatible with PyTorch, DeepSpeed features implementations of novel memory-efficient distributed training methods, based on the Zero Redundancy Optimizer (ZeRO) concept. Through the use of ZeRO, DeepSpeed enables distributed storage and computing of different elements of a training task - such as optimizer states, model weights, model gradients and model activations - across multiple devices, including GPU, CPU, local hard disk, and/or combinations of these devices. This "pooling" of resources, notably for storage, allows models with massive amounts of parameters to be trained efficiently, across multiple nodes, without explicitly handling Model, Pipeline or Data Parallelism in your code.
Installing Deepspeed
Our recommendation is to install it using our provided Python wheel as follows:
- 1. Load a Python module, thus module load python
- 2. Create and start a virtual environment.
- 3. Install both PyTorch and Deepspeed in the virtual environment with
pip install
.
-
(venv) [name@server ~] pip install --no-index torch deepspeed
Multi-GPU and multi-node jobs with Deepspeed
In the example that follows, we use deepspeed to reproduce our PyTorch tutorial on how to train a model with multiple GPUs distributed over multiple nodes. Notable differences are:
- 1. Here we define and configure several common elements of the training task (such as optimizer, learning rate scheduler, batch size and more) in a config file, rather than using code in the main python script.
- 2. We also define Deepspeed specific configurations, such as what modality of ZeRO to utilize, in a config file.
#!/bin/bash
#SBATCH --nodes 2
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=2
#SBATCH --cpus-per-task=4
#SBATCH --mem=16000M
#SBATCH --time=0-00:10
#SBATCH --output=%N-%j.out
## Create a virtualenv and install accelerate + its dependencies on all nodes ##
srun -N $SLURM_NNODES -n $SLURM_NNODES config_env.sh
export HEAD_NODE=$(hostname) # store head node's address
srun launch_training_deepspeed.sh
Where the script config_env.sh
is:
#!/bin/bash
module load python
virtualenv --no-download $SLURM_TMPDIR/ENV
source $SLURM_TMPDIR/ENV/bin/activate
pip install --upgrade pip --no-index
pip install --no-index torchvision deepspeed
echo "Done installing virtualenv!"
The script launch_training_deepseed.sh
is:
#!/bin/bash
source $SLURM_TMPDIR/env/bin/activate
export NCCL_BLOCKING_WAIT=1
#export NCCL_ASYNC_ERROR_HANDLING=1
#export NCCL_DEBUG=INFO
echo "r$SLURM_NODEID master: $HEAD_NODE"
echo "r$SLURM_NODEID Launching python script"
time torchrun \
--nnodes $SLURM_NNODES \
--nproc_per_node 2 \ # This is the number of GPUs on each node!
--rdzv_backend c10d \
--rdzv_endpoint "$HEAD_NODE" \
pytorch-deepspeed.py --batch_size=256 --num_workers=2 --deepspeed_config="./ds_config.json"
The file ds_config.json
is:
{
"train_batch_size": 16,
"steps_per_print": 2000,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.001,
"betas": [
0.8,
0.999
],
"eps": 1e-8,
"weight_decay": 3e-7
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.001,
"warmup_num_steps": 1000
}
},
"gradient_clipping": 1.0,
"prescale_gradients": false,
"fp16": {
"enabled": true,
"fp16_master_weights_and_grads": false,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 15
},
"wall_clock_breakdown": false,
"zero_optimization": {
"stage": 0,
"allgather_partitions": true,
"reduce_scatter": true,
"allgather_bucket_size": 50000000,
"reduce_bucket_size": 50000000,
"overlap_comm": true,
"contiguous_gradients": true,
"cpu_offload": false
}
}
And finally, pytorch-deepspeed.py
is:
import os
import time
import datetime
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.backends.cudnn as cudnn
import torchvision
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
import deepspeed
import torch.distributed as dist
import torch.utils.data.distributed
import argparse
parser = argparse.ArgumentParser(description='cifar10 classification models, distributed data parallel test')
parser.add_argument('--batch_size', type=int, default=64, help='')
parser.add_argument('--num_workers', type=int, default=0, help='')
parser.add_argument('--max_epochs',type=int,default=2,help='number of epochs')
parser = deepspeed.add_config_arguments(parser)
def main():
args = parser.parse_args()
deepspeed.init_distributed()
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
net = Net()
parameters = filter(lambda p: p.requires_grad, net.parameters())
transform_train = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
dataset_train = CIFAR10(root='./data', train=True, download=False, transform=transform_train)
train_sampler = torch.utils.data.distributed.DistributedSampler(dataset_train)
train_loader = DataLoader(dataset_train, batch_size=args.batch_size, shuffle=(train_sampler is None), num_workers=args.num_workers, sampler=train_sampler)
model_engine, optimizer, trainloader, __ = deepspeed.initialize(args=args, model=net, model_parameters=parameters, training_data=dataset_train)
fp16 = model_engine.fp16_enabled()
criterion = nn.CrossEntropyLoss().cuda()
for epoch in range(args.max_epochs):
train_sampler.set_epoch(epoch)
train(epoch, model_engine, criterion, train_loader)
def train(epoch, model_engine, criterion, train_loader):
train_loss = 0
correct = 0
total = 0
epoch_start = time.time()
for batch_idx, (inputs, targets) in enumerate(train_loader):
start = time.time()
inputs = inputs.to(model_engine.local_rank)
inputs = inputs.half()
targets = targets.to(model_engine.local_rank)
outputs = model_engine(inputs)
loss = criterion(outputs, targets)
model_engine.backward(loss)
model_engine.step()
if __name__=='__main__':
main()