Huggingface

From Alliance Doc
Jump to navigation Jump to search
This site replaces the former Compute Canada documentation site, and is now being managed by the Digital Research Alliance of Canada.

Ce site remplace l'ancien site de documentation de Calcul Canada et est maintenant géré par l'Alliance de recherche numérique du Canada.

Hugging Face is an organization that builds and maintains several popular open-source software packages widely used in Artificial Intelligence research. In this article, you will find information and tutorials on how to use packages from the Hugging Face ecosystem on our clusters.

Transformers

Transformers is a python package that provides APIs and tools to easily download and train state-of-the-art models, pre-trained on various tasks in multiple domains.

Installing Transformers

Our recommendation is to install it using our provided Python wheel as follows:

1. Load a Python module, thus module load python
2. Create and start a virtual environment.
3. Install Transformers in the virtual environment with pip install.
Question.png
(venv) [name@server ~] pip install --no-index transformers

Downloading pre-trained models

To download a pre-trained model from the Hugging Face model hub, choose one of the options below and follow the instructions on the login node of the cluster you are working on. Models must be downloaded on a login node to avoid idle compute while waiting for resources to download.

Using git lfs

Pre-trained models are usually made up of fairly large binary files. The Hugging Face makes these files available for download via Git Large File Storage. To download a model, load the git-lfs module and clone your chosen model repository from the model hub:

module load git-lfs/3.3.0
git clone https://huggingface.co/bert-base-uncased

Now that you have a copy of the pre-trained model saved locally in the cluster's filesystem, you can load it with a python script inside a job with the local_files_only option to avoid attempts to download it from the web:

from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("/path/to/where/you/cloned/the/model", local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained("/path/to/where/you/cloned/the/model", local_files_only=True)

Using python

It is also possible to download pre-trained models using Python instead of Git. The following must be executed on a login node as an internet connection is required to download the model files:

from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

This will store the pre-trained model files in a cache directory, which defaults to $HOME/.cache/huggingface/hub. You can change the cache directory by setting the environment variable TRANSFORMERS_CACHE before you import anything from the transformers package in your Python script. For example, the following will store model files in the current working directory:

import os
os.environ['TRANSFORMERS_CACHE']="./"
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Whether you change the default cache directory location or not, you can load the pre-trained model from disk in a job by using the local_files_only option:

from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("/path/to/where/model/is/saved", local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained("/path/to/where/model/is/saved", local_files_only=True)

Using a pipeline

Another frequently used way of loading a pre-trained model is via a pipeline. On a login node you can simply pass a model name or a type of task as an argument to pipeline. This will download and store the model at the default cache location:

from transformers import pipeline
pipe = pipeline("text-classification")

In an environment without internet connection however, such as inside a job, you must specify the location of the model as well as its tokenizer when calling pipeline:

 from transformers import pipeline, AutoModel, AutoTokenizer
 model = AutoModel.from_pretrained("/path/to/where/model/is/saved", local_files_only=True)
 tokenizer = AutoTokenizer.from_pretrained("/path/to/where/model/is/saved", local_files_only=True)
 pipe = pipeline(task = "text-classification", model = model, tokenizer = tokenizer)

Failing to do so will result in pipeline attempting to download models from the internet, which will result in a connection timeout error during a job.

Datasets

Datasets is a python package for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.

Installing Datasets

Our recommendation is to install it using our provided Python wheel as follows:

1. Load a Python module, thus module load python
2. Create and start a virtual environment.
3. Load the Arrow module. This will make the pyarrow package (a dependency of Datasets) available inside your virtualenv.
3. Install Datasets in the virtual environment with pip install.


Question.png
(venv) [name@server ~] module load gcc/9.3.0 arrow/11.0.0
Question.png
(venv) [name@server ~] pip install --no-index datasets

Note: you will need to load the arrow module you every time intend to import the Datasets package in your Python script.

Downloading Datasets

The exact method to download and use a dataset from the Hugging Face hub depends on a number of factors such as format and the type of task for which the data will be used. Regardless of the exact method used, any download must be performed on a login node. See the package's official documentation for details on how to download different types of dataset.

Once the dataset has been downloaded, it will be stored locally in a cache directory, which defaults to $HOME/.cache/huggingface/datasets. It is possible to change the default cache location by setting the environment variable HF_DATASETS_CACHE before you import anything from the Datasets package in your python script.

To load a dataset in a job where there is no internet connection, set the environment variable HF_DATASETS_OFFLINE=1 and specify the location of the cache directory where the dataset is stored when calling load_dataset():

import os
os.environ['HF_DATASETS_OFFLINE'] = '1'
from datasets import load_dataset
dataset = load_dataset("/path/to/loading_script/of/the/dataset")

Accelerate

Accelerate is a package that enables any PyTorch code to be run across any distributed configuration by adding just four lines of code. This makes training and inference at scale simple, efficient and adaptable.

Installing Accelerate

Our recommendation is to install it using our provided Python wheel as follows:

1. Load a Python module, thus module load python
2. Create and start a virtual environment.
3. Install Accelerate in the virtual environment with pip install.
Question.png
(venv) [name@server ~] pip install --no-index accelerate

Multi-GPU & multi-node jobs with Accelerate

In the example that follows, we use accelerate to reproduce our PyTorch tutorial on how to train a model with multiple GPUs distributed over multiple nodes. Notable differences are:

1. Here we ask for only one task per node and we let accelerate handle starting the appropriate number of processes (one per GPU) on each node.
2. We pass the number of nodes in the job and the individual node ids in the job to accelerate via the machine_rank and num_machines arguments respectively. Accelerate handles setting global and local ranks internally.


File : accelerate-example.sh

#!/bin/bash
#SBATCH --nodes 2
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-task=2
#SBATCH --cpus-per-task=4 
#SBATCH --mem=16000M       
#SBATCH --time=0-00:10
#SBATCH --output=%N-%j.out

## Create a virtualenv and install accelerate + its dependencies on all nodes ##
srun -N $SLURM_NNODES -n $SLURM_NNODES config_env.sh

export HEAD_NODE=$(hostname) # store head node's address
export HEAD_NODE_PORT=34567 # choose a port on the main node to start accelerate's main process

srun launch_training_accelerate.sh


Where the script config_env.sh is:


File : config_env.sh

#!/bin/bash

module load python

virtualenv --no-download $SLURM_TMPDIR/ENV

source $SLURM_TMPDIR/ENV/bin/activate

pip install --upgrade pip --no-index

pip install --no-index torchvision accelerate

echo "Done installing virtualenv!"


The script launch_training_accelerate.sh is:


File : launch_training_accelerate.sh

#!/bin/bash

source $SLURM_TMPDIR/ENV/bin/activate
export NCCL_ASYNC_ERROR_HANDLING=1

echo "Node $SLURM_NODEID says: main node at $HEAD_NODE"
echo "Node $SLURM_NODEID says: Launching python script with accelerate..."

accelerate launch \
--multi_gpu \
--gpu_ids="all" \
--num_machines=$SLURM_NNODES \
--machine_rank=$SLURM_NODEID \
--num_processes=4 \ # This is the total number of GPUs across all nodes
--main_process_ip="$HEAD_NODE" \
--main_process_port=$HEAD_NODE_PORT \
pytorch-accelerate.py --batch_size 256 --num_workers=2


And finally, pytorch-accelerate.py is:


File : pytorch-accelerate.py

import os

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import torchvision
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader

import torch.utils.data.distributed

from accelerate import Accelerator

import argparse

parser = argparse.ArgumentParser(description='cifar10 classification models, distributed data parallel test')
parser.add_argument('--lr', default=0.1, help='')
parser.add_argument('--batch_size', type=int, default=64, help='')
parser.add_argument('--num_workers', type=int, default=0, help='')

def main():
    print("Starting...")

    args = parser.parse_args()

    accelerator = Accelerator()

    device = accelerator.device

    class Net(nn.Module):

       def __init__(self):
          super(Net, self).__init__()

          self.conv1 = nn.Conv2d(3, 6, 5)
          self.pool = nn.MaxPool2d(2, 2)
          self.conv2 = nn.Conv2d(6, 16, 5)
          self.fc1 = nn.Linear(16 * 5 * 5, 120)
          self.fc2 = nn.Linear(120, 84)
          self.fc3 = nn.Linear(84, 10)

       def forward(self, x):
          x = self.pool(F.relu(self.conv1(x)))
          x = self.pool(F.relu(self.conv2(x)))
          x = x.view(-1, 16 * 5 * 5)
          x = F.relu(self.fc1(x))
          x = F.relu(self.fc2(x))
          x = self.fc3(x)
          return x

    net = Net()

    net.to(device)

    transform_train = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

    dataset_train = CIFAR10(root='./data', train=True, download=False, transform=transform_train)

    train_sampler = torch.utils.data.distributed.DistributedSampler(dataset_train)
    train_loader = DataLoader(dataset_train, batch_size=args.batch_size, shuffle=(train_sampler is None), num_workers=args.num_workers, sampler=train_sampler)

    criterion = nn.CrossEntropyLoss().cuda()
    optimizer = optim.SGD(net.parameters(), lr=args.lr, momentum=0.9, weight_decay=1e-4)

    net, optimizer, train_loader = accelerator.prepare(net, optimizer, train_loader)

    for batch in train_loader:

       inputs,targets = batch
       outputs = net(inputs)
       loss = criterion(outputs, targets)

       accelerator.backward(loss)
       optimizer.step()

       print("Done!")

if __name__=='__main__':
   main()