Tutoriel Apprentissage machine/en: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(Created page with "This page is a beginner's manual concerning how to port a machine learning job to a Compute Canada cluster.")
No edit summary
 
(61 intermediate revisions by 4 users not shown)
Line 1: Line 1:
<languages />
<languages />


This page is a beginner's manual concerning how to port a machine learning job to a Compute Canada cluster.
This page is a beginner's manual concerning how to port a machine learning job to one of our clusters.


== Étape 1: Archivage d'un ensemble de données ==
== Step 1: Remove all graphical display ==


Les systèmes de fichiers gérés par Calcul Canada sont conçus pour fonctionner avec une quantité limitée de gros fichiers. Assurez-vous que l'ensemble de données dont vous aurez besoin pour votre entraînement se trouve dans un fichier archive (tel que "tar"), que vous transférerez sur votre nœud de calcul au début de votre tâche. '''Si vous ne le faites pas, vous risquez de causer des lectures de fichiers à haute fréquence du noeud de stockage vers votre nœud de calcul, nuisant ainsi à la performance globale du système'''. Si vous voulez apprendre davantage sur la gestion des grands ensembles de fichiers, on vous recommande la lecture de [https://docs.computecanada.ca/wiki/Handling_large_collections_of_files/fr cette page].
Edit your program such that it doesn't use a graphical display. All graphical results will have to be written on disk, and visualized on your personal computer, when the job is finished. For example, if you show plots using matplotlib, you need to [https://stackoverflow.com/questions/4706451/how-to-save-a-figure-remotely-with-pylab write the plots to image files instead of showing them on screen].


En supposant que les fichiers dont vous avez besoin sont dans le dossier <tt>mydataset</tt>:
== Step 2: Archiving a data set ==
 
Shared storage on our clusters is not designed to handle lots of small files (they are optimized for very large files). Make sure that the data set which you need for your training is an archive format like <code>tar</code>, which you can then transfer to your job's compute node when the job starts. <b>If you do not respect these rules, you risk causing enormous numbers of I/O operations on the shared filesystem, leading to performance issues on the cluster for all of its users.</b> If you want to learn more about how to handle collections of large number of files, we recommend that you spend some time reading [[Handling_large_collections_of_files|this page]].
 
Assuming that the files which you need are in the directory <tt>mydataset</tt>:


  $ tar cf mydataset.tar mydataset/*
  $ tar cf mydataset.tar mydataset/*


La commande ci-haut ne compresse pas les données. Si vous croyez que ce serait approprié, vous pouvez utiliser <tt>tar czf</tt>.
The above command does not compress the data. If you believe that this is appropriate, you can use <tt>tar czf</tt>.


== Étape 2: Préparation de l'environnement virtuel ==
==Step 3: Preparing your virtual environment ==


Nous vous recommandons d'essayer votre tâche dans une [[Running_jobs/fr#T.C3.A2ches_interactives|tâche interactive]] avant de la soumettre avec un script (section suivante). Vous pourrez ainsi diagnostiquer plus rapidement les problèmes. Voici un exemple de la commande pour soumettre une tâche interactive:
[[Python#Creating_and_using_a_virtual_environment|Create a virtual environment]] in your home space.
$ salloc --account=def-someuser --gres=gpu:1 --cpus-per-task=6 --mem=32000M --time=1:00
Une fois dans la tâche:


* [[Python#Creating_and_using_a_virtual_environment|Créez et activez un environnement virtuel]] dans <tt>$SLURM_TMPDIR</tt> (cette variable pointe vers un dossier local, c'est-à-dire ''sur le nœud de calcul''). [[AI_and_Machine_Learning/fr#.C3.89viter_Anaconda|N'utilisez pas Anaconda]]. Par exemple:
For details on installation and usage of machine learning frameworks, refer to our documentation:
$ virtualenv --no-download $SLURM_TMPDIR/env
* Installez les paquets dont vous avez besoin. Pour ''TensorFlow'', installez le paquet <tt>tensorflow_gpu</tt>; il s'agit d'une version optimisée pour nos systèmes.
* Tentez d'exécuter votre programme
* Installez les paquets manquants s'il y a lieu
* Créez un fichier <tt>requirements.txt</tt> afin de pouvoir recréer l'environnement virtuel:
(env) $ pip freeze > ~/requirements.txt


'''Maintenant est un bon moment pour vérifier que votre tâche lit et écrit le plus possible sur le nœud de calcul (<tt>$SLURM_TMPDIR</tt>), et le moins possible sur les systèmes de fichiers partagés (home, scratch, project).'''
* [[PyTorch]]
* [[TensorFlow]]


== Étape 3: Préparation du script de soumission ==
== Step 4: Interactive job (salloc) ==


Vous devez soumettre vos tâches à l'aide de scripts <tt>sbatch</tt>, afin qu'elles puissent être entièrement automatisées. Les tâches interactives servent uniquement à préparer et à déboguer des tâches.
We recommend that you try running your job in an [[Running_jobs#Interactive_jobs|interactive job]] before submitting it using a script (discussed in the following section). You can diagnose problems more quickly using an interactive job. An example of the command for submitting such a job is:
$ salloc --account=def-someuser --gres=gpu:1 --cpus-per-task=3 --mem=32000M --time=1:00:00
Once the job has started:


=== Éléments importants d'un script <tt>sbatch</tt> ===
* Activate your virtual environment.
* Try to run your program.
* Install any missing modules if necessary. Since the compute nodes don't have internet access, you will have to install them from a login node. Please refer to our documentation on [[Python#Creating_and_using_a_virtual_environment|virtual environments]].
* Note the steps that you took to make your program work.


# Compte sur lequel les ressources seront "facturées"
'''Now is a good time to verify that your job reads and writes as much as possible on the compute node's local storage (<tt>$SLURM_TMPDIR</tt>) and as little as possible on the [[Storage_and_file_management|shared filesystems (home, scratch and project)]].'''
# Ressources demandées:
 
## Nombre de CPU, suggestion: 6
==Step 5: Scripted job (sbatch)==
## Nombre de GPU, suggestion: 1 ('''Utilisez un (1) seul GPU, à moins d'être certain que votre programme en utilise plusieurs. Par défaut, TensorFlow et PyTorch utilisent un seul GPU.''')
 
## Quantité de mémoire, suggestion: <tt>32000M</tt>
You must [[Running_jobs#Use_sbatch_to_submit_jobs|submit your jobs]] using a script in conjunction with the <tt>sbatch</tt> command, so that they can be entirely automated as a batch process. Interactive jobs are just for preparing and debugging your jobs, so that you can execute them fully and/or at scale using <tt>sbatch</tt>.
## Durée (Maximum Béluga: 7 jours, Graham et Cedar: 28 jours)
 
# Commandes ''bash'':
===Important elements of a <tt>sbatch</tt> script===
## Préparation de l'environnement (modules, virtualenv)
 
## Transfert des données vers le noeud de calcul
# Account that will be "billed" for the resources used
## Lancement de l'exécutable
# Resources required:
## Number of CPUs, suggestion: 6
## Number of GPUs, suggestion: 1 ('''Use one (1) single GPU, unless you are certain that your program can use several. By default, TensorFlow and PyTorch use just one GPU.''')
## Amount of memory, suggestion: <tt>32000M</tt>
## Duration (Maximum Béluga: 7 days, Graham and Cedar: 28 days)
# ''Bash'' commands:
## Preparing your environment (modules, virtualenv)
## Transferring data to the compute node
## Starting the executable
 
===Example script===


=== Exemple de script ===


{{File
{{File
Line 54: Line 65:
#!/bin/bash
#!/bin/bash
#SBATCH --gres=gpu:1      # Request GPU "generic resources"
#SBATCH --gres=gpu:1      # Request GPU "generic resources"
#SBATCH --cpus-per-task=6 # Cores proportional to GPUs: 6 on Cedar, 16 on Graham.
#SBATCH --cpus-per-task=3 # Refer to cluster's documentation for the right CPU/GPU ratio
#SBATCH --mem=32000M      # Memory proportional to GPUs: 32000 Cedar, 64000 Graham.
#SBATCH --mem=32000M      # Memory proportional to GPUs: 32000 Cedar, 47000 Béluga, 64000 Graham.
#SBATCH --time=0-03:00    # DD-HH:MM:SS
#SBATCH --time=0-03:00    # DD-HH:MM:SS
#SBATCH --output=%N-%j.out


module load python/3.6 cuda cudnn
module load python/3.6 cuda cudnn
Line 64: Line 74:


# Prepare virtualenv
# Prepare virtualenv
virtualenv --no-download $SLURM_TMPDIR/env
source ~/my_env/bin/activate
source $SLURM_TMPDIR/env/bin/activate
# You could also create your environment here, on the local storage ($SLURM_TMPDIR), for better performance. See our docs on virtual environments.
pip install --no-index -r $SOURCEDIR/requirements.txt


# Prepare data
# Prepare data
mkdir $SLURM_TMPDIR/data
mkdir $SLURM_TMPDIR/data
tar xf ~/projects/def-xxxx/data.tar $SLURM_TMPDIR/data
tar xf ~/projects/def-xxxx/data.tar -C $SLURM_TMPDIR/data


# Start training
# Start training
Line 76: Line 85:
}}
}}


=== Morcellement d'une longue tâche ===


Nous vous recommandons de morceler vos tâches en blocs de 24 heures. Demander des tâches plus courtes améliore votre priorité. En créant une chaîne de tâches, il est possible de dépasser la limite de 7 jours sur Béluga.
===Checkpointing a long-running job===
 
We recommend that you checkpoint your jobs in 24 hour units. Submitting jobs which have short durations ensures they are more likely to start sooner. By creating a daisy chain of jobs, it is possible to overcome the seven day limit on Béluga.
 
# Modify your job submission script (or your program) so that your job can be interrupted and continued . Your program should be able to access the most recent checkpoint file. (See the example script below).
# Verify how many epochs (or iterations) can be carried out in a 24 hour unit.
# Calculate how many of these 24 hour units you will need:  <tt>n_units = n_epochs_total / n_epochs_per_24h</tt>
# Use the argument <tt>--array 1-<n_blocs>%1</tt> to ask for a chain of <tt>n_blocs</tt> jobs.


# Modifiez votre script de soumission (ou votre programme) afin que votre tâche puisse être interrompue et continuée. Votre programme doit pouvoir accéder au ''checkpoint'' le plus récent. (Voir l'exemple de script ci-dessous.)
The job submission script will look like this:
# Vérifiez combien d'epochs (ou d'itérations) peuvent être effectuées à l'intérieur de 24 heures.
# Calculez combien de blocs de 24 heures vous aurez besoin: <tt>n_blocs = n_epochs_total / n_epochs_par_24h</tt>
# Utilisez l'argument <tt>--array 1-<n_blocs>%1</tt> pour demander une chaine de <tt>n_blocs</tt> tâches.


Le script de soumission ressemblera à ceci:


{{File
{{File
Line 103: Line 114:
...
...


# Get most recent checkpoint (this example is for PyTorch *.pth checkpoint files)
# Get most recent checkpoint
export CHECKPOINTS=~/scratch/checkpoints/ml-test
CHECKPOINT_EXT='*.h5'  # Replace by *.pt for PyTorch checkpoints
LAST_CHECKPOINT=$(find . -maxdepth 1 -name "$CHECKPOINTS/*.pth" -print0 {{!}} xargs -r -0 ls -1 -t {{!}} head -1)
CHECKPOINTS=~/scratch/checkpoints/ml-test
LAST_CHECKPOINT=$(find $CHECKPOINTS -maxdepth 1 -name "$CHECKPOINT_EXT" -print0 {{!}} xargs -r -0 ls -1 -t {{!}} head -1)


# Start training
# Start training
if [ -n "$LAST_CHECKPOINT" ]; then
if [ -z "$LAST_CHECKPOINT" ]; then
     # $LAST_CHECKPOINT is null; start from scratch
     # $LAST_CHECKPOINT is null; start from scratch
     python $SOURCEDIR/train.py --write-checkpoints-to $CHECKPOINTS ...
     python $SOURCEDIR/train.py --write-checkpoints-to $CHECKPOINTS ...
else
else
     python $SOURCEDIR/train.py --load-checkpoint $CHECKPOINTS/$LAST_CHECKPOINT --write-checkpoints-to $CHECKPOINTS ...
     python $SOURCEDIR/train.py --load-checkpoint $LAST_CHECKPOINT --write-checkpoints-to $CHECKPOINTS ...
fi
fi
}}
}}

Latest revision as of 19:08, 3 April 2023

Other languages:

This page is a beginner's manual concerning how to port a machine learning job to one of our clusters.

Step 1: Remove all graphical display

Edit your program such that it doesn't use a graphical display. All graphical results will have to be written on disk, and visualized on your personal computer, when the job is finished. For example, if you show plots using matplotlib, you need to write the plots to image files instead of showing them on screen.

Step 2: Archiving a data set

Shared storage on our clusters is not designed to handle lots of small files (they are optimized for very large files). Make sure that the data set which you need for your training is an archive format like tar, which you can then transfer to your job's compute node when the job starts. If you do not respect these rules, you risk causing enormous numbers of I/O operations on the shared filesystem, leading to performance issues on the cluster for all of its users. If you want to learn more about how to handle collections of large number of files, we recommend that you spend some time reading this page.

Assuming that the files which you need are in the directory mydataset:

$ tar cf mydataset.tar mydataset/*

The above command does not compress the data. If you believe that this is appropriate, you can use tar czf.

Step 3: Preparing your virtual environment

Create a virtual environment in your home space.

For details on installation and usage of machine learning frameworks, refer to our documentation:

Step 4: Interactive job (salloc)

We recommend that you try running your job in an interactive job before submitting it using a script (discussed in the following section). You can diagnose problems more quickly using an interactive job. An example of the command for submitting such a job is:

$ salloc --account=def-someuser --gres=gpu:1 --cpus-per-task=3 --mem=32000M --time=1:00:00

Once the job has started:

  • Activate your virtual environment.
  • Try to run your program.
  • Install any missing modules if necessary. Since the compute nodes don't have internet access, you will have to install them from a login node. Please refer to our documentation on virtual environments.
  • Note the steps that you took to make your program work.

Now is a good time to verify that your job reads and writes as much as possible on the compute node's local storage ($SLURM_TMPDIR) and as little as possible on the shared filesystems (home, scratch and project).

Step 5: Scripted job (sbatch)

You must submit your jobs using a script in conjunction with the sbatch command, so that they can be entirely automated as a batch process. Interactive jobs are just for preparing and debugging your jobs, so that you can execute them fully and/or at scale using sbatch.

Important elements of a sbatch script

  1. Account that will be "billed" for the resources used
  2. Resources required:
    1. Number of CPUs, suggestion: 6
    2. Number of GPUs, suggestion: 1 (Use one (1) single GPU, unless you are certain that your program can use several. By default, TensorFlow and PyTorch use just one GPU.)
    3. Amount of memory, suggestion: 32000M
    4. Duration (Maximum Béluga: 7 days, Graham and Cedar: 28 days)
  3. Bash commands:
    1. Preparing your environment (modules, virtualenv)
    2. Transferring data to the compute node
    3. Starting the executable

Example script

File : ml-test.sh

#!/bin/bash
#SBATCH --gres=gpu:1       # Request GPU "generic resources"
#SBATCH --cpus-per-task=3  # Refer to cluster's documentation for the right CPU/GPU ratio
#SBATCH --mem=32000M       # Memory proportional to GPUs: 32000 Cedar, 47000 Béluga, 64000 Graham.
#SBATCH --time=0-03:00     # DD-HH:MM:SS

module load python/3.6 cuda cudnn

SOURCEDIR=~/ml-test

# Prepare virtualenv
source ~/my_env/bin/activate
# You could also create your environment here, on the local storage ($SLURM_TMPDIR), for better performance. See our docs on virtual environments.

# Prepare data
mkdir $SLURM_TMPDIR/data
tar xf ~/projects/def-xxxx/data.tar -C $SLURM_TMPDIR/data

# Start training
python $SOURCEDIR/train.py $SLURM_TMPDIR/data



Checkpointing a long-running job

We recommend that you checkpoint your jobs in 24 hour units. Submitting jobs which have short durations ensures they are more likely to start sooner. By creating a daisy chain of jobs, it is possible to overcome the seven day limit on Béluga.

  1. Modify your job submission script (or your program) so that your job can be interrupted and continued . Your program should be able to access the most recent checkpoint file. (See the example script below).
  2. Verify how many epochs (or iterations) can be carried out in a 24 hour unit.
  3. Calculate how many of these 24 hour units you will need: n_units = n_epochs_total / n_epochs_per_24h
  4. Use the argument --array 1-<n_blocs>%1 to ask for a chain of n_blocs jobs.

The job submission script will look like this:


File : ml-test-chain.sh

#!/bin/bash
#SBATCH --array=1-10%1   # 10 is the number of jobs in the chain
#SBATCH ...

module load python/3.6 cuda cudnn

# Prepare virtualenv
...

# Prepare data
...

# Get most recent checkpoint
CHECKPOINT_EXT='*.h5'  # Replace by *.pt for PyTorch checkpoints
CHECKPOINTS=~/scratch/checkpoints/ml-test
LAST_CHECKPOINT=$(find $CHECKPOINTS -maxdepth 1 -name "$CHECKPOINT_EXT" -print0 | xargs -r -0 ls -1 -t | head -1)

# Start training
if [ -z "$LAST_CHECKPOINT" ]; then
    # $LAST_CHECKPOINT is null; start from scratch
    python $SOURCEDIR/train.py --write-checkpoints-to $CHECKPOINTS ...
else
    python $SOURCEDIR/train.py --load-checkpoint $LAST_CHECKPOINT --write-checkpoints-to $CHECKPOINTS ...
fi