38,760
edits
(Updating to match new version of source page) |
(Updating to match new version of source page) |
||
Line 183: | Line 183: | ||
Une fois que la connexion est établie, allez à [http://localhost:6006 http://localhost:6006]. | Une fois que la connexion est établie, allez à [http://localhost:6006 http://localhost:6006]. | ||
<div class="mw-translate-fuzzy"> | |||
==Utiliser plusieurs GPU== | ==Utiliser plusieurs GPU== | ||
Il existe plusieurs méthodes de gestion des variables, les plus communes étant ''Parameter Server'' et ''Replicated''. | Il existe plusieurs méthodes de gestion des variables, les plus communes étant ''Parameter Server'' et ''Replicated''. | ||
Line 188: | Line 189: | ||
===Parameter Server=== | ===Parameter Server=== | ||
La copie maîtresse des variables est enregistrée sur un serveur de paramètres. En apprentissage distribué, les serveurs de paramètres sont des processus distincts dans chacun des appareils. À chaque étape, chacune des tours obtient du serveur de paramètres une copie des variables et y retourne ses gradients. | La copie maîtresse des variables est enregistrée sur un serveur de paramètres. En apprentissage distribué, les serveurs de paramètres sont des processus distincts dans chacun des appareils. À chaque étape, chacune des tours obtient du serveur de paramètres une copie des variables et y retourne ses gradients. | ||
</div> | |||
===TensorFlow 1.x=== | |||
TensorFlow provides different methods of managing variables when training models on multiple GPUs. "Parameter Server" and "Replicated" are the most two common methods. | |||
*In this section, [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks] code will be used as an example to explain the different methods. Users can reference the TensorFlow Benchmarks code to implement their own. | |||
====Parameter Server==== | |||
Variables are stored on a parameter server that holds the master copy of the variable. In distributed training, the parameter servers are separate processes in the different devices. For each step, each tower gets a copy of the variables from the parameter server, and sends its gradients to the param server. | |||
Les paramètres peuvent être enregistrés sur un CPU | Les paramètres peuvent être enregistrés sur un CPU | ||
Line 198: | Line 206: | ||
</pre> | </pre> | ||
<div class="mw-translate-fuzzy"> | |||
===Replicated=== | ===Replicated=== | ||
Chaque GPU possède sa propre copie des variables. Les gradients sont copiés sur toutes les tours par agrégation du contenu des appareils ou par un algorithme ''all reduce'' (dépendant de la valeur du paramètre all_reduce_spec). | Chaque GPU possède sa propre copie des variables. Les gradients sont copiés sur toutes les tours par agrégation du contenu des appareils ou par un algorithme ''all reduce'' (dépendant de la valeur du paramètre all_reduce_spec). | ||
</div> | |||
Avec la méthode ''all reduce'' par défaut ː | Avec la méthode ''all reduce'' par défaut ː | ||
Line 219: | Line 229: | ||
Les méthodes se comportent différemment selon les modèles; nous vous recommandons fortement de tester vos modèles avec toutes les méthodes sur les différents types de nœuds GPU. | Les méthodes se comportent différemment selon les modèles; nous vous recommandons fortement de tester vos modèles avec toutes les méthodes sur les différents types de nœuds GPU. | ||
<div class="mw-translate-fuzzy"> | |||
===Étalonnage (''benchmarks'')=== | ===Étalonnage (''benchmarks'')=== | ||
Les résultats ont été obtenus avec TensorFlow v1.5 (CUDA9 et cuDNN 7) sur Graham et Cedar avec un seul GPU et plusieurs GPU et des méthodes différentes de gestion des variables; voyez [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. | Les résultats ont été obtenus avec TensorFlow v1.5 (CUDA9 et cuDNN 7) sur Graham et Cedar avec un seul GPU et plusieurs GPU et des méthodes différentes de gestion des variables; voyez [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. | ||
Line 233: | Line 244: | ||
| Cedar, ''GPU Large'' || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25 | | Cedar, ''GPU Large'' || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25 | ||
|} | |} | ||
</div> | |||
*VGG-16 | *VGG-16 | ||
Line 247: | Line 259: | ||
| Cedar, ''GPU Large'' || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52''' | | Cedar, ''GPU Large'' || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52''' | ||
|} | |} | ||
===TensorFlow 2.x=== | |||
Much like TensorFlow 1.x, TensorFlow 2.x offers a number of different strategies to make use of multiple GPUs through the high-level API <code>tf.distribute</code>. In the following sections, we provide code examples of each strategy using Keras for simplicity. For more details, please refer to the official [https://www.tensorflow.org/api_docs/python/tf/distribute TensorFlow documentation]. | |||
====Mirrored Strategy==== | |||
=====Single Node===== | |||
{{File | |||
|name=tensorflow-singleworker.sh | |||
|lang="bash" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --nodes 1 | |||
#SBATCH --gres=gpu:4 | |||
#SBATCH --mem=8G | |||
#SBATCH --time=0-00:30 | |||
#SBATCH --output=%N-%j.out | |||
module load python/3 | |||
virtualenv --no-download $SLURM_TMPDIR/env | |||
source $SLURM_TMPDIR/env/bin/activate | |||
pip install --no-index tensorflow | |||
export NCCL_BLOCKING_WAIT=1 #Set this environment variable if you wish to use the NCCL backend for inter-GPU communication. | |||
srun python tensorflow-singleworker.py | |||
}} | |||
The Python script <code>tensorflow-singleworker.py</code> has the form: | |||
{{File | |||
|name=tensorflow-singleworker.py | |||
|lang="python" | |||
|contents= | |||
import tensorflow as tf | |||
import numpy as np | |||
import argparse | |||
parser = argparse.ArgumentParser(description='cifar10 classification models, tensorflow MirroredStrategy test') | |||
parser.add_argument('--lr', default=0.1, help='') | |||
parser.add_argument('--batch_size', type=int, default=256, help='') | |||
args = parser.parse_args() | |||
strategy = tf.distribute.MirroredStrategy() | |||
with strategy.scope(): | |||
model = tf.keras.Sequential() | |||
model.add(tf.keras.layers.Conv2D(32, (3, 3), padding='same', | |||
input_shape=(32,32,3))) | |||
model.add(tf.keras.layers.Activation('relu')) | |||
model.add(tf.keras.layers.Conv2D(32, (3, 3))) | |||
model.add(tf.keras.layers.Activation('relu')) | |||
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2))) | |||
model.add(tf.keras.layers.Dropout(0.25)) | |||
model.add(tf.keras.layers.Conv2D(64, (3, 3), padding='same')) | |||
model.add(tf.keras.layers.Activation('relu')) | |||
model.add(tf.keras.layers.Conv2D(64, (3, 3))) | |||
model.add(tf.keras.layers.Activation('relu')) | |||
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2))) | |||
model.add(tf.keras.layers.Dropout(0.25)) | |||
model.add(tf.keras.layers.Flatten()) | |||
model.add(tf.keras.layers.Dense(512)) | |||
model.add(tf.keras.layers.Activation('relu')) | |||
model.add(tf.keras.layers.Dropout(0.5)) | |||
model.add(tf.keras.layers.Dense(10)) | |||
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), | |||
optimizer=tf.keras.optimizers.SGD(learning_rate=args.lr),metrics=['accuracy']) | |||
### This next line will attempt to download the CIFAR10 dataset from the internet if you don't already have it stored in ~/.keras/datasets. | |||
### Run this line on a login node prior to submitting your job, or manually download the data from | |||
### https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz, rename to "cifar-10-batches-py.tar.gz" and place it under ~/.keras/datasets | |||
(x_train, y_train),_ = tf.keras.datasets.cifar10.load_data() | |||
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(args.batch_size) | |||
model.fit(dataset, epochs=2) | |||
}} | |||
=====Multiple Nodes===== | |||
The syntax to use multiple GPUs distributed across multiple nodes is very similar to the single node case, the most notable difference being the use of <code>MultiWorkerMirroredStrategy()</code>. Here, we use <code>SlurmClusterResolver()</code> to tell TensorFlow to acquire all the necessary job information from SLURM, instead of manually assigning master and worker nodes, for example. We also need to add <code>CommunicationImplementation.NCCL</code> to the distribution strategy to specify that we want to use Nvidia's NCCL backend for inter-GPU communications. This was not necessary in the single-node case, as NCCL is the default backend with <code>MirroredStrategy()</code>. | |||
{{File | |||
|name=tensorflow-multiworker.sh | |||
|lang="bash" | |||
|contents= | |||
#!/bin/bash | |||
#SBATCH --nodes 2 # Request 2 nodes so all resources are in two nodes. | |||
#SBATCH --gres=gpu:2 # Request 2 GPU "generic resources”. You will get 2 per node. | |||
#SBATCH --tasks-per-node=2 # Request 1 process per GPU. You will get 1 CPU per process by default. Request more CPUs with the "cpus-per-task" parameter if your input pipeline can handle parallel data-loading/data-transforms | |||
#SBATCH --mem=8G | |||
#SBATCH --time=0-00:30 | |||
#SBATCH --output=%N-%j.out | |||
module load python/3 | |||
virtualenv --no-download $SLURM_TMPDIR/env | |||
source $SLURM_TMPDIR/env/bin/activate | |||
pip install --no-index tensorflow | |||
export NCCL_BLOCKING_WAIT=1 #Set this environment variable if you wish to use the NCCL backend for inter-GPU communication. | |||
srun python tensorflow-multiworker.py | |||
}} | |||
The Python script <code>tensorflow-multiworker.py</code> has the form: | |||
{{File | |||
|name=tensorflow-multiworker.py | |||
|lang="python" | |||
|contents= | |||
import tensorflow as tf | |||
import numpy as np | |||
import argparse | |||
parser = argparse.ArgumentParser(description='cifar10 classification models, tensorflow MultiWorkerMirrored test') | |||
parser.add_argument('--lr', default=0.1, help='') | |||
parser.add_argument('--batch_size', type=int, default=256, help='') | |||
args = parser.parse_args() | |||
cluster_config = tf.distribute.cluster_resolver.SlurmClusterResolver() | |||
comm_options = tf.distribute.experimental.CommunicationOptions(implementation=tf.distribute.experimental.CommunicationImplementation.NCCL) | |||
strategy = tf.distribute.MultiWorkerMirroredStrategy(cluster_resolver=cluster_config, communication_options=comm_options) | |||
with strategy.scope(): | |||
model = tf.keras.Sequential() | |||
model.add(tf.keras.layers.Conv2D(32, (3, 3), padding='same', | |||
input_shape=(32,32,3))) | |||
model.add(tf.keras.layers.Activation('relu')) | |||
model.add(tf.keras.layers.Conv2D(32, (3, 3))) | |||
model.add(tf.keras.layers.Activation('relu')) | |||
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2))) | |||
model.add(tf.keras.layers.Dropout(0.25)) | |||
model.add(tf.keras.layers.Conv2D(64, (3, 3), padding='same')) | |||
model.add(tf.keras.layers.Activation('relu')) | |||
model.add(tf.keras.layers.Conv2D(64, (3, 3))) | |||
model.add(tf.keras.layers.Activation('relu')) | |||
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2))) | |||
model.add(tf.keras.layers.Dropout(0.25)) | |||
model.add(tf.keras.layers.Flatten()) | |||
model.add(tf.keras.layers.Dense(512)) | |||
model.add(tf.keras.layers.Activation('relu')) | |||
model.add(tf.keras.layers.Dropout(0.5)) | |||
model.add(tf.keras.layers.Dense(10)) | |||
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), | |||
optimizer=tf.keras.optimizers.SGD(learning_rate=args.lr),metrics=['accuracy']) | |||
### This next line will attempt to download the CIFAR10 dataset from the internet if you don't already have it stored in ~/.keras/datasets. | |||
### Run this line on a login node prior to submitting your job, or manually download the data from | |||
### https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz, rename to "cifar-10-batches-py.tar.gz" and place it under ~/.keras/datasets | |||
(x_train, y_train),_ = tf.keras.datasets.cifar10.load_data() | |||
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(args.batch_size) | |||
model.fit(dataset, epochs=2) | |||
}} | |||
==Opérateurs personnalisés== | ==Opérateurs personnalisés== |