TensorFlow: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(Add instruction for monitoring TensorFlow jobs through the shell)
Line 94: Line 94:


</translate>
</translate>
==Monitoring==
It is possible to connect to the node running a job and execute processes. This can be used to monitor ressources used by TensorFlow network and visualize the progression of the training. The following command launch the watch of <code>nvidia-smi</code> output on the node assigned to a generic job id every 30 seconds.
{{Command|prompt=[name@server $]
|srun --jobid 123456 --pty watch -n 30 nvidia-smi}}
It is possible to launch multiple monitoring commands using <code>tmux</code>. The following command launch <code>htop</code> and <code>nvidia-smi</code> in a separate pane of the same shell to  monitor the activity on a node assigned to a generic job id.
{{Command|prompt=[name@server $]
|srun --jobid 123456 --pty tmux new-session -d 'htop' \; split-window -h 'watch nvidia-smi' \; attach}}
The processes launch with <code>srun</code> shares the ressources with the job specified. You should therefore be careful not to launch processes that would use a significant portion of the resources allocated for the job, as it would jeopardize the normal execution of your job.


==TensorFlow with Multi-GPUs==
==TensorFlow with Multi-GPUs==

Revision as of 15:36, 13 February 2018

Other languages:

TensorFlow is "an open-source software library for Machine Intelligence".

Installing TensorFlow

These instructions install Tensorflow into your home directory using Compute Canada's pre-built Python wheels. Custom Python wheels are stored in /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/. To install a TensorFlow wheel we will use the pip command and install it into a Python virtual environment. The instructions below are for Python 3.5.2 but you can also install other Python versions by loading a different Python module.

Load modules required by TensorFlow:

Question.png
[name@server ~]$ module load python/3.5.2

Create a new Python virtual environment:

Question.png
[name@server ~]$ virtualenv tensorflow

Activate your newly created Python virtual environment:

Question.png
[name@server ~]$ source tensorflow/bin/activate

Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections.

CPU-only

Question.png
(tensorflow) [name@server $] pip install tensorflow-cpu

GPU

Question.png
(tensorflow) [name@server $] pip install tensorflow-gpu

Submitting a TensorFlow job with a GPU

Once you have the above setup completed you can submit a TensorFlow job as

Question.png
[name@server ~]$ sbatch tensorflow-test.sh

The job submission script has the content

File : tensorflow-test.sh

#!/bin/bash
#SBATCH --gres=gpu:1        # request GPU "generic resource"
#SBATCH --cpus-per-task=6   # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.
#SBATCH --mem=32000M        # memory per node
#SBATCH --time=0-03:00      # time (DD-HH:MM)
#SBATCH --output=%N-%j.out  # %N for node name, %j for jobID

module load cuda cudnn python/3.5.2
source tensorflow/bin/activate
python ./tensorflow-test.py


while the Python script has the form,

File : tensorflow-test.py

import tensorflow as tf
node1 = tf.constant(3.0, dtype=tf.float32)
node2 = tf.constant(4.0) # also tf.float32 implicitly
print(node1, node2)
sess = tf.Session()
print(sess.run([node1, node2]))


Once the above job has completed (should take less than a minute) you should see an output file called something like cdr116-122907.out with contents similar to the following example,

File : cdr116-122907.out

2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla P100-PCIE-12GB
major: 6 minor: 0 memoryClockRate (GHz) 1.3285
pciBusID 0000:82:00.0
Total memory: 11.91GiB
Free memory: 11.63GiB
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)
[3.0, 4.0]


TensorFlow can run on all GPU node types. Cedar's GPU large node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See Using GPUs with SLURM for more information.


Monitoring

It is possible to connect to the node running a job and execute processes. This can be used to monitor ressources used by TensorFlow network and visualize the progression of the training. The following command launch the watch of nvidia-smi output on the node assigned to a generic job id every 30 seconds.

Question.png
[name@server $] srun --jobid 123456 --pty watch -n 30 nvidia-smi

It is possible to launch multiple monitoring commands using tmux. The following command launch htop and nvidia-smi in a separate pane of the same shell to monitor the activity on a node assigned to a generic job id.

Question.png
[name@server $] srun --jobid 123456 --pty tmux new-session -d 'htop' \; split-window -h 'watch nvidia-smi' \; attach

The processes launch with srun shares the ressources with the job specified. You should therefore be careful not to launch processes that would use a significant portion of the resources allocated for the job, as it would jeopardize the normal execution of your job.

TensorFlow with Multi-GPUs

TensorFlow provides different methods of managing variables when training models on multiple GPUs. "Parameter Server" and "Replicated" are the most two common methods.

  • In this section, TensorFlow Benchmarks code will be used as an example to explain the different methods. Users can reference the TensorFlow Benchmarks code to implement their own.

Parameter Server

Variables are stored on a parameter server that holds the master copy of the variable. In distributed training, the parameter servers are separate processes in the different devices. For each step, each tower gets a copy of the variables from the parameter server, and sends its gradients to the param server.

Parameters can be stored in CPU:

--variable_update=parameter_server --local_parameter_device=cpu

or GPU:

--variable_update=parameter_server --local_parameter_device=gpu

Replicated

With this method, each GPU has its own copy of the variables. To apply gradients, an all_reduce algorithm or or regular cross-device aggregation is used to replicate the combined gradients to all towers (depending on all_reduce_spec parameter setting).

All reduce method can be default:

--variable_update=replicated

Xring --- use one global ring reduction for all tensors:

--variable_update=replicated --all_reduce_spec=xring

Pscpu --- use CPU at worker 0 to reduce all tensors:

--variable_update=replicated --all_reduce_spec=pscpu

NCCL --- use NCCL to locally reduce all tensors:

--variable_update=replicated --all_reduce_spec=nccl

Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.

Benchmarks

This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: TensorFlow Benchmarks.

  • ResNet-50

Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)

Node type Single GPU baseline Number of GPUs ps,cpu ps, gpu replicated replicated, xring replicated, pscpu replicated, nccl
Graham GPU node 171.23 2 93.31 324.04 318.33 316.01 109.82 315.99
Cedar GPU Base 172.99 4 662.65 595.43 616.02 490.03 645.04 608.95
Cedar GPU Large 205.71 4 673.47 721.98 754.35 574.91 664.72 692.25
  • VGG-16

Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second", higher the better)

Node type Single GPU baseline Number of GPUs ps,cpu ps, gpu replicated replicated, xring replicated, pscpu replicated, nccl
Graham GPU node 115.89 2 91.29 194.46 194.43 203.83 132.19 219.72
Cedar GPU Base 114.77 4 232.85 280.69 274.41 341.29 330.04 388.53
Cedar GPU Large 137.16 4 175.20 379.80 336.72 417.46 225.37 490.52