TensorFlow: Difference between revisions
(→Monitoring: moved 'Attaching to a running job' to Running jobs) |
m (→Benchmarks) |
||
Line 171: | Line 171: | ||
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. | This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. | ||
*ResNet-50 | *ResNet-50 | ||
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second" | Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second") | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
Line 184: | Line 184: | ||
*VGG-16 | *VGG-16 | ||
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second" | Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second") | ||
{| class="wikitable" | {| class="wikitable" |
Revision as of 16:08, 2 March 2018
TensorFlow is "an open-source software library for Machine Intelligence".
Installing TensorFlow
These instructions install Tensorflow into your home directory using Compute Canada's pre-built Python wheels. Custom Python wheels are stored in /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/
. To install a TensorFlow wheel we will use the pip
command and install it into a Python virtual environment. The instructions below are for Python 3.5.2 but you can also install other Python versions by loading a different Python module.
Load modules required by TensorFlow:
[name@server ~]$ module load python/3.5.2
Create a new Python virtual environment:
[name@server ~]$ virtualenv tensorflow
Activate your newly created Python virtual environment:
[name@server ~]$ source tensorflow/bin/activate
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections.
CPU-only
(tensorflow)_[name@server ~]$ pip install tensorflow-cpu
GPU
(tensorflow)_[name@server ~]$ pip install tensorflow-gpu
Submitting a TensorFlow job with a GPU
Once you have the above setup completed you can submit a TensorFlow job as
[name@server ~]$ sbatch tensorflow-test.sh
The job submission script has the content
#!/bin/bash
#SBATCH --gres=gpu:1 # request GPU "generic resource"
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.
#SBATCH --mem=32000M # memory per node
#SBATCH --time=0-03:00 # time (DD-HH:MM)
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID
module load cuda cudnn python/3.5.2
source tensorflow/bin/activate
python ./tensorflow-test.py
while the Python script has the form,
import tensorflow as tf
node1 = tf.constant(3.0, dtype=tf.float32)
node2 = tf.constant(4.0) # also tf.float32 implicitly
print(node1, node2)
sess = tf.Session()
print(sess.run([node1, node2]))
Once the above job has completed (should take less than a minute) you should see an output file called something like cdr116-122907.out with contents similar to the following example,
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla P100-PCIE-12GB
major: 6 minor: 0 memoryClockRate (GHz) 1.3285
pciBusID 0000:82:00.0
Total memory: 11.91GiB
Free memory: 11.63GiB
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)
[3.0, 4.0]
TensorFlow can run on all GPU node types. Cedar's GPU large node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See Using GPUs with SLURM for more information.
Monitoring
It is possible to connect to the node running a job and execute processes. This can be used to monitor resources used by TensorFlow and to visualize the progress of the training. See Attaching to a running job for examples.
TensorBoard
TensorFlow comes with a suite of visualization tools called TensorBoard. TensorBoard operates by reading TensorFlow events and model files. To know how to create these files, read TensorBoard tutorial on summaries. The event files are created in a directory specified by the user referred to as logdir.
The following command will launch TensorBoard:
[name@server ~]$ tensorboard --logdir=path/to/logdir --host localhost
Note, however, thatTensorBoard requires too much processing power to be run on a login node. Users are strongly encouraged to execute it in parallel with their TensorFlow job. The following submit script gives an example. The source code of mnist_with_summaries.py
is available here.
#!/bin/bash
#SBATCH --gres=gpu:1 # request GPU "generic resource"
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.
#SBATCH --mem=32000M # memory per node
#SBATCH --time=01:00 # time (DD-HH:MM)
source tensorflow/bin/activate
tensorboard --logdir=/tmp/tensorflow/mnist/logs/mnist_with_summaries --host localhost &
python mnist_with_summaries.py
Once the job is running, to access TensorBoard with a web browser, you need to create a connection between your computer and the compute node running TensorFlow and TensorBoard. To create that connection, use the following command.
[name@my_computer ~]$ ssh -J userid@cluster.computecanada.ca -N -f -L localhost:6006:localhost:6006 userid@compute_node
Replace userid
by your Compute Canada username, cluster
by the cluster hostname (i.e.: Cedar, Graham, etc.), and computenode
by the compute node hostname. To retrieve the compute node hostname associated with your JOBID
use the following command
[name@server ~]$ squeue --job JOBID -o %N
Once the connection is created, go to http://localhost:6006.
TensorFlow with Multi-GPUs
TensorFlow provides different methods of managing variables when training models on multiple GPUs. "Parameter Server" and "Replicated" are the most two common methods.
- In this section, TensorFlow Benchmarks code will be used as an example to explain the different methods. Users can reference the TensorFlow Benchmarks code to implement their own.
Parameter Server
Variables are stored on a parameter server that holds the master copy of the variable. In distributed training, the parameter servers are separate processes in the different devices. For each step, each tower gets a copy of the variables from the parameter server, and sends its gradients to the param server.
Parameters can be stored in CPU:
--variable_update=parameter_server --local_parameter_device=cpu
or GPU:
--variable_update=parameter_server --local_parameter_device=gpu
Replicated
With this method, each GPU has its own copy of the variables. To apply gradients, an all_reduce algorithm or or regular cross-device aggregation is used to replicate the combined gradients to all towers (depending on all_reduce_spec parameter setting).
All reduce method can be default:
--variable_update=replicated
Xring --- use one global ring reduction for all tensors:
--variable_update=replicated --all_reduce_spec=xring
Pscpu --- use CPU at worker 0 to reduce all tensors:
--variable_update=replicated --all_reduce_spec=pscpu
NCCL --- use NCCL to locally reduce all tensors:
--variable_update=replicated --all_reduce_spec=nccl
Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.
Benchmarks
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github: TensorFlow Benchmarks.
- ResNet-50
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")
Node type | Single GPU baseline | Number of GPUs | ps,cpu | ps, gpu | replicated | replicated, xring | replicated, pscpu | replicated, nccl |
---|---|---|---|---|---|---|---|---|
Graham GPU node | 171.23 | 2 | 93.31 | 324.04 | 318.33 | 316.01 | 109.82 | 315.99 |
Cedar GPU Base | 172.99 | 4 | 662.65 | 595.43 | 616.02 | 490.03 | 645.04 | 608.95 |
Cedar GPU Large | 205.71 | 4 | 673.47 | 721.98 | 754.35 | 574.91 | 664.72 | 692.25 |
- VGG-16
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")
Node type | Single GPU baseline | Number of GPUs | ps,cpu | ps, gpu | replicated | replicated, xring | replicated, pscpu | replicated, nccl |
---|---|---|---|---|---|---|---|---|
Graham GPU node | 115.89 | 2 | 91.29 | 194.46 | 194.43 | 203.83 | 132.19 | 219.72 |
Cedar GPU Base | 114.77 | 4 | 232.85 | 280.69 | 274.41 | 341.29 | 330.04 | 388.53 |
Cedar GPU Large | 137.16 | 4 | 175.20 | 379.80 | 336.72 | 417.46 | 225.37 | 490.52 |