TensorFlow: Difference between revisions
(Marked this version for translation) |
No edit summary |
||
Line 94: | Line 94: | ||
</translate> | </translate> | ||
==Benchmarks== | |||
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. The benchmark can be found on github: [https://github.com/tensorflow/benchmarks TensorFlow Benchmarks]. | |||
Methods of managing variables: | |||
<pre> | |||
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu | |||
ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu | |||
replicated: --variable_update=replicated | |||
replicated, xring: --variable_update=replicated --all_reduce_spec=xring | |||
replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu | |||
replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl | |||
</pre> | |||
===ResNet-50=== | |||
Batch size is 32 per GPU. Data parallelism is used. | |||
{| class="wikitable" | |||
|- | |||
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl | |||
|- | |||
| Graham GPU node || 171.23||2 || 93.31 || '''324.04''' || 318.33 || 316.01 || 109.82 || 315.99 | |||
|- | |||
| Cedar GPU Base || 1 || 4 || 1 || 1 || 1 || 1 || 1 || 1 | |||
|- | |||
| Cedar GPU Large || 205.71 ||4 || 673.47 || 721.98 || '''754.35''' || 574.91 || 664.72 || 692.25 | |||
|} | |||
===VGG-16=== | |||
Batch size is 32 per GPU. Data parallelism is used. | |||
{| class="wikitable" | |||
|- | |||
! Node type !! Single GPU baseline !! Number of GPUs !! ps,cpu !! ps, gpu !! replicated !! replicated, xring !! replicated, pscpu !! replicated, nccl | |||
|- | |||
| Graham GPU node || 115.89||2 || 91.29 || 194.46 || 194.43 || 203.83 || 132.19 || '''219.72''' | |||
|- | |||
| Cedar GPU Base || 114.77 ||4 || 232.85 || 280.69 || 274.41 || 341.29 || 330.04 || '''388.53''' | |||
|- | |||
| Cedar GPU Large || 137.16 ||4 || 175.20 || 379.80 ||336.72 || 417.46 || 225.37 || '''490.52''' | |||
|} | |||
For VGG-16 model, NCCL runs the best for all kind of node types. |
Revision as of 19:07, 5 February 2018
TensorFlow is "an open-source software library for Machine Intelligence".
Installing TensorFlow
These instructions install Tensorflow into your home directory using Compute Canada's pre-built Python wheels. Custom Python wheels are stored in /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/
. To install TensorFlow's wheel we will use the pip
command and install it into a Python virtual environment. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.
Load modules required by TensorFlow:
[name@server ~]$ module load python/3.5.2
Create a new Python virtual environment:
[name@server ~]$ virtualenv tensorflow
Activate your newly created Python virtual environment:
[name@server ~]$ source tensorflow/bin/activate
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections.
CPU-only
(tensorflow) [name@server $] pip install tensorflow-cpu
GPU
(tensorflow) [name@server $] pip install tensorflow-gpu
Submitting a TensorFlow job with a GPU
Once you have the above setup completed you can submit a TensorFlow job as
[name@server ~]$ sbatch tensorflow-test.sh
The job submission script has the content
#!/bin/bash
#SBATCH --gres=gpu:1 # request GPU "generic resource"
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.
#SBATCH --mem=32000M # memory per node
#SBATCH --time=0-03:00 # time (DD-HH:MM)
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID
module load cuda cudnn python/3.5.2
source tensorflow/bin/activate
python ./tensorflow-test.py
while the Python script has the form,
import tensorflow as tf
node1 = tf.constant(3.0, dtype=tf.float32)
node2 = tf.constant(4.0) # also tf.float32 implicitly
print(node1, node2)
sess = tf.Session()
print(sess.run([node1, node2]))
Once the above job has completed (should take less than a minute) you should see an output file called something like cdr116-122907.out with contents similar to the following example,
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla P100-PCIE-12GB
major: 6 minor: 0 memoryClockRate (GHz) 1.3285
pciBusID 0000:82:00.0
Total memory: 11.91GiB
Free memory: 11.63GiB
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)
[3.0, 4.0]
TensorFlow can run on all GPU node types. Cedar's GPU large node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. See Using GPUs with SLURM for more information.
Benchmarks
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. The benchmark can be found on github: TensorFlow Benchmarks.
Methods of managing variables:
ps, cpu: --variable_update=parameter_server --local_parameter_device=cpu ps, gpu: --variable_update=parameter_server --local_parameter_device=gpu replicated: --variable_update=replicated replicated, xring: --variable_update=replicated --all_reduce_spec=xring replicated, pscpu: --variable_update=replicated --all_reduce_spec=pscpu replicated, nccl: --variable_update=replicated --all_reduce_spec=nccl
ResNet-50
Batch size is 32 per GPU. Data parallelism is used.
Node type | Single GPU baseline | Number of GPUs | ps,cpu | ps, gpu | replicated | replicated, xring | replicated, pscpu | replicated, nccl |
---|---|---|---|---|---|---|---|---|
Graham GPU node | 171.23 | 2 | 93.31 | 324.04 | 318.33 | 316.01 | 109.82 | 315.99 |
Cedar GPU Base | 1 | 4 | 1 | 1 | 1 | 1 | 1 | 1 |
Cedar GPU Large | 205.71 | 4 | 673.47 | 721.98 | 754.35 | 574.91 | 664.72 | 692.25 |
VGG-16
Batch size is 32 per GPU. Data parallelism is used.
Node type | Single GPU baseline | Number of GPUs | ps,cpu | ps, gpu | replicated | replicated, xring | replicated, pscpu | replicated, nccl |
---|---|---|---|---|---|---|---|---|
Graham GPU node | 115.89 | 2 | 91.29 | 194.46 | 194.43 | 203.83 | 132.19 | 219.72 |
Cedar GPU Base | 114.77 | 4 | 232.85 | 280.69 | 274.41 | 341.29 | 330.04 | 388.53 |
Cedar GPU Large | 137.16 | 4 | 175.20 | 379.80 | 336.72 | 417.46 | 225.37 | 490.52 |
For VGG-16 model, NCCL runs the best for all kind of node types.