TensorFlow: Difference between revisions
No edit summary |
(Add some details) |
||
Line 222: | Line 222: | ||
</translate> | </translate> | ||
# Your SBATCH arguments here | # Your SBATCH arguments here | ||
tensorboard --logdir=/tmp/your_log_dir --host 0.0.0.0 & | tensorboard --logdir=/tmp/your_log_dir --host 0.0.0.0 & | ||
python train.py # example | python train.py # example |
Revision as of 15:17, 6 January 2020
TensorFlow is "an open-source software library for Machine Intelligence".
If you are porting a TensorFlow program to a Compute Canada cluster, you should follow our tutorial on the subject.
Installing TensorFlow[edit]
These instructions install TensorFlow in your home directory using Compute Canada's pre-built Python wheels. Custom Python wheels are stored in /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/
. To install a TensorFlow wheel we will use the pip
command and install it into a Python virtual environment. The instructions below are for Python 3.6 but you can also install other Python versions by loading a different Python module.
Load modules required by TensorFlow.
[name@server ~]$ module load python/3.6
Create a new Python virtual environment.
[name@server ~]$ virtualenv --no-download tensorflow
Activate your newly created Python virtual environment.
[name@server ~]$ source tensorflow/bin/activate
Install TensorFlow in your newly created virtual environment using the command from either one of the two following subsections.
Do not install the tensorflow
package (without the _cpu
or _gpu
suffixes) as it has compatibility issues with other libraries.
CPU-only[edit]
(tensorflow) [name@server ~]$ pip install --no-index tensorflow_cpu
GPU[edit]
(tensorflow) [name@server ~]$ pip install --no-index tensorflow_gpu
R package[edit]
To use TensorFlow in R, you will need to first follow the preceding instructions on creating a virtual environment and installing TensorFlow in it. Once this is done, following these instructions.
Load the required modules.
[name@server ~]$ module load gcc r
Activate your Python virtual environment.
[name@server ~]$ source tensorflow/bin/activate
Launch R.
(tensorflow)_[name@server ~]$ R
In R, install package devtools, then tensorflow:
install.packages('devtools', repos='https://cloud.r-project.org')
devtools::install_github('rstudio/tensorflow')
You are then good to go. Do not call install_tensorflow()
in R, as TensorFlow has already been installed in your virtual environment with pip. To use the TensorFlow installed in your virtual environment, enter the following commands in R after the environment has been activated.
library(tensorflow)
use_virtualenv(Sys.getenv('VIRTUAL_ENV'))
Submitting a TensorFlow job with a GPU[edit]
Once you have the above setup completed you can submit a TensorFlow job.
[name@server ~]$ sbatch tensorflow-test.sh
The job submission script contains
#!/bin/bash
#SBATCH --gres=gpu:1 # request GPU "generic resource"
#SBATCH --cpus-per-task=6 # maximum CPU cores per GPU request: 6 on Cedar, 16 on Graham.
#SBATCH --mem=32000M # memory per node
#SBATCH --time=0-03:00 # time (DD-HH:MM)
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID
module load cuda cudnn
source tensorflow/bin/activate
python ./tensorflow-test.py
while the Python script has the form
import tensorflow as tf
node1 = tf.constant(3.0)
node2 = tf.constant(4.0)
print(node1, node2)
print(node1 + node2)
import tensorflow as tf
node1 = tf.constant(3.0)
node2 = tf.constant(4.0)
print(node1, node2)
sess = tf.Session()
print(sess.run(node1 + node2))
Once the job has completed (should take less than a minute) you should see an output file called something like cdr116-122907.out with contents similar to the following (the logged messages from TensorFlow are only examples, expect different messages and more messages):
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)
tf.Tensor(3.0, shape=(), dtype=float32) tf.Tensor(4.0, shape=(), dtype=float32)
tf.Tensor(7.0, shape=(), dtype=float32)
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)
7.0
TensorFlow can run on all GPU node types. Cedar's GPU large node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale deep learning or machine learning research. See Using GPUs with SLURM for more information.
Contrib compatiblity matrix[edit]
Compute Canada compiles TensorFlow wheels for maximum performance and adds features that are not available in Google's TensorFlow releases. Some of these features are part of the TensorFlow contrib folder and therefore not officially supported by TensorFlow developers, nor by Compute Canada staff. We try to accommodate as many users as possible by activating these features, but we are unable to provide support.
We present here a compatibility matrix of the contrib features we have compiled for each TensorFlow version and whether the feature is compiled, functional or tested.
TensorFlow Version | GDR | VERBS | MPI |
---|---|---|---|
1.4.0 | compiled, functional | compiled, untested | compiled, functional |
1.5.0 | compiled, not functional | compiled, untested | compiled, not functional |
1.6.0 | compiled, not functional | compiled, untested | compiled, not functional |
1.7.0 | compiled, functional | compiled, functional | compiled, not functional |
1.8.0 | compiled, untested | compiled, untested | compiled, untested |
If a contrib feature is missing in the version you use and you would like Compute Canada staff to try to integrate it, contact Technical support. We will do our best to recompile TensorFlow with that feature activated.
Monitoring[edit]
It is possible to connect to the node running a job and execute processes. This can be used to monitor resources used by TensorFlow and to visualize the progress of the training. See Attaching to a running job for examples.
TensorBoard[edit]
TensorFlow comes with a suite of visualization tools called TensorBoard. TensorBoard operates by reading TensorFlow events and model files. To know how to create these files, read TensorBoard tutorial on summaries.
TensorBoard requires too much processing power to be run on a login node. Users are strongly encouraged to execute it in the same job as the Tensorflow process. To do so, launch TensorBoard in the background by calling it before your python script, and appending an ampersand (&) to the call:
# Your SBATCH arguments here tensorboard --logdir=/tmp/your_log_dir --host 0.0.0.0 & python train.py # example
Once the job is running, to access TensorBoard with a web browser, you need to create a connection between your computer and the compute node running TensorFlow and TensorBoard. To do this you first need the hostname of the compute node running the Tensorboard server which can be retrieved as follows:
[name@server ~]$ squeue --job JOBID -o %N
To create that connection, use the following command on your local computer:
[name@my_computer ~]$ ssh -N -f -L localhost:6006:computenode:6006 userid@cluster.computecanada.ca
Replace computenode
with the node hostname you retrieved from the preceding step, userid
by your Compute Canada username, cluster
by the cluster hostname (i.e.: Cedar, Graham, etc.).
Once the connection is created, go to http://localhost:6006.
TensorFlow with Multi-GPUs[edit]
TensorFlow provides different methods of managing variables when training models on multiple GPUs. "Parameter Server" and "Replicated" are the most two common methods.
- In this section, TensorFlow Benchmarks code will be used as an example to explain the different methods. Users can reference the TensorFlow Benchmarks code to implement their own.
Parameter Server[edit]
Variables are stored on a parameter server that holds the master copy of the variable. In distributed training, the parameter servers are separate processes in the different devices. For each step, each tower gets a copy of the variables from the parameter server, and sends its gradients to the param server.
Parameters can be stored in CPU:
python tf_cnn_benchmarks.py --variable_update=parameter_server --local_parameter_device=cpu
or GPU:
python tf_cnn_benchmarks.py --variable_update=parameter_server --local_parameter_device=gpu
Replicated[edit]
With this method, each GPU has its own copy of the variables. To apply gradients, an all_reduce algorithm or or regular cross-device aggregation is used to replicate the combined gradients to all towers (depending on the all_reduce_spec parameter's setting).
All reduce method can be default:
python tf_cnn_benchmarks.py --variable_update=replicated
Xring --- use one global ring reduction for all tensors:
python tf_cnn_benchmarks.py --variable_update=replicated --all_reduce_spec=xring
Pscpu --- use CPU at worker 0 to reduce all tensors:
python tf_cnn_benchmarks.py --variable_update=replicated --all_reduce_spec=pscpu
NCCL --- use NCCL to locally reduce all tensors:
python tf_cnn_benchmarks.py --variable_update=replicated --all_reduce_spec=nccl
Different variable managing methods perform differently with different models. Users are highly recommended to test their own models with all methods on different types of GPU node.
Benchmarks[edit]
This section will give ResNet-50 and VGG-16 benchmarking results on both Graham and Cedar with single and multiple GPUs using different methods for managing variables. TensorFlow v1.5 (built with CUDA 9 and cuDNN 7) is used. The benchmark can be found on github at TensorFlow Benchmarks.
- ResNet-50
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")
Node type | 1 GPU | Number of GPUs | ps,cpu | ps, gpu | replicated | replicated, xring | replicated, pscpu | replicated, nccl |
---|---|---|---|---|---|---|---|---|
Graham GPU node | 171.23 | 2 | 93.31 | 324.04 | 318.33 | 316.01 | 109.82 | 315.99 |
Cedar GPU Base | 172.99 | 4 | 662.65 | 595.43 | 616.02 | 490.03 | 645.04 | 608.95 |
Cedar GPU Large | 205.71 | 4 | 673.47 | 721.98 | 754.35 | 574.91 | 664.72 | 692.25 |
- VGG-16
Batch size is 32 per GPU. Data parallelism is used. (Results in "images per second")
Node type | 1 GPU | Number of GPUs | ps,cpu | ps, gpu | replicated | replicated, xring | replicated, pscpu | replicated, nccl |
---|---|---|---|---|---|---|---|---|
Graham GPU node | 115.89 | 2 | 91.29 | 194.46 | 194.43 | 203.83 | 132.19 | 219.72 |
Cedar GPU Base | 114.77 | 4 | 232.85 | 280.69 | 274.41 | 341.29 | 330.04 | 388.53 |
Cedar GPU Large | 137.16 | 4 | 175.20 | 379.80 | 336.72 | 417.46 | 225.37 | 490.52 |
Troubleshooting[edit]
scikit image[edit]
If you are using the scikit-image library, you may get the following error:
OMP: Error #15: Initializing libiomp5.so, but found libiomp5.so already initialized.
This is because the tensorflow library tries to load a bundled version of OMP which conflicts with the system version. The workaround is as follows:
(tf_skimage_venv) name@server $ cd tf_skimage_venv
(tf_skimage_venv) name@server $ export LIBIOMP_PATH=$(strace python -c 'from skimage.transform import AffineTransform' 2>&1 | grep -v ENOENT | grep -ohP -e '(?<=")[^"]+libiomp5.so(?=")' | xargs realpath)
(tf_skimage_venv) name@server $ find -path '*_solib_local*' -name libiomp5.so -exec ln -sf $LIBIOMP_PATH {} \;
This will patch the tensorflow library installation to use the systemwide libiomp5.so.
libcupti.so[edit]
Some tracing features of Tensorflow require libcupti.so to be available, and might give the following error if they are not:
I tensorflow/stream_executor/dso_loader.cc:142] Couldn't open CUDA library libcupti.so.9.0. LD_LIBRARY_PATH: /usr/local/cuda-9.0/lib64
The solution is to run the following before executing your script:
[name@server ~]$ module load cuda/9.0.xxx
[name@server ~]$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/extras/CUPTI/lib64/
Where xxx is the appropriate CUDA version, which can be found using module av cuda
libiomp5.so invalid ELF header[edit]
Sometimes the libiomp5.so
shared object file will be erroneously installed as a text file. This might result in errors like the following:
/home/username/venv/lib/python3.6/site-packages/tensorflow/python/../../_solib_local/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexternal_Smkl_Ulinux_Slib/libiomp5.so: invalid ELF header
The workaround for such errors is to access the directory mentioned in the error (i.e. [...]/_U@mkl_Ulinux_S_S_Cmkl_Ulibs_Ulinux___Uexternal_Smkl_Ulinux_Slib
) and execute the following command:
[name@server ...Ulinux_Slib] $ ln -sf $(cat libiomp5.so) libiomp5.so
This will replace the text file with the correct symbolic link.