TensorFlow: Difference between revisions
No edit summary |
|||
Line 85: | Line 85: | ||
==Using Cedar's large GPU nodes== | ==Using Cedar's large GPU nodes== | ||
TensorFlow can run on all GPU node types on Cedar and Graham. Cedar's large GPU node type, which | TensorFlow can run on all GPU node types on Cedar and Graham. Cedar's large GPU node type, which is equipped with 4 x P100-PCIE-16GB with [http://developer.download.nvidia.com/devzone/devcenter/cuda/docs/GPUDirect_Technology_Overview.pdf GPUDirect P2P] enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research. | ||
Large GPU nodes on Cedar accept both whole-node jobs and single-GPU jobs. But single-GPU requests can '''only run up to 24 hours'''. The job submission script for a single-GPU job should have the contents | Large GPU nodes on Cedar accept both whole-node jobs and single-GPU jobs. But single-GPU requests can '''only run up to 24 hours'''. The job submission script for a single-GPU job should have the contents | ||
Line 93: | Line 93: | ||
|contents= | |contents= | ||
#!/bin/bash | #!/bin/bash | ||
#SBATCH --nodes=1 | #SBATCH --nodes=1 # request number of whole nodes | ||
#SBATCH --ntasks-per-node=1 | #SBATCH --ntasks-per-node=1 | ||
#SBATCH --cpus-per-task=6 | #SBATCH --cpus-per-task=6 # Total CPU cores is 24, each GPU should use up to 6 cores | ||
#SBATCH --gres=gpu:lgpu:1 | #SBATCH --gres=gpu:lgpu:1 # lgpu is required for using large GPU nodes | ||
#SBATCH --mem=60G | #SBATCH --mem=60G # Total memory per node is around 250GB, each GPU can ask 60G | ||
#SBATCH --time=0-03:00 | #SBATCH --time=0-03:00 # time (DD-HH:MM) | ||
#SBATCH --output=%N-%j.out | #SBATCH --output=%N-%j.out # %N for node name, %j for jobID | ||
module load cuda cudnn python/3.5.2 | module load cuda cudnn python/3.5.2 | ||
Line 106: | Line 106: | ||
}} | }} | ||
The job submission script for a whole node ( | The job submission script for a whole node job (i.e., one that uses all four GPUs) should have the contents | ||
{{File | {{File | ||
|name=tensorflow-lgpu-whole-node.sh | |name=tensorflow-lgpu-whole-node.sh | ||
Line 112: | Line 112: | ||
|contents= | |contents= | ||
#!/bin/bash | #!/bin/bash | ||
#SBATCH --nodes=1 | #SBATCH --nodes=1 # request number of whole nodes | ||
#SBATCH --ntasks-per-node=1 | #SBATCH --ntasks-per-node=1 | ||
#SBATCH --cpus-per-task=24 | #SBATCH --cpus-per-task=24 # Total CPU cores should be 24. | ||
#SBATCH --gres=gpu:lgpu:4 | #SBATCH --gres=gpu:lgpu:4 # lgpu is required for using large GPU nodes | ||
#SBATCH --mem=250G | #SBATCH --mem=250G # memory per node | ||
#SBATCH --time=0-03:00 | #SBATCH --time=0-03:00 # time (DD-HH:MM) | ||
#SBATCH --output=%N-%j.out | #SBATCH --output=%N-%j.out # %N for node name, %j for jobID | ||
module load cuda cudnn python/3.5.2 | module load cuda cudnn python/3.5.2 | ||
Line 124: | Line 124: | ||
python ./tensorflow-test.py | python ./tensorflow-test.py | ||
}} | }} | ||
===Packing single-GPU jobs within one SLURM job=== | ===Packing single-GPU jobs within one SLURM job=== | ||
Cedar's large GPU nodes are highly recommended to run Deep Learning models which can be accelerated by multiple GPUs. If user needs to run 4 x single GPU codes or 2 x 2-GPU codes in a node for '''longer than 24 hours''', [https://www.gnu.org/software/parallel/ GNU Parallel] is recommended. A simple example is given below: | Cedar's large GPU nodes are highly recommended to run Deep Learning models which can be accelerated by multiple GPUs. If user needs to run 4 x single GPU codes or 2 x 2-GPU codes in a node for '''longer than 24 hours''', [https://www.gnu.org/software/parallel/ GNU Parallel] is recommended. A simple example is given below: |
Revision as of 12:42, 24 October 2017
Installing Tensorflow
These instructions install Tensorflow into your home directory using Compute Canada's pre-built Python wheels. Custom Python wheels are stored in /cvmfs/soft.computecanada.ca/custom/python/wheelhouse/
. To install Tensorflow's wheel we will use the pip
command and install it into a Python virtual environments. The below instructions install for Python 3.5.2 but you can also install for Python 3.5.Y or 2.7.X by loading a different Python module.
Load modules required by Tensorflow:
[name@server ~]$ module load python/3.5.2
Create a new python virtual environment:
[name@server ~]$ virtualenv tensorflow
Activate your newly created python virtual environment:
[name@server ~]$ source tensorflow/bin/activate
Install TensorFlow into your newly created virtual environment using the command from either one of the two following subsections.
CPU-only
(tensorflow) [name@server $] pip install tensorflow-cpu
GPU
(tensorflow) [name@server $] pip install tensorflow-gpu
Submitting a TensorFlow job with a GPU
Once you have the above setup completed you can submit a Tensorflow job as
[name@server ~]$ sbatch tensorflow-test.sh
The job submission script has the contents
#!/bin/bash
#SBATCH --gres=gpu:1 # request GPU "generic resource"
#SBATCH --cpus-per-task=6 #Maximum of CPU cores per GPU request: 6 on Cedar, 16 on Graham.
#SBATCH --mem=32000M # memory per node
#SBATCH --time=0-03:00 # time (DD-HH:MM)
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID
module load cuda cudnn python/3.5.2
source tensorflow/bin/activate
python ./tensorflow-test.py
while the Python script has the form,
import tensorflow as tf
node1 = tf.constant(3.0, dtype=tf.float32)
node2 = tf.constant(4.0) # also tf.float32 implicitly
print(node1, node2)
sess = tf.Session()
print(sess.run([node1, node2]))
Once the above job has completed (should take less than a minute) you should see an output file called something like cdr116-122907.out with contents similar to the following example,
2017-07-10 12:35:19.489458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla P100-PCIE-12GB
major: 6 minor: 0 memoryClockRate (GHz) 1.3285
pciBusID 0000:82:00.0
Total memory: 11.91GiB
Free memory: 11.63GiB
2017-07-10 12:35:19.491097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-07-10 12:35:19.491156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-07-10 12:35:19.520737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0)
Tensor("Const:0", shape=(), dtype=float32) Tensor("Const_1:0", shape=(), dtype=float32)
[3.0, 4.0]
Using Cedar's large GPU nodes
TensorFlow can run on all GPU node types on Cedar and Graham. Cedar's large GPU node type, which is equipped with 4 x P100-PCIE-16GB with GPUDirect P2P enabled between each pair, is highly recommended for large scale Deep Learning or Machine Learning research.
Large GPU nodes on Cedar accept both whole-node jobs and single-GPU jobs. But single-GPU requests can only run up to 24 hours. The job submission script for a single-GPU job should have the contents
#!/bin/bash
#SBATCH --nodes=1 # request number of whole nodes
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=6 # Total CPU cores is 24, each GPU should use up to 6 cores
#SBATCH --gres=gpu:lgpu:1 # lgpu is required for using large GPU nodes
#SBATCH --mem=60G # Total memory per node is around 250GB, each GPU can ask 60G
#SBATCH --time=0-03:00 # time (DD-HH:MM)
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID
module load cuda cudnn python/3.5.2
source tensorflow/bin/activate
python ./tensorflow-test.py
The job submission script for a whole node job (i.e., one that uses all four GPUs) should have the contents
#!/bin/bash
#SBATCH --nodes=1 # request number of whole nodes
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24 # Total CPU cores should be 24.
#SBATCH --gres=gpu:lgpu:4 # lgpu is required for using large GPU nodes
#SBATCH --mem=250G # memory per node
#SBATCH --time=0-03:00 # time (DD-HH:MM)
#SBATCH --output=%N-%j.out # %N for node name, %j for jobID
module load cuda cudnn python/3.5.2
source tensorflow/bin/activate
python ./tensorflow-test.py
Packing single-GPU jobs within one SLURM job
Cedar's large GPU nodes are highly recommended to run Deep Learning models which can be accelerated by multiple GPUs. If user needs to run 4 x single GPU codes or 2 x 2-GPU codes in a node for longer than 24 hours, GNU Parallel is recommended. A simple example is given below:
cat params.input | parallel -j4 'CUDA_VISIBLE_DEVICES=$(({%} - 1)) python {} &> {#}.out'
GPU id will be calculated by slot id {%} minus 1. {#} is the job id, starting from 1.
A params.input file should includes input parameters in each line like:
code1.py code2.py code3.py code4.py ...
With this method, user can run multiple codes in one submission. In this case, GNU Parallel will run a maximum of 4 jobs at a time. It will launch the next job when one job is finished. CUDA_VISIBLE_DEVICES is used to force using only 1 GPU for each code.