Hyper-Q / MPS: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(Created page with "{{Draft}} ==Overview== Hyper-Q (or MPS) is a new hardware/software feature of NVIDIA GPUs. It is available in GPUs with CUDA capability 3.5 and higher. It is available on P10...")
 
No edit summary
 
(18 intermediate revisions by 5 users not shown)
Line 1: Line 1:
{{Draft}}
<languages/>
==Overview==


Hyper-Q (or MPS) is a new hardware/software feature of NVIDIA GPUs. It is available in GPUs with CUDA capability 3.5 and higher. It is available on P100 and V100 GPUs in the Compute Canada clusters cedar, graham, and beluga.
<translate>


According to NVIDIA,
==Overview== <!--T:1-->


'''Hyper-Q''' / MPS enables multiple CPU cores to launch work on a single GPU
<!--T:2-->
simultaneously, thereby dramatically increasing GPU utilization and significantly reducing CPU
Hyper-Q (or MPS) is a feature of NVIDIA GPUs.
idle times. Hyper-Q increases the total number of connections (work queues) between the host
It is available in GPUs with CUDA compute capability 3.5 and higher,<ref>For a table relating NVIDIA GPU model names, architecture names, and CUDA compute capabilities, see [https://en.wikipedia.org/wiki/Nvidia_Tesla https://en.wikipedia.org/wiki/Nvidia_Tesla]</ref>
and the GK110 GPU by allowing 32 simultaneous, hardware-managed connections (compared to
which is all GPUs currently deployed on Alliance general-purpose clusters (Béluga, Cedar, Graham, and Narval).
the single connection available with Fermi). Hyper-Q is a flexible solution that allows separate
connections from multiple CUDA streams, from multiple Message Passing Interface (MPI)
processes, or even from multiple threads within a process. Applications that previously
encountered false serialization across tasks, thereby limiting achieved GPU utilization, can see
up to dramatic performance increase without changing any existing code.


In our tests, Hyper-Q increases the total GPU flop rate even when the GPU is being shared by unrelated CPU processes ("GPU farming"). That means that Hyper-Q is great for CUDA codes with relatively small problem sizes, which on their own cannot efficiently saturate modern GPUs with thousands of cores (like K20).  
<!--T:3-->
[https://docs.nvidia.com/deploy/mps/index.html According to NVIDIA],
::<i>The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs, to utilize Hyper-Q capabilities on the latest NVIDIA (Kepler and later) GPUs. Hyper-Q allows CUDA kernels to be processed concurrently on the same GPU; this can benefit performance when the GPU compute capacity is underutilized by a single application process.</i>


Hyper-Q is not enabled by default, but it is straightforward to do. If you use the GPU interactively, execute the following commands before running your CUDA code(s):


export CUDA_VISIBLE_DEVICES=0
<!--T:4-->
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
In our tests, MPS may increase the total GPU flop rate even when the GPU is being shared by unrelated CPU processes. That means that MPS is great for CUDA applications with relatively small problem sizes, which on their own cannot efficiently saturate modern GPUs with thousands of cores.
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d


If you are using a scheduler, you should submit a script which contains the above lines, and then executes your code.
<!--T:5-->
MPS is not enabled by default, but it is straightforward to do.  Execute the following commands before running your CUDA application:


Then you can avail the Hyper-Q feature if you have more than one CPU thread accessing the GPU. This will happen if you run an MPI/CUDA, OpenMP/CUDA code, or multiple instances of a serial CUDA code (GPU farming).
<!--T:6-->
{{Commands|export CUDA_MPS_PIPE_DIRECTORY{{=}}/tmp/nvidia-mps
|export CUDA_MPS_LOG_DIRECTORY{{=}}/tmp/nvidia-log
|nvidia-cuda-mps-control -d}}


Many additional details on Hyper-Q can be found in this document: [https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0CCQQFjAB&url=https%3A%2F%2Fdocs.nvidia.com%2Fdeploy%2Fpdf%2FCUDA_Multi_Process_Service_Overview.pdf&ei=M4bvVMbJNYzasATz6YGoAg&usg=AFQjCNGDmPf6wg7ne0F4RoqT0WOOEmxGlg&sig2=dGy5ZeTxawO1bXmtSJySNg|CUDA Multi Process Service (MPS) - NVIDIA Documentation].
<!--T:7-->
Then you can use the MPS feature if you have more than one CPU thread accessing the GPU. This will happen if you run a hybrid MPI/CUDA application, a hybrid OpenMP/CUDA application, or multiple instances of a serial CUDA application (<i>GPU farming</i>).


[[Category:Software packages]]
<!--T:8-->
Additional details on MPS can be found here: [https://docs.nvidia.com/deploy/mps/index.html CUDA Multi Process Service (MPS) - NVIDIA Documentation].
 
==GPU farming== <!--T:9-->
 
<!--T:10-->
One situation when the MPS feature can be very useful is when you need to run multiple instances of a CUDA application, but the application is too small to saturate a modern GPU.  MPS allows you to run multiple instances of the application sharing a single GPU, as long as there is enough of GPU memory for all of the instances of the application.  In many cases this should result in a significantly increased throughput from all of your GPU processes.
 
<!--T:11-->
Here is an example of a job script to set up GPU farming:
 
<!--T:12-->
{{File|name=script.sh
|contents=
#!/bin/bash
#SBATCH --gpus-per-node=v100:1
#SBATCH --time=0-10:00
#SBATCH --mem-per-cpu=8G
#SBATCH --cpus-per-task=8
mkdir -p $HOME/tmp
export CUDA_MPS_LOG_DIRECTORY=$HOME/tmp
nvidia-cuda-mps-control -d
for ((i=0; i<SLURM_CPUS_PER_TASK; i++))
do
echo $i
./my_code $i  &
done
wait
}}
 
<!--T:13-->
In the above example, we share a single V100 GPU between 8 instances of <code>my_code</code> (which takes a single argument-- the loop index <code>$i</code>). We request 8 CPU cores (#SBATCH -c 8) so there is one CPU core per application instance. The two important elements are
* <code>&</code> on the code execution line, which sends the code processes to the background, and
* the <code>wait</code> command at the end of the script, which ensures that the job runs until all background processes end.
 
<!--T:14-->
[[Category:Software]]
 
</translate>

Latest revision as of 17:50, 14 December 2023

Other languages:


Overview[edit]

Hyper-Q (or MPS) is a feature of NVIDIA GPUs. It is available in GPUs with CUDA compute capability 3.5 and higher,[1] which is all GPUs currently deployed on Alliance general-purpose clusters (Béluga, Cedar, Graham, and Narval).

According to NVIDIA,

The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs, to utilize Hyper-Q capabilities on the latest NVIDIA (Kepler and later) GPUs. Hyper-Q allows CUDA kernels to be processed concurrently on the same GPU; this can benefit performance when the GPU compute capacity is underutilized by a single application process.


In our tests, MPS may increase the total GPU flop rate even when the GPU is being shared by unrelated CPU processes. That means that MPS is great for CUDA applications with relatively small problem sizes, which on their own cannot efficiently saturate modern GPUs with thousands of cores.

MPS is not enabled by default, but it is straightforward to do. Execute the following commands before running your CUDA application:

[name@server ~]$ export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
[name@server ~]$ export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
[name@server ~]$ nvidia-cuda-mps-control -d


Then you can use the MPS feature if you have more than one CPU thread accessing the GPU. This will happen if you run a hybrid MPI/CUDA application, a hybrid OpenMP/CUDA application, or multiple instances of a serial CUDA application (GPU farming).

Additional details on MPS can be found here: CUDA Multi Process Service (MPS) - NVIDIA Documentation.

GPU farming[edit]

One situation when the MPS feature can be very useful is when you need to run multiple instances of a CUDA application, but the application is too small to saturate a modern GPU. MPS allows you to run multiple instances of the application sharing a single GPU, as long as there is enough of GPU memory for all of the instances of the application. In many cases this should result in a significantly increased throughput from all of your GPU processes.

Here is an example of a job script to set up GPU farming:


File : script.sh

#!/bin/bash
#SBATCH --gpus-per-node=v100:1
#SBATCH --time=0-10:00
#SBATCH --mem-per-cpu=8G
#SBATCH --cpus-per-task=8
 
mkdir -p $HOME/tmp
export CUDA_MPS_LOG_DIRECTORY=$HOME/tmp
nvidia-cuda-mps-control -d
 
for ((i=0; i<SLURM_CPUS_PER_TASK; i++))
 do
 echo $i
 ./my_code $i  &
 done
 
wait


In the above example, we share a single V100 GPU between 8 instances of my_code (which takes a single argument-- the loop index $i). We request 8 CPU cores (#SBATCH -c 8) so there is one CPU core per application instance. The two important elements are

  • & on the code execution line, which sends the code processes to the background, and
  • the wait command at the end of the script, which ensures that the job runs until all background processes end.
  1. For a table relating NVIDIA GPU model names, architecture names, and CUDA compute capabilities, see https://en.wikipedia.org/wiki/Nvidia_Tesla