Multi-Instance GPU: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
No edit summary
 
(59 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{draft}}
<languages />
<languages />
<translate>  
<translate>  
= Introduction =
<!--T:1-->
Many programs are unable to fully use modern GPUs such as NVidia A100 and H100.
Many programs are unable to fully use modern GPUs such as NVidia [https://www.nvidia.com/en-us/data-center/a100/ A100s] and [https://www.nvidia.com/en-us/data-center/h100/ H100s].
Multi-Instance GPU (MIG) is a technology that allows partitioning a single GPU into multiple instances, thus making each instance a completely independent GPU.
[https://www.nvidia.com/en-us/technologies/multi-instance-gpu/ Multi-Instance GPU (MIG)] is a technology that allows partitioning a single GPU into multiple [https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#terminology instances], making each one a completely independent virtual GPU.
Each GPU subdivision would then have a certain slice of the GPU computational resources and memory that is detached from the other instances by on-chip protections.
Each of the GPU instances gets a portion of the original GPU's computational resources and memory, all detached from the other instances by on-chip protections.


MIGs can be less wasteful and it is billed accordingly. Jobs submitted on a MIG instead of full GPU will use less of your allocated priority and you will be able to execute more of them and have lower wait time.
<!--T:2-->
Using GPU instances is less wasteful, and usage is billed accordingly. Jobs submitted on such instances use less of your allocated priority compared to a full GPU; you will then be able to execute more jobs and have shorter wait time.


= Which jobs should use a MIG instead of full GPUs? =
= Choosing between a full GPU and a GPU instance = <!--T:3-->
Jobs that use less than half of the computing power of a GPU and less than half of the available GPU memory should be evaluated and tested on MIG. In most cases, these jobs will run just as fast on MIG and consume less than half of the computing resource.
Jobs that use less than half of the computing power of a full GPU and less than half of the available memory should be evaluated and tested on an instance. In most cases, these jobs will run just as fast and consume less than half of the computing resource.


= Limitations of MIGs =
<!--T:20-->
MIGs do not support  CUDA Inter process communications, which enables data transfers over NVLink and NVSwitch.  Launching an executable on more than one MIG at a time does not improve performance and should be avoided.
See section [[Multi-Instance GPU#Finding_which_of_your_jobs_should_use_an_instance|Finding which of your jobs should use an instance]] for more details.


Jobs that require many CPU cores per GPU may also require a full GPU instead of a MIG. The maximum number of CPU cores per MIG depends on the number of cores per full GPU and the MIG configuration and will vary from cluster to cluster.
=Limitations = <!--T:4-->
[https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#app-considerations The MIG technology does not support] [https://developer.nvidia.com/docs/drive/drive-os/6.0.8.1/public/drive-os-linux-sdk/common/topics/nvsci_nvsciipc/Inter-ProcessCommunication1.html CUDA Inter-Process Communication (IPC)], which optimizes data transfers between GPUs over NVLink and NVSwitch.
This limitation also reduces communication efficiency between instances.
Consequently, launching an executable on more than one instance at a time <b>does not</b> improve performance and should be avoided.


= Available MIGs configurations =
<!--T:5-->
The MIGs are available for now on the Narval cluster with the A100s GPUs. While there are many possible configurations for MIGs, the following configurations are
GPU jobs requiring many CPU cores may also require a full GPU instead of an instance. The maximum number of CPU cores per instance depends on [[Allocations_and_compute_scheduling#Ratios_in_bundles|the number of cores per full GPU]] and on the configured [https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#a100-profiles MIG profiles]. Both vary between clusters and between GPU nodes in a cluster.
currently activated on Narval:
* MIG 3g.20gb
* MIG 4g.20gb


The names describe the size of the MIG : 3g.20gb has 20GB of GPU RAM and offers 3/8 of the computing performance of a full A100 GPU. Using less powerful MIGs will have a lower impact on your allocation and priority.
= Available configurations = <!--T:6-->
As of August 23, 2024, 20% of the Narval A100 nodes offer GPU instances.
While there are [https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#a100-profiles many possible MIG configurations and profiles], only the two following profiles have been implemented:
* <code>3g.20gb</code>
* <code>4g.20gb</code>


The recommended maximum number of cores per MIG on Narval are :
<!--T:7-->
* 3g.20gb : maximum 6 cores
The profile name describes the size of the instance.
* 4g.20gb : maximum 6 cores
For example, a <code>3g.20gb</code> instance has 20 GB of RAM and offers 3/8 of the computing performance of a full A100-40gb GPU. Using less powerful profiles will have a lower impact on your allocation and priority.
   
 
<!--T:8-->
On Narval, the recommended maximum number of CPU cores and amount of system memory per instance are:
* <code>3g.20gb</code>: maximum 6 cores and 62GB
* <code>4g.20gb</code>: maximum 6 cores and 62GB
 
<!--T:9-->
To request an instance of a certain profile, your job submission must include the <code>--gres</code> parameter:
* <code>3g.20gb</code>: <code>--gres=gpu:a100_3g.20gb:1</code>
* <code>4g.20gb</code>: <code>--gres=gpu:a100_4g.20gb:1</code>  


To request a MIG of certain flavor one must include a line in your job submission script:
<!--T:21-->
  --gres=gpu:a100_3g.20gb:1
Note: For the job scheduler on Narval, the prefix <code>a100_</code> is required at the beginning of the profile name.
  --gres=gpu:a100_4g.20gb:1


= Examples =
= Job examples = <!--T:10-->


Request 1 MIG of power 3/8 and size 20GB for a 1 hour interactive job :
<!--T:11-->
* Requesting an instance of power 3/8 and size 20GB for a 1-hour interactive job:
</translate>  
</translate>  


<code>salloc --time=1:0:0 --nodes=1 --ntasks-per-node=1 --cpus-per-task=2 --gres=gpu:a100_3g.20gb:1 --mem=20gb --account=def-someuser</code>
{{Command2
|salloc --account{{=}}def-someuser --gres{{=}}gpu:a100_3g.20gb:1 --cpus-per-task{{=}}2 --mem{{=}}40gb --time{{=}}1:0:0
}}


<translate>  
<translate>  
Request 1 MIG of power 4/8 for a 24 hour batch script using the maximum recommended number of cores and system memory  
<!--T:12-->
* Requesting an instance of power 4/8 and size 20GB for a 24-hour batch job using the maximum recommended number of cores and system memory:
</translate>  
</translate>  


Line 50: Line 66:
   |contents=
   |contents=
#!/bin/bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --account=def-someuser
#SBATCH --gres=gpu:a100_4g.20gb:1  
#SBATCH --gres=gpu:a100_4g.20gb:1  
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6    # There are 6 CPU cores per 3g.20gb and 4g.20gb on Narval.
#SBATCH --cpus-per-task=6    # There are 6 CPU cores per 3g.20gb and 4g.20gb on Narval.
#SBATCH --mem=40gb           # Request double the amount of system memory than MIG memory
#SBATCH --mem=62gb           # There are 62GB GPU RAM per 3g.20gb and 4g.20gb on Narval.
#SBATCH --time=24:00:00
#SBATCH --time=24:00:00
#SBATCH --account=def-someuser
 
hostname
hostname
nvidia-smi
nvidia-smi
}}
}}


<translate>  
<translate>
= Finding which of your jobs to migrate to using a MIG =
 
= Finding which of your jobs should use an instance = <!--T:13-->


There are currently two ways to monitor the ressource usage of a GPU job. One can find information on current and past jobs by looking at the Narval usage [https://docs.alliancecan.ca/wiki/Portail portal], under the <code>Job stats</code> tab.  
<!--T:14-->
You can find information on current and past jobs on the [[Portail|Narval usage portal (writing in progress)]].  


Electric power consumption are a good indicator of the total computing power requested from the GPU. For instance, the following job requested one A100 GPU with a maximum TDP of 400W, but only used 100W on average, which is only 50W more than the idle electric consumption :
<!--T:15-->
Power consumption is a good indicator of the total computing power requested from the GPU. For example, the following job requested a full A100 GPU with a maximum TDP of 400W, but only used 100W on average, which is only 50W more than the idle electric consumption:
   
   
[[File:ExampleGPUPower.png|400px|frame|left|Example GPU Power usage of a job on a A100 GPU]]  
[[File:ExampleGPUPower.png|400px|frame|left|Example GPU Power usage of a job on a full A100 GPU]]  
<br clear=all> <!-- This is to prevent the next section from filling to the right of the image. -->
<br clear=all> <!-- This is to prevent the next section from filling to the right of the image. -->


GPU functionality utilisation may also provide insights on the usage of the GPU in cases where the Electric power consumption is not sufficient. For this example job, GPU utilisation graph supports the conclusion of the GPU power consumption graph that the job use less than 25% of the available power of a full A100 GPU :
<!--T:16-->
GPU functionality utilization may also provide insights on the usage of the GPU in cases where the power consumption is not sufficient. For this example job, GPU utilization graph supports the conclusion of the GPU power consumption graph, in that the job uses less than 25% of the available computing power of a full A100 GPU:
   
   
[[File:ExampleGPUUtilisation.png|400px|frame|left|Example GPU Utilisation of a job on a A100 GPU]]
[[File:ExampleGPUUtilisation.png|400px|frame|left|Example GPU Utilization of a job on a full A100 GPU]]
<br clear=all> <!-- This is to prevent the next section from filling to the right of the image. -->
<br clear=all> <!-- This is to prevent the next section from filling to the right of the image. -->


The final things to consider is the maximum amount of GPU memory and average amount of CPU cores required to run the job. For this example, the job uses a maximum of 3GB of GPU memory out of the 40GB of an A100 gpu.  
<!--T:17-->
The final metrics to consider are the maximum amount of GPU memory and the average number of CPU cores required to run the job. For this example, the job uses a maximum of 3GB of GPU memory out of the 40GB of a full A100 GPU.  
   
   
[[File:ExampleGPUMemory.png|400px|frame|left|Example GPU memory usage of a job on a A100 GPU]]
[[File:ExampleGPUMemory.png|400px|frame|left|Example GPU memory usage of a job on a full A100 GPU]]
<br clear=all> <!-- This is to prevent the next section from filling to the right of the image. -->
<br clear=all> <!-- This is to prevent the next section from filling to the right of the image. -->


It was also launched using a single CPU core. When taking into account these 3 metrics, we see that the job could easily run on a 3g.20GB or 4g.20GB MIG with power and memory to spare.  
<!--T:18-->
It was also launched using a single CPU core. When taking into account these three last metrics, we can confirm that the job should easily run on a 3g.20GB or 4g.20GB GPU instance with power and memory to spare.  


The second way to monitor the usage of a running job is by [https://docs.alliancecan.ca/wiki/Running_jobs#Attaching_to_a_running_job attaching to the node] where the job is currently running and use <code>nvidia-smi</code> to read the GPU metrics in real time. This will not provide maximum and average values for memory and power of the full job, but may be helpful to troubleshoot jobs.
<!--T:19-->
Another way to monitor the usage of a running job is by [[Running jobs#Attaching_to_a_running_job |attaching to the node]] where the job is currently running and then by using <code>nvidia-smi</code> to read the GPU metrics in real time.
This will not provide maximum and average values for memory and power usage of the entire job, but it may be helpful to identify and troubleshoot underperforming jobs.
</translate>
</translate>

Latest revision as of 19:16, 12 September 2024

Other languages:

Many programs are unable to fully use modern GPUs such as NVidia A100s and H100s. Multi-Instance GPU (MIG) is a technology that allows partitioning a single GPU into multiple instances, making each one a completely independent virtual GPU. Each of the GPU instances gets a portion of the original GPU's computational resources and memory, all detached from the other instances by on-chip protections.

Using GPU instances is less wasteful, and usage is billed accordingly. Jobs submitted on such instances use less of your allocated priority compared to a full GPU; you will then be able to execute more jobs and have shorter wait time.

Choosing between a full GPU and a GPU instance

Jobs that use less than half of the computing power of a full GPU and less than half of the available memory should be evaluated and tested on an instance. In most cases, these jobs will run just as fast and consume less than half of the computing resource.

See section Finding which of your jobs should use an instance for more details.

Limitations

The MIG technology does not support CUDA Inter-Process Communication (IPC), which optimizes data transfers between GPUs over NVLink and NVSwitch. This limitation also reduces communication efficiency between instances. Consequently, launching an executable on more than one instance at a time does not improve performance and should be avoided.

GPU jobs requiring many CPU cores may also require a full GPU instead of an instance. The maximum number of CPU cores per instance depends on the number of cores per full GPU and on the configured MIG profiles. Both vary between clusters and between GPU nodes in a cluster.

Available configurations

As of August 23, 2024, 20% of the Narval A100 nodes offer GPU instances. While there are many possible MIG configurations and profiles, only the two following profiles have been implemented:

  • 3g.20gb
  • 4g.20gb

The profile name describes the size of the instance. For example, a 3g.20gb instance has 20 GB of RAM and offers 3/8 of the computing performance of a full A100-40gb GPU. Using less powerful profiles will have a lower impact on your allocation and priority.

On Narval, the recommended maximum number of CPU cores and amount of system memory per instance are:

  • 3g.20gb: maximum 6 cores and 62GB
  • 4g.20gb: maximum 6 cores and 62GB

To request an instance of a certain profile, your job submission must include the --gres parameter:

  • 3g.20gb: --gres=gpu:a100_3g.20gb:1
  • 4g.20gb: --gres=gpu:a100_4g.20gb:1

Note: For the job scheduler on Narval, the prefix a100_ is required at the beginning of the profile name.

Job examples

  • Requesting an instance of power 3/8 and size 20GB for a 1-hour interactive job:
[name@server ~]$ salloc --account=def-someuser --gres=gpu:a100_3g.20gb:1 --cpus-per-task=2 --mem=40gb --time=1:0:0


  • Requesting an instance of power 4/8 and size 20GB for a 24-hour batch job using the maximum recommended number of cores and system memory:


File : a100_4g.20gb_mig_job.sh

#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --gres=gpu:a100_4g.20gb:1 
#SBATCH --cpus-per-task=6    # There are 6 CPU cores per 3g.20gb and 4g.20gb on Narval.
#SBATCH --mem=62gb           # There are 62GB GPU RAM per 3g.20gb and 4g.20gb on Narval.
#SBATCH --time=24:00:00

hostname
nvidia-smi



Finding which of your jobs should use an instance

You can find information on current and past jobs on the Narval usage portal (writing in progress).

Power consumption is a good indicator of the total computing power requested from the GPU. For example, the following job requested a full A100 GPU with a maximum TDP of 400W, but only used 100W on average, which is only 50W more than the idle electric consumption:

Example GPU Power usage of a job on a full A100 GPU


GPU functionality utilization may also provide insights on the usage of the GPU in cases where the power consumption is not sufficient. For this example job, GPU utilization graph supports the conclusion of the GPU power consumption graph, in that the job uses less than 25% of the available computing power of a full A100 GPU:

Example GPU Utilization of a job on a full A100 GPU


The final metrics to consider are the maximum amount of GPU memory and the average number of CPU cores required to run the job. For this example, the job uses a maximum of 3GB of GPU memory out of the 40GB of a full A100 GPU.

Example GPU memory usage of a job on a full A100 GPU


It was also launched using a single CPU core. When taking into account these three last metrics, we can confirm that the job should easily run on a 3g.20GB or 4g.20GB GPU instance with power and memory to spare.

Another way to monitor the usage of a running job is by attaching to the node where the job is currently running and then by using nvidia-smi to read the GPU metrics in real time. This will not provide maximum and average values for memory and power usage of the entire job, but it may be helpful to identify and troubleshoot underperforming jobs.