Bureaucrats, cc_docs_admin, cc_staff
2,879
edits
(more specific) |
(updates for 2025 RAC and multi-instance GPUs) |
||
Line 36: | Line 36: | ||
It is even possible that you could end a month or even a year having run more work than your allocation would seem to allow, although this is unlikely given the demand on our resources. | It is even possible that you could end a month or even a year having run more work than your allocation would seem to allow, although this is unlikely given the demand on our resources. | ||
=Reference GPU Units= <!--T:45--> | =Reference GPU Units (RGUs)= <!--T:45--> | ||
<!--T:46--> | <!--T:46--> | ||
The performance of GPUs has dramatically increased in the recent years and continues to do so. Until RAC 2023 we treated all GPUs as equivalent to each other for allocation purposes. This caused problems both in the allocation process and while running jobs, so in the 2024 RAC year we introduced the <i>reference GPU unit</i>, or <b>RGU</b>, to rank all GPU models in production and alleviate these problems. In the 2025 RAC year we will also have to deal with new complexity involving [[Multi-Instance GPU]] technology. | |||
<!--T:47--> | <!--T:47--> | ||
Because roughly half of our users primarily use single-precision floating-point operations ([https://en.wikipedia.org/wiki/Single-precision_floating-point_format FP32]), the other half use half-precision floating-point operations ([https://en.wikipedia.org/wiki/Half-precision_floating-point_format FP16], dense matrices), and a significant portion of all users are constrained by the amount of memory on the GPU, we chose the following evaluation criteria and corresponding weights to rank the different GPU models: | |||
<!--T:48--> | <!--T:48--> | ||
{| class="wikitable" style="margin: auto;" | {| class="wikitable" style="margin: auto;" | ||
|- | |- | ||
! scope="col"| Evaluation | ! scope="col"| Evaluation Criterion | ||
! scope="col"| Weight | ! scope="col"| Weight | ||
|- | |- | ||
! scope="row"| FP32 score | ! scope="row"| FP32 score | ||
| 40% | | 40% | ||
|- | |- | ||
! scope="row"| FP16 score | ! scope="row"| FP16 score | ||
| 40% | | 40% | ||
|- | |- | ||
! scope="row"| GPU memory score | ! scope="row"| GPU memory score | ||
| 20% | | 20% | ||
|} | |} | ||
<!--T:49--> | <!--T:49--> | ||
We currently use the NVidia <b>A100-40gb</b> GPU as the reference model and assign it an RGU value of 4.0 for historical reasons. We define its FP16 performance, FP32 performance, and memory size each as 1.0. Multiplying the percentages in the above table by 4.0 yields the following coefficients and RGU values for other models: | |||
<!--T:50--> | <!--T:50--> | ||
{| class="wikitable" style="margin: auto; text-align: center;" | {| class="wikitable" style="margin: auto; text-align: center;" | ||
|+ RGU scores for whole GPU models | |||
|- | |- | ||
| | | | ||
Line 71: | Line 71: | ||
! scope="col"| FP16 score | ! scope="col"| FP16 score | ||
! scope="col"| Memory score | ! scope="col"| Memory score | ||
! scope="col"| | ! scope="col"| Combined score | ||
! colspan="2",scope="col"| Available | |||
! scope="col"| Allocatable | |||
|- | |- | ||
! scope="col"| | ! scope="col"| Coefficient: | ||
! scope="col"| 1.6 | ! scope="col"| 1.6 | ||
! scope="col"| 1.6 | ! scope="col"| 1.6 | ||
! scope="col"| 0.8 | ! scope="col"| 0.8 | ||
| (RGU) | ! scope="col"| (RGU) | ||
! scope="col"| Now | |||
! scope="col"| 2025 | |||
! scope="col"| RAC 2025 | |||
|- | |- | ||
! scope="row" | ! scope="row" | H100-80gb | ||
| 3.44 || 3.17 || 2.0 || 12.2 || No || Yes || Yes | |||
|- | |- | ||
! scope="row"| | ! scope="row"| A100-80gb | ||
| 0. | | 1.00 || 1.00 || 2.0 || 4.8 || No || ? || No | ||
| | |- | ||
| 0. | ! scope="row"| A100-40gb | ||
! | | <b>1.00</b> || <b>1.00</b> || <b>1.0</b> || <b>4.0</b> || Yes || Yes || Yes | ||
|- | |||
! scope="row"| V100-32gb | |||
| 0.81 || 0.40 || 0.8 || 2.6 || Yes || ? || No | |||
|- | |- | ||
! scope="row"| | ! scope="row"| V100-16gb | ||
| 0. | | 0.81 || 0.40 || 0.4 || 2.2 || Yes || ? || No | ||
| 0. | |||
| 0.4 | |||
|- | |- | ||
! scope="row"| T4-16gb | ! scope="row"| T4-16gb | ||
| 0.42 | | 0.42 || 0.21 || 0.4 || 1.3 || Yes || ? || No | ||
| 0.21 | |- | ||
| 0.4 | ! scope="row"| P100-16gb | ||
! 1. | | 0.48 || 0.03 || 0.4 || 1.1 || Yes || No || No | ||
|- | |- | ||
! scope="row"| | ! scope="row"| P100-12gb | ||
| 0. | | 0.48 || 0.03 || 0.3 || 1.0 || Yes || No || No | ||
| 0. | |} | ||
| | |||
With the 2025 [[infrastructure renewal]] it will become possible to schedule a fraction of a GPU using [[multi-instance GPU]] technology. Different jobs, potentially belonging to different users, can run on the same GPU at the same time. Following [https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#terminology NVidia's terminology], a fraction of a GPU allocated to a single job is called a "GPU instance", also sometimes called a "MIG instance". | |||
The following table lists the GPU models and instances that can be selected in the CCDB form for RAC 2025. RGU values for GPU instances have been estimated from whole-GPU performance numbers and the fraction of the GPU which comprises the instance. | |||
{| class="wikitable" style="margin: auto; text-align: center; | |||
|+ GPU models and instances available for RAC 2025 | |||
|- | |- | ||
! | ! Model or instance !! Fraction of GPU !! RGU | ||
! | |||
|- | |- | ||
! scope="row"| A100-40gb | ! scope="row"| A100-40gb | ||
| | | Whole GPU ⇒ 100% || 4.0 | ||
| | |- | ||
| | ! scope="row"| A100-3g.20gb | ||
! | | max(3g/7g, 20GB/40GB) ⇒ 50% || 2.0 | ||
|- | |||
! scope="row"| A100-4g.20gb | |||
| max(4g/7g, 20GB/40GB) ⇒ 57% || 2.3 | |||
|- | |||
! scope="row"| H100-80gb | |||
| Whole GPU ⇒ 100% || 12.2 | |||
|- | |||
! scope="row"| H100-1g.10gb | |||
| max(1g/7g, 40GB/80GB) ⇒ 14% || 1.7 | |||
|- | |||
! scope="row"| H100-2g.20gb | |||
| max(2g/7g, 40GB/80GB) ⇒ 28% || 3.5 | |||
|- | |||
! scope="row"| H100-3g.40gb | |||
| max(3g/7g, 40GB/80GB) ⇒ 50% || 6.1 | |||
|- | |- | ||
! scope="row"| | ! scope="row"| H100-4g.40gb | ||
| | | max(4g/7g, 40GB/80GB) ⇒ 57% || 7.0 | ||
| | |||
| | |||
|} | |} | ||
Note: a GPU instance of profile <b>1g</b> is worth 1/7 of a A100 or H100 GPU. The case of <b>3g</b> takes into consideration the extra amount of memory per <b>g</b>. | |||
< | |||
==Choosing GPU models for your project== <!--T:52--> | ==Choosing GPU models for your project== <!--T:52--> | ||
Line 139: | Line 154: | ||
* If your applications (typically AI-related) are doing primarily FP16 operations (including mixed precision operations or using other [https://en.wikipedia.org/wiki/Bfloat16_floating-point_format floating-point formats]), using an A100-40gb will result in getting evaluated as using 4x the resources of a P100-12gb, but it is capable of computing ~30x the calculations for the same amount of time, which would allow you to complete ~7.5x the computations. | * If your applications (typically AI-related) are doing primarily FP16 operations (including mixed precision operations or using other [https://en.wikipedia.org/wiki/Bfloat16_floating-point_format floating-point formats]), using an A100-40gb will result in getting evaluated as using 4x the resources of a P100-12gb, but it is capable of computing ~30x the calculations for the same amount of time, which would allow you to complete ~7.5x the computations. | ||
== | ==RAC awards hold RGU values constant== <!--T:55--> | ||
<!--T:56--> | <!--T:56--> | ||
* During the Resource Allocation Competition | * During the Resource Allocation Competition (RAC), any proposal asking for GPUs must specify the preferred GPU model for the project. Then, in the CCDB form, the amount of reference GPU units (RGUs) will automatically be calculated from the requested amount of gpu-years per year of project. | ||
** For example, if you select the <i>narval-gpu</i> resource and request 13 gpu-years of the model A100-40gb, the corresponding amount of RGUs would be 13 * 4.0 = 52. The RAC committee would then allocate up to 52 RGUs, depending on the proposal score. | ** For example, if you select the <i>narval-gpu</i> resource and request 13 gpu-years of the model A100-40gb, the corresponding amount of RGUs would be 13 * 4.0 = 52. The RAC committee would then allocate up to 52 RGUs, depending on the proposal score. If your allocation must be moved to a different site, the committee will allocate gpu-years at that site so as to keep the amount of RGUs the same. | ||
=Detailed effect of resource usage on priority= <!--T:10--> | =Detailed effect of resource usage on priority= <!--T:10--> |