Allocations and compute scheduling: Difference between revisions

← Older edit

Allocations and compute scheduling (view source)

Revision as of 12:49, 1 November 2024

551 bytes added , 6 days ago

no edit summary

Rdickson

Bureaucrats, cc_docs_admin, cc_staff

2,879

edits

@@ Line 110: / Line 110: @@
 <!--T:64-->
-With the 2025 [[infrastructure renewal]] it will become possible to schedule a fraction of a GPU using [[multi-instance GPU]] technology.  Different jobs, potentially belonging to different users, can run on the same GPU at the same time.  Following [https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#terminology NVidia's terminology], a fraction of a GPU allocated to a single job is called a "GPU instance", also sometimes called a "MIG instance".
+With the 2025 [[infrastructure renewal]], it will become possible to schedule a fraction of a GPU using [[multi-instance GPU]] technology.  Different jobs, potentially belonging to different users, can run on the same GPU at the same time.  Following [https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#terminology NVidia's terminology], a fraction of a GPU allocated to a single job is called a <i>GPU instance</i>, also sometimes called a <i>MIG instance</i>.
 <!--T:65-->
@@ Line 147: / Line 147: @@
 <!--T:67-->
-Note: a GPU instance of profile <b>1g</b> is worth 1/7 of a A100 or H100 GPU. The case of <b>3g</b> takes into consideration the extra amount of memory per <b>g</b>.
+Note: a GPU instance of profile <b>1g</b> is worth 1/7 of an A100 or H100 GPU. The case of <b>3g</b> takes into consideration the extra amount of memory per <b>g</b>.
 ==Choosing GPU models for your project== <!--T:52-->
@@ Line 191: / Line 191: @@
 <!--T:20-->
 Research groups are charged for the maximum number of core equivalents they take from the resources. Assuming a core equivalent of 1 core and 4GB of memory:
-* [[File:Two_core_equivalents.png|frame|Figure 2 - Two core equivalents.]] Research groups using more cores than memory (above the 1 core/4GB memory ratio), will be charged by cores.  For example, a research group requesting two cores and 2GB per core for a total of 4 GB of memory. The request requires 2 core equivalents worth of cores but only one bundle for memory.  This job request will be counted as 2 core equivalents when priority is calculated. See Figure 2. <br clear=all>
+* [[File:Two_core_equivalents.png|frame|Figure 2 - Two core equivalents.]] Research groups using more cores than memory (above the 1 core/4GB memory ratio) will be charged by cores.  For example, a research group requesting two cores and 2GB per core for a total of 4 GB of memory. The request requires 2 core equivalents worth of cores but only one bundle for memory.  This job request will be counted as 2 core equivalents when priority is calculated. See Figure 2. <br clear=all>
 <!--T:21-->
@@ Line 366: / Line 366: @@
 <b>Note:</b> While the scheduler will compute the priority based on the usage calculated with the above bundles, users requesting multiple GPUs per node also have to take into account the physical ratios.
-=Viewing usage of compute resources using the portal=
+=Viewing resource usage in the portal= <!--T:68-->
-[[File:Slurm portal land edit.png|thumb|alt=usage portal landing view|usage portal landing view]]
+<!--T:69-->
-[https://portal.alliancecan.ca/slurm portal.alliancecan.ca/slurm] provides an interface for exploring time-series data about jobs on Alliance national clusters. The page contains a figure that can display several usage metrics. When you first log in to the site, the figure will display CPU-equivalent days on the Cedar cluster for your account. If you have no usage on Cedar, the figure will contain the text “No Data or usage too small to have a meaningful plot”. The data appearing in the figure can be modified by control panels along the left margin of the page. There are three panels:
+[[File:Slurm portal land edit.png|thumb|alt=usage portal landing view|Usage portal landing view. (Click on the image for a larger version.)]]
+[https://portal.alliancecan.ca/slurm portal.alliancecan.ca/slurm] provides an interface for exploring time-series data about jobs on our national clusters. The page contains a figure that can display several usage metrics. When you first log in to the site, the figure will display CPU days on the Cedar cluster for you across all project accounts that you have access to. If you have no usage on Cedar, the figure will contain the text <i>No Data or usage too small to have a meaningful plot</i>. The data appearing in the figure can be modified by control panels along the left margin of the page. There are three panels:
 * Select system and dates
 * Parameters
@@ Line 375: / Line 376: @@
 <br clear=all>
-==Displaying a specified account==
+==Displaying a specified account== <!--T:70-->
 [[File:Slurm portal account usage edit.png|thumb|alt=usage display of a specified account|Usage display of a specified account]]
-If you have access to more than one [[Running_jobs#Accounts_and_projects|Slurm account]], the "Select user’s account" pull-down menu of the "SLURM account" panel lets you select which project account will be displayed in the figure window. If the “Select user’s account” is left empty the figure will display all of your usage across accounts on the specified cluster during the selected time period. The "Select user’s account" pull-down menu is populated by a list of all the accounts that have job records on the selected cluster during the selected time interval. When you select a single project account the figure is updated and the summary panel titled "Allocation Information" is populated with details of the project account. The height of each bar in the histogram figure corresponds to the metric for that day (e.g. CPU-equivalent days) across all users in the account on the system. The top eight users are displayed in unique colors stacked on top of the summed metric for all other users in gray. You can navigate the figure using [https://plotly.com/graphing-libraries/ Plotly] tools (zoom, pan, etc) whose icons appear at the top-right when you hover your mouse over the figure window. You can also use the legend on the right-hand side to manipulate the figure. Single-clicking an item will toggle the item's presence in the figure, and double-clicking the item will toggle off or on all the other items in the figure.
+If you have access to more than one [[Running_jobs#Accounts_and_projects|Slurm account]], the <i>Select user’s account</i> pull-down menu of the <i>SLURM account</i> panel lets you select which project account will be displayed in the figure window. If the <i>Select user’s account</i> is left empty the figure will display all of your usage across accounts on the specified cluster during the selected time period. The <i>Select user’s account</i> pull-down menu is populated by a list of all the accounts that have job records on the selected cluster during the selected time interval. Other accounts that you have access to but do not have usage on the selected cluster during the selected time interval will also appear in the pull-down menu but will be grayed out and not selectable as they would not generate a figure. When you select a single project account the figure is updated and the summary panel titled <i>Allocation Information</i> is populated with details of the project account. The height of each bar in the histogram figure corresponds to the metric for that day (e.g. CPU-equivalent days) across all users in the account on the system. The top eight users are displayed in unique colors stacked on top of the summed metric for all other users in gray. You can navigate the figure using [https://plotly.com/graphing-libraries/ Plotly] tools (zoom, pan, etc.) whose icons appear at the top-right when you hover your mouse over the figure window. You can also use the legend on the right-hand side to manipulate the figure. Single-clicking an item will toggle the item's presence in the figure, and double-clicking the item will toggle off or on all the other items in the figure.
 <br clear=all>
-==Displaying the allocation target and queued resources==
+==Displaying the allocation target and queued resources== <!--T:71-->
 [[File:Slurm portal account usage queued edit.png|thumb|alt=Allocation target and queued resources displayed on usage figure|Allocation target and queued resources displayed on usage figure]]
-When a single account has been selected for display, the "Allocation target" is shown as a horizontal red line. It can be turned off or on with the “Display allocation target by default” item in the “Parameters” panel, or by clicking on the words "Allocation target" in the legend to the right of the figure.
+When a single account has been selected for display, the <i>Allocation target</i> is shown as a horizontal red line. It can be turned off or on with the <i>Display allocation target by default</i> item in the <i>Parameters</i> panel, or by clicking on <i>Allocation target</i> in the legend to the right of the figure.
-You can toggle the display of the "Queued jobs" metric, which presents a sum of all resources in pending jobs at each time point, by clicking on the words "Queued jobs" in the legend to the right of the figure.
+<!--T:72-->
+You can toggle the display of the <i>Queued jobs</i> metric, which presents a sum of all resources in pending jobs at each time point, by clicking on the words <i>Queued jobs</i> in the legend to the right of the figure.
 <br clear=all>
-==Selecting a specific cluster and time interval==
+==Selecting a specific cluster and time interval== <!--T:73-->
 [[File:Slurm portal select sys date.png|thumb|alt=Select a specific cluster and time interval|Select a specific cluster and time interval]]
-The figure shows your usage for a single cluster over a specified time interval. The "System" pull-down menu contains entries for each of the currently-active national clusters that use Slurm as a scheduler. You can use the "Start date (incl.)" and "End date (incl.)" fields in the "Select system and dates" panel to change the time interval displayed in the figure. It will include all jobs on the specified cluster that were in a running (R) or pending (PD) state during the time interval, including both the start and end date. Selecting an end date in the future will display the <i>projection</i> of currently running and pending jobs for their requested duration into the future.
+The figure shows your usage for a single cluster over a specified time interval. The <i>System</i> pull-down menu contains entries for each of the currently active national clusters that use Slurm as a scheduler. You can use the "Start date (incl.)" and "End date (incl.)" fields in the "Select system and dates" panel to change the time interval displayed in the figure. It will include all jobs on the specified cluster that were in a running (R) or pending (PD) state during the time interval, including both the start and end date. Selecting an end date in the future will display the <i>projection</i> of currently running and pending jobs for their requested duration into the future.
 <br clear=all>
-==Displaying usage over an extended time period into the future==
+==Displaying usage over an extended time period into the future== <!--T:74-->
 [[File:Slurm portal account use duration edit.png|thumb|alt=Displaying usage over and extended period into the future|Displaying usage over and extended period into the future]]
-If you select an end time after the present time, the figure will have a transparent red area overlayed on the future time labelled "Projection". In this projection period, each job is assumed to run to the time limit requested for it. For queued resources, the projection supposes that each pending job starts at beginning of the projected time (that is, right now) and runs until its requested time limit. This is not intended to be a forecast of actual future events!
+If you select an end time after the present time, the figure will have a transparent red area overlaid on the future time labelled <i>Projection</i>. In this projection period, each job is assumed to run to the time limit requested for it. For queued resources, the projection supposes that each pending job starts at the beginning of the projected time (that is, right now) and runs until its requested time limit. This is not intended to be a forecast of actual future events!
 <br clear=all>
-==Metrics, summation, and running jobs==
+==Metrics, summation, and running jobs== <!--T:75-->
 [[File:Slurm portal parameter panel.png|thumb|alt=Parameters of the usage series histogram|Parameters of the usage series histogram]]
-Use the "Metric" pull-down control in the "Parameters" panel to select from the following metrics: CPU, CPU-equivalent, RGU, RGU-equivalent, Memory, Billing, gpu, and all specific GPU models available on the selected cluster.
+Use the <i>Metric</i> pull-down control in the <i>Parameters</i> panel to select from the following metrics: CPU, CPU-equivalent, RGU, RGU-equivalent, Memory, Billing, gpu, and all specific GPU models available on the selected cluster.
-The "Summation" pull-down allows you to switch between the daily "Total" and "Running total".  If you select "Total", each bar of the histogram represents the total usage in that one day.  If you select "Running total", each bar represents the sum of that day's usage and all previous days back to the beginning of the time interval.  If the "Allocation Target" is displayed, it is similarly adjusted to show the running total of the target usage.
+<!--T:76-->
+The <i>Summation</i> pull-down allows you to switch between the daily <i>Total</i> and <i>Running total</i>. If you select <i>Total</i>, each bar of the histogram represents the total usage in that one day.  If you select "Running total", each bar represents the sum of that day's usage and all previous days back to the beginning of the time interval. If the <i>Allocation Target</i> is displayed, it is similarly adjusted to show the running total of the target usage. See the next section for more.
-If you set "Include Running jobs" to "No", the figure shows only data from records of completed jobs. If you set it to "Yes" it includes data from running jobs too.
+<!--T:77-->
+If you set <i>Include Running jobs</i> to <i>No</i>, the figure shows only data from records of completed jobs. If you set it to <i>Yes</i> it includes data from running jobs too.
+<!--T:78-->
 <br clear=all>
-==Display of the running total of account usage==
+==Display of the running total of account usage== <!--T:79-->
 [[File:Slurm portal account use cumulative edit.png|thumb|alt=Display of the  running total of account usage|Display of the  running total of account usage]]
-When displaying the running total of the usage for a single account along with the "Allocation target" the usage histogram displays how an account deviates from its target share over the period displayed. The values in this view are the cumulative sum across days from "total" summation view for both the usage and allocation target. When an account is submitting jobs that request more than the account’s target share it is expected that the usage cumulative total will oscilate above and below the target share cumulative sum if the scheduler is managing fair share properly. Because the scheduler uses a decay period for the impact of past usage, a good interval to use to inspect the scheduler’s performance in maintaining the accounts fair share is to display the past 30 days.
+When displaying the running total of the usage for a single account along with the <i>Allocation target</i> the usage histogram displays how an account deviates from its target share over the period displayed. The values in this view are the cumulative sum across days from "total" summation view for both the usage and allocation target. When an account is submitting jobs that request more than the account’s target share, it is expected that the usage cumulative sum will oscillate above and below the target share cumulative sum if the scheduler is managing fair share properly. Because the scheduler uses a decay period for the impact of past usage, a good interval to use to inspect the scheduler’s performance in maintaining the account's fair share is to display the past 30 days.
 <br clear=all>
-=Viewing group usage summaries of compute resources at ccdb.ca= <!--T:35-->
+=Viewing resource usage in CCDB= <!--T:35-->
 <!--T:36-->
@@ Line 449: / Line 454: @@
 ==GPU usage and Reference GPU Units (RGUs)== <!--T:41-->
 [[File:Rgu en.png|thumb|alt=GPU usage|GPU usage summary with Reference GPU Unit (RGU) breakdown table.]]
-For resource allocation projects that have GPU usage the table is broken down into usage on various GPU models and measured in RGUs.
+For resource allocation projects that have GPU usage, the table is broken down into usage on various GPU models and measured in RGUs.
 <br clear=all>