cc_staff
1,894
edits
(Marked this version for translation) |
mNo edit summary |
||
Line 623: | Line 623: | ||
abaqus licensing lmstat -c $LM_LICENSE_FILE -a | grep "Users of" | egrep "cae|standard|explicit" | abaqus licensing lmstat -c $LM_LICENSE_FILE -a | grep "Users of" | egrep "cae|standard|explicit" | ||
</source> | </source> | ||
<!--T:20858--> | <!--T:20858--> | ||
When the output of query I) above indicates that a job for a particular username is queued this means the job has entered the "R"unning state from the perspective of <code>squeue -j jobid</code> or <code>sacct -j jobid</code> and is therefore idle on a compute node waiting for a license. This will have the same impact on your account priority as if the job were performing computations and consuming CPU time. Eventually when sufficient licenses come available the queued job will start. | When the output of query I) above indicates that a job for a particular username is queued this means the job has entered the "R"unning state from the perspective of <code>squeue -j jobid</code> or <code>sacct -j jobid</code> and is therefore idle on a compute node waiting for a license. This will have the same impact on your account priority as if the job were performing computations and consuming CPU time. Eventually when sufficient licenses come available the queued job will start. | ||
==== Example ==== <!--T:20661--> | |||
<!--T:20662--> | |||
The following shows the situation where a user submitted two 6core jobs (each requiring 12tokens) in quick succession. The slurm schedular then started each job on a different node in the order they were submitted. Since the user had 10 abaqus compute tokens the first job (27527287) was able to acquire exactly enough (10) tokens for the solver to begin running. The second job (27527297) not having access to any more tokens entered an idle "queued" state (as can be scene from the lmstat output) until the first job completed, wasting the available resources and depreciating the users fair share level in the process ... | |||
[roberpj@ | [roberpj@gra-login1:~] sq | ||
JOBID USER ACCOUNT NAME ST TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON) | |||
27530366 roberpj cc-debug_cpu scriptsp2.txt R 9:56:13 1 6 N/A 8G gra107 (None) | |||
27530407 roberpj cc-debug_cpu scriptsp2.txt R 9:59:37 1 6 N/A 8G gra292 (None) | |||
[roberpj@ | [roberpj@gra-login1:~] abaqus licensing lmstat -c $LM_LICENSE_FILE -a | egrep "Users|start|queued" | ||
Users of abaqus: (Total of 78 licenses issued; Total of 53 licenses in use) | |||
roberpj gra107 /dev/tty (v62.6) (license3.sharcnet.ca/27050 1042), start Mon 11/25 17:15, 10 licenses | |||
roberpj gra292 /dev/tty (v62.6) (license3.sharcnet.ca/27050 125) queued for 10 licenses | |||
<!--T:20663--> | |||
To avoid license shortage problems when submitting multiple jobs when working with expensive abaqus tokens either use a [https://docs.alliancecan.ca/wiki/Running_jobs#Cancellation_of_jobs_with_dependency_conditions_which_cannot_be_met job dependency], [https://docs.alliancecan.ca/wiki/Job_arrays job array] or at the very least setup a slurm [https://docs.alliancecan.ca/wiki/Running_jobs#Email_notification email notification] to know when your job completes before manually submitting another one. | |||
<translate> | <translate> | ||
=== Specify job resources === <!--T:20859--> | === Specify job resources === <!--T:20859--> | ||
To ensure optimal usage of both your Abaqus tokens and our resources, it's important to carefully specify the required memory and ncpus in your Slurm script. The values can be determined by submitting a few short test jobs to the queue then checking their utilization. For <b>completed</b> jobs use <code>seff JobNumber</code> to show the total <i>Memory Utilized</i> and <i>Memory Efficiency</i>. If the <i>Memory Efficiency</i> is less than ~90%, decrease the value of the <code>#SBATCH --mem=</code> setting in your Slurm script accordingly. Notice that the <code>seff JobNumber</code> command also shows the total <i>CPU (time) Utilized</i> and <i>CPU Efficiency</i>. If the <i>CPU Efficiency</i> is less than ~90%, perform scaling tests to determine the optimal number of CPUs for optimal performance and then update the value of <code>#SBATCH --cpus-per-task=</code> in your Slurm script. For <b>running</b> jobs, use the <code>srun --jobid=29821580 --pty top -d 5 -u $USER</code> command to watch the %CPU, %MEM and RES for each Abaqus parent process on the compute node. The %CPU and %MEM columns display the percent usage relative to the total available on the node while the RES column shows the per process resident memory size (in human readable format for values over 1GB). Further information regarding how to [[Running jobs#Monitoring_jobs|monitor jobs]] is available on our documentation wiki | To ensure optimal usage of both your Abaqus tokens and our resources, it's important to carefully specify the required memory and ncpus in your Slurm script. The values can be determined by submitting a few short test jobs to the queue then checking their utilization. For <b>completed</b> jobs use <code>seff JobNumber</code> to show the total <i>Memory Utilized</i> and <i>Memory Efficiency</i>. If the <i>Memory Efficiency</i> is less than ~90%, decrease the value of the <code>#SBATCH --mem=</code> setting in your Slurm script accordingly. Notice that the <code>seff JobNumber</code> command also shows the total <i>CPU (time) Utilized</i> and <i>CPU Efficiency</i>. If the <i>CPU Efficiency</i> is less than ~90%, perform scaling tests to determine the optimal number of CPUs for optimal performance and then update the value of <code>#SBATCH --cpus-per-task=</code> in your Slurm script. For <b>running</b> jobs, use the <code>srun --jobid=29821580 --pty top -d 5 -u $USER</code> command to watch the %CPU, %MEM and RES for each Abaqus parent process on the compute node. The %CPU and %MEM columns display the percent usage relative to the total available on the node while the RES column shows the per process resident memory size (in human readable format for values over 1GB). Further information regarding how to [[Running jobs#Monitoring_jobs|monitor jobs]] is available on our documentation wiki |