Abaqus: Difference between revisions

Jump to navigation Jump to search
1,039 bytes added ,  28 days ago
m
no edit summary
(Marked this version for translation)
mNo edit summary
Line 623: Line 623:
abaqus licensing lmstat -c $LM_LICENSE_FILE -a | grep "Users of" | egrep "cae|standard|explicit"
abaqus licensing lmstat -c $LM_LICENSE_FILE -a | grep "Users of" | egrep "cae|standard|explicit"
</source>
</source>
<translate>
 
<!--T:20858-->
<!--T:20858-->
When the output of query I) above indicates that a job for a particular username is queued this means the job has entered the "R"unning state from the perspective of <code>squeue -j jobid</code> or <code>sacct -j jobid</code> and is therefore idle on a compute node waiting for a license.  This will have the same impact on your account priority as if the job were performing computations and consuming CPU time.  Eventually when sufficient licenses come available the queued job will start. To demonstrate, the following shows the license server and queue output for the situation where a user submits two jobs, but only the first job acquires enough licenses to start:
When the output of query I) above indicates that a job for a particular username is queued this means the job has entered the "R"unning state from the perspective of <code>squeue -j jobid</code> or <code>sacct -j jobid</code> and is therefore idle on a compute node waiting for a license.  This will have the same impact on your account priority as if the job were performing computations and consuming CPU time.  Eventually when sufficient licenses come available the queued job will start.
</translate>
 
==== Example ==== <!--T:20661-->
 
<!--T:20662-->
The following shows the situation where a user submitted two 6core jobs (each requiring 12tokens) in quick succession.  The slurm schedular then started each job on a different node in the order they were submitted.  Since the user had 10 abaqus compute tokens the first job (27527287) was able to acquire exactly enough (10) tokens for the solver to begin running.  The second job (27527297) not having access to any more tokens entered an idle "queued" state (as can be scene from the lmstat output) until the first job completed, wasting the available resources and depreciating the users fair share level in the process ...


  [roberpj@dus241:~] sq
  [roberpj@gra-login1:~] sq
          JOBID    USER     ACCOUNT          NAME  ST  TIME_LEFT NODES CPUS MIN_MEM NODELIST (REASON)  
            JOBID    USER             ACCOUNT          NAME  ST  TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON)  
          29801 roberpj def-roberpj scriptep1.txt  R    2:59:18      1    12      8G   dus47 (None)  
        27530366 roberpj         cc-debug_cpu scriptsp2.txt  R    9:56:13    1    6        N/A      8G gra107 (None)  
          29802 roberpj def-roberpj scriptsp1.txt  R    2:59:33      1    12      8G   dus28 (None)  
        27530407 roberpj         cc-debug_cpu scriptsp2.txt  R    9:59:37    1    6        N/A      8G gra292 (None)  


  [roberpj@dus241:~] abaqus licensing lmstat -c $LM_LICENSE_FILE -a | egrep "Users|start|queued|RESERVATION"
  [roberpj@gra-login1:~] abaqus licensing lmstat -c $LM_LICENSE_FILE -a | egrep "Users|start|queued"
  Users of abaqus:  (Total of 78 licenses issued;  Total of 71 licenses in use)
Users of abaqus:  (Total of 78 licenses issued;  Total of 53 licenses in use)
      roberpj dus47 /dev/tty (v62.2) (license3.sharcnet.ca/27050 275), start Thu 8/27 5:45, 14 licenses
    roberpj gra107 /dev/tty (v62.6) (license3.sharcnet.ca/27050 1042), start Mon 11/25 17:15, 10 licenses
      roberpj dus28 /dev/tty (v62.2) (license3.sharcnet.ca/27050 729) queued for 14 licenses
    roberpj gra292 /dev/tty (v62.6) (license3.sharcnet.ca/27050 125) queued for 10 licenses
 
<!--T:20663-->
To avoid license shortage problems when submitting multiple jobs when working with expensive abaqus tokens either use a [https://docs.alliancecan.ca/wiki/Running_jobs#Cancellation_of_jobs_with_dependency_conditions_which_cannot_be_met job dependency], [https://docs.alliancecan.ca/wiki/Job_arrays job array] or at the very least setup a slurm [https://docs.alliancecan.ca/wiki/Running_jobs#Email_notification email notification] to know when your job completes before manually submitting another one.


<translate>
<translate>
=== Specify job resources === <!--T:20859-->
=== Specify job resources === <!--T:20859-->
To ensure optimal usage of both your Abaqus tokens and our resources, it's important to carefully specify the required memory and ncpus in your Slurm script.  The values can be determined by submitting a few short test jobs to the queue then checking their utilization.  For <b>completed</b> jobs use <code>seff JobNumber</code> to show the total <i>Memory Utilized</i> and <i>Memory Efficiency</i>. If the <i>Memory Efficiency</i> is less than ~90%, decrease the value of the <code>#SBATCH --mem=</code> setting in your Slurm script accordingly.  Notice that the <code>seff JobNumber</code> command also shows the total <i>CPU (time) Utilized</i> and <i>CPU Efficiency</i>. If the <i>CPU Efficiency</i> is less than ~90%, perform scaling tests to determine the optimal number of CPUs for optimal performance and then update the value of <code>#SBATCH --cpus-per-task=</code> in your Slurm script.  For <b>running</b> jobs, use the <code>srun --jobid=29821580 --pty top -d 5 -u $USER</code> command to watch the %CPU, %MEM and RES for each Abaqus parent process on the compute node. The %CPU and %MEM columns display the percent usage relative to the total available on the node while the RES column shows the per process resident memory size (in human readable format for values over 1GB). Further information regarding how to [[Running jobs#Monitoring_jobs|monitor jobs]] is available on our documentation wiki
To ensure optimal usage of both your Abaqus tokens and our resources, it's important to carefully specify the required memory and ncpus in your Slurm script.  The values can be determined by submitting a few short test jobs to the queue then checking their utilization.  For <b>completed</b> jobs use <code>seff JobNumber</code> to show the total <i>Memory Utilized</i> and <i>Memory Efficiency</i>. If the <i>Memory Efficiency</i> is less than ~90%, decrease the value of the <code>#SBATCH --mem=</code> setting in your Slurm script accordingly.  Notice that the <code>seff JobNumber</code> command also shows the total <i>CPU (time) Utilized</i> and <i>CPU Efficiency</i>. If the <i>CPU Efficiency</i> is less than ~90%, perform scaling tests to determine the optimal number of CPUs for optimal performance and then update the value of <code>#SBATCH --cpus-per-task=</code> in your Slurm script.  For <b>running</b> jobs, use the <code>srun --jobid=29821580 --pty top -d 5 -u $USER</code> command to watch the %CPU, %MEM and RES for each Abaqus parent process on the compute node. The %CPU and %MEM columns display the percent usage relative to the total available on the node while the RES column shows the per process resident memory size (in human readable format for values over 1GB). Further information regarding how to [[Running jobs#Monitoring_jobs|monitor jobs]] is available on our documentation wiki
cc_staff
1,894

edits

Navigation menu