38,757
edits
(Updating to match new version of source page) |
(Updating to match new version of source page) |
||
Line 18: | Line 18: | ||
= Cluster job submission = | = Cluster job submission = | ||
Below are | Below are prototype slurm scripts for submitting thread and mpi based parallel simulations to single or multiple compute nodes. Most users will find it sufficient to use one of the <i>project directory script's</i> provided in the Single Node Computing section. The optional "memory=" argument found in the last line of the scripts is intended for larger memory or problematic jobs where 3072MB offset value may require tuning. A listing of all abaqus command line arguments can be obtained by loading an abaqus module and running: <code>abaqus -help | less</code>. Single Node jobs that run less than one day should find the <i>project directory script</i> located in the first tab sufficient. Single node jobs that run for more than a day however should use one of the restart scripts. Jobs that create large restart files will benefit by writing to local disc through the use of the SLURM_TMPDIR environment variable utilized in the <i>temporary directory scripts</i> provided in the two rightmost tabs of the Single Node standard and explicit analysis sections. The restart scripts shown here will continue jobs that have been terminated early for some reason. Such job failures can occur if a job reaches its maximum requested runtime before completing and is killed by the queue or if the compute node the job was running on crashed due to an unexpected hardware failure. Other restart types are possible by further tailoring of the input file (not shown here) to continue a job with additional steps or change the analysis (see the documentation for version specific details). Jobs that require large memory or larger compute resources (beyond that which a single compute node can provide) should use the mpi scripts in the Multiple Node sections below to distribute computing over arbitrary node ranges determined automatically by the scheduler. Short scaling test jobs should be run to determine wall clock times (and memory requirements) as a function of the number of cores (2, 4, 8, etc.) to determine the optimal number before running any long jobs. | ||
== Standard Analysis == | == Standard Analysis == | ||
Line 536: | Line 536: | ||
<b>o Specify job resources</b> | <b>o Specify job resources</b> | ||
To ensure optimal usage of both your Abaqus tokens and our resources, it's important to carefully specify the required memory and ncpus in your slurm script. The values can be determined by submitting a few short test jobs to the queue then checking their utilization. For <b>completed</b> jobs use <code>seff JobNumber</code> to show the total "Memory Utilized" and "Memory Efficiency"; If the "Memory Efficiency" is less than ~90% decrease the value of "#SBATCH --mem=" setting in your slurm script accordingly. Notice that the <code>seff JobNumber</code> command also shows the total "CPU (time) Utilized" and "CPU Efficiency"; If the "CPU Efficiency" is less than ~90% perform scaling tests to determine the optimal number of cpu's for optimal performance and then update the value of then update the value of "#SBATCH --cpus-per-task=" in your slurm script. For <b>running</b> jobs use the <code>srun --jobid=29821580 --pty top -d 5 -u $USER</code> command to watch the %CPU, %MEM and RES for each abaqus parent process on the compute node; The %CPU and %MEM columns display the percent usage relative to the total available on the node while the RES column shows the per process resident memory size (in human readable format for values over 1gb). Further information regarding | To ensure optimal usage of both your Abaqus tokens and our resources, it's important to carefully specify the required memory and ncpus in your slurm script. The values can be determined by submitting a few short test jobs to the queue then checking their utilization. For <b>completed</b> jobs use <code>seff JobNumber</code> to show the total "Memory Utilized" and "Memory Efficiency"; If the "Memory Efficiency" is less than ~90% decrease the value of "#SBATCH --mem=" setting in your slurm script accordingly. Notice that the <code>seff JobNumber</code> command also shows the total "CPU (time) Utilized" and "CPU Efficiency"; If the "CPU Efficiency" is less than ~90% perform scaling tests to determine the optimal number of cpu's for optimal performance and then update the value of then update the value of "#SBATCH --cpus-per-task=" in your slurm script. For <b>running</b> jobs use the <code>srun --jobid=29821580 --pty top -d 5 -u $USER</code> command to watch the %CPU, %MEM and RES for each abaqus parent process on the compute node; The %CPU and %MEM columns display the percent usage relative to the total available on the node while the RES column shows the per process resident memory size (in human readable format for values over 1gb). Further information regarding how to [https://docs.computecanada.ca/wiki/Running_jobs#Monitoring_jobs Monitor Jobs] is available in our documentation wiki. | ||
<b>o Core token mapping</b> | <b>o Core token mapping</b> |