38,757
edits
(Updating to match new version of source page) |
(Updating to match new version of source page) |
||
Line 5: | Line 5: | ||
= Using your own license = | = Using your own license = | ||
Abaqus software modules are available on our clusters however you must provide your own license. | Abaqus software modules are available on our clusters; however, you must provide your own license. To configure your account on a cluster, log in and create a file named <tt>$HOME/.licenses/abaqus.lic</tt> containing the following two lines which support versions 202X and 6.14.1 respectively. Next, replace <code>port@server</code> with the flexlm port number and server IP address (or fully qualified hostname) of your Abaqus license server. | ||
{{File | {{File | ||
Line 14: | Line 14: | ||
}} | }} | ||
If your license has not been set up for | If your license has not been set up for use on an Alliance cluster, some additional configuration changes by the Alliance system administrator and your local system administrator will need to be done. Such changes are necessary to ensure the flexlm and vendor TCP ports of your Abaqus server are reachable from all cluster compute nodes when jobs are run via the queue. So we may help you get this done, open a ticket with [[Technical support|technical support]]. Please be sure to include the following three items: flexlm port number, static vendor port number, IP address of your Abaqus license server. You will then be sent a list of cluster IP addresses so that your administrator can open the local server firewall to allow connections from the cluster on both ports. Please note that a special license agreement must generally be negotiated and signed by SIMULIA and your institution before a local license may be used remotely on Alliance hardware. | ||
= Cluster job submission = | = Cluster job submission = | ||
Below are prototype | Below are prototype Slurm scripts for submitting thread and mpi based parallel simulations to single or multiple compute nodes. Most users will find it sufficient to use one of the <i>project directory script's</i> provided in the Single Node Computing sections. The optional "memory=" argument found in the last line of the scripts is intended for larger memory or problematic jobs where 3072MB offset value may require tuning. A listing of all Abaqus command line arguments can be obtained by loading an Abaqus module and running: <code>abaqus -help | less</code>. Single Node jobs that run less than one day should find the <i>project directory script</i> located in the first tab sufficient. Single node jobs that run for more than a day however should use one of the restart scripts. Jobs that create large restart files will benefit by writing to local disk through the use of the SLURM_TMPDIR environment variable utilized in the <i>temporary directory scripts</i> provided in the two rightmost tabs of the single node standard and explicit analysis sections. The restart scripts shown here will continue jobs that have been terminated early for some reason. Such job failures can occur if a job reaches its maximum requested runtime before completing and is killed by the queue or if the compute node the job was running on crashed due to an unexpected hardware failure. Other restart types are possible by further tailoring of the input file (not shown here) to continue a job with additional steps or change the analysis (see the documentation for version specific details). Jobs that require large memory or larger compute resources (beyond that which a single compute node can provide) should use the mpi scripts in the Multiple Node sections below to distribute computing over arbitrary node ranges determined automatically by the scheduler. Short scaling test jobs should be run to determine wall clock times (and memory requirements) as a function of the number of cores (2, 4, 8, etc.) to determine the optimal number before running any long jobs. | ||
== Standard Analysis == | == Standard Analysis == | ||
Line 61: | Line 61: | ||
To check the completed restart information do: | To check the completed restart information do: | ||
cat testsp1.msg | grep "STARTS\|COMPLETED\|WRITTEN" | cat testsp1.msg | grep "STARTS\|COMPLETED\|WRITTEN" | ||
Some simulations may benefit by adding the following to the | Some simulations may benefit by adding the following to the Abaqus command at the bottom of the script: | ||
order_parallel=OFF | order_parallel=OFF | ||
</tab> | </tab> | ||
Line 387: | Line 387: | ||
== Node memory == | == Node memory == | ||
An estimate for the total slurm node memory (--mem=) required for a simulation to run fully in ram (without being virtualized to scratch disk) can be obtained by examining the | An estimate for the total slurm node memory (--mem=) required for a simulation to run fully in ram (without being virtualized to scratch disk) can be obtained by examining the Abaqus output <code>test.dat</code> file. For example, a simulation that requires a fairly large amount of memory might show: | ||
<source lang="bash"> | <source lang="bash"> | ||
Line 412: | Line 412: | ||
</source> | </source> | ||
To completely satisfy the recommended "MEMORY TO OPERATIONS REQUIRED MINIMIZE I/O" (MRMIO) value at least the same amount of non-swapped physical memory (RES) must be available to | To completely satisfy the recommended "MEMORY TO OPERATIONS REQUIRED MINIMIZE I/O" (MRMIO) value at least the same amount of non-swapped physical memory (RES) must be available to Abaqus. Since the RES will in general be less than the virtual memory (VIRT) by some relatively constant amount for a given simulation, it is necessary to slightly over allocate the requested Slurm node memory <code>-mem=</code>. In the above sample slurm script this over-allocation has been hardcoded to a conservative value of 3072MB based on initial testing of the standard Abaqus solver. To avoid long queue wait times associated with large values of MRMIO, it may be worth investigating the simulation performance impact associated with reducing the RES memory that is made available to Abaqus significantly below the MRMIO. This can be done by lowering the <code>-mem=</code> value which in turn will set an artificially low value of <code>memory=</code> in the Abaqus command (found in the last line of the slurm script). In doing this one should be careful the RES does not dip below the "MINIMUM MEMORY REQUIRED" (MMR) otherwise Abaqus will exit due to "Out Of Memory" (OOM). As an example, if your MRMIO is 96GB try running a series of short test jobs with <code>#SBATCH --mem=8G, 16G, 32G, 64G</code> until an acceptable minimal performance impact is found, noting that smaller values will result in increasingly larger scratch space used by temporary files. | ||
= Graphical use = | = Graphical use = | ||
Line 491: | Line 491: | ||
</source> | </source> | ||
If your | If your Abaqus jobs fail with error message [*** ABAQUS/eliT_CheckLicense rank 0 terminated by signal 11 (Segmentation fault)] in the slurm output file verify your <code>abaqus.lic</code> file contains ABAQUSLM_LICENSE_FILE to use abaqus/2020. If your Abaqus jobs fail with error message starting [License server machine is down or not responding etc.] in the output file verify your <code>abaqus.lic</code> file contains LM_LICENSE_FILE to use abaqus/6.14.1 as shown. The <code>abaqus.lic</code> file shown contains both so you should not see this problem. | ||
<b>o Query license server</b> | <b>o Query license server</b> | ||
Line 536: | Line 536: | ||
<b>o Specify job resources</b> | <b>o Specify job resources</b> | ||
To ensure optimal usage of both your Abaqus tokens and our resources, it's important to carefully specify the required memory and ncpus in your slurm script. The values can be determined by submitting a few short test jobs to the queue then checking their utilization. For <b>completed</b> jobs use <code>seff JobNumber</code> to show the total "Memory Utilized" and "Memory Efficiency"; If the "Memory Efficiency" is less than ~90% decrease the value of "#SBATCH --mem=" setting in your slurm script accordingly. Notice that the <code>seff JobNumber</code> command also shows the total "CPU (time) Utilized" and "CPU Efficiency"; If the "CPU Efficiency" is less than ~90% perform scaling tests to determine the optimal number of cpu's for optimal performance and then update the value of then update the value of "#SBATCH --cpus-per-task=" in your slurm script. For <b>running</b> jobs use the <code>srun --jobid=29821580 --pty top -d 5 -u $USER</code> command to watch the %CPU, %MEM and RES for each | To ensure optimal usage of both your Abaqus tokens and our resources, it's important to carefully specify the required memory and ncpus in your slurm script. The values can be determined by submitting a few short test jobs to the queue then checking their utilization. For <b>completed</b> jobs use <code>seff JobNumber</code> to show the total "Memory Utilized" and "Memory Efficiency"; If the "Memory Efficiency" is less than ~90% decrease the value of "#SBATCH --mem=" setting in your slurm script accordingly. Notice that the <code>seff JobNumber</code> command also shows the total "CPU (time) Utilized" and "CPU Efficiency"; If the "CPU Efficiency" is less than ~90% perform scaling tests to determine the optimal number of cpu's for optimal performance and then update the value of then update the value of "#SBATCH --cpus-per-task=" in your slurm script. For <b>running</b> jobs use the <code>srun --jobid=29821580 --pty top -d 5 -u $USER</code> command to watch the %CPU, %MEM and RES for each Abaqus parent process on the compute node; The %CPU and %MEM columns display the percent usage relative to the total available on the node while the RES column shows the per process resident memory size (in human readable format for values over 1gb). Further information regarding how to [https://docs.computecanada.ca/wiki/Running_jobs#Monitoring_jobs Monitor Jobs] is available in our documentation wiki. | ||
<b>o Core token mapping</b> | <b>o Core token mapping</b> | ||
Line 548: | Line 548: | ||
== Western license == | == Western license == | ||
The Western site license may only be used by Western researchers on hardware located at Western's campus. Currently Dusky cluster is the only system that satisfies these conditions. Graham and gra-vdi are excluded since they are located on Waterloo's campus. Contact the Western | The Western site license may only be used by Western researchers on hardware located at Western's campus. Currently Dusky cluster is the only system that satisfies these conditions. Graham and gra-vdi are excluded since they are located on Waterloo's campus. Contact the Western Abaqus license server administrator <jmilner@robarts.ca> to inquire about using the Western Abaqus license. You will need to provide your username and possibly make arrangements to purchase tokens. If you are granted access then you may proceed to configure your <code>abaqus.lic</code> file to point to the Western license server as follows: | ||
<b>o Configure license file</b> | <b>o Configure license file</b> | ||
Line 560: | Line 560: | ||
</source> | </source> | ||
Once configured, submit your job as described in the <tt>Cluster job submission</tt> section above. If there are any problems submit a problem ticket to [[Technical support|technical support]]. Specify that you are using the | Once configured, submit your job as described in the <tt>Cluster job submission</tt> section above. If there are any problems submit a problem ticket to [[Technical support|technical support]]. Specify that you are using the Abaqus Western license on dusky and provide the failed job number along with a paste of any error messages as applicable. | ||
= Online Documentation = | = Online Documentation = | ||
The full | The full Abaqus documentation (latest version) can be accessed on gra-vdi as shown in the following steps. | ||
Account Preparation: | Account Preparation: |