38,757
edits
(Updating to match new version of source page) |
(Updating to match new version of source page) |
||
Line 5: | Line 5: | ||
= Using your own license = | = Using your own license = | ||
Abaqus software modules are available on our clusters; however, you must provide your own license. To configure your account on a cluster, log in and create a file named < | Abaqus software modules are available on our clusters; however, you must provide your own license. To configure your account on a cluster, log in and create a file named <code>$HOME/.licenses/abaqus.lic</code> containing the following two lines which support versions 202X and 6.14.1 respectively. Next, replace <code>port@server</code> with the flexlm port number and server IP address (or fully qualified hostname) of your Abaqus license server. | ||
{{File | {{File | ||
Line 18: | Line 18: | ||
= Cluster job submission = | = Cluster job submission = | ||
Below are prototype Slurm scripts for submitting thread and mpi based parallel simulations to single or multiple compute nodes. Most users will find it sufficient to use one of the <i>project directory script's</i> provided in the Single Node Computing sections. The optional "memory=" argument found in the last line of the scripts is intended for larger memory or problematic jobs where 3072MB offset value may require tuning. A listing of all Abaqus command line arguments can be obtained by loading an Abaqus module and running: <code>abaqus -help | less</code>. | Below are prototype Slurm scripts for submitting thread and mpi-based parallel simulations to single or multiple compute nodes. Most users will find it sufficient to use one of the <i>project directory script's</i> provided in the Single Node Computing sections. The optional "memory=" argument found in the last line of the scripts is intended for larger memory or problematic jobs where 3072MB offset value may require tuning. A listing of all Abaqus command line arguments can be obtained by loading an Abaqus module and running: <code>abaqus -help | less</code>. | ||
Single Node jobs that run less than one day should find the <i>project directory script</i> located in the first tab sufficient. | Single Node jobs that run less than one day should find the <i>project directory script</i> located in the first tab sufficient. However, single node jobs that run for more than a day should use one of the restart scripts. Jobs that create large restart files will benefit by writing to local disk through the use of the SLURM_TMPDIR environment variable utilized in the <i>temporary directory scripts</i> provided in the two rightmost tabs of the single node standard and explicit analysis sections. The restart scripts shown here will continue jobs that have been terminated early for some reason. Such job failures can occur if a job reaches its maximum requested runtime before completing and is killed by the queue or if the compute node the job was running on crashed due to an unexpected hardware failure. Other restart types are possible by further tailoring of the input file (not shown here) to continue a job with additional steps or change the analysis (see the documentation for version specific details). | ||
Jobs that require large memory or larger compute resources (beyond that which a single compute node can provide) should use the mpi scripts in the Multiple Node sections below to distribute computing over arbitrary node ranges determined automatically by the scheduler. Short scaling test jobs should be run to determine wall clock times (and memory requirements) as a function of the number of cores (2, 4, 8, etc.) to determine the optimal number before running any long jobs. | Jobs that require large memory or larger compute resources (beyond that which a single compute node can provide) should use the mpi scripts in the Multiple Node sections below to distribute computing over arbitrary node ranges determined automatically by the scheduler. Short scaling test jobs should be run to determine wall clock times (and memory requirements) as a function of the number of cores (2, 4, 8, etc.) to determine the optimal number before running any long jobs. | ||
Line 176: | Line 176: | ||
=== Multiple Node Computing === | === Multiple Node Computing === | ||
Users with large memory or compute needs (and correspondingly large licenses) can use the following script to perform mpi-based computing over | Users with large memory or compute needs (and correspondingly large licenses) can use the following script to perform mpi-based computing over an arbitrary range of nodes ideally left to the scheduler to automatically determine. A companion template script to perform restart multi-node jobs is not currently provided due to additional limitations when they can be used. | ||
{{File | {{File | ||
Line 416: | Line 416: | ||
</source> | </source> | ||
To completely satisfy the recommended "MEMORY TO OPERATIONS REQUIRED MINIMIZE I/O" (MRMIO) value at least the same amount of non-swapped physical memory (RES) must be available to Abaqus. Since the RES will in general be less than the virtual memory (VIRT) by some relatively constant amount for a given simulation, it is necessary to slightly over allocate the requested Slurm node memory <code>-mem=</code>. In the above sample slurm script this over-allocation has been hardcoded to a conservative value of 3072MB based on initial testing of the standard Abaqus solver. To avoid long queue wait times associated with large values of MRMIO, it may be worth investigating the simulation performance impact associated with reducing the RES memory that is made available to Abaqus significantly below the MRMIO. This can be done by lowering the <code>-mem=</code> value which in turn will set an artificially low value of <code>memory=</code> in the Abaqus command (found in the last line of the slurm script). In doing this one should be careful the RES does not dip below the "MINIMUM MEMORY REQUIRED" (MMR) otherwise Abaqus will exit due to "Out Of Memory" (OOM). As an example, if your MRMIO is 96GB try running a series of short test jobs with <code>#SBATCH --mem=8G, 16G, 32G, 64G</code> until an acceptable minimal performance impact is found, noting that smaller values will result in increasingly larger scratch space used by temporary files. | To completely satisfy the recommended "MEMORY TO OPERATIONS REQUIRED MINIMIZE I/O" (MRMIO) value at least the same amount of non-swapped physical memory (RES) must be available to Abaqus. Since the RES will in general be less than the virtual memory (VIRT) by some relatively constant amount for a given simulation, it is necessary to slightly over allocate the requested Slurm node memory <code>-mem=</code>. In the above sample slurm script, this over-allocation has been hardcoded to a conservative value of 3072MB based on initial testing of the standard Abaqus solver. To avoid long queue wait times associated with large values of MRMIO, it may be worth investigating the simulation performance impact associated with reducing the RES memory that is made available to Abaqus significantly below the MRMIO. This can be done by lowering the <code>-mem=</code> value which in turn will set an artificially low value of <code>memory=</code> in the Abaqus command (found in the last line of the slurm script). In doing this one should be careful the RES does not dip below the "MINIMUM MEMORY REQUIRED" (MMR) otherwise Abaqus will exit due to "Out Of Memory" (OOM). As an example, if your MRMIO is 96GB try running a series of short test jobs with <code>#SBATCH --mem=8G, 16G, 32G, 64G</code> until an acceptable minimal performance impact is found, noting that smaller values will result in increasingly larger scratch space used by temporary files. | ||
= Graphical use = | = Graphical use = | ||
Line 465: | Line 465: | ||
== SHARCNET license == | == SHARCNET license == | ||
SHARCNET provides a small but free license consisting of 2 cae and 35 execute tokens where usage limits are imposed 10 tokens/user and 15 tokens/group. For groups that have purchased dedicated tokens the free token usage limits are added to their reservation. The free tokens are available on a first come first serve basis and mainly intended for testing and light usage before deciding whether or not to purchase dedicated tokens. The costs for dedicated tokens in cdn are approximately 110 per compute token and 400 per gui token, submit a ticket to request an official quote. The license can be used by any Alliance researcher, but only on SHARCNET hardware. Groups that purchase dedicated tokens to run on the SHARCNET license server may likewise only use them on SHARCNET hardware including gra-vdi (for running abaqus in full graphical mode) and graham or dusky clusters (for submitting compute batch jobs to the queue). Before you can use the license you must contact our [[Technical support]] and request access. In your email 1) mention that it is for use on SHARCNET systems and 2) include a copy/paste of the following < | SHARCNET provides a small but free license consisting of 2 cae and 35 execute tokens where usage limits are imposed 10 tokens/user and 15 tokens/group. For groups that have purchased dedicated tokens, the free token usage limits are added to their reservation. The free tokens are available on a first come first serve basis and mainly intended for testing and light usage before deciding whether or not to purchase dedicated tokens. The costs for dedicated tokens in cdn are approximately 110 per compute token and 400 per gui token, submit a ticket to request an official quote. The license can be used by any Alliance researcher, but only on SHARCNET hardware. Groups that purchase dedicated tokens to run on the SHARCNET license server may likewise only use them on SHARCNET hardware including gra-vdi (for running abaqus in full graphical mode) and graham or dusky clusters (for submitting compute batch jobs to the queue). Before you can use the license you must contact our [[Technical support]] and request access. In your email 1) mention that it is for use on SHARCNET systems and 2) include a copy/paste of the following <code>License Agreement</code> statement with your full name and username entered in the indicated locations. Please note that every user must do this ie) cannot be done one time only for a group (including PIs who have purchased their own dedicated tokens). | ||
<b>o License agreement</b> | <b>o License agreement</b> | ||
Line 477: | Line 477: | ||
1) on SHARCNET hardware where the software is already installed | 1) on SHARCNET hardware where the software is already installed | ||
2) in affiliation with a | 2) in affiliation with a Canadian degree-granting academic institution | ||
3) for education, institutional or instruction purposes and not for any commercial | 3) for education, institutional or instruction purposes and not for any commercial | ||
or contract related purposes where results are not publishable | or contract-related purposes where results are not publishable | ||
4) for experimental, theoretical and/or digital research work, undertaken primarily | 4) for experimental, theoretical and/or digital research work, undertaken primarily | ||
to acquire new knowledge of the underlying foundations of phenomena and observable | to acquire new knowledge of the underlying foundations of phenomena and observable | ||
Line 495: | Line 495: | ||
</source> | </source> | ||
If your Abaqus jobs fail with error message [*** ABAQUS/eliT_CheckLicense rank 0 terminated by signal 11 (Segmentation fault)] in the slurm output file verify your <code>abaqus.lic</code> file contains ABAQUSLM_LICENSE_FILE to use abaqus/2020. If your Abaqus jobs fail with error message starting [License server machine is down or not responding etc.] in the output file verify your <code>abaqus.lic</code> file contains LM_LICENSE_FILE to use abaqus/6.14.1 as shown. The <code>abaqus.lic</code> file shown contains both so you should not see this problem. | If your Abaqus jobs fail with the error message [*** ABAQUS/eliT_CheckLicense rank 0 terminated by signal 11 (Segmentation fault)] in the slurm output file, verify if your <code>abaqus.lic</code> file contains ABAQUSLM_LICENSE_FILE to use abaqus/2020. If your Abaqus jobs fail with an error message starting [License server machine is down or not responding, etc.] in the output file verify your <code>abaqus.lic</code> file contains LM_LICENSE_FILE to use abaqus/6.14.1 as shown. The <code>abaqus.lic</code> file shown contains both so you should not see this problem. | ||
<b>o Query license server</b> | <b>o Query license server</b> | ||
Line 526: | Line 526: | ||
</source> | </source> | ||
When the output of query I) above indicates that a job for a particular username is "queued" this means the job has entered the "R"unning state from the perspective of <code>squeue -j jobid</code> or <code>sacct -j jobid</code> and is therefore idle on a compute node waiting for a license. This will have the same impact on your account priority as if the job were performing computations and consuming | When the output of query I) above indicates that a job for a particular username is "queued" this means the job has entered the "R"unning state from the perspective of <code>squeue -j jobid</code> or <code>sacct -j jobid</code> and is therefore idle on a compute node waiting for a license. This will have the same impact on your account priority as if the job were performing computations and consuming cpu time. Eventually when sufficient licenses come available the "queued" job will "start". To demonstrate, the following shows the license server and queue output for the situation where a user submits two jobs, but only the first job acquires enough licenses to start: | ||
[roberpj@dus241:~] sq | [roberpj@dus241:~] sq | ||
Line 564: | Line 564: | ||
</source> | </source> | ||
Once configured, submit your job as described in the < | Once configured, submit your job as described in the <code>Cluster job submission</code> section above. If there are any problems submit a problem ticket to [[Technical support|technical support]]. Specify that you are using the Abaqus Western license on dusky and provide the failed job number along with a paste of any error messages as applicable. | ||
= Online Documentation = | = Online Documentation = |