ARM software: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(simplify language, prep for translation)
Line 1: Line 1:
= Introduction =
= Introduction =


[https://www.arm.com/products/development-tools/hpc-tools/cross-platform/forge/ddt Allinea's DDT] (now called ARM DDT after Allinea was acquired by ARM) is a powerful commercial parallel debugger with GUI interface which can be used to debug serial, MPI, multi-threaded, and CUDA codes (and any combination of the above) written in C, C++ and FORTRAN. [https://www.arm.com/products/development-tools/hpc-tools/cross-platform/forge/map MAP] is another very useful tool from Allinea (now ARM) - an efficient parallel profiler.
[https://www.arm.com/products/development-tools/hpc-tools/cross-platform/forge/ddt ARM DDT] formerly know as Allinea DDT, is a powerful commercial parallel debugger with a graphical user interface. It can be used to debug serial, MPI, multi-threaded, and CUDA programs, or any combination of these, written in C, C++, and FORTRAN. [https://www.arm.com/products/development-tools/hpc-tools/cross-platform/forge/map MAP] is another very useful tool from ARM (formerly Allinea) - an efficient parallel profiler.


This software is available on Graham as two separate modules - allinea-cpu (for CPU debugging and profiling) and allinea-gpu (for GPU or mixed CPU/GPU debugging). As this is a GUI application, one has to login to Graham using "-Y" or "-X" ssh switch, for proper X tunelling. For Windows terminals, we recommend using a free program [https://mobaxterm.mobatek.net/download-home-edition.html MobaXterm (Home Edition)] which has everything you need to run DDT - SSH client and X Windows client (also SFTP, VNC and many other services). For Mac you need to install a free X WIndows client [https://www.xquartz.org/ XQuartz].
This software is available on Graham as two separate modules:
* allinea-cpu, for CPU debugging and profiling;
* allinea-gpu, for GPU or mixed CPU/GPU debugging.
As this is a GUI application, you should log in using <code>ssh -Y</code>, or use an SSH client like MobaXterm or XQuartz to ensure proper X11 tunnelling. See [[SSH]] for further guidance.


Both DDT and MAP are normally used interactively through their GUI, which on Graham is accomplished via "salloc" command (see below for the details). MAP can also be used non-interactively in a batch mode, in which case it can be submitted to the scheduler via sbatch command.
Both DDT and MAP are normally used interactively through their GUI, which is normally accomplished using the <code>salloc</code> command (see below for details). MAP can also be used non-interactively, in which case it can be submitted to the scheduler with the <code>sbatch</code> command.
 
The current license limits the CPU DDT/MAP use to maximum 512 cpu cores across all users at any given time. For GPU DDT the limit is 8 GPUs.


The current license limits the use of DDT/MAP to a maximum of 512 CPU cores across all users at any given time. DDT-GPU is likewise limited to 8 GPUs.


= Usage =
= Usage =
== CPU-only (non-CUDA) code ==
== CPU-only code, no GPUs ==


First one has to allocate the node(s) for the debugging / profiling job with salloc (accepts many of the sbatch arguments), e.g.:
Allocate the node or nodes on which to do the debugging or profiling with <code>salloc</code>, e.g.:


  salloc --x11 --time=0-1:00 --mem-per-cpu=4G --ntasks=4
  salloc --x11 --time=0-1:00 --mem-per-cpu=4G --ntasks=4


Once the resource is allocated, you will get the shell at the allocated node. There you have to load the corresponding module:
This will open a shell session on the allocated node. Then load the appropriate module:


  module load allinea-cpu
  module load allinea-cpu


The above command will likely fail with a suggestion to load another (older) version of OpenMPI first. If that happens, you should reload the OpenMPI module with the suggested command, and then load the allinea-cpu module:
This may fail with a suggestion to load an older version of OpenMPI first. If that happens, reload the OpenMPI module with the suggested command, and then reload the allinea-cpu module:


  module load openmpi/2.0.2
  module load openmpi/2.0.2
Line 33: Line 35:
Make sure the MPI implementation is the default "OpenMPI" in the Allinea application window, before pressing the Run button. If this is not the case, press the Change button next to the "Implementation:" string, and pick the correct option from the drop down menu.
Make sure the MPI implementation is the default "OpenMPI" in the Allinea application window, before pressing the Run button. If this is not the case, press the Change button next to the "Implementation:" string, and pick the correct option from the drop down menu.


When done, exit the shell (this will terminate the allocation).
When done, exit the shell. This will terminate the allocation.


== CUDA code ==
== CUDA code ==


First one has to allocate the node(s) for the debugging job with salloc (accepts many of the sbatch arguments), e.g.:
Allocate the node or nodes on which to do the debugging or profiling with <code>salloc</code>, e.g.:


  salloc --x11 --time=0-1:00 --mem-per-cpu=4G --ntasks=1 --gres=gpu:1
  salloc --x11 --time=0-1:00 --mem-per-cpu=4G --ntasks=1 --gres=gpu:1


Once the resource is allocated, you will get the shell at the allocated node. There you have to load the corresponding module:
This will open a shell session on the allocated node. Then load the appropriate module:


  module load allinea-gpu
  module load allinea-gpu


The above command can fail with a suggestion to load an older version of OpenMPI first. If that happens, you should reload the OpenMPI module with the suggested command, and then load the allinea-gpu module:
This may fail with a suggestion to load an older version of OpenMPI first. If that happens, reload the OpenMPI module with the suggested command, and then reload the allinea-gpu module:


  module load openmpi/2.0.2
  module load openmpi/2.0.2
  module load allinea-gpu
  module load allinea-gpu


You will also need to make sure a cuda module is loaded:
Ensure a cuda module is loaded:


  module load cuda
  module load cuda
Line 58: Line 60:
  ddt path/to/code
  ddt path/to/code


When done, exit the shell (this will terminate the allocation).
When done, exit the shell. This will terminate the allocation.


= Known issues =
= Known issues =

Revision as of 20:55, 23 February 2018

Introduction[edit]

ARM DDT formerly know as Allinea DDT, is a powerful commercial parallel debugger with a graphical user interface. It can be used to debug serial, MPI, multi-threaded, and CUDA programs, or any combination of these, written in C, C++, and FORTRAN. MAP is another very useful tool from ARM (formerly Allinea) - an efficient parallel profiler.

This software is available on Graham as two separate modules:

  • allinea-cpu, for CPU debugging and profiling;
  • allinea-gpu, for GPU or mixed CPU/GPU debugging.

As this is a GUI application, you should log in using ssh -Y, or use an SSH client like MobaXterm or XQuartz to ensure proper X11 tunnelling. See SSH for further guidance.

Both DDT and MAP are normally used interactively through their GUI, which is normally accomplished using the salloc command (see below for details). MAP can also be used non-interactively, in which case it can be submitted to the scheduler with the sbatch command.

The current license limits the use of DDT/MAP to a maximum of 512 CPU cores across all users at any given time. DDT-GPU is likewise limited to 8 GPUs.

Usage[edit]

CPU-only code, no GPUs[edit]

Allocate the node or nodes on which to do the debugging or profiling with salloc, e.g.:

salloc --x11 --time=0-1:00 --mem-per-cpu=4G --ntasks=4

This will open a shell session on the allocated node. Then load the appropriate module:

module load allinea-cpu

This may fail with a suggestion to load an older version of OpenMPI first. If that happens, reload the OpenMPI module with the suggested command, and then reload the allinea-cpu module:

module load openmpi/2.0.2
module load allinea-cpu

You can then run the ddt or map command as:

ddt path/to/code
map path/to/code

Make sure the MPI implementation is the default "OpenMPI" in the Allinea application window, before pressing the Run button. If this is not the case, press the Change button next to the "Implementation:" string, and pick the correct option from the drop down menu.

When done, exit the shell. This will terminate the allocation.

CUDA code[edit]

Allocate the node or nodes on which to do the debugging or profiling with salloc, e.g.:

salloc --x11 --time=0-1:00 --mem-per-cpu=4G --ntasks=1 --gres=gpu:1

This will open a shell session on the allocated node. Then load the appropriate module:

module load allinea-gpu

This may fail with a suggestion to load an older version of OpenMPI first. If that happens, reload the OpenMPI module with the suggested command, and then reload the allinea-gpu module:

module load openmpi/2.0.2
module load allinea-gpu

Ensure a cuda module is loaded:

module load cuda

You can then run the ddt command as:

ddt path/to/code

When done, exit the shell. This will terminate the allocation.

Known issues[edit]

MPI DDT[edit]

  • For some reason the debugger doesn't show queued MPI messages (e.g. when paused in an MPI deadlock).

OpenMP DDT[edit]

  • Memory debugging module (which is off by default) doesn't work.

CUDA DDT[edit]

  • Memory debugging module (which is off by default) doesn't work.

MAP[edit]

  • MAP currently doesn't work correctly on Graham. We are working on resolving this issue. For now the workaround is to request a SHARCNET account from your Compute Canada account (ccdb), and then run MAP on the SHARCNET's legacy cluster orca's development nodes using these instructions.