Debugging and profiling
This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.
The Compute Canada national clusters offer a variety of debugging and profiling tools, both command line and those with a graphical user interface, whose use requires an X11 connection. Note that debugging sessions should be conducted using an interactive job and not run on the login node.
GNU Debugger (gdb)
Please see the GBD page
PGI Debugger (pgdb)
ARM Debugger (ddt)
Please see the ARM software page.
GNU Profiler (gprof)
Valgrind
Valgrind is a powerful debugging tool to detect bad memory usage. It can detect memory leaks, but also access to unallocated or deallocated memory, multiple deallocation or other bad memory usage. If your program ends with a segmentation fault, broken pipe or bus error, you most likely have such a problem in your code.
Valgrind is installed on most of the Calcul Québec clusters and is available through a module. To know the exact name of the module on the server you are using, run the following command:
[nom@serveur ~]$ module avail 2>&1 | grep valgrind
Preparing your application
To get useful information from Valgrind, you first need to compile your code with debuging information enabled. With the GNU and Intel compilers, you do so by adding a "-g" option on compilation. For other compilers, check their documentation.
Some aggressive optimisations may yield false errors in Valgrind if they result in unsupported operations. This is the case for example with some operations implemented in the MKL library. Since you don't want to diagnose errors in those libraries, but rather errors in your own code, you should compile and link your code against non-optimized versions of the libraries (such as the Netlib implementation of BLAS/LAPACK) that will not do those operations. This is of course only to diagnose issues. When time comes to run real simulations, you should link against optimized libraries.
Using Valgrind
Once your code is compiled with the proper options, you execute it within Valgrind with the following command :
[nom@serveur ~]$ valgrind --tool=memcheck --leak-check=yes --show-reachable=yes ./votre_programme
For more information about valgrind, we recommend this page.
Words of wisdom
- When you run your code in Valgrind, your application is executed within a virtual machine that validates every memory access. It will therefore run much slower than usual. Choose the size of the problem to test with caution, much smaller than what you would usually run.
- You do not need to run the exact same problem that results in a segmentation fault to detect memory issues in your code. Very frequently, memory access problem, such as reading data outside of the bounds of an array, will go undetected for small size problems, but will cause a segmentation fault for large ones. Valgrind will detect even the slightest access outside of the bounds of an array.
Some typical error messages
Here are some problems that Valgrind will help you detect, and the error messages that it will produce.
Memory leak
The error message for a memory leak will be given at the end of the program execution, and will look like this :
==2116== 100 bytes in 1 blocks are definitely lost in loss record 1 of 1
==2116== at 0x1B900DD0: malloc (vg_replace_malloc.c:131)
==2116== by 0x804840F: main (in /home/cprogram/example1)
Invalid pointer access/out of bound errors
If you attempt to read or write to an unallocated pointer or outside of the allocated memory, the error message will look like this:
==9814== Invalid write of size 1
==9814== at 0x804841E: main (example2.c:6)
==9814== Address 0x1BA3607A is 0 bytes after a block of size 10 alloc'd
==9814== at 0x1B900DD0: malloc (vg_replace_malloc.c:131)
==9814== by 0x804840F: main (example2.c:5)
Usage of uninitialized variables
If you use an uninitialized variable, you will get an error message such as
==17943== Conditional jump or move depends on uninitialised value(s)
==17943== at 0x804840A: main (example3.c:6)