Debugging and profiling
This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.
The Compute Canada national clusters offer a variety of debugging and profiling tools, both command line and those with a graphical user interface, whose use requires an X11 connection. Note that debugging sessions should be conducted using an interactive job and not run on the login node.
GNU Debugger (gdb)[edit]
Please see GDB page
PGI Debugger (pgdb)[edit]
ARM Debugger (ddt)[edit]
Please see the ARM software page.
GNU Profiler (gprof)[edit]
What is Gprof ?[edit]
Gprof is a profiling software which collects information and statistics on your code. Generally, it searches for functions and subroutines in your program and insert timing instructions for each one. Then executing such modified program creates a raw data file which can be interpreted by Gprof and turned into profiling statistics.
Gprof comes with the GNU compiler (such as GCC or GFORTRAN) and installed on most of the Compute Canada machines.
Preparing your application[edit]
Switch to GNU compiler[edit]
Load the appropriate GNU compiler. For example, for GCC:
[name@server ~]$ module load gcc/5.4.0
Compile your code[edit]
To get useful information from Gprof , you first need to compile your code with debugging information enabled. With the GNU compilers, you do so by adding a "-pg" option on compilation. This option tells the compiler to generate extra code to write profile information suitable for the analysis. If it is not in your compiler options no call-graph data will be gathered and if you run gprof hopping to get the profiling you may get the following error:
gprof: gmon.out file is missing call-graph data
Execute your code[edit]
Once your code is compiled with the proper options, you execute it:
[name@server ~]$ /path/to/your/executable arg1 arg2
You execute your code the same way as you would do it without Gprof profiling. In fact, the execution line does not change. Once the binary is executed, a new file 'gmon.out' is generated in the current working directory. Note that if your code changes current directory, then gmon.out will be created in the new working directory. Furthermore, your program should have sufficient permissions for gmon.out to be generated.
Get the profiling data[edit]
In this step the Gprof tool is executed again with the binary name and the above mentioned ‘gmon.out’ as argument. This should create an analysis file with all the desired profiling information.
[name@server ~]$ gprof /path/to/your/executable gmon.out > analysis.txt
We can notice that the new file analysis.txt was generated.
Valgrind[edit]
Valgrind is a powerful debugging tool to detect bad memory usage. It can detect memory leaks, but also access to unallocated or deallocated memory, multiple deallocation or other bad memory usage. If your program ends with a segmentation fault, broken pipe or bus error, you most likely have such a problem in your code.
Valgrind is installed on most of the Calcul Québec clusters and is available through a module. To know the exact name of the module on the server you are using, run the following command:
[nom@serveur ~]$ module avail 2>&1 | grep valgrind
Preparing your application[edit]
To get useful information from Valgrind, you first need to compile your code with debuging information enabled. With the GNU and Intel compilers, you do so by adding a "-g" option on compilation. For other compilers, check their documentation.
Some aggressive optimisations may yield false errors in Valgrind if they result in unsupported operations. This is the case for example with some operations implemented in the MKL library. Since you don't want to diagnose errors in those libraries, but rather errors in your own code, you should compile and link your code against non-optimized versions of the libraries (such as the Netlib implementation of BLAS/LAPACK) that will not do those operations. This is of course only to diagnose issues. When time comes to run real simulations, you should link against optimized libraries.
Using Valgrind[edit]
Once your code is compiled with the proper options, you execute it within Valgrind with the following command :
[nom@serveur ~]$ valgrind --tool=memcheck --leak-check=yes --show-reachable=yes ./votre_programme
For more information about valgrind, we recommend this page.
Words of wisdom[edit]
- When you run your code in Valgrind, your application is executed within a virtual machine that validates every memory access. It will therefore run much slower than usual. Choose the size of the problem to test with caution, much smaller than what you would usually run.
- You do not need to run the exact same problem that results in a segmentation fault to detect memory issues in your code. Very frequently, memory access problem, such as reading data outside of the bounds of an array, will go undetected for small size problems, but will cause a segmentation fault for large ones. Valgrind will detect even the slightest access outside of the bounds of an array.
Some typical error messages[edit]
Here are some problems that Valgrind will help you detect, and the error messages that it will produce.
Memory leak[edit]
The error message for a memory leak will be given at the end of the program execution, and will look like this :
==2116== 100 bytes in 1 blocks are definitely lost in loss record 1 of 1
==2116== at 0x1B900DD0: malloc (vg_replace_malloc.c:131)
==2116== by 0x804840F: main (in /home/cprogram/example1)
Invalid pointer access/out of bound errors[edit]
If you attempt to read or write to an unallocated pointer or outside of the allocated memory, the error message will look like this:
==9814== Invalid write of size 1
==9814== at 0x804841E: main (example2.c:6)
==9814== Address 0x1BA3607A is 0 bytes after a block of size 10 alloc'd
==9814== at 0x1B900DD0: malloc (vg_replace_malloc.c:131)
==9814== by 0x804840F: main (example2.c:5)
Usage of uninitialized variables[edit]
If you use an uninitialized variable, you will get an error message such as
==17943== Conditional jump or move depends on uninitialised value(s)
==17943== at 0x804840A: main (example3.c:6)