PGPROF
PGPROF is a powerful and simple tool for analyzing the performance of parallel programs written with OpenMP, MPI, OpenACC, or CUDA. There are two profiling modes: Command-line profiling and Visual profiling.
Quickstart guide
Using PGPROF usually consists of two steps:
- Data collection: Run the application with profiling enabled.
- Analysis: Visualize the data produced in the first step.
Both steps can be accomplished in either command-line mode or graphical mode.
Environment modules
Before you start profiling with PGPROF, the appropriate module needs to be loaded.
PGPROF is part of the PGI compiler package, so run module avail pgi
to see what versions are currently available with the compiler, MPI, and CUDA modules you have loaded. For a comprehensive list of PGI modules, run module -r spider '.*pgi.*'
.
At the time this was written these were:
- pgi/13.10
- pgi/17.3
Use module load pgi/version
to choose a version. For example, to load the PGI compiler version 17.3, do:
[name@server ~]$ module load pgi/17.3
Compile your code
To get useful information from Pgprof, you first need to compile your code with one of the PGI compilers (pgcc
for C, pgc++
for C++ , pgfortran
for Fortran). A source in Fortran may need to be compiled with the -g
flag.
Working in command-line mode
In command-line mode, two distinct commands are used to collect timing data and to analyze it.
First, use PGPROF to run the application and save the performance data in a file. In this example, the application
is a.out
and we choose to save the data in a.prof
.
[name@server ~]$ pgprof -o a.prof ./a.out
You can optionally save the data file and analyze it in graphical mode (see below) using File | import.
To visualize the performance data in command-line mode:
[name@server ~]$ pgprof -i a.prof
The results are usually divided into several categories:
- GPU kernel execution profile
- CUDA API execution profile
- OpenACC execution profile
- CPU execution profile
[name@server ~]$ ====== Profiling result:
Time(%) Time Calls Avg Min Max Name
38.14% 1.41393s 20 70.696ms 70.666ms 70.731ms calc2_198_gpu
31.11% 1.15312s 18 64.062ms 64.039ms 64.083ms calc3_273_gpu
23.35% 865.68ms 20 43.284ms 43.244ms 43.325ms calc1_142_gpu
5.17% 191.78ms 141 1.3602ms 1.3120us 1.6409ms [CUDA memcpy HtoD]
...
======== API calls:
Time(%) Time Calls Avg Min Max Name
92.65% 3.49314s 62 56.341ms 1.8850us 70.771ms cuStreamSynchronize
3.78% 142.36ms 1 142.36ms 142.36ms 142.36ms cuDevicePrimaryCtxRetain
...
======== OpenACC (excl):
Time(%) Time Calls Avg Min Max Name
36.27% 1.41470s 20 70.735ms 70.704ms 70.773ms acc_wait@swim-acc-data.f:223
63.3% 1.15449s 18 64.138ms 64.114ms 64.159ms acc_wait@swim-acc-data.f:302
======== CPU profiling result (bottom up):
Time(%) Time Name
59.09% 8.55785s cudbgGetAPIVersion
59.09% 8.55785s start_thread
59.09% 8.55785s clone
25.75% 3.73007s cuStreamSynchronize
25.75% 3.73007s __pgi_uacc_cuda_wait
25.75% 3.73007s __pgi_uacc_computedone
10.38% 1.50269s swim_mod_calc2_
The output can be cropped to show one of the categories. For example, the option --cpu-profiling
will show only the CPU results.
The option --cpu-profiling-mode top-down
will make the PGPROF show the main subroutine at the top and the rest of functions it called below:
[name@server ~]$ pgprof --cpu-profiling-mode top-down -i a.prof
======== CPU profiling result (top down):
Time(%) Time Name
97.36% 35.2596s main
97.36% 35.2596s MAIN_
32.02% 11.5976s swim_mod_calc3_
29.98% 10.8578s swim_mod_calc2_
25.93% 9.38965s swim_mod_calc1_
6.82% 2.46976s swim_mod_inital_
1.76% 637.36ms __fvd_sin_vex_256
To find out what part of your application takes the longest time to run you can use the option --cpu-profiling-mode bottom-up
which orients the call tree to show each function followed by functions that called it working backwards to main.
[name@server ~]$ pgprof --cpu-profiling-mode bottom-up -i a.prof
======== CPU profiling result (bottom up):
Time(%) Time Name
32.02% 11.5976s swim_mod_calc3_
32.02% 11.5976s MAIN_
32.02% 11.5976s main
29.98% 10.8578s swim_mod_calc2_
29.98% 10.8578s MAIN_
29.98% 10.8578s main
25.93% 9.38965s swim_mod_calc1_
25.93% 9.38965s MAIN_
25.93% 9.38965s main
3.43% 1.24057s swim_mod_inital_
Working in graphical mode
In graphical mode, both data collection and analysis can be accomplished in the same session. There are several steps that need to be done to collect and visualize performance data in this mode:
- Launch the PGI profiler.
- Since the Pgrof's GUI is based on Java, it should be executed on the compute node in the interactive session rather than on the login node, as the latter does not have enough memory (see Java for more details). An interactive session can be started with
salloc --x11 ...
to enable X11 forwarding (see Interactive jobs for more details).
- Since the Pgrof's GUI is based on Java, it should be executed on the compute node in the interactive session rather than on the login node, as the latter does not have enough memory (see Java for more details). An interactive session can be started with
- In order to start a new session, open the File menu and click on New Session.
- Select the executable file you want to profile and then add any arguments appropriate for your profiling.
- Click Next, then Finish.
- In the CPU Details tab, push on the Show the top-down (callers first) call tree view button as shown in the figure below.
Take note of these four panes in the graphical interface (see the image "Visualizing performance data", to the left):
- The Timeline: shows all the events ordered by the time they executed
- GPU details: shows performance details for the GPU kernels
- CPU details: shows performance details for the CPU functions
- The Property tab: shows all the details for a selected function in the timeline window
References
PGPROF is a product of PGI, which is a subsidiary of NVIDIA Corporation.