NVTOP

From Alliance Doc
Jump to navigation Jump to search
Other languages:

NVTOP stands for Neat Videocard TOP, a (h)top like task monitor for GPUs and accelerators. It can handle multiple GPUs and print information about them in a htop-familiar way.

Because a picture is worth a thousand words: NVTOP.png


Monitor GPUs usage

NVTOP can monitor single or multiple GPUs. It can show the GPU usage and its memory. One can also select a specific device from the menu (F2 -> GPU Select).

NVTOP is useful to monitor and verify that your job is using the GPU as effeciently as possible.

Monitor batch job

If you have submitted a non-interactive job and would like to see its current GPU usage.

1. From a login node, find the job id and select the one to monitor:

Question.png
[name@server ~]$ sq

2. Attach to the running job:

Question.png
[name@server ~]$ srun --pty --jobid JOBID nvtop

Monitor interactive job

1. Start your interactive job with minimal resources.

2. In a second terminal, connect to the login node, find the job id:

Question.png
[name@server ~]$ sq

3. Attach to the running job:

Question.png
[name@server ~]$ srun --pty --jobid JOBID nvtop

You'll be able to the usage in real time as you run your commands in the first terminal.

Monitor a GPU on a specific node

When running multi-nodes jobs, it can be useful to verify that one or all GPUs are effectively used.

1. From a login node, find the job id and identify the nodes names:

[name@server ~]$ sq
[name@server ~]$ srun --jobid JOBID -n1 -c1 scontrol show hostname


2. Attach to the running job on the specific node:

Question.png
[name@server ~]$ srun --pty --jobid JOBiD --nodelist NODENAME nvtop