Clusterstats: Difference between revisions

Revision as of 14:49, 30 March 2021

This article is a draft

This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.

Cluster Information

Displays information on partitions, nodes, jobs, your account, group(s) and priority.

Run clusterstats just type the command on a cluster.

[name@server ~]$ clusterstats

Clusterstats may use a cached version of the cluster information or it may take a few minutes to update and get fresh data.

[✔] Loading node information (success, loaded cached version that is 2 min old)
[✔] Loading job information (success, loaded cached version that is 2 min old)
[✔] Loading share information (success, loaded cached version that is 1 min old)

Clusterstats main menu will appear asking if you would like information on your user, your group or the state of the cluster. You can scroll down and make a selection with the "Enter" button. You can move back a level by selecting back and quit the program by selecting quit.

Information on? (Use arrow keys, press Enter to select)
‣ User
  Group
  Cluster
  Quit

Information on the Cluster

You will be asked a number of questions on what part of the cluster you wish to see, and what type of information do you wish to display. Once you do you will see a big table with all the resources, by node type and how the length of the walltime of the job. Notice how the available resources/nodes change as your jobs are longer.

Information on? Cluster
Please select on which part of the cluster would you like more information? CPU, (highmem or large) more than 12 GB of RAM per Core
Information on ? Jobs/Partitions/Nodes for whole node jobs
Please select the information you would like to display? Nodes

          This table shows all available resources in the partition. 
          A resource that is available to run 0-24 hour jobs 
          will show up in the (0-3),(3-12) and (12-24) columns.

┌───────────────────────┬─────────────┬────────┬─────────┬──────────┬─────────┬─────────┬──────────┐
│ cpularge_bynode       │ interactive │ 0-3 hr │ 3-12 hr │ 12-24 hr │ 1-3 day │ 3-7 day │ 7-28 day │
├───────────────────────┼─────────────┼────────┼─────────┼──────────┼─────────┼─────────┼──────────┤
│ Total (Nodes)         │ 2           │ 50     │ 50      │ 50       │ 35      │ 17      │ 7        │
│   cpu=32, Mem=3095000 │ 0           │ 4      │ 4       │ 4        │ 4       │ 1       │ 1        │
│   cpu=32, Mem=1547000 │ 0           │ 24     │ 24      │ 24       │ 16      │ 8       │ 3        │
│   cpu=32, Mem=515000  │ 2           │ 22     │ 22      │ 22       │ 15      │ 8       │ 3        │
│ Idle (Nodes)          │ 2           │ 0      │ 0       │ 0        │ 0       │ 0       │ 0        │
│   cpu=32, Mem=515000  │ 2           │ 0      │ 0       │ 0        │ 0       │ 0       │ 0        │
│ Running (Nodes)       │ 0           │ 46     │ 46      │ 46       │ 34      │ 17      │ 7        │
│   cpu=32, Mem=3095000 │ 0           │ 3      │ 3       │ 3        │ 3       │ 1       │ 1        │
│   cpu=32, Mem=1547000 │ 0           │ 21     │ 21      │ 21       │ 16      │ 8       │ 3        │
│   cpu=32, Mem=515000  │ 0           │ 22     │ 22      │ 22       │ 15      │ 8       │ 3        │
│ Down (Nodes)          │ 0           │ 4      │ 4       │ 4        │ 1       │ 0       │ 0        │
│   cpu=32, Mem=3095000 │ 0           │ 1      │ 1       │ 1        │ 1       │ 0       │ 0        │
│   cpu=32, Mem=1547000 │ 0           │ 3      │ 3       │ 3        │ 0       │ 0       │ 0        │
└───────────────────────┴─────────────┴────────┴─────────┴──────────┴─────────┴─────────┴──────────┘

Information on your Group(s)

You will be asked to select from one of the accounting groups that you belong to. You will see a table of all the users in your accounting group, each member's share of the group and the share of the group's use of the system as well as the group's share of the cluster and its use. The group's LevelFS is the group's share of the cluster divided by the group's use. Fairshare is the main component of the priority of any jobs.

Information on? Group
Information on Job ? def-kamil-ab_cpu
┌──────────────────┬──────────┬───────────┬───────────┬──────────┬─────────┬─────────┬───────────────┐
│ Account          │ User     │ Group     │ Group     │ Group    │ Users's │ Users's │ Users's       │
│                  │          │ Share     │ Used      │ LevelFS  │ Share   │ Used    │ Fairshare     │
│                  │          │ % Cluster │ % Cluster │          │ % Group │ % Group │ Using Account │
├──────────────────┼──────────┼───────────┼───────────┼──────────┼─────────┼─────────┼───────────────┤
│ def-kamil-ab_cpu │ kamil    │ SLEEPING  │ 0.0       │ SLEEPING │ 50.0    │ 100.0   │ SLEEPING      │
│ def-kamil-ab_cpu │ tmcguire │ SLEEPING  │ 0.0       │ SLEEPING │ 50.0    │ 0.0     │ SLEEPING      │
└──────────────────┴──────────┴───────────┴───────────┴──────────┴─────────┴─────────┴───────────────┘

In this example user, kamil has 50% of the group's share but used 100% of the resources used by the group. The default group def-kamil-ab_cpu has used almost zero resources and is currently inactive. Shares of active default groups are set to the (unallocated resources/number of active default groups). Inactive default groups get no share, when a group member submits a job the group is soon classified as active)

Information on the User

You will be asked to select Account or Jobs. Select account and you will get information in a table for all the groups you are a member of, just like in the Groups section, but you will not see the other group members.

Select Jobs and you will then select the particular job you wish more information on. Select Basic and you will see the job stats its priority and its rank.

Information on ? Basic
Job:46460857 state: pending partition: cpubase_bycore_b4 priority: 1618298
    This job is ranked 1517 of 7825 in terms of priority

The ranking has the following meaning: the nodes in the jobs queue or partition that can potentially run the job can also run 7825 other jobs. When all these jobs are ranked by priority this job is 1517th.

Select Report to also show the nodes that your job can run on and their state.

Select the Output of the scontrol command to get "scontrol show job <jobid>" output

Clusterstats: Difference between revisions

Revision as of 14:49, 30 March 2021

Contents

Cluster Information

Information on the Cluster

Information on your Group(s)

Information on the User

Navigation menu

Clusterstats: Difference between revisions

Revision as of 14:49, 30 March 2021

Cluster Information

Information on the Cluster

Information on your Group(s)

Information on the User

Navigation menu

Search