GNU Parallel

Revision as of 17:56, 12 July 2016 by Stubbsda (talk | contribs) (Added a section on using Gnu Parallel across multiple nodes.)

Introduction

GNU Parallel is a tool for running many sequential tasks at the same time on one or more nodes. It is useful for running a large number of sequential tasks, especially if they are short or variable duration, as well as when doing a parameter sweep. We will only cover the basic options here, for more advanced usage, please see the official documentation.

By default, parallel will run as many tasks as the number of cores on the node, therefore maximizing resource usage. You can change this behaviour using the option --jobs followed by the number of simultaneous tasks that Gnu Parallel should run. When a task finishes, a new task will automatically be started by parallel.

Basic Usage

Parallel uses curly brackets {} as parameters for the command to be run. For example, to run zip on all the text files in a directory, you can execute

 
[name@server ~]$ ls *.txt | parallel zip {}

An alternative syntax is to use :::, such as this example:

 
[name@server ~]$ parallel echo {} ::: $(seq 1 3)
1
2
3

Multiple Arguments

You can also use multiple arguments by enumerating them, for example:

 
[name@server ~]$ parallel echo {1} {2} ::: $(seq 1 3) ::: $(seq 2 3)
1 2
1 3
2 2
2 3
3 2
3 3

File Content as Argument List

The syntax :::: takes the content of a file to generate the list of values for the arguments. For example, if you have a list of parameter values in the file mylist.txt, you may display its content with:

 
[name@server ~]$ parallel echo {1} :::: mylist.txt

Running on Multiple Nodes

You can also use Gnu Parallel to distribute a workload across multiple nodes in a cluster, such as in the context of a job on a Compute Canada server. An example of this use is the following:

 
[name@server ~]$ parallel --jobs 12 --sshloginfile $PBS_NODEFILE --env MY_VARIABLE --workdir $PWD ./my_program

In this case, we suppose that each node has 12 CPU cores and we will use the PBS_NODEFILE file created automatically by the job scheduler to tell Gnu Parallel which nodes to use for the distribution of tasks. The --env allows us to transfer a named environment variable to all the nodes while the --workdir option ensures that the Gnu Parallel tasks will start in same directory as the main node.