GNU Parallel

Revision as of 16:16, 2 August 2019 by Diane27 (talk | contribs)
Other languages:

Introduction

GNU Parallel is a tool for running many sequential tasks at the same time on one or more nodes. It is useful for running a large number of sequential tasks, especially if they are short or variable duration, as well as when doing a parameter sweep. We will only cover the basic options here, for more advanced usage, please see the official documentation.

By default, parallel will run as many tasks as the number of cores allocated by the scheduler, therefore maximizing resource usage. You can change this behaviour using the option --jobs followed by the number of simultaneous tasks that Gnu Parallel should run. When one task finishes, a new task will automatically be started by parallel in its stead, always keeping the maximum number of tasks running.

Basic Usage

Parallel uses curly brackets {} as parameters for the command to be run. For example, to run gzip on all the text files in a directory, you can execute

 
[name@server ~]$ ls *.txt | parallel gzip {}

An alternative syntax is to use :::, such as this example:

 
[name@server ~]$ parallel echo {} ::: $(seq 1 3)
1
2
3

Note that Gnu Parallel refers to each of the commands executed as jobs. This can be confusing because on many Compute Canada systems, a job is a batch script run by a scheduler or resource manager, and Gnu Parallel would be used inside that job. From that perspective, Gnu Parallel's jobs are sub-jobs.

Multiple Arguments

You can also use multiple arguments by enumerating them, for example:

 
[name@server ~]$ parallel echo {1} {2} ::: $(seq 1 3) ::: $(seq 2 3)
1 2
1 3
2 2
2 3
3 2
3 3

File Content as Argument List

The syntax :::: takes the content of a file to generate the list of values for the arguments. For example, if you have a list of parameter values in the file mylist.txt, you may display its content with:

 
[name@server ~]$ parallel echo {1} :::: mylist.txt

File Content as Command List

Gnu parallel can also interpret the lines of a file as the actual sub-jobs to be run in parallel, by using redirection. For example, if you have a list of sub-jobs in the file mycommands.txt (one per line), you may run them in parallel as follows:

 
[name@server ~]$ parallel < mycommands.txt

Note that there is no command-argument given to parallel. This usage mode can be particularly useful if the sub-jobs contain symbols that are special to gnu parallel, or the sub-command are to contain a few commands (e.g. cd dir1 && ./executable).

Running on Multiple Nodes

You can also use Gnu Parallel to distribute a workload across multiple nodes in a cluster, such as in the context of a job on a Compute Canada server. An example of this use is the following:

 
[name@server ~]$ scontrol show hostname ${SLURM_JOB_NODELIST} > ./node_list_${SLURM_JOB_ID}
 
[name@server ~]$ parallel --jobs 32 --sshloginfile ./node_list_${SLURM_JOB_ID} --env MY_VARIABLE --workdir $PWD ./my_program

In this case, we suppose that each node has 32 CPU cores and we create a file containing the list of nodes from $SLURM_JOB_NODELIST (which is created automatically by the job scheduler), and we use the this file to tell Gnu Parallel which nodes to use for the distribution of tasks. The --env allows us to transfer a named environment variable to all the nodes while the --workdir option ensures that the Gnu Parallel tasks will start in same directory as the main node.

Keeping Track of Completed and Failed Commands, and Restart Capabilities

You can tell Gnu Parallel to keep track of which commands have completed by using the --joblog JOBLOGFILE argument. The file JOBLOGFILE will contain the list of completed commands, their start times, durations, hosts, and exit values. For example,

 
[name@server ~]$ ls *.txt | parallel --joblog gzip.log gzip {}

The job log functionality opens the door to a number of possible restart options. If the parallel command was interrupted (e.g. your job ran longer than the requested walltime of a job), you can make it pick up where it left off using the --resume option, for instance

 
[name@server ~]$ ls *.txt | parallel --resume --joblog gzip.log gzip {}

The new jobs will be appended to the old log file.

If some of the subcommands failed (i.e. they produced a non-zero exit code), and you have think that you have eliminated the source of the error, you can re-run the failed ones, using the --resume-failed, e.g.

 
[name@server ~]$ ls *.txt | parallel --resume-failed --joblog gzip.log gzip {}

Note that this will also start subjobs that were not considered before.

Handling large files

Let say we want to count the characters in parallel from a big FASTA file (database.fa) in a task with 8 cores. We will have to use the GNU Parallel --pipepart and --block arguments to efficiently handle chunks of the file. Using the following command :

 
[name@server ~]$ parallel --jobs $SLURM_CPUS_PER_TASK --keep-order --block -1 --recstart '>' --pipepart wc :::: database.fa

and by varying the block size we get :

# Cores in task Ref. database size Block read size # GNU Parallel jobs # Cores used Time counting chars
1 8 827MB 10MB 83 8 0m2.633s
2 8 827MB 100MB 9 8 0m2.042s
3 8 827MB 827MB 1 1 0m10.877s
4 8 827MB -1 8 8 0m1.734s

This table shows that choosing the right block size has a real impact on the efficiency and the number of cores actually used. The first line shows that the block size is too small, resulting in many jobs dispatched over the available cores. The second line is a better block size, since it results in a number of jobs close to the number of available cores. The third line shows that the block size is too big and that we are only using 1 core out of 8, therefore inefficiently processing chunks. Finally, the last line shows that in many cases, letting GNU Parallel adapt and decide on the block size is often faster.

Related topics