GNU Parallel: Difference between revisions

GNU Parallel (view source)

Revision as of 01:37, 1 August 2019

2,057 bytes added , 5 years ago

Added section for handling large files

Coulombc

cc_staff

284

edits

@@ Line 71: / Line 71: @@
 {{Command|ls *.txt {{!}} parallel --resume-failed --joblog gzip.log gzip {{(}}{{)}} }}
 Note that this will also start subjobs that were not considered before.
+==Handling large files==
+Let say we want to count the characters in parallel from a big [https://en.wikipedia.org/wiki/FASTA_format FASTA] file (<tt>database.fa</tt>) in a task with 8 cores. We will have to use the GNU Parallel <tt>--pipepart</tt> and <tt>--block</tt> arguments to efficiently handle chunks of the file. Using the following command :
+{{Command|parallel --jobs $SLURM_CPUS_PER_TASK --keep-order --block -1 --recstart '>' --pipepart wc :::: database.fa}}
+and by varying the <tt>block</tt> size we get :
+{| class="wikitable"
+!
+! # Cores in task
+! Ref. database size
+! Block read size
+! # GNU Parallel jobs
+! # Cores used
+! Time counting chars
+|-
+| 1
+| style="text-align: right;" | 8
+| style="text-align: right;" | 827MB
+| style="text-align: right;" | 10MB
+| style="text-align: right;" | 83
+| style="text-align: right;" | 8
+| 0m2.633s
+|-
+| 2
+| style="text-align: right;" | 8
+| style="text-align: right;" | 827MB
+| style="text-align: right;" | 100MB
+| style="text-align: right;" | 9
+| style="text-align: right;" | 8
+| 0m2.042s
+|-
+| 3
+| style="text-align: right;" | 8
+| style="text-align: right;" | 827MB
+| style="text-align: right;" | 827MB
+| style="text-align: right;" | 1
+| style="text-align: right;" | 1
+| 0m10.877s
+|-
+| 4
+| style="text-align: right;" | 8
+| style="text-align: right;" | 827MB
+| style="text-align: right;" | -1
+| style="text-align: right;" | 8
+| style="text-align: right;" | 8
+| 0m1.734s
+|}
+The table above shows that choosing the right block size has a real impact on the efficiency and the number of core actually used.
+The first line shows that the block size is too small, resulting in many jobs dispatched over the available cores.
+The second line is a better block size, since it results in a number of jobs close to the number of available cores.
+While the third line shows that the block size is too big and that we are only using 1 core out of 8, therefore inefficiently processing chunks.
+Finally, the last line shows that in many cases, letting GNU Parallel adapt and decide the block size is often faster.
 ==Related topics== <!--T:14-->