Handling large collections of files: Difference between revisions

Jump to navigation Jump to search
m
no edit summary
No edit summary
mNo edit summary
Line 30: Line 30:
* [[Béluga/en | Béluga]] offers roughly 370GB of local disk for the CPU nodes, the GPU nodes have a 1.6TB NVMe disk (to help with the AI image datasets with their millions of small files).
* [[Béluga/en | Béluga]] offers roughly 370GB of local disk for the CPU nodes, the GPU nodes have a 1.6TB NVMe disk (to help with the AI image datasets with their millions of small files).
* [[Niagara]] does not have local storage on the compute nodes
* [[Niagara]] does not have local storage on the compute nodes
* for other clusters you can assume the available disk size to be at least 190GB
* For other clusters you can assume the available disk size to be at least 190GB


You can access this local disk inside of a job using the environment variable <tt>$SLURM_TMPDIR</tt>. One approach therefore would be to keep your dataset archived as a single <tt>tar</tt> file in the project space and then copy it to the local disk at the beginning of your job, extract it and use the dataset during the job. If any changes were made, at the job's end you could again archive the contents to a <tt>tar</tt> file and copy it back to the project space. Here is an example of a submission scrip that allocates an entire node
You can access this local disk inside of a job using the environment variable <tt>$SLURM_TMPDIR</tt>. One approach therefore would be to keep your dataset archived as a single <tt>tar</tt> file in the project space and then copy it to the local disk at the beginning of your job, extract it and use the dataset during the job. If any changes were made, at the job's end you could again archive the contents to a <tt>tar</tt> file and copy it back to the project space. Here is an example of a submission script that allocates an entire node
{{File
{{File
|name=job_script.sh
|name=job_script.sh
Bureaucrats, cc_docs_admin, cc_staff
2,879

edits

Navigation menu