Handling large collections of files: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(Undo revision 73469 by Rdickson (talk))
Tag: Undo
Line 26: Line 26:


cd $SLURM_TMPDIR
cd $SLURM_TMPDIR
mkdir inputs
mkdir work
mkdir results
cd work
cd inputs
tar -xf ~/projects/def-foo/johndoe/my_data.tar
tar -xf ~/projects/def-foo/johndoe/my_data.tar
# Now do my computations here on the local disk using the contents of the extracted archive...
# Now do my computations here on the local disk using the contents of the extracted archive...
Line 34: Line 33:
# The computations are done, so clean up the data set...
# The computations are done, so clean up the data set...
cd $SLURM_TMPDIR
cd $SLURM_TMPDIR
tar -cf ~/projects/def-foo/johndoe/results.tar results
tar -cf ~/projects/def-foo/johndoe/results.tar work
}}
}}



Revision as of 14:55, 3 July 2019


This article is a draft

This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.




In certain domains, notably AI and Machine Learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, a problem arises due to filesystem quotas on Compute Canada clusters that limit the number of filesystem objects. So how can a user or group of users store these necessary data sets on the cluster? In this page we will present a variety of different solutions, each with its own pros and cons, so you may judge for yourself which is an appropriate one for you.

Finding folders with lots of files

As always in optimization, you better start finding where it is worth doing some cleanup. You may consider the following code which will recursively count all files in folders in the current directory:

for FOLDER in $(find . -maxdepth 1 -type d | tail -n +2); do
  echo -ne "$FOLDER:\t"
  find $FOLDER -type f | wc -l
done

Using the local disk

Note that one option is the use of the attached local disk for the compute node, which offers roughly 190 GB of disk space without any quotas of any sort and in general will have a performance that is considerably better than the project or scratch filesystems. You can access this local disk inside of a job using the environment variable $SLURM_TMPDIR. One approach therefore would be to keep your dataset archived as a single TAR file in the project space and then copy it to the local disk at the beginning of your job, extract it and use the dataset during the job. If any changes were made, at the job's end you could again archive the contents to a TAR file and copy it back to the project space.

File : job_script.sh

#!/bin/bash
#SBATCH --time=1-00:00        
#SBATCH --nodes=1             
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=0               

cd $SLURM_TMPDIR
mkdir work
cd work
tar -xf ~/projects/def-foo/johndoe/my_data.tar
# Now do my computations here on the local disk using the contents of the extracted archive...

# The computations are done, so clean up the data set...
cd $SLURM_TMPDIR
tar -cf ~/projects/def-foo/johndoe/results.tar work


Archiving tools

DAR

A disk archive utility. See Dar.

HDF5

SQLite

SquashFS

Random Access Read-Only Tar Mount (Ratarmount)

Cleaning up hidden files

git

When working with Git, over time the number of files in the hidden .git repository subdirectory can grow significantly. Using git repack will pack many of the files together into a few large database files and greatly speed up Git's operations.