Handling large collections of files: Difference between revisions

Jump to navigation Jump to search
Marked this version for translation
(translation markup)
(Marked this version for translation)
Line 3: Line 3:
<translate>
<translate>


<!--T:1-->
In certain domains, notably [[AI and Machine Learning]], it is common to have to manage very large collections of files, meaning hundreds of thousands or more.  The individual files may be fairly small, e.g. less than a few hundred kilobytes.  In these cases, a problem arises due to [[Storage_and_file_management#Filesystem_quotas_and_policies|filesystem quotas]] on Compute Canada clusters that limit the number of filesystem objects.  So how can a user or group of users store these necessary data sets on the cluster?  In this page we will present a variety of different solutions, each with its own pros and cons, so you may judge for yourself which is an appropriate one for you.  
In certain domains, notably [[AI and Machine Learning]], it is common to have to manage very large collections of files, meaning hundreds of thousands or more.  The individual files may be fairly small, e.g. less than a few hundred kilobytes.  In these cases, a problem arises due to [[Storage_and_file_management#Filesystem_quotas_and_policies|filesystem quotas]] on Compute Canada clusters that limit the number of filesystem objects.  So how can a user or group of users store these necessary data sets on the cluster?  In this page we will present a variety of different solutions, each with its own pros and cons, so you may judge for yourself which is an appropriate one for you.  


=Finding folders with lots of files=
=Finding folders with lots of files= <!--T:2-->


<!--T:3-->
As always in optimization, you better start finding where it is worth doing some cleanup. You may consider the following code which will recursively count all files in folders in the current directory:
As always in optimization, you better start finding where it is worth doing some cleanup. You may consider the following code which will recursively count all files in folders in the current directory:


<!--T:4-->
<pre>for FOLDER in $(find . -maxdepth 1 -type d | tail -n +2); do
<pre>for FOLDER in $(find . -maxdepth 1 -type d | tail -n +2); do
   echo -ne "$FOLDER:\t"
   echo -ne "$FOLDER:\t"
Line 14: Line 17:
done</pre>
done</pre>


=Using the local disk=
=Using the local disk= <!--T:5-->
Note that one option is the use of the attached local disk for the compute node, which offers roughly 190 GB of disk space without any quotas of any sort and in general will have a performance that is considerably better than the project or scratch filesystems. You can access this local disk inside of a job using the environment variable <tt>$SLURM_TMPDIR</tt>. One approach therefore would be to keep your dataset archived as a single TAR file in the project space and then copy it to the local disk at the beginning of your job, extract it and use the dataset during the job. If any changes were made, at the job's end you could again archive the contents to a TAR file and copy it back to the project space.  
Note that one option is the use of the attached local disk for the compute node, which offers roughly 190 GB of disk space without any quotas of any sort and in general will have a performance that is considerably better than the project or scratch filesystems. You can access this local disk inside of a job using the environment variable <tt>$SLURM_TMPDIR</tt>. One approach therefore would be to keep your dataset archived as a single TAR file in the project space and then copy it to the local disk at the beginning of your job, extract it and use the dataset during the job. If any changes were made, at the job's end you could again archive the contents to a TAR file and copy it back to the project space.  
{{File
{{File
Line 27: Line 30:
#SBATCH --mem=0               
#SBATCH --mem=0               


<!--T:6-->
cd $SLURM_TMPDIR
cd $SLURM_TMPDIR
mkdir work
mkdir work
Line 33: Line 37:
# Now do my computations here on the local disk using the contents of the extracted archive...
# Now do my computations here on the local disk using the contents of the extracted archive...


<!--T:7-->
# The computations are done, so clean up the data set...
# The computations are done, so clean up the data set...
cd $SLURM_TMPDIR
cd $SLURM_TMPDIR
Line 38: Line 43:
}}
}}


=Archiving tools=
=Archiving tools= <!--T:8-->


==DAR==
==DAR== <!--T:9-->
A disk archive utility, conceived of as a significant modernization of the venerable [[Tar|tar]] tool. For more information on its use, see [[Dar]].
A disk archive utility, conceived of as a significant modernization of the venerable [[Tar|tar]] tool. For more information on its use, see [[Dar]].


==HDF5==
==HDF5== <!--T:10-->
This is a high-performance binary file format that can be used to store a variety of different kinds of data, including extended objects such as matrices but also image data. There exist tools for manipulating HDF5 files in a several common programming languages including Python (e.g. [https://www.h5py.org/ h5py]). For more information on its use, see [[HDF5]].
This is a high-performance binary file format that can be used to store a variety of different kinds of data, including extended objects such as matrices but also image data. There exist tools for manipulating HDF5 files in a several common programming languages including Python (e.g. [https://www.h5py.org/ h5py]). For more information on its use, see [[HDF5]].


==SQLite==
==SQLite== <!--T:11-->


<!--T:12-->
The [https://www.sqlite.org SQLite software] allows for the use of a relational database which resides entirely in a single file stored on disk, without the need for a database server. The data located in the file can be accessed using standard [https://en.wikipedia.org/wiki/SQL SQL] (Structured Query Language) commands such as <tt>SELECT</tt> and there are APIs for several common programming languages. Using these APIs you can then interact with your SQLite database inside of a program written in C/C++, Python, R, Java and Perl. Modern relational databases contain datatypes for handling the storage of ''binary blobs'', such as the contents of an image file, so storing a collection of 5 or 10 million small PNG or JPEG images inside of a single SQLite file may be much more practical than storing them as individual files. There is the overhead of creating the SQLite database and this approach assumes that you are familiar with SQL and how to design a simple relational database with a small number of tables. Note as well that the performance of SQLite can start to degrade for very large database files, several gigabytes or more, in which case you may need to contemplate the use of a more traditional  [[Database servers | database server]] using [https://www.mysql.com MySQL] or [https://www.postgresql.org PostgreSQL].
The [https://www.sqlite.org SQLite software] allows for the use of a relational database which resides entirely in a single file stored on disk, without the need for a database server. The data located in the file can be accessed using standard [https://en.wikipedia.org/wiki/SQL SQL] (Structured Query Language) commands such as <tt>SELECT</tt> and there are APIs for several common programming languages. Using these APIs you can then interact with your SQLite database inside of a program written in C/C++, Python, R, Java and Perl. Modern relational databases contain datatypes for handling the storage of ''binary blobs'', such as the contents of an image file, so storing a collection of 5 or 10 million small PNG or JPEG images inside of a single SQLite file may be much more practical than storing them as individual files. There is the overhead of creating the SQLite database and this approach assumes that you are familiar with SQL and how to design a simple relational database with a small number of tables. Note as well that the performance of SQLite can start to degrade for very large database files, several gigabytes or more, in which case you may need to contemplate the use of a more traditional  [[Database servers | database server]] using [https://www.mysql.com MySQL] or [https://www.postgresql.org PostgreSQL].


=Cleaning up hidden files=
=Cleaning up hidden files= <!--T:13-->


==git==
==git== <!--T:14-->
When working with Git, over time the number of files in the hidden <code>.git</code> repository subdirectory can grow significantly. Using <code>git repack</code> will pack many of the files together into a few large database files and greatly speed up Git's operations.
When working with Git, over time the number of files in the hidden <code>.git</code> repository subdirectory can grow significantly. Using <code>git repack</code> will pack many of the files together into a few large database files and greatly speed up Git's operations.


</translate>
</translate>
Bureaucrats, cc_docs_admin, cc_staff
2,879

edits

Navigation menu