Handling large collections of files: Difference between revisions
(→DAR) |
(Adding more sections, how to find files) |
||
Line 4: | Line 4: | ||
<!-- This text should not appear --> | <!-- This text should not appear --> | ||
= | =Finding folders with lots of files= | ||
As always in optimization, you better start finding where it is worth doing some cleanup. You may consider the following code which which recursively count all files in folders in the current directory: | |||
<pre>for F in $(find . -maxdepth 1 -type d | tail -n +2); do | |||
echo -ne "$F:\t" | |||
find $F -type f | wc -l | |||
done</pre> | |||
=Archiving tools= | |||
==DAR== | |||
A disk archive utility. See [[Dar]]. | A disk archive utility. See [[Dar]]. | ||
=HDF5= | ==HDF5== | ||
==SQLite== | |||
= | ==SquashFS== | ||
= | ==Random Access Read-Only Tar Mount (Ratarmount)== | ||
= | =Cleaning up hidden files= | ||
=git= | ==git== | ||
When working with Git, over time the number of files in the hidden <code>.git</code> repository subdirectory can grow significantly. Using <code>git repack</code> will pack many of the files together into a few large database files and greatly speed up Git's operations. | When working with Git, over time the number of files in the hidden <code>.git</code> repository subdirectory can grow significantly. Using <code>git repack</code> will pack many of the files together into a few large database files and greatly speed up Git's operations. |
Revision as of 17:50, 2 July 2019
This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.
In certain domains, notably AI and Machine Learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, a problem arises due to filesystem quotas on Compute Canada clusters that limit the number of filesystem objects. So how can a user or group of users store these necessary data sets on the cluster? In this page we will present a variety of different solutions, each with its own pros and cons, so you may judge for yourself which is an appropriate one for you.
Finding folders with lots of files
As always in optimization, you better start finding where it is worth doing some cleanup. You may consider the following code which which recursively count all files in folders in the current directory:
for F in $(find . -maxdepth 1 -type d | tail -n +2); do echo -ne "$F:\t" find $F -type f | wc -l done
Archiving tools
DAR
A disk archive utility. See Dar.
HDF5
SQLite
SquashFS
Random Access Read-Only Tar Mount (Ratarmount)
git
When working with Git, over time the number of files in the hidden .git
repository subdirectory can grow significantly. Using git repack
will pack many of the files together into a few large database files and greatly speed up Git's operations.