cc_staff
353
edits
(Rewrite to emphasize the importance of the docs) |
(Add section about large collections of files) |
||
Line 15: | Line 15: | ||
* [[SpaCy]] | * [[SpaCy]] | ||
* [[XGBoost]] | * [[XGBoost]] | ||
== Datasets containing lots of small files (e.g. image datasets) == | |||
In machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, problems arise: | |||
* Filesystem quotas on Compute Canada clusters limit the number of filesystem objects; | |||
* Your software could become be significantly slowed down from streaming lots of small files from <tt>/project</tt> (or <tt>/scratch</tt>) to a compute node. | |||
On a distributed filesystem, data should be stored in large single-file archives. On this subject, please refer to | |||
[[Handling large collections of files]]. |