AI and Machine Learning: Difference between revisions

Jump to navigation Jump to search
Add section about large collections of files
(Rewrite to emphasize the importance of the docs)
(Add section about large collections of files)
Line 15: Line 15:
* [[SpaCy]]
* [[SpaCy]]
* [[XGBoost]]
* [[XGBoost]]
== Datasets containing lots of small files (e.g. image datasets) ==
In machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, problems arise:
* Filesystem quotas on Compute Canada clusters limit the number of filesystem objects;
* Your software could become be significantly slowed down from streaming lots of small files from <tt>/project</tt> (or <tt>/scratch</tt>) to a compute node.
On a distributed filesystem, data should be stored in large single-file archives. On this subject, please refer to
[[Handling large collections of files]].
cc_staff
353

edits

Navigation menu