AI and Machine Learning: Difference between revisions

AI and Machine Learning (view source)

712 bytes added , 5 years ago

Add section about large collections of files

cc_staff

353

edits

@@ Line 15: / Line 15: @@
 * [[SpaCy]]
 * [[XGBoost]]
+== Datasets containing lots of small files (e.g. image datasets) ==
+In machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, problems arise:
+* Filesystem quotas on Compute Canada clusters limit the number of filesystem objects;
+* Your software could become be significantly slowed down from streaming lots of small files from <tt>/project</tt> (or <tt>/scratch</tt>) to a compute node.
+On a distributed filesystem, data should be stored in large single-file archives. On this subject, please refer to
+[[Handling large collections of files]].