AI and Machine Learning: Difference between revisions
(Rewrite to emphasize the importance of the docs) |
(Add section about large collections of files) |
||
Line 15: | Line 15: | ||
* [[SpaCy]] | * [[SpaCy]] | ||
* [[XGBoost]] | * [[XGBoost]] | ||
== Datasets containing lots of small files (e.g. image datasets) == | |||
In machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, problems arise: | |||
* Filesystem quotas on Compute Canada clusters limit the number of filesystem objects; | |||
* Your software could become be significantly slowed down from streaming lots of small files from <tt>/project</tt> (or <tt>/scratch</tt>) to a compute node. | |||
On a distributed filesystem, data should be stored in large single-file archives. On this subject, please refer to | |||
[[Handling large collections of files]]. |
Revision as of 18:50, 16 July 2019
This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.
Python
Python is very popular in the field of machine learning. If you (plan to) use it on our clusters, please refer to our documentation about Python to get important information about Python versions, virtual environments on login or on compute nodes, multiprocessing, Anaconda, Jupyter, etc.
Useful information about software packages
Please refer to the page of your machine learning package of choice for useful information about how to install, common pitfalls, etc.:
Datasets containing lots of small files (e.g. image datasets)
In machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, problems arise:
- Filesystem quotas on Compute Canada clusters limit the number of filesystem objects;
- Your software could become be significantly slowed down from streaming lots of small files from /project (or /scratch) to a compute node.
On a distributed filesystem, data should be stored in large single-file archives. On this subject, please refer to Handling large collections of files.