AI and Machine Learning: Difference between revisions
(Add section about large collections of files) |
No edit summary |
||
Line 1: | Line 1: | ||
{{Draft}} | {{Draft}} | ||
= Python = | |||
[[Python]] is very popular in the field of machine learning. If you (plan to) use it on our clusters, please refer to [[Python|our documentation about Python]] to get important information about Python versions, virtual environments on login or on compute nodes, multiprocessing, Anaconda, Jupyter, etc. | [[Python]] is very popular in the field of machine learning. If you (plan to) use it on our clusters, please refer to [[Python|our documentation about Python]] to get important information about Python versions, virtual environments on login or on compute nodes, multiprocessing, Anaconda, Jupyter, etc. | ||
= Useful information about software packages = | |||
Please refer to the page of your machine learning package of choice for useful information about how to install, common pitfalls, etc.: | Please refer to the page of your machine learning package of choice for useful information about how to install, common pitfalls, etc.: | ||
Line 16: | Line 16: | ||
* [[XGBoost]] | * [[XGBoost]] | ||
= Datasets containing lots of small files (e.g. image datasets) = | |||
In machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, problems arise: | In machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, problems arise: |
Revision as of 18:51, 16 July 2019
This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.
Python
Python is very popular in the field of machine learning. If you (plan to) use it on our clusters, please refer to our documentation about Python to get important information about Python versions, virtual environments on login or on compute nodes, multiprocessing, Anaconda, Jupyter, etc.
Useful information about software packages
Please refer to the page of your machine learning package of choice for useful information about how to install, common pitfalls, etc.:
Datasets containing lots of small files (e.g. image datasets)
In machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, problems arise:
- Filesystem quotas on Compute Canada clusters limit the number of filesystem objects;
- Your software could become be significantly slowed down from streaming lots of small files from /project (or /scratch) to a compute node.
On a distributed filesystem, data should be stored in large single-file archives. On this subject, please refer to Handling large collections of files.