AI and Machine Learning: Difference between revisions

Revision as of 18:50, 16 July 2019

This article is a draft

This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.

Python

Python is very popular in the field of machine learning. If you (plan to) use it on our clusters, please refer to our documentation about Python to get important information about Python versions, virtual environments on login or on compute nodes, multiprocessing, Anaconda, Jupyter, etc.

Useful information about software packages

Please refer to the page of your machine learning package of choice for useful information about how to install, common pitfalls, etc.:

Datasets containing lots of small files (e.g. image datasets)

In machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, problems arise:

Filesystem quotas on Compute Canada clusters limit the number of filesystem objects;
Your software could become be significantly slowed down from streaming lots of small files from /project (or /scratch) to a compute node.

On a distributed filesystem, data should be stored in large single-file archives. On this subject, please refer to Handling large collections of files.

@@ Line 15: / Line 15: @@
 * [[SpaCy]]
 * [[XGBoost]]
+== Datasets containing lots of small files (e.g. image datasets) ==
+In machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, problems arise:
+* Filesystem quotas on Compute Canada clusters limit the number of filesystem objects;
+* Your software could become be significantly slowed down from streaming lots of small files from <tt>/project</tt> (or <tt>/scratch</tt>) to a compute node.
+On a distributed filesystem, data should be stored in large single-file archives. On this subject, please refer to
+[[Handling large collections of files]].

AI and Machine Learning: Difference between revisions

Revision as of 18:50, 16 July 2019

Python

Useful information about software packages

Datasets containing lots of small files (e.g. image datasets)

Navigation menu

Search