cc_staff
353
edits
(Introduction) |
(Section about datasets) |
||
Line 1: | Line 1: | ||
{{Draft}} | {{Draft}} | ||
To get the most out of our clusters for machine learning applications, special care must be taken. A cluster is a complicated beast that is very different from your local machine that you use for prototyping. Notably, a cluster uses a distributed filesystem, linking many storage devices seamlessly. Accessing a file on <tt>/project</tt> ''feels the same'' as accessing one from the current node; but under the hood, these two IO operations have very different performance implications. In short, you need to choose wisely where to put your data. | To get the most out of our clusters for machine learning applications, special care must be taken. A cluster is a complicated beast that is very different from your local machine that you use for prototyping. Notably, a cluster uses a distributed filesystem, linking many storage devices seamlessly. Accessing a file on <tt>/project</tt> ''feels the same'' as accessing one from the current node; but under the hood, these two IO operations have very different performance implications. In short, you need to [[#Managing_your_datasets|choose wisely where to put your data]]. | ||
The sections below are a starting point for machine learning practitioners looking for solutions, or just getting started working with our clusters. | The sections below are a starting point for machine learning practitioners looking for solutions, or just getting started working with our clusters. | ||
Line 21: | Line 21: | ||
* [[XGBoost]] | * [[XGBoost]] | ||
= Datasets containing lots of small files (e.g. image datasets) = | = Managing your datasets = | ||
== Storage and file management == | |||
Compute Canada provides a wide range of storage options to cover the needs of our very diverse users. These storage solutions range from high-speed temporary local storage to different kinds of long-term storage, so you can choose the storage medium that best corresponds to your needs and usage patterns. Please refer to our documentation on [[Storage and file management]]. | |||
== Datasets containing lots of small files (e.g. image datasets) == | |||
In machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, problems arise: | In machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, problems arise: |