Revision as of 16:30, 2 July 2019

This article is a draft

This is not a complete article: This is a draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.

In certain domains, notably machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more. The individual files may be fairly small, e.g. less than a few hundred kilobytes. In these cases, a problem arises due to filesystem quotas on Compute Canada clusters that limit the number of filesystem objects. So how can a user or group of users store these necessary data sets on the cluster? In this page we will present a variety of different solutions, each with its own pros and cons, so you may judge for yourself which is an appropriate one for you.

DAR

HDF5

SQLite

SquashFS

ratarmount

git

When working with Git, over time the number of files in the hidden .git repository subdirectory can grow significantly. Using git repack will pack many of the files together into a few large database files and greatly speed up Git's operations.

@@ Line 1: / Line 1: @@
 {{Draft}}
-In certain domains it is common to have to manage very large collections - meaning hundreds of thousands or more - of files, which individually are often though not always fairly small, e.g. less than a few hundred kilobytes. In these cases, a problem naturally arises from storing such data on Compute Canada clusters due to the filesystem quotas that limit the number of distinct filesystem objects to 500K for the project space (by default) and 1M for the scratch space in most instances. So how can a user or group of users store these necessary data sets on the cluster? In this page we will present a variety of different solutions and workarounds, each of which has its own pros and cons, and allow you as a reader to judge for yourself which is the optimal approach for you.
+In certain domains, notably machine learning, it is common to have to manage very large collections of files, meaning hundreds of thousands or more.  The individual files may be fairly small, e.g. less than a few hundred kilobytes.  In these cases, a problem arises due to [[/Storage_and_file_management#Filesystem_quotas_and_policies|filesystem quotas]] on Compute Canada clusters that limit the number of filesystem objects.  So how can a user or group of users store these necessary data sets on the cluster?  In this page we will present a variety of different solutions, each with its own pros and cons, so you may judge for yourself which is an appropriate one for you.
 <!-- This text should not appear -->

Handling large collections of files: Difference between revisions

Revision as of 16:30, 2 July 2019

Contents

DAR

HDF5

SQLite

SquashFS

ratarmount

git

Navigation menu

Handling large collections of files: Difference between revisions

Revision as of 16:30, 2 July 2019

DAR

HDF5

SQLite

SquashFS

ratarmount

git

Navigation menu

Search