Handling large collections of files: Difference between revisions

Jump to navigation Jump to search
Marked this version for translation
No edit summary
(Marked this version for translation)
Line 28: Line 28:
Local disks attached to compute nodes are at least SATA SSD or better, and, in general, will have a performance that is considerably better than the project or scratch filesystems. Note that local disk is shared by all running jobs on that node without being allocated by the scheduler. The actual amount of local disk space varies from one cluster to another (and might also vary within a given cluster). For example,
Local disks attached to compute nodes are at least SATA SSD or better, and, in general, will have a performance that is considerably better than the project or scratch filesystems. Note that local disk is shared by all running jobs on that node without being allocated by the scheduler. The actual amount of local disk space varies from one cluster to another (and might also vary within a given cluster). For example,


<!--T:19-->
* [[Béluga/en | Béluga]] offers roughly 370GB of local disk for the CPU nodes, the GPU nodes have a 1.6TB NVMe disk (to help with the AI image datasets with their millions of small files).
* [[Béluga/en | Béluga]] offers roughly 370GB of local disk for the CPU nodes, the GPU nodes have a 1.6TB NVMe disk (to help with the AI image datasets with their millions of small files).
* [[Niagara]] does not have local storage on the compute nodes
* [[Niagara]] does not have local storage on the compute nodes
* For other clusters you can assume the available disk size to be at least 190GB
* For other clusters you can assume the available disk size to be at least 190GB


<!--T:20-->
You can access this local disk inside of a job using the environment variable <tt>$SLURM_TMPDIR</tt>. One approach therefore would be to keep your dataset archived as a single <tt>tar</tt> file in the project space and then copy it to the local disk at the beginning of your job, extract it and use the dataset during the job. If any changes were made, at the job's end you could again archive the contents to a <tt>tar</tt> file and copy it back to the project space.
You can access this local disk inside of a job using the environment variable <tt>$SLURM_TMPDIR</tt>. One approach therefore would be to keep your dataset archived as a single <tt>tar</tt> file in the project space and then copy it to the local disk at the beginning of your job, extract it and use the dataset during the job. If any changes were made, at the job's end you could again archive the contents to a <tt>tar</tt> file and copy it back to the project space.


<!--T:21-->
Here is an example of a submission script that allocates an entire node
Here is an example of a submission script that allocates an entire node
{{File
{{File
Line 60: Line 63:
}}
}}


== RAM disk ==
== RAM disk == <!--T:22-->
The <code>/tmp</code> file system can be used as a RAM disk on the compute nodes. It is implemented using [https://en.wikipedia.org/wiki/Tmpfs tmpfs]. Here is more information
The <code>/tmp</code> file system can be used as a RAM disk on the compute nodes. It is implemented using [https://en.wikipedia.org/wiki/Tmpfs tmpfs]. Here is more information
* <code>/tmp</code> is <code>tmpfs</code> on all clusters
* <code>/tmp</code> is <code>tmpfs</code> on all clusters
Line 80: Line 83:
The [https://www.sqlite.org SQLite software] allows for the use of a relational database which resides entirely in a single file stored on disk, without the need for a database server. The data located in the file can be accessed using standard [https://en.wikipedia.org/wiki/SQL SQL] (Structured Query Language) commands such as <tt>SELECT</tt> and there are APIs for several common programming languages. Using these APIs you can then interact with your SQLite database inside of a program written in C/C++, Python, R, Java and Perl. Modern relational databases contain datatypes for handling the storage of ''binary blobs'', such as the contents of an image file, so storing a collection of 5 or 10 million small PNG or JPEG images inside of a single SQLite file may be much more practical than storing them as individual files. There is the overhead of creating the SQLite database and this approach assumes that you are familiar with SQL and designing a simple relational database with a small number of tables. Note as well that the performance of SQLite can start to degrade for very large database files, several gigabytes or more, in which case you may need to contemplate the use of a more traditional  [[Database servers | database server]] using [https://www.mysql.com MySQL] or [https://www.postgresql.org PostgreSQL].
The [https://www.sqlite.org SQLite software] allows for the use of a relational database which resides entirely in a single file stored on disk, without the need for a database server. The data located in the file can be accessed using standard [https://en.wikipedia.org/wiki/SQL SQL] (Structured Query Language) commands such as <tt>SELECT</tt> and there are APIs for several common programming languages. Using these APIs you can then interact with your SQLite database inside of a program written in C/C++, Python, R, Java and Perl. Modern relational databases contain datatypes for handling the storage of ''binary blobs'', such as the contents of an image file, so storing a collection of 5 or 10 million small PNG or JPEG images inside of a single SQLite file may be much more practical than storing them as individual files. There is the overhead of creating the SQLite database and this approach assumes that you are familiar with SQL and designing a simple relational database with a small number of tables. Note as well that the performance of SQLite can start to degrade for very large database files, several gigabytes or more, in which case you may need to contemplate the use of a more traditional  [[Database servers | database server]] using [https://www.mysql.com MySQL] or [https://www.postgresql.org PostgreSQL].


<!--T:23-->
The SQLite executable is called <code>sqlite3</code>.  It is available via the <code>nixpkgs</code> [[Utiliser_des_modules/en|module]], which is loaded by default on Compute Canada systems.
The SQLite executable is called <code>sqlite3</code>.  It is available via the <code>nixpkgs</code> [[Utiliser_des_modules/en|module]], which is loaded by default on Compute Canada systems.


rsnt_translations
56,430

edits

Navigation menu