Using nearline storage: Difference between revisions

Marked this version for translation
(major revision following Sep 28 S2S seminar)
(Marked this version for translation)
Line 5: Line 5:
Nearline is a tape-based file system intended for *inactive data*.  Data sets which you do not expect to access for months are good candidates to be stored in nearline.  
Nearline is a tape-based file system intended for *inactive data*.  Data sets which you do not expect to access for months are good candidates to be stored in nearline.  


= Best practices, and restrictions =
= Best practices, and restrictions = <!--T:33-->


==== Size of files ====
==== Size of files ==== <!--T:34-->


<!--T:35-->
Retrieving small files from tape is inefficient, while extremely large files pose other problems.  Please observe these guidelines about the size of files to store in nearline:
Retrieving small files from tape is inefficient, while extremely large files pose other problems.  Please observe these guidelines about the size of files to store in nearline:


Line 15: Line 16:
*Files larger than 300GB should be split in chunks of 100GB using the [[A_tutorial_on_'tar'#split|split]] command or a similar tool.
*Files larger than 300GB should be split in chunks of 100GB using the [[A_tutorial_on_'tar'#split|split]] command or a similar tool.


==== Using tar or dar ====
==== Using tar or dar ==== <!--T:36-->


<!--T:37-->
Use [[A tutorial on 'tar'|tar]] or [[dar]] to create an archive file directly on nearline.  There is no advantage to creating the archive on a different filesystem and then copying it to nearline once complete.
Use [[A tutorial on 'tar'|tar]] or [[dar]] to create an archive file directly on nearline.  There is no advantage to creating the archive on a different filesystem and then copying it to nearline once complete.


<!--T:38-->
If you have hundreds of gigabytes of data, the <code>tar</code> options <code>-M (--muti-volume)</code> and <code>-L (--tape-length)</code> can be used to produce archive files of suitable size.
If you have hundreds of gigabytes of data, the <code>tar</code> options <code>-M (--muti-volume)</code> and <code>-L (--tape-length)</code> can be used to produce archive files of suitable size.


<!--T:39-->
If you are using <code>dar</code>, you can similarly use the <code>-s (--slice)</code> option.
If you are using <code>dar</code>, you can similarly use the <code>-s (--slice)</code> option.


==== No access from compute nodes ====
==== No access from compute nodes ==== <!--T:40-->


<!--T:41-->
Because data retrieval from nearline may take an uncertain amount of time (see "How it works" below), we do not permit reading from nearline in a job context.  Nearline is not mounted on compute nodes.
Because data retrieval from nearline may take an uncertain amount of time (see "How it works" below), we do not permit reading from nearline in a job context.  Nearline is not mounted on compute nodes.


==== Use a data-transfer node if available ====
==== Use a data-transfer node if available ==== <!--T:42-->


<!--T:32-->
<!--T:32-->
Creating a tar or dar file for a large volume of data can be resource-intensive.  Please do this on a data-transfer node (DTN) instead of a login node if login to a DTN is supported at the cluster you are using.
Creating a tar or dar file for a large volume of data can be resource-intensive.  Please do this on a data-transfer node (DTN) instead of a login node if login to a DTN is supported at the cluster you are using.


= Why nearline? =
= Why nearline? = <!--T:43-->


<!--T:44-->
Tape as a storage medium has these advantages over disk and solid-state ("SSD") media.
Tape as a storage medium has these advantages over disk and solid-state ("SSD") media.
# Cost per unit of data stored is lower.
# Cost per unit of data stored is lower.
Line 39: Line 45:
# Energy consumption per unit of data stored is effectively zero.
# Energy consumption per unit of data stored is effectively zero.


<!--T:45-->
Consequently we can offer much greater volumes of storage on nearline than we can on project.  Also, keeping inactive data ''off'' of project reduces the load and improves its performance.
Consequently we can offer much greater volumes of storage on nearline than we can on project.  Also, keeping inactive data ''off'' of project reduces the load and improves its performance.


= How it works =
= How it works = <!--T:46-->


<!--T:22-->
<!--T:22-->
Line 55: Line 62:
You can determine whether or not a given file has been moved to tape or is still on disk using the `lfs hsm_state` command.  The "hsm" stands for "hierarchical storage manager".
You can determine whether or not a given file has been moved to tape or is still on disk using the `lfs hsm_state` command.  The "hsm" stands for "hierarchical storage manager".


<!--T:47-->
<source lang="bash">
<source lang="bash">
#  Here, <FILE> has not been copied to tape.
#  Here, <FILE> has not been copied to tape.
rsnt_translations
56,420

edits