Using nearline storage: Difference between revisions
No edit summary |
No edit summary |
||
Line 85: | Line 85: | ||
Note that as of October 2020, the output of the command <code>diskusage_report</code>, also known as <code>quota</code>, does not report on /nearline space consumption. | Note that as of October 2020, the output of the command <code>diskusage_report</code>, also known as <code>quota</code>, does not report on /nearline space consumption. | ||
== | == Cluster-specific information == <!--T:6--> | ||
<!--T:10--> | <!--T:10--> |
Revision as of 18:30, 19 October 2020
Nearline is a tape-based filesystem intended for *inactive data*. Datasets which you do not expect to access for months are good candidates to be stored in /nearline.
Restrictions and best practices
Size of files
Retrieving small files from tape is inefficient, while extremely large files pose other problems. Please observe these guidelines about the size of files to store in /nearline:
- Files smaller than ~200MB should be combined into archive files (tarballs) using tar or a similar tool.
- Files larger than 300GB should be split in chunks of 100GB using the split command or a similar tool.
Using tar or dar
Use tar or dar to create an archive file directly on /nearline. There is no advantage to creating the archive on a different filesystem and then copying it to /nearline once complete.
If you have hundreds of gigabytes of data, the tar
options -M (--muti-volume)
and -L (--tape-length)
can be used to produce archive files of suitable size.
If you are using dar
, you can similarly use the -s (--slice)
option.
No access from compute nodes
Because data retrieval from /nearline may take an uncertain amount of time (see "How it works" below), we do not permit reading from /nearline in a job context. /nearline is not mounted on compute nodes.
Use a data-transfer node if available
Creating a tar or dar file for a large volume of data can be resource-intensive. Please do this on a data-transfer node (DTN) instead of a login node if login to a DTN is supported at the cluster you are using.
Why /nearline?
Tape as a storage medium has these advantages over disk and solid-state ("SSD") media.
- Cost per unit of data stored is lower.
- The volume of data stored can be easily expanded by buying more tapes.
- Energy consumption per unit of data stored is effectively zero.
Consequently we can offer much greater volumes of storage on /nearline than we can on /project. Also, keeping inactive data off of /project reduces the load and improves its performance.
How it works
- When a file is first copied to (or created on) /nearline, the file exists only on disk, not tape.
- After a period (on the order of a day), and if the file meets certain criteria, the system will copy the file to tape. At this stage, the file will be on both disk and tape.
- After a further period the disk copy may be deleted, and the file will only be on tape.
- When such a file is recalled, it is copied from tape back to disk, returning it to the second state.
When a file has been moved entirely to tape (that is, when it is virtualized) it will still appear in the directory listing. If the virtual file is read, it will take some time for the tape to be retrieved from the library and copied back to disk. The process which is trying to read the file will block while this is happening. This may take from less than a minute to over an hour, depending on the size of the file and the demand on the tape system.
You can determine whether or not a given file has been moved to tape or is still on disk using the lfs hsm_state command where "hsm" stands for "hierarchical storage manager".
# Here, <FILE> has not been copied to tape.
$ lfs hsm_state <FILE>
<FILE>: (0x00000000)
# Here, <FILE> is still on the disk
$ lfs hsm_state <FILE>
<FILE>: [...]: exists archived, [...]
# Here, <FILE> is archived on tape, there will be a lag when opening it.
$ lfs hsm_state <FILE>
<FILE>: [...]: released archived, [...]
You can explicitly force a file to be recalled from tape without actually reading it with the command lfs hsm_restore <FILE>
.
Note that as of October 2020, the output of the command diskusage_report
, also known as quota
, does not report on /nearline space consumption.
Cluster-specific information
/nearline is only accessible as a directory on login nodes and on DTNs (Data Transfer Nodes).
To use /nearline, just put files into your ~/nearline/PROJECT directory. After a period of time (24 hours as of February 2019), they will be copied onto tape. If the file remains unchanged for another period (24 hours as of February 2019), the copy on disk will be removed, making the file virtualized on tape.
If you accidentally (or deliberately) delete a file from ~/nearline, the tape copy will be retained for up to 60 days. To restore such a file contact technical support with the full path for the file(s) and desired version (by date), just as you would for restoring a backup. Note that since you will need the full path for the file, it is important for you to retain a copy of the complete directory structure of your /nearline space. For example, you can run the command ls -R > ~/nearline_contents.txt from the ~/nearline/PROJECT directory so that you have a copy of the location of all the files.
/nearline service similar to that on Graham.
HPSS is the /nearline service on Niagara.
There are three methods to access the service:
1. By submitting HPSS-specific commands htar or hsi to the Slurm scheduler as a job in one of the archive partitions; see the HPSS documentation for detailed examples. Using job scripts offers the benefit of automating /nearline transfers and is the best method if you use HPSS regularly. Your HPSS files can be found in the $ARCHIVE directory, which is like $PROJECT but with /project replaced by /archive.
2. To manage a small number of files in HPSS, you can use the VFS (Virtual File System) node, which is accessed with the command salloc --time=1:00:00 -pvfsshort. Your HPSS files can be found in the $ARCHIVE directory, which is like $PROJECT but with /project replaced by /archive.
3. By using Globus for transfers to and from HPSS using the endpoint computecanada#hpss. This is useful for occasional usage and for transfers to and from other sites.
/nearline service similar to that on Graham.