Tuning Lustre

From Alliance Doc
Revision as of 18:53, 14 February 2017 by Stubbsda (talk | contribs) (→‎See also)
Jump to navigation Jump to search

Lustre Filesystem[edit]

Lustre is a high performance distributed filesystem which allows users of Compute Canada to reach high bandwidth for input/output operations. There are however some caveat to be taken care of if one wants to reach the best bandwidth.

Stripe Count and Stripe Size[edit]

For each file or directory, it is possible change the stripe size and stripe count parameters. Stripe size is the size of the smallest block of data that is allocated on the filesystem. Stripe count is the number of disks on which the data are spread.

It is possible to get the value of those parameters for a given file or directory using the command

Question.png
[name@server ~]$ lfs getstripe ''path/to/file''

It is also possible to change those parameters for a given directory using the command

Question.png
[name@server ~]$ lfs setstripe -c ''count'' -s ''size'' ''/path/to/dir''

For example, if count=8 and size=4m, then the files will be spread on 8 disks and will grow by steps of 4 MB each time that new space is required.

It is not possible to change the stripe count or the stripe size of an existing file. To change those parameters, the file must be copied (not moved) to a directory with different parameters. To create an empty file with a given value of those parameters without changing the parameters of the directory, you may run lfs setstripe on the name of the file to be created. The file will be created as an empty file with the given parameters.

Increasing the stripe count may improve performances, but also makes this file more susceptible to hardware failures.

When a parallel program needs to read a small file (< 1MB), a configuration file for example, it is best to put this file on one disk (stripe count=1), to read it with the master rank, and to send its content to other ranks using a MPI_Broadcast or MPI_Scatter.

When treating large files, it is usually best to use a stripe count as large as the number of MPI ranks. For the stripe size, you will want it to be the same size as the buffer size for the data that is being read or written, by each rank. For example, if each rank reads 1 MB of data at a time, the ideal stripe size will likely be 1 MB. If you don't know what size to use, your best bet is to keep the default value, which has been optimized for large files. Note that you must never use a stripe size that is not a multiple of 1 MB.

In general, you want to reduce the number of open/close operations on the filesystem. It is therefore best to concatenate all data within a single file rather than writing a lot of small files. It will also be best to open the file once at the beginning, and close it once at the end of the program, rather than opening and closing it each time you want to add new data.

See also[edit]