Tuning Lustre: Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
No edit summary
No edit summary
 
(7 intermediate revisions by 3 users not shown)
Line 3: Line 3:
= Lustre Filesystem = <!--T:1-->
= Lustre Filesystem = <!--T:1-->


<!--T:2-->
[http://lustre.org/ Lustre] is a high performance distributed filesystem which allows users to reach high bandwidth for input/output operations. There are however some caveats to consider if one wants to achieve the best performance.  
[http://lustre.org/ Lustre] is a high performance distributed filesystem which allows users of Compute Canada to reach high bandwidth for input/output operations. There are however some caveats to consider if one wants to achieve the best performance.  
Note that the advice offered here is for advanced users and should be used with caution. Be sure to carry out tests both to verify the scientific validity of your output and to ensure the changes lead to real performance improvements.
 
== Stripe Count and Stripe Size == <!--T:3-->
== Stripe Count and Stripe Size == <!--T:2-->  


<!--T:4-->
<!--T:4-->
Line 13: Line 13:
<!--T:5-->
<!--T:5-->
It is possible to get the value of those parameters for a given file or directory using the command
It is possible to get the value of those parameters for a given file or directory using the command
{{Command|lfs getstripe ''path/to/file''}}
{{Command|lfs getstripe /path/to/file}}


<!--T:6-->
<!--T:6-->
It is also possible to change those parameters for a given directory using the command
It is also possible to change those parameters for a given directory using the command
{{Command|lfs setstripe -c ''count'' -s ''size'' ''/path/to/dir''}}
{{Command|lfs setstripe -c count /path/to/dir}}


<!--T:7-->
<!--T:7-->
For example, if ''count''=8 and ''size''=4m, then the files will be spread on 8 disks and will grow by steps of 4 MB each time that new space is required.
For example, if ''count''=8 , then the files will be spread on 8 targets (RAIDs), each MB will be written in a round-robin fashion on up to 8 different servers.
 
<!--T:15-->
{{Command|lfs setstripe -c 8 /home/user/newdir}}


<!--T:8-->
<!--T:8-->
It is not possible to change the stripe count or the stripe size of an existing file. To change those parameters, the file must be '''copied''' (not moved) to a directory with different parameters. To create an empty file with a given value of those parameters without changing the parameters of the directory, you may run ''lfs setstripe'' on the name of the file to be created. The file will be created as an empty file with the given parameters.  
Changing the stripe count will not modify a existing file. To change those parameters, the file must be '''copied''' (not moved) to a directory with different parameters or the file needs to be migrated. To create an empty file with a given value of those parameters without changing the parameters of the directory, you may run ''lfs setstripe'' on the name of the file to be created. The file will be created as an empty file with the given parameters.  
 
<!--T:16-->
Example of a non-striped directory with a file called "example_file" (lmm_stripe_count is 1 and there is only 1 object for the file)
$ lfs getstripe striping_example/
striping_example/
stripe_count:  1 stripe_size:  1048576 pattern:      raid0 stripe_offset: -1
striping_example//example_file
lmm_stripe_count:  1
lmm_stripe_size:  1048576
lmm_pattern:      raid0
lmm_layout_gen:    0
lmm_stripe_offset: 2
obdidx objid objid group
    2       3714477     0x38adad   0x300000400
 
<!--T:17-->
We can change the striping of this directory to use a stripe count of 2 and create a new file.
 
<!--T:18-->
$ lfs setstripe -c 2 striping_example
$ dd if=/dev/urandom of=striping_example/new_file bs=1M count=10
$ lfs getstripe striping_example/
striping_example/
stripe_count:  2 stripe_size:  1048576 pattern:      raid0 stripe_offset: -1
striping_example//example_file
lmm_stripe_count:  1
lmm_stripe_size:  1048576
lmm_pattern:      raid0
lmm_layout_gen:    0
lmm_stripe_offset: 2
obdidx objid objid group
    2       3714477     0x38adad   0x300000400
striping_example//new_file
lmm_stripe_count:  2
lmm_stripe_size:  1048576
lmm_pattern:      raid0
lmm_layout_gen:    0
lmm_stripe_offset: 3
obdidx objid objid group
    3       3714601     0x38ae29   0x400000400
    0       3714618     0x38ae3a   0x2c0000400
 
<!--T:19-->
Only the new_file is using the new default of count=2 (lmm_stripe_count) and 2 objects are allocated.
 
<!--T:20-->
We can restripe the old file using ''lfs migrate''
$ lfs migrate -c 2 striping_example/example_file
$ lfs getstripe striping_example/example_file
striping_example/example_file
lmm_stripe_count:  2
lmm_stripe_size:  1048576
lmm_pattern:      raid0
lmm_layout_gen:    2
lmm_stripe_offset: 10
obdidx objid objid group
    10       3685344     0x383be0   0x500000400
    11       3685328     0x383bd0   0x540000400
 
<!--T:21-->
The file now has a lmm_stripe_count of 2 and 2 objects are allocated


<!--T:9-->
<!--T:9-->
Line 32: Line 96:


<!--T:11-->
<!--T:11-->
When treating large files, it is usually best to use a stripe count as large as the number of MPI ranks. For the stripe size, you will want it to be the same size as the buffer size for the data that is being read or written, by each rank. For example, if each rank reads 1 MB of data at a time, the ideal stripe size will likely be 1 MB. If you don't know what size to use, your best bet is to keep the default value, which has been optimized for large files. '''Note that you must never use a stripe size that is not a multiple of 1 MB'''.
When treating large files, it is usually best to use a stripe count as large as the number of MPI ranks. For the stripe size, you will want it to be the same size as the buffer size for the data that is being read or written, by each rank. For example, if each rank reads 1 MB of data at a time, the ideal stripe size will likely be 1 MB. If you don't know what size to use, your best bet is to keep the default value, which has been optimized for large files. '''Note that the stripe size must always be an integer multiple of 1MB'''.


<!--T:12-->
<!--T:12-->

Latest revision as of 13:23, 28 April 2022

Other languages:

Lustre Filesystem[edit]

Lustre is a high performance distributed filesystem which allows users to reach high bandwidth for input/output operations. There are however some caveats to consider if one wants to achieve the best performance. Note that the advice offered here is for advanced users and should be used with caution. Be sure to carry out tests both to verify the scientific validity of your output and to ensure the changes lead to real performance improvements.

Stripe Count and Stripe Size[edit]

For each file or directory, it is possible change the stripe size and stripe count parameters. Stripe size is the size of the smallest block of data that is allocated on the filesystem. Stripe count is the number of disks on which the data are spread.

It is possible to get the value of those parameters for a given file or directory using the command

Question.png
[name@server ~]$ lfs getstripe /path/to/file

It is also possible to change those parameters for a given directory using the command

Question.png
[name@server ~]$ lfs setstripe -c count /path/to/dir

For example, if count=8 , then the files will be spread on 8 targets (RAIDs), each MB will be written in a round-robin fashion on up to 8 different servers.

Question.png
[name@server ~]$ lfs setstripe -c 8 /home/user/newdir

Changing the stripe count will not modify a existing file. To change those parameters, the file must be copied (not moved) to a directory with different parameters or the file needs to be migrated. To create an empty file with a given value of those parameters without changing the parameters of the directory, you may run lfs setstripe on the name of the file to be created. The file will be created as an empty file with the given parameters.

Example of a non-striped directory with a file called "example_file" (lmm_stripe_count is 1 and there is only 1 object for the file)

$ lfs getstripe striping_example/
striping_example/
stripe_count:  1 stripe_size:   1048576 pattern:       raid0 stripe_offset: -1
striping_example//example_file
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 2
	obdidx		 objid		 objid		 group
	     2	       3714477	     0x38adad	   0x300000400

We can change the striping of this directory to use a stripe count of 2 and create a new file.

$ lfs setstripe -c 2 striping_example
$ dd if=/dev/urandom of=striping_example/new_file bs=1M count=10
$ lfs getstripe striping_example/
striping_example/
stripe_count:  2 stripe_size:   1048576 pattern:       raid0 stripe_offset: -1
striping_example//example_file
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 2
	obdidx		 objid		 objid		 group
	     2	       3714477	     0x38adad	   0x300000400
striping_example//new_file
lmm_stripe_count:  2
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 3
	obdidx		 objid		 objid		 group
	     3	       3714601	     0x38ae29	   0x400000400
	     0	       3714618	     0x38ae3a	   0x2c0000400

Only the new_file is using the new default of count=2 (lmm_stripe_count) and 2 objects are allocated.

We can restripe the old file using lfs migrate

$ lfs migrate -c 2 striping_example/example_file
$ lfs getstripe striping_example/example_file
striping_example/example_file
lmm_stripe_count:  2
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    2
lmm_stripe_offset: 10
	obdidx		 objid		 objid		 group
	    10	       3685344	     0x383be0	   0x500000400
	    11	       3685328	     0x383bd0	   0x540000400

The file now has a lmm_stripe_count of 2 and 2 objects are allocated

Increasing the stripe count may improve performances, but also makes this file more susceptible to hardware failures.

When a parallel program needs to read a small file (< 1MB), a configuration file for example, it is best to put this file on one disk (stripe count=1), to read it with the master rank, and to send its content to other ranks using a MPI_Broadcast or MPI_Scatter.

When treating large files, it is usually best to use a stripe count as large as the number of MPI ranks. For the stripe size, you will want it to be the same size as the buffer size for the data that is being read or written, by each rank. For example, if each rank reads 1 MB of data at a time, the ideal stripe size will likely be 1 MB. If you don't know what size to use, your best bet is to keep the default value, which has been optimized for large files. Note that the stripe size must always be an integer multiple of 1MB.

In general, you want to reduce the number of open/close operations on the filesystem. It is therefore best to concatenate all data within a single file rather than writing a lot of small files. It will also be best to open the file once at the beginning, and close it once at the end of the program, rather than opening and closing it each time you want to add new data.

See also[edit]