Parallel I/O introductory tutorial: Difference between revisions

Marked this version for translation
No edit summary
(Marked this version for translation)
Line 2: Line 2:
<translate>
<translate>


<!--T:1-->
{{Draft}}
{{Draft}}


<!--T:2-->
This self-study tutorial will discuss issues in handling large amount of data in HPC and also discuss a variety of parallel I/O strategies for doing large-scale Input/Output (I/O) with parallel jobs. In particular we will focus on using MPI-IO and then introduce parallel I/O libraries such as NetCDF, HDF5 and ADIOS.
This self-study tutorial will discuss issues in handling large amount of data in HPC and also discuss a variety of parallel I/O strategies for doing large-scale Input/Output (I/O) with parallel jobs. In particular we will focus on using MPI-IO and then introduce parallel I/O libraries such as NetCDF, HDF5 and ADIOS.


=HPC I/O Issues & Goal=
=HPC I/O Issues & Goal= <!--T:3-->


<!--T:4-->
Many today’s problems are increasingly computationally expensive, requiring large parallel runs on large distributed-memory machines (clusters). There would be basically three big I/O activities in these types of jobs. First is the HPC application requires to read initial dataset or conditions from the designated file. Secondly, mostly at the end of a calculation, data need to be stored on disk for follow-up runs or post-processing. As you may guess, parallel applications commonly need to write distributed arrays to disk 
Thirdly, the application state needs to be written into a file for restarting the application in case of a system failure 
The figure below shows a simple sketch of I/O bottleneck problem when using many cpus or nodes in a parallel job. As Amdahl’s law says, the speedup of a parallel program is limited by the time needed for the sequential fraction of the program. So, if the I/O part in the application works sequentially as shown, the performance of the code would be not as scalable as desired.
Many today’s problems are increasingly computationally expensive, requiring large parallel runs on large distributed-memory machines (clusters). There would be basically three big I/O activities in these types of jobs. First is the HPC application requires to read initial dataset or conditions from the designated file. Secondly, mostly at the end of a calculation, data need to be stored on disk for follow-up runs or post-processing. As you may guess, parallel applications commonly need to write distributed arrays to disk 
Thirdly, the application state needs to be written into a file for restarting the application in case of a system failure 
The figure below shows a simple sketch of I/O bottleneck problem when using many cpus or nodes in a parallel job. As Amdahl’s law says, the speedup of a parallel program is limited by the time needed for the sequential fraction of the program. So, if the I/O part in the application works sequentially as shown, the performance of the code would be not as scalable as desired.


<!--T:5-->
* Reading initial conditions or datasets for processing
* Reading initial conditions or datasets for processing
* Writing numerical data from simulations for later analysis
* Writing numerical data from simulations for later analysis
Line 15: Line 19:




<!--T:6-->
[[File:hpc_IO.png|400px|center]]
[[File:hpc_IO.png|400px|center]]


<!--T:7-->
''Efficient I/O without stressing out the HPC system is challenging''
''Efficient I/O without stressing out the HPC system is challenging''


<!--T:8-->
We will go over the physical problem and limitation in handling data with memory or hard-disk but it is simply expected that load/store operation from memory or hard-disk takes much more time than multiply operations in CPU. Commonly, the total execution time consists of computation time in CPU, communication tim in inter-connection or network and I/O time. So, the efficient I/O handling in high performance computing is a key factor to get best performance.
We will go over the physical problem and limitation in handling data with memory or hard-disk but it is simply expected that load/store operation from memory or hard-disk takes much more time than multiply operations in CPU. Commonly, the total execution time consists of computation time in CPU, communication tim in inter-connection or network and I/O time. So, the efficient I/O handling in high performance computing is a key factor to get best performance.


<!--T:9-->
* Load and store operations are more time-consuming than multiply operations
* Load and store operations are more time-consuming than multiply operations
* '''Total Execution Time 
= Computation Time + Communication Time + I/O time'''
* '''Total Execution Time 
= Computation Time + Communication Time + I/O time'''
* Optimize all the components of the equation above to get best performance!!
* Optimize all the components of the equation above to get best performance!!


==Disk access rates over time==
==Disk access rates over time== <!--T:10-->


<!--T:11-->
An HPC system, I/O related systems are typically slow as compared to its other parts. The figure in this slide shows how the internal drive access rate has been improved over the time. From 1960 to 2014 top supercomputer speed increased by 11 orders of magnitude. However, as shown in the figure, a Single hard disk drive capacity in the same period of time grew by 6 orders and furthermore average internal drive access rate which we can store data at grew by 3-4 orders of magnitude. So, this discrepancy explains that we are producing much more data which we cannot possibly store it at the proportional rate and hence we need to pay special attention to how to store the data appropriately.
An HPC system, I/O related systems are typically slow as compared to its other parts. The figure in this slide shows how the internal drive access rate has been improved over the time. From 1960 to 2014 top supercomputer speed increased by 11 orders of magnitude. However, as shown in the figure, a Single hard disk drive capacity in the same period of time grew by 6 orders and furthermore average internal drive access rate which we can store data at grew by 3-4 orders of magnitude. So, this discrepancy explains that we are producing much more data which we cannot possibly store it at the proportional rate and hence we need to pay special attention to how to store the data appropriately.


<!--T:12-->
[[File:Diskaccess.png|400px|center]]
[[File:Diskaccess.png|400px|center]]


==Memory/Storage latency==
==Memory/Storage latency== <!--T:13-->
Here is a memory and storage latency. Memory/storage latency refers to delays in transmitting data between the CPU and medium. Most CPUs operate at 1 nano second time scale.  As shown the figure, for example, Writing to L2 cache takes about 10 times than CPU operation. As such, accessing to memory has a physical limitation and it also affects the I/O operations.
Here is a memory and storage latency. Memory/storage latency refers to delays in transmitting data between the CPU and medium. Most CPUs operate at 1 nano second time scale.  As shown the figure, for example, Writing to L2 cache takes about 10 times than CPU operation. As such, accessing to memory has a physical limitation and it also affects the I/O operations.


<!--T:14-->
[[File:Memory.png|300px|center]]
[[File:Memory.png|300px|center]]


==How to calculate I/O speed==
==How to calculate I/O speed== <!--T:15-->


<!--T:16-->
Before we proceed more, we better make sure two following performance measurements. Firstly, there is ‘IOPs’. IOPs means I/O operations per second. The operation includes read/write and so on and IOPs is an inverse of latency (think about period (latency) and frequency(IOPs)). And also there is ‘I/O Bandwidth’. The bandwidth is defined as ‘quantity you read/write’. I believe all of you are quite used to this terminology from Internet connection at your home or office. Anyway, here is an information chart for several I/O devices. As you can see, Top-of-the-line SSDs on a PCI Express can push to unto 1GB IOPs. However, the device is still very expensive so it’s not a right fit for several hundreds terabyte supercomputing systems.  
Before we proceed more, we better make sure two following performance measurements. Firstly, there is ‘IOPs’. IOPs means I/O operations per second. The operation includes read/write and so on and IOPs is an inverse of latency (think about period (latency) and frequency(IOPs)). And also there is ‘I/O Bandwidth’. The bandwidth is defined as ‘quantity you read/write’. I believe all of you are quite used to this terminology from Internet connection at your home or office. Anyway, here is an information chart for several I/O devices. As you can see, Top-of-the-line SSDs on a PCI Express can push to unto 1GB IOPs. However, the device is still very expensive so it’s not a right fit for several hundreds terabyte supercomputing systems.  


<!--T:17-->
One thing I would like to emphasize is that parallel filesystems are optimized for efficient I/O by multiple users on multiple machines/nodes. So, it does not result in “supercomputing” performance in I/O.  
One thing I would like to emphasize is that parallel filesystems are optimized for efficient I/O by multiple users on multiple machines/nodes. So, it does not result in “supercomputing” performance in I/O.  


<!--T:18-->
*'''IOPs''' = Input / Output operations per second (read/write/open/close/seek) ; essentially an inverse of latency
*'''IOPs''' = Input / Output operations per second (read/write/open/close/seek) ; essentially an inverse of latency
*'''I/O Bandwidth''' = quantity you read / write
*'''I/O Bandwidth''' = quantity you read / write


<!--T:19-->
Parallel (distributed) filesystems are optimized for efficient I/O by multiple users on multiple machines/nodes, do not result in “supercomputing” performance  
Parallel (distributed) filesystems are optimized for efficient I/O by multiple users on multiple machines/nodes, do not result in “supercomputing” performance  


<!--T:20-->
* disk-access time + communication over the network 
(limited bandwidth, many users)
* disk-access time + communication over the network 
(limited bandwidth, many users)


==I/O Software + Hardware stack==
==I/O Software + Hardware stack== <!--T:21-->


<!--T:22-->
* I/O Hardware --> Parallel filesystem --> I/O Middleware --> High-end I/O library --> Application
* I/O Hardware --> Parallel filesystem --> I/O Middleware --> High-end I/O library --> Application


<!--T:23-->
When it comes to organizing parallel I/O, there are several layers of abstraction you should keep in mind. First of all, let’s start from the bottom. There is a I/O hardware which is a physical array or hard-disks attached to the cluster. On top of that, we are running parallel file system.  
When it comes to organizing parallel I/O, there are several layers of abstraction you should keep in mind. First of all, let’s start from the bottom. There is a I/O hardware which is a physical array or hard-disks attached to the cluster. On top of that, we are running parallel file system.  


<!--T:24-->
On most of the national systems we are running Lustre which is an open-source filesystem. The purpose of the parallel filesystem is to maintain the logical partitions and provide efficient access to data. Then we have I/O middleware on top of the parallel filesystem. It organizes access from many processes by optimizing two-phase I/O, disk I/O and data flow over the network and also provides data sieving by converting many small non-contiguous I/O requests into fewer/bigger requests. Then there would be a high-end I/O library such as HDF5, NetCDF and so on. What it does is that it maps application abstractions to storage abstractions I/O in terms of the data structures of the code. So, data is stored directly to the disk by calling this library and this library is implemented to work quite efficiently. It is better to use this kind of libraries since we support both of HDF5 and NetCDF. You could also use I/O middleware which is MPI-IO. In today’s talk, I will focus more on MPI-IO which is a part of MPI-2. However, I will also discuss the pros and cons of different approaches. And then, as you may see, there is the application which is mostly your program and your program will decide whether to use high-end I/O library or I/O middleware.
On most of the national systems we are running Lustre which is an open-source filesystem. The purpose of the parallel filesystem is to maintain the logical partitions and provide efficient access to data. Then we have I/O middleware on top of the parallel filesystem. It organizes access from many processes by optimizing two-phase I/O, disk I/O and data flow over the network and also provides data sieving by converting many small non-contiguous I/O requests into fewer/bigger requests. Then there would be a high-end I/O library such as HDF5, NetCDF and so on. What it does is that it maps application abstractions to storage abstractions I/O in terms of the data structures of the code. So, data is stored directly to the disk by calling this library and this library is implemented to work quite efficiently. It is better to use this kind of libraries since we support both of HDF5 and NetCDF. You could also use I/O middleware which is MPI-IO. In today’s talk, I will focus more on MPI-IO which is a part of MPI-2. However, I will also discuss the pros and cons of different approaches. And then, as you may see, there is the application which is mostly your program and your program will decide whether to use high-end I/O library or I/O middleware.


=Parallel filesystem=
=Parallel filesystem= <!--T:25-->


<!--T:26-->
On the national systems, we do have a parallel filesystem designed to scale to tens of thousand of computing nodes efficiently. For better performance, files can be striped across multiple drives. It means file does not reside on a single hard drive but multiple drives so that while a hard drive taking reading operation and another drive can send back the data to the program.  
On the national systems, we do have a parallel filesystem designed to scale to tens of thousand of computing nodes efficiently. For better performance, files can be striped across multiple drives. It means file does not reside on a single hard drive but multiple drives so that while a hard drive taking reading operation and another drive can send back the data to the program.  


<!--T:27-->
In order to avoid that two or more different processes access to a same file, parallel file systems use locks to manage this kind of concurrent file access. What actually happens is that the Files are pieced into ‘lock’ units and scattered across multiple hard drives. Then, Client nodes which is computing node obtain locks on units that they access before I/O occurs
In order to avoid that two or more different processes access to a same file, parallel file systems use locks to manage this kind of concurrent file access. What actually happens is that the Files are pieced into ‘lock’ units and scattered across multiple hard drives. Then, Client nodes which is computing node obtain locks on units that they access before I/O occurs


<!--T:28-->
[[File:locks.png|400px|center]]
[[File:locks.png|400px|center]]


<!--T:29-->
* Files can be striped across multiple drives for better performance
* Files can be striped across multiple drives for better performance
* '''Lock'''s used to manage concurrent file across in most parallel file system
* '''Lock'''s used to manage concurrent file across in most parallel file system
Line 72: Line 95:
**Locks are reclaimed from clients when others desire access
**Locks are reclaimed from clients when others desire access


<!--T:30-->
The most important part we should know is that the parallel filesystem is optimized for storing large shared files which can be possibly accessible from many computing nodes. So, it shows very poor performance to store many small size files.  As you may get told in our new user seminar, we strongly recommend users not to generate millions of small size files.  
The most important part we should know is that the parallel filesystem is optimized for storing large shared files which can be possibly accessible from many computing nodes. So, it shows very poor performance to store many small size files.  As you may get told in our new user seminar, we strongly recommend users not to generate millions of small size files.  


<!--T:31-->
Also, how you read and write, your file format, the number of files in a directory, and how often you ls command, affects every user! Quite often we got a ticket reporting that the user cannot even ‘ls’ in his or her /work directory. Most of cases for this situation are caused by a user doing very high I/O activities in the directory and it obviously makes system slower.
Also, how you read and write, your file format, the number of files in a directory, and how often you ls command, affects every user! Quite often we got a ticket reporting that the user cannot even ‘ls’ in his or her /work directory. Most of cases for this situation are caused by a user doing very high I/O activities in the directory and it obviously makes system slower.

The file system is shared over the ethernet network on a cluster: hammering the file system can hurt process communications which mostly related to MPI communication. That also affects others too.

The file system is shared over the ethernet network on a cluster: hammering the file system can hurt process communications which mostly related to MPI communication. That also affects others too.


<!--T:32-->
Please note that the file systems are not infinite: bandwidth, IOPs, number of files, space, . .  
Please note that the file systems are not infinite: bandwidth, IOPs, number of files, space, . .  


<!--T:33-->
*Optimized for large shared files
*Optimized for large shared files
*Poor performance under many small reads/writes (high IOPs): Do not store millions of small files
*Poor performance under many small reads/writes (high IOPs): Do not store millions of small files
Line 86: Line 113:
*File systems are LIMITED: bandwidth, IOPs, # of files, space and etc.
*File systems are LIMITED: bandwidth, IOPs, # of files, space and etc.


==Best Practices for I/O==
==Best Practices for I/O== <!--T:34-->


<!--T:35-->
What would be best practices for I/O.  
What would be best practices for I/O.  


<!--T:36-->
First of all, it is always recommended to make a plan for your data needs such as how much will be generated and how much do you need to save and where to keep.  
First of all, it is always recommended to make a plan for your data needs such as how much will be generated and how much do you need to save and where to keep.  


<!--T:37-->
On the national systems, different file systems (home, project, scratch) have different quotas. Scratch data is also subject to expiry. For more details see [https://alliancecan.ca/en/services/advanced-research-computing/national-services/storage this document]. Takes these limits into account before submitting a job.
On the national systems, different file systems (home, project, scratch) have different quotas. Scratch data is also subject to expiry. For more details see [https://alliancecan.ca/en/services/advanced-research-computing/national-services/storage this document]. Takes these limits into account before submitting a job.


<!--T:38-->
And please minimize use of ‘ls’ or ‘du’ command especially in a directory with many files.  
And please minimize use of ‘ls’ or ‘du’ command especially in a directory with many files.  


<!--T:39-->
Regularly check your disk usage with quota command. Furthermore, please take warning signs that should prompt careful consideration when you have more than 100,000 files in your space and average data file size less than 100MB (if writing lots of data)
Regularly check your disk usage with quota command. Furthermore, please take warning signs that should prompt careful consideration when you have more than 100,000 files in your space and average data file size less than 100MB (if writing lots of data)


<!--T:40-->
Please do ‘housekeeping’ regularly to maintain a reasonable number of file and quota. Gzip and tar command are very popular to compress multiple files and group them. So, you could reduce the number of files using these commands.  
Please do ‘housekeeping’ regularly to maintain a reasonable number of file and quota. Gzip and tar command are very popular to compress multiple files and group them. So, you could reduce the number of files using these commands.  


<!--T:41-->
*Make a plan for your data needs: How much will you generate? How much do you need to save? Where will you keep it?
*Make a plan for your data needs: How much will you generate? How much do you need to save? Where will you keep it?
*Monitor and control usage: Minimize use of filesystem commands like ‘ls’ and ‘du’ in large directories
*Monitor and control usage: Minimize use of filesystem commands like ‘ls’ and ‘du’ in large directories
Line 108: Line 142:
*Do ‘housekeeping’ (gzip, tar, delete) regularly
*Do ‘housekeeping’ (gzip, tar, delete) regularly


=Data Formats=
=Data Formats= <!--T:42-->


===ASCII===
===ASCII=== <!--T:43-->
First of all, there is a ASCII or someone refers it as ‘text’ format. It is a human readable file format but not efficient. So, it’s good for a small input or parameter file to run a code. The ASCII format takes larger amount of storage than other types of formats and automatically it costs more for read/write operation. You could check your code implementation if you could find ‘fprintf’ in C code or open command with ‘formatted’ option in FORTRAN code.
First of all, there is a ASCII or someone refers it as ‘text’ format. It is a human readable file format but not efficient. So, it’s good for a small input or parameter file to run a code. The ASCII format takes larger amount of storage than other types of formats and automatically it costs more for read/write operation. You could check your code implementation if you could find ‘fprintf’ in C code or open command with ‘formatted’ option in FORTRAN code.


<!--T:44-->
'''ASCII''' = '''A'''merican '''S'''tandard '''C'''ode for '''I'''nformation '''I'''nterchange  
'''ASCII''' = '''A'''merican '''S'''tandard '''C'''ode for '''I'''nformation '''I'''nterchange  
*pros: human readable, portable (architecture independent)  
*pros: human readable, portable (architecture independent)  
*cons: inefficient storage 
(13 bytes per single precision float,  
 22 bytes per double precision, 
 plus delimiters), expensive for read/write  
*cons: inefficient storage 
(13 bytes per single precision float,  
 22 bytes per double precision, 
 plus delimiters), expensive for read/write  


  fprintf() in C  
  <!--T:45-->
fprintf() in C  
  open(6,file=’test’,form=’formatted’);write(6,*) in F90
  open(6,file=’test’,form=’formatted’);write(6,*) in F90


===Binary===
===Binary=== <!--T:46-->


<!--T:47-->
Binary format is much ‘cheaper’ in computational sense than ASCII. ASCII has 13 for single precision and 22 for double precision. The table shows an experiment in writing 128M doubles into different locations; /scratch and /tmp on GPCS system in SciNet. As you can see, it is apparent that binary writing takes way shorter time than ASCII format.  
Binary format is much ‘cheaper’ in computational sense than ASCII. ASCII has 13 for single precision and 22 for double precision. The table shows an experiment in writing 128M doubles into different locations; /scratch and /tmp on GPCS system in SciNet. As you can see, it is apparent that binary writing takes way shorter time than ASCII format.  


<!--T:48-->
{| border="1" cellpadding="2" cellspacing="0" style="margin: auto"  
{| border="1" cellpadding="2" cellspacing="0" style="margin: auto"  
! style="background:#8AA8E5;" | Format
! style="background:#8AA8E5;" | Format
Line 134: Line 172:
|}
|}


<!--T:49-->
pros: efficient storage 
(4 bytes per single precision float, 
8 bytes per double precision, no delimiters), efficient read / write  
pros: efficient storage 
(4 bytes per single precision float, 
8 bytes per double precision, no delimiters), efficient read / write  
cons: have to know the format to read, portability (endians)  
cons: have to know the format to read, portability (endians)  


  fwrite() in C  
  <!--T:50-->
fwrite() in C  
  open(6,file=’test’,form=’unformatted’); write(6)in F90
  open(6,file=’test’,form=’unformatted’); write(6)in F90


===MetaData (XML)===
===MetaData (XML)=== <!--T:51-->


<!--T:52-->
While the binary format works fine and efficient, sometimes there would be a need to store additional information such as number of variables in the array, dimensions and size of the array and so on. So, the metadata is a useful to describe the binary. In case of passing the binary files to someone else or some other programs, it would be very helpful to include those information and to use the meta data format. By the way, it could be also done by using the high-end libraries such as HDF5 and NetCDF.
While the binary format works fine and efficient, sometimes there would be a need to store additional information such as number of variables in the array, dimensions and size of the array and so on. So, the metadata is a useful to describe the binary. In case of passing the binary files to someone else or some other programs, it would be very helpful to include those information and to use the meta data format. By the way, it could be also done by using the high-end libraries such as HDF5 and NetCDF.


<!--T:53-->
*Encodes data about data: number and names of variables, their dimensions and sizes, endians, owner, date, links, comments, etc.  
*Encodes data about data: number and names of variables, their dimensions and sizes, endians, owner, date, links, comments, etc.  


===Database===
===Database=== <!--T:54-->


<!--T:55-->
Database data format is good for many small records. Using the database, data organizing and analysis can be greatly simplified. CHARENTE supports three different database packages. It is not quite common in the numerical simulation, though.
Database data format is good for many small records. Using the database, data organizing and analysis can be greatly simplified. CHARENTE supports three different database packages. It is not quite common in the numerical simulation, though.


<!--T:56-->
*very powerful and flexible storage approach
*very powerful and flexible storage approach
*data organization and analysis can be greatly simplified
*data organization and analysis can be greatly simplified
Line 155: Line 199:
*open-sourcesoftware: SQLite(serverless), PostgreSQL, mySQL
*open-sourcesoftware: SQLite(serverless), PostgreSQL, mySQL


===Standar scientific dataset libraries===
===Standar scientific dataset libraries=== <!--T:57-->


<!--T:58-->
There are standard scientific dataset libraries. As mentioned in the previous slide, these libraries are very good not just storing the large-scale arrays in an efficient way but also they include data descriptions that the metadata format is good at. Moreover, the libraries provide data portability across platforms and languages which means the binaries generated in one machine can be read in other machines without a problem. The libraries store data automatically with compression. It could be extremely useful. For example, if you run a large-scale simulation and needs to store large dataset in particular with many repeating value such as zero, then the libraries can compress those repeating values efficiently so that you could save the data storage dramatically.  
There are standard scientific dataset libraries. As mentioned in the previous slide, these libraries are very good not just storing the large-scale arrays in an efficient way but also they include data descriptions that the metadata format is good at. Moreover, the libraries provide data portability across platforms and languages which means the binaries generated in one machine can be read in other machines without a problem. The libraries store data automatically with compression. It could be extremely useful. For example, if you run a large-scale simulation and needs to store large dataset in particular with many repeating value such as zero, then the libraries can compress those repeating values efficiently so that you could save the data storage dramatically.  


<!--T:59-->
*HDF5 = Hierarchical Data Format
*HDF5 = Hierarchical Data Format
*NetCDF = Network Common Data Format
*NetCDF = Network Common Data Format
Line 167: Line 213:
*Optionally provide parallel I/O  
*Optionally provide parallel I/O  


=Serial and Parallel I/O=
=Serial and Parallel I/O= <!--T:60-->


<!--T:61-->
In large parallel calculations your dataset is distributed across many processors/nodes. As shown in the right, for example, the calculation domain is decomposed into several work-load pieces and each node takes each allocation. Therefore, each node will compute the allocated domain and try to store the data into the disk. Unfortunately, in this case using parallel filesystem isn’t sufficient – you must organize parallel I/O yourself. It will be discussed shortly. For the file format, there are a couple of options such as a raw binary without metadata information or using high-end libraries (HDF5/NetCDF).
In large parallel calculations your dataset is distributed across many processors/nodes. As shown in the right, for example, the calculation domain is decomposed into several work-load pieces and each node takes each allocation. Therefore, each node will compute the allocated domain and try to store the data into the disk. Unfortunately, in this case using parallel filesystem isn’t sufficient – you must organize parallel I/O yourself. It will be discussed shortly. For the file format, there are a couple of options such as a raw binary without metadata information or using high-end libraries (HDF5/NetCDF).


<!--T:62-->
*In large parallel calculations your dataset is distributed across many processors/nodes  
*In large parallel calculations your dataset is distributed across many processors/nodes  
*In this case using parallel filesystem isn’t enough – you must organize parallel I/O yourself  
*In this case using parallel filesystem isn’t enough – you must organize parallel I/O yourself  
*Data can be written as raw binary, HDF5 and NetCDF.
*Data can be written as raw binary, HDF5 and NetCDF.


==Serial I/O (Single CPU)==
==Serial I/O (Single CPU)== <!--T:63-->


<!--T:64-->
When you try to write your data from memory in multiple computing node to a single file on the disk, there would be a couple of approaches. The simplest approach is to set a ‘spokesperson’ to collect all of data from other members in the communication. Once the data is entirely collected using communication it writes the data into a file as a regular serial I/O. It is a really simple solution and easy to implement but there are several following problems. Firstly, the bandwidth for writing is limited by the rate of one client and it applies to the memory limit as well. Secondly, the operation time linearly increases with the amount of data or problem size, and moreover it increases with number of member processes because it will take longer time to collect all data into a single node or cpu. Therefore, this type of approach cannot scale.  
When you try to write your data from memory in multiple computing node to a single file on the disk, there would be a couple of approaches. The simplest approach is to set a ‘spokesperson’ to collect all of data from other members in the communication. Once the data is entirely collected using communication it writes the data into a file as a regular serial I/O. It is a really simple solution and easy to implement but there are several following problems. Firstly, the bandwidth for writing is limited by the rate of one client and it applies to the memory limit as well. Secondly, the operation time linearly increases with the amount of data or problem size, and moreover it increases with number of member processes because it will take longer time to collect all data into a single node or cpu. Therefore, this type of approach cannot scale.  


  '''Pros:'''
  <!--T:65-->
'''Pros:'''
  trivially simple for small I/O  
  trivially simple for small I/O  
  some I/O libraries not parallel
  some I/O libraries not parallel
Line 187: Line 237:
  won’t scale (built-in bottleneck)
  won’t scale (built-in bottleneck)


<!--T:66-->
[[File:serial_IO.png|400px|center]]
[[File:serial_IO.png|400px|center]]


==Serial I/O (N processors)==
==Serial I/O (N processors)== <!--T:67-->


<!--T:68-->
What you can do instead is to organize each participating process to do a serial I/O. In other words, all processes perform I/O to individual files. It is somewhat efficient than the previous model but up to a certain limit.
What you can do instead is to organize each participating process to do a serial I/O. In other words, all processes perform I/O to individual files. It is somewhat efficient than the previous model but up to a certain limit.


<!--T:69-->
Firstly, when you have a lot of data, you will end up with many files. One file per processors. If you run a large-calculation with many iterations with many variables, even single simulation run could generate over a thousand output files. In this case, as we discussed before, the parallel filesystem performs poor. Again we reviewed I/O best practices and hundreds of thousand files are strongly prohibited.  
Firstly, when you have a lot of data, you will end up with many files. One file per processors. If you run a large-calculation with many iterations with many variables, even single simulation run could generate over a thousand output files. In this case, as we discussed before, the parallel filesystem performs poor. Again we reviewed I/O best practices and hundreds of thousand files are strongly prohibited.  


<!--T:70-->
Secondly, output data often has to be post-processed into a file. It is additional step and it would be quite inefficient surely. Furthermore, when each processors tries to access to the disk about same time uncoordinated I/O may swamp the filesystem (file locks!)
Secondly, output data often has to be post-processed into a file. It is additional step and it would be quite inefficient surely. Furthermore, when each processors tries to access to the disk about same time uncoordinated I/O may swamp the filesystem (file locks!)


  '''Pros:'''
  <!--T:71-->
'''Pros:'''
  no interprocess communication or coordination necessary  
  no interprocess communication or coordination necessary  
  possibly better scaling than single sequential I/O  
  possibly better scaling than single sequential I/O  
Line 205: Line 260:
  uncoordinated I/O may swamp the filesystem (file locks!)
  uncoordinated I/O may swamp the filesystem (file locks!)


<!--T:72-->
[[File:serial_IO2.png|400px|center]]
[[File:serial_IO2.png|400px|center]]


==Parallel I/O (N processe to/from 1 file)==
==Parallel I/O (N processe to/from 1 file)== <!--T:73-->


<!--T:74-->
The best approach is to do an appropriate parallel I/O. So then each participating process write the data simultaneously into a single file using the parallel I/O. The only thing you should be aware of is that you may want to do this parallel I/O in a coordinated fashion. Otherwise, it will swamp the filesystem.  
The best approach is to do an appropriate parallel I/O. So then each participating process write the data simultaneously into a single file using the parallel I/O. The only thing you should be aware of is that you may want to do this parallel I/O in a coordinated fashion. Otherwise, it will swamp the filesystem.  


  '''Pros:'''
  <!--T:75-->
'''Pros:'''
  only one file (good for visualization, data management, storage)  
  only one file (good for visualization, data management, storage)  
  data can be stored canonically
  data can be stored canonically
Line 219: Line 277:
  requires more design and thought
  requires more design and thought


<!--T:76-->
[[File:parallel_IO.png|400px|center]]
[[File:parallel_IO.png|400px|center]]


==Parallel I/O should be collective!==
==Parallel I/O should be collective!== <!--T:77-->


<!--T:78-->
For example, parallel middleware such as MPI-IO has a few different types of coordinated or uncoordinated writing options. Once coordinated writing like collective I/O is called, then the parallel middleware will know which processes and disks will get involved. Then, the parallel middleware will find an optimized operations in lower software layers for better efficiency.
For example, parallel middleware such as MPI-IO has a few different types of coordinated or uncoordinated writing options. Once coordinated writing like collective I/O is called, then the parallel middleware will know which processes and disks will get involved. Then, the parallel middleware will find an optimized operations in lower software layers for better efficiency.


<!--T:79-->
*'''Independent I/O''' operations specify only what a single process will do  
*'''Independent I/O''' operations specify only what a single process will do  
*'''Collective I/O''' is coordinated access to storage by a group of processes  
*'''Collective I/O''' is coordinated access to storage by a group of processes  
Line 230: Line 291:
*Allows filesystem to know more about access as a whole, more optimization in lower software layers, better performance  
*Allows filesystem to know more about access as a whole, more optimization in lower software layers, better performance  


<!--T:80-->
[[File:parallel_coll.png|400px|center]]
[[File:parallel_coll.png|400px|center]]


=Parallel I/O techniques=
=Parallel I/O techniques= <!--T:81-->


<!--T:82-->
It is a part of MPI-2 standard. So, MPI-IO is good for writing a raw binary file. As you may read in this slide, the high-end libraries such as HDF5, NetCDF and ADIOS are all built on top of MPI-IO. Therefore, you should have MPI-IO anyway. BTW, ADIOS is not part of the official software stack on our systems simply because there are not much demand in our community.
It is a part of MPI-2 standard. So, MPI-IO is good for writing a raw binary file. As you may read in this slide, the high-end libraries such as HDF5, NetCDF and ADIOS are all built on top of MPI-IO. Therefore, you should have MPI-IO anyway. BTW, ADIOS is not part of the official software stack on our systems simply because there are not much demand in our community.


<!--T:83-->
*MPI-IO: parallel I/O part of the MPI-2 standard (1996)  
*MPI-IO: parallel I/O part of the MPI-2 standard (1996)  
*HDF5 (Hierarchical Data Format), built on top of MPI-IO  
*HDF5 (Hierarchical Data Format), built on top of MPI-IO  
Line 244: Line 308:
**can work with HDF/NetCDF  
**can work with HDF/NetCDF  


==MPI-IO==
==MPI-IO== <!--T:84-->


<!--T:85-->
MPI-IO is available on our systems as a default module, OpenMPI.  MPI-IO exploits analogies with MPI, writing/reading to/from file would be very similar to MPI send/receive practice if you have some experience with MPI. For example, file access is grouped via communicator in MPI. The communicator is a group for message passing in MPI. User defined MPI datatypes are also available.  
MPI-IO is available on our systems as a default module, OpenMPI.  MPI-IO exploits analogies with MPI, writing/reading to/from file would be very similar to MPI send/receive practice if you have some experience with MPI. For example, file access is grouped via communicator in MPI. The communicator is a group for message passing in MPI. User defined MPI datatypes are also available.  


<!--T:86-->
*Part of the MPI-2 standard  
*Part of the MPI-2 standard  
*ROMIO is the implementation of MPI-IO in OpenMPI (default on our systems), MPICH2  
*ROMIO is the implementation of MPI-IO in OpenMPI (default on our systems), MPICH2  
Line 258: Line 324:
**all functionality through function calls
**all functionality through function calls


===Basic MPI-IO operations in C===
===Basic MPI-IO operations in C=== <!--T:87-->
<pre>
<pre>
  int MPI_File_open ( MPI_Comm comm, char* filename, int amode, MPI_Info info, MPI_File* fh)  
  int MPI_File_open ( MPI_Comm comm, char* filename, int amode, MPI_Info info, MPI_File* fh)  
Line 271: Line 337:
</pre>
</pre>


<!--T:88-->
Here is a simple skeleton for MPI-IO operations in C. Like a MPI code, it does have MPI_File_open and close at the beginning and at the end. There are File_write and File_read. And also, there is MPI_File_seek which used to update individual file pointer. This will be discussed in detail shortly.  
Here is a simple skeleton for MPI-IO operations in C. Like a MPI code, it does have MPI_File_open and close at the beginning and at the end. There are File_write and File_read. And also, there is MPI_File_seek which used to update individual file pointer. This will be discussed in detail shortly.  


<!--T:89-->
MPI_File_set_view is to assign regions of the file to separate processes.
MPI_File_set_view is to assign regions of the file to separate processes.
File views are specified using a triplet - (displacement, etype, and 
filetype) – that is passed to MPI_File_set_view
File views are specified using a triplet - (displacement, etype, and 
filetype) – that is passed to MPI_File_set_view


<!--T:90-->
*
displacement = number of bytes to skip from the start of the file  
*
displacement = number of bytes to skip from the start of the file  
*etype = unit of data access (can be any basic or derived datatype)  
*etype = unit of data access (can be any basic or derived datatype)  
*filetype = specifies which portion of the file is visible to the process 

*filetype = specifies which portion of the file is visible to the process 



===Basic MPI-IO operations in F90===
===Basic MPI-IO operations in F90=== <!--T:91-->


<!--T:92-->
<pre>
<pre>
MPI_FILE_OPEN (integer comm, character[] filename, integer amode, integer info, integer fh, integer ierr)  
MPI_FILE_OPEN (integer comm, character[] filename, integer amode, integer info, integer fh, integer ierr)  
Line 294: Line 364:
</pre>
</pre>


===Opening a file requires a ...===
===Opening a file requires a ...=== <!--T:93-->


<!--T:94-->
Opening file requires a communicator, file name, and file hand for all future reference to file. And also, it requires file access mode ‘amode’. There are a couple of different modes like MPI_MODE_WRONLY means write only. You can combine it using bitwise or “|” in C or addition “+” in FORTRAN
Opening file requires a communicator, file name, and file hand for all future reference to file. And also, it requires file access mode ‘amode’. There are a couple of different modes like MPI_MODE_WRONLY means write only. You can combine it using bitwise or “|” in C or addition “+” in FORTRAN


<!--T:95-->
*Communicator
*Communicator
*File name
*File name
Line 303: Line 375:
*File access mode ‘amode’, made up of combinations of:
*File access mode ‘amode’, made up of combinations of:


<!--T:96-->
<pre>
<pre>
MPI_MODE_RDONLY                        Read only
MPI_MODE_RDONLY                        Read only
Line 315: Line 388:
</pre>
</pre>


<!--T:97-->
*Combine it using bitwise or “|” in C or addition “+” in FORTRAN
*Combine it using bitwise or “|” in C or addition “+” in FORTRAN
*Info argument usually set to ‘MPI_INFO_NULL’
*Info argument usually set to ‘MPI_INFO_NULL’




===C example===
===C example=== <!--T:98-->
<pre>
<pre>
MPI_FILE fh ;
MPI_File_open (MPI_COMM_WORLD, "test.dat" ,MPI_MODE_RDONLY,
 MPI_INFO_NULL,&fh );
MPI_FILE fh ;
MPI_File_open (MPI_COMM_WORLD, "test.dat" ,MPI_MODE_RDONLY,
 MPI_INFO_NULL,&fh );
Line 326: Line 400:
</pre>
</pre>


===F90 example===
===F90 example=== <!--T:99-->
<pre>
<pre>
integer :: fh,ierr
call MPI_FILE_OPEN(MPI_COMM_WORLD,"test.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, fh, ierr)  
integer :: fh,ierr
call MPI_FILE_OPEN(MPI_COMM_WORLD,"test.dat", MPI_MODE_RDONLY, MPI_INFO_NULL, fh, ierr)  
Line 333: Line 407:
</pre>
</pre>


===Read/Write contiguous data===
===Read/Write contiguous data=== <!--T:100-->
So, let us think to write one file from four different processes. As shown in the figure, each process will write its data into a designated portion in the same file. Writing proceeds in a contiguous fashion from process 0 to 3.
So, let us think to write one file from four different processes. As shown in the figure, each process will write its data into a designated portion in the same file. Writing proceeds in a contiguous fashion from process 0 to 3.


<!--T:101-->
[[File:contiguous.png|400px|center]]
[[File:contiguous.png|400px|center]]


===example in C===
===example in C=== <!--T:102-->
Basically we initialize MPI and initialize several variable arrays. Using MPI_Comm_rank, each process will have its own rank or process id. Using for (i=0), array a is set as it’s rank for 10 array size.  for example, on process 3 creates a array for 10 of 3 characters.
Basically we initialize MPI and initialize several variable arrays. Using MPI_Comm_rank, each process will have its own rank or process id. Using for (i=0), array a is set as it’s rank for 10 array size.  for example, on process 3 creates a array for 10 of 3 characters.




  MPI_File_open (MPI_COMM_WORLD, “data.out" , MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
  <!--T:103-->
MPI_File_open (MPI_COMM_WORLD, “data.out" , MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);


<!--T:104-->
We defined the communicator and filename ‘data.out’, for the mode, we combined ‘write only’ and ‘create file if it does not exist’. And then, we define the offset where each process starts to write. As you can see, process 0 starts from the beginning and process 1 is next in a contiguous fashion.
We defined the communicator and filename ‘data.out’, for the mode, we combined ‘write only’ and ‘create file if it does not exist’. And then, we define the offset where each process starts to write. As you can see, process 0 starts from the beginning and process 1 is next in a contiguous fashion.


  MPI_Offset displace = rank*n*sizeof(char);  
  <!--T:105-->
MPI_Offset displace = rank*n*sizeof(char);  


<!--T:106-->
So, the offset will be calculated by multiplying rank*size of array*sizeof (char). Now we are ready to assign the writing regions to each process using MPI_File_set_view. displacement is set, etype and filetype are set as ‘MPI_CHAR’. ‘native’ means that data in this representation is stored in a file exactly as it is in memory. And then, we command ‘write’ using MPI_File_write.  
So, the offset will be calculated by multiplying rank*size of array*sizeof (char). Now we are ready to assign the writing regions to each process using MPI_File_set_view. displacement is set, etype and filetype are set as ‘MPI_CHAR’. ‘native’ means that data in this representation is stored in a file exactly as it is in memory. And then, we command ‘write’ using MPI_File_write.  


<!--T:107-->
<pre>
<pre>
#include <stdio .h>

#include <stdio .h>

Line 355: Line 435:

int main(int argc, char **argv) {  

int main(int argc, char **argv) {  


<!--T:108-->
int rank, i; char a[10];
int rank, i; char a[10];

MPI_Offset n = 10; MPI_File fh ; MPI_Status status ;  

MPI_Offset n = 10; MPI_File fh ; MPI_Status status ;  


<!--T:109-->
MPI_Init(&argc, &argv);  
MPI_Init(&argc, &argv);  


<!--T:110-->
MPI_Comm_rank(MPI_COMM_WORLD, &rank);  
MPI_Comm_rank(MPI_COMM_WORLD, &rank);  


<!--T:111-->
for (i=0; i<10; i++)
for (i=0; i<10; i++)

a[i] = (char)( ’0’ + rank);  // e.g. on processor 3 creates a[0:9]=’3333333333’  

a[i] = (char)( ’0’ + rank);  // e.g. on processor 3 creates a[0:9]=’3333333333’  


<!--T:112-->
MPI_File_open (MPI_COMM_WORLD, “data.out", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);  
MPI_File_open (MPI_COMM_WORLD, “data.out", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);  
MPI_Offset displace = rank*n*sizeof(char); // start of the view for each processor  
MPI_Offset displace = rank*n*sizeof(char); // start of the view for each processor  


<!--T:113-->
MPI_File_set_view (fh , displace , MPI_CHAR, MPI_CHAR, "native" ,MPI_INFO_NULL); 

MPI_File_set_view (fh , displace , MPI_CHAR, MPI_CHAR, "native" ,MPI_INFO_NULL); 

// note that etype and filetype are the same  
// note that etype and filetype are the same  


<!--T:114-->
MPI_File_write(fh, a, n, MPI_CHAR, &status);
MPI_File_write(fh, a, n, MPI_CHAR, &status);
   
   
MPI_File_close(&fh ) ;  
MPI_File_close(&fh ) ;  


<!--T:115-->
MPI_Finalize ( ) ;  
MPI_Finalize ( ) ;  


<!--T:116-->
return 0;
return 0;
</pre>
</pre>


===Summary: MPI-IO===
===Summary: MPI-IO=== <!--T:117-->


<!--T:118-->
As you may notice, its implementation seems quite straight-forward. There must be much of advanced materials using MPI-IO but it seems a bit beyond the scope of this seminar. So, in summary, MPI-IO is a part of standard MPI-2 library, and it is a very widely installed on almost all of HPC systems with modern MPI versions. We installed OpenMPI which supports MPI-IO on all of our clusters. MPI-IO doesn’t require to install additional libraries but unfortunately it writes raw data to file. So, it is not portable across platforms, hard to append new variables and doesn’t include data description.
As you may notice, its implementation seems quite straight-forward. There must be much of advanced materials using MPI-IO but it seems a bit beyond the scope of this seminar. So, in summary, MPI-IO is a part of standard MPI-2 library, and it is a very widely installed on almost all of HPC systems with modern MPI versions. We installed OpenMPI which supports MPI-IO on all of our clusters. MPI-IO doesn’t require to install additional libraries but unfortunately it writes raw data to file. So, it is not portable across platforms, hard to append new variables and doesn’t include data description.


==NetCDF==
==NetCDF== <!--T:119-->


<!--T:120-->
'''Net'''work '''C'''ommon '''D'''ata '''F'''or
'''Net'''work '''C'''ommon '''D'''ata '''F'''or




<!--T:121-->
NetCDF is one of most popular packages in storing data. Basically, NetCDF covers up all of what MPI-IO cannot support. It uses MPI-IO under the hood but instead of specifying the offset you just need to call NetCDF and tell what arrays you want to store. Then, NetCDF will handle it and try to store it in a contiguous fashion. In NetCDF Data stored as binary and, as mentioned before, it supports Self-describing, metadata in the header and Portable across different architectures and
NetCDF is one of most popular packages in storing data. Basically, NetCDF covers up all of what MPI-IO cannot support. It uses MPI-IO under the hood but instead of specifying the offset you just need to call NetCDF and tell what arrays you want to store. Then, NetCDF will handle it and try to store it in a contiguous fashion. In NetCDF Data stored as binary and, as mentioned before, it supports Self-describing, metadata in the header and Portable across different architectures and
Optional compression. One of better points comparing to HDF5, NetCDF supports a variety of visualization packages such as Paraview. We have both serial and parallel NetCDF on our systems.  
Optional compression. One of better points comparing to HDF5, NetCDF supports a variety of visualization packages such as Paraview. We have both serial and parallel NetCDF on our systems.  


<!--T:122-->
*Format for storing large arrays, uses MPI-IO under the hood  
*Format for storing large arrays, uses MPI-IO under the hood  
*Libraries for C/C++, Fortran 77/90/95/2003, Python, Java, R, Ruby, etc.  
*Libraries for C/C++, Fortran 77/90/95/2003, Python, Java, R, Ruby, etc.  
Line 401: Line 494:
*Parallel NetCDF
*Parallel NetCDF


===example in C===
===example in C=== <!--T:123-->
Basically we initialize MPI and initialize several variable arrays. Using MPI_Comm_rank, each process will have its own rank or process id. Using for (i=0), array a is set as it’s rank for 10 array size.  for example, on process 3 creates a array for 10 of 3 characters.
Basically we initialize MPI and initialize several variable arrays. Using MPI_Comm_rank, each process will have its own rank or process id. Using for (i=0), array a is set as it’s rank for 10 array size.  for example, on process 3 creates a array for 10 of 3 characters.


  MPI_File_open (MPI_COMM_WORLD, “data.out" , MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
  <!--T:124-->
MPI_File_open (MPI_COMM_WORLD, “data.out" , MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);


<!--T:125-->
We defined the communicator and filename ‘data.out’, for the mode, we combined ‘write only’ and ‘create file if it does not exist’. And then, we define the offset where each process starts to write. As you can see, process 0 starts from the beginning and process 1 is next in a contiguous fashion.
We defined the communicator and filename ‘data.out’, for the mode, we combined ‘write only’ and ‘create file if it does not exist’. And then, we define the offset where each process starts to write. As you can see, process 0 starts from the beginning and process 1 is next in a contiguous fashion.


  MPI_Offset displace = rank*n*sizeof(char);  
  <!--T:126-->
MPI_Offset displace = rank*n*sizeof(char);  


<!--T:127-->
So, the offset will be calculated by multiplying rank*size of array*sizeof (char). Now we are ready to assign the writing regions to each process using MPI_File_set_view. displacement is set, etype and filetype are set as ‘MPI_CHAR’. ‘native’ means that data in this representation is stored in a file exactly as it is in memory. And then, we command ‘write’ using MPI_File_write. Once compiled and run successfully, you can have the output as shown in a file.
So, the offset will be calculated by multiplying rank*size of array*sizeof (char). Now we are ready to assign the writing regions to each process using MPI_File_set_view. displacement is set, etype and filetype are set as ‘MPI_CHAR’. ‘native’ means that data in this representation is stored in a file exactly as it is in memory. And then, we command ‘write’ using MPI_File_write. Once compiled and run successfully, you can have the output as shown in a file.


<!--T:128-->
<pre>
<pre>
#include <stdlib.h>
#include <stdlib.h>
Line 439: Line 537:
</pre>
</pre>


==HDF5==
==HDF5== <!--T:129-->


<!--T:130-->
'''H'''ierarchical '''D'''ata '''F'''ormat
'''H'''ierarchical '''D'''ata '''F'''ormat


<!--T:131-->
HDF5 is also very popular tool in storing data. It supports most of NetCDF features such Self-describing file format for large datasets, and also uses MPI-IO under the hood. Basically, HDF5 is More general than NetCDF, with object-oriented description of datasets, groups, attributes, types, data spaces and property lists. We have both serial and parallel HDF5 on our systems.
HDF5 is also very popular tool in storing data. It supports most of NetCDF features such Self-describing file format for large datasets, and also uses MPI-IO under the hood. Basically, HDF5 is More general than NetCDF, with object-oriented description of datasets, groups, attributes, types, data spaces and property lists. We have both serial and parallel HDF5 on our systems.


<!--T:132-->
*Self-describing file format for large datasets, uses MPI-IO under the hood  
*Self-describing file format for large datasets, uses MPI-IO under the hood  
*Libraries for C/C++, Fortran 90, Java, Python, R  
*Libraries for C/C++, Fortran 90, Java, Python, R  
Line 455: Line 556:
*We provide both serial and parallel HDF5
*We provide both serial and parallel HDF5


==ADIOS==
==ADIOS== <!--T:133-->


<!--T:134-->
'''A'''daptable '''I/O''' '''S'''ystem
'''A'''daptable '''I/O''' '''S'''ystem


<!--T:135-->
*A high-performance library for scientific I/O, also based on MPI-IO Libraries for C/C++, Fortran
*A high-performance library for scientific I/O, also based on MPI-IO Libraries for C/C++, Fortran
*A data file and a separate external XML file describing data layout  
*A data file and a separate external XML file describing data layout  
rsnt_translations
56,430

edits