MPI: Difference between revisions

Jump to navigation Jump to search
132 bytes added ,  7 years ago
no edit summary
No edit summary
No edit summary
Line 59: Line 59:


=== SPMD Programming ===
=== SPMD Programming ===
Parallel programs written using MPI make use of an execution model called Single Program, Multiple Data, or SPMD. The SPMD model involves running a number of ''copies'' of a single program. In MPI, each copy or "process" is assigned a unique number, referred to as the rank of the process, and each process can obtain its rank when it runs. When the different copies should behave differently, we usually use an "if" statement based on the rank of the process to execute the appropriate set of instructions.
Parallel programs written using MPI make use of an execution model called Single Program, Multiple Data, or SPMD. The SPMD model involves running a number of ''copies'' of a single program. In MPI, each copy or "process" is assigned a unique number, referred to as the ''rank'' of the process, and each process can obtain its rank when it runs. When a process should behave differently, we usually use an "if" statement based on the rank of the process to execute the appropriate set of instructions.


[[Image:SPMD_model.png|frame|center|'''Figure 3''': ''SPMD model illustrating conditional branching to control divergent behaviour'']]
[[Image:SPMD_model.png|frame|center|'''Figure 3''': ''SPMD model illustrating conditional branching to control divergent behaviour'']]


=== Framework ===
=== Framework ===
Each MPI progam must include the relevant header file (<tt>mpi.h</tt> for C/C++, <tt>mpif.h</tt> for Fortran) and link the MPI library during compilation and linkage. Most MPI implementations provide a handy script, often called a ''compiler wrapper'', that handles all set-up issues with respect to <code>include</code> and <code>lib</code> directories, linking flags, ''etc.'' Our examples will all use these compiler wrappers.
Each MPI program must include the relevant header file (<tt>mpi.h</tt> for C/C++, <tt>mpif.h</tt> for Fortran), and be compiled and linked against the desired MPI implementation. Most MPI implementations provide a handy script, often called a ''compiler wrapper'', that handles all set-up issues with respect to <code>include</code> and <code>lib</code> directories, linking flags, ''etc.'' Our examples will all use these compiler wrappers:
* C language wrapper: <tt>mpicc</tt>
* C language wrapper: <tt>mpicc</tt>
* Fortran: <tt>mpif90</tt>
* Fortran: <tt>mpif90</tt>
Line 84: Line 84:
|}
|}


The arguments to the C <code>MPI_Init</code> are pointers to the <code>argc</code> and <code>argv</code> variables that represent the command-line arguments to the program. Like all C MPI functions, the return value is represents the error status of the function. Fortran MPI subroutines return the error status in an additional argument, <code>IERR</code>.
The arguments to the C <code>MPI_Init</code> are pointers to the <code>argc</code> and <code>argv</code> variables that represent the command-line arguments to the program. Like all C MPI functions, the return value represents the error status of the function. Fortran MPI subroutines return the error status in an additional argument, <code>IERR</code>.


Similarly, we must call a function <code>MPI_Finalize</code> to do any clean-up that might be required before our program exits. The prototype for this function appears below:
Similarly, we must call a function <code>MPI_Finalize</code> to do any clean-up that might be required before our program exits. The prototype for this function appears below:
Line 101: Line 101:
|}
|}


As a rule of thumb, it is a good idea to call <code>MPI_Init</code> as the first statement of our program, and <code>MPI_Finalize</code> as the last statement before program termination. Let's now modify our "Hello, world!" program to do so.
As a rule of thumb, it is a good idea to call <code>MPI_Init</code> as the first statement of our program, and <code>MPI_Finalize</code> as its last statement.  
Let's now modify our "Hello, world!" program accordingly.


{| border="0" cellpadding="5" cellspacing="0" align="center"
{| border="0" cellpadding="5" cellspacing="0" align="center"
Line 137: Line 138:


=== Rank and Size ===
=== Rank and Size ===
We could now run this program under control of MPI, but each process would only output the original string which isn't very interesting. Let's have each process output its rank and how many processes are running in total. This information is obtained at run-time by the use of the following functions.
We could now run this program under control of MPI, but each process would only output the original string which isn't very interesting. Let's instead have each process output its rank and how many processes are running in total. This information is obtained at run-time by the use of the following functions:


{| border="0" cellpadding="5" cellspacing="0" align="center"
{| border="0" cellpadding="5" cellspacing="0" align="center"
Line 156: Line 157:
|}
|}


<tt>MPI_Comm_size</tt> reports the number of processes running as part of this job by assigning it to the result parameter <tt>nproc</tt>.  Similarly, <tt>MPI_Comm_rank</tt> reports the rank of the calling process to the result parameter <tt>myrank</tt>. Ranks in MPI start counting from 0 rather than 1, so given N processes we expect the ranks to be 0..(N-1). The <tt>comm</tt> argument is a ''communicator'', which is a set of processes capable of sending messages to one another. For the purpose of this tutorial we will always pass in the predefined value <tt>MPI_COMM_WORLD</tt>, which is simply all the processes started with the job. It is possible to define and use your own communicators, but that is beyond the scope of this tutorial and the reader is referred to the provided references for additional detail.
<tt>MPI_Comm_size</tt> reports the number of processes running as part of this job by assigning it to the result parameter <tt>nproc</tt>.  Similarly, <tt>MPI_Comm_rank</tt> reports the rank of the calling process to the result parameter <tt>myrank</tt>. Ranks in MPI start from 0 rather than 1, so given N processes we expect the ranks to be 0..(N-1). The <tt>comm</tt> argument is a ''communicator'', which is a set of processes capable of sending messages to one another. For the purpose of this tutorial we will always pass in the predefined value <tt>MPI_COMM_WORLD</tt>, which is simply all the MPI processes started with the job. It is possible to define and use your own communicators, but that is beyond the scope of this tutorial and the reader is referred to the provided references for additional detail.


Let us incorporate these functions into our program, and have each process output its rank and size information. Note that since all processes are still performing identical operations, there are no conditional blocks required in the code.
Let us incorporate these functions into our program, and have each process output its rank and size information. Note that since all processes are still performing identical operations, there are no conditional blocks required in the code.
Line 202: Line 203:
|}
|}


Compile and run this program on 2, 4 and 8 processors. Note that each running process produces output based on the values of its local variables. The stdout of all running processes is simply concatenated together. As you run the program on more processors, you may see that the output from the different processes does not appear in order or rank: You should make no assumptions about the order of output from different processes.
Compile and run this program using 2, 4 and 8 processes. Note that each running process produces output based on the values of its local variables. The stdout of all running processes is simply concatenated together. As you run the program using more processes, you may see that the output from the different processes does not appear in order or rank: You should make no assumptions about the order of output from different processes.
  [orc-login2 ~]$ vi phello1.c  
  [orc-login2 ~]$ vi phello1.c  
  [orc-login2 ~]$ mpicc -Wall phello1.c -o phello1
  [orc-login2 ~]$ mpicc -Wall phello1.c -o phello1
Line 214: Line 215:
While we now have a parallel version of our "Hello, World!" program, it isn't very interesting as there is no communication between the processes. Let's fix this by having the processes send messages to one another.
While we now have a parallel version of our "Hello, World!" program, it isn't very interesting as there is no communication between the processes. Let's fix this by having the processes send messages to one another.


We'll have each process send the string "hello" to the one with the next higher rank number. Rank <tt>i</tt> will send its message to rank <tt>i+1</tt>, and we'll have the last process, rank <tt>N-1</tt>, send its message back to process <tt>0</tt>. A short way to express this is ''process <tt>i</tt> sends to process <tt>(i+1)%N</tt>'', where there are <tt>N</tt> processes and % is the modulus operator.
We'll have each process send the string "hello" to the one with the next higher rank number. Rank <tt>i</tt> will send its message to rank <tt>i+1</tt>, and we'll have the last process, rank <tt>N-1</tt>, send its message back to process <tt>0</tt> (a nice communication ring!). A short way to express this is ''process <tt>i</tt> sends to process <tt>(i+1)%N</tt>'', where there are <tt>N</tt> processes and % is the modulus operator.


MPI provides a large number of functions for sending and receiving data of almost any composition in a variety of communication patterns (one-to-one, one-to-many, many-to-one, and many-to-many). But the simplest to understand are the functions send a sequence of one or more instances of an atomic data type from one process to one other process, <tt>MPI_Send</tt> and <tt>MPI_Recv</tt>.
MPI provides a large number of functions for sending and receiving data of almost any composition in a variety of communication patterns (one-to-one, one-to-many, many-to-one, and many-to-many). But the simplest functions to understand are the ones that send a sequence of one or more instances of an atomic data type from one process to another, <tt>MPI_Send</tt> and <tt>MPI_Recv</tt>.


A process sends data by calling the <tt>MPI_Send</tt> function. Referring to the following function prototypes, <tt>MPI_Send</tt> can be summarized as sending <tt>count</tt> contiguous instances of <tt>datatype</tt> to the process with the specified <tt>rank</tt>, and the data is in the buffer pointed to by <tt>message</tt>.  <tt>Tag</tt> is a programmer-specified identifier that becomes associated with the message, and can be used to organize the communication process (for example, to distinguish two distinct streams of interleaved data). Our examples do not require this, so we will pass in the value 0 for the <tt>tag</tt>. <tt>Comm</tt> is the communicator described above, and we will continue to use <tt>MPI_COMM_WORLD</tt>.
A process sends data by calling the <tt>MPI_Send</tt> function. Referring to the following function prototypes, <tt>MPI_Send</tt> can be summarized as sending <tt>count</tt> contiguous instances of <tt>datatype</tt> to the process with the specified <tt>rank</tt>, and the data is in the buffer pointed to by <tt>message</tt>.  <tt>Tag</tt> is a programmer-specified identifier that becomes associated with the message, and can be used, for example, to organize the communication streams (e.g. to distinguish two distinct streams of interleaved data). Our examples do not require this, so we will pass in the value 0 for the <tt>tag</tt>. <tt>Comm</tt> is the communicator described above, and we will continue to use <tt>MPI_COMM_WORLD</tt>.


{| border="0" cellpadding="5" cellspacing="0" align="center"
{| border="0" cellpadding="5" cellspacing="0" align="center"
Line 246: Line 247:
|}
|}


Note that the <tt>datatype</tt> argument, specifying the type of data contained in the <tt>message</tt> buffer, is a variable. This is intended to provide a layer of compatibility between processes that could be running on architectures for which the native format for these types differs. It is possible to register new data types, but for this tutorial we will only use the pre-defined types provided by MPI. There is an MPI type pre-defined for all atomic data types in the source language (for C: <tt>MPI_CHAR</tt>, <tt>MPI_FLOAT</tt>, <tt>MPI_SHORT</tt>, <tt>MPI_INT</tt>, etc. and for Fortran: <tt>MPI_CHARACTER</tt>, <tt>MPI_INTEGER</tt>, <tt>MPI_REAL</tt>, etc.). YOu can find a full list of these types in the references provided below.
Note that the <tt>datatype</tt> argument, specifying the type of data contained in the <tt>message</tt> buffer, is a variable. This is intended to provide a layer of compatibility between processes that could be running on architectures for which the native format for these types differs. It is possible to register new data types, but for this tutorial we will only use the predefined types provided by MPI. In fact, there is a predefined MPI type for each atomic data types in the source language (for C: <tt>MPI_CHAR</tt>, <tt>MPI_FLOAT</tt>, <tt>MPI_SHORT</tt>, <tt>MPI_INT</tt>, etc. and for Fortran: <tt>MPI_CHARACTER</tt>, <tt>MPI_INTEGER</tt>, <tt>MPI_REAL</tt>, etc.). You can find a full list of these types in the references provided below.


<tt>MPI_Recv</tt> works in much the same way as <tt>MPI_Send</tt>. Referring to the function prototypes below, <tt>message</tt> is now a pointer to an allocated buffer of sufficient size to receive <tt>count</tt> instances of <tt>datatype</tt>, to be received from process <tt>rank</tt>. <tt>MPI_Recv</tt> takes one additional argument, <tt>status</tt>, which in C should be a reference to an allocated <tt>MPI_Status</tt> structure, and in Fortran an array of <tt>MPI_STATUS_SIZE</tt> integers. Upon return it will contain some information about the received message. We will not make use of it in this tutorial, but the argument must be present.
<tt>MPI_Recv</tt> works in much the same way as <tt>MPI_Send</tt>. Referring to the function prototypes below, <tt>message</tt> is now a pointer to an allocated buffer of sufficient size to store <tt>count</tt> instances of <tt>datatype</tt>, to be received from process <tt>rank</tt>. <tt>MPI_Recv</tt> takes one additional argument, <tt>status</tt>, which should, in C, be a reference to an allocated <tt>MPI_Status</tt> structure, and, in Fortran, be an array of <tt>MPI_STATUS_SIZE</tt> integers. Upon return it will contain some information about the received message. Although we will not make use of it in this tutorial, the argument must be present.


{| border="0" cellpadding="5" cellspacing="0" align="center"
{| border="0" cellpadding="5" cellspacing="0" align="center"
Line 277: Line 278:
|}
|}


With this simple use of <tt>MPI_Send</tt> and <tt>MPI_Recv</tt> the sending process must know the rank of the receiving process, and the receiving process must know the rank of the sending process. In our example the following arithmetic is useful:
With this simple use of <tt>MPI_Send</tt> and <tt>MPI_Recv</tt>, the sending process must know the rank of the receiving process, and the receiving process must know the rank of the sending process. In our example the following arithmetic is useful:
* <tt>(rank + 1) % size</tt> is the process to send to, and
* <tt>(rank + 1) % size</tt> is the process to send to, and
* <tt>(rank + size - 1) % size</tt> is the process to receive from.
* <tt>(rank + size - 1) % size</tt> is the process to receive from.
We can now make the required modifications to our parallel "Hello, world!" program.  
We can now make the required modifications to our parallel "Hello, world!" program as shown below.  


{| border="0" cellpadding="5" cellspacing="0" align="center"
{| border="0" cellpadding="5" cellspacing="0" align="center"
Line 356: Line 357:
|}
|}


Compile and run this program on 2, 4 and 8 processors. While it certainly seems to be working as intended, there is a hidden problem here. The MPI standard does not ''guarantee'' that <tt>MPI_Send</tt> returns before the message has been delivered. Most implementations ''buffer'' the data from <tt>MPI_Send</tt> and return without waiting for it to be delivered. But if it were not buffered, the code we've written would deadlock: Each process would call <tt>MPI_Send</tt> and then wait for its neighbour process to call <tt>MPI_Recv</tt>. Since the neighbour would also be waiting at the <tt>MPI_Send</tt> stage, they would all wait forever. Clearly there ''is'' buffering in the libraries on our systems since the code did not deadlock, but it is poor design to rely on this. The code code could fail if used on a system in which there is no buffering provided by the library. Even where buffering is provided, the call might still block if the buffer fills up.
Compile and run this program using 2, 4 and 8 processes. While it certainly seems to be working as intended, there is a hidden problem here. The MPI standard does not ''guarantee'' that <tt>MPI_Send</tt> returns before the message has been delivered. Most implementations ''buffer'' the data from <tt>MPI_Send</tt> and return without waiting for it to be delivered. But if it were not buffered, the code we've written would deadlock: Each process would call <tt>MPI_Send</tt> and then wait for its neighbour process to call <tt>MPI_Recv</tt>. Since the neighbour would also be waiting at the <tt>MPI_Send</tt> stage, they would all wait forever. Clearly there ''is'' buffering in the libraries on our systems since the code did not deadlock, and it is a poor design to rely on this unconventional buffering. In fact, the code could fail if used on a system in which there is no buffering provided by the library. Even where buffering is provided, the call might still block if the buffer fills up.


  [orc-login2 ~]$ mpicc -Wall phello2.c -o phello2
  [orc-login2 ~]$ mpicc -Wall phello2.c -o phello2
Line 369: Line 370:
The MPI standard defines <tt>MPI_Send</tt> and <tt>MPI_Recv</tt> to be '''blocking calls'''. The correct way to interpret this is that <tt>MPI_Send</tt> will not return until it is safe for the calling module to modify the contents of the provided message buffer.  Similarly, <tt>MPI_Recv</tt> will not return until the entire contents of the message are available in the provided message buffer.
The MPI standard defines <tt>MPI_Send</tt> and <tt>MPI_Recv</tt> to be '''blocking calls'''. The correct way to interpret this is that <tt>MPI_Send</tt> will not return until it is safe for the calling module to modify the contents of the provided message buffer.  Similarly, <tt>MPI_Recv</tt> will not return until the entire contents of the message are available in the provided message buffer.


It should be obvious that the availability of buffering in the MPI library is irrelevant to receive operations. As soon as the data is received it will be made available and <tt>MPI_Recv</tt> will return; until then it will be blocked and there is nothing to buffer. <tt>MPI_Send</tt> on the other hand need not block if there is buffering present in the library. Once the message is copied out of the buffer for delivery, it is safe for the user to modify the original data, so the call can return. This is why our parallel "Hello, world!" program doesn't deadlock as we have implemented it, even though all processes call <tt>MPI_Send</tt> first. This relies on there being buffering in the MPI library on our systems. Since this is not required by the MPI standard, we refer to such a program as '''''unsafe''''' MPI.
It should be obvious that the availability of buffering in the MPI library is irrelevant to receive operations. As soon as the data is received, it will be made available and <tt>MPI_Recv</tt> will return; until then it will be blocked and there is nothing to buffer. <tt>MPI_Send</tt> on the other hand need not block if there is buffering present in the library. Once the message is copied out of the buffer for delivery, it is safe for the user to modify the original data, so the call can return. This is why our parallel "Hello, world!" program doesn't deadlock as we have implemented it, even though all processes call <tt>MPI_Send</tt> first. Since the buffering is not required by the MPI standard and the correctness of our program relies on it, we refer to such a program as '''''unsafe''''' MPI.


A '''''safe''''' MPI program is one that does not rely on a buffered implementation in order to function correctly. The following pseudo-code fragments illustrate this concept:
A '''''safe''''' MPI program is one that does not rely on a buffered implementation in order to function correctly. The following pseudo-code fragments illustrate this concept:
Line 389: Line 390:
</source>
</source>


Receives are executed on both processes before the matching send; regardless of buffering the processes in this MPI application will block on the receive calls and deadlock.
Receives are executed on both processes before the matching send; regardless of buffering, the processes in this MPI application will block on the receive calls and deadlock.


==== Unsafe ====
==== Unsafe ====
Line 407: Line 408:
</source>
</source>


This is essentially what our parallel "Hello, world!" program was doing, and it ''may'' work if buffering is provided by the library. If the library is unbuffered, or if messages are simply large enough to fill the buffer, this code will block on the sends, and deadlock.
This is essentially what our parallel "Hello, world!" program was doing, and it ''may'' work if buffering is provided by the library. If the library is unbuffered, or if messages are simply large enough to fill the buffer, this code will block on the sends, and deadlock. To fix that we will need to do the following instead:


==== Safe ====
==== Safe ====
Line 427: Line 428:
Even in the absence of buffering, the send here is paired with a corresponding receive between processes. While a process may block for a time until the corresponding call is made, it cannot deadlock.
Even in the absence of buffering, the send here is paired with a corresponding receive between processes. While a process may block for a time until the corresponding call is made, it cannot deadlock.


How do we rewrite our "Hello, World!" program to make it safe? A common solution to this kind of problem is to adopt an odd-even pairing and perform the communication in two steps. In this case communication is a rotation of data one rank to the right, so we end up with a safe program if all even ranked processes execute a send followed by a receive, while all odd ranked processes execute a receive followed by a send. The reader can verify that the sends and receives are properly paired avoiding any possibility of deadlock.
How do we rewrite our "Hello, World!" program to make it safe? A common solution to this kind of problem is to adopt an odd-even pairing and perform the communication in two steps. Since in our example communication is a rotation of data one rank to the right, we should end up with a safe program if all even ranked processes execute a send followed by a receive, while all odd ranked processes execute a receive followed by a send. The reader can easily verify that the sends and receives are properly paired avoiding any possibility of deadlock.


{| border="0" cellpadding="5" cellspacing="0" align="center"
{| border="0" cellpadding="5" cellspacing="0" align="center"
Line 520: Line 521:
|}
|}


Is there still a problem here if the number of processors is odd? It might seem so at first, since process 0 (which is even) will be sending while process N-1 (also even) is trying to send to 0. But process 0 is originating a send that is correctly paired with a receive at process 1. Since process 1 (odd) begins with a receive, that transaction is guaranteed to complete. When it is complete, process 0 will proceed to receive the message from process N-1. There may be a (very small!) delay, but there is no chance of a deadlock.
Is there still a problem here if the number of processors is odd? It might seem so at first as process 0 (which is even) will be sending while process N-1 (also even) is trying to send to 0. But process 0 is originating a send that is correctly paired with a receive at process 1. Since process 1 (odd) begins with a receive, that transaction is guaranteed to complete. When it does, process 0 will proceed to receive the message from process N-1. There may be a (very small!) delay, but there is no chance of a deadlock.


  [orc-login2 ~]$ mpicc -Wall phello3.c -o phello3
  [orc-login2 ~]$ mpicc -Wall phello3.c -o phello3
Line 542: Line 543:


== Comments and Further Reading ==
== Comments and Further Reading ==
This tutorial has presented some of the key syntax, semantics and design concepts associated with MPI programming. There is still a wealth of material to be considered in designing any serious parallel program, including but not limited to:
This tutorial presented some of the key syntax, semantics and design concepts associated with MPI programming. There is still a wealth of material to be considered in designing any serious parallel program, including but not limited to:
* <tt>MPI_Send</tt>/<tt>MPI_Recv</tt> variants (buffered, non-blocking, synchronous, etc.)
* <tt>MPI_Send</tt>/<tt>MPI_Recv</tt> variants (buffered, non-blocking, synchronous, etc.)
* collective communication/computation operations (reduction, broadcast, barrier, scatter, gather, etc.)
* collective communication/computation operations (reduction, broadcast, barrier, scatter, gather, etc.)
cc_staff
30

edits

Navigation menu