MPI: Difference between revisions

18,543 bytes added ,  8 years ago
Finished adding Sharcnet MPI content.
(Started adding MPI content from the Sharcnet regional guide.)
 
(Finished adding Sharcnet MPI content.)
Line 204: Line 204:


Compile and run this program on 2, 4 and 8 processors. Note that each running process produces output based on the values of its local variables, as one should expect. As you run the program on more processors, you will start to see that the order of the output from the different processes is not regular. The stdout of all running processes is simply concatenated together; you should make no assumptions about the order of output from different processes.
Compile and run this program on 2, 4 and 8 processors. Note that each running process produces output based on the values of its local variables, as one should expect. As you run the program on more processors, you will start to see that the order of the output from the different processes is not regular. The stdout of all running processes is simply concatenated together; you should make no assumptions about the order of output from different processes.
[orc-login2 ~]$ vi phello1.c
[orc-login2 ~]$ mpicc -Wall phello1.c -o phello1
[orc-login2 ~]$ mpirun -np 4 ./phello1
Hello, world! from process 0 of 4
Hello, world! from process 1 of 4
Hello, world! from process 2 of 4
Hello, world! from process 3 of 4
=== Communication ===
While we now have a parallel version of our "Hello, World!" program, it isn't a very interesting one as there is no communication between the processes. Let's explore this issue by having the processes say hello to one another, rather than just sending the output to <tt>stdout</tt>.
The specific functionality we seek is to have each process send its hello message to the "next" one in the communicator. We will define this as a rotation operation, so process rank i is to send its message to process rank i+1, with process N-1 wrapping around and sending to process 0; That is to say ''process i sends to process '''(i+1)%N''''', where there are N processes and % is the modulus operator.
First we need to learn how to send and receive data using MPI. While MPI provides means of sending and receiving data of any composition, any consideration of such issues is too advanced for this tutorial and we'll refer again to the provided references for additional information. The most basic way of organizing data to be sent is to send a sequence of one or more instances of an atomic data type. This is supported natively in languages with contiguous allocation of arrays.
Data is sent using the <tt>MPI_Send</tt> function. Referring to the following function prototypes, <tt>MPI_Send</tt> can be summarized as sending '''count''' contiguous instances of '''datatype''' to process with the specified '''rank''', and the data is in the buffer pointed to by '''message'''.  '''tag''' is a programmer-specified identifier that becomes associated with the message, and can be used to organize the communication process (for example, to distinguish two distinct streams of data interleaved piecemeal). None of our examples require we use this functionality, so we will always pass in the value 0 for this parameter. '''comm''' is again the communicator in which the send occurs, for which we will continue to use the pre-defined <tt>MPI_COMM_WORLD</tt>.
{| border="0" cellpadding="5" cellspacing="0" align="center"
! style="background:#8AA8E5;" | C API
|-valign="top"
|<source lang="c">
int MPI_Send
(
    void *message,          /* reference to data to be sent */
    int count,              /* number of items in message */
    MPI_Datatype datatype,  /* type of item in message */
    int dest,                /* rank of process to receive message */
    int tag,                /* programmer specified identifier */
    MPI_Comm comm            /* communicator */
);
</source>
|}
{| border="0" cellpadding="5" cellspacing="0" align="center"
! style="background:#ECCF98;" | FORTRAN API
|-valign="top"
|<source lang="fortran">
MPI_SEND(MESSAGE, COUNT, DATATYPE, DEST, TAG, COMM, IERR)
<type> MESSAGE(*)
INTEGER :: COUNT, DATATYPE, DEST, TAG, COMM, IERR
</source>
|}
Note that the '''datatype''' argument, specifying the type of data contained in the '''message''' buffer, is a variable. This is intended to provide a layer of compatibility between processes that could be running on architectures for which the native format for these types differs. For more complex situations it is possible to register new data types, however for the purpose of this tutorial we will restrict ourselves to the pre-defined types provided by MPI. There is an MPI type pre-defined for all atomic data types in the source language (for C: <tt>MPI_CHAR</tt>, <tt>MPI_FLOAT</tt>, <tt>MPI_SHORT</tt>, <tt>MPI_INT</tt>, etc. and for Fortran: <tt>MPI_CHARACTER</tt>, <tt>MPI_INTEGER</tt>, <tt>MPI_REAL</tt>, etc.). Please refer to the provided references for a full list of these types if interested.
We can summarize the <tt>MPI_Recv</tt> call relatively quickly now as the function works in much the same way. Referring to the function prototypes below, '''message''' is now a pointer to an allocated buffer of sufficient size to receive '''count''' contiguous instances of '''datatype''', which is received from process '''rank'''. <tt>MPI_Recv</tt> takes one additional argument, '''status''', which is a reference to an allocated <tt>MPI_Status</tt> structure in C and an array of <tt>MPI_STATUS_SIZE</tt> integers in Fortran. It will be filled in with information related to the received message upon return. We will not make use of this argument in this tutorial, however it must be present.
{| border="0" cellpadding="5" cellspacing="0" align="center"
! style="background:#8AA8E5;" | C API
|-valign="top"
|<source lang="c">
int MPI_Recv
(
    void *message,          /* reference to buffer for received data */
    int count,              /* number of items to be received */
    MPI_Datatype datatype,  /* type of item to be received */
    int source,              /* rank of process from which to receive */
    int tag,                /* programmer specified identifier */
    MPI_Comm comm            /* communicator */
    MPI_Status *status      /* stores info. about received message */
);
</source>
|}
{| border="0" cellpadding="5" cellspacing="0" align="center"
! style="background:#ECCF98;" | FORTRAN API
|-valign="top"
|<source lang="fortran">
MPI_RECV(MESSAGE, COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS, IERR)
<type> :: MESSAGE(*)
INTEGER :: COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS(MPI_STATUS_SIZE), IERR
</source>
|}
Keep in mind that both sends and receives are explicit calls. The sending process must know the rank of the process to which it is sending, and the receiving process must know the rank of the process from which to receive. in this case, since all our processes are performing the same action, we need only derive the appropriate arithmetic so that given its rank, the process knows both where to send its data, [(rank + 1) % size], and from which to receive [(rank + size) - 1) % size]. We'll now make the required modifications to our parallel "Hello, world!" program. As a first cut, we'll simply have each process first send its message, then receive what is being sent to it.
{| border="0" cellpadding="5" cellspacing="0" align="center"
! style="background:#8AA8E5;" | ''C CODE'': <tt>phello2.c</tt>
|-valign="top"
|<source lang="c">
#include <stdio.h>
#include <mpi.h>
#define BUFMAX 81
int main(int argc, char *argv[])
{
    char outbuf[BUFMAX], inbuf[BUFMAX];
    int rank, size;
    int sendto, recvfrom;
    MPI_Status status;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    sprintf(outbuf, "Hello, world! from process %d of %d", rank, size);
    sendto = (rank + 1) % size;
    recvfrom = ((rank + size) - 1) % size;
    MPI_Send(outbuf, BUFMAX, MPI_CHAR, sendto, 0, MPI_COMM_WORLD);
    MPI_Recv(inbuf, BUFMAX, MPI_CHAR, recvfrom, 0, MPI_COMM_WORLD, &status);
    printf("[P_%d] process %d said: \"%s\"]\n", rank, recvfrom, inbuf);
    MPI_Finalize();
    return(0);
}
</source>
|}
{| border="0" cellpadding="5" cellspacing="0" align="center"
! style="background:#ECCF98;" | ''FORTRAN CODE'': <tt>phello2.f</tt>
|-valign="top"
|<source lang="fortran">
program phello2
    implicit none
    include 'mpif.h'
    integer, parameter :: BUFMAX=81
    character(len=BUFMAX) :: outbuf, inbuf, tmp
    integer :: rank, num_procs, ierr
    integer :: sendto, recvfrom
    integer :: status(MPI_STATUS_SIZE)
    call MPI_INIT(ierr)
    call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD, num_procs, ierr)
    outbuf = 'Hello, world! from process '
    write(tmp,'(i2)') rank
    outbuf = outbuf(1:len_trim(outbuf)) // tmp(1:len_trim(tmp))
    write(tmp,'(i2)') num_procs
    outbuf = outbuf(1:len_trim(outbuf)) // ' of ' // tmp(1:len_trim(tmp))
    sendto = mod((rank + 1), num_procs)
    recvfrom = mod(((rank + num_procs) - 1), num_procs)
    call MPI_SEND(outbuf, BUFMAX, MPI_CHARACTER, sendto, 0, MPI_COMM_WORLD, ierr)
    call MPI_RECV(inbuf, BUFMAX, MPI_CHARACTER, recvfrom, 0, MPI_COMM_WORLD, status, ierr)
    print *, 'Process', rank, ': Process', recvfrom, ' said:', inbuf
    call MPI_FINALIZE(ierr)
end program phello2
</source>
|}
Compile and run this program on 2, 4 and 8 processors. While it certainly seems to be working as intended, there is something we are overlooking. The MPI standard says nothing about whether sends are buffered---that is to say, we should not assume that <tt>MPI_Send</tt> returns immediately buffering our message. If the send isn't buffered, the code as written would deadlock with all processes performing its send and blocking as there are no receives to consume the message until after the send returns. Clearly there is buffering in the libraries on our systems as the code did not deadlock, however it is poor design to rely on this. You may find your code fails if used on a system in which there is no buffering provided by the library, and even where buffering is provided, the call will still block if the buffer fills up.
[orc-login2 ~]$ mpicc -Wall phello2.c -o phello2
[orc-login2 ~]$ mpirun -np 4 ./phello2
[P_0] process 3 said: "Hello, world! from process 3 of 4"]
[P_1] process 0 said: "Hello, world! from process 0 of 4"]
[P_2] process 1 said: "Hello, world! from process 1 of 4"]
[P_3] process 2 said: "Hello, world! from process 2 of 4"]
=== Safe MPI ===
The MPI standard defines <tt>MPI_Send</tt> and <tt>MPI_Recv</tt> to be '''blocking calls'''. The correct way to interpret this is that <tt>MPI_Send</tt> will not return until it is safe for the calling module to modify the contents of the provided message buffer.  Similarly, <tt>MPI_Recv</tt> will not return until the entire contents of the message are available in the provided message buffer.
It should be obvious that the availability of buffering in the MPI library is irrelevant to receive operations. It is meaningless to speak of buffering something that hasn't yet been received. As such we assume that <tt>MPI_Recv</tt> will always block until the message is fully received. <tt>MPI_Send</tt> on the other hand need not block if there is buffering present in the library. Once the message is copied out of the buffer for delivery, it is safe for the user to modify the buffer, so the call can return. This is why our parallel "Hello, world!" program doesn't deadlock as we have implemented it, even though all processes call <tt>MPI_Send</tt> first. This relies on there being buffering in the MPI library on our systems, which is not required by the MPI standard, and thus we refer to such a program as '''''unsafe''''' MPI.
A '''''safe''''' MPI program is one that does not rely on a buffered underlying implementation in order to function correctly. The following pseudo-code fragments illustrate this concept clearly:
==== Deadlock ====
<source lang="c">
...
    if (rank == 0)
    {
        MPI_Recv(from 1);
        MPI_Send(to 1);
    }
    else if (rank == 1)
    {
        MPI_Recv(from 0);
        MPI_Send(to 0);
    }
...
</source>
Receives are executed on both processes before the matching send; regardless of buffering the processes in this MPI application will block on the receive calls and deadlock.
==== Unsafe ====
<source lang="c">
...
    if (rank == 0)
    {
        MPI_Send(to 1);
        MPI_Recv(from 1);
    }
    else if (rank == 1)
    {
        MPI_Send(to 0);
        MPI_Recv(from 0);
    }
...
</source>
This is essentially what our parallel "Hello, world!" program was doing, and in general this ''may'' work if buffering is provided by the library. If the library is unbuffered, or even if messages are simply sufficiently large to fill the buffer, this code '''will''' block on the sends, and deadlock.
==== Safe ====
<source lang="c">
...
    if (rank == 0)
    {
        MPI_Send(to 1);
        MPI_Recv(from 1);
    }
    else if (rank == 1)
    {
        MPI_Recv(from 0);
        MPI_Send(to 0);
    }
...
</source>
Even in the absence of buffering, the send here is paired with a corresponding receive between processes. While a process may block for a short time until the corresponding call is made, it is not possible for deadlock to occur in this code regardless of buffering.
The last thing we will consider then is how to recast our "Hello, World!" program so that it is ''safe''. A common solution to this kind of data-exchange issue, where all processes are exchanging data with a subset of other processes, is to adopt an odd-even pairing and perform the communication in two steps. In this case communication is a rotation of data one rank to the right, so we end up with a safe program if all even ranked processes execute a send followed by a receive, while all odd ranked processes execute a receive followed by a send. The reader can verify that the sends and receives are properly paired avoiding any possibility of deadlock.
{| border="0" cellpadding="5" cellspacing="0" align="center"
! style="background:#8AA8E5;" | ''C CODE'': <tt>phello3.c</tt>
|-valign="top"
|<source lang="c">
#include <stdio.h>
#include <mpi.h>
#define BUFMAX 81
int main(int argc, char *argv[])
{
    char outbuf[BUFMAX], inbuf[BUFMAX];
    int rank, size;
    int sendto, recvfrom;
    MPI_Status status;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    sprintf(outbuf, "Hello, world! from process %d of %d", rank, size);
    sendto = (rank + 1) % size;
    recvfrom = ((rank + size) - 1) % size;
    if (!(rank % 2))
    {
        MPI_Send(outbuf, BUFMAX, MPI_CHAR, sendto, 0, MPI_COMM_WORLD);
        MPI_Recv(inbuf, BUFMAX, MPI_CHAR, recvfrom, 0, MPI_COMM_WORLD, &status);
    }
    else
    {
        MPI_Recv(inbuf, BUFMAX, MPI_CHAR, recvfrom, 0, MPI_COMM_WORLD, &status);
        MPI_Send(outbuf, BUFMAX, MPI_CHAR, sendto, 0, MPI_COMM_WORLD);
    }
    printf("[P_%d] process %d said: \"%s\"]\n", rank, recvfrom, inbuf);
    MPI_Finalize();
    return(0);
}
</source>
|}
{| border="0" cellpadding="5" cellspacing="0" align="center"
! style="background:#ECCF98;" | ''FORTRAN CODE'': <tt>phello3.f</tt>
|-valign="top"
|<source lang="fortran">
program phello3
    implicit none
    include 'mpif.h'
    integer, parameter :: BUFMAX=81
    character(len=BUFMAX) :: outbuf, inbuf, tmp
    integer :: rank, num_procs, ierr
    integer :: sendto, recvfrom
    integer :: status(MPI_STATUS_SIZE)
    call MPI_INIT(ierr)
    call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
    call MPI_COMM_SIZE(MPI_COMM_WORLD, num_procs, ierr)
    outbuf = 'Hello, world! from process '
    write(tmp,'(i2)') rank
    outbuf = outbuf(1:len_trim(outbuf)) // tmp(1:len_trim(tmp))
    write(tmp,'(i2)') num_procs
    outbuf = outbuf(1:len_trim(outbuf)) // ' of ' // tmp(1:len_trim(tmp))
    sendto = mod((rank + 1), num_procs)
    recvfrom = mod(((rank + num_procs) - 1), num_procs)
    if (MOD(rank,2) == 0) then
        call MPI_SEND(outbuf, BUFMAX, MPI_CHARACTER, sendto, 0, MPI_COMM_WORLD, ierr)
        call MPI_RECV(inbuf, BUFMAX, MPI_CHARACTER, recvfrom, 0, MPI_COMM_WORLD, status, ierr)
    else
        call MPI_RECV(inbuf, BUFMAX, MPI_CHARACTER, recvfrom, 0, MPI_COMM_WORLD, status, ierr)
        call MPI_SEND(outbuf, BUFMAX, MPI_CHARACTER, sendto, 0, MPI_COMM_WORLD, ierr)
    endif
    print *, 'Process', rank, ': Process', recvfrom, ' said:', inbuf
    call MPI_FINALIZE(ierr)
end program phello3
</source>
|}
In the interest of completeness, it should be noted that in this implementation, there is the possibility of a blocked <tt>MPI_Send</tt> where the number of processors is odd---one processor will be sending while another is trying to send to it. Verify for yourself that there can only be one processor affected in this way. This is still safe however since the receiving process was involved in a send that was correctly paired with a receive, so the blocked send will complete as soon as the <tt>MPI_Recv</tt> call is executed on the receiving process.
[orc-login2 ~]$ vi phello3.c
[orc-login2 ~]$ mpicc -Wall phello3.c -o phello3
[orc-login2 ~]$ mpirun -np 16 ./phello3
[P_1] process 0 said: "Hello, world! from process 0 of 16"]
[P_2] process 1 said: "Hello, world! from process 1 of 16"]
[P_5] process 4 said: "Hello, world! from process 4 of 16"]
[P_3] process 2 said: "Hello, world! from process 2 of 16"]
[P_9] process 8 said: "Hello, world! from process 8 of 16"]
[P_0] process 15 said: "Hello, world! from process 15 of 16"]
[P_12] process 11 said: "Hello, world! from process 11 of 16"]
[P_6] process 5 said: "Hello, world! from process 5 of 16"]
[P_13] process 12 said: "Hello, world! from process 12 of 16"]
[P_8] process 7 said: "Hello, world! from process 7 of 16"]
[P_7] process 6 said: "Hello, world! from process 6 of 16"]
[P_14] process 13 said: "Hello, world! from process 13 of 16"]
[P_10] process 9 said: "Hello, world! from process 9 of 16"]
[P_4] process 3 said: "Hello, world! from process 3 of 16"]
[P_15] process 14 said: "Hello, world! from process 14 of 16"]
[P_11] process 10 said: "Hello, world! from process 10 of 16"]
== Comments and Further Reading ==
This tutorial has presented a brief overview of some of the key syntax, semantics and design concepts associated with MPI programming. There is still a wealth of material to be considered in architecting any serious parallel architecture, including but not limited to:
* <tt>MPI_Send</tt>/<tt>MPI_Recv</tt> variants (buffered, non-blocking, synchronous, etc.)
* collective communication/computation operations (reduction, broadcast, barrier, scatter, gather, etc.)
* defining derived data types
* communicators and topologies
* one-sided communication (and MPI-2 in general)
* efficiency issues
* parallel debugging
The following are recommended books and online references for those interested in more detail on the concepts we've discussed in this tutorial, and to continue learning about the more advanced features available to you through the Message Passing Interface.
* '''William Gropp, Ewing Lusk and Anthony Skjellum. ''Using MPI: Portable Parallel Programming with the Message-Passing Interface (2e)''. MIT Press, 1999.'''
** Comprehensive reference covering Fortran, C and C++ bindings
* '''Peter S. Pacheco. ''Parallel Programming with MPI''. Morgan Kaufmann, 1997.'''
** Easy to follow tutorial-style approach in C
Bureaucrats, cc_docs_admin, cc_staff
2,306

edits