MPI - Blocking point-to-point communication

The simplest method to communicate with MPI is point-to-point communication between two specific processes, a sender and a receiver.

Objectives

  • Blocking point-to-point communication

  • Different send modes in MPI: MPI_Send, MPI_Ssend, MPI_Bsend, MPI_Rsend

  • Explore point-to-point communication with two MPI processes playing ping pong

Instructor note

  • 45 min teaching

  • 75 min exercises

The simplest method to communicate with MPI is point-to-point communication between two specific processes, a sender and a receiver. Both processes actively participate in this form of communication where the sender must execute some send function while the receiver executes some receive function. Furthermore, both processes must be in the same communicator and need information about the communication partner (source or destination rank) as well as a tag that helps to identify the message. MPI is equipped with two flavors of point-to-point communication: blocking and nonblocking.

Blocking point-to-point communication

With blocking communication, the processes may or may not wait until the communication partner is ready to engage in the communication. Blocking send or receive functions cause the executing process to suspend until the send buffer can be reused / changed or until the receive buffer is actually filled. After a blocking send, the process only continues when the data to be sent has been copied from the send buffer, however, this does not mean that the data has been received at the destination process. In the case of a blocking receive, the completion implies that the data transfer has happened and the data has been copied into the receive buffer and is therefore safe to be used.

Communication Modes

For blocking point-to-point communication, the MPI standard defines four modes of communication with subtle differences in their semantics:

SENDING

Mode

Standard

Send

recommended for production runs

Synchronous

Ssend

recommended for debugging version

Buffered

Bsend

recommended to use nonblocking communcation instead

Ready

Rsend

dangerous, for experts only

RECEIVING

Mode

Standard

Recv

only one mode needed (fits all sends)

Standard Mode

Standard Mode is done either using a synchronous or an asynchronous protocol and the MPI library decides which one to use depending on the message size (and handles the asynchronous protocol transparently). When the synchronous protocol is used there is a risk of deadlocks and serializations. Standard mode is recommended for production runs.

Synchronous Send

Synchronous Send is the most stringent communication mode, since the sending process requires the receiving process to provide a matching receive, which is similar to accepting a handshake, in order to start the send. This means that the receiving process has to declare its readiness for receiving a message. Ideally, every MPI program still works correctly when standard send is replaced with synchronous send, however, if it is used incorrectly, it can lead to deadlocks and serialization. The use case for this mode is debugging.

Buffered Send

Buffered Send copies the data from the send buffer to a buffer that has to be managed by the programmer and subsequently returns. Once a matching receive has been received, the data will be transmitted over the network from the user-managed buffer. Naturally, this requires an additional buffer and an extra transfer between the buffers. However, this communication mode is local, and its completion does not depend on the occurrence of a matching receive. This communication mode also requires the programmer to attach and detach a user-managed buffer, where the detach call blocks until all data in the buffer has been transmitted. We are not going to show this here as nonblocking communication can accomplish the same goal in a more elegant way.

Ready Send

Ready Send works only under the assumption that the matching receive has already been posted and thus the send call completes immediately. If this is not the case, the behavior is undefined and might give wrong results. This communication has the potential to be the fastest but it should be handled with utmost care and used only when the control flow of the parallel program permits it. This mode of communication is rather advanced.

Hands-on labs

Explore point-to-point communication with two MPI processes playing ping pong:

Step by step (according to the pictures below):

  1. ping - rank 0 sends a message (ping) to rank 1 and rank 1 receives it

  2. pingpong - after receiving the ping, rank 1 sends a message (pong) back to rank 0 and rank 0 receives it

  3. timing - repeat the ping pong in a loop and add timing calls before and after the loop

  4. warmup - don’t forget to warmup and do one ping pong before starting the timed loop

  5. finish - who wins the race?

02_pingpong

 

  • Let’s write a ping pong benchmark (to measure the latency) step by step

  • Only two MPI processes (rank 0 and rank 1) will be needed (mpirun -np 2 …)

  • For the ping pong exercise, we’ll adopt the MPMD (multiple program multiple data) approach

  • The ping pong ball will be 1 float (but the value is not of interest)

  • Be careful, in the end the two MPI processses should play ping pong with only one ball
    (not ping-ping pong-pong with two balls)

Note

MPI_Send(&buf, count, datatype, dest, tag, comm)

  • blocking send procedure (other send modes have the same syntax)

    • source rank sends the message defined by (buf, count, datatype) to the dest(ination) rank
       

  • IN       buf             initial address of send buffer (choice)

  • IN       count         number of elements in send buffer (non-negative integer)

  • IN       datatype   datatype of each send buffer element (handle)

  • IN       dest           rank of destination (integer)

  • IN       tag             message tag (integer)

  • IN       comm        communicator (handle)
     

  • C binding
    int MPI_Send(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

    • Usage:   MPI_Send(&buffer, 1, MPI_FLOAT, 1, 17, MPI_COMM_WORLD);

  • Note:

    • MPI_Send (standard send) is recommended for production runs (best speed)
      –> let the MPI library decide how to best transfer the message (same risks as MPI_Ssend)

    • MPI_Ssend (synchronous send) is recommended for debugging (helps to detect deadlocks)
      –> completes only when the receive has started –> risk of deadlocks and serializations

    • MPI_Bsend (buffered send) –> not recommended since unnecessarily complicated
      –> it’s recommended to use MPI_Send or nonblocking communication instead

    • MPI_Rsend (ready send) –> not recommended because it’s highly dangerous to get it wrong
      –> it may be started only after the matching receive is already posted (needs additional guaratees)

Note

MPI_Recv(&buf, count, datatype, source, tag, comm, &status)

  • blocking receive procedure

    • dest(ination) rank receives a message from the source rank and stores it at (buf, count, datatype)
       

  • OUT   buf             initial address of receive buffer (choice)

  • IN       count         number of elements in receive buffer (non-negative integer)

  • IN       datatype   datatype of each receive buffer element (handle)

  • IN       dest           rank of source or MPI_ANY_SOURCE (integer)

  • IN       tag             message tag or MPI_ANY_TAG (integer)

  • IN       comm        communicator (handle)

  • OUT   status        status object (status)
     

  • C binding
    int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

    • Usage:   MPI_Recv(&buffer, 1, MPI_FLOAT, 0, 17, MPI_COMM_WORLD, &status);
       

MPI 5.0 Table 3.2 - Predefined MPI datatypes corresponding to C datatypes

  • Note:

    • MPI_Recv completes when the message has arrived
      –> only one receive mode is needed that works together with all 4 send modes

 

1. ping

Exercise

The very first ping:

Modify the code below such that:

  • rank 0 sends a message (ping) to rank 1

  • rank 1 receives the message (ping) from rank 0

  • the message (ping pong ball) shall be 1 float and please use tag=17 for the ping

What happens if you do NOT modify the code below? Try it out!

You can compile and execute part 1 without modifying the code below.
Give it a try before you actually modify.
What happens here? Why is this possible at all?
Of course, before you can proceed to the next step (2), you have to modify.

#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
  int i, rank;
  float buffer[1];
  MPI_Status status;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

      printf("I am %i before send ping \n", rank);

      printf("I am %i after  recv ping \n", rank);

  MPI_Finalize();
}

Compile:

mpicc ping.c -o ping

Run:

mpirun -np 2 ./ping

Expected output:

I am 0 before send ping 
I am 1 after  recv ping 

Unexpected output - but still correct - do you remember why this might happen?

I am 1 after  recv ping 
I am 0 before send ping 

Tip

Seeing more than 2 output lines?
If you are seeing more than 2 output lines, please modify / correct the code above.
If you have not yet modified it you will see 4 (2 x number of MPI processes) lines of output, i.e., each MPI process runs the whole code which has 2 print statements.

Solution (please try to solve the exercise by yourself before looking at the solution)

 

2. pingpong

Exercise

Sending back the pong:

Modify the code below such that:

  • after receiving the ping, rank 1 sends a message (pong) back to rank 0

  • rank 0 receives the message (pong) from rank 1

  • the message (ping pong ball) shall be 1 float and please use tag=23 for the pong

#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
  int i, rank;
  float buffer[1];
  MPI_Status status;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank == 0)
    {
      printf("I am %i before send ping \n", rank);
      MPI_Send(buffer, 1, MPI_FLOAT, 1, 17, MPI_COMM_WORLD);
      printf("I WILL BE / am %i after  recv ping \n", rank);
    }
    else if (rank == 1)
    {
      MPI_Recv(buffer, 1, MPI_FLOAT, 0, 17, MPI_COMM_WORLD, &status);
      printf("I am %i after  recv ping \n", rank);
      printf("I WILL BE / am %i before send pong \n", rank);
    }

  MPI_Finalize();
}

Compile:

mpicc pingpong.c -o pingpong

Run:

mpirun -np 2 ./pingpong

Expected output:

I am 0 before send ping 
I am 1 after  recv ping 
I am 1 before send pong 
I am 0 after  recv pong

Unexpected output - but still correct - do you remember why this might happen?

I am 0 before send ping
I am 0 after  recv pong
I am 1 after  recv ping 
I am 1 before send pong 

Solution (please try to solve the exercise by yourself before looking at the solution)

 

3. timing

Note

MPI_Wtime()

  • timing

    • returns a floating-point number of seconds, representing elapsed wallclock time since some time in the past
       

  • C binding
    double MPI_Wtime(void)

    • Usage:   time = MPI_Wtime();

Exercise

Repeat this in a loop and add timing calls:

Modify the code below:

  • repeat this ping pong with a loop of length 50

  • add timing calls before and after the loop

  • only rank 0 shall print out the transfer time of one message in micro seconds, i.e., delta_time / (2*50) * 1e6

Uncomment the 3 // resp. # lines and add all other pieces needed in the code.

#include <stdio.h>
#include <mpi.h>

#define number_of_messages 50

int main(int argc, char *argv[])
{
  int i, rank;
  float buffer[1];
  // ??? start, finish, msg_transfer_time;
  MPI_Status status;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank == 0)
    {
      MPI_Send(buffer, 1, MPI_FLOAT, 1, 17, MPI_COMM_WORLD);
      MPI_Recv(buffer, 1, MPI_FLOAT, 1, 23, MPI_COMM_WORLD, &status);
    }
    else if (rank == 1)
    {
      MPI_Recv(buffer, 1, MPI_FLOAT, 0, 17, MPI_COMM_WORLD, &status);
      MPI_Send(buffer, 1, MPI_FLOAT, 0, 23, MPI_COMM_WORLD);
    }

  if (rank == 0)
  {
    // msg_transfer_time = ((finish - start) / (2 * number_of_messages)) * 1e6 ; // in microsec
    // printf("Time for one message: %f micro seconds.\n", msg_transfer_time);
  }

  MPI_Finalize();
}

Compile:

mpicc pingpong-bench.c -o pingpong-bench

Run:

mpirun -np 2 ./pingpong-bench

Expected output - What did you measure? Run is a couple of times to see run to run variations!

Time for one message: 0.440590 micro seconds.

Solution (please try to solve the exercise by yourself before looking at the solution)

4. warmup

Exercise

Don’t forget to warmup and do one ping pong before starting the timed loop:
Modify the code below accordingly:

#include <stdio.h>
#include <mpi.h>

#define number_of_messages 50

int main(int argc, char *argv[])
{
  int i, rank;
  float buffer[1];
  double start, finish, msg_transfer_time;
  MPI_Status status;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  start = MPI_Wtime();
  for (i = 1; i <= number_of_messages; i++)
  {
    if (rank == 0)
    {
      MPI_Send(buffer, 1, MPI_FLOAT, 1, 17, MPI_COMM_WORLD);
      MPI_Recv(buffer, 1, MPI_FLOAT, 1, 23, MPI_COMM_WORLD, &status);
    }
    else if (rank == 1)
    {
      MPI_Recv(buffer, 1, MPI_FLOAT, 0, 17, MPI_COMM_WORLD, &status);
      MPI_Send(buffer, 1, MPI_FLOAT, 0, 23, MPI_COMM_WORLD);
    }
  }
  finish = MPI_Wtime();

  if (rank == 0)
  {
    msg_transfer_time = ((finish - start) / (2 * number_of_messages)) * 1e6 ; // in microsec
    printf("Time for one messsage: %f micro seconds.\n", msg_transfer_time);
  }

  MPI_Finalize();
}

Compile:

mpicc pingpong-bench1.c -o pingpong-bench1

Run:

mpirun -np 2 ./pingpong-bench1

Expected output - What did you measure? Run it a couple of times to see run to run variations!

Time for one messsage: 0.134900 micro seconds.

Solution (please try to solve the exercise by yourself before looking at the solution)

5. finish - who wins the race?

Please do a couple of time measurements - run a couple of times each and note down your fastest result for:

  • MPI_Send - including the first ping pong in the time measurement (result of 3. timing)

  • MPI_Send - excluding the first ping pong from the time measurements (result of 4. warmup)
     

  • MPI_Ssend - including the first ping pong in the time measurement (you’ll have to edit/copy from above)

  • MPI_Ssend - excluding the first ping pong from the time measurements (you’ll have to edit/copy from above)

You can do these measurements on different systems and in different environments, e.g.:

  • VSC JupyterHub using VSC-5 or VSC-4

  • Submitting jobs to VSC-5 or VSC-4 and playing around with pinning (see previous 01_hello.ipynb)

    • put both processes on the same NUMA domain

    • put the two processes on different NUMA domains but still on the same CPU/socket

    • put the two processes on different CPUs/sockets on the same node

    • put them on different nodes and both on CPU/socket 0

    • put them on different nodes and both on CPU/socket 1

    • put them on different nodes and one on CPU/socket 0 and the other on CPU/socket 1

  • With submitting jobs you can also witch to another MPI library (e.g. Intel-MPI) and do the same.

  • Run the ping pong benchmark on your own laptop and/or on another HPC system you have access to.

Record your results below we would like to see who wins the race?
(Copy the cell below to record all your measurements on different systems and in different environments.)

First name:     ________
Measurement on: ________
Programming language: ________
time for 1 ping in micro seconds with     MPI_Send     MPI_Ssend
including first ping pong  in  timing     ________     ________
excluding first ping pong from timing     ________     ________

Keypoints

  • Blocking point-to-point communication

  • Different send modes in MPI: MPI_Send, MPI_Ssend, MPI_Bsend, MPI_Rsend

  • Explore point-to-point communication with two MPI processes playing ping pong

See also

-MPI 5.0 Section 9.6 - Timers and Synchronization

  • pages 467-468: MPI_Wtime, MPI_Wtick