CHPC

A (very) Brief Introduction to MPI: Fortran example

We'll start with the single processor code:


      program monte_carlo
      USE IFPORT
      implicit none

      integer*8 npts
      parameter (npts = 1e10)

      integer*8 i

      real*8 f,sum
      real*8 xmin,xmax,x

      xmin = 0.0d0
      xmax = 1.0d0

      do i=1,npts
         x = (xmax-xmin)*rand(0) + xmin
         sum = sum + 4.0d0/(1.0d0 + x**2)
      enddo
      f = sum/npts
      
      write(*,*)'PI calculated with ',npts,' points = ',f

      stop
      end

We can compile this code with the command:
[mtobias@login001 Fortran]$ ifort monte_carlo.f

and run the resulting executable. To get consistent timings with the MPI runs, I'm going to run everything on a compute node by requesting an interactive job:
[mtobias@login001 Fortran]$ qsub -I -lnodes=1:ppn=8,walltime=8:00:00
[mtobias@node164 Fortran]$ time ./a.out
PI calculated with 10000000000 points = 3.14159466262554

real 1m41.133s
user 1m41.114s
sys 0m0.001s

The simplest way to parallelize this program is to divide up the sampled points among all the processors. After each processor has computed its partial sum, we'll add the results together. It is important to remember that each processor will run the same piece of code.
I'll now present a MPI version of the code in its entirety, and then go through some of the differences between it and the single processor code:


      program monte_carlo_mpi
      USE IFPORT
      implicit none

      integer*8 npts,mynpts
      parameter (npts = 1e10)

      include "mpif.h"

      integer ierr, myid, nprocs

      integer*8 i

      real*8 f,sum,mysum
      real*8 xmin,xmax,x

      call MPI_INIT(ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)

      if (myid .eq. 0) then
         mynpts = npts - (nprocs-1)*(npts/nprocs)
      else
         mynpts = npts/nprocs
      endif

      mysum = 0.0d0
      xmin = 0.0d0
      xmax = 1.0d0

      x = (xmax-xmin)*rand(myid+1) + xmin
      do i=1,mynpts
         x = (xmax-xmin)*rand(0) + xmin
         mysum = mysum + 4.0d0/(1.0d0 + x**2)
      enddo


      call MPI_REDUCE(mysum,sum,1,MPI_DOUBLE_PRECISION,MPI_SUM,
     &     0,MPI_COMM_WORLD,ierr)
      
      if (myid .eq. 0) then
         f = sum/npts
         write(*,*)'PI calculated with ',npts,' points = ',f
      endif

      call MPI_FINALIZE(ierr)

      stop
      end


The first thing we need to do is include the MPI Fortran header file:

      include "mpif.h"

We define some integers:

      integer ierr, myid, nprocs

where ierr is an error flag used by MPI, myid is a unique identifier denoting which process we are in the set, and nprocs is the total number of processes.

 
      integer*8 npts,mynpts

We'll use mynpts to keep track of the number of points to be sampled by a particular process.

 
      real*8 f,sum,mysum

mysum is each process's contribution to the integral.

      call MPI_INIT(ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)

These three MPI calls are found in almost all MPI programs. The first routine initializes MPI, the second allows each process to determine its unique id, and the third allows each process to know the total number of processes being used in the run (determined at run time).

      if (myid .eq. 0) then
         mynpts = npts - (nprocs-1)*(npts/nprocs)
      else
         mynpts = npts/nprocs
      endif

Here is where we divvy up the work (also called data decomposition). Since the total number of points may not be evenly divisible by the number of processors, we'll give the extra work to processor 0.

      x = (xmax-xmin)*rand(myid+1) + xmin

This is a crucial step; We need to seed the random number generator for each process with a unique seed. We also need to be careful as a seed of 0 is equivalent to a seed of 1.


      do i=1,mynpts
         x = (xmax-xmin)*rand(0) + xmin
         mysum = mysum + 4.0d0/(1.0d0 + x**2)
      enddo

Whereas before we did the sum over all points, now each process does a partial sum over its own points.

      call MPI_REDUCE(mysum,sum,1,MPI_DOUBLE_PRECISION,MPI_SUM,
     &     0,MPI_COMM_WORLD,ierr)

This MPI routine takes all the process's values of mysum and adds them up into sum which it sends to processor 0.

      if (myid .eq. 0) then
         f = sum/npts
         write(*,*)'PI calculated with ',npts,' points = ',f
      endif

The final calculation is done by process 0 since it is the only process that knows the value of sum and we only want one copy of the result printed out.

     call MPI_FINALIZE(ierr)

Lastly, we allow MPI to clean up after itself.

That's it! We're now ready to compile the code:
[mtobias@node164 Fortran]$ ifort monte_carlo_mpi.f -I/export/openmpi-1.4.3/include/ -L/export/openmpi-1.4.3/lib -lmpi_f77
/export/intel/Compiler/12.0/lib/intel64//libimf.so: warning: warning: feupdateenv is not implemented and will always fail

Notice that we need to link against the MPI Fortran library (mpi_f77) as well as point the compiler where to find the libraries and include files. Feel free to ignore the warning.

At this time I should point out we have many different implementations of MPI installed on our systems. I'm using the default implementation OpenMPI.

All MPI executables must be run with the command mpirun where we specify the number of processors via the np flag:
[mtobias@node164 Fortran]$ time mpirun -np 1 ./a.out
PI calculated with 10000000000 points = 3.14159466253464

real 1m39.127s
user 1m38.967s
sys 0m0.032s

Not surprisingly, the single-process MPI run takes about as long as the serial code.
[mtobias@node164 Fortran]$ time mpirun -np 2 ./a.out
PI calculated with 10000000000 points = 3.14159641144649

real 0m50.100s
user 1m37.129s
sys 0m0.901s
[mtobias@node164 Fortran]$ time mpirun -np 4 ./a.out
PI calculated with 10000000000 points = 3.14159362924826

real 0m26.340s
user 1m36.637s
sys 0m0.093s

[mtobias@node164 Fortran]$ time mpirun -np 8 ./a.out
PI calculated with 10000000000 points = 3.14159378677509

real 0m13.745s
user 1m39.962s
sys 0m0.504s

Every time we double the number of processes, we see the speed of the code almost doubles leading to near-perfect scaling. The reason for this is that the Monte-Carlo algorithm is ideally suited for parallelization of this type, or to put it another way it is "trivially parallelizable".