# A (very) Brief Introduction to MPI: Fortran example

We'll start with the single processor code:

program monte_carlo USE IFPORT implicit none integer*8 npts parameter (npts = 1e10) integer*8 i real*8 f,sum real*8 xmin,xmax,x xmin = 0.0d0 xmax = 1.0d0 do i=1,npts x = (xmax-xmin)*rand(0) + xmin sum = sum + 4.0d0/(1.0d0 + x**2) enddo f = sum/npts write(*,*)'PI calculated with ',npts,' points = ',f stop end

We can compile this code with the command:

```
[mtobias@login001 Fortran]$ ifort monte_carlo.f
```

and run the resulting executable. To get consistent timings with the MPI runs, I'm going to run everything on a compute node by requesting an interactive job:

```
[mtobias@login001 Fortran]$ qsub -I -lnodes=1:ppn=8,walltime=8:00:00
```

```
[mtobias@node164 Fortran]$ time ./a.out
```

PI calculated with 10000000000 points = 3.14159466262554

real 1m41.133s

user 1m41.114s

sys 0m0.001s

The simplest way to parallelize this program is to divide up the sampled points among all the processors. After each processor has computed its partial sum, we'll add the results together. It is important to remember that each processor will run the same piece of code.

I'll now present a MPI version of the code in its entirety, and then go through some of the differences between it and the single processor code:

program monte_carlo_mpi USE IFPORT implicit none integer*8 npts,mynpts parameter (npts = 1e10) include "mpif.h" integer ierr, myid, nprocs integer*8 i real*8 f,sum,mysum real*8 xmin,xmax,x call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr) if (myid .eq. 0) then mynpts = npts - (nprocs-1)*(npts/nprocs) else mynpts = npts/nprocs endif mysum = 0.0d0 xmin = 0.0d0 xmax = 1.0d0 x = (xmax-xmin)*rand(myid+1) + xmin do i=1,mynpts x = (xmax-xmin)*rand(0) + xmin mysum = mysum + 4.0d0/(1.0d0 + x**2) enddo call MPI_REDUCE(mysum,sum,1,MPI_DOUBLE_PRECISION,MPI_SUM, & 0,MPI_COMM_WORLD,ierr) if (myid .eq. 0) then f = sum/npts write(*,*)'PI calculated with ',npts,' points = ',f endif call MPI_FINALIZE(ierr) stop end

The first thing we need to do is include the MPI Fortran header file:

include "mpif.h"

We define some integers:

integer ierr, myid, nprocs

where ierr is an error flag used by MPI, myid is a unique identifier denoting which process we are in the set, and nprocs is the total number of processes.

integer*8 npts,mynpts

We'll use mynpts to keep track of the number of points to be sampled by a particular process.

real*8 f,sum,mysum

mysum is each process's contribution to the integral.

call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)

These three MPI calls are found in almost all MPI programs. The first routine initializes MPI, the second allows each process to determine its unique id, and the third allows each process to know the total number of processes being used in the run (determined at run time).

if (myid .eq. 0) then mynpts = npts - (nprocs-1)*(npts/nprocs) else mynpts = npts/nprocs endif

Here is where we divvy up the work (also called data decomposition). Since the total number of points may not be evenly divisible by the number of processors, we'll give the extra work to processor 0.

x = (xmax-xmin)*rand(myid+1) + xmin

This is a crucial step; We need to seed the random number generator for each process with a unique seed. We also need to be careful as a seed of 0 is equivalent to a seed of 1.

do i=1,mynpts x = (xmax-xmin)*rand(0) + xmin mysum = mysum + 4.0d0/(1.0d0 + x**2) enddo

Whereas before we did the sum over all points, now each process does a partial sum over its own points.

call MPI_REDUCE(mysum,sum,1,MPI_DOUBLE_PRECISION,MPI_SUM, & 0,MPI_COMM_WORLD,ierr)

This MPI routine takes all the process's values of mysum and adds them up into sum which it sends to processor 0.

if (myid .eq. 0) then f = sum/npts write(*,*)'PI calculated with ',npts,' points = ',f endif

The final calculation is done by process 0 since it is the only process that knows the value of sum and we only want one copy of the result printed out.

call MPI_FINALIZE(ierr)

Lastly, we allow MPI to clean up after itself.

That's it! We're now ready to compile the code:

```
[mtobias@node164 Fortran]$ ifort monte_carlo_mpi.f -I/export/openmpi-1.4.3/include/ -L/export/openmpi-1.4.3/lib -lmpi_f77
```

/export/intel/Compiler/12.0/lib/intel64//libimf.so: warning: warning: feupdateenv is not implemented and will always fail

Notice that we need to link against the MPI Fortran library (mpi_f77) as well as point the compiler where to find the libraries and include files. Feel free to ignore the warning.

At this time I should point out we have many different implementations of MPI installed on our systems. I'm using the default implementation OpenMPI.

All MPI executables must be run with the command mpirun where we specify the number of processors via the np flag:

```
[mtobias@node164 Fortran]$ time mpirun -np 1 ./a.out
```

PI calculated with 10000000000 points = 3.14159466253464

real 1m39.127s

user 1m38.967s

sys 0m0.032s

Not surprisingly, the single-process MPI run takes about as long as the serial code.

```
[mtobias@node164 Fortran]$ time mpirun -np 2 ./a.out
```

PI calculated with 10000000000 points = 3.14159641144649

real 0m50.100s

user 1m37.129s

sys 0m0.901s

[mtobias@node164 Fortran]$ time mpirun -np 4 ./a.out

PI calculated with 10000000000 points = 3.14159362924826

real 0m26.340s

user 1m36.637s

sys 0m0.093s

[mtobias@node164 Fortran]$ time mpirun -np 8 ./a.out

PI calculated with 10000000000 points = 3.14159378677509

real 0m13.745s

user 1m39.962s

sys 0m0.504s

Every time we double the number of processes, we see the speed of the code almost doubles leading to near-perfect scaling. The reason for this is that the Monte-Carlo algorithm is ideally suited for parallelization of this type, or to put it another way it is "trivially parallelizable".