A (very) Brief Introduction to MPI: Fortran example
We'll start with the single processor code:
program monte_carlo
USE IFPORT
implicit none
integer*8 npts
parameter (npts = 1e10)
integer*8 i
real*8 f,sum
real*8 xmin,xmax,x
xmin = 0.0d0
xmax = 1.0d0
do i=1,npts
x = (xmax-xmin)*rand(0) + xmin
sum = sum + 4.0d0/(1.0d0 + x**2)
enddo
f = sum/npts
write(*,*)'PI calculated with ',npts,' points = ',f
stop
end
We can compile this code with the command:
[mtobias@login001 Fortran]$ ifort monte_carlo.f
and run the resulting executable. To get consistent timings with the MPI runs, I'm going to run everything on a compute node by requesting an interactive job:
[mtobias@login001 Fortran]$ qsub -I -lnodes=1:ppn=8,walltime=8:00:00
[mtobias@node164 Fortran]$ time ./a.out
PI calculated with 10000000000 points = 3.14159466262554
real 1m41.133s
user 1m41.114s
sys 0m0.001s
The simplest way to parallelize this program is to divide up the sampled points among all the processors. After each processor has computed its partial sum, we'll add the results together. It is important to remember that each processor will run the same piece of code.
I'll now present a MPI version of the code in its entirety, and then go through some of the differences between it and the single processor code:
program monte_carlo_mpi
USE IFPORT
implicit none
integer*8 npts,mynpts
parameter (npts = 1e10)
include "mpif.h"
integer ierr, myid, nprocs
integer*8 i
real*8 f,sum,mysum
real*8 xmin,xmax,x
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
if (myid .eq. 0) then
mynpts = npts - (nprocs-1)*(npts/nprocs)
else
mynpts = npts/nprocs
endif
mysum = 0.0d0
xmin = 0.0d0
xmax = 1.0d0
x = (xmax-xmin)*rand(myid+1) + xmin
do i=1,mynpts
x = (xmax-xmin)*rand(0) + xmin
mysum = mysum + 4.0d0/(1.0d0 + x**2)
enddo
call MPI_REDUCE(mysum,sum,1,MPI_DOUBLE_PRECISION,MPI_SUM,
& 0,MPI_COMM_WORLD,ierr)
if (myid .eq. 0) then
f = sum/npts
write(*,*)'PI calculated with ',npts,' points = ',f
endif
call MPI_FINALIZE(ierr)
stop
end
The first thing we need to do is include the MPI Fortran header file:
include "mpif.h"
We define some integers:
integer ierr, myid, nprocs
where ierr is an error flag used by MPI, myid is a unique identifier denoting which process we are in the set, and nprocs is the total number of processes.
integer*8 npts,mynpts
We'll use mynpts to keep track of the number of points to be sampled by a particular process.
real*8 f,sum,mysum
mysum is each process's contribution to the integral.
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr)
These three MPI calls are found in almost all MPI programs. The first routine initializes MPI, the second allows each process to determine its unique id, and the third allows each process to know the total number of processes being used in the run (determined at run time).
if (myid .eq. 0) then
mynpts = npts - (nprocs-1)*(npts/nprocs)
else
mynpts = npts/nprocs
endif
Here is where we divvy up the work (also called data decomposition). Since the total number of points may not be evenly divisible by the number of processors, we'll give the extra work to processor 0.
x = (xmax-xmin)*rand(myid+1) + xmin
This is a crucial step; We need to seed the random number generator for each process with a unique seed. We also need to be careful as a seed of 0 is equivalent to a seed of 1.
do i=1,mynpts
x = (xmax-xmin)*rand(0) + xmin
mysum = mysum + 4.0d0/(1.0d0 + x**2)
enddo
Whereas before we did the sum over all points, now each process does a partial sum over its own points.
call MPI_REDUCE(mysum,sum,1,MPI_DOUBLE_PRECISION,MPI_SUM,
& 0,MPI_COMM_WORLD,ierr)
This MPI routine takes all the process's values of mysum and adds them up into sum which it sends to processor 0.
if (myid .eq. 0) then
f = sum/npts
write(*,*)'PI calculated with ',npts,' points = ',f
endif
The final calculation is done by process 0 since it is the only process that knows the value of sum and we only want one copy of the result printed out.
call MPI_FINALIZE(ierr)
Lastly, we allow MPI to clean up after itself.
That's it! We're now ready to compile the code:
[mtobias@node164 Fortran]$ ifort monte_carlo_mpi.f -I/export/openmpi-1.4.3/include/ -L/export/openmpi-1.4.3/lib -lmpi_f77
/export/intel/Compiler/12.0/lib/intel64//libimf.so: warning: warning: feupdateenv is not implemented and will always fail
Notice that we need to link against the MPI Fortran library (mpi_f77) as well as point the compiler where to find the libraries and include files. Feel free to ignore the warning.
At this time I should point out we have many different implementations of MPI installed on our systems. I'm using the default implementation OpenMPI.
All MPI executables must be run with the command mpirun where we specify the number of processors via the np flag:
[mtobias@node164 Fortran]$ time mpirun -np 1 ./a.out
PI calculated with 10000000000 points = 3.14159466253464
real 1m39.127s
user 1m38.967s
sys 0m0.032s
Not surprisingly, the single-process MPI run takes about as long as the serial code.
[mtobias@node164 Fortran]$ time mpirun -np 2 ./a.out
PI calculated with 10000000000 points = 3.14159641144649
real 0m50.100s
user 1m37.129s
sys 0m0.901s
[mtobias@node164 Fortran]$ time mpirun -np 4 ./a.out
PI calculated with 10000000000 points = 3.14159362924826
real 0m26.340s
user 1m36.637s
sys 0m0.093s
[mtobias@node164 Fortran]$ time mpirun -np 8 ./a.out
PI calculated with 10000000000 points = 3.14159378677509
real 0m13.745s
user 1m39.962s
sys 0m0.504s
Every time we double the number of processes, we see the speed of the code almost doubles leading to near-perfect scaling. The reason for this is that the Monte-Carlo algorithm is ideally suited for parallelization of this type, or to put it another way it is "trivially parallelizable".


