A useful teaching example for Parallel Processing courses is given in Rebecca's Blog: "Example Performance Evaluation"
I added the following comment:
"I liked the example and I think this kind of demonstrations really help explaining were bottlenecks may occur in the parallel code. I would like to add two comments:
1. Solving the matrix and obtaining 0 for the startup time, after already neglecting the bandwidth, means that the problem is totally Embarrassingly Parallel and this is indeed the case for MC calculations. It would be interesting to repeat the exercise for a communication intensive application, e.g. Laplace equation.
2. I think it worth mentioning Collective Communication commands, i.e. Bcast and Reduce in this context. In particular Reduce+Summation can be more efficient than the loop over the Recv because it can do summation out of order and this may hide some of the holding/communication time. Another alternative is to use MPI ANY_SOURCE in the Recv command.