Parallel calculation of π

An approximation to the value of π can be calculated from the following expression

where the answer becomes more accurate with increasing N. As each term is independent, the summation over i can be parallelized nearly trivially.

Starting from the serial code pi.cpp (or [pi.F90}(pi.F90) for Fortran), make a version that performs the calculation in parallel.

Divide the range over N in ntasks, so that rank 0 does i=1, 2, ..., N / ntasks, rank 1 does i=N / ntasks + 1, N / ntasks + 2, ... , etc.. You may assume that N is evenly divisible by the number of processes.
All tasks calculate their own partial sums
Once finished with the calculation, all ranks expect rank 0 send their partial sum to rank 0, which then calculates the final result and prints it out.
Run the code with different number of processes, do you get exactly the same result? If not, can you explain why?
Make a version where rank 0 receives the partial sums with MPI_ANY_SOURCE. When running multiple times with the same number of processes, do you get always exactly the same result? If not, can you explain why?
(Bonus) Make a version that works with arbitrary number of processes. Now, if N cannot be divided evenly by ntasks, some processes calculate more terms than others.

Provide feedback