test case for very large runs #1963

JamiePringle · 2025-04-07T14:33:21Z

JamiePringle
Apr 7, 2025
Collaborator

The attached code runs 30 days of particle releases in a global coastal ocean current field from the Mercator GLORYS V12R1 ocean circulation model from Copernicus Oceans. The problem is relatively large, with 1,549,325 particles released every day for 30 days for a total release of 46,479,750 particles.

The purposes of this code is to provide regression testing on large parcels runs, and to illustrate how to configure parcels for large and efficient runs. The timings given below might help folks understand what performance can be expected from parcels.

The files for the run can be found here and are:

bigTest.py is the particle tracking code, and writes output to directory dataOut/
ext-PSY4V3R1_mesh_hgr.nc, the model grid file
startPlaces.zip, a zarr file with the initial particle locations
plot_oneTime_showPartition.py shows the location of a single release of particles after 30 days, colored by the MPI process which created them. It also prints the number of particles handled by each MPI process.
read_allKinds_data_maskValues.py, a script that will download the Mercator ocean velocity data. Requires you to add valid Mercator ocean credentials which you can apply for here.

The directories which need to be created (and mercatorDataFullGrid populated) are

mercatorDataFullGrid_masked/ is where the circulation data files are kept. To replicate these runs, this directory must be populated by running read_allKinds_data_maskValues.py
dataOut/ is where the output is stored. "output.zarr" for serial runs, and "output/" for runs made with MPI

The timings given below are the time it takes to run on a dual 16 core EPYC 7343 system with a relatively fast and cached ZFS file system. Most of these runs were made on a system which had recently accessed all of the data, so the data was mostly already in the large SSD based cache.

The timing are repeated for three different ways to partition the particles. It is clear that for large problems, choosing a good partitioning method can pay substantial dividends. Partitioning should optimally satisfy two constraints: 1) give the same amount of work to each MPI job, and 2) make the particles close in space so that a given MPI jobs will read in the same chunks of data from the circulation data. (It is important to get the chunking of the circulation data inputs and the float trajectory outputs configured appropriately before worrying about getting the partitioning optimal! See #1473 for discussions of output chunking and the parallel tutorial for discussions of input chunking.)

Here we examine 3 partitioning methods. See the figures at the end of this post. "Kmeans" is the default Kmeans based method of partitioning particles for MPI if SciKit-learn has been installed. "Uniform" just allocates the same number of particles to each core, in the order that they are listed in the list of the starting particle positions. There is no effort to make sure the particles are close to each other in space, though they mostly are in this list of starting positions. "Constrained-kMeans" uses a Kmeans algorithm which attempts to keep the same number of particles in each cluster, to within +/- 5% in this code. The underlying routine can be found here

Model run times

Serial time of 417.5 minutes
All times in minutes.

cores	Kmeans	uniform	constrained-kMeans
2	230.3	212.5	226.6
4	123.3	122.6	112.6
8	86.5	69.0	61.0
16	49.0	41.0	34.0
32	31.3	30.1	20.4

In the following figure, you can see the number of parcels runs/second as a function of the number of MPI processes. This would be linear in a perfect world -- twice the number of MPI processes would run the job twice as fast. The constrained-kmeans scales best, than uniform and Kmeans does the worst.

The consistent and reliable performance of the uniform initial partitioning of particles between MPI processes suggests that this might be a better default choice for partitioning than the default Kmeans scheme.

The reasons that the default "Kmeans" partitioning usually under-performs "uniform" and "constrained-kMeans" is that it usually puts different numbers of particles into each MPI process. The figures below show the partitioning and the number of particles in each partition, and it can be seen that this is most uneven with default Kmeans. For the default Kmeans, the smallest partition has 4.7 times fewer particles than the partition with the most particles. This uneven allocation of particles leads to inefficient allocation of effort, and slower run times. I think, but have not proven, that constrained-kMeans out-performs the "uniform" partitioning because it ensures that the particles in each cluster are closer together in Latitude/Longitude space, and this allows each MPI job to do less IO as each job needs to read fewer chunks of the circulation data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test case for very large runs #1963

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

test case for very large runs #1963

Uh oh!

Uh oh!

JamiePringle Apr 7, 2025 Collaborator

Model run times

Replies: 0 comments

JamiePringle
Apr 7, 2025
Collaborator