Skip to content

Conversation

@erikvansebille
Copy link
Member

@erikvansebille erikvansebille commented Jul 21, 2025

This PR adds a benchmark test for the CopernicusMarineService data (specifically the SMOC product). The aim is to explore the scaling of the interpolation / field-evaluation loop.

The script in this PR was run with either Parcels-main (for the v3-benchmarking) or with Parcels-code/Parcels#2098 (for the v4-benchmarking

The good news is that for 1 particle, v4 is faster than v3. However, the (more important) not-so-good-news is that the scaling for v4 is much worse than v3. See figure below
Screenshot 2025-07-21 at 17 05 16

  1. For these small amounts of particles (≤50), the timing for v3 is constant and independent of JIT (blue) or Scipy (red) mode at approximately 2 minutes; suggesting that (almost) all compute time goes into reading the FieldSet

For the current v4, however, the runtime depends strongly on the type of interpolation methods.

  1. Using the PureXarrayInterp method defined in this PR (simply return field.data.interp(time=t, lon=x, lat=y).values[0]) gives the slowest runtime for npart=50 (more than 30 minutes; yellow line)

  2. Using the explicit BiRectiLinear method defined in this PR approximately halves the runtime (green line)

  3. However, using the NoFieldAccess method defined in this PR (simply return 0 without any xarray access to the field data; but with index_searching) is blazingly fast (only 2 seconds for npart=50, the hardly-visible orange line at the bottom of the figure)

So, this suggests that the bottleneck is the access to field.data in the Interpolators. It's likely that our access pattern (one particle at a time) is just not efficient now. We'll need to consider how to make this access pattern smarter...

@erikvansebille erikvansebille marked this pull request as draft July 22, 2025 07:38
@erikvansebille
Copy link
Member Author

erikvansebille commented Jul 22, 2025

Here's some further benchmarking for the CopernicsMarine test, which shows really good results. All of these are with the 'pragmatic computeTimeChunk' that I implemented in Parcels-code/Parcels#2098
Screenshot 2025-07-22 at 17 58 52

The JIT-v3 (blue line) is still independent of npart, but v3-Scipy (red) can now be seen to increase with npart

I could not run the original v4-dev for these large values of npart, the figure above shows that this would have taken hours/days...

However, with implementing a very simple 'computeTimeChunk` method (i.e. calling a field.data.compute on the two time slices needed, see Parcels-code/Parcels#2098 for implementation), we do get a reasonable scaling (yellow).

Interestingly, the performance improves much (and coincidentally(?) is exactly the same as v3-Scipy) when we also include the 'time-as-float' change (green, Parcels-code/Parcels#2090)

For reference, I also plotted the performance for no compute at all (so no calls to the field data), that is the dashed orange line

@erikvansebille
Copy link
Member Author

Here's an update on the benchmarking for the Copernicus Marine (A-grid dataset). Turns out that the behaviour of v4 is quite different for particle sets larger than the 10k particles in the plot above. Note that this new plot has logarithmic axes, to better appreciate large differences.

Screenshot 2025-08-06 at 11 56 28

Both v4-dev (yellow line above) and v4-timechunk (Parcels-code/Parcels#2098, orange line above) become excessively slow for larger particle sizes; even slower than v3-Scipy (grey line above).

The good news is that up to ~10k particles, the new v4 vectorized kernel (PR Parcels-code/Parcels#2122, green line above) is at ~40 seconds very fast; only slightly slower than v3-JIT (black line above) at ~15 seconds, and always much faster than v3-scipy!

But beyond approx 50k particles, this implementation also becomes very slow, with much poorer scaling than v3-JIT. To be further explored what's going on there, but it could be xarray/dask internals (this benchmark uses task for the fieldset)

@erikvansebille
Copy link
Member Author

Below is an analysis of the peak memory use of the different PRs (as requested by @fluidnumerics-joe).

Screenshot 2025-08-06 at 17 20 54

As expected, vectorized kernels (Parcels-code/Parcels#2122, green line) has a much smaller memory for low number of particles than v3; because both JIT (black line) and Scipy (grey line) need to load two time steps of the FieldSet into memory

For larger particle sets, vectorized kernels does grow faster because many temporary variables now are an arrays (instead of a float). But even for 1M particles, the vectorized kernel has a peak memory load of only 2GB; so that doesn't seem to be the bottleneck for the decreasing performance (see #2 (comment) above) for >10k particles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants