-
Notifications
You must be signed in to change notification settings - Fork 0
Adding benchmark test for CopernicusMarine with different Interpolators #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…erval is hourly) and more particles
|
Here's some further benchmarking for the CopernicsMarine test, which shows really good results. All of these are with the 'pragmatic The JIT-v3 (blue line) is still independent of I could not run the original v4-dev for these large values of However, with implementing a very simple 'computeTimeChunk` method (i.e. calling a field.data.compute on the two time slices needed, see Parcels-code/Parcels#2098 for implementation), we do get a reasonable scaling (yellow). Interestingly, the performance improves much (and coincidentally(?) is exactly the same as v3-Scipy) when we also include the 'time-as-float' change (green, Parcels-code/Parcels#2090) For reference, I also plotted the performance for no compute at all (so no calls to the field data), that is the dashed orange line |
|
Here's an update on the benchmarking for the Copernicus Marine (A-grid dataset). Turns out that the behaviour of v4 is quite different for particle sets larger than the 10k particles in the plot above. Note that this new plot has logarithmic axes, to better appreciate large differences.
Both v4-dev (yellow line above) and v4-timechunk (Parcels-code/Parcels#2098, orange line above) become excessively slow for larger particle sizes; even slower than v3-Scipy (grey line above). The good news is that up to ~10k particles, the new v4 vectorized kernel (PR Parcels-code/Parcels#2122, green line above) is at ~40 seconds very fast; only slightly slower than v3-JIT (black line above) at ~15 seconds, and always much faster than v3-scipy! But beyond approx 50k particles, this implementation also becomes very slow, with much poorer scaling than v3-JIT. To be further explored what's going on there, but it could be xarray/dask internals (this benchmark uses task for the fieldset) |
|
Below is an analysis of the peak memory use of the different PRs (as requested by @fluidnumerics-joe).
As expected, vectorized kernels (Parcels-code/Parcels#2122, green line) has a much smaller memory for low number of particles than v3; because both JIT (black line) and Scipy (grey line) need to load two time steps of the FieldSet into memory For larger particle sets, vectorized kernels does grow faster because many temporary variables now are an arrays (instead of a float). But even for 1M particles, the vectorized kernel has a peak memory load of only 2GB; so that doesn't seem to be the bottleneck for the decreasing performance (see #2 (comment) above) for >10k particles |



This PR adds a benchmark test for the CopernicusMarineService data (specifically the SMOC product). The aim is to explore the scaling of the interpolation / field-evaluation loop.
The script in this PR was run with either Parcels-
main(for the v3-benchmarking) or with Parcels-code/Parcels#2098 (for the v4-benchmarkingThe good news is that for 1 particle, v4 is faster than v3. However, the (more important) not-so-good-news is that the scaling for v4 is much worse than v3. See figure below

For the current v4, however, the runtime depends strongly on the type of interpolation methods.
Using the
PureXarrayInterpmethod defined in this PR (simplyreturn field.data.interp(time=t, lon=x, lat=y).values[0]) gives the slowest runtime fornpart=50(more than 30 minutes; yellow line)Using the explicit
BiRectiLinearmethod defined in this PR approximately halves the runtime (green line)However, using the
NoFieldAccessmethod defined in this PR (simplyreturn 0without any xarray access to the field data; but with index_searching) is blazingly fast (only 2 seconds fornpart=50, the hardly-visible orange line at the bottom of the figure)So, this suggests that the bottleneck is the access to
field.datain the Interpolators. It's likely that our access pattern (one particle at a time) is just not efficient now. We'll need to consider how to make this access pattern smarter...