Skip to content

Conversation

@erikvansebille
Copy link
Member

One way to improve performance in Parcels is to 'vectorize' the kernels: i.e. to not make kernels look over particles, but to have them act on the entire particles. This PR explores the performance of that approach

@erikvansebille
Copy link
Member Author

So a very quick first assessment of performance is below. JIT in v3 takes 8 seconds for 1 million particles in this simulation, and the implementation of vectorised kernels in Parcels-code/Parcels#2122 takes approximately 10 times as long. That's not bad, I'd say!

Screenshot 2025-07-29 at 13 25 42

Note that the 'custom kernel' (red line, code below) is already much faster than the Parcels implementation, showing that there may be room for improvement!
https://github.com/OceanParcels/parcels-benchmarks/blob/44c28b1114368dce06586b0b5dcfad2a84573b37/benchmark_vectorized_kernels.py#L36-L73

@erikvansebille
Copy link
Member Author

erikvansebille commented Aug 6, 2025

Here's an up[date of the scaling for the simple flow with vectorized kernels; to be seen in conjunction with https://github.com//pull/2#issuecomment-3158958884
Screenshot 2025-08-06 at 17 26 36

The vectorized kernel in v4 (green, Parcels-code/Parcels#2122) line is quite a bit slower than v3-JIT (black line); but much faster than v3-Scipy (grey dashed line). We can get a bit of speedup by unit direct numpy indexing (instead of xarray.isel(), cyan line), but that doesn't work for dask (like in #2), so wouldn't be a general solution.

@erikvansebille
Copy link
Member Author

As per #2 (comment), below also the peak memory use for the idealised flow field

Screenshot 2025-08-06 at 17 26 36

Since this flow field is a simple stationary, 2D flow, the memory footprint for 1 particle is very small in all cases. In this case, the memory footprint for the vectorized kernel (green line) is almost five times as large as for v3-JIT (black line); but even for 2M particles, it's only ~1 GB

@erikvansebille
Copy link
Member Author

I've looked a bit deeper into the memory use of the vectored kernels, and found something quite interesting (and good news!)
Screenshot 2025-08-07 at 13 59 37

The diagram above shows, for a 100k particle run in v3-JIT and v4-vectored kernels (Parcels-code/Parcels#2122) the runtime in red and memory consumption in blue. As expected (and also shown in the posts above), vectored kernels are both a bit slower and have a larger memory footprint.

But, the memory footprint does not increase a lot when more complicated kernels are used(!). The difference between the built-in AdvectionEE and AdvectionRK4 kernels is only 10MB in this case (and, as expected, AdvectionRK4 is four times slower because it does four times more field evals).
But more surprisingly(!), a custom 'thin' AdvectionRK4 kernel where temporary variables are reused, does not have a huge impact on peak memory. I guess the python garbage collector is really smart?

Curious to hear what you think about this, @fluidnumerics-joe and @VeckoTheGecko!

@fluidnumerics-joe
Copy link
Contributor

I would've thought the garbage collector cleaned up local variables when they went out of scope (e.g. outside the AdvectionRK4 kernel). This is indeed quite interesting. Is the runtime here the total simulation runtime or the accumulated runtime inside the advection kernel ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants