Basic benchmark script for a very simple kernel loop #1

erikvansebille · 2025-07-17T11:41:26Z

A first, very simple script, that can be used for benchmarking the Kernel-loop itself. Since the Kernel is doing nothing, there is no field access or call to interpolation methods.

With the current commit of v4-dev (488e3fb), the scaling is pretty poor.

The timing on my machine for the pset.execute() for npart=1, 10, 100, 1000 and 2500 particles is 0:01, 0:10, 1:36, 15:48 and 37:57 minutes, respectively. That's a nicely linear scaling, but of course not at all efficient

For main (i.e. Parcels v3), the pset.execute() for all values of npart are taking 0 seconds, irrespective of number of particles

I'll start digging in to get this poor scaling of the Kernel/pset.execute() loop with particle size and hopefully find an improvement soon

erikvansebille · 2025-07-18T06:11:00Z

OK, so a first morning of benchmarking on this very simple script resulted in the graph below

Parcels v3 is extremely fast, and the current v4-dev branch is very slow (yellow line in graph below).

I first tried changing back to using float for time and dt in the kernel (Parcels-code/Parcels#2090), and that improved performance by ~40%.

Some further digging led to the realisation that the main bottleneck is actually the updating of the value of a particle attribute, what in v4-dev is

def __setattr__(self, name, value):
    if name in ["_data", "_index"]:
        object.__setattr__(self, name, value)
    else:
        self._data[name][self._index] = value

That last line is very slow, and simply changing it to (Parcels-code/Parcels#2092)

+        self._data[name][self._index] = value
-        self._data[name].data[self._index] = value

improved the performance by 85% compared to v4-dev (orange line), because setting a value on an xarray DataArray is much slower than on its underlying numpy array.

Note that the time-as-float change does not further seem to improve performance on top of the setattr change, probably because the setting of an np.datetime64 or np.timedelta64 value was even slower than setting a float array.

Now, to further explore what kind of performance boost we can expect, I also wrote a PR that does not set the value of particles at all (Parcels-code/Parcels#2093). Here, the setattr function is simply

def __setattr__(self, name, value):
    if name in ["_data", "_index"]:
        object.__setattr__(self, name, value)
    else:
        pass

This leads to a speed increase of 95% compared to v4-dev (cyan line). But this PR of course is not a suitable solution, as it doesn't allow changing particle properties (so can't be used to move particles around). It only works in this particular benchmark because we use the DoNothing() Kernel.

So in summary: a major bottleneck seems to be the setattr in the new xarray particle class. In v3 we use a dictionary of arrays, so I'll spend some time now to see what the performance gain is if we change the v4 code from xarray to a dictionary

erikvansebille · 2025-07-18T07:22:47Z

Another update: I changed the data structure for ParticleSet._data from an xarray.DataSet to a dictionary of numpy arrays in Parcels-code/Parcels#2094. This gives again a big performance increase, see graph below (darker blue line); now on logarithmic y-scale

In summary, we are now down to

branch	time [seconds] for `npart=500`
v3-JIT	0
v3-Scipy	2
v4-dev	178
time as float (Parcels-code/Parcels#2090)	112
xarray.data setattr (Parcels-code/Parcels#2092)	30
no setattr (Parcels-code/Parcels#2093)	9
using dict (Parcels-code/Parcels#2094)	5

erikvansebille · 2025-07-18T09:47:28Z

I added a few smaller fixes in Parcels-code/Parcels#2096, which increases the kernel loop performance a bit more.

For reference, in combination with Parcels-code/Parcels#2094, the time for npart=500 drops from 5 seconds to 2 seconds, getting us to a similar speed as v3-Scipy!

erikvansebille · 2025-07-18T10:50:28Z

Ran some profiling for the branches

Is this with the latest versions of the PRs where Parcels-code/Parcels#2096 has been merged in?

VeckoTheGecko · 2025-07-18T10:52:46Z

No, I'll re-run

erikvansebille · 2025-07-18T10:54:27Z

Another update on the timings, now with the changes to the kernel loop optimisation (Parcels-code/Parcels#2096) included (extended to npart=1500):

branch	time [seconds] for `npart=500`
v3-JIT	0
v3-Scipy	2
v4-dev	149
time as float (Parcels-code/Parcels#2090)	112
xarray.data setattr (Parcels-code/Parcels#2092)	20
no setattr (Parcels-code/Parcels#2093)	0
using dict (Parcels-code/Parcels#2094)	2

So we are now slightly faster than v3-Scipy for Parcels-code/Parcels#2094!

And the fact that Parcels-code/Parcels#2093 takes effectively 0 seconds means that we have no serious overhead except the ParticleSet.setattr

VeckoTheGecko · 2025-07-18T10:59:05Z

Ran some (updated) profiling for the branches:

time as float (Parcels-code/Parcels#2090) (NOT updated with Parcels-code/Parcels#2096 )

xarray.data setattr (Parcels-code/Parcels#2092)

no setattr (Parcels-code/Parcels#2093)

erikvansebille · 2025-07-18T11:15:42Z

OK, and do you also have the profile of Parcels-code/Parcels#2094?

VeckoTheGecko · 2025-07-18T11:20:53Z

OK, and do you also have the profile of OceanParcels/Parcels#2094?

Ok, just made it

VeckoTheGecko · 2025-07-18T11:22:34Z

Some of the graphs got really messy, so there was some tuning done on the % threshold to show (in some I trimmed off anything with <1%).

erikvansebille · 2025-07-18T11:26:51Z

Thanks for these profiling graphs! Would you agree that these graphs clearly show that calling a Particle._setattr_ when the Particle data is a dictionary of numpy-arrays (Parcels-code/Parcels#2094) is much faster than the same operation when the Particle data is an xarray DataSet, even when we directly call DataArray.data[index] (Parcels-code/Parcels#2092)?

VeckoTheGecko · 2025-07-18T11:29:28Z

Yes - I don't think that there is really a way to get around this in xarray-world (for completeness over the coming week or 2 I might put together a post to send to pangeo asking about this usecase along with a minimal example).

VeckoTheGecko · 2025-07-18T11:37:04Z

1 second actually, just looking at the xarray.data setattr run, there is a lot of _construct_dataarray calls from this code here:

    def __getattr__(self, name):
        return self._data[name].data[self._index]

This means that on every iteration that a new data array is constructed (+ a bunch of other xarray machinery is invoked). Isn't there a way that we can get the best of both worlds?

use xarray for particledata
transfer that to dict of arrays (arrays still being the same object in memory as in xarray)
make these changes on the dict of arrays which doesn't invoke the rest of the machinery

or is that needlessly complex? (all good if its the latter)

@VeckoTheGecko

Following @VeckoTheGecko's suggestion at Parcels-code/parcels-benchmarks#1 (comment)

erikvansebille · 2025-07-21T07:15:00Z

Isn't there a way that we can get the best of both worlds?

use xarray for particledata

transfer that to dict of arrays (arrays still being the same object in memory as in xarray)

make these changes on the dict of arrays which doesn't invoke the rest of the machinery

or is that needlessly complex? (all good if its the latter)

It's a really cool and original idea! I just implemented it in Parcels-code/Parcels#2097; is that what you had in mind?
As expected, the performance in this new branch is still the fact dict-of-numpy-arrays of Parcels-code/Parcels#2094, but with the added benefit that we also have access to the DataSet.

Long live copy-by-reference ;-)

erikvansebille · 2025-07-22T15:53:15Z

OK, so it turns out that the time-as-float change does have a (small) effect, but only at much larger part than tested above...

The plot below shows scaling up to npart=100_000

JIT in v3 is still blazingly fast (blue line), and the current v4-dev branch (yellow line) is as fast as v3-scipy (red line); but the v4-dev with time treated as float is a factor 2 faster than the original v4. So perhaps there is some merit to using float instead of np.datetime64 inside the kernel after all....

Create a benchmark for a very simple kernel loop

afdc233

erikvansebille marked this pull request as draft July 17, 2025 11:42

erikvansebille mentioned this pull request Jul 17, 2025

Reverting back to float for time and dt inside pset.execute Parcels-code/Parcels#2090

Closed

4 tasks

Updating npart range for more efficient testing

c1908ce

erikvansebille mentioned this pull request Jul 18, 2025

Performance: Using a dict for ParticleSet._data (instead of xarray DataSet) Parcels-code/Parcels#2094

Merged

4 tasks

erikvansebille mentioned this pull request Jul 18, 2025

Kernelloop speedups Parcels-code/Parcels#2096

Merged

4 tasks

This comment was marked as outdated.

Sign in to view

erikvansebille added a commit to Parcels-code/Parcels that referenced this pull request Jul 21, 2025

Adding both a datstruct and a dict for the particledata

86e0b54

Following @VeckoTheGecko's suggestion at Parcels-code/parcels-benchmarks#1 (comment)

erikvansebille mentioned this pull request Jul 21, 2025

Adding both an xarray.dataset and a dict-of-numpy-arrays for the particle-data Parcels-code/Parcels#2097

Closed

2 tasks

This was referenced Jul 22, 2025

Performance boost: calling numpy data in particle.__setattr__ Parcels-code/Parcels#2092

Closed

Performance testing: Removing particle.__setattr__ [BREAKING; DO NOT MERGE] Parcels-code/Parcels#2093

Closed

Increasing kernelloop benchmarking to 10k particles

8f71341

erikvansebille mentioned this pull request Jul 28, 2025

Using float for time in the Parcels kernel Parcels-code/Parcels#2117

Closed

2 tasks

Basic benchmark script for a very simple kernel loop #1

Are you sure you want to change the base?

Basic benchmark script for a very simple kernel loop #1

Uh oh!

Conversation

erikvansebille commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikvansebille commented Jul 18, 2025

Uh oh!

erikvansebille commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikvansebille commented Jul 18, 2025

Uh oh!

This comment was marked as outdated.

erikvansebille commented Jul 18, 2025

Uh oh!

VeckoTheGecko commented Jul 18, 2025

Uh oh!

erikvansebille commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VeckoTheGecko commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikvansebille commented Jul 18, 2025

Uh oh!

VeckoTheGecko commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VeckoTheGecko commented Jul 18, 2025

Uh oh!

erikvansebille commented Jul 18, 2025

Uh oh!

VeckoTheGecko commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VeckoTheGecko commented Jul 18, 2025

Uh oh!

erikvansebille commented Jul 21, 2025

Uh oh!

erikvansebille commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

erikvansebille commented Jul 17, 2025 •

edited

Loading

erikvansebille commented Jul 18, 2025 •

edited

Loading

erikvansebille commented Jul 18, 2025 •

edited

Loading

VeckoTheGecko commented Jul 18, 2025 •

edited

Loading

VeckoTheGecko commented Jul 18, 2025 •

edited

Loading

VeckoTheGecko commented Jul 18, 2025 •

edited

Loading