Skip to content

Conversation

@erikvansebille
Copy link
Member

@erikvansebille erikvansebille commented Jul 17, 2025

A first, very simple script, that can be used for benchmarking the Kernel-loop itself. Since the Kernel is doing nothing, there is no field access or call to interpolation methods.

With the current commit of v4-dev (488e3fb), the scaling is pretty poor.

The timing on my machine for the pset.execute() for npart=1, 10, 100, 1000 and 2500 particles is 0:01, 0:10, 1:36, 15:48 and 37:57 minutes, respectively. That's a nicely linear scaling, but of course not at all efficient

For main (i.e. Parcels v3), the pset.execute() for all values of npart are taking 0 seconds, irrespective of number of particles

I'll start digging in to get this poor scaling of the Kernel/pset.execute() loop with particle size and hopefully find an improvement soon

@erikvansebille
Copy link
Member Author

OK, so a first morning of benchmarking on this very simple script resulted in the graph below
Screenshot 2025-07-18 at 07 52 24

Parcels v3 is extremely fast, and the current v4-dev branch is very slow (yellow line in graph below).

I first tried changing back to using float for time and dt in the kernel (Parcels-code/Parcels#2090), and that improved performance by ~40%.

Some further digging led to the realisation that the main bottleneck is actually the updating of the value of a particle attribute, what in v4-dev is

def __setattr__(self, name, value):
    if name in ["_data", "_index"]:
        object.__setattr__(self, name, value)
    else:
        self._data[name][self._index] = value

That last line is very slow, and simply changing it to (Parcels-code/Parcels#2092)

+        self._data[name][self._index] = value
-        self._data[name].data[self._index] = value

improved the performance by 85% compared to v4-dev (orange line), because setting a value on an xarray DataArray is much slower than on its underlying numpy array.

Note that the time-as-float change does not further seem to improve performance on top of the setattr change, probably because the setting of an np.datetime64 or np.timedelta64 value was even slower than setting a float array.

Now, to further explore what kind of performance boost we can expect, I also wrote a PR that does not set the value of particles at all (Parcels-code/Parcels#2093). Here, the setattr function is simply

def __setattr__(self, name, value):
    if name in ["_data", "_index"]:
        object.__setattr__(self, name, value)
    else:
        pass

This leads to a speed increase of 95% compared to v4-dev (cyan line). But this PR of course is not a suitable solution, as it doesn't allow changing particle properties (so can't be used to move particles around). It only works in this particular benchmark because we use the DoNothing() Kernel.

So in summary: a major bottleneck seems to be the setattr in the new xarray particle class. In v3 we use a dictionary of arrays, so I'll spend some time now to see what the performance gain is if we change the v4 code from xarray to a dictionary

@erikvansebille
Copy link
Member Author

erikvansebille commented Jul 18, 2025

Another update: I changed the data structure for ParticleSet._data from an xarray.DataSet to a dictionary of numpy arrays in Parcels-code/Parcels#2094. This gives again a big performance increase, see graph below (darker blue line); now on logarithmic y-scale

Screenshot 2025-07-18 at 09 25 47

In summary, we are now down to

branch time [seconds] for npart=500
v3-JIT 0
v3-Scipy 2
v4-dev 178
time as float (Parcels-code/Parcels#2090) 112
xarray.data setattr (Parcels-code/Parcels#2092) 30
no setattr (Parcels-code/Parcels#2093) 9
using dict (Parcels-code/Parcels#2094) 5

@erikvansebille
Copy link
Member Author

I added a few smaller fixes in Parcels-code/Parcels#2096, which increases the kernel loop performance a bit more.

For reference, in combination with Parcels-code/Parcels#2094, the time for npart=500 drops from 5 seconds to 2 seconds, getting us to a similar speed as v3-Scipy!

@VeckoTheGecko

This comment was marked as outdated.

@erikvansebille
Copy link
Member Author

Ran some profiling for the branches

Is this with the latest versions of the PRs where Parcels-code/Parcels#2096 has been merged in?

@VeckoTheGecko
Copy link
Contributor

No, I'll re-run

@erikvansebille
Copy link
Member Author

erikvansebille commented Jul 18, 2025

Another update on the timings, now with the changes to the kernel loop optimisation (Parcels-code/Parcels#2096) included (extended to npart=1500):
Screenshot 2025-07-18 at 12 51 48

branch time [seconds] for npart=500
v3-JIT 0
v3-Scipy 2
v4-dev 149
time as float (Parcels-code/Parcels#2090) 112
xarray.data setattr (Parcels-code/Parcels#2092) 20
no setattr (Parcels-code/Parcels#2093) 0
using dict (Parcels-code/Parcels#2094) 2

So we are now slightly faster than v3-Scipy for Parcels-code/Parcels#2094!

And the fact that Parcels-code/Parcels#2093 takes effectively 0 seconds means that we have no serious overhead except the ParticleSet.setattr

@VeckoTheGecko
Copy link
Contributor

VeckoTheGecko commented Jul 18, 2025

Ran some (updated) profiling for the branches:

time as float (Parcels-code/Parcels#2090) (NOT updated with Parcels-code/Parcels#2096 )

log-particle_time_as_float prof

xarray.data setattr (Parcels-code/Parcels#2092)

log-pset_performance prof

no setattr (Parcels-code/Parcels#2093)

log-no_particle_attrsetting prof

@erikvansebille
Copy link
Member Author

OK, and do you also have the profile of Parcels-code/Parcels#2094?

@VeckoTheGecko
Copy link
Contributor

VeckoTheGecko commented Jul 18, 2025

OK, and do you also have the profile of OceanParcels/Parcels#2094?

Ok, just made it
log-particledata_as_dict prof

@VeckoTheGecko
Copy link
Contributor

Some of the graphs got really messy, so there was some tuning done on the % threshold to show (in some I trimmed off anything with <1%).

@erikvansebille
Copy link
Member Author

Thanks for these profiling graphs! Would you agree that these graphs clearly show that calling a Particle._setattr_ when the Particle data is a dictionary of numpy-arrays (Parcels-code/Parcels#2094) is much faster than the same operation when the Particle data is an xarray DataSet, even when we directly call DataArray.data[index] (Parcels-code/Parcels#2092)?

@VeckoTheGecko
Copy link
Contributor

VeckoTheGecko commented Jul 18, 2025

Yes - I don't think that there is really a way to get around this in xarray-world (for completeness over the coming week or 2 I might put together a post to send to pangeo asking about this usecase along with a minimal example).

@VeckoTheGecko
Copy link
Contributor

1 second actually, just looking at the xarray.data setattr run, there is a lot of _construct_dataarray calls from this code here:

    def __getattr__(self, name):
        return self._data[name].data[self._index]

This means that on every iteration that a new data array is constructed (+ a bunch of other xarray machinery is invoked). Isn't there a way that we can get the best of both worlds?

  • use xarray for particledata
  • transfer that to dict of arrays (arrays still being the same object in memory as in xarray)
  • make these changes on the dict of arrays which doesn't invoke the rest of the machinery

or is that needlessly complex? (all good if its the latter)

@erikvansebille
Copy link
Member Author

Isn't there a way that we can get the best of both worlds?

  • use xarray for particledata
  • transfer that to dict of arrays (arrays still being the same object in memory as in xarray)
  • make these changes on the dict of arrays which doesn't invoke the rest of the machinery

or is that needlessly complex? (all good if its the latter)

It's a really cool and original idea! I just implemented it in Parcels-code/Parcels#2097; is that what you had in mind?
As expected, the performance in this new branch is still the fact dict-of-numpy-arrays of Parcels-code/Parcels#2094, but with the added benefit that we also have access to the DataSet.

Long live copy-by-reference ;-)

@erikvansebille
Copy link
Member Author

OK, so it turns out that the time-as-float change does have a (small) effect, but only at much larger part than tested above...

The plot below shows scaling up to npart=100_000
Screenshot 2025-07-22 at 17 50 51

JIT in v3 is still blazingly fast (blue line), and the current v4-dev branch (yellow line) is as fast as v3-scipy (red line); but the v4-dev with time treated as float is a factor 2 faster than the original v4. So perhaps there is some merit to using float instead of np.datetime64 inside the kernel after all....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants