-
Notifications
You must be signed in to change notification settings - Fork 0
Basic benchmark script for a very simple kernel loop #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
OK, so a first morning of benchmarking on this very simple script resulted in the graph below Parcels v3 is extremely fast, and the current I first tried changing back to using Some further digging led to the realisation that the main bottleneck is actually the updating of the value of a particle attribute, what in def __setattr__(self, name, value):
if name in ["_data", "_index"]:
object.__setattr__(self, name, value)
else:
self._data[name][self._index] = valueThat last line is very slow, and simply changing it to (Parcels-code/Parcels#2092) + self._data[name][self._index] = value
- self._data[name].data[self._index] = valueimproved the performance by 85% compared to Note that the time-as-float change does not further seem to improve performance on top of the Now, to further explore what kind of performance boost we can expect, I also wrote a PR that does not set the value of particles at all (Parcels-code/Parcels#2093). Here, the setattr function is simply def __setattr__(self, name, value):
if name in ["_data", "_index"]:
object.__setattr__(self, name, value)
else:
passThis leads to a speed increase of 95% compared to So in summary: a major bottleneck seems to be the setattr in the new xarray particle class. In v3 we use a dictionary of arrays, so I'll spend some time now to see what the performance gain is if we change the v4 code from xarray to a dictionary |
|
Another update: I changed the data structure for
In summary, we are now down to
|
|
I added a few smaller fixes in Parcels-code/Parcels#2096, which increases the kernel loop performance a bit more. For reference, in combination with Parcels-code/Parcels#2094, the time for |
This comment was marked as outdated.
This comment was marked as outdated.
Is this with the latest versions of the PRs where Parcels-code/Parcels#2096 has been merged in? |
|
No, I'll re-run |
|
Another update on the timings, now with the changes to the kernel loop optimisation (Parcels-code/Parcels#2096) included (extended to
So we are now slightly faster than v3-Scipy for Parcels-code/Parcels#2094! And the fact that Parcels-code/Parcels#2093 takes effectively 0 seconds means that we have no serious overhead except the ParticleSet.setattr |
|
Ran some (updated) profiling for the branches: time as float (Parcels-code/Parcels#2090) (NOT updated with Parcels-code/Parcels#2096 ) xarray.data setattr (Parcels-code/Parcels#2092) no setattr (Parcels-code/Parcels#2093) |
|
OK, and do you also have the profile of Parcels-code/Parcels#2094? |
|
|
Some of the graphs got really messy, so there was some tuning done on the % threshold to show (in some I trimmed off anything with <1%). |
|
Thanks for these profiling graphs! Would you agree that these graphs clearly show that calling a |
|
Yes - I don't think that there is really a way to get around this in xarray-world (for completeness over the coming week or 2 I might put together a post to send to pangeo asking about this usecase along with a minimal example). |
|
1 second actually, just looking at the This means that on every iteration that a new data array is constructed (+ a bunch of other xarray machinery is invoked). Isn't there a way that we can get the best of both worlds?
or is that needlessly complex? (all good if its the latter) |
|
Isn't there a way that we can get the best of both worlds?
It's a really cool and original idea! I just implemented it in Parcels-code/Parcels#2097; is that what you had in mind? Long live copy-by-reference ;-) |




A first, very simple script, that can be used for benchmarking the Kernel-loop itself. Since the Kernel is doing nothing, there is no field access or call to interpolation methods.
With the current commit of v4-dev (488e3fb), the scaling is pretty poor.
The timing on my machine for the
pset.execute()fornpart=1, 10, 100, 1000 and 2500particles is 0:01, 0:10, 1:36, 15:48 and 37:57 minutes, respectively. That's a nicely linear scaling, but of course not at all efficientFor
main(i.e. Parcels v3), thepset.execute()for all values ofnpartare taking 0 seconds, irrespective of number of particlesI'll start digging in to get this poor scaling of the Kernel/pset.execute() loop with particle size and hopefully find an improvement soon