Skip to content

Conversation

@erikvansebille
Copy link
Member

Meant to compare the spatial hashing (Parcels-code/Parcels#2132) against the old (vectorized) search

Meant to compare the spatial hashing (Parcels-code/Parcels#2132) against the old (vectorized) search
@fluidnumerics-joe
Copy link
Contributor

Just posting a basline bulk runtime estimate on my system for this benchmark with parcels v4-dev at Parcels-code/Parcels@add8158

Runtime (41s)

(parcels) [joe@claymation parcels-benchmarks]$ python MOi-Curvilinear/benchmark_index_search.py 
Running 1 particles with parcels v4
Execution time: 41 seconds

System specs

lstopo

(parcels) [joe@claymation parcels-benchmarks]$ lstopo
Machine (15GB total)
  Package L#0
    NUMANode L#0 (P#0 15GB)
    L3 L#0 (12MB)
      L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#1)
      L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#3)
      L2 L#2 (2048KB)
        L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2 + PU L#4 (P#4)
        L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3 + PU L#5 (P#5)
        L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4 + PU L#6 (P#6)
        L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5 + PU L#7 (P#7)
      L2 L#3 (2048KB)
        L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6 + PU L#8 (P#8)
        L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7 + PU L#9 (P#9)
        L1d L#8 (32KB) + L1i L#8 (64KB) + Core L#8 + PU L#10 (P#10)
        L1d L#9 (32KB) + L1i L#9 (64KB) + Core L#9 + PU L#11 (P#11)
  HostBridge
    PCI 00:02.0 (VGA)
    PCIBridge
      PCI 01:00.0 (NVMExp)
        Block(Disk) "nvme0n1"
    PCI 00:14.3 (Network)
      Net "wlp0s20f3"
    PCIBridge
      PCI 2e:00.0 (NVMExp)
        Block(Disk) "nvme1n1"

OS details

$ cat /etc/os-release 
NAME="Rocky Linux"
VERSION="9.6 (Blue Onyx)"
ID="rocky"

tuned profile

parcels) [joe@claymation parcels-benchmarks]$ tuned-adm active
Current active profile: balanced

@fluidnumerics-joe
Copy link
Contributor

fluidnumerics-joe commented Aug 15, 2025

Doubling the hash grid cell cell size (computed in parcels.spatialhash.SpatialHash._hash_cell_size, I can cut the runtime in half

(parcels) [joe@claymation parcels-benchmarks]$ python MOi-Curvilinear/benchmark_index_search.py
Running 1 particles with parcels v4
Execution time: 21 seconds

Nearly all the runtime savings here is coming from the initialization. This result makes sense as increasing the hash cell size reduces the inner loop size on the hash table initialization, giving fewer .append calls to the list of lists hash-table.

@fluidnumerics-joe
Copy link
Contributor

fluidnumerics-joe commented Aug 15, 2025

Here's a few results, just playing with the hash grid cell size factor (the parameter that multiplies the median quad area), for the single particle on MOi benchmark. I've also split apart the timing for the hash grid initialization and the actual search time

Running 1 particles with parcels v4
  Spatial hash factor: 0.5
Spatial hash construction time: 40.19 seconds
Search time: 0.00 seconds

  Spatial hash factor: 1.0
Spatial hash construction time: 22.69 seconds
Search time: 0.00 seconds

  Spatial hash factor: 1.5
Spatial hash construction time: 16.79 seconds
Search time: 0.00 seconds

  Spatial hash factor: 2.0
Spatial hash construction time: 13.68 seconds
Search time: 0.00 seconds

  Spatial hash factor: 2.5
Spatial hash construction time: 12.34 seconds
Search time: 0.00 seconds

  Spatial hash factor: 3.0
Spatial hash construction time: 11.79 seconds
Search time: 0.00 seconds

  Spatial hash factor: 3.5
Spatial hash construction time: 11.43 seconds
Search time: 0.00 seconds

  Spatial hash factor: 4.0
Spatial hash construction time: 11.60 seconds
Search time: 0.01 seconds

@erikvansebille - tbh, I'm not sure how you're getting runtimes for multiple particles - it must not be on v4-dev. I'm crashing during the search as soon as I go to more than 1 particle in this benchmark

Traceback (most recent call last):
  File "/home/joe/Projects/Geomar-Utrecht/parcels-benchmarks/MOi-Curvilinear/benchmark_index_search.py", line 123, in <module>
    main()
  File "/home/joe/Projects/Geomar-Utrecht/parcels-benchmarks/MOi-Curvilinear/benchmark_index_search.py", line 119, in main
    run_benchmark(args.memory)
  File "/home/joe/Projects/Geomar-Utrecht/parcels-benchmarks/MOi-Curvilinear/benchmark_index_search.py", line 95, in run_benchmark
    fieldset.U.grid.search(0, lat, lon)
  File "/home/joe/miniconda3/envs/parcels/lib/python3.12/site-packages/parcels/xgrid.py", line 302, in search
    yi, eta, xi, xsi = _search_indices_curvilinear_2d(self, y, x, yi, xi)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/joe/miniconda3/envs/parcels/lib/python3.12/site-packages/parcels/_index_search.py", line 316, in _search_indices_curvilinear_2d
    if abs(aa) < 1e-12:  # Rectilinear cell, or quasi

Edit : my current work-around is to use the _search_indices_curvilinear_2d from your vectorized branch

@erikvansebille
Copy link
Member Author

Thanks for this analysis, Joe!

@erikvansebille - tbh, I'm not sure how you're getting runtimes for multiple particles - it must not be on v4-dev. I'm crashing during the search as soon as I go to more than 1 particle in this benchmark

Indeed, I do all the simulations on the vectorized kernel branch from Parcels-code/Parcels#2122; I hope to merge that branch into v4-dev any day now; when @VeckoTheGecko has done a final review. So indeed best to already do all performance benchmarking under that branch

On Friday, I started work on the VectorField interpolation (Parcels-code/Parcels#2152) and found there that I got significant speedups when loading the curvilinear lon and lat arrays (Parcels-code/Parcels@2da436a) for the NemoCurvilinear_data test (Parcels-code/Parcels@1bd7e0c). Because curvilinear lon and lat are not xarray dimensions; they seem not to be loaded by default, which triggered a separate load on each call to xgrid.lon and xgrid.lat.
Is this something that would help speed up the curvilinear search in general?

@fluidnumerics-joe
Copy link
Contributor

Search Time v  Hash Grid Scaling Factor Construction Time v  Hash Grid Scaling Factor

I've done a little bit of a sweep to see how changing the hash grid cell size influences construction and search time. In general, increasing the hash grid cell size (increasing the "hash grid factor") reduces the initialization cost but does have the effect that it can increase search times. For the search times, I still want to dive in to see how much time is spent in the spatial hash query and the rest of the search.

@fluidnumerics-joe
Copy link
Contributor

On Friday, I started work on the VectorField interpolation (Parcels-code/Parcels#2152) and found there that I got significant speedups when loading the curvilinear lon and lat arrays (Parcels-code/Parcels@2da436a) for the NemoCurvilinear_data test (Parcels-code/Parcels@1bd7e0c). Because curvilinear lon and lat are not xarray dimensions; they seem not to be loaded by default, which triggered a separate load on each call to xgrid.lon and xgrid.lat.
Is this something that would help speed up the curvilinear search in general?

Potentially. I'll pull profiles this morning :)

@fluidnumerics-joe
Copy link
Contributor

I looked at another potential implementation of the hash table initialization that reduced the inner loop operations and killed off append operations. This was done by moving to a compressed sparse row storage format for the hash table. Initial results are not too good in comparison to what we originally had. I'm still giving this some thought..
chart

@fluidnumerics-joe
Copy link
Contributor

I did also find a bug in how we treat wrap-arounds in the antimeridian. This is creating a few rows in the hash table that are quite long - this very well could be associated with long construction times and with long search times.. Working on a fix now.

@fluidnumerics-joe
Copy link
Contributor

In working through resolving issues related to crossing the antimeridian, I've tracked down a few more problematic cells in the MOi dataset; there are some nearly colinear elements near the north pole that give us some additional headache. The bounding box and hash cell indices are given below.

[4803 4803] [10080 10080]
x : -16.45905303955078, 163.24029541015625
y : 89.95170593261719, 89.9555892944336

To avoid these kinds of issues, grids with the spherical type are going to have to be converted to cartesian coordinates (on a unit sphere).

@erikvansebille
Copy link
Member Author

Thanks for digging deeper, Joe!

To avoid these kinds of issues, grids with the spherical type are going to have to be converted to cartesian coordinates (on a unit sphere).

What do you mean with that? I don't think I understand. Anything I can help with?

@erikvansebille
Copy link
Member Author

I've drafted a PR with a (failing) unit test at Parcels-code/Parcels#2153 that you could use to debug the spatial hash function. It uses the NEMO_Curvilinear dataset, which is a bit smaller than the MOi one used here; but has the same issue that there is an antemeridian in the domain, see plot below

image

Does this help you with debugging?

@erikvansebille
Copy link
Member Author

I just did the full benchmarking for the Morton encoding in Parcels-code/Parcels#2158. It really is very impressive speedup compared to the spatial hashing that you introduced in Parcels-code/Parcels#2132, @fluidnumerics-joe!

See below the runtime
Screenshot 2025-08-25 at 08 47 41

On my MacBook, the runtime is < 1 seconds up to 500k particles, and the initiation is only 1 second too. For 1M particles, the runtime is 2 seconds, which is a factor 300(!) faster than the 604 seconds for the spatial hash.

The performance worsens for the 5M particle case (which takes 668 seconds), probably because of memory issues. See below the peak memory uses. The memory for 5M particles is >6GB, which may have lead to some disk/swapping on my laptop.

Screenshot 2025-08-25 at 10 10 22

Note also that the memory load for the Morton encoding is in general 5 times higher than for the spatial hashing. Any initial idea why the memory footprint is so much larger, @fluidnumerics-joe? Reducing that is perhaps something to explore further down the line.

@erikvansebille
Copy link
Member Author

Here an update of the search_indices timings with the new Morton implementation in Parcels-code/Parcels#2177 (cyan lines)

Screenshot 2025-09-09 at 11 20 05

The initialisation is a bit slower (6 seconds on my MacBook), but the factor 10 speed-up for 5M particles is very nice (from 11 minutes to 50 seconds)

The peak memory use is a lot higher again, though...

Screenshot 2025-09-09 at 11 22 03

@fluidnumerics-joe
Copy link
Contributor

I'll see if there's room to trim back on memory usage. There's a few spots where uint32 would suffice that uint64 is being used. This is kinda the price we pay with vectorization though, it seems. Getting rid of those for loops required masks and quick lookup tables. But perhaps they're may be room for reuse? I'll take another pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants