Skip to content

Cell Sorting, main branch (2026.02.20.)#1264

Open
krasznaa wants to merge 7 commits intoacts-project:mainfrom
krasznaa:CellSorting-main-20260219
Open

Cell Sorting, main branch (2026.02.20.)#1264
krasznaa wants to merge 7 commits intoacts-project:mainfrom
krasznaa:CellSorting-main-20260219

Conversation

@krasznaa
Copy link
Member

After earlier discussions about how fast we can be with sorting cells as part of the throughput measurements, I spent some time in putting up some code for this.

I introduced "cell sorting algorithms" for all backends. In pretty much the same way in which the measurement sorting algorithms are implemented.

Then I taught traccc::io::read_cells(...) how to randomize the order of the cells on request. I did it like this because the CSV reading code is fundamentally set up such that it would output a sorted vector of cells. Instead of completely re-thinking the logic of the I/O code, it was easier to add a shuffling step at the end. (When the user asks for it.)

Finally I updated the throughput measurement applications to:

  • shuffle the cells that they read into host memory;
  • make use of the appropriate cell sorting algorithm as part of their data processing.

Unfortunately the result is slightly worse than what I was hoping for. 😦 With the current main branch I see the following (reference) throughput on our trusty ol' A5000:

[bash][pcadp04]:traccc > ./build_current/bin/traccc_throughput_mt_cuda --input-directory /data/Acts/odd-simulations-20240509/geant4_ttbar_mu200/ --input-events=20 --track-candidates-range=5:100 --seedfinder-vertex-range=-150:150 --finding-run-mbf-smoother=false --processed-events=500 --deterministic --cpu-threads=8
...
Warm-up processing [==================================================] 100% [00m:00s]                                            
Event processing   [==================================================] 100% [00m:00s]                                            
04:34:29 PM ThroughputExample             INFO      Reconstructed track parameters: 2622220
04:34:29 PM ThroughputExample             INFO      Time totals:                   File reading  1249 ms
04:34:29 PM ThroughputExample             INFO                  Warm-up processing  153 ms
04:34:29 PM ThroughputExample             INFO                    Event processing  5087 ms
04:34:29 PM ThroughputExample             INFO      Throughput:            Warm-up processing  15.3186 ms/event, 65.2802 events/s
04:34:29 PM ThroughputExample             INFO                    Event processing  10.1754 ms/event, 98.2765 events/s
[bash][pcadp04]:traccc >

While when I add an extra sorting step, I get:

[bash][pcadp04]:traccc > ./build_new/bin/traccc_throughput_mt_cuda --input-directory /data/Acts/odd-simulations-20240509/geant4_ttbar_mu200/ --input-events=20 --track-candidates-range=5:100 --seedfinder-vertex-range=-150:150 --finding-run-mbf-smoother=false --processed-events=500 --deterministic --cpu-threads=8
...
Warm-up processing [==================================================] 100% [00m:00s]                                            
Event processing   [==================================================] 100% [00m:00s]                                            
04:35:53 PM ThroughputExample             INFO      Reconstructed track parameters: 2622229
04:35:53 PM ThroughputExample             INFO      Time totals:                   File reading  975 ms
04:35:53 PM ThroughputExample             INFO                  Warm-up processing  161 ms
04:35:53 PM ThroughputExample             INFO                    Event processing  5510 ms
04:35:53 PM ThroughputExample             INFO      Throughput:            Warm-up processing  16.1115 ms/event, 62.0676 events/s
04:35:53 PM ThroughputExample             INFO                    Event processing  11.0217 ms/event, 90.7302 events/s
[bash][pcadp04]:traccc >

So the cell sorting adds almost an entire millisecond to the event processing. 😦 Way more than I was expecting...

I didn't do any deeper profiling on the sorting code. It's not impossible that it could still be improved. And it's also worth remembering that the random shuffling of the cells that the code does is a much worse scenario than what we would ever get from real data. Even under the least ideal circumstances.

Still, I was hoping for a quicker sorting, even with all this taken into account. 🤔

Pinging @flg, @paradajzblond.

@krasznaa krasznaa requested a review from stephenswat February 20, 2026 15:40
@krasznaa krasznaa added cuda Changes related to CUDA sycl Changes related to SYCL cpu Changes related to CPU code alpaka Changes related to Alpaka labels Feb 20, 2026
if (rhs >= cells.size()) {
return true;
}
return cells.at(lhs) < cells.at(rhs);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephenswat, "how sorted" do the cells actually need to be? 🤔 The I/O code was "fully" sorting them so far, so I went for the same in these algorithms. But is this necessary? Would it maybe be enough to just do the same that we do for the measurements? (That cells belonging to the same module would be side-by-side. But not necessarily in the correct order.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stephen confirmed recently that they need to be grouped (contiguous) by module and then sorted by row and column indices.

Copy link
Member Author

@krasznaa krasznaa Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is what I remembered. Still, was hoping that I misremembered...

In the end this is exactly what the EDM defines currently.

https://github.com/acts-project/traccc/blob/main/core/include/traccc/edm/impl/silicon_cell_collection.ipp#L52-L64

@flg
Copy link
Contributor

flg commented Feb 20, 2026

This is quite interesting, thank you for providing this. So 0.85 ms to sort the completely randomized cells of a PU200 events.

First, just to make sure that our numbers are comparable: how many cells is this? For ITk we have on average 1.1e6 cells per ttbar, pu200 event. I expect it to be the same.

If we want to compare with the current scenario, we want to know how long it takes to sort all cells provided that there are already grouped by module. This, I expect, can make a significant difference on your side. To be more cost efficient than the CPU equivalent in this scenario the GPU needs to make it in 0.54 ms or less.

Made it possible to randomize the order of the cells read from an
input CSV. In order to exercise the newly added cell sorting
algorithms.
@krasznaa krasznaa force-pushed the CellSorting-main-20260219 branch from 7a1f05b to 388eff3 Compare February 20, 2026 16:05
@krasznaa
Copy link
Member Author

All good/relevant points.

The ODD μ=200 sample contains O(500k) cells per event. So about half of the ITk. 🤔

I'll do a test with just shuffling the cells per module. Let's see how much of a change that will bring. ( 🤞 that a lot...)

@flg
Copy link
Contributor

flg commented Feb 20, 2026

The ODD μ=200 sample contains O(500k) cells per event. So about half of the ITk. 🤔

This is so odd (pun intended) that it requires double-checking and further investigation.

@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
2 Security Hotspots

See analysis details on SonarQube Cloud

@krasznaa
Copy link
Member Author

This latest version of the code, which only shuffles cells within the same module, runs like this:

05:24:07 PM ThroughputExample             INFO      Reconstructed track parameters: 2622230
05:24:07 PM ThroughputExample             INFO      Time totals:                   File reading  1034 ms
05:24:07 PM ThroughputExample             INFO                  Warm-up processing  158 ms
05:24:07 PM ThroughputExample             INFO                    Event processing  5237 ms
05:24:07 PM ThroughputExample             INFO      Throughput:            Warm-up processing  15.8187 ms/event, 63.2164 events/s
05:24:07 PM ThroughputExample             INFO                    Event processing  10.4753 ms/event, 95.4628 events/s

So Thrust's sorting, as expected, is quite a bit quicker in this case.

@flg
Copy link
Contributor

flg commented Feb 20, 2026

This latest version of the code, which only shuffles cells within the same module, runs like this:

05:24:07 PM ThroughputExample             INFO      Reconstructed track parameters: 2622230
05:24:07 PM ThroughputExample             INFO      Time totals:                   File reading  1034 ms
05:24:07 PM ThroughputExample             INFO                  Warm-up processing  158 ms
05:24:07 PM ThroughputExample             INFO                    Event processing  5237 ms
05:24:07 PM ThroughputExample             INFO      Throughput:            Warm-up processing  15.8187 ms/event, 63.2164 events/s
05:24:07 PM ThroughputExample             INFO                    Event processing  10.4753 ms/event, 95.4628 events/s

So Thrust's sorting, as expected, is quite a bit quicker in this case.

Awesome. I will now test this with ITk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

alpaka Changes related to Alpaka cpu Changes related to CPU code cuda Changes related to CUDA sycl Changes related to SYCL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants