ref: More realistic toy detector benchmarks #885

niermann999 · 2024-11-14T13:27:11Z

Use random, but sorted tracks for benchmarks and copy the detector to device. I also changed the eta range to [-4, 4] and the momentum range to 1Gev - 100GeV

tests/benchmarks/cpu/benchmark_propagator.cpp

beomki-yeo · 2024-11-14T14:30:58Z

tests/benchmarks/cpu/benchmark_propagator.cpp

+    if (do_sort) {
+        // Sort by theta angle
+        const auto traj_comp = [](const auto &lhs, const auto &rhs) {
+            return getter::theta(lhs.dir()) < getter::theta(rhs.dir());


Are you sorting to get better performance? if so, I would do it with $|\pi/2 -\theta|$ given that toy geometry is symmetry around z=0

That's actually a good suggestion!

But the volumes and surfaces in $z > 0$ are - even though they are geometrically the same as the ones in $z < 0$ - still different entries in memory, right? So perhaps sorting on $\theta$ might still give us better memory locality?

It might be worth measuring the performance for both cases.
I think the memory locality gets beneficial only if all threads in a warp assess the same surface at the same time. Assessing the adjacent surfaces won't be very helpful given that our memory layout is full-SoA.
In the ODD full chain benchmark, I observed that $|\pi/2 - \theta|$ shows better performance than $\theta$

Ok, correction, there was a bug in the PR. There does seem to be a slight difference:

Sorting on theta sector

Running ./bin/detray_benchmark_cuda_array Run on (48 X 1797.74 MHz CPU s) CPU Caches: L1 Data 32 KiB (x24) L1 Instruction 32 KiB (x24) L2 Unified 512 KiB (x24) L3 Unified 32768 KiB (x4) Load Average: 1.11, 0.98, 1.35 -------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------- CUDA unsync propagation/8 1810918 ns 1805887 ns 346 TracksPropagated=35.4396k/s CUDA unsync propagation/16 2196059 ns 2190089 ns 321 TracksPropagated=116.89k/s CUDA unsync propagation/32 2255072 ns 2248088 ns 311 TracksPropagated=455.498k/s CUDA unsync propagation/64 2835461 ns 2826132 ns 248 TracksPropagated=1.44933M/s CUDA unsync propagation/128 3910987 ns 3901016 ns 192 TracksPropagated=4.19993M/s CUDA unsync propagation/256 12241997 ns 12215791 ns 58 TracksPropagated=5.36486M/s CUDA sync propagation/8 1834123 ns 1829156 ns 353 TracksPropagated=34.9888k/s CUDA sync propagation/16 2210750 ns 2204672 ns 320 TracksPropagated=116.117k/s CUDA sync propagation/32 2264046 ns 2257410 ns 310 TracksPropagated=453.617k/s CUDA sync propagation/64 2891798 ns 2883179 ns 243 TracksPropagated=1.42065M/s CUDA sync propagation/128 3994973 ns 3984976 ns 186 TracksPropagated=4.11144M/s CUDA sync propagation/256 12808931 ns 12781180 ns 55 TracksPropagated=5.12754M/s

Sorting on theta directly

Running ./bin/detray_benchmark_cuda_array Run on (48 X 1797.6 MHz CPU s) CPU Caches: L1 Data 32 KiB (x24) L1 Instruction 32 KiB (x24) L2 Unified 512 KiB (x24) L3 Unified 32768 KiB (x4) Load Average: 1.56, 0.98, 1.40 -------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------- CUDA unsync propagation/8 1718087 ns 1713267 ns 367 TracksPropagated=37.3555k/s CUDA unsync propagation/16 2233579 ns 2227494 ns 315 TracksPropagated=114.927k/s CUDA unsync propagation/32 2298005 ns 2291165 ns 306 TracksPropagated=446.934k/s CUDA unsync propagation/64 2697623 ns 2689106 ns 260 TracksPropagated=1.52318M/s CUDA unsync propagation/128 4019020 ns 4008956 ns 187 TracksPropagated=4.08685M/s CUDA unsync propagation/256 12336555 ns 12309757 ns 57 TracksPropagated=5.32391M/s CUDA sync propagation/8 1717021 ns 1712189 ns 372 TracksPropagated=37.379k/s CUDA sync propagation/16 2233491 ns 2227255 ns 317 TracksPropagated=114.94k/s CUDA sync propagation/32 2304860 ns 2297486 ns 305 TracksPropagated=445.705k/s CUDA sync propagation/64 2737803 ns 2728854 ns 257 TracksPropagated=1.501M/s CUDA sync propagation/128 4113359 ns 4102710 ns 182 TracksPropagated=3.99346M/s CUDA sync propagation/256 12920203 ns 12892396 ns 54 TracksPropagated=5.08331M/s

stephenswat · 2024-11-15T09:11:50Z

General question, does this allow us to preserve the old behaviour?

niermann999 · 2024-11-15T10:43:31Z

General question, does this allow us to preserve the old behaviour?

What do you mean with old behaviour? Should I add an option to also use the uniform track generator?

niermann999 · 2024-11-15T16:25:50Z

Also added charge randomization now, which mainly harms the low track multiplicity benchmark cases:

Running ./bin/detray_benchmark_cuda_array
Run on (48 X 1797.24 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x24)
  L1 Instruction 32 KiB (x24)
  L2 Unified 512 KiB (x24)
  L3 Unified 32768 KiB (x4)
Load Average: 1.01, 0.42, 0.83
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
CUDA unsync propagation/8      2173104 ns      2166468 ns          294 TracksPropagated=29.5412k/s
CUDA unsync propagation/16     2271953 ns      2264979 ns          309 TracksPropagated=113.025k/s
CUDA unsync propagation/32     2250060 ns      2242717 ns          312 TracksPropagated=456.589k/s
CUDA unsync propagation/64     2725427 ns      2716794 ns          258 TracksPropagated=1.50766M/s
CUDA unsync propagation/128    3951212 ns      3941215 ns          190 TracksPropagated=4.15709M/s
CUDA unsync propagation/256   12060627 ns     12034753 ns           58 TracksPropagated=5.44556M/s
CUDA sync propagation/8        2187177 ns      2181196 ns          311 TracksPropagated=29.3417k/s
CUDA sync propagation/16       2285902 ns      2279226 ns          307 TracksPropagated=112.319k/s
CUDA sync propagation/32       2264908 ns      2258104 ns          310 TracksPropagated=453.478k/s
CUDA sync propagation/64       2782645 ns      2774048 ns          253 TracksPropagated=1.47654M/s
CUDA sync propagation/128      4038953 ns      4028872 ns          185 TracksPropagated=4.06665M/s
CUDA sync propagation/256     12662450 ns     12634151 ns           55 TracksPropagated=5.18721M/s

… device

sonarqubecloud · 2024-12-02T16:14:07Z

Quality Gate failed

Failed conditions
81.0% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

stephenswat

Given the detray meeting today, let's go ahead and approve this.

niermann999 added refactor refactoring the current codes priority: low Low priority labels Nov 14, 2024

niermann999 requested review from beomki-yeo and stephenswat November 14, 2024 13:28

niermann999 force-pushed the feat-random-tracks branch from 03e102d to 28a6721 Compare November 14, 2024 13:35

beomki-yeo reviewed Nov 14, 2024

View reviewed changes

niermann999 force-pushed the feat-random-tracks branch 4 times, most recently from bd51ba0 to 3e48743 Compare November 15, 2024 16:25

Use random, but sorted tracks for benchmarks and copy the detector to…

5b367b8

… device

niermann999 force-pushed the feat-random-tracks branch from 3e48743 to 5b367b8 Compare December 2, 2024 16:13

stephenswat approved these changes Dec 2, 2024

View reviewed changes

stephenswat enabled auto-merge December 2, 2024 16:54

stephenswat merged commit 7275508 into acts-project:main Dec 2, 2024
17 of 19 checks passed

ref: More realistic toy detector benchmarks #885

ref: More realistic toy detector benchmarks #885

Uh oh!

Conversation

niermann999 commented Nov 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

beomki-yeo Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

asalzburger Nov 14, 2024

Choose a reason for hiding this comment

Uh oh!

stephenswat Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

beomki-yeo Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

niermann999 Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephenswat commented Nov 15, 2024

Uh oh!

niermann999 commented Nov 15, 2024

Uh oh!

niermann999 commented Nov 15, 2024

Uh oh!

sonarqubecloud bot commented Dec 2, 2024

Quality Gate failed

Uh oh!

stephenswat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

niermann999 commented Nov 14, 2024 •

edited

Loading

beomki-yeo Nov 15, 2024 •

edited

Loading

niermann999 Nov 15, 2024 •

edited

Loading