Skip to content

Conversation

@niermann999
Copy link
Collaborator

@niermann999 niermann999 commented Nov 14, 2024

Use random, but sorted tracks for benchmarks and copy the detector to device. I also changed the eta range to [-4, 4] and the momentum range to 1Gev - 100GeV

@niermann999 niermann999 added refactor refactoring the current codes priority: low Low priority labels Nov 14, 2024
if (do_sort) {
// Sort by theta angle
const auto traj_comp = [](const auto &lhs, const auto &rhs) {
return getter::theta(lhs.dir()) < getter::theta(rhs.dir());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sorting to get better performance? if so, I would do it with $|\pi/2 -\theta|$ given that toy geometry is symmetry around z=0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually a good suggestion!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the volumes and surfaces in $z &gt; 0$ are - even though they are geometrically the same as the ones in $z &lt; 0$ - still different entries in memory, right? So perhaps sorting on $\theta$ might still give us better memory locality?

Copy link
Collaborator

@beomki-yeo beomki-yeo Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth measuring the performance for both cases.
I think the memory locality gets beneficial only if all threads in a warp assess the same surface at the same time. Assessing the adjacent surfaces won't be very helpful given that our memory layout is full-SoA.
In the ODD full chain benchmark, I observed that $|\pi/2 - \theta|$ shows better performance than $\theta$

This comment was marked as outdated.

Copy link
Collaborator Author

@niermann999 niermann999 Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, correction, there was a bug in the PR. There does seem to be a slight difference:

Sorting on theta sector

Running ./bin/detray_benchmark_cuda_array
Run on (48 X 1797.74 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x24)
  L1 Instruction 32 KiB (x24)
  L2 Unified 512 KiB (x24)
  L3 Unified 32768 KiB (x4)
Load Average: 1.11, 0.98, 1.35
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
CUDA unsync propagation/8      1810918 ns      1805887 ns          346 TracksPropagated=35.4396k/s
CUDA unsync propagation/16     2196059 ns      2190089 ns          321 TracksPropagated=116.89k/s
CUDA unsync propagation/32     2255072 ns      2248088 ns          311 TracksPropagated=455.498k/s
CUDA unsync propagation/64     2835461 ns      2826132 ns          248 TracksPropagated=1.44933M/s
CUDA unsync propagation/128    3910987 ns      3901016 ns          192 TracksPropagated=4.19993M/s
CUDA unsync propagation/256   12241997 ns     12215791 ns           58 TracksPropagated=5.36486M/s
CUDA sync propagation/8        1834123 ns      1829156 ns          353 TracksPropagated=34.9888k/s
CUDA sync propagation/16       2210750 ns      2204672 ns          320 TracksPropagated=116.117k/s
CUDA sync propagation/32       2264046 ns      2257410 ns          310 TracksPropagated=453.617k/s
CUDA sync propagation/64       2891798 ns      2883179 ns          243 TracksPropagated=1.42065M/s
CUDA sync propagation/128      3994973 ns      3984976 ns          186 TracksPropagated=4.11144M/s
CUDA sync propagation/256     12808931 ns     12781180 ns           55 TracksPropagated=5.12754M/s

Sorting on theta directly

Running ./bin/detray_benchmark_cuda_array
Run on (48 X 1797.6 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x24)
  L1 Instruction 32 KiB (x24)
  L2 Unified 512 KiB (x24)
  L3 Unified 32768 KiB (x4)
Load Average: 1.56, 0.98, 1.40
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
CUDA unsync propagation/8      1718087 ns      1713267 ns          367 TracksPropagated=37.3555k/s
CUDA unsync propagation/16     2233579 ns      2227494 ns          315 TracksPropagated=114.927k/s
CUDA unsync propagation/32     2298005 ns      2291165 ns          306 TracksPropagated=446.934k/s
CUDA unsync propagation/64     2697623 ns      2689106 ns          260 TracksPropagated=1.52318M/s
CUDA unsync propagation/128    4019020 ns      4008956 ns          187 TracksPropagated=4.08685M/s
CUDA unsync propagation/256   12336555 ns     12309757 ns           57 TracksPropagated=5.32391M/s
CUDA sync propagation/8        1717021 ns      1712189 ns          372 TracksPropagated=37.379k/s
CUDA sync propagation/16       2233491 ns      2227255 ns          317 TracksPropagated=114.94k/s
CUDA sync propagation/32       2304860 ns      2297486 ns          305 TracksPropagated=445.705k/s
CUDA sync propagation/64       2737803 ns      2728854 ns          257 TracksPropagated=1.501M/s
CUDA sync propagation/128      4113359 ns      4102710 ns          182 TracksPropagated=3.99346M/s
CUDA sync propagation/256     12920203 ns     12892396 ns           54 TracksPropagated=5.08331M/s

@stephenswat
Copy link
Member

General question, does this allow us to preserve the old behaviour?

@niermann999
Copy link
Collaborator Author

General question, does this allow us to preserve the old behaviour?

What do you mean with old behaviour? Should I add an option to also use the uniform track generator?

@niermann999 niermann999 force-pushed the feat-random-tracks branch 4 times, most recently from bd51ba0 to 3e48743 Compare November 15, 2024 16:25
@niermann999
Copy link
Collaborator Author

Also added charge randomization now, which mainly harms the low track multiplicity benchmark cases:

Running ./bin/detray_benchmark_cuda_array
Run on (48 X 1797.24 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x24)
  L1 Instruction 32 KiB (x24)
  L2 Unified 512 KiB (x24)
  L3 Unified 32768 KiB (x4)
Load Average: 1.01, 0.42, 0.83
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
CUDA unsync propagation/8      2173104 ns      2166468 ns          294 TracksPropagated=29.5412k/s
CUDA unsync propagation/16     2271953 ns      2264979 ns          309 TracksPropagated=113.025k/s
CUDA unsync propagation/32     2250060 ns      2242717 ns          312 TracksPropagated=456.589k/s
CUDA unsync propagation/64     2725427 ns      2716794 ns          258 TracksPropagated=1.50766M/s
CUDA unsync propagation/128    3951212 ns      3941215 ns          190 TracksPropagated=4.15709M/s
CUDA unsync propagation/256   12060627 ns     12034753 ns           58 TracksPropagated=5.44556M/s
CUDA sync propagation/8        2187177 ns      2181196 ns          311 TracksPropagated=29.3417k/s
CUDA sync propagation/16       2285902 ns      2279226 ns          307 TracksPropagated=112.319k/s
CUDA sync propagation/32       2264908 ns      2258104 ns          310 TracksPropagated=453.478k/s
CUDA sync propagation/64       2782645 ns      2774048 ns          253 TracksPropagated=1.47654M/s
CUDA sync propagation/128      4038953 ns      4028872 ns          185 TracksPropagated=4.06665M/s
CUDA sync propagation/256     12662450 ns     12634151 ns           55 TracksPropagated=5.18721M/s

@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 2, 2024

Quality Gate Failed Quality Gate failed

Failed conditions
81.0% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Copy link
Member

@stephenswat stephenswat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the detray meeting today, let's go ahead and approve this.

@stephenswat stephenswat enabled auto-merge December 2, 2024 16:54
@stephenswat stephenswat merged commit 7275508 into acts-project:main Dec 2, 2024
17 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority: low Low priority refactor refactoring the current codes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants