-
Notifications
You must be signed in to change notification settings - Fork 24
ref: More realistic toy detector benchmarks #885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
03e102d to
28a6721
Compare
| if (do_sort) { | ||
| // Sort by theta angle | ||
| const auto traj_comp = [](const auto &lhs, const auto &rhs) { | ||
| return getter::theta(lhs.dir()) < getter::theta(rhs.dir()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sorting to get better performance? if so, I would do it with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually a good suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the volumes and surfaces in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth measuring the performance for both cases.
I think the memory locality gets beneficial only if all threads in a warp assess the same surface at the same time. Assessing the adjacent surfaces won't be very helpful given that our memory layout is full-SoA.
In the ODD full chain benchmark, I observed that
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, correction, there was a bug in the PR. There does seem to be a slight difference:
Sorting on theta sector
Running ./bin/detray_benchmark_cuda_array
Run on (48 X 1797.74 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x24)
L1 Instruction 32 KiB (x24)
L2 Unified 512 KiB (x24)
L3 Unified 32768 KiB (x4)
Load Average: 1.11, 0.98, 1.35
--------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------
CUDA unsync propagation/8 1810918 ns 1805887 ns 346 TracksPropagated=35.4396k/s
CUDA unsync propagation/16 2196059 ns 2190089 ns 321 TracksPropagated=116.89k/s
CUDA unsync propagation/32 2255072 ns 2248088 ns 311 TracksPropagated=455.498k/s
CUDA unsync propagation/64 2835461 ns 2826132 ns 248 TracksPropagated=1.44933M/s
CUDA unsync propagation/128 3910987 ns 3901016 ns 192 TracksPropagated=4.19993M/s
CUDA unsync propagation/256 12241997 ns 12215791 ns 58 TracksPropagated=5.36486M/s
CUDA sync propagation/8 1834123 ns 1829156 ns 353 TracksPropagated=34.9888k/s
CUDA sync propagation/16 2210750 ns 2204672 ns 320 TracksPropagated=116.117k/s
CUDA sync propagation/32 2264046 ns 2257410 ns 310 TracksPropagated=453.617k/s
CUDA sync propagation/64 2891798 ns 2883179 ns 243 TracksPropagated=1.42065M/s
CUDA sync propagation/128 3994973 ns 3984976 ns 186 TracksPropagated=4.11144M/s
CUDA sync propagation/256 12808931 ns 12781180 ns 55 TracksPropagated=5.12754M/s
Sorting on theta directly
Running ./bin/detray_benchmark_cuda_array
Run on (48 X 1797.6 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x24)
L1 Instruction 32 KiB (x24)
L2 Unified 512 KiB (x24)
L3 Unified 32768 KiB (x4)
Load Average: 1.56, 0.98, 1.40
--------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------
CUDA unsync propagation/8 1718087 ns 1713267 ns 367 TracksPropagated=37.3555k/s
CUDA unsync propagation/16 2233579 ns 2227494 ns 315 TracksPropagated=114.927k/s
CUDA unsync propagation/32 2298005 ns 2291165 ns 306 TracksPropagated=446.934k/s
CUDA unsync propagation/64 2697623 ns 2689106 ns 260 TracksPropagated=1.52318M/s
CUDA unsync propagation/128 4019020 ns 4008956 ns 187 TracksPropagated=4.08685M/s
CUDA unsync propagation/256 12336555 ns 12309757 ns 57 TracksPropagated=5.32391M/s
CUDA sync propagation/8 1717021 ns 1712189 ns 372 TracksPropagated=37.379k/s
CUDA sync propagation/16 2233491 ns 2227255 ns 317 TracksPropagated=114.94k/s
CUDA sync propagation/32 2304860 ns 2297486 ns 305 TracksPropagated=445.705k/s
CUDA sync propagation/64 2737803 ns 2728854 ns 257 TracksPropagated=1.501M/s
CUDA sync propagation/128 4113359 ns 4102710 ns 182 TracksPropagated=3.99346M/s
CUDA sync propagation/256 12920203 ns 12892396 ns 54 TracksPropagated=5.08331M/s
|
General question, does this allow us to preserve the old behaviour? |
What do you mean with old behaviour? Should I add an option to also use the uniform track generator? |
bd51ba0 to
3e48743
Compare
|
Also added charge randomization now, which mainly harms the low track multiplicity benchmark cases: |
3e48743 to
5b367b8
Compare
|
stephenswat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the detray meeting today, let's go ahead and approve this.


Use random, but sorted tracks for benchmarks and copy the detector to device. I also changed the eta range to [-4, 4] and the momentum range to 1Gev - 100GeV