Improve handling of retrieved steps from the GPU in async mode #350

SeverinDiederichs · 2025-02-21T15:07:55Z

This PR improves the handling of the retrieved steps from the GPU in the async mode.

Previously, the HitProcessingThread would always copy out the retrieved steps to a vector and a unique_ptr would be pushed to the HitQueue. When the G4 workers pop the item from the queue, the data would be released.
Unfortunately, the copying of GB of data could take significant time in certain settings and block the buffer for longer than the GPU would require to run out of HitSlots. This scenario would mostly appear in testEm3 when shooting electrons, since every step of the full shower would be recorded, leading to a vast amount of data to be copied.

This PR now changes the handling in the following way:

Now, there are still two buffers on device, but only one large circular buffer on the CPU. Whenever the device buffer swaps, the retrieved steps are copied into the circular host buffer. Then, the G4 workers are woken up to directly score inside the circular buffer. When they finished the scoring, the memory is released, i.e., the GPU can copy into it again. If the circular host buffer is filled to more than 50% (which indicates that the G4 workers are not scoring their hits, e.g., because they are doing hadronics) the hits will be copied out before. If the G4 workers are available, the copy can be prevented, and if the G4 workers are busy, the hits can be copied out (e.g., as needed for CMS ttbar events).

This yields to greatly improved performance of the async mode for applications where EM is largely dominating, such as testEm3.

The table shows the run time for T number of threads and E number of events. The first rows denote testEm3 with 100 primary 10 GeV electrons per event, the last two rows are ttbar events in CMS (without magnetic field)

Type	Sync	Async master	Async this PR
1T 2E	1.9	1.27	1.2
2T 4E	2.0	1.9	1.2
4T 16E	4.16	5.0	2.6
8T 32E	4.6	10.1	2.8
16T 64E	6.6	out of slots	4.5
16T 256E	24.1	-	17.5
16T 1000E	94.7	-	70.2
ttbar 8T 16E all sensitive	79.6	72.4	70.3
ttbar 16T 64E	134.126	118.461	115.5

This PR adds the async mode in the CI, now both are tested.
Also, this finally enabled physics validation in testEm3 with the async AdePT, and it yields the same results as the sync AdePT:

…ProcessingThread

phsft-bot · 2025-02-21T15:11:01Z

Can one of the admins verify this patch?

If the UserSteppingAction is required, we need to copy back every GPU step back to the G4 workers. This required to change the kernels, as we need to be able to record every step independent of edep or sensitive detector. Since copying back every step can lead to a very large amount of steps, this would quickly fill the buffer. Then, if the buffer gets too full before the Geant4 workers take care of their hits, the GPUStep Management Thread would start copying out the GPUSteps, as implemented in #350. However, this copying is too slow, if every Step is recorded and leads to the GPU running out of HitSlots. Previously, the Geant4 workers would only take care of the GPU Steps after their transport has finished. However, this may be too late and the buffer may be too full, leading to copying. Therefore, the Geant4 workers must be able to process some of the steps already earlier. This is now done in the `AdePTTrackingManager`: Before a new track is processed, the `GPUStepProcessing` is called. This way, the GPU step buffer can be kept under control. To enable this, the processing of the GPUSteps is now encapsulated in a single function, that can be called from the `AdePTTrackingManager`. In the same manner, the PostUserTrackingAction is called. For this, the RecordHit also writes if it is the LastStep of a track. Note that it is straightforward to calso call the PreUserTrackingAction. This requires a StepCounter, which is availale in the B field update branch, so I will add it *after* the B field branch is merged. Both can be enabled via: ``` /adept/CallUserSteppingAction true /adept/CallPostUserTrackingAction true ``` Since this PR touches the kernels, below the physics validation at high statistics, which is as good as it should be: <img width="586" alt="Screenshot 2025-03-09 at 07 30 15" src="https://github.com/user-attachments/assets/8027a386-2680-4b22-9c4f-8da91a693ea3" />

…sim#356) If the UserSteppingAction is required, we need to copy back every GPU step back to the G4 workers. This required to change the kernels, as we need to be able to record every step independent of edep or sensitive detector. Since copying back every step can lead to a very large amount of steps, this would quickly fill the buffer. Then, if the buffer gets too full before the Geant4 workers take care of their hits, the GPUStep Management Thread would start copying out the GPUSteps, as implemented in apt-sim#350. However, this copying is too slow, if every Step is recorded and leads to the GPU running out of HitSlots. Previously, the Geant4 workers would only take care of the GPU Steps after their transport has finished. However, this may be too late and the buffer may be too full, leading to copying. Therefore, the Geant4 workers must be able to process some of the steps already earlier. This is now done in the `AdePTTrackingManager`: Before a new track is processed, the `GPUStepProcessing` is called. This way, the GPU step buffer can be kept under control. To enable this, the processing of the GPUSteps is now encapsulated in a single function, that can be called from the `AdePTTrackingManager`. In the same manner, the PostUserTrackingAction is called. For this, the RecordHit also writes if it is the LastStep of a track. Note that it is straightforward to calso call the PreUserTrackingAction. This requires a StepCounter, which is availale in the B field update branch, so I will add it *after* the B field branch is merged. Both can be enabled via: ``` /adept/CallUserSteppingAction true /adept/CallPostUserTrackingAction true ``` Since this PR touches the kernels, below the physics validation at high statistics, which is as good as it should be: <img width="586" alt="Screenshot 2025-03-09 at 07 30 15" src="https://github.com/user-attachments/assets/8027a386-2680-4b22-9c4f-8da91a693ea3" />

SeverinDiederichs added 25 commits February 21, 2025 15:45

draft with header only problems

2e65550

score directly in the buffer

078c07a

fix deadlock

5143a8c

non working draft version

25eeb1d

working version but slow due to big transfer

ff1fa66

copy only what is needed into compact buffer via multiple copies

5a49686

not yet reproducible

4bf174a

working version

799f059

loop inversion

b83ceb7

use struct for queue and no refcount

f9ceb63

add safeguard to copy out

c1fdf8f

improved naming

738c90f

some cleaning

8ca1ae4

direct scoring with one buffer, removing of segments still has problems

fedd6a0

fixing remove segment and cleaning

98ed63e

wake workers up again after copy

7033b8f

cleaning

13cd962

remove hostBufferSubmitted variable and use state instead

baaf783

copy out depending on buffer fill quota, only wake G4workers from Hit…

5085d98

…ProcessingThread

Clang formatting

681a7a6

cleaning

fbaedc8

improve parameters in CI test

0979e5f

disabling advanced debugging

6e11d8f

remove debugging from CMakeLists

0bb89cf

using both sync and async mode in CI

9e63a85

SeverinDiederichs added 4 commits February 21, 2025 16:29

fix typo

caa1b36

clang format

345e916

adjust tolerance for numerical rounding

e083e16

adjust timeout

844d454

SeverinDiederichs requested a review from JuanGonzalezCaminero February 24, 2025 14:58

JuanGonzalezCaminero approved these changes Feb 24, 2025

View reviewed changes

JuanGonzalezCaminero merged commit 4b09036 into apt-sim:master Feb 24, 2025
3 checks passed

SeverinDiederichs mentioned this pull request Mar 8, 2025

enable calling of UserSteppingAction and PostUserTrackingAction #356

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve handling of retrieved steps from the GPU in async mode #350

Improve handling of retrieved steps from the GPU in async mode #350

Uh oh!

SeverinDiederichs commented Feb 21, 2025 •

edited

Loading

Uh oh!

phsft-bot commented Feb 21, 2025

Uh oh!

Uh oh!

Uh oh!

Improve handling of retrieved steps from the GPU in async mode #350

Improve handling of retrieved steps from the GPU in async mode #350

Uh oh!

Conversation

SeverinDiederichs commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phsft-bot commented Feb 21, 2025

Uh oh!

Uh oh!

Uh oh!

SeverinDiederichs commented Feb 21, 2025 •

edited

Loading