Skip to content

Improve handling of retrieved steps from the GPU in async mode #350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

SeverinDiederichs
Copy link
Collaborator

@SeverinDiederichs SeverinDiederichs commented Feb 21, 2025

This PR improves the handling of the retrieved steps from the GPU in the async mode.

Previously, the HitProcessingThread would always copy out the retrieved steps to a vector and a unique_ptr would be pushed to the HitQueue. When the G4 workers pop the item from the queue, the data would be released.
Unfortunately, the copying of GB of data could take significant time in certain settings and block the buffer for longer than the GPU would require to run out of HitSlots. This scenario would mostly appear in testEm3 when shooting electrons, since every step of the full shower would be recorded, leading to a vast amount of data to be copied.

This PR now changes the handling in the following way:

Now, there are still two buffers on device, but only one large circular buffer on the CPU. Whenever the device buffer swaps, the retrieved steps are copied into the circular host buffer. Then, the G4 workers are woken up to directly score inside the circular buffer. When they finished the scoring, the memory is released, i.e., the GPU can copy into it again. If the circular host buffer is filled to more than 50% (which indicates that the G4 workers are not scoring their hits, e.g., because they are doing hadronics) the hits will be copied out before. If the G4 workers are available, the copy can be prevented, and if the G4 workers are busy, the hits can be copied out (e.g., as needed for CMS ttbar events).

This yields to greatly improved performance of the async mode for applications where EM is largely dominating, such as testEm3.

The table shows the run time for T number of threads and E number of events. The first rows denote testEm3 with 100 primary 10 GeV electrons per event, the last two rows are ttbar events in CMS (without magnetic field)

Type Sync Async master Async this PR
1T 2E 1.9 1.27 1.2
2T 4E 2.0 1.9 1.2
4T 16E 4.16 5.0 2.6
8T 32E 4.6 10.1 2.8
16T 64E 6.6 out of slots 4.5
16T 256E 24.1 - 17.5
16T 1000E 94.7 - 70.2
ttbar 8T 16E all sensitive 79.6 72.4 70.3
ttbar 16T 64E 134.126 118.461 115.5

This PR adds the async mode in the CI, now both are tested.
Also, this finally enabled physics validation in testEm3 with the async AdePT, and it yields the same results as the sync AdePT:
Screenshot 2025-02-22 at 17 06 47

@phsft-bot
Copy link

Can one of the admins verify this patch?

@JuanGonzalezCaminero JuanGonzalezCaminero merged commit 4b09036 into apt-sim:master Feb 24, 2025
3 checks passed
SeverinDiederichs added a commit that referenced this pull request Mar 10, 2025
If the UserSteppingAction is required, we need to copy back every GPU
step back to the G4 workers.

This required to change the kernels, as we need to be able to record
every step independent of edep or sensitive detector.
Since copying back every step can lead to a very large amount of steps,
this would quickly fill the buffer. Then, if the buffer gets too full
before the Geant4 workers take care of their hits, the GPUStep
Management Thread would start copying out the GPUSteps, as implemented
in #350. However, this copying is too slow, if every Step is recorded
and leads to the GPU running out of HitSlots.

Previously, the Geant4 workers would only take care of the GPU Steps
after their transport has finished. However, this may be too late and
the buffer may be too full, leading to copying. Therefore, the Geant4
workers must be able to process some of the steps already earlier. This
is now done in the `AdePTTrackingManager`: Before a new track is
processed, the `GPUStepProcessing` is called. This way, the GPU step
buffer can be kept under control. To enable this, the processing of the
GPUSteps is now encapsulated in a single function, that can be called
from the `AdePTTrackingManager`.

In the same manner, the PostUserTrackingAction is called. For this, the
RecordHit also writes if it is the LastStep of a track.

Note that it is straightforward to calso call the PreUserTrackingAction.
This requires a StepCounter, which is availale in the B field update
branch, so I will add it *after* the B field branch is merged.

Both can be enabled via:
```
/adept/CallUserSteppingAction true
/adept/CallPostUserTrackingAction true
```

Since this PR touches the kernels, below the physics validation at high
statistics, which is as good as it should be:

<img width="586" alt="Screenshot 2025-03-09 at 07 30 15"
src="https://github.com/user-attachments/assets/8027a386-2680-4b22-9c4f-8da91a693ea3"
/>
SeverinDiederichs added a commit to SeverinDiederichs/AdePT that referenced this pull request Mar 17, 2025
…sim#356)

If the UserSteppingAction is required, we need to copy back every GPU
step back to the G4 workers.

This required to change the kernels, as we need to be able to record
every step independent of edep or sensitive detector.
Since copying back every step can lead to a very large amount of steps,
this would quickly fill the buffer. Then, if the buffer gets too full
before the Geant4 workers take care of their hits, the GPUStep
Management Thread would start copying out the GPUSteps, as implemented
in apt-sim#350. However, this copying is too slow, if every Step is recorded
and leads to the GPU running out of HitSlots.

Previously, the Geant4 workers would only take care of the GPU Steps
after their transport has finished. However, this may be too late and
the buffer may be too full, leading to copying. Therefore, the Geant4
workers must be able to process some of the steps already earlier. This
is now done in the `AdePTTrackingManager`: Before a new track is
processed, the `GPUStepProcessing` is called. This way, the GPU step
buffer can be kept under control. To enable this, the processing of the
GPUSteps is now encapsulated in a single function, that can be called
from the `AdePTTrackingManager`.

In the same manner, the PostUserTrackingAction is called. For this, the
RecordHit also writes if it is the LastStep of a track.

Note that it is straightforward to calso call the PreUserTrackingAction.
This requires a StepCounter, which is availale in the B field update
branch, so I will add it *after* the B field branch is merged.

Both can be enabled via:
```
/adept/CallUserSteppingAction true
/adept/CallPostUserTrackingAction true
```

Since this PR touches the kernels, below the physics validation at high
statistics, which is as good as it should be:

<img width="586" alt="Screenshot 2025-03-09 at 07 30 15"
src="https://github.com/user-attachments/assets/8027a386-2680-4b22-9c4f-8da91a693ea3"
/>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants