-
Notifications
You must be signed in to change notification settings - Fork 35
Improve handling of retrieved steps from the GPU in async mode #350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
JuanGonzalezCaminero
merged 29 commits into
apt-sim:master
from
SeverinDiederichs:direct_scoring_by_workers_with_1_buffer
Feb 24, 2025
Merged
Improve handling of retrieved steps from the GPU in async mode #350
JuanGonzalezCaminero
merged 29 commits into
apt-sim:master
from
SeverinDiederichs:direct_scoring_by_workers_with_1_buffer
Feb 24, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Can one of the admins verify this patch? |
JuanGonzalezCaminero
approved these changes
Feb 24, 2025
SeverinDiederichs
added a commit
that referenced
this pull request
Mar 10, 2025
If the UserSteppingAction is required, we need to copy back every GPU step back to the G4 workers. This required to change the kernels, as we need to be able to record every step independent of edep or sensitive detector. Since copying back every step can lead to a very large amount of steps, this would quickly fill the buffer. Then, if the buffer gets too full before the Geant4 workers take care of their hits, the GPUStep Management Thread would start copying out the GPUSteps, as implemented in #350. However, this copying is too slow, if every Step is recorded and leads to the GPU running out of HitSlots. Previously, the Geant4 workers would only take care of the GPU Steps after their transport has finished. However, this may be too late and the buffer may be too full, leading to copying. Therefore, the Geant4 workers must be able to process some of the steps already earlier. This is now done in the `AdePTTrackingManager`: Before a new track is processed, the `GPUStepProcessing` is called. This way, the GPU step buffer can be kept under control. To enable this, the processing of the GPUSteps is now encapsulated in a single function, that can be called from the `AdePTTrackingManager`. In the same manner, the PostUserTrackingAction is called. For this, the RecordHit also writes if it is the LastStep of a track. Note that it is straightforward to calso call the PreUserTrackingAction. This requires a StepCounter, which is availale in the B field update branch, so I will add it *after* the B field branch is merged. Both can be enabled via: ``` /adept/CallUserSteppingAction true /adept/CallPostUserTrackingAction true ``` Since this PR touches the kernels, below the physics validation at high statistics, which is as good as it should be: <img width="586" alt="Screenshot 2025-03-09 at 07 30 15" src="https://github.com/user-attachments/assets/8027a386-2680-4b22-9c4f-8da91a693ea3" />
SeverinDiederichs
added a commit
to SeverinDiederichs/AdePT
that referenced
this pull request
Mar 17, 2025
…sim#356) If the UserSteppingAction is required, we need to copy back every GPU step back to the G4 workers. This required to change the kernels, as we need to be able to record every step independent of edep or sensitive detector. Since copying back every step can lead to a very large amount of steps, this would quickly fill the buffer. Then, if the buffer gets too full before the Geant4 workers take care of their hits, the GPUStep Management Thread would start copying out the GPUSteps, as implemented in apt-sim#350. However, this copying is too slow, if every Step is recorded and leads to the GPU running out of HitSlots. Previously, the Geant4 workers would only take care of the GPU Steps after their transport has finished. However, this may be too late and the buffer may be too full, leading to copying. Therefore, the Geant4 workers must be able to process some of the steps already earlier. This is now done in the `AdePTTrackingManager`: Before a new track is processed, the `GPUStepProcessing` is called. This way, the GPU step buffer can be kept under control. To enable this, the processing of the GPUSteps is now encapsulated in a single function, that can be called from the `AdePTTrackingManager`. In the same manner, the PostUserTrackingAction is called. For this, the RecordHit also writes if it is the LastStep of a track. Note that it is straightforward to calso call the PreUserTrackingAction. This requires a StepCounter, which is availale in the B field update branch, so I will add it *after* the B field branch is merged. Both can be enabled via: ``` /adept/CallUserSteppingAction true /adept/CallPostUserTrackingAction true ``` Since this PR touches the kernels, below the physics validation at high statistics, which is as good as it should be: <img width="586" alt="Screenshot 2025-03-09 at 07 30 15" src="https://github.com/user-attachments/assets/8027a386-2680-4b22-9c4f-8da91a693ea3" />
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves the handling of the retrieved steps from the GPU in the async mode.
Previously, the HitProcessingThread would always copy out the retrieved steps to a vector and a unique_ptr would be pushed to the HitQueue. When the G4 workers pop the item from the queue, the data would be released.
Unfortunately, the copying of GB of data could take significant time in certain settings and block the buffer for longer than the GPU would require to run out of HitSlots. This scenario would mostly appear in testEm3 when shooting electrons, since every step of the full shower would be recorded, leading to a vast amount of data to be copied.
This PR now changes the handling in the following way:
Now, there are still two buffers on device, but only one large circular buffer on the CPU. Whenever the device buffer swaps, the retrieved steps are copied into the circular host buffer. Then, the G4 workers are woken up to directly score inside the circular buffer. When they finished the scoring, the memory is released, i.e., the GPU can copy into it again. If the circular host buffer is filled to more than 50% (which indicates that the G4 workers are not scoring their hits, e.g., because they are doing hadronics) the hits will be copied out before. If the G4 workers are available, the copy can be prevented, and if the G4 workers are busy, the hits can be copied out (e.g., as needed for CMS ttbar events).
This yields to greatly improved performance of the async mode for applications where EM is largely dominating, such as testEm3.
The table shows the run time for T number of threads and E number of events. The first rows denote testEm3 with 100 primary 10 GeV electrons per event, the last two rows are ttbar events in CMS (without magnetic field)
This PR adds the async mode in the CI, now both are tested.

Also, this finally enabled physics validation in testEm3 with the async AdePT, and it yields the same results as the sync AdePT: