-
Notifications
You must be signed in to change notification settings - Fork 20
Reduce thread divergence in covariance transport #997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
const transform3_type& trf3, const mask_t& mask, | ||
const bound_parameters_vector<algebra_type>& bound_vec) { | ||
template <typename frame_t> | ||
requires std::is_object_v<typename frame_t::loc_point> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, this is a piece of old code. Would you mind checking, if this can be changed to
requires std::is_object_v<typename frame_t::loc_point> | |
requires concepts::point<typename frame_t::loc_point> |
?
const bound_parameters_vector<algebra_type>& bound_vec) { | ||
template <typename frame_t> | ||
requires std::is_object_v<typename frame_t::loc_point> | ||
DETRAY_HOST_DEVICE static inline void bound_to_free_jacobian_step_1( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am guessing that not all of the the sub-steps here are supposed to be part of the public API of the class? If not can we move them either to the private section or otherwise give them more instructive names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately these function do need to be in the public API, as the parameter transporter must be able to use them separately. But yes, I'll name them better.
sf.transform(gctx), bound_to_free_jacobian, vol_mat_ptr, | ||
propagation._stepping); | ||
auto free_to_bound_jacobian = | ||
sf.template visit_mask<get_free_to_bound_jacobian_kernel>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sf.template visit_mask<get_free_to_bound_jacobian_kernel>( | |
sf.free_to_bound_jacobian(gctx, stepping()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this performs the whole computation, but it needs to be split up into these smaller parts, so this will not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see now! This does not actually retrieve the free_to_bound_jacobian? I might have to double down on finding better names, please 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😅
core/include/detray/propagator/actors/parameter_transporter.hpp
Outdated
Show resolved
Hide resolved
core/include/detray/propagator/actors/parameter_transporter.hpp
Outdated
Show resolved
Hide resolved
Currently, the covariance transport uses the Jacobian engine which is templated on the frame type. While this makes the code easier to read and write, it requires the compiler to duplicate a lot of code for each of the frame types, and this duplicated code counts as multiple branches for the sake of GPU execution. Thus, this templating increases the amount of thread divergence. This commit refactors the Jacobian engine into smaller parts, some of which are templated on the frame type and some of which are not. Client code and then take a more fine-grained approach to branching and improve divergence.
6ae62f1
to
4193e1c
Compare
|
Results from integrating this into traccc: Before this PRNumber of PTX lines per kernel
Kernel performanceAfter this PR
Kernel performanceComparisonAccording to NSight Compute, for the fit_forward kernel, the number of cycles on which a warp is eligible increases from 7.27% to 8.05%. The average number of threads per warp (measuring divergence) increases from 5.22 to 5.74. Because the code shrinks by ~10% and because this happens in the hottest parts, the number of stalls due to instruction cache misses decreases significantly: |
I'll probably finish this PR when I get home from holiday, but the results are promising. Makes me wonder where else we can improve performance in this way. 😄 |
I tried to reformulate the intersectors a while back to be called per local coordinate system instead of per shape, but the cylinders need access to their mask... I think it is possible to work around that, but I have not made up my mind yet how best to do that |
Currently, the covariance transport uses the Jacobian engine which is templated on the frame type. While this makes the code easier to read and write, it requires the compiler to duplicate a lot of code for each of the frame types, and this duplicated code counts as multiple branches for the sake of GPU execution. Thus, this templating increases the amount of thread divergence.
This commit refactors the Jacobian engine into smaller parts, some of which are templated on the frame type and some of which are not. Client code and then take a more fine-grained approach to branching and improve divergence.