Skip to content

Reduce thread divergence in covariance transport #997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

stephenswat
Copy link
Member

Currently, the covariance transport uses the Jacobian engine which is templated on the frame type. While this makes the code easier to read and write, it requires the compiler to duplicate a lot of code for each of the frame types, and this duplicated code counts as multiple branches for the sake of GPU execution. Thus, this templating increases the amount of thread divergence.

This commit refactors the Jacobian engine into smaller parts, some of which are templated on the frame type and some of which are not. Client code and then take a more fine-grained approach to branching and improve divergence.

@stephenswat stephenswat requested a review from niermann999 June 4, 2025 15:33
@stephenswat stephenswat added the refactor refactoring the current codes label Jun 4, 2025
const transform3_type& trf3, const mask_t& mask,
const bound_parameters_vector<algebra_type>& bound_vec) {
template <typename frame_t>
requires std::is_object_v<typename frame_t::loc_point>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, this is a piece of old code. Would you mind checking, if this can be changed to

Suggested change
requires std::is_object_v<typename frame_t::loc_point>
requires concepts::point<typename frame_t::loc_point>

?

const bound_parameters_vector<algebra_type>& bound_vec) {
template <typename frame_t>
requires std::is_object_v<typename frame_t::loc_point>
DETRAY_HOST_DEVICE static inline void bound_to_free_jacobian_step_1(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing that not all of the the sub-steps here are supposed to be part of the public API of the class? If not can we move them either to the private section or otherwise give them more instructive names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately these function do need to be in the public API, as the parameter transporter must be able to use them separately. But yes, I'll name them better.

sf.transform(gctx), bound_to_free_jacobian, vol_mat_ptr,
propagation._stepping);
auto free_to_bound_jacobian =
sf.template visit_mask<get_free_to_bound_jacobian_kernel>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sf.template visit_mask<get_free_to_bound_jacobian_kernel>(
sf.free_to_bound_jacobian(gctx, stepping());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this performs the whole computation, but it needs to be split up into these smaller parts, so this will not work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see now! This does not actually retrieve the free_to_bound_jacobian? I might have to double down on finding better names, please 😂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅

@niermann999 niermann999 added the performance Improvements to compute performance label Jun 4, 2025
Currently, the covariance transport uses the Jacobian engine which is
templated on the frame type. While this makes the code easier to read
and write, it requires the compiler to duplicate a lot of code for each
of the frame types, and this duplicated code counts as multiple branches
for the sake of GPU execution. Thus, this templating increases the
amount of thread divergence.

This commit refactors the Jacobian engine into smaller parts, some of
which are templated on the frame type and some of which are not. Client
code and then take a more fine-grained approach to branching and improve
divergence.
@stephenswat stephenswat force-pushed the refactor/non-branching-transporter branch from 6ae62f1 to 4193e1c Compare June 5, 2025 12:05
Copy link

sonarqubecloud bot commented Jun 5, 2025

@stephenswat
Copy link
Member Author

Results from integrating this into traccc:

Before this PR

Number of PTX lines per kernel

  252383 build_rel/fit_forward_default_detector.ptx
  238522 build_rel/fit_backward_default_detector.ptx
  490905 total

Kernel performance

image

After this PR

  234333 build_rel/fit_forward_default_detector.ptx
  220478 build_rel/fit_backward_default_detector.ptx
  454811 total

Kernel performance

image

Comparison

According to NSight Compute, for the fit_forward kernel, the number of cycles on which a warp is eligible increases from 7.27% to 8.05%. The average number of threads per warp (measuring divergence) increases from 5.22 to 5.74. Because the code shrinks by ~10% and because this happens in the hottest parts, the number of stalls due to instruction cache misses decreases significantly:

image

@stephenswat
Copy link
Member Author

I'll probably finish this PR when I get home from holiday, but the results are promising. Makes me wonder where else we can improve performance in this way. 😄

@niermann999
Copy link
Contributor

I'll probably finish this PR when I get home from holiday, but the results are promising. Makes me wonder where else we can improve performance in this way. 😄

I tried to reformulate the intersectors a while back to be called per local coordinate system instead of per shape, but the cylinders need access to their mask... I think it is possible to work around that, but I have not made up my mind yet how best to do that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improvements to compute performance refactor refactoring the current codes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants