Refactor vector type to reduce build times #3641

cgmillette · 2026-01-23T17:20:00Z

Proposed changes

Build times can be affected by many different things and is highly attributed to the way we write and use the code. Two critical areas of the builds are frontend parsing and backend codegen and compilation.

Frontend Parsing

The length of the code, the include header tree and macro expansions all affect the front-end parsing time.
This PR seeks to reduce the parsing time of the dtype_vector.hpp vector_type class by reducing redundant code by generalization.

Partial specializations of vector_type for native and non-native datatypes have been generalized to one single class, consolidating all of the data initialization and AsType casting requirements into one place.
The class nnvb_data_t_selector (e.g., Non-native vector base dataT selector) class has been removed and replaced with scalar_type instantiations as they have the same purpose. Scalar type class' purpose is already to map generalized datatypes to native types compatible with ext_vector_t.

Backend Codegen

Template instantiation behavior can also affect build times. Recursive instantiations are very slow versus concrete instantiations. The compiler must make multiple passes to expand template instantiations so we need to be careful about how they are used.

Previous vector_type classes declared a union storage class, which aliases StaticallyIndexedArray<T,N>.

template <typename T>
struct vector_type<T, 4, typename ck::enable_if_t<is_native_type<T>()>>
{
    using d1_t = T;
    typedef T d2_t __attribute__((ext_vector_type(2)));
    typedef T d4_t __attribute__((ext_vector_type(4)));

    using type = d4_t;

    union
    {
        d4_t d4_;
        StaticallyIndexedArray<d1_t, 4> d1x4_;
        StaticallyIndexedArray<d2_t, 2> d2x2_;
        StaticallyIndexedArray<d4_t, 1> d4x1_;
    } data_;
   ...
};

Upon further inspection, StaticallyIndexedArray is built on-top of a recursive Tuple concatenation.

template <typename T, index_t N>
struct StaticallyIndexedArrayImpl
{
    using type =
        typename tuple_concat<typename StaticallyIndexedArrayImpl<T, N / 2>::type,
                              typename StaticallyIndexedArrayImpl<T, N - N / 2>::type>::type;
};

This union storage has been removed from the vector_type storage class.

Further references to StaticallyIndexedArray have been replaced with StaticallyIndexedArray_v2, which is a concrete implementation using C-style arrays.

template <typename T, index_t N>
struct StaticallyIndexedArray_v2
{
    ...

    T data_[N];
};

Fixes

Using bool datatype with vector_type was previously error prone. Bool, as a native datatype would be stored into bool ext_vector_type(N) for storage, which is a packed datatype. Meaning that for example, sizeof(bool ext_vector_type(4)) == 1, which does not equal sizeof(StaticallyIndexedArray<bool ext_vector_type(1), 4> == 4. The union of these datatypes has incorrect data slicing, meaning that the bits location of the packed bool do not match with the StaticallyIndexedArray member. As such, vector_type will use C-Style array storage for bool type instead of ext_vector_type.

template <typename T, index_t Rank>
using NativeVectorT = T __attribute__((ext_vector_type(Rank)));

sizeof(NativeVectorT<bool, 4>) == 1  (1 byte per 4 bool - packed)
element0 = bit 0 of byte 0
element1 = bit 1 of byte 0
element2 = bit 2 of byte 0
element3 = bit 3 of byte 0

sizeof(StaticallyIndexedArray[NativeVectorT<bool, 1>, 4] == 4  (1 byte per bool)
element0 = bit 0 of byte 0
element1 = bit 0 of byte 1
element1 = bit 0 of byte 2
element1 = bit 0 of byte 3

union{
    NativeVectorT<bool, 4> d1_t;
    ...
    StaticallyIndexedArray[NativeVectorT<bool,1>, 4] d4x1;
};

// union size == 4 which means invalid slicing!

Math utilities such as next_power_of_two addressed for invalid cases of X < 2
Remove redundant implementation of next_pow2

Additions

integer_log2_floor to math.hpp
is_power_of_two_integer to math.hpp

Build Time Analysis

Machine: banff-cyxtera-s78-2
Target: gfx942

Build Target	Threads	Frontend Parse Time (s)	Backend Codegen Time (s)	TotalTime (s)	commitId
device_grouped_conv3d_fwd_bias_bnorm_clamp_instance	1				`2e08a7e`
device_grouped_conv3d_fwd_bias_bnorm_clamp_instance	1				`d9b5883`

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Copilot

Pull request overview

This pull request refactors the vector_type implementation to reduce build times by consolidating specialized template implementations into a generalized design. The changes aim to improve frontend parsing times by reducing redundant code and backend codegen times by replacing recursive template instantiations with concrete implementations.

Changes:

Generalizes vector_type partial specializations into a single class with helper structs (vector_type_storage, non_native_vector_base)
Replaces recursive StaticallyIndexedArray with concrete StaticallyIndexedArray_v2
Fixes bool datatype handling with vector_type to avoid data slicing issues
Adds new math utilities: integer_log2_floor and is_power_of_two_integer
Updates next_power_of_two to handle edge cases (X <= 1)
Introduces default scalar_type template specialization with unsigned _BitInt fallback

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File	Description
include/ck/utility/statically_indexed_array.hpp	Adds blank line (cosmetic change)
include/ck/utility/math.hpp	Fixes next_power_of_two for edge cases, adds integer_log2_floor and is_power_of_two_integer functions
include/ck/utility/data_type.hpp	Refactors scalar_type with default template and typename keywords, removes redundant next_pow2
include/ck/utility/dtype_vector.hpp	Major refactor: introduces NativeVectorT alias, vector_type_storage helper, generalized vector_type class with is_as_type_cast_valid validation, consolidates non_native_vector_base specializations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/ck/utility/dtype_vector.hpp

Copilot · 2026-01-23T17:26:07Z

include/ck/utility/data_type.hpp

+template <typename T>
+struct scalar_type
+{
+    // Basic data type mapping to unsigned _BitInt of appropriate size
+    using type                           = unsigned _BitInt(8 * sizeof(T));
+    static constexpr index_t vector_size = 1;
+};


The introduction of a default template specialization for scalar_type that uses unsigned _BitInt(8 * sizeof(T)) is a significant change. While this provides a sensible fallback for types without explicit specializations, it may cause issues with certain types. For example, pointer types, function types, or complex class types would get this BitInt mapping which may not be the intended behavior. The previous design likely intentionally required explicit specializations to ensure only valid types were used. Consider either documenting this behavior more explicitly or adding static assertions to catch problematic type usages at compile time.

include/ck/utility/dtype_vector.hpp

Copilot · 2026-01-23T17:26:07Z

include/ck/utility/math.hpp

+    // TODO: x need to be 1 ~ 0x7fffffff
+    // __builtin_clz will produce unexpected result if x is 0;
+    return 31 - __builtin_clz(x);
+}
+
+__host__ __device__ constexpr bool is_power_of_two_integer(int32_t x)
+{
+    // TODO: x need to be 1 ~ 0x7fffffff
+    return x == (1 << integer_log2_floor(x));


The is_power_of_two_integer function has a potential issue when x equals 1. When x is 1, integer_log2_floor(1) returns 0 (since 31 - __builtin_clz(1) = 31 - 31 = 0), and then (1 << 0) = 1, so the function correctly returns true. However, this edge case behavior should be verified. Additionally, the function can return incorrect results for x <= 0 per the TODO on line 244, but there's no runtime check to prevent such inputs. Consider adding a static_assert or runtime check if this function is called with compile-time constant expressions.

Suggested change

// TODO: x need to be 1 ~ 0x7fffffff

// __builtin_clz will produce unexpected result if x is 0;

return 31 - __builtin_clz(x);

}

__host__ __device__ constexpr bool is_power_of_two_integer(int32_t x)

{

// TODO: x need to be 1 ~ 0x7fffffff

return x == (1 << integer_log2_floor(x));

// x needs to be in the range [1, 0x7fffffff].

// __builtin_clz is undefined for x <= 0; guard against that here.

return (x > 0) ? (31 - __builtin_clz(static_cast<uint32_t>(x))) : int32_t{-1};

}

__host__ __device__ constexpr bool is_power_of_two_integer(int32_t x)

{

// x needs to be in the range [1, 0x7fffffff]. For x <= 0, this is not a power of two.

return (x > 0) && (x == (int32_t(1) << integer_log2_floor(x)));

include/ck/utility/dtype_vector.hpp

Co-authored-by: Copilot <[email protected]>

…ype is visible where needed

cgmillette added 2 commits January 23, 2026 16:23

Refactor vector_type to reduce build time

91b85f3

clang formatting fixes

b36eb81

cgmillette requested a review from Copilot January 23, 2026 17:20

Copilot started reviewing on behalf of cgmillette January 23, 2026 17:20 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

cgmillette and others added 3 commits January 23, 2026 11:03

Apply suggestions from code review

640863d

Co-authored-by: Copilot <[email protected]>

Fixes clang formatting and adjust storage class type check

d9b5883

Re-home NativeVectorT and ensure partial specialization with scalar_t…

401f3ea

…ype is visible where needed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor vector type to reduce build times #3641

Refactor vector type to reduce build times #3641

cgmillette commented Jan 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor vector type to reduce build times #3641

Are you sure you want to change the base?

Refactor vector type to reduce build times #3641

Conversation

cgmillette commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Frontend Parsing

Backend Codegen

Fixes

Additions

Build Time Analysis

Checklist

Discussion

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cgmillette commented Jan 23, 2026 •

edited

Loading