Skip to content

Conversation

@cgmillette
Copy link
Collaborator

@cgmillette cgmillette commented Jan 23, 2026

Proposed changes

Build times can be affected by many different things and is highly attributed to the way we write and use the code. Two critical areas of the builds are frontend parsing and backend codegen and compilation.

Frontend Parsing

The length of the code, the include header tree and macro expansions all affect the front-end parsing time.
This PR seeks to reduce the parsing time of the dtype_vector.hpp vector_type class by reducing redundant code by generalization.

  • Partial specializations of vector_type for native and non-native datatypes have been generalized to one single class, consolidating all of the data initialization and AsType casting requirements into one place.
  • The class nnvb_data_t_selector (e.g., Non-native vector base dataT selector) class has been removed and replaced with scalar_type instantiations as they have the same purpose. Scalar type class' purpose is already to map generalized datatypes to native types compatible with ext_vector_t.

Backend Codegen

Template instantiation behavior can also affect build times. Recursive instantiations are very slow versus concrete instantiations. The compiler must make multiple passes to expand template instantiations so we need to be careful about how they are used.

  • Previous vector_type classes declared a union storage class, which aliases StaticallyIndexedArray<T,N>.
template <typename T>
struct vector_type<T, 4, typename ck::enable_if_t<is_native_type<T>()>>
{
    using d1_t = T;
    typedef T d2_t __attribute__((ext_vector_type(2)));
    typedef T d4_t __attribute__((ext_vector_type(4)));

    using type = d4_t;

    union
    {
        d4_t d4_;
        StaticallyIndexedArray<d1_t, 4> d1x4_;
        StaticallyIndexedArray<d2_t, 2> d2x2_;
        StaticallyIndexedArray<d4_t, 1> d4x1_;
    } data_;
   ...
};
  • Upon further inspection, StaticallyIndexedArray is built on-top of a recursive Tuple concatenation.
template <typename T, index_t N>
struct StaticallyIndexedArrayImpl
{
    using type =
        typename tuple_concat<typename StaticallyIndexedArrayImpl<T, N / 2>::type,
                              typename StaticallyIndexedArrayImpl<T, N - N / 2>::type>::type;
};

This union storage has been removed from the vector_type storage class.

  • Further references to StaticallyIndexedArray have been replaced with StaticallyIndexedArray_v2, which is a concrete implementation using C-style arrays.
template <typename T, index_t N>
struct StaticallyIndexedArray_v2
{
    ...

    T data_[N];
};

Fixes

  • Using bool datatype with vector_type was previously error prone. Bool, as a native datatype would be stored into bool ext_vector_type(N) for storage, which is a packed datatype. Meaning that for example, sizeof(bool ext_vector_type(4)) == 1, which does not equal sizeof(StaticallyIndexedArray<bool ext_vector_type(1), 4> == 4. The union of these datatypes has incorrect data slicing, meaning that the bits location of the packed bool do not match with the StaticallyIndexedArray member. As such, vector_type will use C-Style array storage for bool type instead of ext_vector_type.
template <typename T, index_t Rank>
using NativeVectorT = T __attribute__((ext_vector_type(Rank)));

sizeof(NativeVectorT<bool, 4>) == 1  (1 byte per 4 bool - packed)
element0 = bit 0 of byte 0
element1 = bit 1 of byte 0
element2 = bit 2 of byte 0
element3 = bit 3 of byte 0

sizeof(StaticallyIndexedArray[NativeVectorT<bool, 1>, 4] == 4  (1 byte per bool)
element0 = bit 0 of byte 0
element1 = bit 0 of byte 1
element1 = bit 0 of byte 2
element1 = bit 0 of byte 3

union{
    NativeVectorT<bool, 4> d1_t;
    ...
    StaticallyIndexedArray[NativeVectorT<bool,1>, 4] d4x1;
};

// union size == 4 which means invalid slicing!
  • Math utilities such as next_power_of_two addressed for invalid cases of X < 2
  • Remove redundant implementation of next_pow2

Additions

  • integer_log2_floor to math.hpp
  • is_power_of_two_integer to math.hpp

Build Time Analysis

Machine: banff-cyxtera-s78-2
Target: gfx942

Build Target Threads Frontend Parse Time (s) Backend Codegen Time (s) TotalTime (s) commitId
device_grouped_conv3d_fwd_bias_bnorm_clamp_instance 1 2e08a7e
device_grouped_conv3d_fwd_bias_bnorm_clamp_instance 1 d9b5883

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the vector_type implementation to reduce build times by consolidating specialized template implementations into a generalized design. The changes aim to improve frontend parsing times by reducing redundant code and backend codegen times by replacing recursive template instantiations with concrete implementations.

Changes:

  • Generalizes vector_type partial specializations into a single class with helper structs (vector_type_storage, non_native_vector_base)
  • Replaces recursive StaticallyIndexedArray with concrete StaticallyIndexedArray_v2
  • Fixes bool datatype handling with vector_type to avoid data slicing issues
  • Adds new math utilities: integer_log2_floor and is_power_of_two_integer
  • Updates next_power_of_two to handle edge cases (X <= 1)
  • Introduces default scalar_type template specialization with unsigned _BitInt fallback

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
include/ck/utility/statically_indexed_array.hpp Adds blank line (cosmetic change)
include/ck/utility/math.hpp Fixes next_power_of_two for edge cases, adds integer_log2_floor and is_power_of_two_integer functions
include/ck/utility/data_type.hpp Refactors scalar_type with default template and typename keywords, removes redundant next_pow2
include/ck/utility/dtype_vector.hpp Major refactor: introduces NativeVectorT alias, vector_type_storage helper, generalized vector_type class with is_as_type_cast_valid validation, consolidates non_native_vector_base specializations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +41 to +47
template <typename T>
struct scalar_type
{
// Basic data type mapping to unsigned _BitInt of appropriate size
using type = unsigned _BitInt(8 * sizeof(T));
static constexpr index_t vector_size = 1;
};
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The introduction of a default template specialization for scalar_type that uses unsigned _BitInt(8 * sizeof(T)) is a significant change. While this provides a sensible fallback for types without explicit specializations, it may cause issues with certain types. For example, pointer types, function types, or complex class types would get this BitInt mapping which may not be the intended behavior. The previous design likely intentionally required explicit specializations to ensure only valid types were used. Consider either documenting this behavior more explicitly or adding static assertions to catch problematic type usages at compile time.

Copilot uses AI. Check for mistakes.
Comment on lines +237 to +245
// TODO: x need to be 1 ~ 0x7fffffff
// __builtin_clz will produce unexpected result if x is 0;
return 31 - __builtin_clz(x);
}

__host__ __device__ constexpr bool is_power_of_two_integer(int32_t x)
{
// TODO: x need to be 1 ~ 0x7fffffff
return x == (1 << integer_log2_floor(x));
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is_power_of_two_integer function has a potential issue when x equals 1. When x is 1, integer_log2_floor(1) returns 0 (since 31 - __builtin_clz(1) = 31 - 31 = 0), and then (1 << 0) = 1, so the function correctly returns true. However, this edge case behavior should be verified. Additionally, the function can return incorrect results for x <= 0 per the TODO on line 244, but there's no runtime check to prevent such inputs. Consider adding a static_assert or runtime check if this function is called with compile-time constant expressions.

Suggested change
// TODO: x need to be 1 ~ 0x7fffffff
// __builtin_clz will produce unexpected result if x is 0;
return 31 - __builtin_clz(x);
}
__host__ __device__ constexpr bool is_power_of_two_integer(int32_t x)
{
// TODO: x need to be 1 ~ 0x7fffffff
return x == (1 << integer_log2_floor(x));
// x needs to be in the range [1, 0x7fffffff].
// __builtin_clz is undefined for x <= 0; guard against that here.
return (x > 0) ? (31 - __builtin_clz(static_cast<uint32_t>(x))) : int32_t{-1};
}
__host__ __device__ constexpr bool is_power_of_two_integer(int32_t x)
{
// x needs to be in the range [1, 0x7fffffff]. For x <= 0, this is not a power of two.
return (x > 0) && (x == (int32_t(1) << integer_log2_floor(x)));

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants