Skip to content

Conversation

@schnellerhase
Copy link
Contributor

@schnellerhase schnellerhase commented Apr 27, 2025

In performance critical parts some block sizes are optimized for by compiling explicit versions with the block size being provided as a compile time constant. At the same time general runtime block sizes are supported through an argument to these functions.

This causes

  1. Code duplication: one path for the runtime and one for the compile time definitions of the block sizes, and
  2. duplicate input of the block sizes: once as template argument once as argument (matching of both is only asserted does not raise in Release)

Introduces a BlockSize concept that either holds a runtime int or a compile time std::integral_constant<int, bs> which allows to generate code paths explicitly for certain sizes, while maintaining a shared code path in both cases.

This is based on a more general concept of an optionally compile time valued ConstexprType<T, V>. It stores a value of type T in the container type V. If runtime valued, then T = V. For compile time, usually T = std::integral_constant<T, ...>.

Future applications:

  • form packing optimizes for block sizes 1,2,3 - vector assembly for 1,3: is this miss match intentional?
  • matrix operation routines (in particular, custom CSR assembler routines for arbitrary block sizes, ref
    "assemble_matrix",
    [](dolfinx::la::MatrixCSR<T>& A, const dolfinx::fem::Form<T, U>& a,
    nb::ndarray<const T, nb::ndim<1>, nb::c_contig> constants,
    const std::map<std::pair<dolfinx::fem::IntegralType, int>,
    nb::ndarray<const T, nb::ndim<2>, nb::c_contig>>&
    coefficients,
    const std::vector<const dolfinx::fem::DirichletBC<T, U>*>& bcs)
    {
    std::vector<
    std::reference_wrapper<const dolfinx::fem::DirichletBC<T, U>>>
    _bcs;
    for (auto bc : bcs)
    {
    assert(bc);
    _bcs.push_back(*bc);
    }
    // Get index map block size. Note that mixed-topology meshes
    // will have multiple DOF maps, but the block sizes are the same.
    const std::array<int, 2> data_bs
    = {a.function_spaces().at(0)->dofmaps(0)->index_map_bs(),
    a.function_spaces().at(1)->dofmaps(0)->index_map_bs()};
    if (data_bs[0] != data_bs[1])
    {
    throw std::runtime_error(
    "Non-square blocksize unsupported in Python");
    }
    if (data_bs[0] == 1)
    {
    dolfinx::fem::assemble_matrix(
    A.mat_add_values(), a,
    std::span<const T>(constants.data(), constants.size()),
    dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
    }
    else if (data_bs[0] == 2)
    {
    auto mat_add = A.template mat_add_values<2, 2>();
    dolfinx::fem::assemble_matrix(
    mat_add, a, std::span(constants.data(), constants.size()),
    dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
    }
    else if (data_bs[0] == 3)
    {
    auto mat_add = A.template mat_add_values<3, 3>();
    dolfinx::fem::assemble_matrix(
    mat_add, a, std::span(constants.data(), constants.size()),
    dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
    }
    else if (data_bs[0] == 4)
    {
    auto mat_add = A.template mat_add_values<4, 4>();
    dolfinx::fem::assemble_matrix(
    mat_add, a, std::span(constants.data(), constants.size()),
    dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
    }
    else if (data_bs[0] == 5)
    {
    auto mat_add = A.template mat_add_values<5, 5>();
    dolfinx::fem::assemble_matrix(
    mat_add, a, std::span(constants.data(), constants.size()),
    dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
    }
    else if (data_bs[0] == 6)
    {
    auto mat_add = A.template mat_add_values<6, 6>();
    dolfinx::fem::assemble_matrix(
    mat_add, a, std::span(constants.data(), constants.size()),
    dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
    }
    else if (data_bs[0] == 7)
    {
    auto mat_add = A.template mat_add_values<7, 7>();
    dolfinx::fem::assemble_matrix(
    mat_add, a, std::span(constants.data(), constants.size()),
    dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
    }
    else if (data_bs[0] == 8)
    {
    auto mat_add = A.template mat_add_values<8, 8>();
    dolfinx::fem::assemble_matrix(
    mat_add, a, std::span(constants.data(), constants.size()),
    dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
    }
    else if (data_bs[0] == 9)
    {
    auto mat_add = A.template mat_add_values<9, 9>();
    dolfinx::fem::assemble_matrix(
    mat_add, a, std::span(constants.data(), constants.size()),
    dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
    }
    else
    throw std::runtime_error("Block size not supported in Python");
    . )

@jhale
Copy link
Member

jhale commented Apr 27, 2025

Looks very nice. Could we review the basic approach before you spend lots more time on it?

@schnellerhase
Copy link
Contributor Author

Sure thing. Should be good to go as is and can be extended further when approved. One neat byproduct, that these changes would allow for, are non compile time sized operations on the MatrixCSR which we are currently missing.

@schnellerhase schnellerhase marked this pull request as ready for review April 27, 2025 18:50
@chrisrichardson chrisrichardson self-requested a review April 28, 2025 15:26
Copy link
Contributor

@chrisrichardson chrisrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good.

@garth-wells
Copy link
Member

Looks really neat.

  • Should the name be more generic, it's basically a runtime or templated integer. I can think of applications outside of block size, e.g. geometric dimension, where it could be useful.
  • Should it support different integer types?
  • Could tests be added to check that when it's a compile time integer that it really is a compiler time integer?

@schnellerhase
Copy link
Contributor Author

For points 1 and 2 that should be no problem - how about: ConstexprType as name for the general concept?

Regarding 3: the interface to retrieve the value (here block_size) needs to be able to produce both a runtime value and a compile time value. Therefore it can not be marked constexpr. Testing for in lining of the compile time variant is also not straight forward as this remains in all cases a compiler decision. Best way to check for its effect, I assume, would be with a benchmark of those cases.

@garth-wells
Copy link
Member

Regarding 3: the interface to retrieve the value (here block_size) needs to be able to produce both a runtime value and a compile time value. Therefore it can not be marked constexpr. Testing for in lining of the compile time variant is also not straight forward as this remains in all cases a compiler decision. Best way to check for its effect, I assume, would be with a benchmark of those cases.

I don't like relying on the compiler to inline things that we know are known at compile time. We have avoided this in the past and preferred being explicit over relying on the compiler and then not knowing what the compiler does.

@schnellerhase
Copy link
Contributor Author

It would be best if the block_size/value function would be constexpr for the compile time case. I will try if I can recover that behaviour.

@schnellerhase
Copy link
Contributor Author

schnellerhase commented Apr 30, 2025

It think I have a fix: value(ConxtexprType<T, V>) is now constexpr for is_compile_v<T, V> == True and otherwise not. The test case showcases that we can assert during compile time now. (Block size is not yet adapted).

@schnellerhase schnellerhase added the enhancement New feature or request label Dec 8, 2025
@jhale
Copy link
Member

jhale commented Dec 15, 2025

I made a further tweaks to the concept.

@garth-wells
Copy link
Member

I'm not convinced by this approach. The block size in the assemblers is performance critical and what's happening should be explicit, with no room relying on what a compiler might do. For me this PR is not explicit and relies too much on what the compiler might do.

@schnellerhase
Copy link
Contributor Author

We extract the block sizes with the block_size as

auto bs = block_size(_bs)

inferring the type from the returned type. block_size has signatures

  • int block_size(V bs) for the runtime, and
  • consteval int block_size(V bs) in the compile time valued case.

The compile time valued version thus guarantees evaluation to a constant. Making this code identical to the previous one. The only possible risk is for the runtime version, where the inlining of

int (int x) {return x;}

needs to happen for an identical path to before.

Could we run some benchmarks to confirm this happens?

@schnellerhase
Copy link
Contributor Author

schnellerhase commented Dec 16, 2025

Quick testing at https://godbolt.org/z/dze1aq9br yielded we should definitely mark the runtime version inline, then -O1 and above yields inlined version. Adding __attribute__((always_inline)) yields inlined version at -O0.

@jhale
Copy link
Member

jhale commented Dec 16, 2025

@garth-wells I've reviewed the code quite carefully and I don't think that e.g. BS<1> can lead to any other behaviour than the constant being compile time evaluated, ultimately due to the use of consteval. Nonetheless I take your point that the current way is more explicit, in the sense that it is 'easy' to see that it leads to the behaviour we desire. That said, the proposed method really does cut down on code duplication and templating.

What could @schnellerhase do in terms of performance tests or additional unit tests that might persuade you that it does work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants