Introduce `BlockSize` #3716

schnellerhase · 2025-04-27T15:32:22Z

In performance critical parts some block sizes are optimized for by compiling explicit versions with the block size being provided as a compile time constant. At the same time general runtime block sizes are supported through an argument to these functions.

This causes

Code duplication: one path for the runtime and one for the compile time definitions of the block sizes, and
duplicate input of the block sizes: once as template argument once as argument (matching of both is only asserted does not raise in Release)

Introduces a BlockSize concept that either holds a runtime int or a compile time std::integral_constant<int, bs> which allows to generate code paths explicitly for certain sizes, while maintaining a shared code path in both cases.

This is based on a more general concept of an optionally compile time valued ConstexprType<T, V>. It stores a value of type T in the container type V. If runtime valued, then T = V. For compile time, usually T = std::integral_constant<T, ...>.

Future applications:

form packing optimizes for block sizes 1,2,3 - vector assembly for 1,3: is this miss match intentional?

matrix operation routines (in particular, custom CSR assembler routines for arbitrary block sizes, ref

dolfinx/python/dolfinx/wrappers/assemble.cpp

Lines 348 to 441 in f1daede

    
           "assemble_matrix", 
        
           [](dolfinx::la::MatrixCSR<T>& A, const dolfinx::fem::Form<T, U>& a, 
        
              nb::ndarray<const T, nb::ndim<1>, nb::c_contig> constants, 
        
              const std::map<std::pair<dolfinx::fem::IntegralType, int>, 
        
                             nb::ndarray<const T, nb::ndim<2>, nb::c_contig>>& 
        
                  coefficients, 
        
              const std::vector<const dolfinx::fem::DirichletBC<T, U>*>& bcs) 
        
           { 
        
             std::vector< 
        
                 std::reference_wrapper<const dolfinx::fem::DirichletBC<T, U>>> 
        
                 _bcs; 
        
             for (auto bc : bcs) 
        
             { 
        
               assert(bc); 
        
               _bcs.push_back(*bc); 
        
             } 
        
             // Get index map block size. Note that mixed-topology meshes 
        
             // will have multiple DOF maps, but the block sizes are the same. 
        
             const std::array<int, 2> data_bs 
        
                 = {a.function_spaces().at(0)->dofmaps(0)->index_map_bs(), 
        
                    a.function_spaces().at(1)->dofmaps(0)->index_map_bs()}; 
        
             if (data_bs[0] != data_bs[1]) 
        
             { 
        
               throw std::runtime_error( 
        
                   "Non-square blocksize unsupported in Python"); 
        
             } 
        
             if (data_bs[0] == 1) 
        
             { 
        
               dolfinx::fem::assemble_matrix( 
        
                   A.mat_add_values(), a, 
        
                   std::span<const T>(constants.data(), constants.size()), 
        
                   dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs); 
        
             } 
        
             else if (data_bs[0] == 2) 
        
             { 
        
               auto mat_add = A.template mat_add_values<2, 2>(); 
        
               dolfinx::fem::assemble_matrix( 
        
                   mat_add, a, std::span(constants.data(), constants.size()), 
        
                   dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs); 
        
             } 
        
             else if (data_bs[0] == 3) 
        
             { 
        
               auto mat_add = A.template mat_add_values<3, 3>(); 
        
               dolfinx::fem::assemble_matrix( 
        
                   mat_add, a, std::span(constants.data(), constants.size()), 
        
                   dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs); 
        
             } 
        
             else if (data_bs[0] == 4) 
        
             { 
        
               auto mat_add = A.template mat_add_values<4, 4>(); 
        
               dolfinx::fem::assemble_matrix( 
        
                   mat_add, a, std::span(constants.data(), constants.size()), 
        
                   dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs); 
        
             } 
        
             else if (data_bs[0] == 5) 
        
             { 
        
               auto mat_add = A.template mat_add_values<5, 5>(); 
        
               dolfinx::fem::assemble_matrix( 
        
                   mat_add, a, std::span(constants.data(), constants.size()), 
        
                   dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs); 
        
             } 
        
             else if (data_bs[0] == 6) 
        
             { 
        
               auto mat_add = A.template mat_add_values<6, 6>(); 
        
               dolfinx::fem::assemble_matrix( 
        
                   mat_add, a, std::span(constants.data(), constants.size()), 
        
                   dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs); 
        
             } 
        
             else if (data_bs[0] == 7) 
        
             { 
        
               auto mat_add = A.template mat_add_values<7, 7>(); 
        
               dolfinx::fem::assemble_matrix( 
        
                   mat_add, a, std::span(constants.data(), constants.size()), 
        
                   dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs); 
        
             } 
        
             else if (data_bs[0] == 8) 
        
             { 
        
               auto mat_add = A.template mat_add_values<8, 8>(); 
        
               dolfinx::fem::assemble_matrix( 
        
                   mat_add, a, std::span(constants.data(), constants.size()), 
        
                   dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs); 
        
             } 
        
             else if (data_bs[0] == 9) 
        
             { 
        
               auto mat_add = A.template mat_add_values<9, 9>(); 
        
               dolfinx::fem::assemble_matrix( 
        
                   mat_add, a, std::span(constants.data(), constants.size()), 
        
                   dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs); 
        
             } 
        
             else 
        
               throw std::runtime_error("Block size not supported in Python");

. )

jhale · 2025-04-27T18:36:15Z

Looks very nice. Could we review the basic approach before you spend lots more time on it?

schnellerhase · 2025-04-27T18:49:26Z

Sure thing. Should be good to go as is and can be extended further when approved. One neat byproduct, that these changes would allow for, are non compile time sized operations on the MatrixCSR which we are currently missing.

chrisrichardson

Looking good.

garth-wells · 2025-04-28T17:16:16Z

Looks really neat.

Should the name be more generic, it's basically a runtime or templated integer. I can think of applications outside of block size, e.g. geometric dimension, where it could be useful.
Should it support different integer types?
Could tests be added to check that when it's a compile time integer that it really is a compiler time integer?

schnellerhase · 2025-04-28T17:56:29Z

For points 1 and 2 that should be no problem - how about: ConstexprType as name for the general concept?

Regarding 3: the interface to retrieve the value (here block_size) needs to be able to produce both a runtime value and a compile time value. Therefore it can not be marked constexpr. Testing for in lining of the compile time variant is also not straight forward as this remains in all cases a compiler decision. Best way to check for its effect, I assume, would be with a benchmark of those cases.

garth-wells · 2025-04-30T08:03:01Z

Regarding 3: the interface to retrieve the value (here block_size) needs to be able to produce both a runtime value and a compile time value. Therefore it can not be marked constexpr. Testing for in lining of the compile time variant is also not straight forward as this remains in all cases a compiler decision. Best way to check for its effect, I assume, would be with a benchmark of those cases.

I don't like relying on the compiler to inline things that we know are known at compile time. We have avoided this in the past and preferred being explicit over relying on the compiler and then not knowing what the compiler does.

schnellerhase · 2025-04-30T08:14:12Z

It would be best if the block_size/value function would be constexpr for the compile time case. I will try if I can recover that behaviour.

schnellerhase · 2025-04-30T15:30:53Z

It think I have a fix: value(ConxtexprType<T, V>) is now constexpr for is_compile_v<T, V> == True and otherwise not. The test case showcases that we can assert during compile time now. ~~(Block size is not yet adapted)~~.

This reverts commit 2652fb5.

cpp/dolfinx/common/constexpr_type.h

jhale · 2025-12-15T14:37:15Z

I made a further tweaks to the concept.

cpp/dolfinx/common/constexpr_type.h

garth-wells · 2025-12-16T08:21:10Z

I'm not convinced by this approach. The block size in the assemblers is performance critical and what's happening should be explicit, with no room relying on what a compiler might do. For me this PR is not explicit and relies too much on what the compiler might do.

schnellerhase · 2025-12-16T08:48:36Z

We extract the block sizes with the block_size as

auto bs = block_size(_bs)

inferring the type from the returned type. block_size has signatures

int block_size(V bs) for the runtime, and
consteval int block_size(V bs) in the compile time valued case.

The compile time valued version thus guarantees evaluation to a constant. Making this code identical to the previous one. The only possible risk is for the runtime version, where the inlining of

int (int x) {return x;}

needs to happen for an identical path to before.

Could we run some benchmarks to confirm this happens?

schnellerhase · 2025-12-16T09:06:53Z

Quick testing at https://godbolt.org/z/dze1aq9br yielded we should definitely mark the runtime version inline, then -O1 and above yields inlined version. Adding __attribute__((always_inline)) yields inlined version at -O0.

jhale · 2025-12-16T13:19:55Z

@garth-wells I've reviewed the code quite carefully and I don't think that e.g. BS<1> can lead to any other behaviour than the constant being compile time evaluated, ultimately due to the use of consteval. Nonetheless I take your point that the current way is more explicit, in the sense that it is 'easy' to see that it leads to the behaviour we desire. That said, the proposed method really does cut down on code duplication and templating.

What could @schnellerhase do in terms of performance tests or additional unit tests that might persuade you that it does work?

schnellerhase added 6 commits April 27, 2025 17:21

Introduce BlockSize concept

5f80934

Use BlockSize in packing

65ff61f

Use BlockSize in vector assembly

a708e7d

Adapt demo

5b65ad8

Introduce BS<> alias

29a1219

Use BlockSize in spmv

10cc79c

schnellerhase force-pushed the block_size branch from bd90307 to 10cc79c Compare April 27, 2025 18:19

doc

1fb65d4

schnellerhase marked this pull request as ready for review April 27, 2025 18:50

chrisrichardson self-requested a review April 28, 2025 15:26

chrisrichardson approved these changes Apr 28, 2025

View reviewed changes

schnellerhase added 4 commits April 30, 2025 01:38

Introduce generic ConstexprType

dcfef33

value()

b8b0f90

Add test case

152e8d0

format

0e2ad15

schnellerhase added 2 commits April 30, 2025 17:22

constexpr value access

31a146d

format

6a4d5b5

schnellerhase added 6 commits April 30, 2025 22:40

Bump PETSc/SLEPc

2652fb5

Revert "Bump PETSc/SLEPc"

c762822

This reverts commit 2652fb5.

Tidy up

796725c

Merge branch 'main' into block_size

7602862

Compiler limitation for floating point values

460b350

Misses year code

5c1d722

schnellerhase added 6 commits September 9, 2025 19:05

Merge branch 'main' into block_size

b96bdb5

Use template test case

adc20fc

Merge branch 'main' into block_size

e3dd4e6

macross...

6f674a6

Merge branch 'main' into block_size

9e6b920

Merge branch 'main' into block_size

f7841f3

schnellerhase requested a review from garth-wells September 22, 2025 14:04

schnellerhase added 4 commits October 22, 2025 16:00

Merge branch 'main' into block_size

293d685

Adapt new assembler

ed1ba3e

Merge branch 'main' into block_size

4fe2694

Merge branch 'main' into block_size

8845f73

schnellerhase added the enhancement New feature or request label Dec 8, 2025

schnellerhase and others added 4 commits December 15, 2025 12:48

Merge branch 'main' into block_size

1ffaa94

Require fundamental type

e89c3be

int -> int32

4e26138

Tweak

95d1b30

jhale reviewed Dec 15, 2025

View reviewed changes

cpp/dolfinx/common/constexpr_type.h Outdated Show resolved Hide resolved

schnellerhase and others added 2 commits December 15, 2025 15:03

Fix: add missing requires

1ac9165

Tighten up concept further.

98aeadd

schnellerhase commented Dec 15, 2025

View reviewed changes

cpp/dolfinx/common/constexpr_type.h Outdated Show resolved Hide resolved

schnellerhase added 4 commits December 15, 2025 19:31

Update cpp/dolfinx/common/constexpr_type.h

496b3e9

Merge branch 'main' into block_size

14176ea

format

dfa8b3b

Merge branch 'main' into block_size

941552e

Mark inline

148f6d6

	"assemble_matrix",
	[](dolfinx::la::MatrixCSR<T>& A, const dolfinx::fem::Form<T, U>& a,
	nb::ndarray<const T, nb::ndim<1>, nb::c_contig> constants,
	const std::map<std::pair<dolfinx::fem::IntegralType, int>,
	nb::ndarray<const T, nb::ndim<2>, nb::c_contig>>&
	coefficients,
	const std::vector<const dolfinx::fem::DirichletBC<T, U>*>& bcs)
	{
	std::vector<
	std::reference_wrapper<const dolfinx::fem::DirichletBC<T, U>>>
	_bcs;
	for (auto bc : bcs)
	{
	assert(bc);
	_bcs.push_back(*bc);
	}

	// Get index map block size. Note that mixed-topology meshes
	// will have multiple DOF maps, but the block sizes are the same.
	const std::array<int, 2> data_bs
	= {a.function_spaces().at(0)->dofmaps(0)->index_map_bs(),
	a.function_spaces().at(1)->dofmaps(0)->index_map_bs()};

	if (data_bs[0] != data_bs[1])
	{
	throw std::runtime_error(
	"Non-square blocksize unsupported in Python");
	}

	if (data_bs[0] == 1)
	{
	dolfinx::fem::assemble_matrix(
	A.mat_add_values(), a,
	std::span<const T>(constants.data(), constants.size()),
	dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
	}
	else if (data_bs[0] == 2)
	{
	auto mat_add = A.template mat_add_values<2, 2>();
	dolfinx::fem::assemble_matrix(
	mat_add, a, std::span(constants.data(), constants.size()),
	dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
	}
	else if (data_bs[0] == 3)
	{
	auto mat_add = A.template mat_add_values<3, 3>();
	dolfinx::fem::assemble_matrix(
	mat_add, a, std::span(constants.data(), constants.size()),
	dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
	}
	else if (data_bs[0] == 4)
	{
	auto mat_add = A.template mat_add_values<4, 4>();
	dolfinx::fem::assemble_matrix(
	mat_add, a, std::span(constants.data(), constants.size()),
	dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
	}
	else if (data_bs[0] == 5)
	{
	auto mat_add = A.template mat_add_values<5, 5>();
	dolfinx::fem::assemble_matrix(
	mat_add, a, std::span(constants.data(), constants.size()),
	dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
	}
	else if (data_bs[0] == 6)
	{
	auto mat_add = A.template mat_add_values<6, 6>();
	dolfinx::fem::assemble_matrix(
	mat_add, a, std::span(constants.data(), constants.size()),
	dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
	}
	else if (data_bs[0] == 7)
	{
	auto mat_add = A.template mat_add_values<7, 7>();
	dolfinx::fem::assemble_matrix(
	mat_add, a, std::span(constants.data(), constants.size()),
	dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
	}
	else if (data_bs[0] == 8)
	{
	auto mat_add = A.template mat_add_values<8, 8>();
	dolfinx::fem::assemble_matrix(
	mat_add, a, std::span(constants.data(), constants.size()),
	dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
	}
	else if (data_bs[0] == 9)
	{
	auto mat_add = A.template mat_add_values<9, 9>();
	dolfinx::fem::assemble_matrix(
	mat_add, a, std::span(constants.data(), constants.size()),
	dolfinx_wrappers::py_to_cpp_coeffs(coefficients), _bcs);
	}
	else
	throw std::runtime_error("Block size not supported in Python");

Uh oh!

Introduce BlockSize #3716

Are you sure you want to change the base?

Introduce BlockSize #3716

Uh oh!

Conversation

schnellerhase commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhale commented Apr 27, 2025

Uh oh!

schnellerhase commented Apr 27, 2025

Uh oh!

chrisrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

garth-wells commented Apr 28, 2025

Uh oh!

schnellerhase commented Apr 28, 2025

Uh oh!

garth-wells commented Apr 30, 2025

Uh oh!

schnellerhase commented Apr 30, 2025

Uh oh!

schnellerhase commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jhale commented Dec 15, 2025

Uh oh!

Uh oh!

garth-wells commented Dec 16, 2025

Uh oh!

schnellerhase commented Dec 16, 2025

Uh oh!

schnellerhase commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhale commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Introduce `BlockSize` #3716

Introduce `BlockSize` #3716

schnellerhase commented Apr 27, 2025 •

edited

Loading

schnellerhase commented Apr 30, 2025 •

edited

Loading

schnellerhase commented Dec 16, 2025 •

edited

Loading