Data gathered by @matthiasdiener 271,000 elements  7,500 elements  - [ ] Why is the invoker taking so long? - [ ] Invoker time seems to be dependent on problem size. It should not be. - [ ] Given that `BoundPyOpenCLExecutor` waits, what is behind the long wait in `post_step`? - [ ] In the large-scale case, why does the last invoker take twice the time of the others? Added April 29, 2025: - [ ] `t_step` and `t_2step` disagree. Why? - [ ] Large meshes have been observed to take more host time than small ones. Why? cc @MTCam