Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLD: compatibility with Cython 3.0 #4355

Closed
11 of 15 tasks
neutrinoceros opened this issue Mar 2, 2023 · 26 comments · Fixed by #4575
Closed
11 of 15 tasks

BLD: compatibility with Cython 3.0 #4355

neutrinoceros opened this issue Mar 2, 2023 · 26 comments · Fixed by #4575
Labels
build related to the build process refactor improve readability, maintainability, modularity
Milestone

Comments

@neutrinoceros
Copy link
Member

neutrinoceros commented Mar 2, 2023

Cython 3.0 is now in beta (after several years of alpha).

#4043 contains an initial effort to discover and patch incompatibilities, but at the moment it seems to me that this migration is better done incrementally, as only some of the patches will be seamless and uncontroversial.
I'll be using this issue to track progress on this effort.

The most urgent issues are listed in the following. A useful resource for this task is the migration guide.

blockers

restoring performance

performance issues are discussed in the present thread.

deprecation warnings

Since the migration path for some items is a moving target (see cython/cython#5242), I think the minimal effort approach would be to set our requirement to Cython>=3.0,<3.1 at first. According to the current plan for Cython 3.1 and 3.2, we should be able to drop the upper limit completely once we reach compatibility with version 3.2

low priority (expected gains from dropping Cython 0.29.x)

Details

with header.txt

# distutils: define_macros=NPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION

and t.sh

for file in "$@"
do
    cat header.txt "$file" > /tmp/xx.$$
    mv /tmp/xx.$$ "$file"
done

apply the change as

git ls-files | grep .pyx | xargs bash t.sh
@neutrinoceros neutrinoceros added the refactor improve readability, maintainability, modularity label Mar 2, 2023
@neutrinoceros
Copy link
Member Author

neutrinoceros commented Mar 13, 2023

The last item without a PR on the "blockers" list is a little hard for me to decipher. It only affects yt/geometry/oct_container.pyx, which I am not familiar with. It looks like @matthewturk and @cphyc last authored the lines these errors are pointing to. Any chance you guys could give me a hand here ?

For reference, here's a simple script that can be used for testing compilation

python -m pip install -U --pre Cython build
python -m build --no-isolation .

update: nevermind I think I figured it out

@neutrinoceros
Copy link
Member Author

neutrinoceros commented Mar 17, 2023

We are currently compatible with Cython 3.0.0b1, but I note that some tests are showing performance regressions.
They are very visible in the bleeding-edge CI job history: running tests went from under 5min to over 13min. I can also reproduce this regression locally and measure which tests are most affected using pytest-durations. Here are my results

With Cython 3.0.0b1

total          name                                           num avg            min
0:00:21.818796                    RotationTest::test_rotation   1 0:00:21.818796 0:00:21.818796
0:00:19.647312                LensTest::test_perspective_lens   1 0:00:19.647312 0:00:19.647312
0:00:17.663738           test_add_gradient_fields_curvilinear   1 0:00:17.663738 0:00:17.663738
0:00:16.087412           VariousVRTests::test_multiple_fields   1 0:00:16.087412 0:00:16.087412
0:00:15.884202               SaveRenderTest::test_save_render   1 0:00:15.884202 0:00:15.884202
0:00:15.345128 VariousVRTests::test_rotation_volume_rendering   1 0:00:15.345128 0:00:15.345128
0:00:14.989918  VariousVRTests::test_modify_transfer_function   1 0:00:14.989918 0:00:14.989918
0:00:13.948768                            test_point_selector   1 0:00:13.948768 0:00:13.948768
0:00:13.781540                     test_amr_kdtree_set_fields   1 0:00:13.781540 0:00:13.781540
0:00:13.541609                           test_compose_overlap   1 0:00:13.541609 0:00:13.541609
0:00:10.365563                                test_projection   1 0:00:10.365563 0:00:10.365563
0:00:10.068168                                 test_ellipsoid   1 0:00:10.068168 0:00:10.068168
0:00:09.264413                        test_compose_no_overlap   1 0:00:09.264413 0:00:09.264413
0:00:08.577762                 test_boolean_cylinders_overlap   1 0:00:08.577762 0:00:08.577762
0:00:08.008635   VariousVRTests::test_simple_volume_rendering   1 0:00:08.008635 0:00:08.008635
0:00:07.843007                 SigmaClipTest::test_sigma_clip   1 0:00:07.843007 0:00:07.843007
0:00:07.822579                   SimpleVRTest::test_simple_vr   1 0:00:07.822579 0:00:07.822579
0:00:07.640429                              test_field_access   1 0:00:07.640429 0:00:07.640429
0:00:07.561692                                       test_ray   2 0:00:03.780846 0:00:00.717569
0:00:07.272802                   test_boolean_mix_periodicity   1 0:00:07.272802 0:00:07.272802
0:00:07.191133                               test_annotations   1 0:00:07.191133 0:00:07.191133
0:00:05.887486              test_boolean_cylinders_no_overlap   1 0:00:05.887486 0:00:05.887486
0:00:05.853210                      test_region_and_particles   1 0:00:05.853210 0:00:05.853210
0:00:05.568958                            test_offaxis_moment   1 0:00:05.568958 0:00:05.568958
0:00:05.147904                 ZBufferTest::test_composite_vr   1 0:00:05.147904 0:00:05.147904
0:00:04.973548                    test_boolean_slices_overlap   1 0:00:04.973548 0:00:04.973548
0:00:04.946019                 test_boolean_slices_no_overlap   1 0:00:04.946019 0:00:04.946019
0:00:04.929659             CompositeVRTest::test_composite_vr   1 0:00:04.929659 0:00:04.929659
0:00:04.911772                   PointsVRTest::test_points_vr   1 0:00:04.911772 0:00:04.911772
0:00:04.853476                   test_boolean_spheres_overlap   1 0:00:04.853476 0:00:04.853476
0:08:21.841224                                    grand total 488 0:00:00.234941 0:00:00.000044

With Cython 0.29.33

total          name                                           num avg            min
0:00:13.668872                            test_point_selector   1 0:00:13.668872 0:00:13.668872
0:00:13.002215           test_add_gradient_fields_curvilinear   1 0:00:13.002215 0:00:13.002215
0:00:10.452180                                test_projection   1 0:00:10.452180 0:00:10.452180
0:00:09.817984                     test_amr_kdtree_set_fields   1 0:00:09.817984 0:00:09.817984
0:00:06.553763                    RotationTest::test_rotation   1 0:00:06.553763 0:00:06.553763
0:00:06.185139                      test_region_and_particles   1 0:00:06.185139 0:00:06.185139
0:00:05.647264                LensTest::test_perspective_lens   1 0:00:05.647264 0:00:05.647264
0:00:05.333219                                       test_ray   2 0:00:02.666610 0:00:00.846876
0:00:04.999946               SaveRenderTest::test_save_render   1 0:00:04.999946 0:00:04.999946
0:00:04.983511                              test_field_access   1 0:00:04.983511 0:00:04.983511
0:00:04.841462           VariousVRTests::test_multiple_fields   1 0:00:04.841462 0:00:04.841462
0:00:04.511641 VariousVRTests::test_rotation_volume_rendering   1 0:00:04.511641 0:00:04.511641
0:00:04.410010  VariousVRTests::test_modify_transfer_function   1 0:00:04.410010 0:00:04.410010
0:00:04.339119                            test_arbitrary_grid   2 0:00:02.169559 0:00:00.711297
0:00:04.289835                    test_smoothed_covering_grid   1 0:00:04.289835 0:00:04.289835
0:00:04.234965                       test_testable_geometries   7 0:00:00.593930 0:00:00.510673
0:00:04.157473                                  test_profiles   1 0:00:04.157473 0:00:04.157473
0:00:03.841644                           test_icoords_to_ires   1 0:00:03.841644 0:00:03.841644
0:00:03.834416                 test_boolean_cylinders_overlap   1 0:00:03.834416 0:00:03.834416
0:00:03.485764                           test_compose_overlap   1 0:00:03.485764 0:00:03.485764
0:00:03.399844                                     test_slice   2 0:00:01.699922 0:00:00.614181
0:00:03.371764                             test_cutting_plane   1 0:00:03.371764 0:00:03.371764
0:00:03.177318                         test_particle_profiles   1 0:00:03.177318 0:00:03.177318
0:00:03.060623                                  test_chunking   1 0:00:03.060623 0:00:03.060623
0:00:03.012477                   test_boolean_mix_periodicity   1 0:00:03.012477 0:00:03.012477
0:00:02.848275                test_boolean_ray_region_overlap   1 0:00:02.848275 0:00:02.848275
0:00:02.755700                                 test_ellipsoid   1 0:00:02.755700 0:00:02.755700
0:00:02.735369                        test_mean_sum_integrate   1 0:00:02.735369 0:00:02.735369
0:00:02.638087                 SigmaClipTest::test_sigma_clip   1 0:00:02.638087 0:00:02.638087
0:00:02.623963   VariousVRTests::test_simple_volume_rendering   1 0:00:02.623963 0:00:02.623963
0:05:01.671335                                    grand total 488 0:00:00.204368 0:00:00.000045

I will investigate this, and report anything useful upstream.

@matthewturk
Copy link
Member

I can't do this right now, but probably the most effective thing to do would be to run cython -a on a handful of representative ones, like RotationTest::test_rotation and briefly inspect the generated code. My guess is there are some default behavior changes that we are not accounting for, likely to make things more python-like that we relied on being c-like.

@neutrinoceros
Copy link
Member Author

neutrinoceros commented Mar 17, 2023

Taking the worst offender as a starting point (RotationTest::test_rotation), I'm able to boil it down to

from time import monotonic

import numpy as np

from yt.testing import fake_random_ds
from yt.visualization.volume_rendering.api import volume_render

tstart = monotonic()

ds = fake_random_ds(8)
dd = ds.sphere(ds.domain_center, ds.domain_width[0] / 2)
im, sc = volume_render(dd, field=("gas", "density"))

tstop = monotonic()

print(f"{(tstop-tstart)*1000:.0f} ms")

and still see the perf difference.
Following this trail, I annotated (as suggested) yt/utilities/lib/bounding_volume_hierarchy.pyx, and I found some suspicious differences (Cython 3.0.0b1 is on the left, yellow lines hint at Python interaction)

Screenshot 2023-03-17 at 15 52 14

@matthewturk
Copy link
Member

@neutrinoceros Can you post the .html file? Clicking on the yellow lines expands them. My suspicion is that there's some function call context or something that's being added here that wasn't before. We call all these cdef nogil functions in order to save time and remove overhead, but looks like some is sneaking back in.

@neutrinoceros
Copy link
Member Author

Here are the annotations
http://use.yt/upload/66b16f8c
http://use.yt/upload/959c81a7

@matthewturk
Copy link
Member

Well, I'm not sure anymore.

Here's the new generated code:

+224:     cdef void _set_up_triangles(self, np.float64_t[:, :] vertices,
static void __pyx_f_2yt_9utilities_3lib_25bounding_volume_hierarchy_3BVH__set_up_triangles(struct __pyx_obj_2yt_9utilities_3lib_25bounding_volume_hierarchy_BVH *__pyx_v_self, __Pyx_memviewslice __pyx_v_vertices, __Pyx_memviewslice __pyx_v_indices) {
  __pyx_t_5numpy_int64_t __pyx_v_offset;
  __pyx_t_5numpy_int64_t __pyx_v_tri_index;
  __pyx_t_5numpy_int64_t __pyx_v_v0;
  __pyx_t_5numpy_int64_t __pyx_v_v1;
  __pyx_t_5numpy_int64_t __pyx_v_v2;
  struct __pyx_t_2yt_9utilities_3lib_10primitives_Triangle *__pyx_v_tri;
  __pyx_t_5numpy_int64_t __pyx_v_i;
  __pyx_t_5numpy_int64_t __pyx_v_j;
  __pyx_t_5numpy_int64_t __pyx_v_k;
/* … */
  /* function exit code */
  goto __pyx_L0;
  __pyx_L1_error:;
  #ifdef WITH_THREAD
  __pyx_gilstate_save = __Pyx_PyGILState_Ensure();
  #endif
  __Pyx_AddTraceback("yt.utilities.lib.bounding_volume_hierarchy.BVH._set_up_triangles", __pyx_clineno, __pyx_lineno, __pyx_filename);
  #ifdef WITH_THREAD
  __Pyx_PyGILState_Release(__pyx_gilstate_save);
  #endif
  __pyx_L0:;
}

Here's the old generated code:

224:     cdef void _set_up_triangles(self, np.float64_t[:, :] vertices,
static void __pyx_f_2yt_9utilities_3lib_25bounding_volume_hierarchy_3BVH__set_up_triangles(struct __pyx_obj_2yt_9utilities_3lib_25bounding_volume_hierarchy_BVH *__pyx_v_self, __Pyx_memviewslice __pyx_v_vertices, __Pyx_memviewslice __pyx_v_indices) {
  __pyx_t_5numpy_int64_t __pyx_v_offset;
  __pyx_t_5numpy_int64_t __pyx_v_tri_index;
  __pyx_t_5numpy_int64_t __pyx_v_v0;
  __pyx_t_5numpy_int64_t __pyx_v_v1;
  __pyx_t_5numpy_int64_t __pyx_v_v2;
  struct __pyx_t_2yt_9utilities_3lib_10primitives_Triangle *__pyx_v_tri;
  __pyx_t_5numpy_int64_t __pyx_v_i;
  __pyx_t_5numpy_int64_t __pyx_v_j;
  __pyx_t_5numpy_int64_t __pyx_v_k;
/* … */
  /* function exit code */
}

at first glance it looks like it's doing a bunch of GIL setting, etc. But, that should only be called in an error state -- which in principle should not be the dominant cost. So at this point, I'm not entirely sure what the issue is, unless I am misunderstanding and this code gets called a bunch of times when I think it should not.

Now, I looked a little bit more, and there is another difference, also related to the GIL. Again, this shouldn't be an issue, though? In the call to self.get_centroid we now have:

__pyx_v_self->get_centroid(__pyx_v_self->primitives, __pyx_v_tri_index, (__pyx_v_self->centroids[__pyx_v_tri_index])); if (unlikely(__Pyx_ErrOccurredWithGIL())) __PYX_ERR(0, 245, __pyx_L1_error)

So it's also possible now that we're calling __Pyx_ErrOccurredWithGIL() a lot, maybe? (I'm not sure what unlikely does in this case.) I note that get_centroid is a function pointer and declared nogil. So while I'm still not sure, I wonder if this is related to the ability to raise exceptions being changed in Cython 3.0 (we had to do a bunch of new except -1 decls, right?) or maybe it's losing track of what has and hasn't the GIL? (I think that is less likely than an error on our part.)

@neutrinoceros
Copy link
Member Author

While I'm far from mastering every relevant details here, I think you might find this paragraph enlightening (maybe ?)
https://cython.readthedocs.io/en/latest/src/userguide/migrating_to_cy30.html#exception-values-and-noexcept

@matthewturk
Copy link
Member

matthewturk commented Mar 17, 2023 via email

@neutrinoceros
Copy link
Member Author

Looks appropriate indeed. I tried this patch

diff --git a/yt/utilities/lib/bounding_volume_hierarchy.pyx b/yt/utilities/lib/bounding_volume_hierarchy.pyx
index a743257..13f2384 100644
--- a/yt/utilities/lib/bounding_volume_hierarchy.pyx
+++ b/yt/utilities/lib/bounding_volume_hierarchy.pyx
@@ -1,3 +1,4 @@
+# cython: legacy_implicit_noexcept=True
 # distutils: libraries = STD_LIBS
 # distutils: include_dirs = LIB_DIR
 # distutils: language = c++

but it fails to compile on both versions (Cython 0.29.33 doesn't recognise this option, and Cython 3.0.0b1 crashes in a more exotic way). At this point I think it's worth reporting ?

@neutrinoceros
Copy link
Member Author

Nevermind, the reason it didn't compile is that I forgot to also include the directive in the header, resulting in mismatches between declarations and implementations.
Anyway, I see that noexcept was actually backported to Cython 0.29.x as a noop, so in principle it is possible to add it explicitly where we rely on the old behaviour. In the particular case we've been inspecting however, I do not see any significant improvement doing so, which seems to match your (@matthewturk) intuition that these functions are not actually called that much.
So I learned about noexpect, which might come handy later, but I'm back to square one regarding the regressions we're seeing. I'll try to scan for other known performance-affecting changes in Cython 3's release notes.

@neutrinoceros
Copy link
Member Author

This seems related cython/cython#5327

@neutrinoceros
Copy link
Member Author

neutrinoceros commented Mar 27, 2023

Looks like the problem was adressed upstream cython/cython#5324 (assuming it's indeed the same issue).
I'll try out the new beta release once it's available on PyPI

@neutrinoceros
Copy link
Member Author

Cython 3.0.0b2 still shows degraded performance (there are no obvious differences with 3.0.0b1). I really need to take a minute to report this properly.

@neutrinoceros
Copy link
Member Author

Enabling profiling in Cython helped me determine that the main offender is ImageSampler.cast_through_kdtree
Annotating the source file, I can spot one place were there's visibly more Python interaction with Cython 3.0.0b2 (left)
Screenshot 2023-03-27 at 23 21 26

An uneducated look at the code generated for these calls to i64clip seem to indicate that it includes error handling (which the old version does not).

For instance, for this line

iter[0] = i64clip(iter[0]-1, 0, self.nv[0])

this is the old generated code

(__pyx_v_iter[0]) = __pyx_f_2yt_9utilities_3lib_8fp_utils_i64clip(((__pyx_v_iter[0]) - 1), 0, (__pyx_v_self->nv[0]));

And this is the new code

__pyx_t_3 = __pyx_f_2yt_9utilities_3lib_8fp_utils_i64clip(((__pyx_v_iter[0]) - 1), 0, (__pyx_v_self->nv[0])); if (unlikely(__pyx_t_3 == ((__pyx_t_5numpy_int64_t)-1) && PyErr_Occurred())) __PYX_ERR(0, 163, __pyx_L1_error)
  (__pyx_v_iter[0]) = __pyx_t_3;

I further note that the code generated for the i64clip function itself is identical in both versions.

I tried applying the noexcept qualifier to cast_through_kdtree but it crashes at compile time (supposedly because this is a def function, not a cdef one).

@neutrinoceros
Copy link
Member Author

neutrinoceros commented Mar 27, 2023

I was close to solving it: in fact what's needed is to add the noexcept qualifier to i64clip itself. Done in #4386
Note that this measurably reduces the runtime overhead observed earlier, but further efforts are needed to get back to baseline perfs

@matthewturk
Copy link
Member

Haha, we call those fp_utils functions all the time! Everywhere! All at once! I can see why adding even a smidgen of runtime overhead to them would clobber us everywhere.

@neutrinoceros
Copy link
Member Author

"fp" : functional programming ?

@neutrinoceros
Copy link
Member Author

We've had 3 runs of bleeding edge CI with Cython 3.0.0b2 so far and they seem to take even longer than with 3.0.0b1: now tests take as much as 18min (previously 13, ref is 5). The itch is only getting stronger.

@neutrinoceros
Copy link
Member Author

ImageSampler.cast_through_kdtree is still the main source of overhead in the example I'm working on.
The generated code for the function signature itself is visibly different (and as far as I can tell that's the only difference), but it's difficult to pin down exactly what can be so time consuming in the new version.
@matthewturk, any chance you could have a look there ?
here is the reference: http://use.yt/upload/c98c3d98
and here's the new result: http://use.yt/upload/071b3456

@matthewturk
Copy link
Member

Yes, I'll take a look.

@neutrinoceros
Copy link
Member Author

I've escalated this question to Cython's user mailing list, hopefully it'll get some attention

@neutrinoceros
Copy link
Member Author

In the end it's again all about sprinkling noexcept where needed. The challenge being in the identification of functions where it's performance critical. Anyway VR regressions should be overcome with #4390

Next in line is test_add_gradient_fields_curvilinear, which we can boil down to the following self-contained script, still exposing the regression

from time import monotonic
from yt.testing import fake_amr_ds

ds = fake_amr_ds(fields=["density"], units=["g/cm**3"])
ds.add_gradient_fields(("gas", "density"))
ad = ds.all_data()

tstart = monotonic()
ret = ad[("gas", "density_gradient_x")]
tstop = monotonic()

print(f"{(tstop-tstart)*1000:.0f} ms")

(on my laptop it takes about 2200ms with Cython 0.29 and 3500ms with Cython 3.0.0b2)

@neutrinoceros
Copy link
Member Author

This is now also addressed in #4390

@neutrinoceros
Copy link
Member Author

Already identified performance issues are fixed, as far as I can tell. We should be able to close this issue when we do the switch from >=0.29.33,<3 to >=3.0,<3.1, which won't happen until Cython 3.0.0 final is released, and we take extra time to verify that performances do not degrade over a larger fraction of our tests.

@neutrinoceros
Copy link
Member Author

#4575 fixes the remaining issues with, and switch requirements to Cython>=3.0,<3.1
The remaining, low priority problems and further development listed above will be addressed in follow up works and for the most part will not be possible until Cython 3.1 is out.

@neutrinoceros neutrinoceros linked a pull request Jul 27, 2023 that will close this issue
@neutrinoceros neutrinoceros added the build related to the build process label Jul 31, 2023
@neutrinoceros neutrinoceros added this to the 4.3.0 milestone Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build related to the build process refactor improve readability, maintainability, modularity
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants