Reduce generated functions: getindex #28

NHDaly · 2024-05-08T16:11:21Z

Convert all generated functions to regular function.

The produced code is mostly unchanged, and the perf remains the mostly the same.
Type stability is tested by a new testitem

Compilation time comparisons:

julia> test_getproperty1(b) = b.e
test_getproperty1 (generic function with 1 method)

julia> @time test_getproperty1(bar)      # BEFORE
  0.039638 seconds (36.95 k allocations: 2.471 MiB)
Blob{Blob{Quux}}(Ptr{Nothing} @0x000000013f2d0ea0, 41, 361)

julia> @time test_getproperty1(bar)      # AFTER
  0.007395 seconds (15.06 k allocations: 1019.320 KiB)
Blob{Blob{Quux}}(Ptr{Nothing} @0x000000013f2d0ea0, 41, 361)

julia> @time unsafe_load(bar)      # BEFORE
  0.076596 seconds (86.41 k allocations: 5.710 MiB)
Bar(10, Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], false, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Blob{Quux}(Ptr{Nothing} @0x000000013f2d0ea0, 217, 361))

julia> @time unsafe_load(bar)      # AFTER
  0.025392 seconds (78.95 k allocations: 5.357 MiB)
Bar(10, Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], false, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Blob{Quux}(Ptr{Nothing} @0x000000013f2d0ea0, 217, 361))

julia> @time unsafe_store!(bar, bar_val)      # BEFORE
  0.076252 seconds (96.31 k allocations: 6.358 MiB)
Bar(10, Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], false, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Blob{Quux}(Ptr{Nothing} @0x000000013f2d0ea0, 217, 361))

julia> @time unsafe_store!(bar, bar_val)      # AFTER
  0.039480 seconds (49.01 k allocations: 3.199 MiB)
Bar(10, Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], false, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Blob{Quux}(Ptr{Nothing} @0x000000013f2d0ea0, 217, 361))

Runtime comparisons on julia 1.10:

julia> test_getproperty1(b) = b.e
test_getproperty1 (generic function with 1 method)

julia> @btime test_getproperty1($bar)      # BEFORE
  1.416 ns (0 allocations: 0 bytes)
Blob{Blob{Quux}}(Ptr{Nothing} @0x000000013f2d0ea0, 41, 361)

julia> @btime test_getproperty1($bar)      # AFTER
  1.416 ns (0 allocations: 0 bytes)
Blob{Blob{Quux}}(Ptr{Nothing} @0x000000013f2d0ea0, 41, 361)

julia> @btime unsafe_load($bar)      # BEFORE
  2.166 ns (0 allocations: 0 bytes)
Bar(10, Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], false, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Blob{Quux}(Ptr{Nothing} @0x000000013f2d0ea0, 217, 361))

julia> @btime unsafe_load($bar)      # AFTER
  5.250 ns (0 allocations: 0 bytes)
Bar(10, Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], false, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Blob{Quux}(Ptr{Nothing} @0x000000013f2d0ea0, 217, 361))

julia> @btime unsafe_store!($bar, $bar_val)      # BEFORE
  5.250 ns (0 allocations: 0 bytes)
Bar(10, Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], false, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Blob{Quux}(Ptr{Nothing} @0x000000013f2d0ea0, 217, 361))

julia> @btime unsafe_store!($bar, $bar_val)      # AFTER
  10.177 ns (0 allocations: 0 bytes)
Bar(10, Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], false, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Blob{Quux}(Ptr{Nothing} @0x000000013f2d0ea0, 217, 361))

Runtime comparisons on julia 1.11:

julia> test_getproperty1(b) = b.e
test_getproperty1 (generic function with 1 method)

julia> @btime test_getproperty1($bar)      # BEFORE
  2.042 ns (0 allocations: 0 bytes)
Blob{Blob{Quux}}(Ptr{Nothing} @0x000000013f959380, 41, 361)

julia> @btime test_getproperty1($bar)      # AFTER
  2.000 ns (0 allocations: 0 bytes)
Blob{Blob{Quux}}(Ptr{Nothing} @0x000000013f959380, 41, 361)

julia> @btime unsafe_load($bar)      # BEFORE
  2.333 ns (0 allocations: 0 bytes)
Bar(0, Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], false, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Blob{Quux}(Ptr{Nothing} @0x000000013f959380, 217, 361))

julia> @btime unsafe_load($bar)      # AFTER
  2.292 ns (0 allocations: 0 bytes)
Bar(0, Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], false, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Blob{Quux}(Ptr{Nothing} @0x000000013f959380, 217, 361))

julia> @btime unsafe_store!($bar, $bar_val)      # BEFORE
  5.250 ns (0 allocations: 0 bytes)
Bar(0, Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], false, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Blob{Quux}(Ptr{Nothing} @0x000000013f959380, 217, 361))

julia> @btime unsafe_store!($bar, $bar_val)      # AFTER
  5.333 ns (0 allocations: 0 bytes)
Bar(0, Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], false, [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Blob{Quux}(Ptr{Nothing} @0x000000013f959380, 217, 361))

- The produced code is unchanged, and the perf remains the same. - This is tested by a new testitem

functions, but these had code-size and perf impacts. :(

src/blob.jl

Managed via compiler annotations This new function is ~10x faster than the older `@generated` function: - ~10ms down to ~1ms

NHDaly · 2024-05-09T21:05:17Z

Okay, I think this is good to review! 🎉 Thanks again for the offline support! :)

NHDaly · 2024-05-10T20:29:50Z

src/blob.jl

+        # ~0.5ms for 5 fields, vs ~5ms for unrolling via splatting the fields.
+        # ~3ms for 20 fields, vs ~6ms for splatting.
+        # Note that splatting gives up after ~30 fields, whereas recursion remains robust.
+        _sum_field_sizes(T)


It looks like it might be a good idea to add some kind of cutoff for the recursion to fall back to runtime computations for very large types?

That's what @aviatesk did here:
https://github.com/JuliaLang/julia/pull/54026/files#diff-12e7a6522633012a408b1bdee7639e8cb722617fe1a8ed6a3881bf4ad1ebdbbdR1369-R1370

Did you test, large types? I would fix it once we hit a problem, so code is not too complicated

2x slower, but less compile time so worth it.

robertbuessow · 2024-10-07T07:54:06Z

src/blob.jl

+        # ~0.5ms for 5 fields, vs ~5ms for unrolling via splatting the fields.
+        # ~3ms for 20 fields, vs ~6ms for splatting.
+        # Note that splatting gives up after ~30 fields, whereas recursion remains robust.
+        _sum_field_sizes(T)


Did you test, large types? I would fix it once we hit a problem, so code is not too complicated

src/blob.jl

robertbuessow · 2024-10-07T08:05:19Z

src/blob.jl

-        Blob{$(fieldtype(T, i))}(blob + $(blob_offset(T, i)))
-    end
+    @assert i !== nothing "$T has no field $field"
+    Blob{fieldtype(T, i)}(blob + (blob_offset(T, i)))


The + creates a Blob{T} that we then cast to Blob{fieldtype(T, i)}. Wouldn't it be better to create the right type from the beginning? (I think the +/- operators don't make much sense)

That sounds reasonable to me. Again, i just did a blind transformation on what was here... 🤔

I think the + operators are adding bytes, in which case you could do it either way? But yes i agree this is confusing

NHDaly added 2 commits May 8, 2024 10:10

Convert getindex(::Blob, ::Val{field}) from generated func.

66daa86

- The produced code is unchanged, and the perf remains the same. - This is tested by a new testitem

Commit some attempts to convert away from some other generated

5befac7

functions, but these had code-size and perf impacts. :(