feat(prof): move allocation profiler stack collection to safe point #3306

realFlowControl · 2025-06-23T10:16:08Z

Description

The allocation profiler currently collects a stack trace immediately when the sampling threshold (4MB) is reached. This can occur anywhere in the PHP VM, including outside safe points. While usually safe, there are rare cases where opline has been free()'d and not restored, leading to use-after-free issues during stack collection 💥

These are engine-level bugs, which we've addressed retroactively via upstream fixes and/or profiler-side workarounds. This PR takes a proactive approach by deferring stack trace collection until the next safe point, by storing allocation samples in TLS and raising the engine interrupt flag. At the next interrupt - where we already handle wall/CPU time sampling - we check for a pending allocation sample and collect the trace then.

Out of Scope / Follow-Up Work

Stack trace collection for I/O samples should also be deferred to safe points.
When multiple profiler events (e.g., wall time, CPU time, allocation) are pending, we should ideally collect only one stack trace to reduce overhead. This will require additional refactoring.
Investigate whether we can attach wall/CPU timing metrics to allocation samples when collected, potentially increasing resolution without introducing sampling bias.

Point 2 seems like a clear improvement, but due to its architectural implications, it's deferred for now. Importantly, this PR already avoids redundant stack walks in one case: when a single PHP function call performs multiple allocations and hits the sampling threshold multiple times, we previously performed multiple redundant stack walks. With this change, those are consolidated into a single stack trace collected at the next safe point.

Reviewer checklist

Test coverage seems ok.
Appropriate labels assigned.

pr-commenter · 2025-06-23T10:40:59Z

Benchmarks [ profiler ]

Benchmark execution time: 2025-06-25 16:00:22

Comparing candidate commit 657058f in PR branch florian/safe-point with baseline commit 0db0b0e in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 28 metrics, 8 unstable metrics.

morrisonlevi

This is fine for a proof of concept; it let's us see how the accuracy of the allocation profiler is impacted.

However, it will walk the stacks twice, which is wasteful. If you are trying to compare overhead, you'll want to fix that first, or disable wall time sampling when testing.

We should think carefully about this, but I think it may make sense to gather time any time we walk the stack.

realFlowControl · 2025-06-23T16:18:32Z

This is fine for a proof of concept; it let's us see how the accuracy of the allocation profiler is impacted.

That's the sole reason I opened a draft PR, that way I can see prof-correctness tests (as well as all other tests) and have a binary that I can ship.

However, it will walk the stacks twice, which is wasteful. If you are trying to compare overhead, you'll want to fix that first, or disable wall time sampling when testing.

Exactly, it is collecting two stack traces. This does not affect the outcome, but the overhead 😉

We should think carefully about this, but I think it may make sense to gather time any time we walk the stack.

I thought about that as well, as we have the stack trace already, we can easily add wall-/cpu-time to it. Let's measure once we are there.

realFlowControl · 2025-06-25T12:29:23Z

profiling/src/allocation/mod.rs

+                    // TODO: If this thread needs to take an allocation sample, calling
+                    // `profiler.trigger_interrupt()` will not only trigger this threads interrupt,
+                    // but all other PHP ZTS threads interrupts as well. The interrupt handler is
+                    // pretty slim, and does not collect a stack trace if there is nothing pending,
+                    // yet we should only trigger an interrupt in the "current" thread.
+                    profiler.trigger_interrupt();


Even though we could fix it here, I'd say we are also save to ignore this for the time being, as only very very very few people are using ZTS PHP.

realFlowControl changed the title ~~Move allocation profiler stack collection to safe point~~ feat(prof) move allocation profiler stack collection to safe point Jun 23, 2025

github-actions bot added profiling Relates to the Continuous Profiler tracing labels Jun 23, 2025

realFlowControl changed the title ~~feat(prof) move allocation profiler stack collection to safe point~~ feat(prof): move allocation profiler stack collection to safe point Jun 23, 2025

realFlowControl force-pushed the florian/safe-point branch 3 times, most recently from 1edc9dd to 83a2a05 Compare June 23, 2025 13:54

morrisonlevi reviewed Jun 23, 2025

View reviewed changes

realFlowControl force-pushed the florian/safe-point branch 3 times, most recently from 7e12b7b to 931e657 Compare June 25, 2025 11:07

realFlowControl commented Jun 25, 2025

View reviewed changes

Move sample collection to interrupt handler

1244de0

realFlowControl force-pushed the florian/safe-point branch 4 times, most recently from ddac914 to 36331ef Compare June 25, 2025 15:39

Add correctness test for multiple pending allocation samples

657058f

realFlowControl force-pushed the florian/safe-point branch from 36331ef to 657058f Compare June 25, 2025 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(prof): move allocation profiler stack collection to safe point #3306

feat(prof): move allocation profiler stack collection to safe point #3306

Uh oh!

realFlowControl commented Jun 23, 2025 •

edited

Loading

Uh oh!

pr-commenter bot commented Jun 23, 2025 •

edited

Loading

Uh oh!

morrisonlevi left a comment •

edited

Loading

Uh oh!

realFlowControl commented Jun 23, 2025 •

edited

Loading

Uh oh!

realFlowControl Jun 25, 2025

Uh oh!

Uh oh!

feat(prof): move allocation profiler stack collection to safe point #3306

Are you sure you want to change the base?

feat(prof): move allocation profiler stack collection to safe point #3306

Uh oh!

Conversation

realFlowControl commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Reviewer checklist

Uh oh!

pr-commenter bot commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [ profiler ]

Uh oh!

morrisonlevi left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

realFlowControl commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

realFlowControl Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

realFlowControl commented Jun 23, 2025 •

edited

Loading

pr-commenter bot commented Jun 23, 2025 •

edited

Loading

morrisonlevi left a comment •

edited

Loading

realFlowControl commented Jun 23, 2025 •

edited

Loading