Skip to content

feat(prof): move allocation profiler stack collection to safe point #3306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

realFlowControl
Copy link
Member

@realFlowControl realFlowControl commented Jun 23, 2025

Description

The allocation profiler currently collects a stack trace immediately when the sampling threshold (4MB) is reached. This can occur anywhere in the PHP VM, including outside safe points. While usually safe, there are rare cases where opline has been free()'d and not restored, leading to use-after-free issues during stack collection 💥

These are engine-level bugs, which we've addressed retroactively via upstream fixes and/or profiler-side workarounds. This PR takes a proactive approach by deferring stack trace collection until the next safe point, by storing allocation samples in TLS and raising the engine interrupt flag. At the next interrupt - where we already handle wall/CPU time sampling - we check for a pending allocation sample and collect the trace then.

Out of Scope / Follow-Up Work

  1. Stack trace collection for I/O samples should also be deferred to safe points.
  2. When multiple profiler events (e.g., wall time, CPU time, allocation) are pending, we should ideally collect only one stack trace to reduce overhead. This will require additional refactoring.
  3. Investigate whether we can attach wall/CPU timing metrics to allocation samples when collected, potentially increasing resolution without introducing sampling bias.

Point 2 seems like a clear improvement, but due to its architectural implications, it's deferred for now. Importantly, this PR already avoids redundant stack walks in one case: when a single PHP function call performs multiple allocations and hits the sampling threshold multiple times, we previously performed multiple redundant stack walks. With this change, those are consolidated into a single stack trace collected at the next safe point.

Reviewer checklist

  • Test coverage seems ok.
  • Appropriate labels assigned.

@realFlowControl realFlowControl changed the title Move allocation profiler stack collection to safe point feat(prof) move allocation profiler stack collection to safe point Jun 23, 2025
@github-actions github-actions bot added profiling Relates to the Continuous Profiler tracing labels Jun 23, 2025
@realFlowControl realFlowControl changed the title feat(prof) move allocation profiler stack collection to safe point feat(prof): move allocation profiler stack collection to safe point Jun 23, 2025
@pr-commenter
Copy link

pr-commenter bot commented Jun 23, 2025

Benchmarks [ profiler ]

Benchmark execution time: 2025-06-25 16:00:22

Comparing candidate commit 657058f in PR branch florian/safe-point with baseline commit 0db0b0e in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 28 metrics, 8 unstable metrics.

@realFlowControl realFlowControl force-pushed the florian/safe-point branch 3 times, most recently from 1edc9dd to 83a2a05 Compare June 23, 2025 13:54
Copy link
Collaborator

@morrisonlevi morrisonlevi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for a proof of concept; it let's us see how the accuracy of the allocation profiler is impacted.

However, it will walk the stacks twice, which is wasteful. If you are trying to compare overhead, you'll want to fix that first, or disable wall time sampling when testing.

We should think carefully about this, but I think it may make sense to gather time any time we walk the stack.

@realFlowControl
Copy link
Member Author

realFlowControl commented Jun 23, 2025

This is fine for a proof of concept; it let's us see how the accuracy of the allocation profiler is impacted.

That's the sole reason I opened a draft PR, that way I can see prof-correctness tests (as well as all other tests) and have a binary that I can ship.

However, it will walk the stacks twice, which is wasteful. If you are trying to compare overhead, you'll want to fix that first, or disable wall time sampling when testing.

Exactly, it is collecting two stack traces. This does not affect the outcome, but the overhead 😉

We should think carefully about this, but I think it may make sense to gather time any time we walk the stack.

I thought about that as well, as we have the stack trace already, we can easily add wall-/cpu-time to it. Let's measure once we are there.

@realFlowControl realFlowControl force-pushed the florian/safe-point branch 3 times, most recently from 7e12b7b to 931e657 Compare June 25, 2025 11:07
Comment on lines +80 to +85
// TODO: If this thread needs to take an allocation sample, calling
// `profiler.trigger_interrupt()` will not only trigger this threads interrupt,
// but all other PHP ZTS threads interrupts as well. The interrupt handler is
// pretty slim, and does not collect a stack trace if there is nothing pending,
// yet we should only trigger an interrupt in the "current" thread.
profiler.trigger_interrupt();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though we could fix it here, I'd say we are also save to ignore this for the time being, as only very very very few people are using ZTS PHP.

@realFlowControl realFlowControl force-pushed the florian/safe-point branch 4 times, most recently from ddac914 to 36331ef Compare June 25, 2025 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Relates to the Continuous Profiler tracing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants