Skip to content

Store sampling context in ContinuationPreservedEmbedderData #186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

szegedi
Copy link

@szegedi szegedi commented Dec 11, 2024

What does this PR do?:
Moves profiler's sampling context into the V8 isolate's ContinuationPreservedEmbedderData (CPED.) In fact, moves it into a private symbol-keyed property within the object as Node.js version 24 and later will store AsyncContextFrame objects in the CPED.

Motivation:
Before this change, we rely on async_hooks in dd-trace-js to update the sampling context on every context switch. It is both somewhat costly to do this on every async context switch and we should also migrate away from async hooks, as profiling is their last user, they'll be deprecated in Node, and they bring their own set of performance overhead by using them.

This PR comes with a document explaining the design in it.

Additional Notes:

The code using this in dd-trace-js' wall.js sets the context on AsyncLocalStorage.{enterWith|run}, and the before callback of AsyncHooks.createHook. With CPED storage enabled, we can eliminate the before callback and "only" keep the AsyncLocalStorage.{enterWith|run} callbacks – they are the minimum amount of instrumentation needed for the feature to work correctly then.

This branch can be tested with the same-named branch in dd-trace-js.

Jira: PROF-11771

@szegedi szegedi changed the base branch from main to szegedi/atomic-context December 11, 2024 20:10
@szegedi szegedi force-pushed the szegedi/cped-context branch from 5b796ea to 1b15c3a Compare December 11, 2024 20:48
@szegedi szegedi changed the base branch from szegedi/atomic-context to main December 11, 2024 20:53
@szegedi szegedi force-pushed the szegedi/cped-context branch from 1b15c3a to 3decb46 Compare December 11, 2024 21:04
@szegedi szegedi force-pushed the szegedi/cped-context branch 4 times, most recently from bfa23b5 to 32b2170 Compare December 17, 2024 11:26
@szegedi szegedi force-pushed the szegedi/cped-context branch from 32b2170 to 9f398f2 Compare January 20, 2025 16:43
@szegedi szegedi force-pushed the szegedi/cped-context branch from 9f398f2 to 965a627 Compare February 26, 2025 16:40
@szegedi szegedi changed the base branch from main to szegedi/async-id-2 February 26, 2025 16:41
@szegedi szegedi force-pushed the szegedi/cped-context branch from 965a627 to 0c1791e Compare February 26, 2025 16:51
@szegedi szegedi force-pushed the szegedi/cped-context branch from 0c1791e to 8f88f3d Compare February 27, 2025 10:32
@@ -66,6 +66,7 @@ export interface TimeProfilerOptions {
withContexts?: boolean;
workaroundV8Bug?: boolean;
collectCpuTime?: boolean;
useCPED?: boolean;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Quality Violation

Prop name (useCPED) doesn't match rule (is|has) (...read more)

Enforces a consistent naming pattern for boolean props.

The pattern is: "^(is|has)[A-Z]([A-Za-z0-9]?)+" to enforce is and has prefixes.

View in Datadog  Leave us feedback  Documentation

@szegedi szegedi force-pushed the szegedi/cped-context branch from 8f88f3d to 717f17b Compare March 5, 2025 12:53
@szegedi szegedi force-pushed the szegedi/async-id-2 branch from 3c86526 to 79992a0 Compare March 6, 2025 17:05
Base automatically changed from szegedi/async-id-2 to main March 19, 2025 15:27
@szegedi szegedi force-pushed the szegedi/cped-context branch from 717f17b to 2b9041d Compare March 20, 2025 15:12
Copy link

github-actions bot commented Mar 20, 2025

Overall package size

Self size: 1.62 MB
Deduped: 1.99 MB
No deduping: 1.99 MB

Dependency sizes | name | version | self size | total size | |------|---------|-----------|------------| | source-map | 0.7.4 | 226 kB | 226 kB | | pprof-format | 2.1.0 | 111.69 kB | 111.69 kB | | p-limit | 3.1.0 | 7.75 kB | 13.78 kB | | delay | 5.0.0 | 11.17 kB | 11.17 kB | | node-gyp-build | 3.9.0 | 8.81 kB | 8.81 kB |

🤖 This report was automatically generated by heaviest-objects-in-the-universe

@pr-commenter
Copy link

pr-commenter bot commented Mar 20, 2025

Benchmarks

Benchmark execution time: 2025-06-11 11:29:00

Comparing candidate commit 98b683c in PR branch szegedi/cped-context with baseline commit 2f6f2f8 in branch main.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 89 metrics, 30 unstable metrics.

scenario:profiler-light-load-with-wall-profiler-20

  • 🟩 cpu_user_time [-11.651ms; -2.413ms] or [-11.063%; -2.291%]

@szegedi szegedi changed the title Experiment for storing profiling context in the ContinuationPreservedEmbeddedData Store profiling context in ContinuationPreservedEmbeddedData Mar 20, 2025
@szegedi szegedi changed the title Store profiling context in ContinuationPreservedEmbeddedData Store sampling context in ContinuationPreservedEmbeddedData Mar 20, 2025
@szegedi szegedi added the semver-minor Usually minor non-breaking improvements label Mar 20, 2025
@szegedi szegedi force-pushed the szegedi/cped-context branch from 2b9041d to c475e2f Compare March 20, 2025 16:14
@szegedi szegedi force-pushed the szegedi/cped-context branch 3 times, most recently from 9ba3bf1 to 9c4090e Compare May 7, 2025 12:00
@szegedi szegedi requested a review from nsavoire as a code owner May 20, 2025 13:32
@szegedi szegedi changed the title Store sampling context in ContinuationPreservedEmbeddedData Store sampling context in ContinuationPreservedEmbedderData May 20, 2025
@szegedi szegedi marked this pull request as draft May 20, 2025 13:41
@szegedi szegedi force-pushed the szegedi/cped-context branch 5 times, most recently from 89ff3a8 to d774391 Compare May 21, 2025 12:58
@szegedi szegedi marked this pull request as ready for review May 21, 2025 13:15
@@ -0,0 +1,190 @@
# Storing Sample Context in V8 Continuation-Preserved Embedder Data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huge thanks for the very comprehensive doc

@@ -521,12 +523,16 @@ WallProfiler::WallProfiler(std::chrono::microseconds samplingPeriod,
workaroundV8Bug_ = workaroundV8Bug && DD_WALL_USE_SIGPROF && detectV8Bug_;
collectCpuTime_ = collectCpuTime && withContexts;
collectAsyncId_ = collectAsyncId && withContexts;
#if NODE_MAJOR_VERSION >= 23

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was node 24 ? Can we really rely on 23 being OK ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AsyncContextFrame implementation was released in Node 23 but needed a command line option to activate. Node 24 just made it so it's turned on by default. It's pretty much the same code in both. I developed and tested against 23 for most of the time as 24 was just released recently. (That is also the reason it says 23 here :-) )

FWIW, now that Node 24 is out, we no longer publish a release for 23, so might as well bump this to 24. Technically though there's no reason to not support this in source code from 23 onwards.

isolate->RemoveGCPrologueCallback(&GCPrologueCallback, this);
isolate->RemoveGCEpilogueCallback(&GCEpilogueCallback, this);
}

for (auto it = liveContextPtrs_.begin(); it != liveContextPtrs_.end();) {
Copy link

@r1viollet r1viollet May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK to have non atomic operations on this structure ?
edit: it looks like we no longer rely on GC cleanups (so no risk there)
We have one instance per isolate and we always sample from the same thread ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V8 isolates can be used from a single thread at once safely (they have enter/exit APIs for threads for this purpose.) The way Node uses them, there'll be a 1:1 correspondence between an isolate and a thread. Normally this disposal method is invoked from WallProfiler.StopImpl() when the profiler is not restarted – and that itself happens from JS code for TimeProfiler.stop() so this is expected to always execute on the thread belonging to the current isolate (same one that calls SetContext that puts elements into the set.)

deadContextPtrs_ OTOH are inserted into from the GC callback, and I'm not 100% sure if GC callbacks are always invoked from the thread currently running JS code. I'd think yes as GC is triggered during execution of JS code in its allocation path when it needs more memory. But also some GC algorithms can be parallel. That's actually something we should look into and get 100% certainty on.


auto cpedObj = cped.As<Object>();
auto localSymbol = cpedSymbol_.Get(isolate);
auto maybeProfData = cpedObj->Get(v8Ctx, localSymbol);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might need a refresher on the re-entrant v8 calls and how to protect ourselves (mainly from GC?). The getters call into v8 right ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We register callbacks with the V8 Isolate to observe start and end of GC operations, and when we know engine is GCing, the signal handler won't touch the CPED. It's actually even better, because when our GC prologue callback is invoked, we can still safely read the current sample context and store its shared_ptr<Global<Value>> in a profiler instance field, and signal handler then just copies that shared_ptr<> into struct elements in a ring buffer that associates samples with contexts. This way, even the GC samples will have a sample context associated with them – the one that was the active one when the GC started.

And yeah, Object::Get will read into the object. As it is an ordinary property (meaning, it does not have a getter function) it will essentially execute few dependent loads (pointer traversals and offset reads.) I believe this is safe to do from a signal handler. There's a V8 blog post that describes this in more detail.

There can still be a concurrency issue between setting the property on the frame object by SetContext and it being read by the signal handler. In SetContext we define the property only once on every frame object. Its JS value is immutable – a v8::External with a pointer to a C++ PersistentContextPtr instance. All mutation happens within this C++ object which is a subclass of AtomicContextPtr and is coded to be atomic and signal-safe. This leaves with only making sure signal handler is not trying to execute Object::Get if it interrupted SetContext while it executes Object::DefineProperty – we use the setInProgress atomic boolean for this purpose.

This covers all cases I can see. Some of this is captured in the design doc too.

PersistentContextPtr* contextPtr = nullptr;
auto profData = maybeProfData.ToLocalChecked();
if (profData->IsUndefined()) {
contextPtr = new PersistentContextPtr(&deadContextPtrs_);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be too many new operations ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be at most as many as there are AsyncContextFrame (ACF) objects in Node (practically, there'll be exactly as many.) With GC callbacks we'll keep them deleted in sync with how ACFs are being GCd in V8 heap. It's theoretically possible that V8 could still create new ACF instances on its heap (as it allocates bigger chunks of memory up front) but we can't allocate a C++ object on the heap. I would like to think it is very unlikely. We might want to add comments here in the code though to explain this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So O(1000) per minute of elements ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, any time there's an AsyncLocalStorage.enterWith so quite likely a lot. I could expose a counter and see if we can report it through telemetry?


auto external = External::New(isolate, contextPtr);
setInProgress.store(true, std::memory_order_relaxed);
std::atomic_signal_fence(std::memory_order_release);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

release fence should should be placed before atomic operation.
This always confuses me, but the pattern should be:

     Thread                                          Signal handler

fence(Release);         A --------------
m.store(true, Relaxed); X ---------    |
                                  |    |
                                  |    |
                                  -------------> Y  if m.load(Relaxed) {
                                       |-------> B      fence(Acquire);
                                                        ...
                                                    }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's tricky and confuses me too, especially since the specification only discusses it in the context of it being applied on loads and stores, not on fences, and the documentation on signal fences when it says:

a fence with release semantics prevents reads or writes from being moved past subsequent writes and a fence with acquire semantics prevents reads or writes from being moved ahead of preceding reads

it does not say whether the atomic reads/writes are the one being prevented from being moved or the non-atomic ones.

I decided to review all uses of atomic_signal_fence in our code and frankly, there's a bunch of them that look incorrect following the above logic (you can see for yourself.) We even have one release after load, but we surely have a bunch of releases after store too. I'll canonicalize 'em all.

Copy link
Author

@szegedi szegedi Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to fix fencing on main first: #217. Let me know there if everything looks right.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found https://doc.rust-lang.org/std/sync/atomic/fn.compiler_fence.html clearer in its explanations.
From what I understand, atomic variable provides the synchronization between thread / signal handler and fences ensure that:

  • no non-atomic writes are moved after the atomic write
  • no non-atomic read are moved before the atomic read


std::atomic<int> gcCount = 0;
std::atomic<bool> setInProgress = false;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
std::atomic<bool> setInProgress = false;
std::atomic<bool> setInProgress_ = false;

double gcAsyncId;
ContextPtr gcContext;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
ContextPtr gcContext;
ContextPtr gcContext_;

contextPtr = new PersistentContextPtr(&deadContextPtrs_);

auto external = External::New(isolate, contextPtr);
setInProgress.store(true, std::memory_order_relaxed);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of setProgress atomic that prevents the read during the update.
With this, AtomicContextPtr should not be needed for CPED, right ?
And we could use the same mechanism for non-CPED flow and also remove AtomicContextPtr.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 The behavior would be different for sure. If we manage to sample just when the value is being set then currently we can still bind the previous value (good question of whether it is valid or not) and now we'll bail out.
However, it would allow for fairly strong simplification. We still couldn't just store the JS value in the CPED, I think. At the very minimum, we would still need to have a v8::External with a std::shared_ptr<v8::Global<v8::Value>> as the signal handler is just copying the shared_ptr into the structs in the ring buffer. We could "just" be resetting the shared_ptr though. And we'd still need to do the live/dead pointer tracking. But still simpler.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just bailing out when a context set is in progress seems reasonable, we would lose context for a few samples.
I agree that sadly we would still need a shared_ptr on a v8::Global.
We could explore storing only span ID / endpoint in an array buffer, but it's unclear to me if it's possible.

}

WallProfiler::~WallProfiler() {
Dispose(nullptr);
}

class PersistentContextPtr : AtomicContextPtr {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this class at the top file instead of putting it in the middle of WallProfiler methods

@szegedi szegedi force-pushed the szegedi/cped-context branch 2 times, most recently from c6d75ac to 6d6b07a Compare June 6, 2025 14:34
@szegedi szegedi force-pushed the szegedi/cped-context branch from 6d6b07a to 98b683c Compare June 11, 2025 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-minor Usually minor non-breaking improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants