-
Notifications
You must be signed in to change notification settings - Fork 3
Store sampling context in ContinuationPreservedEmbedderData #186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
5b796ea
to
1b15c3a
Compare
1b15c3a
to
3decb46
Compare
bfa23b5
to
32b2170
Compare
32b2170
to
9f398f2
Compare
9f398f2
to
965a627
Compare
965a627
to
0c1791e
Compare
a2384e5
to
bb0bc2d
Compare
0c1791e
to
8f88f3d
Compare
@@ -66,6 +66,7 @@ export interface TimeProfilerOptions { | |||
withContexts?: boolean; | |||
workaroundV8Bug?: boolean; | |||
collectCpuTime?: boolean; | |||
useCPED?: boolean; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8f88f3d
to
717f17b
Compare
3c86526
to
79992a0
Compare
717f17b
to
2b9041d
Compare
Overall package sizeSelf size: 1.62 MB Dependency sizes| name | version | self size | total size | |------|---------|-----------|------------| | source-map | 0.7.4 | 226 kB | 226 kB | | pprof-format | 2.1.0 | 111.69 kB | 111.69 kB | | p-limit | 3.1.0 | 7.75 kB | 13.78 kB | | delay | 5.0.0 | 11.17 kB | 11.17 kB | | node-gyp-build | 3.9.0 | 8.81 kB | 8.81 kB |🤖 This report was automatically generated by heaviest-objects-in-the-universe |
BenchmarksBenchmark execution time: 2025-06-11 11:29:00 Comparing candidate commit 98b683c in PR branch Found 1 performance improvements and 0 performance regressions! Performance is the same for 89 metrics, 30 unstable metrics. scenario:profiler-light-load-with-wall-profiler-20
|
2b9041d
to
c475e2f
Compare
9ba3bf1
to
9c4090e
Compare
89ff3a8
to
d774391
Compare
@@ -0,0 +1,190 @@ | |||
# Storing Sample Context in V8 Continuation-Preserved Embedder Data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huge thanks for the very comprehensive doc
@@ -521,12 +523,16 @@ WallProfiler::WallProfiler(std::chrono::microseconds samplingPeriod, | |||
workaroundV8Bug_ = workaroundV8Bug && DD_WALL_USE_SIGPROF && detectV8Bug_; | |||
collectCpuTime_ = collectCpuTime && withContexts; | |||
collectAsyncId_ = collectAsyncId && withContexts; | |||
#if NODE_MAJOR_VERSION >= 23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this was node 24 ? Can we really rely on 23 being OK ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AsyncContextFrame
implementation was released in Node 23 but needed a command line option to activate. Node 24 just made it so it's turned on by default. It's pretty much the same code in both. I developed and tested against 23 for most of the time as 24 was just released recently. (That is also the reason it says 23 here :-) )
FWIW, now that Node 24 is out, we no longer publish a release for 23, so might as well bump this to 24. Technically though there's no reason to not support this in source code from 23 onwards.
bindings/profilers/wall.cc
Outdated
isolate->RemoveGCPrologueCallback(&GCPrologueCallback, this); | ||
isolate->RemoveGCEpilogueCallback(&GCEpilogueCallback, this); | ||
} | ||
|
||
for (auto it = liveContextPtrs_.begin(); it != liveContextPtrs_.end();) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it OK to have non atomic operations on this structure ?
edit: it looks like we no longer rely on GC cleanups (so no risk there)
We have one instance per isolate and we always sample from the same thread ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
V8 isolates can be used from a single thread at once safely (they have enter/exit APIs for threads for this purpose.) The way Node uses them, there'll be a 1:1 correspondence between an isolate and a thread. Normally this disposal method is invoked from WallProfiler.StopImpl()
when the profiler is not restarted – and that itself happens from JS code for TimeProfiler.stop()
so this is expected to always execute on the thread belonging to the current isolate (same one that calls SetContext
that puts elements into the set.)
deadContextPtrs_
OTOH are inserted into from the GC callback, and I'm not 100% sure if GC callbacks are always invoked from the thread currently running JS code. I'd think yes as GC is triggered during execution of JS code in its allocation path when it needs more memory. But also some GC algorithms can be parallel. That's actually something we should look into and get 100% certainty on.
bindings/profilers/wall.cc
Outdated
|
||
auto cpedObj = cped.As<Object>(); | ||
auto localSymbol = cpedSymbol_.Get(isolate); | ||
auto maybeProfData = cpedObj->Get(v8Ctx, localSymbol); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might need a refresher on the re-entrant v8 calls and how to protect ourselves (mainly from GC?). The getters call into v8 right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We register callbacks with the V8 Isolate to observe start and end of GC operations, and when we know engine is GCing, the signal handler won't touch the CPED. It's actually even better, because when our GC prologue callback is invoked, we can still safely read the current sample context and store its shared_ptr<Global<Value>>
in a profiler instance field, and signal handler then just copies that shared_ptr<>
into struct elements in a ring buffer that associates samples with contexts. This way, even the GC samples will have a sample context associated with them – the one that was the active one when the GC started.
And yeah, Object::Get
will read into the object. As it is an ordinary property (meaning, it does not have a getter function) it will essentially execute few dependent loads (pointer traversals and offset reads.) I believe this is safe to do from a signal handler. There's a V8 blog post that describes this in more detail.
There can still be a concurrency issue between setting the property on the frame object by SetContext
and it being read by the signal handler. In SetContext
we define the property only once on every frame object. Its JS value is immutable – a v8::External
with a pointer to a C++ PersistentContextPtr
instance. All mutation happens within this C++ object which is a subclass of AtomicContextPtr
and is coded to be atomic and signal-safe. This leaves with only making sure signal handler is not trying to execute Object::Get
if it interrupted SetContext
while it executes Object::DefineProperty
– we use the setInProgress
atomic boolean for this purpose.
This covers all cases I can see. Some of this is captured in the design doc too.
PersistentContextPtr* contextPtr = nullptr; | ||
auto profData = maybeProfData.ToLocalChecked(); | ||
if (profData->IsUndefined()) { | ||
contextPtr = new PersistentContextPtr(&deadContextPtrs_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this be too many new operations ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be at most as many as there are AsyncContextFrame
(ACF) objects in Node (practically, there'll be exactly as many.) With GC callbacks we'll keep them deleted in sync with how ACFs are being GCd in V8 heap. It's theoretically possible that V8 could still create new ACF instances on its heap (as it allocates bigger chunks of memory up front) but we can't allocate a C++ object on the heap. I would like to think it is very unlikely. We might want to add comments here in the code though to explain this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So O(1000) per minute of elements ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, any time there's an AsyncLocalStorage.enterWith
so quite likely a lot. I could expose a counter and see if we can report it through telemetry?
bindings/profilers/wall.cc
Outdated
|
||
auto external = External::New(isolate, contextPtr); | ||
setInProgress.store(true, std::memory_order_relaxed); | ||
std::atomic_signal_fence(std::memory_order_release); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
release fence should should be placed before atomic operation.
This always confuses me, but the pattern should be:
Thread Signal handler
fence(Release); A --------------
m.store(true, Relaxed); X --------- |
| |
| |
-------------> Y if m.load(Relaxed) {
|-------> B fence(Acquire);
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's tricky and confuses me too, especially since the specification only discusses it in the context of it being applied on loads and stores, not on fences, and the documentation on signal fences when it says:
a fence with release semantics prevents reads or writes from being moved past subsequent writes and a fence with acquire semantics prevents reads or writes from being moved ahead of preceding reads
it does not say whether the atomic reads/writes are the one being prevented from being moved or the non-atomic ones.
I decided to review all uses of atomic_signal_fence
in our code and frankly, there's a bunch of them that look incorrect following the above logic (you can see for yourself.) We even have one release after load, but we surely have a bunch of releases after store too. I'll canonicalize 'em all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to fix fencing on main first: #217. Let me know there if everything looks right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found https://doc.rust-lang.org/std/sync/atomic/fn.compiler_fence.html clearer in its explanations.
From what I understand, atomic variable provides the synchronization between thread / signal handler and fences ensure that:
- no non-atomic writes are moved after the atomic write
- no non-atomic read are moved before the atomic read
bindings/profilers/wall.hh
Outdated
|
||
std::atomic<int> gcCount = 0; | ||
std::atomic<bool> setInProgress = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
std::atomic<bool> setInProgress = false; | |
std::atomic<bool> setInProgress_ = false; |
bindings/profilers/wall.hh
Outdated
double gcAsyncId; | ||
ContextPtr gcContext; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
ContextPtr gcContext; | |
ContextPtr gcContext_; |
bindings/profilers/wall.cc
Outdated
contextPtr = new PersistentContextPtr(&deadContextPtrs_); | ||
|
||
auto external = External::New(isolate, contextPtr); | ||
setInProgress.store(true, std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the idea of setProgress
atomic that prevents the read during the update.
With this, AtomicContextPtr
should not be needed for CPED, right ?
And we could use the same mechanism for non-CPED flow and also remove AtomicContextPtr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 The behavior would be different for sure. If we manage to sample just when the value is being set then currently we can still bind the previous value (good question of whether it is valid or not) and now we'll bail out.
However, it would allow for fairly strong simplification. We still couldn't just store the JS value in the CPED, I think. At the very minimum, we would still need to have a v8::External
with a std::shared_ptr<v8::Global<v8::Value>>
as the signal handler is just copying the shared_ptr
into the structs in the ring buffer. We could "just" be resetting the shared_ptr though. And we'd still need to do the live/dead pointer tracking. But still simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just bailing out when a context set is in progress seems reasonable, we would lose context for a few samples.
I agree that sadly we would still need a shared_ptr on a v8::Global
.
We could explore storing only span ID / endpoint in an array buffer, but it's unclear to me if it's possible.
bindings/profilers/wall.cc
Outdated
} | ||
|
||
WallProfiler::~WallProfiler() { | ||
Dispose(nullptr); | ||
} | ||
|
||
class PersistentContextPtr : AtomicContextPtr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move this class at the top file instead of putting it in the middle of WallProfiler
methods
c6d75ac
to
6d6b07a
Compare
…preserved embedder data
Also optimize away some setting of null/undefined as context
6d6b07a
to
98b683c
Compare
What does this PR do?:
Moves profiler's sampling context into the V8 isolate's
ContinuationPreservedEmbedderData
(CPED.) In fact, moves it into a private symbol-keyed property within the object as Node.js version 24 and later will storeAsyncContextFrame
objects in the CPED.Motivation:
Before this change, we rely on async_hooks in dd-trace-js to update the sampling context on every context switch. It is both somewhat costly to do this on every async context switch and we should also migrate away from async hooks, as profiling is their last user, they'll be deprecated in Node, and they bring their own set of performance overhead by using them.
This PR comes with a document explaining the design in it.
Additional Notes:
The code using this in
dd-trace-js
'wall.js
sets the context onAsyncLocalStorage.{enterWith|run}
, and thebefore
callback ofAsyncHooks.createHook
. With CPED storage enabled, we can eliminate thebefore
callback and "only" keep theAsyncLocalStorage.{enterWith|run}
callbacks – they are the minimum amount of instrumentation needed for the feature to work correctly then.This branch can be tested with the same-named branch in dd-trace-js.
Jira: PROF-11771