-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Critical path tracing shows the wrong critical path for async script #2371
Comments
@dosubot |
All three foo spans are siblings in your trace (which does not exactly match the code structure where, one could argue, foo3 is a child of foo1 because foo1 doesn't succeed without foo3). There is no "right" way to do critical path when you just have equal siblings, Jaeger simply shows how the execution happened. The most basic definition of critical path is:
I believe Jaeger's depiction meets this definition. What were you expecting to see? |
Also, Jaeger is not claiming that foo3 is blocked by foo2, that's just what your trace is telling us has happened. |
@jkowall Dosu seems not to pick up UI issues, this would've been a good one to LLM. |
In this case, making
The true critical path is foo1() -> foo2(). That would be the critical path I as a human would highlight in the UI. (There would be a bit of unannotated unaccounted for time.)
Is this what the trace says happened? I looked at the JSON and I don't see that there's any indication that says foo3 is blocked on foo2. The only thing we know is that they're technically all sibling spans of main() in terms of nesting of annotations. (i.e. |
Just to elaborate on this:
fwiw yes I agree that in an idealized world, the foo spans might have their own explicit annotations, making this an easier problem to solve. However this is a minimal repro from an issue i'm seeing at scale, where the only thing we have guaranteed annotations for is outbound network requests, in a sea of deeply nested async code - making it difficult to know what was kicked of where by who. This minimal repro is very representative of a real world problem. |
You are basing that on your knowledge of the code, but that's not reflected in the trace. Jaeger can only make decision based on trace information, and in that trace the program was first "doing work" in foo2, then some work in main, then foo3.
It's in the timing of the span. First there was work in foo2, then in foo3. That's all that Jaeger knows. As I mentioned, if you want the trace to more accurately reflect the dependencies, you should have foo3 be a child of foo1. It would then affect how the CP is displayed. When spans are just siblings, Jaeger cannot make any assumptions - the could be running in parallel, OR they could have hidden inter-dependencies (e.g. foo1 and foo3 could be true siblings and yet have a logical dependency |
I think that's precisely the point i'm getting at here! Jaeger fundamentally doesn't have enough information provided in this trace to accurately confidently say "foo2 is in the critical path and i will highlight it as such".
I think this minimal repro shows where this assumption that this is enough to calculate CP breaks down when you have concurrent spans.
Per the above, it looks like Jaeger is making assumptions? To clarify, I think we're in agreement that:
My point is that there doesn't seem to be enough information currently provided in the trace JSON for jaeger to say that "foo2 is in the critical path" - because as we humans can see, it isn't! Maybe in such cases, Jaeger could color the line differently to highlight this is a heuristic/best guess? I think that would help boost confidence in using this feature (otherwise it's difficult to know when it's accurate or not) |
I'll ask again - what do you expect to see? CP is a heuristic, and yes Jaeger is making assumptions but based on the data present in the span (which eg. shows foo2 was running solo and therefor WAS on the critical path, same as foo3). If either of those weren't in the program the parent span would've been shorter. |
Ah I think "CP is a heuristic" is probably the key bit of information i was missing. TIL. Purely looking at the UI this wasn't clear to me that this should be assumed to be a heurstic. The only thing I'm able to see on this is the popover: TBH I don't think it's currently super obvious to users that this is a heuristic.
Again:
Additionally: Is this documented somewhere? I took a look on the docs (https://www.jaegertracing.io/docs/1.18/frontend-ui/) and couldn't see anything related to CP (mentioning that it's a heuristic or otherwise) - maybe I couldn't find the page though? I would either accept the status quo and:
OR maybe this solvable with a spec change + smarter SDK? open-telemetry/opentelemetry-specification#4079 related
I could shorten or remove foo2 altogether and it wouldn't make the request any shorter. Based on my understanding and your definition above ("if making the execution of that segment shorter would make the whole request shorter"), foo2 is not on the critical path. |
we could add a note in the popup |
Sounds like a good idea. I would also encourage exploring the idea of coloring the CP differently in cases where there is ambiguity (such as here). In a world where there are only sequential non async spans (and maybe this is provided as an input via the SDK) then the current approach seems sound. In an a microservice environment you may have a mix of both. |
I wouldn't do different coloring. OTEL data model is fundamentally lacking certain causality information, so the algorithm is always a heuristic, since we never know the true causal graph of events. Feel free to open a PR to add a warning to the tooltip. |
👋 I've been on a bit of journey trying ot understand how critical path tracing works. I'm still not 100% so apologies if this is wrong, but would appreciate y'alls input :)
Full context here open-telemetry/opentelemetry-specification#4079
Anyway, I was able to come up with the following minimal repro:
This visualization appears to be wrong ^
(JSON)
Per the following code (which produces this trace),
/get foo3
is not blocked by/get foo2
:full code (including imports)
I understand that the algorithm walks backwards, but i'm guessing it can only do so heuristically? Or is there something else i'm missing here? (I don't know if we provide the "previous sibling span id"? Is this an issue with the spec itself (hence the other issue)?)
Thanks!
Jaeger backend version
1.58.0
SDK
Pipeline
jaeger-1.58.0-darwin-amd64/jaeger-all-in-one
locally on my macbook.http://localhost:4317
Operating system
MacOS
Deployment model
Running
jaeger-all-in-one
locallyThe text was updated successfully, but these errors were encountered: