Conversation
Signed-off-by: Kevin Svetlitski <ksvetlitski@janestreet.com>
This is useful for testing. Signed-off-by: Kevin Svetlitski <ksvetlitski@janestreet.com>
This is in preparation for writing the new backend. After we've made substantial changes to `src/new_trace_writer.ml`, you'll still be able to track the history by running: ```bash git blame -C -C src/new_trace_writer.ml ``` Yes, you need to pass `-C` *twice*: > -C ... when this option is given twice, the command additionally looks for copies from other files in the commit that creates the file. Signed-off-by: Kevin Svetlitski <ksvetlitski@janestreet.com>
The existing `Trace_writer` module works well enough (albeit not perfectly) most of the time. However, it is difficult to reason about, in large part because it writes the trace in a streaming fashion. That introduces significant additional complexity and book-keeping, and limits the ability of the trace-writer to make use of information discovered later in the trace (I believe the latter is why traces produced today often have the few frames closest to the root wrong). Because we want to extend the trace-writer in the near future, we're starting fresh with a different design that's easier to reason about. The new implementation currently exists alongside the original, but the goal is to eventually replace it entirely. Instead of writing the trace in a streaming fashion, we construct an internal representation of the trace in memory, and write out the trace in a separate, final pass once all of the events have been consumed. The module responsible for doing most of the heavy lifting is the new `Trace_segment`, which represents a continuous, **lossless, and error-free** segment of the trace; we create a new trace-segment whenever we encounter an error. **This PR does not represent a complete, finished product.** The code here does indeed work, and already produces better traces than the existing backend in several cases, but has several critical pieces missing: - [ ] Trace-segment stitching: At present we naively treat each trace-segment as disjoint. We need to add an additional "stitching" pass before the trace is written out, making a heuristic, best-effort attempt to join together adjacent trace-segments in a way that preserves control-flow continuity. - [ ] OCaml exception handling logic when OCaml-specific debug-info is *not* available - [ ] Golang support (are we keeping this?) It should also go without saying that while this code appears to work well on the traces I've tried it on, I would not at all be surprised if there are still bugs/edge-cases. Signed-off-by: Kevin Svetlitski <ksvetlitski@janestreet.com>
…tation to use Signed-off-by: Kevin Svetlitski <ksvetlitski@janestreet.com>
| type ('a : k) t | ||
|
|
||
| val create : 'a -> 'a t | ||
| val length : _ t -> int |
| thread. *) | ||
| type t | ||
|
|
||
| val create : Ocaml_exception_info.t option -> in_filtered_region:bool -> t |
| : trace_scope:Trace_scope.t | ||
| -> debug_info:Elf.Addr_table.t option | ||
| -> ocaml_exception_info:Ocaml_exception_info.t option | ||
| -> earliest_time:Time_ns.Span.t |
There was a problem hiding this comment.
Should all of these be Timestamp.ts?
| -> earliest_time:Time_ns.Span.t | ||
| -> hits:(string * Breakpoint.Hit.t) list | ||
| -> annotate_inferred_start_times:bool | ||
| -> (module Trace with type thread = _) |
| -> (module Trace_writer_intf.S_trace with type thread = 'thread) | ||
| -> 'thread | ||
| -> Elf.Addr_table.t | ||
| -> enter_initial_callstack:bool |
There was a problem hiding this comment.
Mmm, it feels a little strange that the state of whether this is the first or last trace segment is injected into write_trace... In theory we should already know this, depending on whether this t was created with create_continuing_from or not?
There was a problem hiding this comment.
...actually, what is the intent of this? I only see this being called with true for both params.
| Returns the matching frame (if found), and that frame's distance from the initial | ||
| frame (e.g. a call to [find my_frame my_symbol] with a return value of | ||
| [#(This _, ~distance:0)] indicates that [my_frame.location.symbol] is [my_symbol]). *) | ||
| val find : t -> Symbol.t -> #(t Or_null.t * distance:int) |
There was a problem hiding this comment.
What does #(Null, distance) represent? i.e., why is this not a #(t, distance:int) or_null?
| val find : t -> Symbol.t -> #(t Or_null.t * distance:int) | ||
|
|
||
| (** Iterate from leaf-to-root up to the given number of frames, or until encountering | ||
| the [Sentinel.t] *) |
There was a problem hiding this comment.
| the [Sentinel.t] *) | |
| the [Sentinel.t]. *) |
| tail-recursive, given that frames form a singly-linked list from leaf-to-root. *) | ||
| val iter_rev : t -> f:local_ (t -> unit) -> unit | ||
|
|
||
| val find_ancestor : t -> ancestor:t -> int Or_null.t |
There was a problem hiding this comment.
Missing comment, I'm assuming this returns a distance?
For consistency, should we return a #(Vnit.t, distance:int)?
| (* These fields are actually **immutable** except for [Sentinel.t] instances. *) | ||
| type t = private | ||
| { mutable location : Event.Location.t | ||
| ; mutable parent : t Or_null.t |
There was a problem hiding this comment.
Nit: prefer or_null over Or_null.t.
| type nonrec t = t | ||
|
|
||
| let sentinel_location : Location.t = | ||
| { instruction_pointer = 0L; symbol_offset = 0; symbol = From_perf "\x00" } |
There was a problem hiding this comment.
Is the From_perf ... semantically meaningful, or could we use Unknown here?
| type t = | ||
| #{ time : Timestamp.t | ||
| ; leaf : Frame.t | ||
| ; control_flow : Control_flow.t |
There was a problem hiding this comment.
Is this the control flow that led to this Callstack.t, or the one that is next made from this Callstack.t? I'm assuming the former.
But also, why do we care?
| In contrast to [callstacks] — which records the entire history of control-flow for | ||
| later examination — [exception_handlers] represents the state **as of the event we | ||
| are currently processing**, and as such is only used during the "ingestion" phase | ||
| (i.e. while calls are still being made to [add_event]). *) |
There was a problem hiding this comment.
I'm assuming that an invariant here is that every Frame.t in exception_handlers is present in tl callstacks. Is that right?
| In contrast to [callstacks] — which records the entire history of control-flow for | ||
| later examination — [exception_handlers] represents the state **as of the event we | ||
| are currently processing**, and as such is only used during the "ingestion" phase | ||
| (i.e. while calls are still being made to [add_event]). *) |
There was a problem hiding this comment.
An interesting question here is, what do we do if the program did
call apushtrapcall bcall ccall dpushtrapcall epushtrap
but due to an IPT error, we missed [1, 6]. So now we have a fresh Trace_segment.t containing only e.
In that case, I believe exception_handlers = {e}.
If we then see
entertap eentertrap dentertrap a
what will happen?
Certainly an incorrect thing to do here is to infer that the callstack must have been a -> d -> e: that will fail to ever stitch correctly.
Probably the most sane thing to do when we see an unmatched entertrap is to break the Trace_segment.t as we would on a regular error. Maybe you already do this; I haven't read that far into this file yet.
| are currently processing**, and as such is only used during the "ingestion" phase | ||
| (i.e. while calls are still being made to [add_event]). *) | ||
| ; mutable last_known_instruction_pointer : int64 | ||
| ; in_filtered_region : bool |
There was a problem hiding this comment.
Why is this a property of t, as opposed to a post-processing step? Not saying it's wrong, just want to understand.
| } | ||
| ;; | ||
|
|
||
| let in_filtered_region t = t.in_filtered_region |
There was a problem hiding this comment.
[@@deriving fields ~getters] + DCE?
|
|
||
| (** Mutate [t]'s contents to the provided [location] and [parent] and return [t] as a | ||
| [frame]. *) | ||
| val become_frame : t -> Location.t -> parent:frame -> frame |
There was a problem hiding this comment.
Is parent:frame ever not going to be sentinel:t?
| let in_filtered_region t = t.in_filtered_region | ||
| let[@inline always] current_frame t = (Nonempty_vec.last t.callstacks).#leaf | ||
|
|
||
| let replace_root t location = |
There was a problem hiding this comment.
Nit: this isn't really replacing the root, it's strictly an additive operation.
| match Frame.find (current_frame t) src.symbol with | ||
| | #(This _, ~distance:0) -> (* The happy case, [src] matches [current_frame t]. *) () | ||
| | #(This src_frame, ~distance) -> | ||
| (* [src] exists, but is higher up the callstack. *) |
| (* First, reconcile things such that [src] matches [current_frame t] if it doesn't | ||
| already. *) | ||
| let () = | ||
| match Frame.find (current_frame t) src.symbol with |
There was a problem hiding this comment.
Could we order these in the order of most-to-least likely to occur? They are not all of equal importance.
I think the happy case is happy, and then the beginning-of-trace call is the second most-happy.
The other two cases I don't really understand how we can hit.
|
|
||
| let handle_return (t : t) (time : Timestamp.t) ~(dst : Location.t) = | ||
| match (current_frame t).parent with | ||
| | Null -> |
There was a problem hiding this comment.
How can this happen? Does this mean we're executing with only the sentinel being around?
| (* 99% of the time [distance] should be 0, indicating we are returning to | ||
| [parent_frame] as expected. We allow for the possibility of "long" returns to | ||
| account for [Sysret]/[Iret] events that return to userspace directly from deep | ||
| within their kernel/interrupt stack. *) |
There was a problem hiding this comment.
Is this sufficient to handle something like rseq aborting, where the abort IP may not be present in our callstack at all?
| | Trace { src; dst; _ } -> | ||
| Ocaml_exception_info.iter_pushtraps_and_poptraps_in_range | ||
| ocaml_exception_info | ||
| ~from:t.last_known_instruction_pointer |
There was a problem hiding this comment.
Does this do anything reasonable when last_known_instruction_pointer = max_value as at initialization time?
| module Nonempty_vec = Nonempty_vec.Value | ||
|
|
||
| let debug = ref false | ||
| let is_kernel_address addr = Int64.(addr < 0L) |
There was a problem hiding this comment.
Rather than keep lots of old trace_writer.ml code in here, can we instead delete all of it, and have something like a Multi_trace_writer module that dispatches to both Trace_writer and New_trace_writer in the cases where you want to debug and compare?
Introduce new, experimental trace-writing backend
The existing
Trace_writermodule works well enough (albeit not perfectly) most of thetime. However, it is difficult to reason about, in large part because it writes the
trace in a streaming fashion. That introduces significant additional complexity and
book-keeping, and limits the ability of the trace-writer to make use of information
discovered later in the trace (I believe the latter is why traces produced today often
have the few frames closest to the root wrong). Because we want to extend the trace-writer
in the near future, we're starting fresh with a different design that's easier to reason about.
The new implementation currently exists alongside the original, but the goal is to eventually
replace it entirely.
Instead of writing the trace in a streaming fashion, we construct an internal
representation of the trace in memory, and write out the trace in a separate, final pass
once all of the events have been consumed. The module responsible for doing most of the
heavy lifting is the new
Trace_segment, which represents a continuous, lossless, anderror-free segment of the trace; we create a new trace-segment whenever we encounter an
error.
This PR does not represent a complete, finished product. The code here does indeed work,
and already produces better traces than the existing backend in several cases, but has several
critical pieces missing:
disjoint. We need to add an additional "stitching" pass before the trace is written out,
making a heuristic, best-effort attempt to join together adjacent trace-segments in a
way that preserves control-flow continuity.
It should also go without saying that while this code appears to work well on the traces I've
tried it on, I would not at all be surprised if there are still bugs/edge-cases. On account of this and the aforementioned missing features, using the new implementation is opt-in via the environment variable
MAGIC_TRACE_USE_NEW_TRACE_WRITER.