Trace segments alpha by Svetlitski · Pull Request #353 · janestreet/magic-trace

Svetlitski · 2026-02-09T21:18:48Z

Introduce new, experimental trace-writing backend

The existing Trace_writer module works well enough (albeit not perfectly) most of the
time. However, it is difficult to reason about, in large part because it writes the
trace in a streaming fashion. That introduces significant additional complexity and
book-keeping, and limits the ability of the trace-writer to make use of information
discovered later in the trace (I believe the latter is why traces produced today often
have the few frames closest to the root wrong). Because we want to extend the trace-writer
in the near future, we're starting fresh with a different design that's easier to reason about.
The new implementation currently exists alongside the original, but the goal is to eventually
replace it entirely.

Instead of writing the trace in a streaming fashion, we construct an internal
representation of the trace in memory, and write out the trace in a separate, final pass
once all of the events have been consumed. The module responsible for doing most of the
heavy lifting is the new Trace_segment, which represents a continuous, lossless, and
error-free segment of the trace; we create a new trace-segment whenever we encounter an
error.

This PR does not represent a complete, finished product. The code here does indeed work,
and already produces better traces than the existing backend in several cases, but has several
critical pieces missing:

Trace-segment stitching: At present we naively treat each trace-segment as
disjoint. We need to add an additional "stitching" pass before the trace is written out,
making a heuristic, best-effort attempt to join together adjacent trace-segments in a
way that preserves control-flow continuity.
OCaml exception handling logic when OCaml-specific debug-info is not available
Golang support (are we keeping this?)

It should also go without saying that while this code appears to work well on the traces I've
tried it on, I would not at all be surprised if there are still bugs/edge-cases. On account of this and the aforementioned missing features, using the new implementation is opt-in via the environment variable MAGIC_TRACE_USE_NEW_TRACE_WRITER.

Signed-off-by: Kevin Svetlitski <ksvetlitski@janestreet.com>

This is useful for testing. Signed-off-by: Kevin Svetlitski <ksvetlitski@janestreet.com>

This is in preparation for writing the new backend. After we've made substantial changes to `src/new_trace_writer.ml`, you'll still be able to track the history by running: ```bash git blame -C -C src/new_trace_writer.ml ``` Yes, you need to pass `-C` *twice*: > -C ... when this option is given twice, the command additionally looks for copies from other files in the commit that creates the file. Signed-off-by: Kevin Svetlitski <ksvetlitski@janestreet.com>

The existing `Trace_writer` module works well enough (albeit not perfectly) most of the time. However, it is difficult to reason about, in large part because it writes the trace in a streaming fashion. That introduces significant additional complexity and book-keeping, and limits the ability of the trace-writer to make use of information discovered later in the trace (I believe the latter is why traces produced today often have the few frames closest to the root wrong). Because we want to extend the trace-writer in the near future, we're starting fresh with a different design that's easier to reason about. The new implementation currently exists alongside the original, but the goal is to eventually replace it entirely. Instead of writing the trace in a streaming fashion, we construct an internal representation of the trace in memory, and write out the trace in a separate, final pass once all of the events have been consumed. The module responsible for doing most of the heavy lifting is the new `Trace_segment`, which represents a continuous, **lossless, and error-free** segment of the trace; we create a new trace-segment whenever we encounter an error. **This PR does not represent a complete, finished product.** The code here does indeed work, and already produces better traces than the existing backend in several cases, but has several critical pieces missing: - [ ] Trace-segment stitching: At present we naively treat each trace-segment as disjoint. We need to add an additional "stitching" pass before the trace is written out, making a heuristic, best-effort attempt to join together adjacent trace-segments in a way that preserves control-flow continuity. - [ ] OCaml exception handling logic when OCaml-specific debug-info is *not* available - [ ] Golang support (are we keeping this?) It should also go without saying that while this code appears to work well on the traces I've tried it on, I would not at all be surprised if there are still bugs/edge-cases. Signed-off-by: Kevin Svetlitski <ksvetlitski@janestreet.com>

…tation to use Signed-off-by: Kevin Svetlitski <ksvetlitski@janestreet.com>

Xyene · 2026-02-23T04:44:37Z

src/nonempty_vec.mli

+  type ('a : k) t
+
+  val create : 'a -> 'a t
+  val length : _ t -> int


intentionally _ instead of 'a?

Xyene · 2026-02-23T04:48:12Z

src/trace_segment.mli

+    thread. *)
+type t
+
+val create : Ocaml_exception_info.t option -> in_filtered_region:bool -> t


Nit: ~ocaml_exception_info

Xyene · 2026-02-23T04:55:12Z

src/trace_writer_implementation_intf.ml

+    :  trace_scope:Trace_scope.t
+    -> debug_info:Elf.Addr_table.t option
+    -> ocaml_exception_info:Ocaml_exception_info.t option
+    -> earliest_time:Time_ns.Span.t


Should all of these be Timestamp.ts?

Xyene · 2026-02-23T04:55:52Z

src/trace_writer_implementation_intf.ml

+    -> earliest_time:Time_ns.Span.t
+    -> hits:(string * Breakpoint.Hit.t) list
+    -> annotate_inferred_start_times:bool
+    -> (module Trace with type thread = _)


is the with necessary?

Xyene · 2026-03-03T05:15:19Z

src/trace_segment.mli

+  -> (module Trace_writer_intf.S_trace with type thread = 'thread)
+  -> 'thread
+  -> Elf.Addr_table.t
+  -> enter_initial_callstack:bool


Mmm, it feels a little strange that the state of whether this is the first or last trace segment is injected into write_trace... In theory we should already know this, depending on whether this t was created with create_continuing_from or not?

...actually, what is the intent of this? I only see this being called with true for both params.

Xyene · 2026-03-03T05:23:31Z

src/trace_segment.ml

+      Returns the matching frame (if found), and that frame's distance from the initial
+      frame (e.g. a call to [find my_frame my_symbol] with a return value of
+      [#(This _, ~distance:0)] indicates that [my_frame.location.symbol] is [my_symbol]). *)
+  val find : t -> Symbol.t -> #(t Or_null.t * distance:int)


What does #(Null, distance) represent? i.e., why is this not a #(t, distance:int) or_null?

Xyene · 2026-03-03T05:23:54Z

src/trace_segment.ml

+  val find : t -> Symbol.t -> #(t Or_null.t * distance:int)
+
+  (** Iterate from leaf-to-root up to the given number of frames, or until encountering
+      the [Sentinel.t] *)


Suggested change

the [Sentinel.t] *)

the [Sentinel.t]. *)

Xyene · 2026-03-03T05:25:02Z

src/trace_segment.ml

+      tail-recursive, given that frames form a singly-linked list from leaf-to-root. *)
+  val iter_rev : t -> f:local_ (t -> unit) -> unit
+
+  val find_ancestor : t -> ancestor:t -> int Or_null.t


Missing comment, I'm assuming this returns a distance?

For consistency, should we return a #(Vnit.t, distance:int)?

Xyene · 2026-03-03T05:25:46Z

src/trace_segment.ml

+  (* These fields are actually **immutable** except for [Sentinel.t] instances. *)
+  type t = private
+    { mutable location : Event.Location.t
+    ; mutable parent : t Or_null.t


Nit: prefer or_null over Or_null.t.

Xyene · 2026-03-03T05:28:42Z

src/trace_segment.ml

+    type nonrec t = t
+
+    let sentinel_location : Location.t =
+      { instruction_pointer = 0L; symbol_offset = 0; symbol = From_perf "\x00" }


Is the From_perf ... semantically meaningful, or could we use Unknown here?

Xyene · 2026-03-03T05:29:51Z

src/trace_segment.ml

+  type t =
+    #{ time : Timestamp.t
+     ; leaf : Frame.t
+     ; control_flow : Control_flow.t


Is this the control flow that led to this Callstack.t, or the one that is next made from this Callstack.t? I'm assuming the former.

But also, why do we care?

Xyene · 2026-03-03T05:32:15Z

src/trace_segment.ml

+      In contrast to [callstacks] — which records the entire history of control-flow for
+      later examination — [exception_handlers] represents the state **as of the event we
+      are currently processing**, and as such is only used during the "ingestion" phase
+      (i.e. while calls are still being made to [add_event]). *)


I'm assuming that an invariant here is that every Frame.t in exception_handlers is present in tl callstacks. Is that right?

Xyene · 2026-03-03T05:37:35Z

src/trace_segment.ml

+      In contrast to [callstacks] — which records the entire history of control-flow for
+      later examination — [exception_handlers] represents the state **as of the event we
+      are currently processing**, and as such is only used during the "ingestion" phase
+      (i.e. while calls are still being made to [add_event]). *)


An interesting question here is, what do we do if the program did

call a

pushtrap

call b

call c

call d

pushtrap

call e

pushtrap

but due to an IPT error, we missed [1, 6]. So now we have a fresh Trace_segment.t containing only e.

In that case, I believe exception_handlers = {e}.

If we then see

entertap e

entertrap d

entertrap a

what will happen?

Certainly an incorrect thing to do here is to infer that the callstack must have been a -> d -> e: that will fail to ever stitch correctly.

Probably the most sane thing to do when we see an unmatched entertrap is to break the Trace_segment.t as we would on a regular error. Maybe you already do this; I haven't read that far into this file yet.

Xyene · 2026-03-03T05:39:02Z

src/trace_segment.ml

+      are currently processing**, and as such is only used during the "ingestion" phase
+      (i.e. while calls are still being made to [add_event]). *)
+  ; mutable last_known_instruction_pointer : int64
+  ; in_filtered_region : bool


Why is this a property of t, as opposed to a post-processing step? Not saying it's wrong, just want to understand.

Xyene · 2026-03-03T05:42:33Z

src/trace_segment.ml

+  }
+;;
+
+let in_filtered_region t = t.in_filtered_region


[@@deriving fields ~getters] + DCE?

Xyene · 2026-03-03T05:44:18Z

src/trace_segment.ml

+
+    (** Mutate [t]'s contents to the provided [location] and [parent] and return [t] as a
+        [frame]. *)
+    val become_frame : t -> Location.t -> parent:frame -> frame


Is parent:frame ever not going to be sentinel:t?

Xyene · 2026-03-03T05:45:35Z

src/trace_segment.ml

+let in_filtered_region t = t.in_filtered_region
+let[@inline always] current_frame t = (Nonempty_vec.last t.callstacks).#leaf
+
+let replace_root t location =


Nit: this isn't really replacing the root, it's strictly an additive operation.

Xyene · 2026-03-03T06:01:02Z

src/trace_segment.ml

+    match Frame.find (current_frame t) src.symbol with
+    | #(This _, ~distance:0) -> (* The happy case, [src] matches [current_frame t]. *) ()
+    | #(This src_frame, ~distance) ->
+      (* [src] exists, but is higher up the callstack. *)


When can this happen?

Xyene · 2026-03-03T06:02:24Z

src/trace_segment.ml

+  (* First, reconcile things such that [src] matches [current_frame t] if it doesn't
+     already. *)
+  let () =
+    match Frame.find (current_frame t) src.symbol with


Could we order these in the order of most-to-least likely to occur? They are not all of equal importance.

I think the happy case is happy, and then the beginning-of-trace call is the second most-happy.

The other two cases I don't really understand how we can hit.

Xyene · 2026-03-03T06:05:05Z

src/trace_segment.ml

+
+let handle_return (t : t) (time : Timestamp.t) ~(dst : Location.t) =
+  match (current_frame t).parent with
+  | Null ->


How can this happen? Does this mean we're executing with only the sentinel being around?

Xyene · 2026-03-03T06:07:20Z

src/trace_segment.ml

+       (* 99% of the time [distance] should be 0, indicating we are returning to
+          [parent_frame] as expected. We allow for the possibility of "long" returns to
+          account for [Sysret]/[Iret] events that return to userspace directly from deep
+          within their kernel/interrupt stack. *)


Is this sufficient to handle something like rseq aborting, where the abort IP may not be present in our callstack at all?

Xyene · 2026-03-03T06:17:39Z

src/trace_segment.ml

+      | Trace { src; dst; _ } ->
+        Ocaml_exception_info.iter_pushtraps_and_poptraps_in_range
+          ocaml_exception_info
+          ~from:t.last_known_instruction_pointer


Does this do anything reasonable when last_known_instruction_pointer = max_value as at initialization time?

Xyene · 2026-03-03T06:25:26Z

src/new_trace_writer.ml

+module Nonempty_vec = Nonempty_vec.Value
+
+let debug = ref false
+let is_kernel_address addr = Int64.(addr < 0L)


Rather than keep lots of old trace_writer.ml code in here, can we instead delete all of it, and have something like a Multi_trace_writer module that dispatches to both Trace_writer and New_trace_writer in the cases where you want to debug and compare?

Svetlitski added 5 commits February 9, 2026 15:15

Add trace file formats to .gitignore

54d7cff

Signed-off-by: Kevin Svetlitski <ksvetlitski@janestreet.com>

Add -filter argument to decode subcommand

97b0089

This is useful for testing. Signed-off-by: Kevin Svetlitski <ksvetlitski@janestreet.com>

Add an environment variable for selecting which trace-writer implemen…

821bcfa

…tation to use Signed-off-by: Kevin Svetlitski <ksvetlitski@janestreet.com>

Svetlitski mentioned this pull request Feb 9, 2026

Trace segments #350

Closed

6 tasks

Svetlitski marked this pull request as ready for review February 9, 2026 21:26

Svetlitski requested a review from Xyene February 9, 2026 21:27

Xyene reviewed Mar 3, 2026

View reviewed changes

Conversation

Svetlitski commented Feb 9, 2026

Introduce new, experimental trace-writing backend

Uh oh!

Xyene Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xyene Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xyene Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Xyene Feb 23, 2026 •

edited

Loading

Xyene Mar 3, 2026 •

edited

Loading

Xyene Mar 3, 2026 •

edited

Loading