Integrate high-level observability using Tracer #392

aryanjassal · 2025-04-21T02:42:02Z

Specification

Some work has been done in js-logger for rolling out a custom implementation of tracing which supports streaming and potentially unending spans. (js-logger#47). We now need to use the tracing system in Polykey to gain observability during Polykey's runtime.

Obtaining observability helps debug resource leaks more easily, and given how frequently the seednodes shut down and polykey remains stuck in a stopping state, there are a lot of them. This is meant to simplify finding and fixing them.

Additional context

Integrate Tracing (derived from OpenTelemetry) js-logger#15

Tasks

Start by monkey-patching tracer to plan out integration (async-init for lifecycles)
Extract data and pipe it to a visualiser to see the spans and events
Fully integrate tracer if the small pilot experiment was successful

linear · 2025-04-21T02:42:05Z

ENG-585

aryanjassal · 2025-04-21T02:50:54Z

This will be starting off as more of a research effort, then later it might become more of an implementation issue.

aryanjassal · 2025-04-21T05:09:00Z

Logger is used in a lot of subpackages and the version needs to be updated everywhere. Instead of updating the version manually, the following needs to be done:

  "overrides": {
    "@matrixai/logger": "^4.0.4-alpha.0"
  }

This should match the version of the dependency in the current package.json, then this version will cascade down and override other dependencies. A clean-install might be required to properly update the dependency graph.

Without updating this in all the dependencies, Polykey will fail to compile.

CMCDragonkai · 2025-04-21T05:47:26Z

Well due to semantic versioning it should have been a minor update that easily flowed to all dependencies. No need to update everything.

aryanjassal · 2025-04-21T05:53:36Z

I think the issue was when we were trying to give the RPC a logger, when the RPC was using the older version and polykey used the newer version. That made the RPC logger instance different than the polykey one. The difference could have been also caused due to different symbols, as even with the same name, two symbols can be technically different in what they represent, which was why it was complaining. After overriding the version, all references to @matrixai/logger were the same deduplicated and latest one, which is probably why it is working now.

CMCDragonkai · 2025-04-21T07:59:02Z

Yes you have to beware of this. Keep track of it using npm command that gives you a tree of the dependencies.

CMCDragonkai · 2025-04-28T04:59:05Z

Is this dependent on #391?

aryanjassal · 2025-04-28T05:01:51Z

Not really, I can run the program just fine locally, and monkey-patching is local-only as well. I'm working on this right now, actually.

CMCDragonkai · 2025-05-04T23:58:54Z

Refer to MatrixAI/js-logger#52 (comment) - and define a few phases: after you have experimental visualisation, the next step is to apply it to areas where we know there are resource leakage.

Agent Start/Stop Lifecycle
Node Connection Lifecycle
RPC Lifecycle

aryanjassal · 2025-05-06T07:31:10Z

Tracer should be transparent. Must not affect operational behaviour, not even performance.

It turns out that the issue was in my implementation. Basically, I used an IIFE to wait for polykey to finish execution and then signal to end the tracing, but that didn't capture the entirety of polykey logs. After digging into it, the reason ended up being that the exit handlers did some additional processing, ending the tracer before polykey could end. To avoid this, previously I avoided ending tracer and let the process close naturally, which worked fine for running it interactively, but it ended up failing all the tests as the tests were finished but the generator remained active.

Also, the tracer is run in the background as a promise and awaited just before exiting to ensure the logs are written properly. This unfortunately results in tracer not being completely transparent as time is added before the application can end. This has a major effect on the tests where multiple instances are started and stopped.

Only running spans in specific modes like verbose or a specific mode to enable span tracing might be better here, but it still wont make tracing completely transparent. For full transparency, I'll have to fork the main process to run the tracer which will receive messages from the main process via IPC, ensuring the main process can end by itself and the forked process will continue until it has written all the logs, then exit. But this might have issues in runtimes like the polykey docker image.

CMCDragonkai · 2025-05-07T11:54:00Z

I think you should review the observability push-flow. The main thread running the JS code will always be pushing data, this will have a performance impact no matter what however there should be no IO delays. A secondary thread from node's IO thread pool can be used by the relevant sink to then push the tracing information into a designated location like append-only file descriptor - potentially with log rotation.

aryanjassal added the development Standard development label Apr 21, 2025

aryanjassal self-assigned this Apr 21, 2025

aryanjassal mentioned this issue May 5, 2025

Integrating tracer into object lifecycles #403

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrate high-level observability using Tracer #392

Integrate high-level observability using Tracer #392

aryanjassal commented Apr 21, 2025

linear bot commented Apr 21, 2025

Uh oh!

aryanjassal commented Apr 21, 2025

Uh oh!

aryanjassal commented Apr 21, 2025

Uh oh!

CMCDragonkai commented Apr 21, 2025

Uh oh!

aryanjassal commented Apr 21, 2025

Uh oh!

CMCDragonkai commented Apr 21, 2025

Uh oh!

CMCDragonkai commented Apr 28, 2025

Uh oh!

aryanjassal commented Apr 28, 2025

Uh oh!

CMCDragonkai commented May 4, 2025 •

edited

Loading

Uh oh!

aryanjassal commented May 6, 2025

Uh oh!

CMCDragonkai commented May 7, 2025 via email

Uh oh!

Integrate high-level observability using Tracer #392

Integrate high-level observability using Tracer #392

Comments

aryanjassal commented Apr 21, 2025

Specification

Additional context

Tasks

linear bot commented Apr 21, 2025

Uh oh!

aryanjassal commented Apr 21, 2025

Uh oh!

aryanjassal commented Apr 21, 2025

Uh oh!

CMCDragonkai commented Apr 21, 2025

Uh oh!

aryanjassal commented Apr 21, 2025

Uh oh!

CMCDragonkai commented Apr 21, 2025

Uh oh!

CMCDragonkai commented Apr 28, 2025

Uh oh!

aryanjassal commented Apr 28, 2025

Uh oh!

CMCDragonkai commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aryanjassal commented May 6, 2025

Uh oh!

CMCDragonkai commented May 7, 2025 via email

Uh oh!

CMCDragonkai commented May 4, 2025 •

edited

Loading