Skip to content

Integrate high-level observability using Tracer #392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aryanjassal opened this issue Apr 21, 2025 · 11 comments
Open

Integrate high-level observability using Tracer #392

aryanjassal opened this issue Apr 21, 2025 · 11 comments
Assignees
Labels
development Standard development

Comments

@aryanjassal
Copy link
Member

Specification

Some work has been done in js-logger for rolling out a custom implementation of tracing which supports streaming and potentially unending spans. (js-logger#47). We now need to use the tracing system in Polykey to gain observability during Polykey's runtime.

Obtaining observability helps debug resource leaks more easily, and given how frequently the seednodes shut down and polykey remains stuck in a stopping state, there are a lot of them. This is meant to simplify finding and fixing them.

Additional context

Tasks

  1. Start by monkey-patching tracer to plan out integration (async-init for lifecycles)
  2. Extract data and pipe it to a visualiser to see the spans and events
  3. Fully integrate tracer if the small pilot experiment was successful
@aryanjassal aryanjassal added the development Standard development label Apr 21, 2025
Copy link

linear bot commented Apr 21, 2025

ENG-585

@aryanjassal aryanjassal self-assigned this Apr 21, 2025
Copy link
Member Author

This will be starting off as more of a research effort, then later it might become more of an implementation issue.

Copy link
Member Author

Logger is used in a lot of subpackages and the version needs to be updated everywhere. Instead of updating the version manually, the following needs to be done:

  "overrides": {
    "@matrixai/logger": "^4.0.4-alpha.0"
  }

This should match the version of the dependency in the current package.json, then this version will cascade down and override other dependencies. A clean-install might be required to properly update the dependency graph.

Without updating this in all the dependencies, Polykey will fail to compile.

@CMCDragonkai
Copy link
Member

Well due to semantic versioning it should have been a minor update that easily flowed to all dependencies. No need to update everything.

Copy link
Member Author

I think the issue was when we were trying to give the RPC a logger, when the RPC was using the older version and polykey used the newer version. That made the RPC logger instance different than the polykey one. The difference could have been also caused due to different symbols, as even with the same name, two symbols can be technically different in what they represent, which was why it was complaining. After overriding the version, all references to @matrixai/logger were the same deduplicated and latest one, which is probably why it is working now.

Copy link
Member

Yes you have to beware of this. Keep track of it using npm command that gives you a tree of the dependencies.

@CMCDragonkai
Copy link
Member

Is this dependent on #391?

Copy link
Member Author

Not really, I can run the program just fine locally, and monkey-patching is local-only as well. I'm working on this right now, actually.

Copy link
Member

CMCDragonkai commented May 4, 2025

Refer to MatrixAI/js-logger#52 (comment) - and define a few phases: after you have experimental visualisation, the next step is to apply it to areas where we know there are resource leakage.

  1. Agent Start/Stop Lifecycle
  2. Node Connection Lifecycle
  3. RPC Lifecycle

Copy link
Member Author

Tracer should be transparent. Must not affect operational behaviour, not even performance.

It turns out that the issue was in my implementation. Basically, I used an IIFE to wait for polykey to finish execution and then signal to end the tracing, but that didn't capture the entirety of polykey logs. After digging into it, the reason ended up being that the exit handlers did some additional processing, ending the tracer before polykey could end. To avoid this, previously I avoided ending tracer and let the process close naturally, which worked fine for running it interactively, but it ended up failing all the tests as the tests were finished but the generator remained active.

Also, the tracer is run in the background as a promise and awaited just before exiting to ensure the logs are written properly. This unfortunately results in tracer not being completely transparent as time is added before the application can end. This has a major effect on the tests where multiple instances are started and stopped.

Only running spans in specific modes like verbose or a specific mode to enable span tracing might be better here, but it still wont make tracing completely transparent. For full transparency, I'll have to fork the main process to run the tracer which will receive messages from the main process via IPC, ensuring the main process can end by itself and the forked process will continue until it has written all the logs, then exit. But this might have issues in runtimes like the polykey docker image.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented May 7, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development
Development

No branches or pull requests

2 participants