Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit heartbeat and remoting metrics periodically to the EventStream #7427

Open
1 task
Aaronontheweb opened this issue Dec 20, 2024 · 2 comments
Open
1 task

Comments

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Dec 20, 2024

Is your feature request related to a problem? Please describe.

Similar to how we implemented some built-in telemetry for actor starts / stops in #6293, we've had a request on our Phobos issue tracker to do the same for some of the moving parts inside Akka.Remote and Akka.Cluster: petabridge/phobos-issues#79 as part of Phobos

I think we can probably achieve this - these metrics are already captured inside the remoting and clustering systems, but they're not exposed in any meaningful way that could easily be consumed for instrumentation purposes.

Describe the solution you'd like

I think we should create topics for each of the major heartbeat systems:

  1. Akka.Remote - Transport
  2. Akka.Remote - DeathWatch
  3. Akka.Cluster - Watch

and one for Akka.Remote - transport metrics.

And make these subscribable via the EventStream locally, for that node's traffic only.

Describe alternatives you've considered

The alternatives in this case are basically "not doing it" - or trying to do something really janky in Phobos (i.e. custom failure detector registries.) Doing this natively in Akka.NET is the way to do it - and this won't have much of a performance impact since these messages would only be shared once every 5s or so.

Metrics Checklist

  • Akka.Cluster failure detector heartbeat - want to emit a data structure that includes everyone we are heatbeating to and their response times. Emit this value periodically, like once every 10-30s or so
@Aaronontheweb
Copy link
Member Author

Aaronontheweb commented Jan 6, 2025

Want to know:

  1. Who we are heartbeating to (THIS VALUE WILL CHANGE)
  2. And how long it took them to reply back, on average - the PhiAccrualDetector, which we use by default, already has this information here:
    public ImmutableList<long> Intervals { get; }
    - that measures the durations as a long representing ticks or ms. But we can't solely rely on that because users can and do customize the failure detector to also the DeadlineDetector, which is simpler and doesn't maintain a history.

@Aaronontheweb
Copy link
Member Author

I would also an event that measures the rough throughput of Akka.Remote, but that will take a bit more work to implement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant