Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cluster-trace-collector tool #1247

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

sharnoff
Copy link
Member

@sharnoff sharnoff commented Feb 7, 2025

This new tool exists so that we can capture a scheduling/scaling "trace" from production (basically: a record of exactly all the operations that happened). This way, we have some real data available to use in simulations for changes to our scheduling policies, AND we can use that data to extrapolate out to how changes in load will be handled.

The architecture is pretty simple:

  • Reuse pkg/util/watch to get a stream of events on Pod and Node objects
  • Extract the useful bits out of each event
  • If the useful bits changed, send it off to S3/Azure Blob with an event sink from pkg/reporting

ref neondatabase/cloud#23343


Notes:

  • This is currently hacked together pretty quickly! Planning to add docs before requesting review.
  • Currently the deployment manifest isn't managed here. I think that maybe makes sense? Not sure -- would be good to discuss!

Copy link

github-actions bot commented Feb 7, 2025

No changes to the coverage.

HTML Report

Click to open

Basically, make there's the same set of images every time and they're in
the same order. And if there's an image intentionally missing, comment
why.

Also, align spacing in the places where that's feasible, because it
makes it easier to see the set of images.
This new tool exists so that we can capture a scheduling/scaling "trace"
from production (basically: a record of exactly all the operations that
happened). This way, we have some real data availalbe to use in
simulations for changes to our scheduling policies, AND we can use that
data to extrapolate out to how changes in load will be handled.

The architecture is pretty simple:

* Reuse pkg/util/watch to get a stream of events on Pod and Node objects
* Extract the useful bits out of each event
* If the useful bits changed, send it off to S3/Azure Blob with an event
  sink from pkg/reporting
@sharnoff sharnoff force-pushed the sharnoff/cluster-trace branch from 6a2eccc to 7b004fe Compare March 11, 2025 09:50
@sharnoff sharnoff changed the base branch from main to sharnoff/build-images-cleanup March 11, 2025 09:50
@sharnoff sharnoff marked this pull request as ready for review March 11, 2025 09:54
@sharnoff sharnoff self-assigned this Mar 11, 2025
Base automatically changed from sharnoff/build-images-cleanup to main March 11, 2025 10:46
sharnoff added a commit that referenced this pull request Mar 11, 2025
Basically, make there's the same set of images every time and they're in
the same order. And if there's an image intentionally missing, comment
why.

Also, align spacing in the places where that's feasible, because it
makes it easier to see the set of images.

---

Noticed some painful merge conflicts trying to update #1247, figured
this bit of cleanup would help (it does!).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant