Add cluster-trace-collector tool #1247

sharnoff · 2025-02-07T11:50:28Z

This new tool exists so that we can capture a scheduling/scaling "trace" from production (basically: a record of exactly all the operations that happened). This way, we have some real data available to use in simulations for changes to our scheduling policies, AND we can use that data to extrapolate out to how changes in load will be handled.

The architecture is pretty simple:

Reuse pkg/util/watch to get a stream of events on Pod and Node objects
Extract the useful bits out of each event
If the useful bits changed, send it off to S3/Azure Blob with an event sink from pkg/reporting

ref neondatabase/cloud#23343

Notes:

This is currently hacked together pretty quickly! Planning to add docs before requesting review.
Currently the deployment manifest isn't managed here. I think that maybe makes sense? Not sure -- would be good to discuss!

github-actions · 2025-02-07T11:54:13Z

No changes to the coverage.

HTML Report

Click to open

Basically, make there's the same set of images every time and they're in the same order. And if there's an image intentionally missing, comment why. Also, align spacing in the places where that's feasible, because it makes it easier to see the set of images.

This new tool exists so that we can capture a scheduling/scaling "trace" from production (basically: a record of exactly all the operations that happened). This way, we have some real data availalbe to use in simulations for changes to our scheduling policies, AND we can use that data to extrapolate out to how changes in load will be handled. The architecture is pretty simple: * Reuse pkg/util/watch to get a stream of events on Pod and Node objects * Extract the useful bits out of each event * If the useful bits changed, send it off to S3/Azure Blob with an event sink from pkg/reporting

Basically, make there's the same set of images every time and they're in the same order. And if there's an image intentionally missing, comment why. Also, align spacing in the places where that's feasible, because it makes it easier to see the set of images. --- Noticed some painful merge conflicts trying to update #1247, figured this bit of cleanup would help (it does!).

sharnoff mentioned this pull request Mar 11, 2025

ci/build-images: Standardize image sets #1312

Merged

sharnoff force-pushed the sharnoff/cluster-trace branch from 6a2eccc to 7b004fe Compare March 11, 2025 09:50

sharnoff changed the base branch from main to sharnoff/build-images-cleanup March 11, 2025 09:50

sharnoff marked this pull request as ready for review March 11, 2025 09:54

sharnoff self-assigned this Mar 11, 2025

Base automatically changed from sharnoff/build-images-cleanup to main March 11, 2025 10:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cluster-trace-collector tool #1247

Add cluster-trace-collector tool #1247

sharnoff commented Feb 7, 2025 •

edited

Loading

github-actions bot commented Feb 7, 2025 •

edited

Loading

Add cluster-trace-collector tool #1247

Are you sure you want to change the base?

Add cluster-trace-collector tool #1247

Conversation

sharnoff commented Feb 7, 2025 • edited Loading

github-actions bot commented Feb 7, 2025 • edited Loading

HTML Report

sharnoff commented Feb 7, 2025 •

edited

Loading

github-actions bot commented Feb 7, 2025 •

edited

Loading