Skip to content

Conversation

@kdelemme
Copy link
Contributor

@kdelemme kdelemme commented Nov 26, 2025

Fix #210049

Summary

This PR introduce a new route for purging stale summary documents.
The API is flexible to allow usage as bulk action when a list of SLO ids is specified or as a global action when no list is provided. We default to the stale threshold configured in the slo settings when not specified, and we enforce a greater value when specified, except when force is used.

The modal loads with the default stale threshold settings applied, and the user must confirm the override when entering a lower value.

The status API returns some information about the task, like the number of deleted documents, its completion state and some timing information like how long it took and when it started. It is currently not in used...

I've also cleanup the purge rollup data actions to remove some duplication. Also added some unit and integration tests for the purge instances flow.

@github-actions github-actions bot added the author:actionable-obs PRs authored by the actionable obs team label Nov 26, 2025
@kdelemme kdelemme force-pushed the feat/delete-stale-instances branch from 295175b to e98e7ac Compare December 1, 2025 15:28
@kdelemme kdelemme added release_note:skip Skip the PR/issue when compiling release notes Team:actionable-obs Formerly "obs-ux-management", responsible for SLO, o11y alerting, significant events, & synthetics. v9.3.0 backport:skip This PR does not require backporting labels Dec 1, 2025
@kdelemme kdelemme self-assigned this Dec 1, 2025
@kdelemme kdelemme added the ci:beta-faster-pr-build Uses an alternative PR build pipeline with speed optimizations label Dec 1, 2025
@kdelemme kdelemme marked this pull request as ready for review December 1, 2025 18:31
@kdelemme kdelemme requested review from a team as code owners December 1, 2025 18:31
@elasticmachine
Copy link
Contributor

Pinging @elastic/actionable-obs-team (Team:actionable-obs)

@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

@kdelemme kdelemme changed the title feat(slo): Bulk delete stale summary data feat(slo): Delete stale summary data Dec 1, 2025
@mgiota mgiota self-requested a review December 2, 2025 09:14
refresh: false,
wait_for_completion: false,
conflicts: 'proceed',
slices: 'auto',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are millions of documents to delete, slices: auto could overwhelm ES. I was looking into throttling and slicing in the official documentation . Have you considered using these optimization techniques?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are millions of documents to delete, slices: auto could overwhelm ES

I don't think so. where do you get this from?
slices auto will use the best settings based on the number of shards available. We should use this as trying to set it ourselves when we don't know what is the customer shard settings will be more complex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slices: auto caught my attention. I searched a bit in the official documentation to check what it means and what it does and here's what I got, Delete by query supports sliced scroll to parallelize the delete process. This can improve efficiency and provide a convenient way to break the request down into smaller parts.. I searched for scroll_size in the codebase and I found a few references, for example here used together with max_docs.

I am fine using it as is until we notice any performance issues. Just wanted to investigate if there are more performant ways for the deleteByQuery.

Copy link
Member

@dmlemeshko dmlemeshko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x-pack/solutions/observability/test/api_integration_deployment_agnostic/services/slo_api.ts changes LGTM

@mgiota
Copy link
Contributor

mgiota commented Dec 2, 2025

I tried to delete one SLO and here's what I got. When I checked the Ignore purge policy restrictions checkbox or I selected 30 days, I didn't get any error.

Screenshot 2025-12-02 at 19 11 41

After I successfully purged the data (I selected 30 days), I expected the SLO to not appear as stale anymore, but it appears as stale. Is it expected? Can we have a tooltip that explains what purge does behind the scenes?

Screenshot 2025-12-02 at 19 15 46

@kdelemme
Copy link
Contributor Author

kdelemme commented Dec 2, 2025

@mgiota You've tested the "purge rollup data" action, not the "delete stale instance". This one is in the "Actions" menu (big blue button on top right) in the header of the SLO Management page. I can't take a screenshot right now since i'm in a middle of something in another branch.

@mgiota
Copy link
Contributor

mgiota commented Dec 3, 2025

@kdelemme Yep thanks! I figured it out, I clicked the Purge stale instances actions and it worked as expected. The stale SLO didn't appear as stale anymore in the SLO overview page, but it is still present in the SLO Management page. Is it expected?

@kdelemme
Copy link
Contributor Author

kdelemme commented Dec 3, 2025

@mgiota

The stale SLO didn't appear as stale anymore in the SLO overview page, but it is still present in the SLO Management page. Is it expected?

Yes since the SLO Management page shows only the SLO definitions, regardless of the number of instances and their state (stale or not). When we purge stale instances, we delete all SLO Instances that have not been updated for X time, but their SLO Definition stays.

@kdelemme kdelemme merged commit 26e14c0 into elastic:main Dec 3, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

author:actionable-obs PRs authored by the actionable obs team backport:skip This PR does not require backporting ci:beta-faster-pr-build Uses an alternative PR build pipeline with speed optimizations release_note:skip Skip the PR/issue when compiling release notes Team:actionable-obs Formerly "obs-ux-management", responsible for SLO, o11y alerting, significant events, & synthetics. Team:obs-ux-management v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[SLO] [Management] cleanup stale instances

4 participants