[ILM]: Fix TSDS unfollow timing with WaitUntilTimeSeriesEndTimePassesStep #128361

happysubin · 2025-05-23T10:12:06Z

I added the WaitUntilTimeSeriesEndTimePassesStep between the WaitForFollowShardTasksStep and the PauseFollowerIndexStep in the step list of UnFollowAction, and updated the tests accordingly.

Please review my code!

cla-checker-service · 2025-05-23T10:12:11Z

💚 CLA has been signed

happysubin · 2025-05-25T08:40:43Z

Hi, @gmarouli
Could you take a look at this PR when you get a moment? I’d really appreciate your review!

gmarouli · 2025-05-27T06:28:57Z

Hi @happysubin , thank you for your contribution, @samxbr and I will review it as soon as possible.

elasticsearchmachine · 2025-05-27T06:30:02Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine · 2025-05-27T06:30:02Z

Pinging @elastic/es-data-management (Team:Data Management)

samxbr

@happysubin Thank you for contributing to Elasticsearch! We value external contributions and would love to work with you to get this PR merged. I have left some comments on the PR, please feel free to take your time to address them.

samxbr · 2025-05-28T18:10:13Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/UnfollowAction.java

+        WaitForFollowShardTasksStep step2 = new WaitForFollowShardTasksStep(
+            waitForFollowShardTasks,
+            waitUntilTimeSeriesEndTimePassesStep,
+            client
+        );
+        WaitUntilTimeSeriesEndTimePassesStep step3 = new WaitUntilTimeSeriesEndTimePassesStep(
+            waitUntilTimeSeriesEndTimePassesStep,
+            pauseFollowerIndex,
+            Instant::now
+        );


The WaitUntilTimeSeriesEndTimePassesStep should be prior to WaitForFollowShardTasksStep, because the follower index can sync with the leader index one last time after the end_time has passed, to make sure there's no new docs coming in to the leader index.

samxbr · 2025-05-28T18:18:57Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/UnfollowAction.java

Please add new integration test in the CCRIndexLifecycleIT test suite to cover this scenario. You can refer to this test as an example for verifying the WaitUntilTimeSeriesEndTimePassesStep. Essentially we would want to verify that after leader index rollovers, the follower index goes into the WaitUntilTimeSeriesEndTimePassesStep, and new documents to the leader index are synced to the follower index until end_time has passed.

Thank you for the code review.
I'll review the existing test cases and work on adding new ones.I'm going to add test code!

…hardTasksStep

samxbr · 2025-05-30T14:56:45Z

buildkite test this

…wait step

happysubin · 2025-06-07T05:47:59Z

@samxbr
Sorry for the delay.
I’ve added the integration test code.
Please take a look when you have time !

samxbr

I posted some minor comments, but the change generally looks good to me. Nice!

samxbr · 2025-06-09T06:36:22Z

...ugin/ilm/qa/multi-cluster/src/test/java/org/elasticsearch/xpack/ilm/CCRIndexLifecycleIT.java

@@ -533,6 +571,94 @@ public void testILMUnfollowFailsToRemoveRetentionLeases() throws Exception {
        }
    }

+    @SuppressWarnings({ "checkstyle:LineLength", "unchecked" })


Please remove the line length warning supress, you can run ./gradlew spotlessApply to auto-format.

Also please take a look at the contribution guide for more details and other tips: https://github.com/elastic/elasticsearch/blob/main/CONTRIBUTING.md#java-language-formatting-guidelines

Thank you for your code review. I've reflected your suggestions!

samxbr · 2025-06-09T07:24:37Z

...ugin/ilm/qa/multi-cluster/src/test/java/org/elasticsearch/xpack/ilm/CCRIndexLifecycleIT.java

+        Request countRequest = new Request("GET", "/" + indexName + "/_count");
+        Response response = client.performRequest(countRequest);
+        Map<String, Object> result = entityAsMap(response);
+        System.out.println("result = " + result);


Please remove the println

samxbr · 2025-06-09T07:38:19Z

buildkite test this

gmarouli · 2025-06-12T06:22:17Z

@happysubin that is a very good finding, good job! I did not know we have moved such configuration there.

But this means that this is not a sufficient fix, we need to think why is the timing off on CI.

happysubin · 2025-06-12T09:46:34Z

@gmarouli It might be similar to just increasing the timeout, but how about adding the following code instead?

Request rolloverRequest = new Request("POST", "/" + dataStream + "/_rollover");
leaderClient.performRequest(rolloverRequest);
// Consider adding this code
assertBusy(
    () -> assertThat(getIndexSetting(client(), backingIndexName, "index.lifecycle.indexing_complete"), is("true")),
    60,
    TimeUnit.SECONDS
);

Manually setting the index like below doesn’t seem to align with the purpose of this integration test.

updateIndexSettings(leaderClient, backingIndexName, Settings.builder().put("index.lifecycle.indexing_complete",true).build());

I sincerely appreciate your kind and detailed feedback on this issue ! 👍

gmarouli · 2025-06-12T11:10:57Z

Hey @happysubin , you are right we could set it ourselves but this is taking it a step too far when it comes to manipulating state for a test.

But good news, I think I found the culprit. There are a few issues in this test and they create some flakiness:

The follower cluster is not recommended to have its own rollover action in the policy, it is meant to follow the rollover of the leader index. This means we need to change the putILMPolicy with putUnfollowOnlyPolicy, this way the follower will not try to rollover the follower index.
The second problem is the manual rollover we perform, we actually need to wait for the ILM executed rollover because it is the only one that sets index.lifecycle.indexing_complete to true (see docs). So we need to remove:

Request rolloverRequest = new Request("POST", "/" + dataStream + "/_rollover");
leaderClient.performRequest(rolloverRequest);

My theory is that the rollover on the follower cluster was catching the problem locally.

Shall we try fixing these 2 issues and put back the timeout to what it was?

Again great work being thorough!

happysubin · 2025-06-12T11:42:03Z

@gmarouli
I've quickly incorporated your review, and thank you very much for your help !!

gmarouli · 2025-06-12T12:54:52Z

buildkite test this

gmarouli · 2025-06-12T12:55:16Z

@gmarouli I've quickly incorporated your review, and thank you very much for your help !!

It's been my pleasure, let's see if we got it!

gmarouli

LGTM! @happysubin great work, thank you! @samxbr any other comments before we merge?

samxbr · 2025-06-13T09:08:29Z

Hey @happysubin , you are right we could set it ourselves but this is taking it a step too far when it comes to manipulating state for a test.

But good news, I think I found the culprit. There are a few issues in this test and they create some flakiness:

The follower cluster is not recommended to have its own rollover action in the policy, it is meant to follow the rollover of the leader index. This means we need to change the putILMPolicy with putUnfollowOnlyPolicy, this way the follower will not try to rollover the follower index.

The second problem is the manual rollover we perform, we actually need to wait for the ILM executed rollover because it is the only one that sets index.lifecycle.indexing_complete to true (see docs). So we need to remove:
Request rolloverRequest = new Request("POST", "/" + dataStream + "/_rollover");
leaderClient.performRequest(rolloverRequest);
My theory is that the rollover on the follower cluster was catching the problem locally.

Shall we try fixing these 2 issues and put back the timeout to what it was?

Again great work being thorough!

This is awesome, thanks @gmarouli for the thorough review!
These are subtle issues, @happysubin I would recommend putting these findings as comments in the test so people in the future knows why it's done this way.

samxbr · 2025-06-13T09:10:03Z

...ugin/ilm/qa/multi-cluster/src/test/java/org/elasticsearch/xpack/ilm/CCRIndexLifecycleIT.java

-                // rollover
-                Request rolloverRequest = new Request("POST", "/" + dataStream + "/_rollover");
-                rolloverRequest.setJsonEntity("""
-                    {
-                        "conditions": {
-                        "max_docs": "1"
-                        }
-                    }""");
-                leaderClient.performRequest(rolloverRequest);


Could you add a comment to explain why we need to wait for ILM to rollover instead of manual rollover?

Of course! I’ll add comments explaining why we did it this way!

samxbr · 2025-06-13T09:10:57Z

...ugin/ilm/qa/multi-cluster/src/test/java/org/elasticsearch/xpack/ilm/CCRIndexLifecycleIT.java

-            putILMPolicy(policyName, null, 1, null);
+            putUnfollowOnlyPolicy(client(), policyName);


Similar here, could you add a comment to explain why we only want Unfollow action?

happysubin · 2025-06-13T10:46:14Z

@samxbr
Added some comments to explain the reasoning behind the implementation — thanks for pointing it out!

...ugin/ilm/qa/multi-cluster/src/test/java/org/elasticsearch/xpack/ilm/CCRIndexLifecycleIT.java

samxbr

LGTM, I think we are good to merge after incorporating Mary's comment change and CI has passed. This is a great change, thank you very much for your contribution and being so promptly addressing the comments!

Co-authored-by: Mary Gouseti <[email protected]>

happysubin · 2025-06-16T15:22:35Z

@gmarouli comments have been incorporated, and I'm really grateful for the help from both @samxbr and @gmarouli
It's been a pleasure!!

samxbr · 2025-06-16T16:59:47Z

buildkite test this

elasticsearchmachine · 2025-06-17T06:42:52Z

💔 Backport failed

Status	Branch	Result
❌	8.19	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 128361

samxbr · 2025-06-17T07:32:13Z

💚 All backports created successfully

Status	Branch	Result
✅	8.19

Questions ?

Please refer to the Backport tool documentation

…Step (elastic#128361) The backing indices of a time series data streams (TSDS) have time ranges (start_time & end_time) and they include documents that belong to these time ranges. To ensure that we will not unfollow a leader TSDS index before the indexing is complete, we need to add a WaitUntilTimeSeriesEndTimePassesStep to the unfollow action. This will ensure that we will only unfollow after the end_time has passed. This creates some weird semantics with the combination of the rollover and the unfollow. Because we need the rollover of the leader index to finalise the end_time but the unfollow action is injected before the rollover. However, this should be fine, because the leader index will skip the unfollow action so it will rollover and finalise the end_time and the follower index will wait the end_time to pass before it unfollows. Rolling over the follower index will have no effect since it’s already rolled over. (cherry picked from commit ed7f2ca) # Conflicts: # x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/UnfollowAction.java

…ePassesStep (#128361) (#129518) * [ILM]: Fix TSDS unfollow timing with WaitUntilTimeSeriesEndTimePassesStep (#128361) The backing indices of a time series data streams (TSDS) have time ranges (start_time & end_time) and they include documents that belong to these time ranges. To ensure that we will not unfollow a leader TSDS index before the indexing is complete, we need to add a WaitUntilTimeSeriesEndTimePassesStep to the unfollow action. This will ensure that we will only unfollow after the end_time has passed. This creates some weird semantics with the combination of the rollover and the unfollow. Because we need the rollover of the leader index to finalise the end_time but the unfollow action is injected before the rollover. However, this should be fine, because the leader index will skip the unfollow action so it will rollover and finalise the end_time and the follower index will wait the end_time to pass before it unfollows. Rolling over the follower index will have no effect since it’s already rolled over. (cherry picked from commit ed7f2ca) # Conflicts: # x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/UnfollowAction.java * Fix missing method * [CI] Auto commit changes from spotless --------- Co-authored-by: 안수빈 <[email protected]> Co-authored-by: elasticsearchmachine <[email protected]>

[ILM] Fix TSDS unfollow timing with WaitUntilTimeSeriesEndTimePassesStep

e45b2ef

elasticsearchmachine added needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team v9.1.0 labels May 23, 2025

happysubin closed this May 23, 2025

happysubin reopened this May 23, 2025

gmarouli added :Data Management/ILM+SLM Index and Snapshot lifecycle management :StorageEngine/TSDB You know, for Metrics >bug and removed needs:triage Requires assignment of a team area label labels May 27, 2025

gmarouli assigned gmarouli and samxbr May 27, 2025

elasticsearchmachine added Team:Data Management Meta label for data/management team Team:StorageEngine labels May 27, 2025

gmarouli requested a review from samxbr May 28, 2025 13:39

samxbr requested changes May 28, 2025

View reviewed changes

samxbr and others added 2 commits May 28, 2025 14:23

Merge branch 'main' into fix/ilm-unfollow-delay

731d62b

[ILM] Move WaitUntilTimeSeriesEndTimePassesStep before WaitForFollowS…

e759e97

…hardTasksStep

happysubin added 2 commits June 7, 2025 14:42

style: apply code format to UnfollowAction.java

4634388

test: Add integration test for TSDB rollover and CCR sync during ILM …

99d6ecf

…wait step

samxbr requested changes Jun 9, 2025

View reviewed changes

Merge branch 'main' into fix/ilm-unfollow-delay

4f2845a

fix: Address flakiness in follower cluster rollover test

9598115

Merge branch 'main' into fix/ilm-unfollow-delay

808af43

gmarouli self-requested a review June 12, 2025 17:20

gmarouli approved these changes Jun 12, 2025

View reviewed changes

gmarouli requested a review from samxbr June 12, 2025 17:21

samxbr reviewed Jun 13, 2025

View reviewed changes

docs: explain why certain implementations were chosen

18bcd3d

gmarouli requested a review from samxbr June 16, 2025 14:37

gmarouli reviewed Jun 16, 2025

View reviewed changes

...ugin/ilm/qa/multi-cluster/src/test/java/org/elasticsearch/xpack/ilm/CCRIndexLifecycleIT.java Outdated Show resolved Hide resolved

samxbr approved these changes Jun 16, 2025

View reviewed changes

Improve ILM rollover comment clarity

1ad5d31

Co-authored-by: Mary Gouseti <[email protected]>

Merge branch 'main' into fix/ilm-unfollow-delay

c0cc30e

gmarouli merged commit ed7f2ca into elastic:main Jun 17, 2025
26 checks passed

elasticsearchmachine added the backport pending label Jun 17, 2025

samxbr mentioned this pull request Jun 17, 2025

[8.19] [ILM]: Fix TSDS unfollow timing with WaitUntilTimeSeriesEndTimePassesStep (#128361) #129518

Merged

samxbr removed the backport pending label Jun 18, 2025

		putILMPolicy(policyName, null, 1, null);
		putUnfollowOnlyPolicy(client(), policyName);

[ILM]: Fix TSDS unfollow timing with WaitUntilTimeSeriesEndTimePassesStep #128361

[ILM]: Fix TSDS unfollow timing with WaitUntilTimeSeriesEndTimePassesStep #128361

Uh oh!

Conversation

happysubin commented May 23, 2025

Uh oh!

cla-checker-service bot commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

happysubin commented May 25, 2025

Uh oh!

gmarouli commented May 27, 2025

Uh oh!

elasticsearchmachine commented May 27, 2025

Uh oh!

elasticsearchmachine commented May 27, 2025

Uh oh!

samxbr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samxbr commented May 30, 2025

Uh oh!

happysubin commented Jun 7, 2025

Uh oh!

samxbr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samxbr commented Jun 9, 2025

Uh oh!

gmarouli commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

happysubin commented Jun 12, 2025

Uh oh!

gmarouli commented Jun 12, 2025

Uh oh!

happysubin commented Jun 12, 2025

Uh oh!

gmarouli commented Jun 12, 2025

Uh oh!

gmarouli commented Jun 12, 2025

Uh oh!

gmarouli left a comment

Choose a reason for hiding this comment

Uh oh!

samxbr commented Jun 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

happysubin Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

happysubin commented Jun 13, 2025

Uh oh!

Uh oh!

samxbr left a comment

Choose a reason for hiding this comment

Uh oh!

happysubin commented Jun 16, 2025

Uh oh!

samxbr commented Jun 16, 2025

Uh oh!

Uh oh!

elasticsearchmachine commented Jun 17, 2025

💔 Backport failed

cla-checker-service bot commented May 23, 2025 •

edited

Loading

gmarouli commented Jun 12, 2025 •

edited

Loading

happysubin Jun 13, 2025 •

edited

Loading