Skip to content

[ILM]: Fix TSDS unfollow timing with WaitUntilTimeSeriesEndTimePassesStep #128361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jun 17, 2025

Conversation

happysubin
Copy link
Contributor

fix: #128129

I added the WaitUntilTimeSeriesEndTimePassesStep between the WaitForFollowShardTasksStep and the PauseFollowerIndexStep in the step list of UnFollowAction, and updated the tests accordingly.

Please review my code!

Copy link

cla-checker-service bot commented May 23, 2025

💚 CLA has been signed

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label external-contributor Pull request authored by a developer outside the Elasticsearch team v9.1.0 labels May 23, 2025
@happysubin happysubin closed this May 23, 2025
@happysubin happysubin reopened this May 23, 2025
@happysubin
Copy link
Contributor Author

Hi, @gmarouli
Could you take a look at this PR when you get a moment? I’d really appreciate your review!

@gmarouli
Copy link
Contributor

Hi @happysubin , thank you for your contribution, @samxbr and I will review it as soon as possible.

@gmarouli gmarouli added :Data Management/ILM+SLM Index and Snapshot lifecycle management :StorageEngine/TSDB You know, for Metrics >bug and removed needs:triage Requires assignment of a team area label labels May 27, 2025
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team Team:StorageEngine labels May 27, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@gmarouli gmarouli requested a review from samxbr May 28, 2025 13:39
Copy link
Contributor

@samxbr samxbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@happysubin Thank you for contributing to Elasticsearch! We value external contributions and would love to work with you to get this PR merged. I have left some comments on the PR, please feel free to take your time to address them.

Comment on lines 65 to 74
WaitForFollowShardTasksStep step2 = new WaitForFollowShardTasksStep(
waitForFollowShardTasks,
waitUntilTimeSeriesEndTimePassesStep,
client
);
WaitUntilTimeSeriesEndTimePassesStep step3 = new WaitUntilTimeSeriesEndTimePassesStep(
waitUntilTimeSeriesEndTimePassesStep,
pauseFollowerIndex,
Instant::now
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WaitUntilTimeSeriesEndTimePassesStep should be prior to WaitForFollowShardTasksStep, because the follower index can sync with the leader index one last time after the end_time has passed, to make sure there's no new docs coming in to the leader index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add new integration test in the CCRIndexLifecycleIT test suite to cover this scenario. You can refer to this test as an example for verifying the WaitUntilTimeSeriesEndTimePassesStep. Essentially we would want to verify that after leader index rollovers, the follower index goes into the WaitUntilTimeSeriesEndTimePassesStep, and new documents to the leader index are synced to the follower index until end_time has passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the code review.
I'll review the existing test cases and work on adding new ones.I'm going to add test code!

@samxbr
Copy link
Contributor

samxbr commented May 30, 2025

buildkite test this

@happysubin
Copy link
Contributor Author

@samxbr
Sorry for the delay.
I’ve added the integration test code.
Please take a look when you have time !

Copy link
Contributor

@samxbr samxbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I posted some minor comments, but the change generally looks good to me. Nice!

@@ -533,6 +571,94 @@ public void testILMUnfollowFailsToRemoveRetentionLeases() throws Exception {
}
}

@SuppressWarnings({ "checkstyle:LineLength", "unchecked" })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the line length warning supress, you can run ./gradlew spotlessApply to auto-format.

Also please take a look at the contribution guide for more details and other tips: https://github.com/elastic/elasticsearch/blob/main/CONTRIBUTING.md#java-language-formatting-guidelines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your code review. I've reflected your suggestions!

Request countRequest = new Request("GET", "/" + indexName + "/_count");
Response response = client.performRequest(countRequest);
Map<String, Object> result = entityAsMap(response);
System.out.println("result = " + result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the println

@samxbr
Copy link
Contributor

samxbr commented Jun 9, 2025

buildkite test this

@gmarouli
Copy link
Contributor

gmarouli commented Jun 12, 2025

@happysubin that is a very good finding, good job! I did not know we have moved such configuration there.

But this means that this is not a sufficient fix, we need to think why is the timing off on CI.

@happysubin
Copy link
Contributor Author

@gmarouli It might be similar to just increasing the timeout, but how about adding the following code instead?

Request rolloverRequest = new Request("POST", "/" + dataStream + "/_rollover");
leaderClient.performRequest(rolloverRequest);
// Consider adding this code
assertBusy(
    () -> assertThat(getIndexSetting(client(), backingIndexName, "index.lifecycle.indexing_complete"), is("true")),
    60,
    TimeUnit.SECONDS
);

Manually setting the index like below doesn’t seem to align with the purpose of this integration test.

updateIndexSettings(leaderClient, backingIndexName, Settings.builder().put("index.lifecycle.indexing_complete",true).build());

I sincerely appreciate your kind and detailed feedback on this issue ! 👍

@gmarouli
Copy link
Contributor

Hey @happysubin , you are right we could set it ourselves but this is taking it a step too far when it comes to manipulating state for a test.

But good news, I think I found the culprit. There are a few issues in this test and they create some flakiness:

  1. The follower cluster is not recommended to have its own rollover action in the policy, it is meant to follow the rollover of the leader index. This means we need to change the putILMPolicy with putUnfollowOnlyPolicy, this way the follower will not try to rollover the follower index.
  2. The second problem is the manual rollover we perform, we actually need to wait for the ILM executed rollover because it is the only one that sets index.lifecycle.indexing_complete to true (see docs). So we need to remove:
Request rolloverRequest = new Request("POST", "/" + dataStream + "/_rollover");
leaderClient.performRequest(rolloverRequest);

My theory is that the rollover on the follower cluster was catching the problem locally.

Shall we try fixing these 2 issues and put back the timeout to what it was?

Again great work being thorough!

@happysubin
Copy link
Contributor Author

@gmarouli
I've quickly incorporated your review, and thank you very much for your help !!

@gmarouli
Copy link
Contributor

buildkite test this

@gmarouli
Copy link
Contributor

@gmarouli I've quickly incorporated your review, and thank you very much for your help !!

It's been my pleasure, let's see if we got it!

@gmarouli gmarouli self-requested a review June 12, 2025 17:20
Copy link
Contributor

@gmarouli gmarouli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @happysubin great work, thank you! @samxbr any other comments before we merge?

@gmarouli gmarouli requested a review from samxbr June 12, 2025 17:21
@samxbr
Copy link
Contributor

samxbr commented Jun 13, 2025

Hey @happysubin , you are right we could set it ourselves but this is taking it a step too far when it comes to manipulating state for a test.

But good news, I think I found the culprit. There are a few issues in this test and they create some flakiness:

  1. The follower cluster is not recommended to have its own rollover action in the policy, it is meant to follow the rollover of the leader index. This means we need to change the putILMPolicy with putUnfollowOnlyPolicy, this way the follower will not try to rollover the follower index.
  2. The second problem is the manual rollover we perform, we actually need to wait for the ILM executed rollover because it is the only one that sets index.lifecycle.indexing_complete to true (see docs). So we need to remove:
Request rolloverRequest = new Request("POST", "/" + dataStream + "/_rollover");
leaderClient.performRequest(rolloverRequest);

My theory is that the rollover on the follower cluster was catching the problem locally.

Shall we try fixing these 2 issues and put back the timeout to what it was?

Again great work being thorough!

This is awesome, thanks @gmarouli for the thorough review!
These are subtle issues, @happysubin I would recommend putting these findings as comments in the test so people in the future knows why it's done this way.

Comment on lines 605 to 613
// rollover
Request rolloverRequest = new Request("POST", "/" + dataStream + "/_rollover");
rolloverRequest.setJsonEntity("""
{
"conditions": {
"max_docs": "1"
}
}""");
leaderClient.performRequest(rolloverRequest);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment to explain why we need to wait for ILM to rollover instead of manual rollover?

Copy link
Contributor Author

@happysubin happysubin Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course! I’ll add comments explaining why we did it this way!

Comment on lines 586 to 585
putILMPolicy(policyName, null, 1, null);
putUnfollowOnlyPolicy(client(), policyName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here, could you add a comment to explain why we only want Unfollow action?

@happysubin
Copy link
Contributor Author

@samxbr
Added some comments to explain the reasoning behind the implementation — thanks for pointing it out!

@gmarouli gmarouli requested a review from samxbr June 16, 2025 14:37
Copy link
Contributor

@samxbr samxbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think we are good to merge after incorporating Mary's comment change and CI has passed. This is a great change, thank you very much for your contribution and being so promptly addressing the comments!

@happysubin
Copy link
Contributor Author

@gmarouli comments have been incorporated, and I'm really grateful for the help from both @samxbr and @gmarouli
It's been a pleasure!!

@samxbr
Copy link
Contributor

samxbr commented Jun 16, 2025

buildkite test this

@gmarouli gmarouli merged commit ed7f2ca into elastic:main Jun 17, 2025
26 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 128361

@samxbr
Copy link
Contributor

samxbr commented Jun 17, 2025

💚 All backports created successfully

Status Branch Result
8.19

Questions ?

Please refer to the Backport tool documentation

samxbr pushed a commit to samxbr/elasticsearch that referenced this pull request Jun 17, 2025
…Step (elastic#128361)

The backing indices of a time series data streams (TSDS) have time ranges (start_time & end_time) and they include documents that belong to these time ranges.

To ensure that we will not unfollow a leader TSDS index before the indexing is complete, we need to add a WaitUntilTimeSeriesEndTimePassesStep to the unfollow action. This will ensure that we will only unfollow after the end_time has passed.

This creates some weird semantics with the combination of the rollover and the unfollow. Because we need the rollover of the leader index to finalise the end_time but the unfollow action is injected before the rollover. However, this should be fine, because the leader index will skip the unfollow action so it will rollover and finalise the end_time and the follower index will wait the end_time to pass before it unfollows. Rolling over the follower index will have no effect since it’s already rolled over.

(cherry picked from commit ed7f2ca)

# Conflicts:
#	x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/UnfollowAction.java
elasticsearchmachine pushed a commit that referenced this pull request Jun 18, 2025
…ePassesStep (#128361) (#129518)

* [ILM]: Fix TSDS unfollow timing with WaitUntilTimeSeriesEndTimePassesStep (#128361)

The backing indices of a time series data streams (TSDS) have time ranges (start_time & end_time) and they include documents that belong to these time ranges.

To ensure that we will not unfollow a leader TSDS index before the indexing is complete, we need to add a WaitUntilTimeSeriesEndTimePassesStep to the unfollow action. This will ensure that we will only unfollow after the end_time has passed.

This creates some weird semantics with the combination of the rollover and the unfollow. Because we need the rollover of the leader index to finalise the end_time but the unfollow action is injected before the rollover. However, this should be fine, because the leader index will skip the unfollow action so it will rollover and finalise the end_time and the follower index will wait the end_time to pass before it unfollows. Rolling over the follower index will have no effect since it’s already rolled over.

(cherry picked from commit ed7f2ca)

# Conflicts:
#	x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/UnfollowAction.java

* Fix missing method

* [CI] Auto commit changes from spotless

---------

Co-authored-by: 안수빈 <[email protected]>
Co-authored-by: elasticsearchmachine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management external-contributor Pull request authored by a developer outside the Elasticsearch team :StorageEngine/TSDB You know, for Metrics Team:Data Management Meta label for data/management team Team:StorageEngine v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ILM & TSDS] ILM needs to wait for end_time before it "unfollows"
4 participants