Skip to content

Modify the mechanism to pause indexing #128405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ankikuma
Copy link
Contributor

This PR changes the mechanism to pause indexing which was introduced in #127173.

The original PR caused IndexStatsIT#testThrottleStats to fail. See #126359.

@ankikuma ankikuma added Team:Distributed Indexing Meta label for Distributed Indexing team >bug and removed v9.1.0 labels May 23, 2025
@ankikuma
Copy link
Contributor Author

The problem with the original implementation of the pause lock mechanism was as follows:

  1. First note that the EngineConcurrentMergeScheduler#beforeMerge and EngineConcurrentMergeScheduler#afterMerge can be called concurrently by 2 different merge threads.

  2. Also note that ThreadPoolMergeScheduler#checkMergeTaskThrottling can be called concurrently from 2 different threads via ThreadPoolMergeScheduler#mergeTaskDone and ThreadPoolMergeScheduler#submitNewMergeTask

  3. Next note that the above concurrency can cause Engine#IndexThrottle#activate and Engine#IndexThrottle#deactivate to be called concurrently. The IndexStatsIT#testThrottleStats test exposed that this concurrency was not handled correctly by the pause mechanism in place.

This is because the semaphore based approach relies on acquiring and releasing precise number of permits during activate and deactivate, which is not possible to synchronize. Let me try to explain with the the following scenario:
a. Thread 1 : activates throttling on shard --> acquires all but 1 permits
b. Multiple indexing jobs arrive, all waiting to acquire the single available permit --> only one goes through at a time (all is good so far)
c. Thread 2 : deactivate throttling on shard --> releases all but 1 permits --> all is good but note that the jobs in step (b) will still need to acquire and release these permits.
d. Thread 3: activates throttling immediately after the deactivate in step (c) but it is not able to acquire the requested number of permits immediately because the indexing threads in step (b) have acquired permits in the meantime to work on indexing. So the activate is waiting for permits before it can switch the lock to pauseLock.
e. Thread 4: deactivates throttling immediately following step (d) and finds that we still have a NOOP lock and asserts. In any case, it cannot release the precise number of permits deactivate would like to release, because activate hasn't even acquired them yet.

@ankikuma ankikuma marked this pull request as ready for review May 27, 2025 17:48
@ankikuma ankikuma requested review from henningandersen and tlrx May 27, 2025 17:48
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label and removed Team:Distributed Indexing Meta label for Distributed Indexing team labels May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug needs:triage Requires assignment of a team area label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants