Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexCleanupService inevitably leads to race conditions #55

Open
theburge opened this issue Aug 26, 2022 · 0 comments
Open

IndexCleanupService inevitably leads to race conditions #55

theburge opened this issue Aug 26, 2022 · 0 comments

Comments

@theburge
Copy link
Contributor

theburge commented Aug 26, 2022

Background

#54 lead to an interesting discussion about the potential for race conditions between clean-up and new index creation. #53 also relates to race conditions arising from IndexCleanupService (although that race is between renaming index directories and shutting down Lucene components properly.

Between the two mentioned tickets, I've concluded that both deletion- and rename-based clean-up implementations are racy.

The race conditions are intrinsic to the implementation due to a separate actor being responsible for clean-up actions. This thought was triggered when considering IndexCleanupService.recursivelyDelete() and the case of trying to clean-up orphaned index-signature directories (i.e., ${SEARCH_ROOT}/shards/${RANGE}/${DB_NAME_AND_SUFFIX}/${INDEX_SIG}).

Clarification: Currently, the application does not remove the directories for orphaned indexes. (Although it really should because otherwise filesystem inodes are wasted on empty directories when users change their index definitions.) The reasoning in this ticket applies to both the current implementation and any implementation that also unlinks empty index directories in response to CleanupDbMsg: regardless of whether the directory is unlinked, the content is removed if the directory is unrecognised (meaning that all of the Lucene-created content will be removed), and that leads to potential problems.

Example

In the above case, clouseau sends a CleanupDbMsg with a list of active signatures to not delete. However, imagine the case where a user changes a design document in two steps to trigger this:

  1. Remove the old index definition (which triggers a CleanupDbMsg with the list of known signatures)
  2. (In a second operation) Create a new index definition (which has a signature unknown to the message sent in step 1)

Scanning the directories not in activeSigs can therefore race the new index creation, and potentially the new index's directory could end up being deleted through the follow interleaving:

  1. Send CleanupDbMsg
  2. User request arrives and causes creation of the new index structure
  3. Execute IndexCleanupService.recursivelyDelete() in response to the message sent in 1
  4. Find the new index directory, whose signature is not in activeSigs
  5. Delete the index directory's contents
  6. Etc.

In this case, I'm not sure exactly what happens, but it probably isn't good and I expect it manifests like #53, where shutdown of the writer would fail (but leave the associated file lock held, etc.). It would also be the case that the LRU would refer to a non-existent path, which I expect is a problem.

Recommendation

Clean-up actions that are issued during the lifetime of an IndexService should occur in that actor's flow of execution, not in another actor's. That will ensure that filesystem actions are properly sequenced from the application's point of view, at the cost of slowing down the IndexService if clean-up actions are performed frequently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant