IndexCleanupService inevitably leads to race conditions #55

theburge · 2022-08-26T11:33:14Z

Background

#54 lead to an interesting discussion about the potential for race conditions between clean-up and new index creation. #53 also relates to race conditions arising from IndexCleanupService (although that race is between renaming index directories and shutting down Lucene components properly.

Between the two mentioned tickets, I've concluded that both deletion- and rename-based clean-up implementations are racy.

The race conditions are intrinsic to the implementation due to a separate actor being responsible for clean-up actions. This thought was triggered when considering IndexCleanupService.recursivelyDelete() and the case of trying to clean-up orphaned index-signature directories (i.e., ${SEARCH_ROOT}/shards/${RANGE}/${DB_NAME_AND_SUFFIX}/${INDEX_SIG}).

Clarification: Currently, the application does not remove the directories for orphaned indexes. (Although it really should because otherwise filesystem inodes are wasted on empty directories when users change their index definitions.) The reasoning in this ticket applies to both the current implementation and any implementation that also unlinks empty index directories in response to CleanupDbMsg: regardless of whether the directory is unlinked, the content is removed if the directory is unrecognised (meaning that all of the Lucene-created content will be removed), and that leads to potential problems.

Example

In the above case, clouseau sends a CleanupDbMsg with a list of active signatures to not delete. However, imagine the case where a user changes a design document in two steps to trigger this:

Remove the old index definition (which triggers a CleanupDbMsg with the list of known signatures)
(In a second operation) Create a new index definition (which has a signature unknown to the message sent in step 1)

Scanning the directories not in activeSigs can therefore race the new index creation, and potentially the new index's directory could end up being deleted through the follow interleaving:

Send CleanupDbMsg
User request arrives and causes creation of the new index structure
Execute IndexCleanupService.recursivelyDelete() in response to the message sent in 1
Find the new index directory, whose signature is not in activeSigs
Delete the index directory's contents
Etc.

In this case, I'm not sure exactly what happens, but it probably isn't good and I expect it manifests like #53, where shutdown of the writer would fail (but leave the associated file lock held, etc.). It would also be the case that the LRU would refer to a non-existent path, which I expect is a problem.

Recommendation

Clean-up actions that are issued during the lifetime of an IndexService should occur in that actor's flow of execution, not in another actor's. That will ensure that filesystem actions are properly sequenced from the application's point of view, at the cost of slowing down the IndexService if clean-up actions are performed frequently.

The text was updated successfully, but these errors were encountered:

theburge mentioned this issue Aug 26, 2022

Delete empty shard folders when DB is deleted #54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexCleanupService inevitably leads to race conditions #55

IndexCleanupService inevitably leads to race conditions #55

theburge commented Aug 26, 2022 •

edited

Loading

IndexCleanupService inevitably leads to race conditions #55

IndexCleanupService inevitably leads to race conditions #55

Comments

theburge commented Aug 26, 2022 • edited Loading

Background

Example

Recommendation

theburge commented Aug 26, 2022 •

edited

Loading