You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#54 lead to an interesting discussion about the potential for race conditions between clean-up and new index creation. #53 also relates to race conditions arising from IndexCleanupService (although that race is between renaming index directories and shutting down Lucene components properly.
Between the two mentioned tickets, I've concluded that both deletion- and rename-based clean-up implementations are racy.
The race conditions are intrinsic to the implementation due to a separate actor being responsible for clean-up actions. This thought was triggered when considering IndexCleanupService.recursivelyDelete() and the case of trying to clean-up orphaned index-signature directories (i.e., ${SEARCH_ROOT}/shards/${RANGE}/${DB_NAME_AND_SUFFIX}/${INDEX_SIG}).
Clarification: Currently, the application does not remove the directories for orphaned indexes. (Although it really should because otherwise filesystem inodes are wasted on empty directories when users change their index definitions.) The reasoning in this ticket applies to both the current implementation and any implementation that also unlinks empty index directories in response to CleanupDbMsg: regardless of whether the directory is unlinked, the content is removed if the directory is unrecognised (meaning that all of the Lucene-created content will be removed), and that leads to potential problems.
Example
In the above case, clouseau sends a CleanupDbMsg with a list of active signatures to not delete. However, imagine the case where a user changes a design document in two steps to trigger this:
Remove the old index definition (which triggers a CleanupDbMsg with the list of known signatures)
(In a second operation) Create a new index definition (which has a signature unknown to the message sent in step 1)
Scanning the directories not in activeSigs can therefore race the new index creation, and potentially the new index's directory could end up being deleted through the follow interleaving:
Send CleanupDbMsg
User request arrives and causes creation of the new index structure
Execute IndexCleanupService.recursivelyDelete() in response to the message sent in 1
Find the new index directory, whose signature is not in activeSigs
Delete the index directory's contents
Etc.
In this case, I'm not sure exactly what happens, but it probably isn't good and I expect it manifests like #53, where shutdown of the writer would fail (but leave the associated file lock held, etc.). It would also be the case that the LRU would refer to a non-existent path, which I expect is a problem.
Recommendation
Clean-up actions that are issued during the lifetime of an IndexService should occur in that actor's flow of execution, not in another actor's. That will ensure that filesystem actions are properly sequenced from the application's point of view, at the cost of slowing down the IndexService if clean-up actions are performed frequently.
The text was updated successfully, but these errors were encountered:
Background
#54 lead to an interesting discussion about the potential for race conditions between clean-up and new index creation. #53 also relates to race conditions arising from
IndexCleanupService
(although that race is between renaming index directories and shutting down Lucene components properly.Between the two mentioned tickets, I've concluded that both deletion- and rename-based clean-up implementations are racy.
The race conditions are intrinsic to the implementation due to a separate actor being responsible for clean-up actions. This thought was triggered when considering
IndexCleanupService.recursivelyDelete()
and the case of trying to clean-up orphaned index-signature directories (i.e.,${SEARCH_ROOT}/shards/${RANGE}/${DB_NAME_AND_SUFFIX}/${INDEX_SIG}
).Clarification: Currently, the application does not remove the directories for orphaned indexes. (Although it really should because otherwise filesystem inodes are wasted on empty directories when users change their index definitions.) The reasoning in this ticket applies to both the current implementation and any implementation that also unlinks empty index directories in response to
CleanupDbMsg
: regardless of whether the directory is unlinked, the content is removed if the directory is unrecognised (meaning that all of the Lucene-created content will be removed), and that leads to potential problems.Example
In the above case, clouseau sends a
CleanupDbMsg
with a list of active signatures to not delete. However, imagine the case where a user changes a design document in two steps to trigger this:CleanupDbMsg
with the list of known signatures)Scanning the directories not in
activeSigs
can therefore race the new index creation, and potentially the new index's directory could end up being deleted through the follow interleaving:CleanupDbMsg
IndexCleanupService.recursivelyDelete()
in response to the message sent in 1activeSigs
In this case, I'm not sure exactly what happens, but it probably isn't good and I expect it manifests like #53, where shutdown of the writer would fail (but leave the associated file lock held, etc.). It would also be the case that the LRU would refer to a non-existent path, which I expect is a problem.
Recommendation
Clean-up actions that are issued during the lifetime of an
IndexService
should occur in that actor's flow of execution, not in another actor's. That will ensure that filesystem actions are properly sequenced from the application's point of view, at the cost of slowing down theIndexService
if clean-up actions are performed frequently.The text was updated successfully, but these errors were encountered: