You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
IndexCleanupService can rename an index directory while it is still open, when using "rename on delete" (aka clouseau index recovery after deletion; see around 7a601e9).
IndexService.exit() has some recovery logic that emits the warning "Error while closing writer" and attempts to close the Lucene Directory, which was added to handle race conditions. See also #23 which resulted from trying to protect against outright crashes of IndexService when the Lucene Directory could not be closed.
Observed problem
It seems that preventing IndexService from crashing and trying to release the write lock doesn't (always?) work.
In some (but not all) cases, the recovery logic ends up emitting the warning "Error while closing writer" (which IIRC simply results from moving the directory while the index is open). Then, in a fraction of those cases, this correlates to the write lock being held rather than released.
From the clouseau logs:
2022-06-13 01:58:12,166 clouseau.cleanup [INFO] Renaming '/srv/search_index/shards/80000000-9fffffff/user/db.suffix' to '/srv/search_index/shards/80000000-9fffffff/user/db.20220613.015812.deleted.suffix'
2022-06-13 01:58:12,171 clouseau [WARN] shards/80000000-9fffffff/user/db.suffix/hash Error while closing writer
org.apache.lucene.store.NoSu file '/srv/search_index/shards/80000000-9fffffff/user/db.suffix/hash' exists but is not a directory
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:221)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:243)
at org.apache.lucene.index.IndexFileDeleter.refresh(IndexFileDeleter.java:324)
at org.apache.lucene.index.IndexFileDeleter.refresh(IndexFileDeleter.java:360)
at org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2095)
at org.apache.lucene.index.IndexWriter.rollback(IndexWriter.java:2033)
at com.cloudant.clouseau.IndexService.exit(IndexService.scala:221)
at scalang.Process.trapExit(Process.scala:58)
at scalang.Process.handleExit(Process.scala:46)
at scalang.node.ProcessHolder$$anon$2.onMessage(ProcessAdapter.scala:55)
at scalang.node.ProcessHolder$$anon$2.onMessage(ProcessAdapter.scala:52)
at org.jetlang.channels.ChannelSubscription$1.run(ChannelSubscription.java:31)
at org.jetlang.core.BatchExecutorImpl.execute(BatchExecutorImpl.java:11)
at org.jetlang.fibers.PoolFiber.flush(PoolFiber.java:82)
at org.jetlang.fibers.PoolFiber.access$100(PoolFiber.java:21)
at org.jetlang.fibers.PoolFiber$1.run(PoolFiber.java:38)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
at overlock.threadpool.ErrorLoggedThread.run(NamedThreadFactory.scala:40)
On the node in question:
me@somehost:~$ date
Mon 13 Jun 2022 10:30:26 AM UTC
me@somehost:~$ sudo lsof -p 5505 | grep -F 'shards/80000000-9fffffff/user/db'
java 5505 dbcore 158uW REG 254,0 0 1540214 /srv/search_index/shards/80000000-9fffffff/user/db.20220613.015812.deleted.suffix/hash/write.lock
If there is a subsequent attempt to build a clouseau index on the same database shard/design doc hash (i.e., an identically named shard is created anew), it will fail with an error (which bubbles up through dreyfus) like:
Background
IndexCleanupService
can rename an index directory while it is still open, when using "rename on delete" (aka clouseau index recovery after deletion; see around 7a601e9).IndexService.exit()
has some recovery logic that emits the warning "Error while closing writer" and attempts to close the Lucene Directory, which was added to handle race conditions. See also #23 which resulted from trying to protect against outright crashes ofIndexService
when the Lucene Directory could not be closed.Observed problem
It seems that preventing
IndexService
from crashing and trying to release the write lock doesn't (always?) work.In some (but not all) cases, the recovery logic ends up emitting the warning "Error while closing writer" (which IIRC simply results from moving the directory while the index is open). Then, in a fraction of those cases, this correlates to the write lock being held rather than released.
From the clouseau logs:
On the node in question:
If there is a subsequent attempt to build a clouseau index on the same database shard/design doc hash (i.e., an identically named shard is created anew), it will fail with an error (which bubbles up through dreyfus) like:
This can be demonstrated via an Erlang remote shell on the node with write lock held:
Impact
The text was updated successfully, but these errors were encountered: