Lucene write locks are sometimes held open upon database (or shard) deletion #53

theburge · 2022-06-13T10:54:42Z

Background

IndexCleanupService can rename an index directory while it is still open, when using "rename on delete" (aka clouseau index recovery after deletion; see around 7a601e9).

IndexService.exit() has some recovery logic that emits the warning "Error while closing writer" and attempts to close the Lucene Directory, which was added to handle race conditions. See also #23 which resulted from trying to protect against outright crashes of IndexService when the Lucene Directory could not be closed.

Observed problem

It seems that preventing IndexService from crashing and trying to release the write lock doesn't (always?) work.

In some (but not all) cases, the recovery logic ends up emitting the warning "Error while closing writer" (which IIRC simply results from moving the directory while the index is open). Then, in a fraction of those cases, this correlates to the write lock being held rather than released.

From the clouseau logs:

2022-06-13 01:58:12,166 clouseau.cleanup [INFO] Renaming '/srv/search_index/shards/80000000-9fffffff/user/db.suffix' to '/srv/search_index/shards/80000000-9fffffff/user/db.20220613.015812.deleted.suffix'
2022-06-13 01:58:12,171 clouseau [WARN] shards/80000000-9fffffff/user/db.suffix/hash Error while closing writer
	org.apache.lucene.store.NoSu file '/srv/search_index/shards/80000000-9fffffff/user/db.suffix/hash' exists but is not a directory
	at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:221)
	at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:243)
	at org.apache.lucene.index.IndexFileDeleter.refresh(IndexFileDeleter.java:324)
	at org.apache.lucene.index.IndexFileDeleter.refresh(IndexFileDeleter.java:360)
	at org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2095)
	at org.apache.lucene.index.IndexWriter.rollback(IndexWriter.java:2033)
	at com.cloudant.clouseau.IndexService.exit(IndexService.scala:221)
	at scalang.Process.trapExit(Process.scala:58)
	at scalang.Process.handleExit(Process.scala:46)
	at scalang.node.ProcessHolder$$anon$2.onMessage(ProcessAdapter.scala:55)
	at scalang.node.ProcessHolder$$anon$2.onMessage(ProcessAdapter.scala:52)
	at org.jetlang.channels.ChannelSubscription$1.run(ChannelSubscription.java:31)
	at org.jetlang.core.BatchExecutorImpl.execute(BatchExecutorImpl.java:11)
	at org.jetlang.fibers.PoolFiber.flush(PoolFiber.java:82)
	at org.jetlang.fibers.PoolFiber.access$100(PoolFiber.java:21)
	at org.jetlang.fibers.PoolFiber$1.run(PoolFiber.java:38)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
	at overlock.threadpool.ErrorLoggedThread.run(NamedThreadFactory.scala:40)

On the node in question:

me@somehost:~$ date
Mon 13 Jun 2022 10:30:26 AM UTC
me@somehost:~$ sudo lsof -p 5505 | grep -F 'shards/80000000-9fffffff/user/db'
java    5505 dbcore  158uW     REG              254,0        0  1540214 /srv/search_index/shards/80000000-9fffffff/user/db.20220613.015812.deleted.suffix/hash/write.lock

If there is a subsequent attempt to build a clouseau index on the same database shard/design doc hash (i.e., an identically named shard is created anew), it will fail with an error (which bubbles up through dreyfus) like:

Lock obtain timed out: NativeFSLock@/path/to/write.lock

This can be demonstrated via an Erlang remote shell on the node with write lock held:

> dreyfus_index_manager:get_index(<<"shards/80000000-9fffffff/user/db.suffix">>, #index{sig= <<"hash">>, analyzer= <<"perfield">>}).
{error,<<"Lock obtain timed out: NativeFSLock@/srv/search_index/shards/80000000-9fffffff/user/db.suffix/ha"...>>}

Impact

Potential for exhaustion of file handles
Inability to recreate the same shard without restarting clouseau to clear the lock
- This is observable in scenarios where a single shard was previously deleted and is now being copied back

The text was updated successfully, but these errors were encountered:

theburge mentioned this issue Aug 26, 2022

IndexCleanupService inevitably leads to race conditions #55

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lucene write locks are sometimes held open upon database (or shard) deletion #53

Lucene write locks are sometimes held open upon database (or shard) deletion #53

theburge commented Jun 13, 2022 •

edited

Loading

Lucene write locks are sometimes held open upon database (or shard) deletion #53

Lucene write locks are sometimes held open upon database (or shard) deletion #53

Comments

theburge commented Jun 13, 2022 • edited Loading

Background

Observed problem

Impact

theburge commented Jun 13, 2022 •

edited

Loading