Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lucene write locks are sometimes held open upon database (or shard) deletion #53

Open
theburge opened this issue Jun 13, 2022 · 0 comments

Comments

@theburge
Copy link
Contributor

theburge commented Jun 13, 2022

Background

IndexCleanupService can rename an index directory while it is still open, when using "rename on delete" (aka clouseau index recovery after deletion; see around 7a601e9).

IndexService.exit() has some recovery logic that emits the warning "Error while closing writer" and attempts to close the Lucene Directory, which was added to handle race conditions. See also #23 which resulted from trying to protect against outright crashes of IndexService when the Lucene Directory could not be closed.

Observed problem

It seems that preventing IndexService from crashing and trying to release the write lock doesn't (always?) work.

In some (but not all) cases, the recovery logic ends up emitting the warning "Error while closing writer" (which IIRC simply results from moving the directory while the index is open). Then, in a fraction of those cases, this correlates to the write lock being held rather than released.

From the clouseau logs:

2022-06-13 01:58:12,166 clouseau.cleanup [INFO] Renaming '/srv/search_index/shards/80000000-9fffffff/user/db.suffix' to '/srv/search_index/shards/80000000-9fffffff/user/db.20220613.015812.deleted.suffix'
2022-06-13 01:58:12,171 clouseau [WARN] shards/80000000-9fffffff/user/db.suffix/hash Error while closing writer
	org.apache.lucene.store.NoSu file '/srv/search_index/shards/80000000-9fffffff/user/db.suffix/hash' exists but is not a directory
	at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:221)
	at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:243)
	at org.apache.lucene.index.IndexFileDeleter.refresh(IndexFileDeleter.java:324)
	at org.apache.lucene.index.IndexFileDeleter.refresh(IndexFileDeleter.java:360)
	at org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2095)
	at org.apache.lucene.index.IndexWriter.rollback(IndexWriter.java:2033)
	at com.cloudant.clouseau.IndexService.exit(IndexService.scala:221)
	at scalang.Process.trapExit(Process.scala:58)
	at scalang.Process.handleExit(Process.scala:46)
	at scalang.node.ProcessHolder$$anon$2.onMessage(ProcessAdapter.scala:55)
	at scalang.node.ProcessHolder$$anon$2.onMessage(ProcessAdapter.scala:52)
	at org.jetlang.channels.ChannelSubscription$1.run(ChannelSubscription.java:31)
	at org.jetlang.core.BatchExecutorImpl.execute(BatchExecutorImpl.java:11)
	at org.jetlang.fibers.PoolFiber.flush(PoolFiber.java:82)
	at org.jetlang.fibers.PoolFiber.access$100(PoolFiber.java:21)
	at org.jetlang.fibers.PoolFiber$1.run(PoolFiber.java:38)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
	at overlock.threadpool.ErrorLoggedThread.run(NamedThreadFactory.scala:40)

On the node in question:

me@somehost:~$ date
Mon 13 Jun 2022 10:30:26 AM UTC
me@somehost:~$ sudo lsof -p 5505 | grep -F 'shards/80000000-9fffffff/user/db'
java    5505 dbcore  158uW     REG              254,0        0  1540214 /srv/search_index/shards/80000000-9fffffff/user/db.20220613.015812.deleted.suffix/hash/write.lock

If there is a subsequent attempt to build a clouseau index on the same database shard/design doc hash (i.e., an identically named shard is created anew), it will fail with an error (which bubbles up through dreyfus) like:

Lock obtain timed out: NativeFSLock@/path/to/write.lock

This can be demonstrated via an Erlang remote shell on the node with write lock held:

> dreyfus_index_manager:get_index(<<"shards/80000000-9fffffff/user/db.suffix">>, #index{sig= <<"hash">>, analyzer= <<"perfield">>}).
{error,<<"Lock obtain timed out: NativeFSLock@/srv/search_index/shards/80000000-9fffffff/user/db.suffix/ha"...>>}

Impact

  • Potential for exhaustion of file handles
  • Inability to recreate the same shard without restarting clouseau to clear the lock
    • This is observable in scenarios where a single shard was previously deleted and is now being copied back
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant