Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not hold mutex when destruct big object #178

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dongdaoguang
Copy link

Signed-off-by: jason [email protected]
我们线上使用titan的时候经常发现get会产生几百毫秒延时,原因是get命令创建snapshot的时候会mutex_.Lock()。但BlobGCJob对象在析构的时候会持有mutex_,该对象中保存了需要回写sst中的kv,如果需要回写的kv很多(线下测试可能会有几十万条),那么该对象会很大,析构的时候比较耗时,持有mutex_的时间就会比较长,导致get命令加锁被阻塞。所以建议析构大对象的时候,不要持有锁。我们修复后,线上已经跑了一个月,get已经不会产生延时抖动。

优化前:一天get超过200ms延时大概在10w条左右
优化后:一天get超过200ms延时大概在100条左右

}

mutex_.Unlock();
delete blob_gc_job;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if unlocking would cause any race condition, and it is just hard to check. A better pattern is we can pass a struct GCContext { std::unique_ptr<BlobGCJob> } from outside of the mutex (in BackgroundCallGC()) into BackgroundGC and use the context struct to hold the pointer, and make sure the context struct is cleanup outside of mutex.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We checked the BlobGCJob destruct function and sure it will not cause race condition, but it is a good idea to use GCContext to manage the BlobGCJob pointer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just BlobGCJob destructor. I didn't check whether it is safe if some other job slip in between blob_gc_job->Finish() and blob_gc->ReleaseGcFiles().

Copy link
Collaborator

@yiwu-arbug yiwu-arbug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks for the fix. May I know where are you using Titan?

@yiwu-arbug
Copy link
Collaborator

And do you mind update the PR summary using English?

@yiwu-arbug
Copy link
Collaborator

You can install clang-format and clang-format-diff and run scripts/format-diff.sh once to fix CI.

@dongdaoguang
Copy link
Author

We use titan for PIKA whitch is a persistent huge storage service.

@Connor1996
Copy link
Member

@dongdaoguang friendly ping

@yiwu-arbug
Copy link
Collaborator

/run-tests

@yiwu-arbug
Copy link
Collaborator

/run-tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants