-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
not hold mutex when destruct big object #178
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: jason <[email protected]>
} | ||
|
||
mutex_.Unlock(); | ||
delete blob_gc_job; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if unlocking would cause any race condition, and it is just hard to check. A better pattern is we can pass a struct GCContext { std::unique_ptr<BlobGCJob> }
from outside of the mutex (in BackgroundCallGC()
) into BackgroundGC
and use the context struct to hold the pointer, and make sure the context struct is cleanup outside of mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We checked the BlobGCJob destruct function and sure it will not cause race condition, but it is a good idea to use GCContext to manage the BlobGCJob pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not just BlobGCJob
destructor. I didn't check whether it is safe if some other job slip in between blob_gc_job->Finish()
and blob_gc->ReleaseGcFiles()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks for the fix. May I know where are you using Titan?
And do you mind update the PR summary using English? |
You can install clang-format and clang-format-diff and run scripts/format-diff.sh once to fix CI. |
We use titan for PIKA whitch is a persistent huge storage service. |
@dongdaoguang friendly ping |
/run-tests |
/run-tests |
Signed-off-by: jason [email protected]
我们线上使用titan的时候经常发现get会产生几百毫秒延时,原因是get命令创建snapshot的时候会mutex_.Lock()。但BlobGCJob对象在析构的时候会持有mutex_,该对象中保存了需要回写sst中的kv,如果需要回写的kv很多(线下测试可能会有几十万条),那么该对象会很大,析构的时候比较耗时,持有mutex_的时间就会比较长,导致get命令加锁被阻塞。所以建议析构大对象的时候,不要持有锁。我们修复后,线上已经跑了一个月,get已经不会产生延时抖动。
优化前:一天get超过200ms延时大概在10w条左右
优化后:一天get超过200ms延时大概在100条左右