-
Notifications
You must be signed in to change notification settings - Fork 1k
fix(cluster_family): Cancel slot migration from incoming node on OOM #5000
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9c88c11
to
6a5d621
Compare
src/server/cluster/cluster_family.cc
Outdated
if (migration->GetState() == MigrationState::C_FATAL) { | ||
migration->Stop(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like incorrect place for this logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If FLOW
fails with C_FATAL we'll call it, where would you put migration->Stop
to handle stopping of migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can move it to reportError or SetState, where we set the fatal status
@@ -70,6 +76,20 @@ class ClusterShardMigration { | |||
break; | |||
} | |||
|
|||
auto oom_check = [&]() -> bool { | |||
auto used_mem = used_mem_current.load(memory_order_relaxed); | |||
if ((used_mem + tx_data->command.cmd_len) > max_memory_limit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is used_mem RSS? maybe we need max_memory_limit - 100MB for example, or how we can guarantee 100% success of this condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is checks whatever is set with maxmemory. I'll modify so it uses 90% of max memory as upper bound. That should be safe i think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss it with @adiholden or @romange
void ReportFatalError(dfly::GenericError err) ABSL_LOCKS_EXCLUDED(state_mu_, error_mu_) { | ||
errors_count_.fetch_add(1, std::memory_order_relaxed); | ||
util::fb2::LockGuard lk_state(state_mu_); | ||
util::fb2::LockGuard lk_error(error_mu_); | ||
state_ = MigrationState::C_FATAL; | ||
last_error_ = std::move(err); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this code. Let's think how to make it better. Maybe we can use state_mu_ for error too or merge this method with reportError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, but i changed ReportError to check GetState() != C_FATAL
and i wanted to update state & set error message while having both locks.
e9049f0
to
16a27f6
Compare
If applying command on incoming node will result in OOM (we overflow max_memory_limit) we are closing migration and switch state to FATAL. Signed-off-by: mkaruza <[email protected]>
16a27f6
to
f080a52
Compare
7e7ebf4
to
40851f9
Compare
If applying command on incoming node will result in OOM (we overflow
max_memory_limit) we are closing migration and switch state to FATAL.