-
Notifications
You must be signed in to change notification settings - Fork 108
[server] Add a heartbeat lag based replica auto resubscribe feature #2269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
642919d to
09ab18a
Compare
haoxu07
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for preparing this! Left a few comments.
internal/venice-common/src/main/java/com/linkedin/venice/ConfigKeys.java
Show resolved
Hide resolved
...ts/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTask.java
Outdated
Show resolved
Hide resolved
...ts/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTask.java
Outdated
Show resolved
Hide resolved
| * We will delegate to KafkaConsumerService to determine whether it takes this request as it has information. | ||
| */ | ||
| if (getServerConfig().isLagBasedReplicaAutoResubscribeEnabled() && TimeUnit.SECONDS | ||
| .toMillis(getServerConfig().getLagBasedReplicaAutoResubscribeThresholdInSeconds()) < lag) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we only concern lag here, should we consider both lag and very few records polled from consumer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it is possible that underlying resubscribe operationh is very slow on waiting for the drainer to be cleared for specific partition, can we add some safeguard here to avoid enqueuing too many partititons into SIT resubscribe queue? We could check the SIT PSC # to prevent put too many partitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced offline - we have one config to control each replica's resubscribe interval - it will make sure resubscribe behavior does not happen too frequent
d3a410b to
a322c10
Compare
[server] Add a heartbeat lag based replica auto resubscribe feature
This is an optional feature that aims to improve consumption resiliency. When we observe current version resource heartbeat delay grows beyond certain threshold, it could potentially be that underlying consumer is not functioning well. While we are also working on figuring out the root cause of that, this feature aims to mitigate it in the runtime by asking the SIT to try to perform a resubscribe action that brings the replica to another consumer in the same consumer pool.
Solution
Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
Work in progress
Does this PR introduce any user-facing or breaking changes?