-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Heartbeat compatibility, Add HeartBeatLastPushedAt to Manager Stats #85
base: master
Are you sure you want to change the base?
Fix Heartbeat compatibility, Add HeartBeatLastPushedAt to Manager Stats #85
Conversation
I successfully implemented this check. The one negative is that you are required to do a func (p *sidekiqChecker) Check(ctx context.Context) error {
stats, err := p.manager.GetStats()
if err != nil {
return err
}
if !stats.HeartbeatLastPushedAt.IsZero() && time.Now().Add(time.Second*-30).After(stats.HeartbeatLastPushedAt) {
msg := fmt.Sprintf("heartbeat last pushed at time is too old, and probably lost redis connection. heartBeatLastPushedAt %s", stats.HeartbeatLastPushedAt)
log.Println(msg)
return errors.New(msg)
} else {
return nil
}
} |
Thank you for the changes, really solid update. However, have a question about the commented out code. |
} | ||
} | ||
|
||
//expireTS := heartbeatTime.Add(-m.opts.Heartbeat.HeartbeatTTL).Unix() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, what should be done with the code that was commented out? Do you recommend to move that into an new afterheartbeathook
? Without that code work in progress from workers with expired heartbeats will never be enqueued back for processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to separate out the visual heartbeat ui code from the recoverable heartbeat (eg, what a node is working on) .
From what I've observed in the Sidekiq ruby behavior. (At least v6)
-
Heartbeats are for visual presentation of what a node is working on and utilize redis TTL expiration, and AFAIK are never read for consumption and resumption. Utilizing this in a mixed ruby/sidekiq environment is very useful for at-a-glance observability.
-
A specially named queue is created by each node to contain the in-progress work that is looked for upon boot, in order to properly recover a killed process. BRPOPLPUSH now BLMOVE, should be utilized to move the item from the queue into the specific worker's queue, after which if the process is killed, the specific queue is read upon boot and present items are assumed to be incomplete and should be restarted (along with the base assumption that all jobs are idempotent). This is where the additional logic that was attached to the heartbeat code could live. I will need to reread what the goals of the heartbeat hook logic are intended for (as my current mental model is more event based then poll based).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for looking into Sidekiq's behavior. When you mention "after which if the process is killed, the specific queue is read upon boot and present items are assumed to be incomplete and should be restarted," does this mean that work will resume when the corresponding manager of the queue is restarted? We have a use case where we have a cluster of managers, and if a given manager goes down with work in its active queue for a prolonged period of time, we would like for that work that to be picked up by another manager.
For what it's worth, our current stance is we are okay with deviating from an exact port of Sidekiq, and this is why we introduced a use case for reading the heartbeat. That said, if there are things we can do to maintain compatibility with the Sidekiq UI, we agree that's preferable. Is there anything with the new heartbeat changes, such as expiring via polling instead of TTL, which breaks compatibility with the UI somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand of how sidekiq works:
The current jobs that a given process is processing should be atomically moved into another data structure (not around heartbeat) that can be read upon the process being restarted.
Having the heartbeat separated from the manager's queue allows you to check for orphaned jobs easier, because if the heartbeat has expired, but a cron process discovers a LIST
of jobs in memory, you can assume that those jobs should be restarted by another manager.
This also has the added benefit of maintaining compatibility with the sidekiq UI, as current jobs for a given process do not live in the same key as the heartbeat and current job details.
Looking into the docs around super_fetch
there is logic around orphaned job discovery and reprocessing. I believe it will do a SCAN
looking for job sets for any host where the accompanying heartbeat has expired (by ttl), as this indicates that there are orphaned jobs to be picked up.
https://github.com/sidekiq/sidekiq/wiki/Reliability#recovering-jobs
I can refactor this to a more middle ground approach, but my end goal is to allow for discovery of orphaned jobs from persistent hosts (same name upon restart) along with dynamically named hosts (eg: replicas in k8s).
This MR does two things.
HeartBeatLastPushedAt
to manager stats so you can build a liveness probe in case the worker process crashes.Note:
I've commented out some code that reads the heartbeats, from my observations of sidekiq.rb's heartbeat code it is never read, just pushed to. If the recent code that was added that modifies heartbeats to be read is necessary, I would recommend it being moved into a different data structure.