Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Heartbeat compatibility, Add HeartBeatLastPushedAt to Manager Stats #85

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

skunkworker
Copy link
Contributor

@skunkworker skunkworker commented Apr 18, 2024

This MR does two things.

  1. Fixes compatibility with the sidekiq web UI, heartbeats were not autoexpiring, and jobs for an individual worker were not being set to the correct key.
  2. Adds HeartBeatLastPushedAt to manager stats so you can build a liveness probe in case the worker process crashes.

Note:
I've commented out some code that reads the heartbeats, from my observations of sidekiq.rb's heartbeat code it is never read, just pushed to. If the recent code that was added that modifies heartbeats to be read is necessary, I would recommend it being moved into a different data structure.

@skunkworker skunkworker changed the title add heartbeat last pushed at to Stats Add HeartBeatLastPushedAt to Manager Stats Apr 18, 2024
@skunkworker
Copy link
Contributor Author

I successfully implemented this check. The one negative is that you are required to do a IsZero() check as it is initialized but would be incorrect to set it to Now() upon initialization.

func (p *sidekiqChecker) Check(ctx context.Context) error {
	stats, err := p.manager.GetStats()
	if err != nil {
		return err
	}
	if !stats.HeartbeatLastPushedAt.IsZero() && time.Now().Add(time.Second*-30).After(stats.HeartbeatLastPushedAt) {
		msg := fmt.Sprintf("heartbeat last pushed at time is too old, and probably lost redis connection. heartBeatLastPushedAt %s", stats.HeartbeatLastPushedAt)
		log.Println(msg)

		return errors.New(msg)
	} else {
		return nil
	}
}

@skunkworker skunkworker changed the title Add HeartBeatLastPushedAt to Manager Stats Fix Heartbeat compatibility, Add HeartBeatLastPushedAt to Manager Stats Apr 19, 2024
@stefannegrea
Copy link
Collaborator

Thank you for the changes, really solid update. However, have a question about the commented out code.

}
}

//expireTS := heartbeatTime.Add(-m.opts.Heartbeat.HeartbeatTTL).Unix()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, what should be done with the code that was commented out? Do you recommend to move that into an new afterheartbeathook? Without that code work in progress from workers with expired heartbeats will never be enqueued back for processing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to separate out the visual heartbeat ui code from the recoverable heartbeat (eg, what a node is working on) .

From what I've observed in the Sidekiq ruby behavior. (At least v6)

  • Heartbeats are for visual presentation of what a node is working on and utilize redis TTL expiration, and AFAIK are never read for consumption and resumption. Utilizing this in a mixed ruby/sidekiq environment is very useful for at-a-glance observability.

  • A specially named queue is created by each node to contain the in-progress work that is looked for upon boot, in order to properly recover a killed process. BRPOPLPUSH now BLMOVE, should be utilized to move the item from the queue into the specific worker's queue, after which if the process is killed, the specific queue is read upon boot and present items are assumed to be incomplete and should be restarted (along with the base assumption that all jobs are idempotent). This is where the additional logic that was attached to the heartbeat code could live. I will need to reread what the goals of the heartbeat hook logic are intended for (as my current mental model is more event based then poll based).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into Sidekiq's behavior. When you mention "after which if the process is killed, the specific queue is read upon boot and present items are assumed to be incomplete and should be restarted," does this mean that work will resume when the corresponding manager of the queue is restarted? We have a use case where we have a cluster of managers, and if a given manager goes down with work in its active queue for a prolonged period of time, we would like for that work that to be picked up by another manager.

For what it's worth, our current stance is we are okay with deviating from an exact port of Sidekiq, and this is why we introduced a use case for reading the heartbeat. That said, if there are things we can do to maintain compatibility with the Sidekiq UI, we agree that's preferable. Is there anything with the new heartbeat changes, such as expiring via polling instead of TTL, which breaks compatibility with the UI somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand of how sidekiq works:

The current jobs that a given process is processing should be atomically moved into another data structure (not around heartbeat) that can be read upon the process being restarted.
Having the heartbeat separated from the manager's queue allows you to check for orphaned jobs easier, because if the heartbeat has expired, but a cron process discovers a LIST of jobs in memory, you can assume that those jobs should be restarted by another manager.

This also has the added benefit of maintaining compatibility with the sidekiq UI, as current jobs for a given process do not live in the same key as the heartbeat and current job details.

Looking into the docs around super_fetch there is logic around orphaned job discovery and reprocessing. I believe it will do a SCAN looking for job sets for any host where the accompanying heartbeat has expired (by ttl), as this indicates that there are orphaned jobs to be picked up.

https://github.com/sidekiq/sidekiq/wiki/Reliability#recovering-jobs

I can refactor this to a more middle ground approach, but my end goal is to allow for discovery of orphaned jobs from persistent hosts (same name upon restart) along with dynamically named hosts (eg: replicas in k8s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants