Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workaround for 90 second LX systemd stop/reboot delay #1022

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

smokris
Copy link

@smokris smokris commented Apr 7, 2022

  • When shutting down a SmartOS LX zone, systemd-shutdown's main() calls broadcast_signal() first with SIGTERM, then with SIGKILL.
  • At that point, the root cgroup becomes empty.

This PR modifies vmadm to send the systemd-shutdown process SIGCHLD periodically during a vmadm stop or vmadm reboot command, which causes systemd-shutdown to re-check its list of remaining processes, thereby removing the unnecessary delay.

Prior to this change, vmadm stop and vmadm reboot were taking slightly more than 90 seconds. With this change, each takes about 10 seconds.

See discussion and @danmcd's initial testing on https://smartos.topicbox.com/groups/smartos-discuss/T7eaf4fad91e64ad8-M2e374d3cdfd9366191bc09b7.

On SmartOS LX, systemd uses its "legacy cgroup hierarchy" (cgroup-v1)
codepath, which does not signal when the root cgroup becomes empty,
causing it to wait 90 seconds before continuing to shut down.

This commit modifies vmadm to send the `systemd-shutdown` process
SIGCHLD every second during a `vmadm stop` or `vmadm reboot` command,
which causes `systemd-shutdown` to re-check its list of remaining
processes, which removes the unnecessary wait.
@danmcd danmcd self-assigned this Apr 7, 2022
Copy link
Contributor

@danmcd danmcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't studied vmadm(1M) much, but how are you sure that vmobj.pid is the pid for systemd-shutdown? What if you're sending SIGCHLD to something else? What about older or alternate LX images that don't have modern systemd?

@smokris
Copy link
Author

smokris commented Apr 7, 2022

Thanks for the feedback, @danmcd.

how are you sure that vmobj.pid is the pid for systemd-shutdown?

If I understand correctly, vmobj.pid is the PID of the init process for the zone.

During shutdown, systemd replaces the init process image with systemd-shutdown (specifically, it calls execve() without a fork()), reusing the same PID.

What if you're sending SIGCHLD to something else? What about older or alternate LX images that don't have modern systemd?

Ah, good point. To reduce that risk, I could have it check the process's comm and only send the signal if it's systemd-shutdown.

Copy link
Contributor

@danmcd danmcd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my meandering into procfs.

You should test this on multiple kinds of LX Zone images. I'm happy to build a test platform image for you that has this fix in it, OR you can use lofs tricks to mount yours over the one in use on the system.

Also, and maybe @bahamat might know, there may be a corresponding code segment in VMAPI that needs to have this as well. (I can't remember if it's duplicate code or if VMAPI direclty uses files here.)

@danmcd
Copy link
Contributor

danmcd commented Apr 7, 2022

@bahamat and I had a chat offline... we're going to look a little deeper into this problem. Your analysis of how systemd-shutdown behaves is EXCELLENT, but we're wondering if this fix belongs somewhere lower in the stack, so zoneadm(8) can DTRT as well.

@smokris
Copy link
Author

smokris commented Apr 7, 2022

DAMN...

@danmcd, heh, no problem. (I had first looked at /proc/PID/argv, then realized /proc also provided a symlink to the binary and figured it would be slightly simpler/less-error-prone to check that link, rather than parsing argv.)

I've been testing on PI 20220324T002253Z using a lofs bind mount to override VM.js. So far I've tested successfully with a modified image 3dbbdcca-2eab-11e8-b925-23bf77789921 (CentOS 7, which I updated to CentOS Stream 8 running systemd-v239).

we're going to look a little deeper into this problem

Sounds good; keep me posted.

@danmcd
Copy link
Contributor

danmcd commented Apr 11, 2022

I'm not sure if this requires fixes in both VM.js and in zone brand scripts. This one may get delayed for a bit; I'm sorry.

@bahamat
Copy link
Member

bahamat commented Apr 11, 2022

I think if zoneadm handles this properly, vmadm should be fine as-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants