Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reboot hangs for 10 minutes when VM does not have external IP address #483

Open
jpalermo opened this issue Jan 24, 2025 · 6 comments
Open

Comments

@jpalermo
Copy link

This is a stock Ubuntu 22.04 image on an e2.micro with no external IP that I ssh-ed onto and then ran sudo reboot

VMs was using the latest package 20241011.01-0ubuntu1~22.04.0

sudo reboot hangs for ~10 minutes before the system shuts down.

This behavior was not seen on 20240716.00-0ubuntu1~22.04.0.

If you give the VM an ephemeral external IP address, the behavior is not seen.

Here is a syslog fragment with a grep for "google". The reboot happened at 22:20:04 but the guest-agent didn't finish shutting down until 22:30:05.

Jan 24 22:11:30 reboot-test google_guest_agent[662]: Updating keys for user root.
Jan 24 22:20:04 reboot-test systemd[1]: google-oslogin-cache.timer: Deactivated successfully.
Jan 24 22:20:05 reboot-test systemd[1]: google-osconfig-agent.service: Deactivated successfully.
Jan 24 22:20:05 reboot-test systemd[1]: google-osconfig-agent.service: Consumed 1.480s CPU time.
Jan 24 22:20:05 reboot-test google_guest_agent[662]: ERROR metadata.go:68 Error watching metadata: context canceled
Jan 24 22:20:05 reboot-test google_guest_agent[662]: GCE Agent Stopped
Jan 24 22:20:05 reboot-test google_metadata_script_runner[2370]: Starting shutdown scripts (version 20241011.01-0ubuntu1~22.04.0).
Jan 24 22:20:05 reboot-test google_metadata_script_runner[2370]: No shutdown scripts to run.
Jan 24 22:20:20 reboot-test google_guest_agent[662]: CRITICAL main.go:310 error registering service: failed to shutdown within timeout 15s
Jan 24 22:21:35 reboot-test systemd[1]: google-guest-agent.service: State 'stop-sigterm' timed out. Killing.
Jan 24 22:21:35 reboot-test systemd[1]: google-guest-agent.service: Killing process 662 (google_guest_ag) with signal SIGKILL.
Jan 24 22:21:35 reboot-test systemd[1]: google-guest-agent.service: Killing process 670 (google_guest_ag) with signal SIGKILL.
Jan 24 22:21:35 reboot-test systemd[1]: google-guest-agent.service: Killing process 671 (google_guest_ag) with signal SIGKILL.
Jan 24 22:21:35 reboot-test systemd[1]: google-guest-agent.service: Killing process 673 (google_guest_ag) with signal SIGKILL.
Jan 24 22:21:35 reboot-test systemd[1]: google-guest-agent.service: Killing process 760 (google_guest_ag) with signal SIGKILL.
Jan 24 22:21:35 reboot-test systemd[1]: google-guest-agent.service: Killing process 761 (google_guest_ag) with signal SIGKILL.
Jan 24 22:21:35 reboot-test systemd[1]: google-guest-agent.service: Main process exited, code=killed, status=9/KILL
Jan 24 22:21:35 reboot-test systemd[1]: google-guest-agent.service: Failed with result 'timeout'.
Jan 24 22:21:35 reboot-test systemd[1]: google-guest-agent.service: Consumed 2.762s CPU time.
Jan 24 22:30:05 reboot-test systemd[1]: google-shutdown-scripts.service: Deactivated successfully.
Jan 24 22:30:25 reboot-test systemd[1]: Mounting Mount unit for google-cloud-cli, revision 297...
Jan 24 22:30:25 reboot-test systemd[1]: Mounted Mount unit for google-cloud-cli, revision 297.

Full copy of the syslog is attached.

syslog.log

@ChaitanyaKulkarni28
Copy link
Member

@jpalermo thanks for reporting, issue is because agent tries to flush buffered logs to cloud logging on exit, and if there's no network it blocks. Unfortunately cloud logging library blocks indefinitely retrying to flush these logs so we see the hang until systemd timeout (10 mins). Issue was fixed in 20250116.00 version of guest-agent but hasn't made to Canonical images yet. They're working on taking this new agent version.

In the meantime if this is impacting, you can disable the cloud-logging on guest-agent to prevent this from happening again. Setting cloud_logging_enabled = false should disable cloud logging on agent. See this for more information on how its set.

@jpalermo
Copy link
Author

Thanks for the info. We tried to disable cloud logging but ran into the same issue. Maybe we're doing something wrong though. We modified /etc/default/instance_configs.cfg.template and here's the full contents of that file:

[Daemons]
accounts_daemon = true
clock_skew_daemon = true
ip_forwarding_daemon = true
[InstanceSetup]
network_enabled = true
optimize_local_ssd = true
set_host_keys = false
shutdown = false
startup = false
[Core]
cloud_logging_enabled = false

@ChaitanyaKulkarni28
Copy link
Member

@jpalermo sorry for the late response. The configuration file looks fine. However, the agent needs to be restarted for any changes to take effect. After the initial restart, subsequent restarts should be quick.

@jpalermo
Copy link
Author

jpalermo commented Feb 3, 2025

That's the configuration file we're building into the VM image, so it's there from boot up and we still see the problem.

@ChaitanyaKulkarni28
Copy link
Member

I missed the fact that Canonical has not picked the changes for cloud_logging_enabled flag. It was required for both guest agent and metadata script runner. guest-agent change was released earlier and got released but not the metadata script runner one(change was merged in Dec 2024 and the latest version they've is from Oct 2024). So even with this flag guest-agent would shutdown quickly but metadata script runner would not causing the same issue. Metadata script runner is ran during shutdown and hangs on exit even if no scripts were passed.

We're working with Canonical on releasing new version soon, I'll update here once its available.

@jpalermo
Copy link
Author

jpalermo commented Feb 3, 2025

Thanks for the update. We'll just pin back to the previous agent version for now if it starts causing us major problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants