-
Notifications
You must be signed in to change notification settings - Fork 331
Description
On a test server we had 3 deployments (1 pinned) and deleted 1, then 27s later we got a kernel crash (Kernel panic - not syncing: Fatal exception
)
Mar 06 14:27:27 appliance-13214 systemd[1]: Starting Update kargs if needed...
Mar 06 14:27:27 appliance-13214 appliance-kargs[2783]: Removing deployment 1 (id: os-68eeb6d1db3e002e890b57cddb76874ce6b3e60f170cd486eb7fd8bcd36569d6.0, version 2.0.1-test1)
Mar 06 14:27:29 appliance-13214 ostree[2983]: Starting syncfs() for system root
Mar 06 14:27:29 appliance-13214 ostree[2983]: Completed syncfs() for system root
Mar 06 14:27:29 appliance-13214 ostree[2983]: Starting freeze/thaw cycle for system root
Mar 06 14:27:29 appliance-13214 ostree[2983]: Completed freeze/thaw cycle for system root
Mar 06 14:27:29 appliance-13214 ostree[2983]: Starting global sync()
Mar 06 14:27:29 appliance-13214 ostree[2983]: Completed global sync()
Mar 06 14:27:29 appliance-13214 ostree[2983]: Bootloader updated; bootconfig swap: yes; bootversion: boot.1.1, deployment count change: -1
Mar 06 14:27:29 appliance-13214 appliance-kargs[2983]: Bootloader updated; bootconfig swap: yes; bootversion: boot.1.1, deployment count change: -1
Mar 06 14:27:36 appliance-13214 appliance-kargs[2983]: Deleted deployment 68eeb6d1db3e002e890b57cddb76874ce6b3e60f170cd486eb7fd8bcd36569d6.0
Mar 06 14:27:37 appliance-13214 appliance-kargs[2783]: Unpinning deployment 1 (id: os-ba3e0bb90f8f8c6c373008558d6f66c71d3ee2274f4e44a6600c5138cba52c80.3, version: 2.4.0-rc1)
Mar 06 14:27:37 appliance-13214 appliance-kargs[3215]: Deployment 1 is now unpinned
Mar 06 14:27:37 appliance-13214 appliance-kargs[2783]: No kargs change needed
Mar 06 14:27:37 appliance-13214 systemd[1]: kargs.service: Deactivated successfully.
Mar 06 14:27:37 appliance-13214 systemd[1]: Finished Update kargs if needed.
Mar 06 14:28:03 appliance-13214 kernel: Kernel panic - not syncing: Fatal exception
On the following boot the system was going into emergency mode, ostree-prepare-root couldn't find /sysroot//ostree/boot.0/...
.
/proc/cmdline
had ... ostree=/ostree/boot.0/...
, and a second reboot didn't fix the issue.
After mounting /boot to inspect, (without taking screenshot but I bet it replayed xfs journal), the 2 grub conf file were pointing to boot.1
and boot.1.1
, there was no grub config with boot.0
Just rebooting after having mounted once the fs was enough for the system to properly boot on next boot.
Looking at write_deployments_bootswap
:
ostree/src/libostree/ostree-sysroot-deploy.c
Lines 2447 to 2451 in 8df797d
if (!full_system_sync (self, out_syncstats, cancellable, error)) | |
return FALSE; | |
if (!swap_bootloader (self, bootloader, self->bootversion, new_bootversion, cancellable, error)) | |
return FALSE; |
we flush everything before doing the swap, but we do not flush again after the swap.
I think the fix is just to use fsfreeze_thaw_cycle
in swap_bootloader
instead of just fsync()
ostree/src/libostree/ostree-sysroot-deploy.c
Lines 2228 to 2238 in 8df797d
/* Now we explicitly fsync this directory, even though it | |
* isn't required for atomicity, for two reasons: | |
* - It should be very cheap as we're just syncing whatever | |
* data was written since the last sync which was hopefully | |
* less than a second ago. | |
* - It should be sync'd before shutdown as that could crash | |
* for whatever reason, and we wouldn't want to confuse the | |
* admin by going back to the previous session. | |
*/ | |
if (fsync (sysroot->boot_fd) != 0) | |
return glnx_throw_errno_prefix (error, "fsync(boot)"); |