Skip to content

swap_bootloader not crash resistant #3392

@champtar

Description

@champtar

On a test server we had 3 deployments (1 pinned) and deleted 1, then 27s later we got a kernel crash (Kernel panic - not syncing: Fatal exception)

Mar 06 14:27:27 appliance-13214 systemd[1]: Starting Update kargs if needed...
Mar 06 14:27:27 appliance-13214 appliance-kargs[2783]: Removing deployment 1 (id: os-68eeb6d1db3e002e890b57cddb76874ce6b3e60f170cd486eb7fd8bcd36569d6.0, version 2.0.1-test1)
Mar 06 14:27:29 appliance-13214 ostree[2983]: Starting syncfs() for system root
Mar 06 14:27:29 appliance-13214 ostree[2983]: Completed syncfs() for system root
Mar 06 14:27:29 appliance-13214 ostree[2983]: Starting freeze/thaw cycle for system root
Mar 06 14:27:29 appliance-13214 ostree[2983]: Completed freeze/thaw cycle for system root
Mar 06 14:27:29 appliance-13214 ostree[2983]: Starting global sync()
Mar 06 14:27:29 appliance-13214 ostree[2983]: Completed global sync()
Mar 06 14:27:29 appliance-13214 ostree[2983]: Bootloader updated; bootconfig swap: yes; bootversion: boot.1.1, deployment count change: -1
Mar 06 14:27:29 appliance-13214 appliance-kargs[2983]: Bootloader updated; bootconfig swap: yes; bootversion: boot.1.1, deployment count change: -1
Mar 06 14:27:36 appliance-13214 appliance-kargs[2983]: Deleted deployment 68eeb6d1db3e002e890b57cddb76874ce6b3e60f170cd486eb7fd8bcd36569d6.0
Mar 06 14:27:37 appliance-13214 appliance-kargs[2783]: Unpinning deployment 1 (id: os-ba3e0bb90f8f8c6c373008558d6f66c71d3ee2274f4e44a6600c5138cba52c80.3, version: 2.4.0-rc1)
Mar 06 14:27:37 appliance-13214 appliance-kargs[3215]: Deployment 1 is now unpinned
Mar 06 14:27:37 appliance-13214 appliance-kargs[2783]: No kargs change needed
Mar 06 14:27:37 appliance-13214 systemd[1]: kargs.service: Deactivated successfully.
Mar 06 14:27:37 appliance-13214 systemd[1]: Finished Update kargs if needed.
Mar 06 14:28:03 appliance-13214 kernel: Kernel panic - not syncing: Fatal exception

On the following boot the system was going into emergency mode, ostree-prepare-root couldn't find /sysroot//ostree/boot.0/....
/proc/cmdline had ... ostree=/ostree/boot.0/..., and a second reboot didn't fix the issue.
After mounting /boot to inspect, (without taking screenshot but I bet it replayed xfs journal), the 2 grub conf file were pointing to boot.1 and boot.1.1, there was no grub config with boot.0
Just rebooting after having mounted once the fs was enough for the system to properly boot on next boot.

Looking at write_deployments_bootswap:

if (!full_system_sync (self, out_syncstats, cancellable, error))
return FALSE;
if (!swap_bootloader (self, bootloader, self->bootversion, new_bootversion, cancellable, error))
return FALSE;

we flush everything before doing the swap, but we do not flush again after the swap.

I think the fix is just to use fsfreeze_thaw_cycle in swap_bootloader instead of just fsync()

/* Now we explicitly fsync this directory, even though it
* isn't required for atomicity, for two reasons:
* - It should be very cheap as we're just syncing whatever
* data was written since the last sync which was hopefully
* less than a second ago.
* - It should be sync'd before shutdown as that could crash
* for whatever reason, and we wouldn't want to confuse the
* admin by going back to the previous session.
*/
if (fsync (sysroot->boot_fd) != 0)
return glnx_throw_errno_prefix (error, "fsync(boot)");

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions