Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to configure kdump on CoreOS #28164

Merged
merged 1 commit into from
Jan 18, 2021

Conversation

kelvinfan001
Copy link
Contributor

RHCOS 4.7 includes kexec-tools (required for kdump) so
investigating kernel crashes through kdump is now supported.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 16, 2020
Copy link
Member

@travier travier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good.

As we expect administrators to use MCs instead of relying on SSH for node setup, I'm wondering if we should first provide a full example using a MachineConfig with a filter to select only one or some of the nodes and keep the manual setup instructions in a second section.

See the docs for kargs changes for MC examples.

@cgwalters
Copy link
Member

As we expect administrators to use MCs instead of relying on SSH for node setup,

Yes...but I think it will be common for admins to want to enable kdump on just one node (or just a subset), and we don't support machine specific machineconfigs yet.

I think this is probably OK for now; an admin who actually wants to enable kdump on multiple nodes could indeed use a MachineConfig, and it'd probably be worth at least mentioning that.

@kelvinfan001
Copy link
Contributor Author

an admin who actually wants to enable kdump on multiple nodes could indeed use a MachineConfig, and it'd probably be worth at least mentioning that.

I'll mention this, but since it's unlikely that admins would want to enable kdump on all nodes, I'll omit the fulll example of using MachineConfigs for now, especially since there will soon be better kdump support through FCC.

@travier
Copy link
Member

travier commented Dec 18, 2020

This is planned for 4.7.

@cynepco3hahue
Copy link

I tried to enable it and got an error:

Dec 30 13:55:08 cnfdd3.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: Starting Crash recovery kernel arming...
Dec 30 13:55:08 cnfdd3.clus2.t5g.lab.eng.bos.redhat.com kdumpctl[59248]: No kdump initial ramdisk found.
Dec 30 13:55:08 cnfdd3.clus2.t5g.lab.eng.bos.redhat.com kdumpctl[59248]: Rebuilding /boot(hd0,gpt3)/ostree/rhcos-912fd9f507a1f7eb885c1c86689d8df3a72d383dcaada7254789a43fe1d7be87/initramfs-4.18.0-240.8.1.el8_3.x86_64kdump.img
Dec 30 13:55:08 cnfdd3.clus2.t5g.lab.eng.bos.redhat.com kdumpctl[59248]: /boot(hd0,gpt3)/ostree/rhcos-912fd9f507a1f7eb885c1c86689d8df3a72d383dcaada7254789a43fe1d7be87 does not have write permission. Can not rebuild /boot(hd0,gpt3)/ostree/rhcos-912fd9f507a1f7eb885c1c86689d8df3a72d383dcaada7254789a43fe1d7be87/initramfs-4.18.0-240.8.1.el8_3.x86_64kdump.img
Dec 30 13:55:08 cnfdd3.clus2.t5g.lab.eng.bos.redhat.com kdumpctl[59248]: Starting kdump: [FAILED]
Dec 30 13:55:08 cnfdd3.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: kdump.service: Main process exited, code=exited, status=1/FAILURE
Dec 30 13:55:08 cnfdd3.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: kdump.service: Failed with result 'exit-code'.
Dec 30 13:55:08 cnfdd3.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: Failed to start Crash recovery kernel arming.

@kelvinfan001
Copy link
Contributor Author

@cynepco3hahue Did you set the KDUMP_BOOTDIR variable in /etc/kdump.conf manually?

BOOT_LOC=/boot$(cat /proc/cmdline | egrep -o "/ostree/.*/vmlinuz" | sed -e "s|/vmlinuz||g")
sudo sed -i "s|^#KDUMP_BOOTDIR=\"/boot\"|KDUMP_BOOTDIR=\"${BOOT_LOC}\"|" /etc/sysconfig/kdump

This is required on RHCOS, currently. It will no longer be required once kexec-tools-2.0.20-35.el8 or after lands. Related BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1866611

@cynepco3hahue
Copy link

@kelvinfan001 Thanks for information!

@travier
Copy link
Member

travier commented Jan 5, 2021

@cynepco3hahue Can you confirm that the documented steps work for you? Thanks

@kelvinfan001
Copy link
Contributor Author

@openshift/team-documentation


. Ensure that `kdump` has loaded a crash kernel by checking that `kdump.service` has started and exited successfully and that `cat /sys/kernel/kexec_crash_loaded` prints `1`.

=== Enabling kdump on day-1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider making this a new module, especially if it ends up with additional procedures.

== Testing the kdump configuration

ifdef::openshift-enterprise[]
Please refer to the link:https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/system_design_guide/installing-and-configuring-kdump_system-design-guide#testing-the-kdump-configuration_installing-and-configuring-kdump["Testing the kdump configuration" section] over at the {op-system-base} documentation for `kdump`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following apply throughout:

s/Please refer to the/See

s/over at the/in the

No quotations needed for titles

@bmcelvee
Copy link
Contributor

@kelvinfan001 thank you, this is looking great! I did a first pass review and left some suggestions. Also tagging @bobfuru since he works directly with CoreOS content.

@bobfuru
Copy link
Contributor

bobfuru commented Jan 12, 2021

Added a few more comments and agree with suggestions from @bmcelvee. This is great work, @kelvinfan001 - thank you!

@kelvinfan001
Copy link
Contributor Author

Thank you @bmcelvee and @bobfuru for the review! I've made the suggested changes.

. Ensure that `kdump` has loaded a crash kernel by checking that `kdump.service` has started and exited successfully and that `cat /sys/kernel/kexec_crash_loaded` prints `1`.

== Enabling kdump on day-1
`kdump` is intended to be enabled per-node to debug kernel problems. It is not recommended to enable `kdump` on all of your nodes in the cluster. Although machine-specific `MachineConfigs` are not yet supported, it is possible to do the above through a systemd unit in a `MachineConfig` object on day-1 and have kdump enabled on all nodes in the cluster. You can create a `MachineConfig` object and inject that object into the set of manifest files used by Ignition during cluster setup. See "Customizing nodes" in the _Installing -> Installation configuration_ section for more information and examples on how to use Ignition configs.
Copy link
Contributor

@bobfuru bobfuru Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions in this paragraph:

  • So that the sentence doesn't start lowercase: s/kdump is intended/The kdump service is intended/
  • s/Although machine-specific MachineConfigs/Although machine-specific machine configs/ (do not pluralize an object ref, according to docs guidelines)
  • s/it is possible to do the above through a systemd unit/you can perform the previous step through a systemd unit/

+
[source, terminal]
----
sudo rpm-ostree kargs --append='crashkernel=256M'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add $ prompts at the beginning of terminal commands (or # as root)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reserve memory for the crash kernel during the first kernel booting, provide kernel arguments by entering the following command:


.Procedure

The following steps are needed to enable `kdump` on {op-system}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this active voice. s/The following steps are needed to enable kdump on {op-system}./Perform the following steps to enable kdump on {op-system}:/

sudo rpm-ostree kargs --append='crashkernel=256M'
----

. By default, the path in which the vmcore will be saved is `/var/crash`. It is also possible to write the dump over the network or to some other location on the local system by editing `/etc/kdump.conf`. For example, assuming `/var/usrlocal/cores` exists, enter the following command to edit `/etc/kdump.conf` to save the vmcore to `/var/usrlocal/cores`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit to single-space after . for list items:
s/. By default/. By default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I wonder where I got the idea to put two spaces.

sed -i "s/^path.*/path \/var\/usrlocal\/cores/" /etc/kdump.conf
----
+
For additional information, see `kdump.conf`, a manual page for the `/etc/kdump.conf` configuration file containing the full documentation of available options, and the comments in `/etc/kdump.conf` and `/etc/sysconfig/kdump`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe s/and the comments/and note the comments/ ?

sudo systemctl reboot
----

. Ensure that `kdump` has loaded a crash kernel by checking that `kdump.service` has started and exited successfully and that `cat /sys/kernel/kexec_crash_loaded` prints `1`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/that kdump.service/that the kdump.service

endif::[]
* link:https://www.kernel.org/doc/html/latest/admin-guide/kdump/kdump.html[Linux kernel documentation for kdump]
* kdump.conf(5) — a manual page for the `/etc/kdump.conf` configuration file containing the full documentation of available options
* kexec(8) — a manual page for kexec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/kexec/kexec

@bobfuru
Copy link
Contributor

bobfuru commented Jan 14, 2021

Thanks for the updates, @kelvinfan001 - just a few more minor nits and otherwise LGTM.

RHCOS 4.7 includes `kexec-tools` (required for `kdump`) so
investigating kernel crashes through `kdump` is now supported.
@kelvinfan001
Copy link
Contributor Author

Thanks again, @bobfuru. I've updated the PR with your additional suggestions.

@bobfuru
Copy link
Contributor

bobfuru commented Jan 14, 2021

LGTM!! 👍

@bobfuru bobfuru added the peer-review-done Signifies that the peer review team has reviewed this PR label Jan 14, 2021
@bobfuru bobfuru merged commit 8320005 into openshift:master Jan 18, 2021
@bobfuru
Copy link
Contributor

bobfuru commented Jan 18, 2021

/cherrypick enterprise-4.7

@openshift-cherrypick-robot
Copy link

openshift-cherrypick-robot commented Jan 18, 2021

@bobfuru: new pull request created: #28657

In response to this:

/cherrypick enterprise-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch/enterprise-4.7 peer-review-done Signifies that the peer review team has reviewed this PR size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants