Skip to content
This repository has been archived by the owner on Apr 13, 2019. It is now read-only.

CONFIG_NO_HZ=y possibly broken #127

Open
sorear opened this issue Feb 15, 2018 · 8 comments
Open

CONFIG_NO_HZ=y possibly broken #127

sorear opened this issue Feb 15, 2018 · 8 comments

Comments

@sorear
Copy link
Contributor

sorear commented Feb 15, 2018

Adding that to the Fedora defconfig results in a kernel which fails to boot (hangs with high cpu usage immediately after "clocksource: Switched to clocksource riscv_clocksource").

@palmer-dabbelt
Copy link
Contributor

I tried the Fedora Kconfig and it appears to use a dummy console. If I add CONFIG_NO_HZ=y to my regular config then I can boot, but if I select CONFIG_NO_HZ_IDLE=y then I get a hang. Do you have a Kconfig that manifests your error in QEMU simulation (ie, without a PCIe graphics card)? Walking all the way across the lab to check if the machine boots is a headache :).

@atishp04
Copy link
Contributor

atishp04 commented Feb 22, 2018

For me, it booted but hangs after
"Starting logging: OK
Starting mdev...
sort: /sys/devices/platform/Fixed: No such file or directory
modprobe: can't change directory to '/lib/modules': No such file or directory
Initializing random number generator... done.
Starting network...
udhcpc (v1.24.2) started
Sending discover...
Sending select for 10.0.2.15...
Lease of 10.0.2.15 obtained, lease time 86400
deleting routers
adding dns 10.0.2.3
Starting dropbear sshd: OK"

Are you seeing the same hang ? I am seeing same hang for both
CONFIG_NO_HZ=y
or
CONFIG_NO_HZ=y
CONFIG_NO_HZ_IDLE=y
CONFIG_NO_HZ_COMMON=y


However, it hangs during the boot for spike/smp combination (for both configuration).
[ 0.020000] clocksource: Switched to clocksource riscv_clocksource
[ 0.020000] NET: Registered protocol family 2
[ 0.020000] TCP established hash table entries: 16384 (order: 5, 131072 bytes)
[ 0.020000] TCP bind hash table entries: 16384 (order: 6, 262144 bytes)
[ 0.020000] TCP: Hash tables configured (established 16384 bind 16384)
[ 0.020000] UDP hash table entries: 1024 (order: 3, 32768 bytes)
[ 0.020000] UDP-Lite hash table entries: 1024 (order: 3, 32768 bytes)
[ 0.030000] NET: Registered protocol family 1
[ 0.050000] Unpacking initramfs...
[ 0.080000] Initialise system trusted keyrings
[ 0.080000] workingset: timestamp_bits=62 max_order=19 bucket_order=0
[ 21.020000] INFO: rcu_sched self-detected stall on CPU
[ 21.020000] INFO: rcu_sched self-detected stall on CPU
[ 21.020000] 3-...!: (698636 ticks this GP) idle=2b6/2/0 softirq=1/1 fqs=0
[ 21.020000] (t=2100 jiffies g=-299 c=-300 q=1376)
[ 21.020000] rcu_sched kthread starved for 2100 jiffies! g18446744073709551317 c18446744073709551316 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=1
[ 21.020000] rcu_sched I 0 8 2 0x00000000
[ 21.020000] Call Trace:
[ 21.020000] [< (ptrval)>] __schedule+0x1c6/0x4ea
[ 21.020000] [< (ptrval)>] schedule+0x16/0x58
[ 21.020000] [< (ptrval)>] schedule_timeout+0x12a/0x248
[ 21.020000] [< (ptrval)>] rcu_gp_kthread+0x31e/0x6cc
[ 21.020000] [< (ptrval)>] kthread+0xcc/0xde
[ 21.020000] [< (ptrval)>] ret_from_syscall+0xa/0xe
[ 21.020000] Task dump for CPU 2:
[ 21.020000] swapper/2 R running task 0 0 1 0x00000008
[ 21.020000] Call Trace:
[ 21.020000] [< (ptrval)>] __schedule+0x1c6/0x4ea
[ 21.020000] Task dump for CPU 3:
[ 21.020000] swapper/3 R running task 0 0 1 0x00000008
[ 21.020000] Call Trace:
[ 21.020000] [< (ptrval)>] walk_stackframe+0x0/0xa2
[ 21.020000] [< (ptrval)>] show_stack+0x26/0x34
[ 21.020000] [< (ptrval)>] sched_show_task+0xa6/0xfc
[ 21.020000] [< (ptrval)>] dump_cpu_task+0x3a/0x48
[ 21.020000] [< (ptrval)>] rcu_dump_cpu_stacks+0x7a/0xb8
[ 21.020000] [< (ptrval)>] rcu_check_callbacks+0x4a0/0x660
[ 21.020000] [< (ptrval)>] update_process_times+0x1e/0x48
[ 21.020000] [< (ptrval)>] tick_nohz_handler+0x7c/0xd2
[ 21.020000] [< (ptrval)>] riscv_timer_interrupt+0x26/0x32
[ 21.020000] [< (ptrval)>] riscv_intc_irq+0xaa/0xd2
[ 21.020000] [< (ptrval)>] ret_from_syscall+0xa/0xe
[ 21.030000] 2-...!: (698635 ticks this GP) idle=31a/2/0 softirq=1/1 fqs=0
[ 21.040000] (t=2102 jiffies g=-299 c=-300 q=1376)
[ 21.040000] rcu_sched kthread starved for 2102 jiffies! g18446744073709551317 c18446744073709551316 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=3
[ 21.040000] rcu_sched R running task 0 8 2 0x00000000

I am running with fairly latest linux
commit 8fa96b8
Author: Palmer Dabbelt [email protected]
Date: Wed Feb 7 18:23:09 2018 -0800

irq-riscv-intc post-merge cleanups

Should I update anything else to verify?

@palmer-dabbelt
Copy link
Contributor

The difference is probably just my working directory being a mess, but I can reproduce some failures with

$ git diff
diff --git a/Makefile b/Makefile
index 408a4247fe64..6375cab35974 100644
--- a/Makefile
+++ b/Makefile
@@ -130,6 +130,7 @@ endif
 
 $(vmlinux): $(linux_srcdir) $(linux_wrkdir)/.config $(buildroot_initramfs_sysroot_stamp)
        $(MAKE) -C $< O=$(linux_wrkdir) \
+               CROSS_COMPILE=riscv64-unknown-linux-gnu- \
                CONFIG_INITRAMFS_SOURCE="$(confdir)/initramfs.txt $(buildroot_initramfs_sysroot)" \
                CONFIG_INITRAMFS_ROOT_UID=$(shell id -u) \
                CONFIG_INITRAMFS_ROOT_GID=$(shell id -g) \
diff --git a/conf/linux_defconfig b/conf/linux_defconfig
index eff1c1731aa8..e0aac738869a 100644
--- a/conf/linux_defconfig
+++ b/conf/linux_defconfig
@@ -83,3 +83,6 @@ CONFIG_MTD=y
 CONFIG_MTD_SPI_NOR=y
 CONFIG_MTD_M25P80=y
 CONFIG_MTD_BLOCK_RO=y
+CONFIG_NO_HZ=y
+CONFIG_NO_HZ_IDLE=y
+CONFIG_NO_HZ_COMMON=y
diff --git a/linux b/linux
index 8fa96b887786..632fc702e0ef 160000
--- a/linux
+++ b/linux
@@ -1 +1 @@
-Subproject commit 8fa96b8877861fbcc7340f8f0a975569da254d93
+Subproject commit 632fc702e0efd519bf75f55c7f85ac7300e08dab
diff --git a/riscv-gnu-toolchain b/riscv-gnu-toolchain
--- a/riscv-gnu-toolchain
+++ b/riscv-gnu-toolchain
@@ -1 +1 @@
-Subproject commit b4dae89f85bf882852c6186b1284df11065bfcd9
+Subproject commit b4dae89f85bf882852c6186b1284df11065bfcd9-dirty
diff --git a/riscv-qemu b/riscv-qemu
index cd2c21d70361..6256f8ac6004 160000
--- a/riscv-qemu
+++ b/riscv-qemu
@@ -1 +1 @@
-Subproject commit cd2c21d70361f6887e265860c4f0dbb49ad323ca
+Subproject commit 6256f8ac6004677cc4652b1fcd77c839dfc05015-dirty

so I'm going to go figure out what's going on.

@palmer-dabbelt
Copy link
Contributor

@atishp04 is going to take a look, thanks!

@atishp04
Copy link
Contributor

Just a quick update. I am able to get repeated stall warnings in qemu setup as well (with smp).

[ 21.020000] INFO: rcu_sched self-detected stall on CPU
[ 21.020000] INFO: rcu_sched self-detected stall on CPU
[ 21.020000] 3-...!: (698636 ticks this GP) idle=2b6/2/0 softirq=1/1 fqs=0
[ 21.020000] (t=2100 jiffies g=-299 c=-300 q=1376)
[ 21.020000] rcu_sched kthread starved for 2100 jiffies! g18446744073709551317 c18446744073709551316 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=1
[ 21.020000] rcu_sched I 0 8 2 0x00000000
[ 21.020000] Call Trace:
[ 21.020000] [< (ptrval)>] __schedule+0x1c6/0x4ea
[ 21.020000] [< (ptrval)>] schedule+0x16/0x58
[ 21.020000] [< (ptrval)>] schedule_timeout+0x12a/0x248
[ 21.020000] [< (ptrval)>] rcu_gp_kthread+0x31e/0x6cc
[ 21.020000] [< (ptrval)>] kthread+0xcc/0xde
[ 21.020000] [< (ptrval)>] ret_from_syscall+0xa/0xe
[ 21.020000] Task dump for CPU 2:
[ 21.020000] swapper/2 R running task 0 0 1 0x00000008
[ 21.020000] Call Trace:
[ 21.020000] [< (ptrval)>] __schedule+0x1c6/0x4ea
[ 21.020000] Task dump for CPU 3:
[ 21.020000] swapper/3 R running task 0 0 1 0x00000008
[ 21.020000] Call Trace:
[ 21.020000] [< (ptrval)>] walk_stackframe+0x0/0xa2
[ 21.020000] [< (ptrval)>] show_stack+0x26/0x34
[ 21.020000] [< (ptrval)>] sched_show_task+0xa6/0xfc
[ 21.020000] [< (ptrval)>] dump_cpu_task+0x3a/0x48
[ 21.020000] [< (ptrval)>] rcu_dump_cpu_stacks+0x7a/0xb8
[ 21.020000] [< (ptrval)>] rcu_check_callbacks+0x4a0/0x660
[ 21.020000] [< (ptrval)>] update_process_times+0x1e/0x48
[ 21.020000] [< (ptrval)>] tick_nohz_handler+0x7c/0xd2
[ 21.020000] [< (ptrval)>] riscv_timer_interrupt+0x26/0x32
[ 21.020000] [< (ptrval)>] riscv_intc_irq+0xaa/0xd2
[ 21.020000] [< (ptrval)>] ret_from_syscall+0xa/0xe
[ 21.030000] 2-...!: (698635 ticks this GP) idle=31a/2/0 softirq=1/1 fqs=0
[ 21.040000] (t=2102 jiffies g=-299 c=-300 q=1376)
[ 21.040000] rcu_sched kthread starved for 2102 jiffies! g18446744073709551317 c18446744073709551316 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=3
[ 21.040000] rcu_sched R running task 0 8 2 0x00000000
[ 21.040000] Call Trace:
[ 21.040000] [< (ptrval)>] __schedule+0x1c6/0x4ea
[ 21.040000] [< (ptrval)>] schedule+0x16/0x58
[ 21.040000] [< (ptrval)>] schedule_timeout+0x12a/0x248
[ 21.040000] [< (ptrval)>] rcu_gp_kthread+0x31e/0x6cc
[ 21.040000] [< (ptrval)>] kthread+0xcc/0xde
[ 21.040000] [< (ptrval)>] ret_from_syscall+0xa/0xe
[ 21.040000] Task dump for CPU 2:
[ 21.040000] swapper/2 R running task 0 0 1 0x00000008
[ 21.040000] Call Trace:
[ 21.040000] [< (ptrval)>] walk_stackframe+0x0/0xa2
[ 21.040000] [< (ptrval)>] show_stack+0x26/0x34
[ 21.040000] [< (ptrval)>] sched_show_task+0xa6/0xfc
[ 21.050000] [< (ptrval)>] dump_cpu_task+0x3a/0x48
[ 21.050000] [< (ptrval)>] rcu_dump_cpu_stacks+0x7a/0xb8
[ 21.050000] [< (ptrval)>] rcu_check_callbacks+0x4a0/0x660
[ 21.050000] [< (ptrval)>] update_process_times+0x1e/0x48
[ 21.050000] [< (ptrval)>] tick_nohz_handler+0x7c/0xd2
[ 21.050000] [< (ptrval)>] riscv_timer_interrupt+0x26/0x32
[ 21.050000] [< (ptrval)>] riscv_intc_irq+0xaa/0xd2
[ 21.050000] [< (ptrval)>] ret_from_syscall+0xa/0xe
[ 21.050000] Task dump for CPU 3:
[ 21.050000] swapper/3 R running task 0 0 1 0x00000008
[ 21.050000] Call Trace:
[ 21.050000] [< (ptrval)>] __schedule+0x1c6/0x4ea

The first one is after 21sec and rest of them keeps appearing after each ~63 secs. Still investigating.

palmer-dabbelt pushed a commit that referenced this issue Mar 20, 2018
This is to fix warning got as:

[ 6730.476938] ------------[ cut here ]------------
[ 6730.476979] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLAB object 'gvt-g_vgpu_workload' (offset 120, size 4)!
[ 6730.477021] WARNING: CPU: 2 PID: 441 at mm/usercopy.c:81 usercopy_warn+0x7e/0xa0
[ 6730.477042] Modules linked in: tun(E) bridge(E) stp(E) llc(E) kvmgt(E) x86_pkg_temp_thermal(E) vfio_mdev(E) intel_powerclamp(E) mdev(E) coretemp(E) vfio_iommu_type1(E) vfio(E) kvm_intel(E) kvm(E) hid_generic(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) usbhid(E) i915(E) crc32c_intel(E) hid(E) ghash_clmulni_intel(E) pcbc(E) aesni_intel(E) aes_x86_64(E) crypto_simd(E) cryptd(E) glue_helper(E) intel_cstate(E) idma64(E) evdev(E) virt_dma(E) iTCO_wdt(E) intel_uncore(E) intel_rapl_perf(E) intel_lpss_pci(E) sg(E) shpchp(E) mei_me(E) pcspkr(E) iTCO_vendor_support(E) intel_lpss(E) intel_pch_thermal(E) prime_numbers(E) mei(E) mfd_core(E) video(E) acpi_pad(E) button(E) binfmt_misc(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) fscrypto(E) sd_mod(E) e1000e(E) xhci_pci(E) sdhci_pci(E)
[ 6730.477244]  ptp(E) cqhci(E) xhci_hcd(E) pps_core(E) sdhci(E) mmc_core(E) i2c_i801(E) usbcore(E) thermal(E) fan(E)
[ 6730.477276] CPU: 2 PID: 441 Comm: gvt workload 0 Tainted: G            E    4.16.0-rc1-gvt-staging-0213+ #127
[ 6730.477303] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0039.2016.0316.1747 03/16/2016
[ 6730.477326] RIP: 0010:usercopy_warn+0x7e/0xa0
[ 6730.477340] RSP: 0018:ffffba6301223d18 EFLAGS: 00010286
[ 6730.477355] RAX: 0000000000000000 RBX: ffff8f41caae9838 RCX: 0000000000000006
[ 6730.477375] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff8f41dad166f0
[ 6730.477395] RBP: 0000000000000004 R08: 0000000000000576 R09: 0000000000000000
[ 6730.477415] R10: ffffffffb1293fb2 R11: 00000000ffffffff R12: 0000000000000001
[ 6730.477447] R13: ffff8f41caae983c R14: ffff8f41caae9838 R15: 00007f183ca2b000
[ 6730.477467] FS:  0000000000000000(0000) GS:ffff8f41dad00000(0000) knlGS:0000000000000000
[ 6730.477489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6730.477506] CR2: 0000559462817291 CR3: 000000028b46c006 CR4: 00000000003626e0
[ 6730.477526] Call Trace:
[ 6730.477537]  __check_object_size+0x9c/0x1a0
[ 6730.477562]  __kvm_write_guest_page+0x45/0x90 [kvm]
[ 6730.477585]  kvm_write_guest+0x46/0x80 [kvm]
[ 6730.477599]  kvmgt_rw_gpa+0x9b/0xf0 [kvmgt]
[ 6730.477642]  workload_thread+0xa38/0x1040 [i915]
[ 6730.477659]  ? do_wait_intr_irq+0xc0/0xc0
[ 6730.477673]  ? finish_wait+0x80/0x80
[ 6730.477707]  ? clean_workloads+0x120/0x120 [i915]
[ 6730.477722]  kthread+0x111/0x130
[ 6730.477733]  ? _kthread_create_worker_on_cpu+0x60/0x60
[ 6730.477750]  ? exit_to_usermode_loop+0x6f/0xb0
[ 6730.477766]  ret_from_fork+0x35/0x40
[ 6730.477777] Code: 48 c7 c0 20 e3 25 b1 48 0f 44 c2 41 50 51 41 51 48 89 f9 49 89 f1 4d 89 d8 4c 89 d2 48 89 c6 48 c7 c7 78 e3 25 b1 e8 b2 bc e4 ff <0f> ff 48 83 c4 18 c3 48 c7 c6 09 d0 26 b1 49 89 f1 49 89 f3 eb
[ 6730.477849] ---[ end trace cae869c1c323e45a ]---

By whitelist guest page write from workload struct allocated from kmem cache.

Reviewed-by: Hang Yuan <[email protected]>
Signed-off-by: Zhenyu Wang <[email protected]>
(cherry picked from commit 5627705406874df57fdfad3b4e0c9aedd3b007df)
@atishp04
Copy link
Contributor

atishp04 commented May 9, 2018

Got some time to look at the issue.
It looks like it's a self detected stall on one or more cpus where rcu_sched kthread couldn't run for more than 21 seconds. All other cpus seems to run the idle thread (swapper task).

arch_idle() function invokes wfi which puts the cpu to until "idle" mode until the next interrupt happens.
In periodic tick, the cpus are woken up by periodic timer tick. However, the no_hz idle case it should be only woken up when there is some task queued i.e. scheduler IPIs.

Here are possible causes:
1.All of the cpus are in idle. Thus nobody is forwarding the jiffies - This is not the case. We can see repeated stalls have correct jiffies.
2. The concerned cpu never returned from WFI after IPI. I am still investigating this cause. Doesn't seem likely because I have qemu logs which suggest the cpus do return from WFI after IPI.

  1. The concerned cpu immediately returns from WFI thus looping continuously without giving any other thread chance to run.

  2. My initial assumption was something wrong with timer_probe and clock source being initialized for each cpu. But that has been fixed with following patches.

http://lists.infradead.org/pipermail/linux-riscv/2018-April/000510.html

It is difficult to debug as I can't put printk in the arch_idle function as it is called so many times that serial console will hang in the first place. trace_printk is of no help as it is stuck during the boot itself. I am not aware of any method to access the trace buffer unless it's booted completely.

Looking for any ideas or suggestions to debug the problem further.

@sorear
Copy link
Contributor Author

sorear commented May 10, 2018

Add an instruction or CSR to qemu which does printf on a register, then debug by doing csr writes in arch_idle? If you're feeling ambitious implement Liviu's semihosting proposal (already in openocd and should be findable on sw-dev, it's a small change to EBREAK translation)

atishp04 added a commit to atishp04/riscv-linux that referenced this issue May 23, 2018
The timer interrupt pending bit is cleared in bbl while
reprogramming the timer. This works fine unless we are
in nohz mode. Since, the timer is not reprogrammed, the
pending bits are not cleared leading to continuous timer
interrupt firing and cpu stalls.

Clear timer interrupt in timer interrupt handler by calling
into SEE. This also introduces an SBI call to do this.

This patch requires following bbl fix.
riscv-software-src/riscv-pk#108

Both kernel & bbl patch are required to run kernel as tick less
mode in RISC-V.

The details of the stalls can be found in
riscvarchive#127

Signed-off-by: Atish Patra <[email protected]>
atishp04 added a commit to atishp04/riscv-linux that referenced this issue May 25, 2018
The timer interrupt pending bit is cleared in bbl while
reprogramming the timer. This works fine unless we are
in nohz mode. Since, the timer is not reprogrammed, the
pending bits are not cleared leading to continuous timer
interrupt firing and cpu stalls.

Disable timer interrupt in interrupt handler to ignore
the pending bit until next interrupt. Timer interrupt
is enabled during setting the next timer event.

The details of the stalls can be found in
riscvarchive#127

Other possible ideas discussion:
riscv-software-src/riscv-pk#108

Signed-off-by: Atish Patra <[email protected]>
@atishp04
Copy link
Contributor

atishp04 commented May 25, 2018

Fixed in
atishp04@fd1e893

We can close this issue now.

palmer-dabbelt pushed a commit that referenced this issue Jun 4, 2018
The timer interrupt pending bit is cleared in bbl while
reprogramming the timer. This works fine unless we are
in nohz mode. In nohz mode, the timer is not reprogrammed.
Thus, the pending bits are not cleared leading to continuous
timer interrupt firing and cpu stalls.

Disable timer interrupt in interrupt handler to ignore
the pending bit until next interrupt. Timer interrupt
is enabled again before next timer event is set.

The details of the stalls can be found in
#127

Other possible ideas discussion:
riscv-software-src/riscv-pk#108

Signed-off-by: Atish Patra <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>
palmer-dabbelt pushed a commit that referenced this issue Jul 23, 2018
The timer interrupt pending bit is cleared in bbl while
reprogramming the timer. This works fine unless we are
in nohz mode. In nohz mode, the timer is not reprogrammed.
Thus, the pending bits are not cleared leading to continuous
timer interrupt firing and cpu stalls.

Disable timer interrupt in interrupt handler to ignore
the pending bit until next interrupt. Timer interrupt
is enabled again before next timer event is set.

The details of the stalls can be found in
#127

Other possible ideas discussion:
riscv-software-src/riscv-pk#108

Signed-off-by: Atish Patra <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants