Description
Describe the bug
Setting restart-ms
to a non zero value (100ms is the one I've been using) results in an occasional BUG about echo SKBs, and slightly less commonly an OOPS
Steps to reproduce the behaviour
- Set up a raspberry Pi 4 with the Waveshare 2-channel CAN FD hat SKU 17075 (the one with the MCP2518FD). I installed the latest Raspberry Pi OS Lite 64 and did an
apt update
andapt upgrade
. - Connect one of the two CAN buses to something that will acknowledge your frames (I just connected both of them together but I have verified this works with a PCAN-USB as well).
- install this udev rules file:
KERNELS=="spi0.0", SUBSYSTEMS=="spi", DRIVERS=="mcp251xfd", ACTION=="add|bind|change", NAME="canbus0", ATTR{tx_queue_len}="1024"
KERNELS=="spi0.1", SUBSYSTEMS=="spi", DRIVERS=="mcp251xfd", ACTION=="add|bind|change", NAME="canbus1", ATTR{tx_queue_len}="1024"
KERNELS=="spi1.0", SUBSYSTEMS=="spi", DRIVERS=="mcp251xfd", ACTION=="add|bind|change", NAME="canbus2", ATTR{tx_queue_len}="1024"
KERNELS=="spi1.1", SUBSYSTEMS=="spi", DRIVERS=="mcp251xfd", ACTION=="add|bind|change", NAME="canbus3", ATTR{tx_queue_len}="1024"
- Disable network manager, enable systemd-networkd, and add this .network file to
/etc/systemd/network
:
[Match]
Name=canbus*
[CAN]
BitRate=500K
BusErrorReporting=yes
RestartSec=500ms
[Link]
RequiredForOnline=no
- Add these lines to
config.txt
:
dtparam=spi=on
dtoverlay=spi1-3cs
dtoverlay=mcp251xfd,spi0-0,interrupt=25
dtoverlay=mcp251xfd,spi1-0,interrupt=24
- install can-utils
- reboot.
- verify that
canbus0
andcanbus2
are up (if your jumpers are set differently, you might have 0 and 1 instead). - Assuming
canbus0
has a partner that will acknowledge, startcangen -I 123 -g 1 canbus0
(it also works on canbus2 if it has a partner). Verify that traffic is flowing via counters orcandump
. - Briefly short the CAN pair together with a wire, screwdriver, etc, while watching
journalctl -f
. - Repeat step 10 until you see an OOPS. It should not take many attempts (usually takes less than 30 seconds)
Device (s)
Raspberry Pi 4 Mod. B
System
Logs
Included in raspinfo
Additional context
This is following up from an email conversation between myself and @marckleinebudde . I thought it best to move it somewhere we can track it rather than continue sending many emails.
I'm reporting this in raspberrypi/linux because I've reproduced it on multiple distros but don't have another SPI master to try it with to see if it's not Pi-specific.
I have reproduced this issue with 6.6 kernels and 6.12 kernels, on both yocto and raspberry pi OS, with the waveshare hat as well as our custom board, with CM4 and Pi 4 model B, etc.
I tried cherry-picking some later commits into the 6.12 kernel and it doesn't really help.
Marc suggested this patch which did not seem to cure the issue:
--- a/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
+++ b/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c
@@ -759,6 +759,9 @@ static void mcp251xfd_chip_stop(struct mcp251xfd_priv *priv,
{
priv->can.state = state;
+ hrtimer_cancel(&priv->rx_irq_timer);
+ hrtimer_cancel(&priv->tx_irq_timer);
+ cancel_work_sync(&priv->tx_work);
mcp251xfd_chip_interrupts_disable(priv);
mcp251xfd_chip_rx_int_disable(priv);
mcp251xfd_timestamp_stop(priv);
He also suggested this patch which I have yet to try but will try as soon as I submit this (been spending time verifying that it's not particular to our distro or hardware):
--- a/drivers/net/can/dev/dev.c
+++ b/drivers/net/can/dev/dev.c
@@ -185,7 +185,9 @@ static void can_restart_work(struct work_struct *work)
struct can_priv *priv = container_of(dwork, struct can_priv,
restart_work);
+ netif_tx_lock(priv->dev);
can_restart(priv->dev);
+ netif_tx_unlock(priv->dev);
}
int can_restart_now(struct net_device *dev)