-
Notifications
You must be signed in to change notification settings - Fork 70
Description
As requested, this issue has been split off from #1176 (lcore management gets into a bad state) as it appears to be a different issue (just with similar symptoms).
In this issue, the VFIO BDF port itself (or a data-structure related to it) gets into a bad state, and attempting to initialize MTL in a new process that uses the troubled port fails to initialize.
When in this state (unlike #1176), changing the lcores list doesn't help, as it is the port itself which is problematic.
A reboot restores normal functionality. (or switching to a different BDF port)
We're not sure what leads to this situation, or if there's a way to resolve it other than a reboot.
We start up multiple processes at once, each with their own uniquely assigned BDF / VFIO port.
Any guidance, experiments, or commands to help get to the root cause (and solution) would be appreciated.
Here's a log of attempting to initialize MTL on a port that is in this bad state:
(Note - these logs are from launching a new, single MTL process with no other MTL processes running)
(MTL main branch - I believe this includes commits up to July 3rd)
07-05 04:25:53.890 INFO Intel2110Config_GLOBAL [Blueprint-GraphStateNotifier-2]: /: Initializing MTL...
07-05 04:25:53.892 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: Logging thread started
07-05 04:25:53.893 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_eal_init(0), port_param: 0000:83:01.2
07-05 04:25:53.893 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_eal_init, wait eal_init_thread done
07-05 04:25:54.347 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mtl_init, MTL version: 25.2.0.DEV Thu Jul 3 11:02:26 2025 e285345a gcc-11.4.0, dpdk version: DPDK 25.03.0
07-05 04:25:54.347 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mtl_init, MTL_HAS_USDT is defined for this build
07-05 04:25:54.347 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_instance_init, connect to manager fail, assume single instance mode
07-05 04:25:54.347 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mtl_init(0), socket_id 0 port 0000:83:01.2
07-05 04:25:54.348 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_stat_init, stat period 10s
07-05 04:25:54.348 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_if_init(0), use user ptp source
07-05 04:25:54.348 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_if_init(0), user request queues tx 11 rx 40
07-05 04:25:54.348 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_if_init(0), deprecated sessions tx 0 rx 0
07-05 04:25:54.348 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: stat_thread, start
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Error: dev_config_port(0), rte_eth_dev_configure fail -1
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Error: mt_dev_if_init(0), dev_config_port fail -1
07-05 04:25:54.698 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_close_port(0), port not started
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Error: mtl_init, st dev if init fail -5
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9761742a0 priv 0x11802b6680 not found
07-05 04:25:54.698 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_cni_uinit, no cni for all ports
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Error: sch_lcore_shm_uinit, can not stat shared memory, Invalid argument
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802bf738 not found
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802c6e58 not found
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802ce578 not found
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802d5c98 not found
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802dd3b8 not found
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802e4ad8 not found
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802ec1f8 not found
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802f3918 not found
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802fb038 not found
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x1180302758 not found
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x1180309e78 not found
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x1180311598 not found
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x1180318cb8 not found
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11803203d8 not found
07-05 04:25:54.699 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x1180327af8 not found
07-05 04:25:54.699 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x118032f218 not found
07-05 04:25:54.699 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x1180336938 not found
07-05 04:25:54.699 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x118033e058 not found
07-05 04:25:54.699 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760e5600 priv 0x11802b6680 not found
07-05 04:25:54.699 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_stop_port(0), port not started
07-05 04:25:54.699 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_free, succ
07-05 04:25:54.699 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_main_free, succ
07-05 04:25:54.699 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_close_port(0), port not started
07-05 04:25:54.699 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: stat_thread, stop
07-05 04:25:54.700 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_uinit, succ
07-05 04:25:54.700 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mtl_uninit, succ
07-05 04:25:54.700 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: mtl_init: mtl_init fail
07-05 04:25:54.700 SEVERE Intel2110Config_GLOBAL [Blueprint-GraphStateNotifier-2]: /: mtl_init fail -1
With the same settings, switching to the next BDF port works, and has output as follows:
07-05 04:27:43.955 INFO Intel2110Config_GLOBAL [Blueprint-GraphStateNotifier-2]: /: Initializing MTL...
07-05 04:27:43.957 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: Logging thread started
07-05 04:27:43.957 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_eal_init(0), port_param: 0000:83:01.3
07-05 04:27:43.957 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_eal_init, wait eal_init_thread done
07-05 04:27:44.456 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mtl_init, MTL version: 25.2.0.DEV Thu Jul 3 11:02:26 2025 e285345a gcc-11.4.0, dpdk version: DPDK 25.03.0
07-05 04:27:44.457 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mtl_init, MTL_HAS_USDT is defined for this build
07-05 04:27:44.457 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_instance_init, connect to manager fail, assume single instance mode
07-05 04:27:44.457 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mtl_init(0), socket_id 0 port 0000:83:01.3
07-05 04:27:44.457 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_stat_init, stat period 10s
07-05 04:27:44.457 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_if_init(0), use user ptp source
07-05 04:27:44.457 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_if_init(0), user request queues tx 11 rx 40
07-05 04:27:44.457 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_if_init(0), deprecated sessions tx 0 rx 0
07-05 04:27:44.457 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: stat_thread, start
07-05 04:27:44.594 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_config_port(0), tx_q(41 with 512 desc) rx_q (41 with 2048 desc)
07-05 04:27:44.595 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_mempool_create_by_ops(0), succ at 0x11804fdf00 size 4.310394m n 2047 d 2048 for T_P0_SYS_0
07-05 04:27:44.595 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_if_init_tx_queues(0), tx_queues 41 malloc succ
07-05 04:27:44.596 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_mempool_create_by_ops(0), succ at 0x118061da80 size 8.622894m n 4095 d 2048 for R_P0Q0_MBUF_1
07-05 04:27:44.597 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_mempool_create_by_ops(0), succ at 0x11816fdf00 size 6.623383m n 4095 d 1536 for R_P0Q1_MBUF_2
07-05 04:27:44.598 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_mempool_create_by_ops(0), succ at 0x11820fdf00 size 6.623383m n 4095 d 1536 for R_P0Q2_MBUF_3
...
Some command output in case it's useful... (note - we create 16 vfio devices on each port)
imagine@tor-dl345-3:/opt/ZeniumUtils$ dpdk-devbind.py -s
Network devices using DPDK-compatible driver
============================================
0000:83:01.0 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:01.1 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:01.2 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:01.3 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:01.4 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:01.5 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:01.6 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:01.7 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.0 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.1 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.2 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.3 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.4 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.5 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.6 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.7 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.0 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.1 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.2 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.3 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.4 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.5 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.6 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.7 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.0 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.1 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.2 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.3 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.4 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.5 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.6 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.7 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
Network devices using kernel driver
===================================
0000:43:00.0 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' numa_node=0 if=ens22f0 drv=tg3 unused=vfio-pci *Active*
0000:43:00.1 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' numa_node=0 if=ens22f1 drv=tg3 unused=vfio-pci
0000:43:00.2 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' numa_node=0 if=ens22f2 drv=tg3 unused=vfio-pci
0000:43:00.3 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' numa_node=0 if=ens22f3 drv=tg3 unused=vfio-pci
0000:83:00.0 'Ethernet Controller E810-XXV for SFP 159b' numa_node=0 if=ens3f0 drv=ice unused=vfio-pci *Active*
0000:83:00.1 'Ethernet Controller E810-XXV for SFP 159b' numa_node=0 if=ens3f1 drv=ice unused=vfio-pci *Active*
0000:c8:00.0 'BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller 16d7' numa_node=0 if=ens21f0np0 drv=bnxt_en unused=vfio-pci
0000:c8:00.1 'BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller 16d7' numa_node=0 if=ens21f1np1 drv=bnxt_en unused=vfio-pci
sudo lshw -c network -businfo
Bus info Device Class Description
========================================================
pci@0000:43:00.0 ens22f0 network NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:43:00.1 ens22f1 network NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:43:00.2 ens22f2 network NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:43:00.3 ens22f3 network NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:83:00.0 ens3f0 network Ethernet Controller E810-XXV for SFP
pci@0000:83:00.1 ens3f1 network Ethernet Controller E810-XXV for SFP
pci@0000:83:01.0 network Ethernet Adaptive Virtual Function
pci@0000:83:01.1 network Ethernet Adaptive Virtual Function
pci@0000:83:01.2 network Ethernet Adaptive Virtual Function
pci@0000:83:01.3 network Ethernet Adaptive Virtual Function
pci@0000:83:01.4 network Ethernet Adaptive Virtual Function
pci@0000:83:01.5 network Ethernet Adaptive Virtual Function
pci@0000:83:01.6 network Ethernet Adaptive Virtual Function
pci@0000:83:01.7 network Ethernet Adaptive Virtual Function
pci@0000:83:02.0 network Ethernet Adaptive Virtual Function
pci@0000:83:02.1 network Ethernet Adaptive Virtual Function
pci@0000:83:02.2 network Ethernet Adaptive Virtual Function
pci@0000:83:02.3 network Ethernet Adaptive Virtual Function
pci@0000:83:02.4 network Ethernet Adaptive Virtual Function
pci@0000:83:02.5 network Ethernet Adaptive Virtual Function
pci@0000:83:02.6 network Ethernet Adaptive Virtual Function
pci@0000:83:02.7 network Ethernet Adaptive Virtual Function
pci@0000:83:11.0 network Ethernet Adaptive Virtual Function
pci@0000:83:11.1 network Ethernet Adaptive Virtual Function
pci@0000:83:11.2 network Ethernet Adaptive Virtual Function
pci@0000:83:11.3 network Ethernet Adaptive Virtual Function
pci@0000:83:11.4 network Ethernet Adaptive Virtual Function
pci@0000:83:11.5 network Ethernet Adaptive Virtual Function
pci@0000:83:11.6 network Ethernet Adaptive Virtual Function
pci@0000:83:11.7 network Ethernet Adaptive Virtual Function
pci@0000:83:12.0 network Ethernet Adaptive Virtual Function
pci@0000:83:12.1 network Ethernet Adaptive Virtual Function
pci@0000:83:12.2 network Ethernet Adaptive Virtual Function
pci@0000:83:12.3 network Ethernet Adaptive Virtual Function
pci@0000:83:12.4 network Ethernet Adaptive Virtual Function
pci@0000:83:12.5 network Ethernet Adaptive Virtual Function
pci@0000:83:12.6 network Ethernet Adaptive Virtual Function
pci@0000:83:12.7 network Ethernet Adaptive Virtual Function
pci@0000:c8:00.0 ens21f0np0 network BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
pci@0000:c8:00.1 ens21f1np1 network BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller