Skip to content

BDF port gets into bad state - MTL fails to init #1199

@cwhite102

Description

@cwhite102

As requested, this issue has been split off from #1176 (lcore management gets into a bad state) as it appears to be a different issue (just with similar symptoms).

In this issue, the VFIO BDF port itself (or a data-structure related to it) gets into a bad state, and attempting to initialize MTL in a new process that uses the troubled port fails to initialize.
When in this state (unlike #1176), changing the lcores list doesn't help, as it is the port itself which is problematic.
A reboot restores normal functionality. (or switching to a different BDF port)

We're not sure what leads to this situation, or if there's a way to resolve it other than a reboot.
We start up multiple processes at once, each with their own uniquely assigned BDF / VFIO port.

Any guidance, experiments, or commands to help get to the root cause (and solution) would be appreciated.

Here's a log of attempting to initialize MTL on a port that is in this bad state:
(Note - these logs are from launching a new, single MTL process with no other MTL processes running)
(MTL main branch - I believe this includes commits up to July 3rd)

07-05 04:25:53.890 INFO Intel2110Config_GLOBAL [Blueprint-GraphStateNotifier-2]: /: Initializing MTL... 
07-05 04:25:53.892 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: Logging thread started 
07-05 04:25:53.893 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_eal_init(0), port_param: 0000:83:01.2 
07-05 04:25:53.893 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_eal_init, wait eal_init_thread done 
07-05 04:25:54.347 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mtl_init, MTL version: 25.2.0.DEV Thu Jul  3 11:02:26 2025 e285345a gcc-11.4.0, dpdk version: DPDK 25.03.0 
07-05 04:25:54.347 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mtl_init, MTL_HAS_USDT is defined for this build 
07-05 04:25:54.347 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_instance_init, connect to manager fail, assume single instance mode 
07-05 04:25:54.347 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mtl_init(0), socket_id 0 port 0000:83:01.2 
07-05 04:25:54.348 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_stat_init, stat period 10s 
07-05 04:25:54.348 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_if_init(0), use user ptp source 
07-05 04:25:54.348 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_if_init(0), user request queues tx 11 rx 40 
07-05 04:25:54.348 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_if_init(0), deprecated sessions tx 0 rx 0 
07-05 04:25:54.348 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: stat_thread, start 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Error: dev_config_port(0), rte_eth_dev_configure fail -1 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Error: mt_dev_if_init(0), dev_config_port fail -1 
07-05 04:25:54.698 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_close_port(0), port not started 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Error: mtl_init, st dev if init fail -5 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9761742a0 priv 0x11802b6680 not found 
07-05 04:25:54.698 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_cni_uinit, no cni for all ports 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Error: sch_lcore_shm_uinit, can not stat shared memory, Invalid argument 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802bf738 not found 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802c6e58 not found 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802ce578 not found 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802d5c98 not found 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802dd3b8 not found 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802e4ad8 not found 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802ec1f8 not found 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802f3918 not found 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11802fb038 not found 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x1180302758 not found 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x1180309e78 not found 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x1180311598 not found 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x1180318cb8 not found 
07-05 04:25:54.698 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x11803203d8 not found 
07-05 04:25:54.699 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x1180327af8 not found 
07-05 04:25:54.699 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x118032f218 not found 
07-05 04:25:54.699 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x1180336938 not found 
07-05 04:25:54.699 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760a71c0 priv 0x118033e058 not found 
07-05 04:25:54.699 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_stat_unregister, cb 0x7fe9760e5600 priv 0x11802b6680 not found 
07-05 04:25:54.699 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_stop_port(0), port not started 
07-05 04:25:54.699 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_free, succ 
07-05 04:25:54.699 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_main_free, succ 
07-05 04:25:54.699 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_close_port(0), port not started 
07-05 04:25:54.699 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: stat_thread, stop 
07-05 04:25:54.700 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_uinit, succ 
07-05 04:25:54.700 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mtl_uninit, succ 
07-05 04:25:54.700 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: mtl_init: mtl_init fail 
07-05 04:25:54.700 SEVERE Intel2110Config_GLOBAL [Blueprint-GraphStateNotifier-2]: /: mtl_init fail -1

With the same settings, switching to the next BDF port works, and has output as follows:

07-05 04:27:43.955 INFO Intel2110Config_GLOBAL [Blueprint-GraphStateNotifier-2]: /: Initializing MTL... 
07-05 04:27:43.957 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: Logging thread started 
07-05 04:27:43.957 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_eal_init(0), port_param: 0000:83:01.3 
07-05 04:27:43.957 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_eal_init, wait eal_init_thread done 
07-05 04:27:44.456 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mtl_init, MTL version: 25.2.0.DEV Thu Jul  3 11:02:26 2025 e285345a gcc-11.4.0, dpdk version: DPDK 25.03.0 
07-05 04:27:44.457 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mtl_init, MTL_HAS_USDT is defined for this build 
07-05 04:27:44.457 WARNING Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: Warn: mt_instance_init, connect to manager fail, assume single instance mode 
07-05 04:27:44.457 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mtl_init(0), socket_id 0 port 0000:83:01.3 
07-05 04:27:44.457 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_stat_init, stat period 10s 
07-05 04:27:44.457 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_if_init(0), use user ptp source 
07-05 04:27:44.457 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_if_init(0), user request queues tx 11 rx 40 
07-05 04:27:44.457 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_dev_if_init(0), deprecated sessions tx 0 rx 0 
07-05 04:27:44.457 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: stat_thread, start 
07-05 04:27:44.594 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_config_port(0), tx_q(41 with 512 desc) rx_q (41 with 2048 desc) 
07-05 04:27:44.595 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_mempool_create_by_ops(0), succ at 0x11804fdf00 size 4.310394m n 2047 d 2048 for T_P0_SYS_0 
07-05 04:27:44.595 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: dev_if_init_tx_queues(0), tx_queues 41 malloc succ 
07-05 04:27:44.596 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_mempool_create_by_ops(0), succ at 0x118061da80 size 8.622894m n 4095 d 2048 for R_P0Q0_MBUF_1 
07-05 04:27:44.597 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_mempool_create_by_ops(0), succ at 0x11816fdf00 size 6.623383m n 4095 d 1536 for R_P0Q1_MBUF_2 
07-05 04:27:44.598 INFO Intel2110Config_GLOBAL [Intel2110_Logging_BG-1]: /: MTL: mt_mempool_create_by_ops(0), succ at 0x11820fdf00 size 6.623383m n 4095 d 1536 for R_P0Q2_MBUF_3 
...

Some command output in case it's useful... (note - we create 16 vfio devices on each port)

imagine@tor-dl345-3:/opt/ZeniumUtils$ dpdk-devbind.py -s

Network devices using DPDK-compatible driver
============================================
0000:83:01.0 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:01.1 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:01.2 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:01.3 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:01.4 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:01.5 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:01.6 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:01.7 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.0 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.1 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.2 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.3 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.4 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.5 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.6 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:02.7 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.0 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.1 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.2 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.3 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.4 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.5 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.6 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:11.7 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.0 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.1 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.2 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.3 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.4 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.5 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.6 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf
0000:83:12.7 'Ethernet Adaptive Virtual Function 1889' numa_node=0 drv=vfio-pci unused=iavf

Network devices using kernel driver
===================================
0000:43:00.0 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' numa_node=0 if=ens22f0 drv=tg3 unused=vfio-pci *Active*
0000:43:00.1 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' numa_node=0 if=ens22f1 drv=tg3 unused=vfio-pci 
0000:43:00.2 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' numa_node=0 if=ens22f2 drv=tg3 unused=vfio-pci 
0000:43:00.3 'NetXtreme BCM5719 Gigabit Ethernet PCIe 1657' numa_node=0 if=ens22f3 drv=tg3 unused=vfio-pci 
0000:83:00.0 'Ethernet Controller E810-XXV for SFP 159b' numa_node=0 if=ens3f0 drv=ice unused=vfio-pci *Active*
0000:83:00.1 'Ethernet Controller E810-XXV for SFP 159b' numa_node=0 if=ens3f1 drv=ice unused=vfio-pci *Active*
0000:c8:00.0 'BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller 16d7' numa_node=0 if=ens21f0np0 drv=bnxt_en unused=vfio-pci 
0000:c8:00.1 'BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller 16d7' numa_node=0 if=ens21f1np1 drv=bnxt_en unused=vfio-pci 




sudo lshw -c network -businfo
Bus info          Device      Class          Description
========================================================
pci@0000:43:00.0  ens22f0     network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:43:00.1  ens22f1     network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:43:00.2  ens22f2     network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:43:00.3  ens22f3     network        NetXtreme BCM5719 Gigabit Ethernet PCIe
pci@0000:83:00.0  ens3f0      network        Ethernet Controller E810-XXV for SFP
pci@0000:83:00.1  ens3f1      network        Ethernet Controller E810-XXV for SFP
pci@0000:83:01.0              network        Ethernet Adaptive Virtual Function
pci@0000:83:01.1              network        Ethernet Adaptive Virtual Function
pci@0000:83:01.2              network        Ethernet Adaptive Virtual Function
pci@0000:83:01.3              network        Ethernet Adaptive Virtual Function
pci@0000:83:01.4              network        Ethernet Adaptive Virtual Function
pci@0000:83:01.5              network        Ethernet Adaptive Virtual Function
pci@0000:83:01.6              network        Ethernet Adaptive Virtual Function
pci@0000:83:01.7              network        Ethernet Adaptive Virtual Function
pci@0000:83:02.0              network        Ethernet Adaptive Virtual Function
pci@0000:83:02.1              network        Ethernet Adaptive Virtual Function
pci@0000:83:02.2              network        Ethernet Adaptive Virtual Function
pci@0000:83:02.3              network        Ethernet Adaptive Virtual Function
pci@0000:83:02.4              network        Ethernet Adaptive Virtual Function
pci@0000:83:02.5              network        Ethernet Adaptive Virtual Function
pci@0000:83:02.6              network        Ethernet Adaptive Virtual Function
pci@0000:83:02.7              network        Ethernet Adaptive Virtual Function
pci@0000:83:11.0              network        Ethernet Adaptive Virtual Function
pci@0000:83:11.1              network        Ethernet Adaptive Virtual Function
pci@0000:83:11.2              network        Ethernet Adaptive Virtual Function
pci@0000:83:11.3              network        Ethernet Adaptive Virtual Function
pci@0000:83:11.4              network        Ethernet Adaptive Virtual Function
pci@0000:83:11.5              network        Ethernet Adaptive Virtual Function
pci@0000:83:11.6              network        Ethernet Adaptive Virtual Function
pci@0000:83:11.7              network        Ethernet Adaptive Virtual Function
pci@0000:83:12.0              network        Ethernet Adaptive Virtual Function
pci@0000:83:12.1              network        Ethernet Adaptive Virtual Function
pci@0000:83:12.2              network        Ethernet Adaptive Virtual Function
pci@0000:83:12.3              network        Ethernet Adaptive Virtual Function
pci@0000:83:12.4              network        Ethernet Adaptive Virtual Function
pci@0000:83:12.5              network        Ethernet Adaptive Virtual Function
pci@0000:83:12.6              network        Ethernet Adaptive Virtual Function
pci@0000:83:12.7              network        Ethernet Adaptive Virtual Function
pci@0000:c8:00.0  ens21f0np0  network        BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
pci@0000:c8:00.1  ens21f1np1  network        BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions