Skip to content

[ISSUE] NVMe-of-TCP online migration failed #44

@elmemis

Description

@elmemis

The destination disk never create/generate on TrueNAS.

Establishing API connection with remote at '192.168.231.243'
2025-11-25 03:46:29 remote: started tunnel worker 'UPID:pve-org-3:00007DB8:000360B9:69256CE5:qmtunnel:116:root@pam!migracion:'
tunnel: -> sending command "version" to remote
tunnel: <- got reply
2025-11-25 03:46:29 local WS tunnel version: 2
2025-11-25 03:46:29 remote WS tunnel version: 2
2025-11-25 03:46:29 minimum required WS tunnel version: 2
websocket tunnel started
2025-11-25 03:46:29 starting migration of VM 116 to node 'pve-org-3' (192.168.231.243)
tunnel: -> sending command "bwlimit" to remote
tunnel: <- got reply
2025-11-25 03:46:29 found local disk 'zfs-ssd:vm-116-disk-0' (attached)
2025-11-25 03:46:29 mapped: net0 from vmbr0 to vmbr0
2025-11-25 03:46:29 Allocating volume for drive 'scsi0' on remote storage 'nvme-sas'..
tunnel: -> sending command "disk" to remote
tunnel: <- got reply
2025-11-25 03:46:30 volume 'zfs-ssd:vm-116-disk-0' is 'nvme-sas:vol-vm-116-disk-0-ns536e94db-2a96-4a72-b188-4713c9ab205c' on the target
tunnel: -> sending command "config" to remote
tunnel: <- got reply
tunnel: -> sending command "start" to remote
tunnel: <- got reply
2025-11-25 03:46:30 ERROR: online migrate failure - error - tunnel command '{"start_params":{"forcemachine":"pc-i440fx-9.2+pve1","statefile":"unix","forcecpu":null,"skiplock":1},"cmd":"start","migrate_opts":{"storagemap":{"default":"nvme-sas"},"type":"websocket","nbd":{"scsi0":{"drivestr":"nvme-sas:vol-vm-116-disk-0-ns536e94db-2a96-4a72-b188-4713c9ab205c,format=raw,size=40G","success":true,"volid":"nvme-sas:vol-vm-116-disk-0-ns536e94db-2a96-4a72-b188-4713c9ab205c"}},"nbd_proto_version":1,"remote_node":"pve-org-3","network":null,"spice_ticket":null,"migratedfrom":"pve-org-2"}}' failed - failed to handle 'start' command - start failed: QEMU exited with code -1
2025-11-25 03:46:30 aborting phase 2 - cleanup resources
2025-11-25 03:46:30 migrate_cancel
tunnel: -> sending command "stop" to remote
tunnel: <- got reply
tunnel: -> sending command "quit" to remote
tunnel: <- got reply
2025-11-25 03:46:38 ERROR: error - tunnel command '{"cmd":"quit","cleanup":1}' failed - failed to handle 'quit' command - Could not locate NVMe device for TrueNAS UUID 536e94db-2a96-4a72-b188-4713c9ab205c
2025-11-25 03:46:38 ERROR:   Subsystem NQN: nqn.2025-11.org.company.lan:infrastructure-nvme
2025-11-25 03:46:38 ERROR:
2025-11-25 03:46:38 ERROR: Troubleshooting steps:
2025-11-25 03:46:38 ERROR:   1. Verify NVMe subsystem connection:
2025-11-25 03:46:38 ERROR:      -> Check: nvme list | grep 'nqn.2025-11.org.company.lan:infrastructure-nvme'
2025-11-25 03:46:38 ERROR:   2. Check if namespace is visible:
2025-11-25 03:46:38 ERROR:      -> Check: nvme list-subsys | grep -A10 'nqn.2025-11.org.company.lan:infrastructure-nvme'
2025-11-25 03:46:38 ERROR:   3. Verify TrueNAS NVMe-oF service is running
2025-11-25 03:46:38 ERROR:      -> TrueNAS: System Settings > Services > NVMe-oF Target
2025-11-25 03:46:38 ERROR:   4. Check network connectivity:
2025-11-25 03:46:38 ERROR:      -> Check: ping 10.10.231.246
2025-11-25 03:46:38 ERROR:   5. Review kernel logs for NVMe errors:
2025-11-25 03:46:38 ERROR:      -> Check: dmesg | tail -50 | grep nvme
2025-11-25 03:46:38 ERROR:
2025-11-25 03:46:38 ERROR: The namespace exists on TrueNAS but the device did not appear.
2025-11-25 03:46:38 ERROR: Manual cleanup may be required. at /usr/share/perl5/PVE/Storage/Custom/TrueNASPlugin.pm line 2629.
print() on closed filehandle GEN35 at /usr/share/perl5/PVE/Tunnel.pm line 99.
readline() on closed filehandle GEN32 at /usr/share/perl5/PVE/Tunnel.pm line 71.
Use of uninitialized value $res in concatenation (.) or string at /usr/share/perl5/PVE/Tunnel.pm line 117.
2025-11-25 03:47:08 tunnel still running - terminating now with SIGTERM
2025-11-25 03:47:18 tunnel still running - terminating now with SIGKILL
2025-11-25 03:47:19 ERROR: tunnel child process (PID 188178) couldn't be collected
2025-11-25 03:47:19 ERROR: failed to decode tunnel reply '' (command '{"cleanup":0,"cmd":"quit"}') - malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/PVE/Tunnel.pm line 116.
2025-11-25 03:47:19 ERROR: migration finished with problems (duration 00:00:50)

Healthcheck: 100% passed

After the process fails, we must delete the .conf file since it does not allow us to delete the VM from the GUI or with "qm destroy".

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions