Skip to content

RHEL-137735: [viosock] Fix vsock connectivity and BSOD issue after VM migration#1485

Open
leidwang wants to merge 2 commits intovirtio-win:masterfrom
leidwang:fix-vsock-migration-bsod
Open

RHEL-137735: [viosock] Fix vsock connectivity and BSOD issue after VM migration#1485
leidwang wants to merge 2 commits intovirtio-win:masterfrom
leidwang:fix-vsock-migration-bsod

Conversation

@leidwang
Copy link
Contributor

@leidwang leidwang commented Jan 6, 2026

Problem

The viosock driver experiences connectivity issues and potential BSOD crashes after VM migration due to improper handling of transport
reset events. The current implementation has several critical issues:

  • Event size validation bug causing incorrect packet processing
  • Infinite loop in transport reset handler causing system hang
  • Race conditions during device state transitions
  • Missing queue cleanup leading to memory corruption
  • Improper socket state management after CID changes

Solution

This PR implements a comprehensive fix through two focused commits:

  1. Transport Reset Infrastructure and Bug Fixes
  • ✅ Fix critical event size validation bug (sizeof(*pEvt) vs sizeof(pEvt))
  • ✅ Move transport reset handling to workitem for PASSIVE_LEVEL processing
  • ✅ Add DeviceReady synchronization to prevent null pointer access
  • ✅ Add defensive null checks in transmit path during reset
  • ✅ Fix missing VOID return type in function declaration
  1. Complete Transport Reset Mechanism
  • ✅ Implement complete VirtIO queue rebuilding during transport reset
  • ✅ Add three-layer defense: queue state reset + socket handling + queue rebuild
  • ✅ Detect guest-CID changes and handle protocol mismatch after migration
  • ✅ Close connected sockets and update listen sockets appropriately
  • ✅ Add queue draining to prevent memory leaks during device shutdown
  • ✅ Add defensive cleanup for corrupted socket collections

Key Technical Changes

  • Transport Reset Workitem: Prevents infinite loops by processing resets at PASSIVE_LEVEL
  • Queue Rebuilding: Complete teardown and reconstruction of VirtIO queues after reset
  • Socket State Management: Proper handling of connected/listening sockets during migration
  • Synchronization: Device ready flag prevents operations during unsafe states
  • Memory Safety: Queue draining and defensive cleanup prevent leaks and corruption

Files Modified

  • viosock/sys/Device.c - Infrastructure, workitem, device lifecycle
  • viosock/sys/Evt.c - Event processing and transport reset detection
  • viosock/sys/Tx.c - Transmit path synchronization and null checks
  • viosock/sys/Rx.c - Receive path, queue draining, CID validation
  • viosock/sys/Socket.c - Socket state management during reset
  • viosock/sys/viosock.h - New declarations and structure members

Impact

This fix resolves the VM migration BSOD and connectivity issues, ensuring stable vsock operation across migration events while maintaining backward compatibility.

@YanVugenfirer
Copy link
Collaborator

[Jenkins CI]: Can one of the admins verify this patch?

@gemini-code-assist
Copy link

Summary of Changes

Hello @leidwang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the stability and reliability of the viosock driver, particularly in virtualized environments involving VM migration. It addresses critical issues such as connection failures due to guest-CID changes, prevents system crashes (BSODs) by fixing an infinite loop in the transport reset logic, and introduces robust cleanup procedures during device shutdown. The changes ensure that the driver gracefully handles state inconsistencies that can arise post-migration, improving overall system resilience and user experience.

Highlights

  • Guest-CID Protocol Mismatch: Fixed an issue where a protocol mismatch in the guest-CID after VM migration caused connection failures in viosock.
  • Transport Reset Infinite Loop: Resolved an infinite loop bug in the transport reset handler, which was preventing BSODs during system reboots or shutdowns.
  • Migration Scenario Handling: Added defensive cleanup mechanisms and improved graceful error handling specifically for virtual machine migration scenarios to enhance stability.
  • Guest-CID Change Detection: Implemented detection and logging of guest-CID changes to provide clearer insights into connection behavior and debugging after migration events.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces important fixes for vsock connectivity and stability after VM migration. The changes correctly identify guest-CID changes and add defensive cleanup logic. However, I've identified a critical resource leak in the receive path and two potential deadlocks in the transport reset handler. These issues could lead to resource exhaustion and system instability. I've provided detailed comments and code suggestions to address these critical problems. Additionally, there's a minor code duplication that could be refactored for better maintainability.

@leidwang leidwang force-pushed the fix-vsock-migration-bsod branch 3 times, most recently from c3d4293 to ee6fb62 Compare January 6, 2026 01:53
@leidwang
Copy link
Contributor Author

leidwang commented Jan 6, 2026

Hi @kostyanf14 Could you please review this PR?Thanks!

@leidwang leidwang changed the title viosock: Fix vsock connectivity and BSOD issue after VM migration RHEL-137735: Fix vsock connectivity and BSOD issue after VM migration Jan 6, 2026
@leidwang leidwang changed the title RHEL-137735: Fix vsock connectivity and BSOD issue after VM migration RHEL-137735: [viosock] Fix vsock connectivity and BSOD issue after VM migration Jan 6, 2026
@kostyanf14
Copy link
Member

BSOD stack
image

Tx.c#L333: virtqueue_disable_cb(pContext->TxVq);
https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/viosock/sys/Tx.c#L333

@kostyanf14
Copy link
Member

If I got the correct function arguments from the stack, the broken address 0x50 (very low value)
image

@leidwang leidwang force-pushed the fix-vsock-migration-bsod branch 3 times, most recently from 4d54b68 to f5e7b1d Compare January 7, 2026 01:55
@kostyanf14
Copy link
Member

ok to test

@leidwang leidwang force-pushed the fix-vsock-migration-bsod branch from f5e7b1d to 9a89f5e Compare January 12, 2026 08:39
@kostyanf14
Copy link
Member

rerun tests

@leidwang leidwang force-pushed the fix-vsock-migration-bsod branch 2 times, most recently from 3c2ecb0 to 7da81ca Compare January 15, 2026 08:13
@kostyanf14
Copy link
Member

rerun tests

@leidwang leidwang force-pushed the fix-vsock-migration-bsod branch 3 times, most recently from fffd144 to 0c27380 Compare January 16, 2026 04:02
@kostyanf14 kostyanf14 changed the title RHEL-137735: [viosock] Fix vsock connectivity and BSOD issue after VM migration RHEL-137735: [viosock] Fix vsock connectivity and BSOD issue after VM migration CI-NO-BUILD Jan 16, 2026
@leidwang leidwang force-pushed the fix-vsock-migration-bsod branch 2 times, most recently from aa6e9a5 to 793310b Compare January 16, 2026 07:58
@leidwang leidwang changed the title RHEL-137735: [viosock] Fix vsock connectivity and BSOD issue after VM migration CI-NO-BUILD RHEL-137735: [viosock] Fix vsock connectivity and BSOD issue after VM migration Jan 17, 2026
@leidwang
Copy link
Contributor Author

rerun tests

@leidwang leidwang force-pushed the fix-vsock-migration-bsod branch from 793310b to f8f6aeb Compare January 20, 2026 03:24
@leidwang leidwang changed the title RHEL-137735: [viosock] Fix vsock connectivity and BSOD issue after VM migration RHEL-137735: [viosock] Fix vsock connectivity and BSOD issue after VM migration CI-NO-BUILD Jan 20, 2026
@leidwang leidwang force-pushed the fix-vsock-migration-bsod branch 3 times, most recently from d2d0e75 to c829790 Compare January 21, 2026 01:41
@leidwang leidwang changed the title RHEL-137735: [viosock] Fix vsock connectivity and BSOD issue after VM migration CI-NO-BUILD RHEL-137735: [viosock] Fix vsock connectivity and BSOD issue after VM migration Jan 21, 2026
@leidwang leidwang force-pushed the fix-vsock-migration-bsod branch 2 times, most recently from 8ee3009 to 349a5e5 Compare January 21, 2026 01:56
@leidwang
Copy link
Contributor Author

Hi @kostyanf14 Would you please review the code again? It's working perfectly functionally right now, but I'd still like to confirm if there are any other issues. Thanks!

Copy link
Collaborator

@YanVugenfirer YanVugenfirer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leidwang Can you please break the PR into smaller commits. Basically each bullet point in commit message should be a separate commit in my opinion.
Smaller commits make it easier for review.

- Fix event size validation and missing VOID return type
- Move transport reset to workitem to prevent infinite loop
- Add DeviceReady synchronization and null pointer protection

Signed-off-by: Leidong Wang <[email protected]>
- Rebuild VirtIO queues after transport reset
- Close connected sockets and handle CID changes
- Add queue draining and defensive cleanup for migration

Signed-off-by: Leidong Wang <[email protected]>
@leidwang leidwang force-pushed the fix-vsock-migration-bsod branch from 349a5e5 to c05d2eb Compare February 4, 2026 01:35
@leidwang
Copy link
Contributor Author

leidwang commented Feb 4, 2026

@leidwang Can you please break the PR into smaller commits. Basically each bullet point in commit message should be a separate commit in my opinion. Smaller commits make it easier for review.

Yes, I completely agree. I've updated the PR, splitting it into two commits, and updated the PR description to provide more details of the code changes.Thanks @YanVugenfirer

@YanVugenfirer
Copy link
Collaborator

rerun tests

@YanVugenfirer
Copy link
Collaborator

Hi @leidwang.

After initial review with Konstantin, we have several questions:

  1. Can you make a take a trace of the driver during the migration?
  2. The reason we ask, it appears that for some reason either D0exit is called or for some other reasons virtio queues become NULL. We want to understand what actually happens.
  3. Not for this PR - but the operation of the driver during S3\S4 should be checked.

@leidwang
Copy link
Contributor Author

leidwang commented Mar 9, 2026

Hi @leidwang.

After initial review with Konstantin, we have several questions:

  1. Can you make a take a trace of the driver during the migration?
  2. The reason we ask, it appears that for some reason either D0exit is called or for some other reasons virtio queues become NULL. We want to understand what actually happens.
  3. Not for this PR - but the operation of the driver during S3\S4 should be checked.

Hi @YanVugenfirer, thanks for the review.
Do you mean I need to collect the trace during the migration or add some function to take the trace during migration?

@YanVugenfirer
Copy link
Collaborator

@leidwang Please collect the trace during the migration.

@leidwang
Copy link
Contributor Author

@leidwang Please collect the trace during the migration.

I tried using the tools in Tools\trace to collect traces, but it seems I didn't collect any useful information. To trigger a blue screen, I need to restart the VM, but restarting the VM will interrupt the trace collection. I will add trace file and dump file together, could you please take a look?

Trace file and dump file: https://drive.google.com/drive/folders/1rZ4H4v7mELI4DJVZI7F04-m5anTI2Ey4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants