RHEL-137735: [viosock] Fix vsock connectivity and BSOD issue after VM migration#1485
RHEL-137735: [viosock] Fix vsock connectivity and BSOD issue after VM migration#1485leidwang wants to merge 2 commits intovirtio-win:masterfrom
Conversation
|
[Jenkins CI]: Can one of the admins verify this patch? |
Summary of ChangesHello @leidwang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the stability and reliability of the viosock driver, particularly in virtualized environments involving VM migration. It addresses critical issues such as connection failures due to guest-CID changes, prevents system crashes (BSODs) by fixing an infinite loop in the transport reset logic, and introduces robust cleanup procedures during device shutdown. The changes ensure that the driver gracefully handles state inconsistencies that can arise post-migration, improving overall system resilience and user experience. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces important fixes for vsock connectivity and stability after VM migration. The changes correctly identify guest-CID changes and add defensive cleanup logic. However, I've identified a critical resource leak in the receive path and two potential deadlocks in the transport reset handler. These issues could lead to resource exhaustion and system instability. I've provided detailed comments and code suggestions to address these critical problems. Additionally, there's a minor code duplication that could be refactored for better maintainability.
c3d4293 to
ee6fb62
Compare
|
Hi @kostyanf14 Could you please review this PR?Thanks! |
|
Tx.c#L333: |
4d54b68 to
f5e7b1d
Compare
|
ok to test |
f5e7b1d to
9a89f5e
Compare
|
rerun tests |
3c2ecb0 to
7da81ca
Compare
|
rerun tests |
fffd144 to
0c27380
Compare
aa6e9a5 to
793310b
Compare
|
rerun tests |
793310b to
f8f6aeb
Compare
d2d0e75 to
c829790
Compare
8ee3009 to
349a5e5
Compare
|
Hi @kostyanf14 Would you please review the code again? It's working perfectly functionally right now, but I'd still like to confirm if there are any other issues. Thanks! |
YanVugenfirer
left a comment
There was a problem hiding this comment.
@leidwang Can you please break the PR into smaller commits. Basically each bullet point in commit message should be a separate commit in my opinion.
Smaller commits make it easier for review.
- Fix event size validation and missing VOID return type - Move transport reset to workitem to prevent infinite loop - Add DeviceReady synchronization and null pointer protection Signed-off-by: Leidong Wang <[email protected]>
- Rebuild VirtIO queues after transport reset - Close connected sockets and handle CID changes - Add queue draining and defensive cleanup for migration Signed-off-by: Leidong Wang <[email protected]>
349a5e5 to
c05d2eb
Compare
Yes, I completely agree. I've updated the PR, splitting it into two commits, and updated the PR description to provide more details of the code changes.Thanks @YanVugenfirer |
|
rerun tests |
|
Hi @leidwang. After initial review with Konstantin, we have several questions:
|
Hi @YanVugenfirer, thanks for the review. |
|
@leidwang Please collect the trace during the migration. |
I tried using the tools in Tools\trace to collect traces, but it seems I didn't collect any useful information. To trigger a blue screen, I need to restart the VM, but restarting the VM will interrupt the trace collection. I will add trace file and dump file together, could you please take a look? Trace file and dump file: https://drive.google.com/drive/folders/1rZ4H4v7mELI4DJVZI7F04-m5anTI2Ey4 |


Problem
The viosock driver experiences connectivity issues and potential BSOD crashes after VM migration due to improper handling of transport
reset events. The current implementation has several critical issues:
Solution
This PR implements a comprehensive fix through two focused commits:
Key Technical Changes
Files Modified
Impact
This fix resolves the VM migration BSOD and connectivity issues, ensuring stable vsock operation across migration events while maintaining backward compatibility.