-
Notifications
You must be signed in to change notification settings - Fork 184
memory backing: add VM state checks, crash diagnostics and login retry logic #6761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughThis change modifies a single test file to enhance VM lifecycle reliability. It introduces two new helper functions: Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
|
Before fix After fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py (2)
49-74: Consider usingtest.log.exception()for better stack traces in diagnostic code.The broad exception handling is reasonable for diagnostic collection, but using
test.log.exception()instead oftest.log.error()on lines 64 and 71 would provide complete stack traces, which is valuable when these diagnostics are triggered.📊 Proposed logging enhancement
try: result = virsh.domstate(vm_name, "--reason", debug=True) test.log.error("VM state with reason: %s" % result.stdout_text) except Exception as e: - test.log.error("Failed to get VM state: %s" % e) + test.log.exception("Failed to get VM state: %s" % e) # Try to get libvirt logs try: result = virsh.dumpxml(vm_name, debug=True) test.log.debug("VM XML: %s" % result.stdout_text) except Exception as e: - test.log.error("Failed to dump VM XML: %s" % e) + test.log.exception("Failed to dump VM XML: %s" % e)
76-113: Consider parameterizing the retry delay and improving exception logging.The retry logic is well-designed and robust. Two minor improvements:
- The 30-second retry delay (line 108) is hardcoded. Consider making it a parameter for flexibility across different test scenarios.
- Using
test.log.exception()on line 102 would provide better diagnostics when the VM dies during login.🔧 Proposed improvements
Option 1: Parameterize the retry delay
- def wait_for_vm_login_with_retry(vm, timeout=360, retry_count=3): + def wait_for_vm_login_with_retry(vm, timeout=360, retry_count=3, retry_delay=30): """ Wait for VM login with retry logic and status checks. :param vm: VM object :param timeout: Login timeout in seconds :param retry_count: Number of retry attempts + :param retry_delay: Delay in seconds between retry attempts :return: session object """And update the sleep call:
if attempt < retry_count - 1: - test.log.info("Waiting 30 seconds before retry...") - time.sleep(30) + test.log.info("Waiting %d seconds before retry..." % retry_delay) + time.sleep(retry_delay)Option 2: Enhance logging for VM death scenarios
# Check if VM died during login attempt if not vm.is_alive(): - test.log.error("VM died during login attempt") + test.log.exception("VM died during login attempt") check_vm_status_and_log(vm, "after failed login") break
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-24T08:01:27.899Z
Learnt from: hholoubk
Repo: autotest/tp-libvirt PR: 6579
File: libvirt/tests/src/sriov/vIOMMU/iommu_device_lifecycle.py:95-97
Timestamp: 2025-09-24T08:01:27.899Z
Learning: In the libvirt test framework used in tp-libvirt, VM cleanup including destroying running VMs is handled by the teardown method (test_obj.teardown_iommu_test()) called in the finally block, so explicit VM destroy calls on timeout are not necessary according to the maintainers.
Applied to files:
libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py
📚 Learning: 2025-11-18T08:47:14.465Z
Learnt from: smitterl
Repo: autotest/tp-libvirt PR: 6072
File: libvirt/tests/src/passthrough/pci/libvirt_pci_passthrough_hotplug.py:239-241
Timestamp: 2025-11-18T08:47:14.465Z
Learning: In tp-libvirt tests, `utils_test.ping` is a valid function that exists in the avocado-vt framework (virttest.utils_test module). It is an alias to `utils_net.ping` and supports parameters including dest, count, interface, timeout, and session.
Applied to files:
libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py
🧬 Code graph analysis (1)
libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py (2)
libvirt/tests/src/serial/serial_functional.py (1)
wait_for_login(235-258)v2v/tests/src/function_test_esx.py (1)
output(161-173)
🪛 Ruff (0.14.10)
libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py
63-63: Do not catch blind exception: Exception
(BLE001)
64-64: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
70-70: Do not catch blind exception: Exception
(BLE001)
71-71: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
95-95: Consider moving this statement to an else block
(TRY300)
96-96: Do not catch blind exception: Exception
(BLE001)
102-102: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Python 3.12
- GitHub Check: Python 3.9
- GitHub Check: Python 3.8
- GitHub Check: Python 3.11
🔇 Additional comments (3)
libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py (3)
11-11: LGTM: Required import for stabilization delays.The
timemodule is properly imported to support the new stabilization and retry delay logic introduced in this PR.
144-156: Excellent defensive programming with VM lifecycle checks.The additions properly implement the PR objectives:
- Stabilization delay after VM start helps ensure the VM is fully initialized
- Status checks at each critical point (after start, after login) will help identify crash points
- Enhanced login with retry logic improves test reliability
The 5-second stabilization delay is reasonable, though you could optionally make it a parameter (e.g.,
vm_stabilization_delay) if different test scenarios require different wait times.
167-176: Good placement of VM health checks around memory operations.The pre- and post-operation status checks around
consume_vm_freememoryare well-positioned to detect VM crashes during memory stress testing, which aligns perfectly with the PR's stated goal of identifying memory-pressure-related failures.
Could not reliably reproduce the VM crash locally, so added preventive
instrumentation to improve observability and robustness.
Changes:
collect diagnostics (domstate --reason, dumpxml) if VM is inactive
login and before/after memory stress
This should help identify VM shutdowns/crashes (e.g. memory pressure or
hugepages issues on aarch64 with 64k pagesize) if they occur again.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.