Skip to content

Conversation

@BulaYoungR
Copy link

@BulaYoungR BulaYoungR commented Jan 11, 2026

Could not reliably reproduce the VM crash locally, so added preventive
instrumentation to improve observability and robustness.

Changes:

  • Add check_vm_status_and_log() to verify VM state at critical stages and
    collect diagnostics (domstate --reason, dumpxml) if VM is inactive
  • Add wait_for_vm_login_with_retry() with retries, delays and VM state checks
  • Add stabilization delay and explicit VM state validation after start,
    login and before/after memory stress

This should help identify VM shutdowns/crashes (e.g. memory pressure or
hugepages issues on aarch64 with 64k pagesize) if they occur again.

Summary by CodeRabbit

  • Tests
    • Enhanced memory test reliability with improved VM health monitoring and diagnostics
    • Added automatic retry mechanism for VM login with pre- and post-operation status checks
    • Strengthened error handling with comprehensive logging during VM lifecycle operations

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 11, 2026

Walkthrough

This change modifies a single test file to enhance VM lifecycle reliability. It introduces two new helper functions: check_vm_status_and_log to collect VM state and diagnostic logs when the VM is unresponsive, and wait_for_vm_login_with_retry to perform login attempts with retry logic and status validation. The existing test flow is updated to incorporate pre- and post-operation VM health checks around memory operations, include a stabilization delay after VM startup, and replace direct login calls with the new retry-enabled sequence.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: adding VM state checks, crash diagnostics, and login retry logic to improve test robustness and observability.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@BulaYoungR
Copy link
Author

Before fix

2026-01-06 04:22:55,080 avocado.core.task.statemachine statemachine     L0146 DEBUG| Task "1-type_specific.io-github-autotest-libvirt.memory.backing.page_locking_and_shared_pages.memory_allocation.page_default.ksm_default.lock_hard_limit.64k.524288" finished with status: RuntimeTaskStatus.FINISHED
2026-01-06 04:22:55,131 avocado.job job              L0659 INFO | Test results available in /var/log/avocado/job-results/job-2026-01-06T04.18-d9a9af2
2026-01-06 04:22:55,131 avocado.app human            L0146 INFO | RESULTS    : PASS 0 | ERROR 1 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0

After fix

JOB ID     : 60209dba7b756d7dd685d5369b0ee9c36eb5bf81
JOB LOG    : /var/log/avocado/job-results/job-2026-01-11T08.38-60209db/job.log
 (1/1) type_specific.io-github-autotest-libvirt.memory.backing.page_locking_and_shared_pages.memory_allocation.page_default.ksm_default.lock_hard_limit.64k.524288: STARTED
 (1/1) type_specific.io-github-autotest-libvirt.memory.backing.page_locking_and_shared_pages.memory_allocation.page_default.ksm_default.lock_hard_limit.64k.524288: PASS (29.48 s)
RESULTS    : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /var/log/avocado/job-results/job-2026-01-11T08.38-60209db/results.html
JOB TIME   : 30.75 s

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py (2)

49-74: Consider using test.log.exception() for better stack traces in diagnostic code.

The broad exception handling is reasonable for diagnostic collection, but using test.log.exception() instead of test.log.error() on lines 64 and 71 would provide complete stack traces, which is valuable when these diagnostics are triggered.

📊 Proposed logging enhancement
             try:
                 result = virsh.domstate(vm_name, "--reason", debug=True)
                 test.log.error("VM state with reason: %s" % result.stdout_text)
             except Exception as e:
-                test.log.error("Failed to get VM state: %s" % e)
+                test.log.exception("Failed to get VM state: %s" % e)
 
             # Try to get libvirt logs
             try:
                 result = virsh.dumpxml(vm_name, debug=True)
                 test.log.debug("VM XML: %s" % result.stdout_text)
             except Exception as e:
-                test.log.error("Failed to dump VM XML: %s" % e)
+                test.log.exception("Failed to dump VM XML: %s" % e)

76-113: Consider parameterizing the retry delay and improving exception logging.

The retry logic is well-designed and robust. Two minor improvements:

  1. The 30-second retry delay (line 108) is hardcoded. Consider making it a parameter for flexibility across different test scenarios.
  2. Using test.log.exception() on line 102 would provide better diagnostics when the VM dies during login.
🔧 Proposed improvements

Option 1: Parameterize the retry delay

-    def wait_for_vm_login_with_retry(vm, timeout=360, retry_count=3):
+    def wait_for_vm_login_with_retry(vm, timeout=360, retry_count=3, retry_delay=30):
         """
         Wait for VM login with retry logic and status checks.
 
         :param vm: VM object
         :param timeout: Login timeout in seconds
         :param retry_count: Number of retry attempts
+        :param retry_delay: Delay in seconds between retry attempts
         :return: session object
         """

And update the sleep call:

                 if attempt < retry_count - 1:
-                    test.log.info("Waiting 30 seconds before retry...")
-                    time.sleep(30)
+                    test.log.info("Waiting %d seconds before retry..." % retry_delay)
+                    time.sleep(retry_delay)

Option 2: Enhance logging for VM death scenarios

                 # Check if VM died during login attempt
                 if not vm.is_alive():
-                    test.log.error("VM died during login attempt")
+                    test.log.exception("VM died during login attempt")
                     check_vm_status_and_log(vm, "after failed login")
                     break
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f35e2b and 57f3bdb.

📒 Files selected for processing (1)
  • libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-24T08:01:27.899Z
Learnt from: hholoubk
Repo: autotest/tp-libvirt PR: 6579
File: libvirt/tests/src/sriov/vIOMMU/iommu_device_lifecycle.py:95-97
Timestamp: 2025-09-24T08:01:27.899Z
Learning: In the libvirt test framework used in tp-libvirt, VM cleanup including destroying running VMs is handled by the teardown method (test_obj.teardown_iommu_test()) called in the finally block, so explicit VM destroy calls on timeout are not necessary according to the maintainers.

Applied to files:

  • libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py
📚 Learning: 2025-11-18T08:47:14.465Z
Learnt from: smitterl
Repo: autotest/tp-libvirt PR: 6072
File: libvirt/tests/src/passthrough/pci/libvirt_pci_passthrough_hotplug.py:239-241
Timestamp: 2025-11-18T08:47:14.465Z
Learning: In tp-libvirt tests, `utils_test.ping` is a valid function that exists in the avocado-vt framework (virttest.utils_test module). It is an alias to `utils_net.ping` and supports parameters including dest, count, interface, timeout, and session.

Applied to files:

  • libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py
🧬 Code graph analysis (1)
libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py (2)
libvirt/tests/src/serial/serial_functional.py (1)
  • wait_for_login (235-258)
v2v/tests/src/function_test_esx.py (1)
  • output (161-173)
🪛 Ruff (0.14.10)
libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py

63-63: Do not catch blind exception: Exception

(BLE001)


64-64: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


70-70: Do not catch blind exception: Exception

(BLE001)


71-71: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


95-95: Consider moving this statement to an else block

(TRY300)


96-96: Do not catch blind exception: Exception

(BLE001)


102-102: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Python 3.12
  • GitHub Check: Python 3.9
  • GitHub Check: Python 3.8
  • GitHub Check: Python 3.11
🔇 Additional comments (3)
libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py (3)

11-11: LGTM: Required import for stabilization delays.

The time module is properly imported to support the new stabilization and retry delay logic introduced in this PR.


144-156: Excellent defensive programming with VM lifecycle checks.

The additions properly implement the PR objectives:

  • Stabilization delay after VM start helps ensure the VM is fully initialized
  • Status checks at each critical point (after start, after login) will help identify crash points
  • Enhanced login with retry logic improves test reliability

The 5-second stabilization delay is reasonable, though you could optionally make it a parameter (e.g., vm_stabilization_delay) if different test scenarios require different wait times.


167-176: Good placement of VM health checks around memory operations.

The pre- and post-operation status checks around consume_vm_freememory are well-positioned to detect VM crashes during memory stress testing, which aligns perfectly with the PR's stated goal of identifying memory-pressure-related failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant