memory backing: add VM state checks, crash diagnostics and login retry logic #6761

BulaYoungR · 2026-01-11T13:48:22Z

Could not reliably reproduce the VM crash locally, so added preventive
instrumentation to improve observability and robustness.

Changes:

Add check_vm_status_and_log() to verify VM state at critical stages and
collect diagnostics (domstate --reason, dumpxml) if VM is inactive
Add wait_for_vm_login_with_retry() with retries, delays and VM state checks
Add stabilization delay and explicit VM state validation after start,
login and before/after memory stress

This should help identify VM shutdowns/crashes (e.g. memory pressure or
hugepages issues on aarch64 with 64k pagesize) if they occur again.

Summary by CodeRabbit

Tests
- Enhanced memory test reliability with improved VM health monitoring and diagnostics
- Added automatic retry mechanism for VM login with pre- and post-operation status checks
- Strengthened error handling with comprehensive logging during VM lifecycle operations

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…y logic

coderabbitai · 2026-01-11T13:48:34Z

Walkthrough

This change modifies a single test file to enhance VM lifecycle reliability. It introduces two new helper functions: check_vm_status_and_log to collect VM state and diagnostic logs when the VM is unresponsive, and wait_for_vm_login_with_retry to perform login attempts with retry logic and status validation. The existing test flow is updated to incorporate pre- and post-operation VM health checks around memory operations, include a stabilization delay after VM startup, and replace direct login calls with the new retry-enabled sequence.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: adding VM state checks, crash diagnostics, and login retry logic to improve test robustness and observability.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

BulaYoungR · 2026-01-11T13:50:15Z

Before fix

2026-01-06 04:22:55,080 avocado.core.task.statemachine statemachine     L0146 DEBUG| Task "1-type_specific.io-github-autotest-libvirt.memory.backing.page_locking_and_shared_pages.memory_allocation.page_default.ksm_default.lock_hard_limit.64k.524288" finished with status: RuntimeTaskStatus.FINISHED
2026-01-06 04:22:55,131 avocado.job job              L0659 INFO | Test results available in /var/log/avocado/job-results/job-2026-01-06T04.18-d9a9af2
2026-01-06 04:22:55,131 avocado.app human            L0146 INFO | RESULTS    : PASS 0 | ERROR 1 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0

After fix

JOB ID     : 60209dba7b756d7dd685d5369b0ee9c36eb5bf81
JOB LOG    : /var/log/avocado/job-results/job-2026-01-11T08.38-60209db/job.log
 (1/1) type_specific.io-github-autotest-libvirt.memory.backing.page_locking_and_shared_pages.memory_allocation.page_default.ksm_default.lock_hard_limit.64k.524288: STARTED
 (1/1) type_specific.io-github-autotest-libvirt.memory.backing.page_locking_and_shared_pages.memory_allocation.page_default.ksm_default.lock_hard_limit.64k.524288: PASS (29.48 s)
RESULTS    : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
JOB HTML   : /var/log/avocado/job-results/job-2026-01-11T08.38-60209db/results.html
JOB TIME   : 30.75 s

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py (2)

49-74: Consider using test.log.exception() for better stack traces in diagnostic code.

The broad exception handling is reasonable for diagnostic collection, but using test.log.exception() instead of test.log.error() on lines 64 and 71 would provide complete stack traces, which is valuable when these diagnostics are triggered.

📊 Proposed logging enhancement

             try:
                 result = virsh.domstate(vm_name, "--reason", debug=True)
                 test.log.error("VM state with reason: %s" % result.stdout_text)
             except Exception as e:
-                test.log.error("Failed to get VM state: %s" % e)
+                test.log.exception("Failed to get VM state: %s" % e)
 
             # Try to get libvirt logs
             try:
                 result = virsh.dumpxml(vm_name, debug=True)
                 test.log.debug("VM XML: %s" % result.stdout_text)
             except Exception as e:
-                test.log.error("Failed to dump VM XML: %s" % e)
+                test.log.exception("Failed to dump VM XML: %s" % e)

76-113: Consider parameterizing the retry delay and improving exception logging.

The retry logic is well-designed and robust. Two minor improvements:

The 30-second retry delay (line 108) is hardcoded. Consider making it a parameter for flexibility across different test scenarios.
Using test.log.exception() on line 102 would provide better diagnostics when the VM dies during login.

🔧 Proposed improvements

Option 1: Parameterize the retry delay

-    def wait_for_vm_login_with_retry(vm, timeout=360, retry_count=3):
+    def wait_for_vm_login_with_retry(vm, timeout=360, retry_count=3, retry_delay=30):
         """
         Wait for VM login with retry logic and status checks.
 
         :param vm: VM object
         :param timeout: Login timeout in seconds
         :param retry_count: Number of retry attempts
+        :param retry_delay: Delay in seconds between retry attempts
         :return: session object
         """

And update the sleep call:

                 if attempt < retry_count - 1:
-                    test.log.info("Waiting 30 seconds before retry...")
-                    time.sleep(30)
+                    test.log.info("Waiting %d seconds before retry..." % retry_delay)
+                    time.sleep(retry_delay)

Option 2: Enhance logging for VM death scenarios

                 # Check if VM died during login attempt
                 if not vm.is_alive():
-                    test.log.error("VM died during login attempt")
+                    test.log.exception("VM died during login attempt")
                     check_vm_status_and_log(vm, "after failed login")
                     break

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f35e2b and 57f3bdb.

📒 Files selected for processing (1)

libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-09-24T08:01:27.899Z

Learnt from: hholoubk
Repo: autotest/tp-libvirt PR: 6579
File: libvirt/tests/src/sriov/vIOMMU/iommu_device_lifecycle.py:95-97
Timestamp: 2025-09-24T08:01:27.899Z
Learning: In the libvirt test framework used in tp-libvirt, VM cleanup including destroying running VMs is handled by the teardown method (test_obj.teardown_iommu_test()) called in the finally block, so explicit VM destroy calls on timeout are not necessary according to the maintainers.

Applied to files:

libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py

📚 Learning: 2025-11-18T08:47:14.465Z

Learnt from: smitterl
Repo: autotest/tp-libvirt PR: 6072
File: libvirt/tests/src/passthrough/pci/libvirt_pci_passthrough_hotplug.py:239-241
Timestamp: 2025-11-18T08:47:14.465Z
Learning: In tp-libvirt tests, `utils_test.ping` is a valid function that exists in the avocado-vt framework (virttest.utils_test module). It is an alias to `utils_net.ping` and supports parameters including dest, count, interface, timeout, and session.

Applied to files:

libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py

🧬 Code graph analysis (1)

libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py (2)

libvirt/tests/src/serial/serial_functional.py (1)

wait_for_login (235-258)

v2v/tests/src/function_test_esx.py (1)

output (161-173)

🪛 Ruff (0.14.10)

libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py

63-63: Do not catch blind exception: Exception

(BLE001)

64-64: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

70-70: Do not catch blind exception: Exception

(BLE001)

71-71: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

95-95: Consider moving this statement to an else block

(TRY300)

96-96: Do not catch blind exception: Exception

(BLE001)

102-102: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Python 3.12
GitHub Check: Python 3.9
GitHub Check: Python 3.8
GitHub Check: Python 3.11

🔇 Additional comments (3)

libvirt/tests/src/memory/memory_backing/page_locking_and_shared_pages.py (3)

11-11: LGTM: Required import for stabilization delays.

The time module is properly imported to support the new stabilization and retry delay logic introduced in this PR.

144-156: Excellent defensive programming with VM lifecycle checks.

The additions properly implement the PR objectives:

Stabilization delay after VM start helps ensure the VM is fully initialized

Status checks at each critical point (after start, after login) will help identify crash points

Enhanced login with retry logic improves test reliability

The 5-second stabilization delay is reasonable, though you could optionally make it a parameter (e.g., vm_stabilization_delay) if different test scenarios require different wait times.

167-176: Good placement of VM health checks around memory operations.

The pre- and post-operation status checks around consume_vm_freememory are well-positioned to detect VM crashes during memory stress testing, which aligns perfectly with the PR's stated goal of identifying memory-pressure-related failures.

memory backing: add VM state checks, crash diagnostics and login retr…

57f3bdb

…y logic

coderabbitai bot reviewed Jan 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

memory backing: add VM state checks, crash diagnostics and login retry logic #6761

memory backing: add VM state checks, crash diagnostics and login retry logic #6761

Uh oh!

BulaYoungR commented Jan 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 11, 2026 •

edited

Loading

Uh oh!

BulaYoungR commented Jan 11, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

memory backing: add VM state checks, crash diagnostics and login retry logic #6761

Are you sure you want to change the base?

memory backing: add VM state checks, crash diagnostics and login retry logic #6761

Uh oh!

Conversation

BulaYoungR commented Jan 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

BulaYoungR commented Jan 11, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BulaYoungR commented Jan 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 11, 2026 •

edited

Loading