Skip to content

File-based disk-only VM snapshot with KVM as hypervisor #10632

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

JoaoJandre
Copy link
Contributor

Description

This PR implements the spec available at #9524. For more information regarding it, please read the spec.

Furthermore, the following changes that are not contemplated in the spec were added:

  1. The snapshot.merge.timeout agent property was added. It is only considered if libvirt.events.enabled is true;
  2. A new snapshot merge process (which affects normal volume snapshots and this feature) was created. When libvirt.events.enabled is true, ACS will register to gather events from Libvirt and will collect information on the process, providing a progress report in the logs. If the configuration is false, the old process is used;
  3. Volumes attached to VMs with file-based disk-only VM snapshots in KVM are able to be resized.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

Basic Tests

I created a test VM to carry out the tests below. Additionally, after performing the relevant operations, the VM's XML and the storage were checked to observe if the snapshots existed.

Snapshot Creation

The tests below were also repeated with the VM stopped.

N Test Result
1 Take a snapshot of VM 1 without specifying quiesceVM Snapshot created
2 Take a snapshot of VM 2 specifying quiesceVM Snapshot created

Snapshot Reversion

N Test Result
1 Revert VM in Running state to any snapshot Error thrown
2 Revert VM in Stopped state to snapshot 1 and start it VM reverted and started successfully

Snapshot Removal

N Test Result
1 Create a new snapshot 3 after the second reversion test and delete snapshot 1 I verified that the snapshot was no longer listed and had the correct database metadata, the file still existed because more than one delta depended on it
2 Delete snapshot 2 Snapshot deleted; snapshot 1 was merged with snapshot 3 since it only had the latter as a dependency
3 Delete snapshot 3 (current) Snapshot removed, merged with the VM's volume
4 Create 3 snapshots and remove the first one Snapshot removed, merged with the second snapshot
5 Create two snapshots, revert to the first, and delete the second Snapshot deleted

Advanced Tests

Deletion Test

All tests were carried out with the VM stopped.

  1. I created 3 snapshots: s1, s2, and s3.
  2. I reverted the VM to snapshot s2.
  3. I created snapshot s4.
  4. I removed snapshot s2.

The snapshot was marked as hidden and was not removed from storage.

  1. I removed snapshot s3.

Snapshot s3 was removed normally. Snapshot s2 was merged with snapshot s4.

  1. I created snapshot s5.
  2. I reverted to snapshot s4.
  3. I removed snapshot s4.

Snapshot s4 was marked as hidden and was not removed from storage.

  1. I removed snapshot s5.
    Snapshot s5 was removed normally. Snapshot s4 was merged with the delta of the VM's volume.
  2. I removed the last remaining snapshot (s1). It was removed normally.

Reversion Test

  1. I created two snapshots: s1 and s2.
  2. I reverted to snapshot s1.
  3. I removed snapshot s1.

Snapshot s1 was marked as hidden and was not removed from storage.

  1. I reverted to snapshot s2. Snapshot s1 was merged with the base volume.

Concurrent Test

I created 4 VMs and took a VM snapshot of each. Then, I instructed to remove them all at the same time. All snapshots were removed simultaneously and successfully.

Test with Multiple Volumes

I created a VM with one datadisk and attached 8 more datadisks (10 volumes in total), took two VM snapshots, and then instructed to remove one at a time. The snapshots were removed successfully.

Tests Changing the snapshot.merge.timeout Config

  1. I changed the config to 1 and restarted the host;
  2. I created a VM, took a VM snapshot, accessed it, and wrote 4GB of data to it;
  3. I tried to remove the snapshot, an error occurred, and looking at the logs, it was possible to observe that it timed out;
  4. I manually aborted the blockcommit process;
  5. I changed the config to 0 and restarted the host;
  6. I tried to remove the snapshot, and it was performed correctly;

Tests Related to Volume Resize with Disk-Only VM Snapshots on KVM

Test Result Expected?
Create a VM, take a snapshot, resize the volume Resize performed successfully, both in metadata and when checked with qemu-img info Y
Stop the VM and revert the snapshot Revert performed successfully, volume size returned to original, both in metadata and qemu-img info Y
Remove the snapshot with the VM stopped The delta of the volume was correctly merged with the snapshot's, and the final size was that of the volume Y
Start the VM, take a new snapshot, resize the volume, and remove the snapshot Deltas were correctly merged, and the final size was that of the volume Y

The last two tests were repeated on a VM with several snapshots, so that a merge between snapshots was performed. The result was the same.

Tests Related to Events:

  1. Create VM, take disk-only VM snapshot, resize the root volume by 1GB more, stop the VM, revert the snapshot. It was observed through the cloud.usage_event table that the resize event was correctly triggered, and it was also observed via GUI that the account's resource limit was updated.
  2. Repeat the test above with a VM with two volumes, with only one resized. The test had the same result, and only one resize event was triggered, for the volume that had been resized.

@JoaoJandre
Copy link
Contributor Author

@blueorangutan package

Copy link

codecov bot commented Mar 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 3.91%. Comparing base (6fdaf51) to head (d5f8d05).
Report is 117 commits behind head on main.

❗ There is a different number of reports uploaded between BASE (6fdaf51) and HEAD (d5f8d05). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (6fdaf51) HEAD (d5f8d05)
unittests 1 0
Additional details and impacted files
@@              Coverage Diff              @@
##               main   #10632       +/-   ##
=============================================
- Coverage     16.41%    3.91%   -12.51%     
=============================================
  Files          5702      415     -5287     
  Lines        503405    33793   -469612     
  Branches      60976     6078    -54898     
=============================================
- Hits          82626     1322    -81304     
+ Misses       411594    32313   -379281     
+ Partials       9185      158     -9027     
Flag Coverage Δ
uitests 3.91% <ø> (-0.09%) ⬇️
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@JoaoJandre
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13204

@JoaoJandre
Copy link
Contributor Author

@rohityadavcloud @sureshanaparti @weizhouapache could we run the CI?

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-13177)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 54050 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10632-t13177-kvm-ol8.zip
Smoke tests completed. 140 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_02_restore_vm_strict_tags_failure Failure 53.35 test_vm_strict_host_tags.py
test_02_scale_vm_strict_tags_failure Failure 54.75 test_vm_strict_host_tags.py
test_06_deploy_vm_on_any_host_with_strict_tags_failure Failure 4.69 test_vm_strict_host_tags.py

Copy link

github-actions bot commented May 2, 2025

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

Copy link

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@sureshanaparti
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✖️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 13793

@JoaoJandre
Copy link
Contributor Author

@bernardodemarco I've fixed the reported errors and validated that the use case you reported is working. Could you check?

@JoaoJandre
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13819

@JoaoJandre
Copy link
Contributor Author

@sureshanaparti could we run the CI here?

@DaanHoogland
Copy link
Contributor

@blueorangutan LLtest

@blueorangutan
Copy link

@DaanHoogland a [LL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@sureshanaparti sureshanaparti moved this from Done to In Progress in Apache CloudStack 4.21.0 Jun 26, 2025
@sureshanaparti
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13991

Copy link
Collaborator

@bernardodemarco bernardodemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some more tests that I have performed.

Tests Descriptions

First, I reproduced again the tests mentioned on #10632 (review) regarding creation, deletion and reversion of disk-only VM snapshots. After performing the operations, I verified the consistency of the volume/snapshot chains. No inconsistency/unexpected behavior was noticed.

Next, I have performed tests regarding the limitations of the feature, as described in the specification (#9524):

  • Verified that it is not possible to create memory snapshots when VMs already have disk-only snapshots
  • Verified that it is not possible to create disk-only VM snapshots when VMs already have memory snapshots
  • Verified that it is not possible to create disk-only VM snapshots when the volumes of the VM already have volume snapshots
  • Verified that it is not possible to create volume snapshots when the VM already has disk-only VM snapshots
  • Verified that it is not possible to migrate a volume of a VM that has disk-only VMs snapshots

Advanced Tests - Creation, reversion and deletion of disk-only VM snapshots

To perform more advanced test cases, first, I verified that the scenario described in #10632 (review) has been correctly fixed. The performed steps are described below, along with the corresponding volumes/snapshots chains illustrations:

  • Created snapshots s1, s2 and s3:
    image
  • Reverted to s2:
    Image
  • Created s4:
    Image
  • Deleted s2:
    Image
  • Deleted s1:
    Image
    Image

After the above operations, I verified that the s3 snapshot (79faf6c9-f40f-40d3-a6fb-fa362a5b160b) correctly pointed to the s2 snapshot (3b90354c-1c03-4c4f-a71f-df95092bec68), which was marked as Hidden in the DB:

> qemu-img info 79faf6c9-f40f-40d3-a6fb-fa362a5b160b
image: 79faf6c9-f40f-40d3-a6fb-fa362a5b160b
file format: qcow2
virtual size: 50 MiB (52428800 bytes)
disk size: 644 KiB
cluster_size: 65536
backing file: /mnt/cb058c32-08a7-36ab-a540-8cee5f1a6b9f/3b90354c-1c03-4c4f-a71f-df95092bec68
backing file format: qcow2
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false
    extended l2: false
Child node '/file':
    filename: 79faf6c9-f40f-40d3-a6fb-fa362a5b160b
    protocol type: file
    file length: 704 KiB (720896 bytes)
    disk size: 644 KiB
  • Reverted to s3:
    Image
  • Created s5:
    Image
  • Deleted s3:
    Image
    Image
  • Deleted s4:
    Image
    Image

With the #10632 (review), #10632 (review) and the current presented test cases, all creation, reversion and deletion workflows, which are depicted in the feature's specification (#9524), have been encompassed by manual tests. @JoaoJandre, amazing work!

@DaanHoogland
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link

[SF] Trillian test result (tid-13678)
Environment: kvm-ol8 (x2), Advanced Networking with Mgmt server ol8
Total time taken: 63838 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr10632-t13678-kvm-ol8.zip
Smoke tests completed. 141 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Copy link
Collaborator

@hsato03 hsato03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLGTM, just a minor suggestion.

@winterhazel
Copy link
Member

@JoaoJandre while testing this, I found that when attempting to revert to a snapshot, if ACS is unable to find a host to perform the reversion, the snapshot gets stuck in Reverting. I think that we can rollback this state change, as no operation was performed yet in the VM.

@JoaoJandre
Copy link
Contributor Author

@JoaoJandre while testing this, I found that when attempting to revert to a snapshot, if ACS is unable to find a host to perform the reversion, the snapshot gets stuck in Reverting. I think that we can rollback this state change, as no operation was performed yet in the VM.

Could you check that the latest commit fixes this? I've just reordered the method calls

return;
}

resourceLimitManager.decrementResourceCount(volumeVO.getAccountId(), Resource.ResourceType.primary_storage, volumeVO.getSize() - snapshotRef.getSize());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a situation such as:

  1. Create a VM with a 50 GB volume
  2. Take snapshot A
  3. Resize the VM's volume to 100 GB
  4. Take snapshot B
  5. Restore snapshot A
  6. Restore snapshot B

In step 5, the primary storage resource limit is decremented correctly; however, in step 6, it is not incremented immediately.

@winterhazel
Copy link
Member

@JoaoJandre while testing this, I found that when attempting to revert to a snapshot, if ACS is unable to find a host to perform the reversion, the snapshot gets stuck in Reverting. I think that we can rollback this state change, as no operation was performed yet in the VM.

Could you check that the latest commit fixes this? I've just reordered the method calls

@JoaoJandre yup, its fixed now. The snapshot remains Ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

8 participants