-
Notifications
You must be signed in to change notification settings - Fork 1.2k
File-based disk-only VM snapshot with KVM as hypervisor #10632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@blueorangutan package |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #10632 +/- ##
=============================================
- Coverage 16.41% 3.91% -12.51%
=============================================
Files 5702 415 -5287
Lines 503405 33793 -469612
Branches 60976 6078 -54898
=============================================
- Hits 82626 1322 -81304
+ Misses 411594 32313 -379281
+ Partials 9185 158 -9027
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
@blueorangutan package |
@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13204 |
@rohityadavcloud @sureshanaparti @weizhouapache could we run the CI? |
@blueorangutan test |
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
[SF] Trillian test result (tid-13177)
|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch. |
@blueorangutan package |
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✖️ el8 ✖️ el9 ✔️ debian ✖️ suse15. SL-JID 13793 |
@bernardodemarco I've fixed the reported errors and validated that the use case you reported is working. Could you check? |
@blueorangutan package |
@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13819 |
@sureshanaparti could we run the CI here? |
@blueorangutan LLtest |
@DaanHoogland a [LL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
@blueorangutan test |
@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
core/src/main/java/com/cloud/agent/api/storage/DeleteDiskOnlyVmSnapshotCommand.java
Outdated
Show resolved
Hide resolved
core/src/main/java/com/cloud/agent/api/storage/MergeDiskOnlyVmSnapshotCommand.java
Outdated
Show resolved
Hide resolved
core/src/main/java/com/cloud/agent/api/storage/RevertDiskOnlyVmSnapshotCommand.java
Outdated
Show resolved
Hide resolved
plugins/hypervisors/kvm/src/main/java/org/apache/cloudstack/utils/qemu/QemuImg.java
Outdated
Show resolved
Hide resolved
@blueorangutan package |
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 13991 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some more tests that I have performed.
Tests Descriptions
First, I reproduced again the tests mentioned on #10632 (review) regarding creation, deletion and reversion of disk-only VM snapshots. After performing the operations, I verified the consistency of the volume/snapshot chains. No inconsistency/unexpected behavior was noticed.
Next, I have performed tests regarding the limitations of the feature, as described in the specification (#9524):
- Verified that it is not possible to create memory snapshots when VMs already have disk-only snapshots
- Verified that it is not possible to create disk-only VM snapshots when VMs already have memory snapshots
- Verified that it is not possible to create disk-only VM snapshots when the volumes of the VM already have volume snapshots
- Verified that it is not possible to create volume snapshots when the VM already has disk-only VM snapshots
- Verified that it is not possible to migrate a volume of a VM that has disk-only VMs snapshots
Advanced Tests - Creation, reversion and deletion of disk-only VM snapshots
To perform more advanced test cases, first, I verified that the scenario described in #10632 (review) has been correctly fixed. The performed steps are described below, along with the corresponding volumes/snapshots chains illustrations:
After the above operations, I verified that the s3
snapshot (79faf6c9-f40f-40d3-a6fb-fa362a5b160b
) correctly pointed to the s2
snapshot (3b90354c-1c03-4c4f-a71f-df95092bec68
), which was marked as Hidden
in the DB:
> qemu-img info 79faf6c9-f40f-40d3-a6fb-fa362a5b160b
image: 79faf6c9-f40f-40d3-a6fb-fa362a5b160b
file format: qcow2
virtual size: 50 MiB (52428800 bytes)
disk size: 644 KiB
cluster_size: 65536
backing file: /mnt/cb058c32-08a7-36ab-a540-8cee5f1a6b9f/3b90354c-1c03-4c4f-a71f-df95092bec68
backing file format: qcow2
Format specific information:
compat: 1.1
compression type: zlib
lazy refcounts: false
refcount bits: 16
corrupt: false
extended l2: false
Child node '/file':
filename: 79faf6c9-f40f-40d3-a6fb-fa362a5b160b
protocol type: file
file length: 704 KiB (720896 bytes)
disk size: 644 KiB
With the #10632 (review), #10632 (review) and the current presented test cases, all creation, reversion and deletion workflows, which are depicted in the feature's specification (#9524), have been encompassed by manual tests. @JoaoJandre, amazing work!
@blueorangutan test |
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
[SF] Trillian test result (tid-13678)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLGTM, just a minor suggestion.
...apshot/src/main/java/org/apache/cloudstack/storage/vmsnapshot/DefaultVMSnapshotStrategy.java
Outdated
Show resolved
Hide resolved
...apshot/src/main/java/org/apache/cloudstack/storage/vmsnapshot/DefaultVMSnapshotStrategy.java
Outdated
Show resolved
Hide resolved
...apshot/src/main/java/org/apache/cloudstack/storage/vmsnapshot/DefaultVMSnapshotStrategy.java
Outdated
Show resolved
Hide resolved
@JoaoJandre while testing this, I found that when attempting to revert to a snapshot, if ACS is unable to find a host to perform the reversion, the snapshot gets stuck in |
Could you check that the latest commit fixes this? I've just reordered the method calls |
return; | ||
} | ||
|
||
resourceLimitManager.decrementResourceCount(volumeVO.getAccountId(), Resource.ResourceType.primary_storage, volumeVO.getSize() - snapshotRef.getSize()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a situation such as:
- Create a VM with a 50 GB volume
- Take snapshot A
- Resize the VM's volume to 100 GB
- Take snapshot B
- Restore snapshot A
- Restore snapshot B
In step 5, the primary storage resource limit is decremented correctly; however, in step 6, it is not incremented immediately.
@JoaoJandre yup, its fixed now. The snapshot remains |
Description
This PR implements the spec available at #9524. For more information regarding it, please read the spec.
Furthermore, the following changes that are not contemplated in the spec were added:
snapshot.merge.timeout
agent property was added. It is only considered iflibvirt.events.enabled
is true;libvirt.events.enabled
is true, ACS will register to gather events from Libvirt and will collect information on the process, providing a progress report in the logs. If the configuration is false, the old process is used;Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Basic Tests
I created a test VM to carry out the tests below. Additionally, after performing the relevant operations, the VM's XML and the storage were checked to observe if the snapshots existed.
Snapshot Creation
The tests below were also repeated with the VM stopped.
Snapshot Reversion
Snapshot Removal
Advanced Tests
Deletion Test
All tests were carried out with the VM stopped.
The snapshot was marked as hidden and was not removed from storage.
Snapshot s3 was removed normally. Snapshot s2 was merged with snapshot s4.
Snapshot s4 was marked as hidden and was not removed from storage.
Snapshot s5 was removed normally. Snapshot s4 was merged with the delta of the VM's volume.
Reversion Test
Snapshot s1 was marked as hidden and was not removed from storage.
Concurrent Test
I created 4 VMs and took a VM snapshot of each. Then, I instructed to remove them all at the same time. All snapshots were removed simultaneously and successfully.
Test with Multiple Volumes
I created a VM with one datadisk and attached 8 more datadisks (10 volumes in total), took two VM snapshots, and then instructed to remove one at a time. The snapshots were removed successfully.
Tests Changing the
snapshot.merge.timeout
ConfigTests Related to Volume Resize with Disk-Only VM Snapshots on KVM
qemu-img info
qemu-img info
The last two tests were repeated on a VM with several snapshots, so that a merge between snapshots was performed. The result was the same.
Tests Related to Events:
cloud.usage_event
table that the resize event was correctly triggered, and it was also observed via GUI that the account's resource limit was updated.