-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Proxmox |
| Distribution Version | 7.4.1 |
| Kernel Version | 5.15.158-2 |
| Architecture | x86_64 |
| OpenZFS Version | 2.1.15 |
Describe the problem you're observing
zfs promote operation results in total freeze of system.
Describe how to reproduce the problem
- This pool receives (incremental) replication streams from multiple sources (backups)
- Replication streams are compressed (lz4) and cyphered (sha256) but not deduped
- The cypher key is not available, all operation apply to cyphered, unmounted datasets
- Replication streams consist of "cascades" of snapshots and clones, eg:
dsv0 -> dsv0@ro -> dsv1 -> dsv1@ro -> dsv2 -> ... dsvn@ro - Notice all snapshots are named @ro
At some point I start pruning the cascade to get rid of old versions, eg. to get rid of everything before a given version.
For example, to get rid of everything before dsv4, I would apply iteratively the following sequence of 4 operations to dsv3, dsv2 and dsv1:
- zfs rename dsv3@ro dsv3@ro_old
- zfs promote dsv4
- zfs destroy -R dsv4@ro_old
4. zfs destroy -r dsv3(example edited: this destroy step is useless)
(And repeat with dsv2, and dsv1)
The issue happens in a total reproducible way, when I reach step2 for the last iteration, ie when the origin of dsv4 is dsv0@r0, after the intermediate iterations completed without error (dsv1 and dsv2 successfully deleted).
However, the pool hosts many such "cascades", but the issue does NOT happen with all of them.
The pool has no error, and last scrub completed without any error.
The pool has not log or cache device.
I also had quite a few errors using bookmarks in the past and totally gave up using them.
The datasets on which the error occurs have no bookmarks.
The pool was not created using full devices but with partitions, with the following command:
root@stor4:~# zpool history zpool | head
History for 'zpool':
2022-04-13.23:34:30 zpool create -o ashift=12 -o feature@async_destroy=enabled -o feature@bookmarks=enabled -o feature@embedded_data=enabled -o feature@empty_bpobj=enabled -o feature@enabled_txg=enabled -o feature@extensible_dataset=enabled -o feature@filesystem_limits=enabled -o feature@hole_birth=enabled -o feature@large_blocks=enabled -o feature@lz4_compress=enabled -o feature@spacemap_histogram=enabled -o feature@zpool_checkpoint=enabled -O acltype=posixacl -O compression=lz4 -O normalization=formD -O relatime=on -O xattr=sa zpool raidz1 /dev/disk/by-id/wwn-0x5000c50084fa59f3-part4 /dev/disk/by-id/wwn-0x5000c500855c7907-part4 /dev/disk/by-id/wwn-0x5000c50097ce49bb-part4 /dev/disk/by-id/wwn-0x5000c50097cea04f-part4 /dev/disk/by-id/wwn-0x5000c50097cf5663-part4 /dev/disk/by-id/wwn-0x5000c50097cf59ab-part4
I have the same setting on multiple backup storage systems and the issue seems to happen randomly on some of them.
I also tried to replicate the whole "cascade" to another server with latest promox kerenl and zfs version 2.2.7 and the problem happens exactly the same.
Include any warning/errors/backtraces from the system logs
No error log, no console message, no disk activity, network connexions are frozen.
NIC leds are still blinking though.
Please advise on how to increase kernel module and/or kernel log level.
If needed I might be able to recompile ZFS module, I had the tool-chain working some time ago.
I would love to be able to activate a higher level of log to trace operations just before starting the evil promote.