forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snp host latest #6
Open
MelodyHuibo
wants to merge
1,985
commits into
AMDESE:snp-host-latest
Choose a base branch
from
MelodyHuibo:snp-host-latest
base: snp-host-latest
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Snp host latest #6
MelodyHuibo
wants to merge
1,985
commits into
AMDESE:snp-host-latest
from
MelodyHuibo:snp-host-latest
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Replace open coded functionalify of kstrdup_and_replace() with a call. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Reviewed-by: Luis Chamberlain <[email protected]> Cc: Jason Baron <[email protected]> Cc: Jim Cromie <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
Found with git grep 'MODULE_AUTHOR(".*([^)]*@' Fixed with sed -i '/MODULE_AUTHOR(".*([^)]*@/{s/ (/ </g;s/)"/>"/;s/)and/> and/}' \ $(git grep -l 'MODULE_AUTHOR(".*([^)]*@') Also: in drivers/media/usb/siano/smsusb.c normalise ", INC" to ", Inc"; this is what every other MODULE_AUTHOR for this company says, and it's what the header says in drivers/sbus/char/openprom.c normalise a double-spaced separator; this is clearly copied from the copyright header, where the names are aligned on consecutive lines thusly: * Linux/SPARC PROM Configuration Driver * Copyright (C) 1996 Thomas K. Dyas ([email protected]) * Copyright (C) 1996 Eddie C. Dost ([email protected]) but the authorship branding is single-line Link: https://lkml.kernel.org/r/mk3geln4azm5binjjlfsgjepow4o73domjv6ajybws3tz22vb3@tarta.nabijaczleweli.xyz Signed-off-by: Ahelenia Ziemiańska <[email protected]> Cc: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
Since commit aed65af ("drivers: make device_type const"), the driver core can properly handle constant struct device_type. Make sure that new usages of the struct already enter the tree as const. Link: https://lkml.kernel.org/r/20240218-device_cleanup-checkpatch-v1-1-8b0b89c4f6b1@marliere.net Signed-off-by: Ricardo B. Marliere <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Cc: Joe Perches <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
Bring the PDF build fix into docs-next so we can ... build PDFs
…inux/kernel/git/akpm/mm
…ernel/git/arm64/linux
…/git/geert/linux-m68k.git
…/powerpc/linux.git
…t/klassert/ipsec.git
…git/wireless/wireless.git
…/git/tiwai/sound.git
…/git/broonie/regulator.git
…/git/gregkh/tty.git
…/git/gregkh/usb.git
…/phy/linux-phy.git
…kernel/git/wbg/counter.git
…kernel/git/gregkh/char-misc.git
…/westeri/thunderbolt.git
…t/herbert/crypto-2.6.git
…/vkoul/dmaengine.git
…/git/mtd/linux.git
…l/git/kdave/linux.git
When SEV-SNP is enabled globally, the hardware places restrictions on all memory accesses based on the RMP entry, whether the hypervisor or a VM, performs the accesses. When hardware encounters an RMP access violation during a guest access, it will cause a #VMEXIT(NPF) with a number of additional bits set to indicate the reasons for the #NPF. Define those here. See APM2 section 16.36.10 for more details. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> [mdr: add some additional details to commit message] Signed-off-by: Michael Roth <[email protected]>
For KVM_X86_SNP_VM, only the PFERR_GUEST_ENC_MASK flag is needed to determine with an #NPF is due to a private/shared access by the guest. Implement that handling here. Also add handling needed to deal with SNP guests which in some cases will make MMIO accesses with the encryption bit. Signed-off-by: Michael Roth <[email protected]>
SEV-SNP relies on restricted/protected memory support to run guests, so make sure to enable that support via the CONFIG_KVM_GENERIC_PRIVATE_MEM config option. Signed-off-by: Michael Roth <[email protected]>
Add support for AP Reset Hold being invoked using the GHCB MSR protocol, available in version 2 of the GHCB specification. Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> Signed-off-by: Michael Roth <[email protected]>
Version 2 of the GHCB specification introduced advertisement of features that are supported by the Hypervisor. Now that KVM supports version 2 of the GHCB specification, bump the maximum supported protocol version. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> Signed-off-by: Michael Roth <[email protected]>
The next generation of SEV is called SEV-SNP (Secure Nested Paging). SEV-SNP builds upon existing SEV and SEV-ES functionality while adding new hardware-based security protection. SEV-SNP adds strong memory encryption and integrity protection to help prevent malicious hypervisor-based attacks such as data replay, memory re-mapping, and more, to create an isolated execution environment. Implement some initial infrastructure in KVM to check/report when SNP is enabled on the system. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> [mdr: commit fixups, use similar ASID reporting as with SEV/SEV-ES] Signed-off-by: Michael Roth <[email protected]>
TODO: move to using common KVM_SEV_INIT2 interface. The flag discovery mechanism also needs to be changed since it provides no mechanism for discovery outside of trying to trigger the error path by guessing at unimplemented flags. May went to leverage something similar to the KVM_SEV_INIT2 VMSA feature machinery for this as well which involves a KVM device ioctl to return the feature bitmap. The KVM_SNP_INIT command is used by the hypervisor to initialize the SEV-SNP platform context. In a typical workflow, this command should be the first command issued. When creating SEV-SNP guest, the VMM must use this command instead of the KVM_SEV_INIT or KVM_SEV_ES_INIT. The flags value must be zero, it will be extended in future SNP support to communicate the optional features (such as restricted INT injection etc). Co-developed-by: Pavan Kumar Paluri <[email protected]> Signed-off-by: Pavan Kumar Paluri <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> Signed-off-by: Michael Roth <[email protected]>
KVM_SEV_SNP_LAUNCH_START begins the launch process for an SEV-SNP guest. The command initializes a cryptographic digest context used to construct the measurement of the guest. If the guest is expected to be migrated, the command also binds a migration agent (MA) to the guest. For more information see the SEV-SNP specification. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> [mdr: hold sev_deactivate_lock when calling SEV_CMD_SNP_DECOMMISSION] Signed-off-by: Michael Roth <[email protected]>
The KVM_SEV_SNP_LAUNCH_UPDATE command can be used to insert data into the guest's memory. The data is encrypted with the cryptographic context created with the KVM_SEV_SNP_LAUNCH_START. In addition to the inserting data, it can insert a two special pages into the guests memory: the secrets page and the CPUID page. While terminating the guest, reclaim the guest pages added in the RMP table. If the reclaim fails, then the page is no longer safe to be released back to the system and leak them. For more information see the SEV-SNP specification. Co-developed-by: Michael Roth <[email protected]> Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]>
TODO: see if ID block / Author Key enabled bits can be set automatically rather than exposed directly via KVM API. See if there are other opportunities to architect things in user-facing APIs. The KVM_SEV_SNP_LAUNCH_FINISH finalize the cryptographic digest and stores it as the measurement of the guest at launch. While finalizing the launch flow, it also issues the LAUNCH_UPDATE command to encrypt the VMSA pages. If its an SNP guest, then VMSA was added in the RMP entry as a guest owned page and also removed from the kernel direct map so flush it later after it is transitioned back to hypervisor state and restored in the direct map. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Harald Hoyer <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> [mdr: always measure BSP first to get consistent launch measurements] Signed-off-by: Michael Roth <[email protected]>
SEV-SNP guests are required to perform a GHCB GPA registration. Before using a GHCB GPA for a vCPU the first time, a guest must register the vCPU GHCB GPA. If hypervisor can work with the guest requested GPA then it must respond back with the same GPA otherwise return -1. On VMEXIT, verify that the GHCB GPA matches with the registered value. If a mismatch is detected, then abort the guest. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> Signed-off-by: Michael Roth <[email protected]>
SEV-SNP VMs can ask the hypervisor to change the page state in the RMP table to be private or shared using the Page State Change MSR protocol as defined in the GHCB specification. When using gmem, private/shared memory is allocated through separate pools, and KVM relies on userspace issuing a KVM_SET_MEMORY_ATTRIBUTES KVM ioctl to tell the KVM MMU whether or not a particular GFN should be backed by private memory or not. Forward these page state change requests to userspace so that it can issue the expected KVM ioctls. The KVM MMU will handle updating the RMP entries when it is ready to map a private page into a guest. Define a new KVM_EXIT_VMGEXIT for exits of this type, and structure it so that it can be extended for other cases where VMGEXITs need some level of handling in userspace. Co-developed-by: Michael Roth <[email protected]> Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]>
SEV-SNP VMs can ask the hypervisor to change the page state in the RMP table to be private or shared using the Page State Change NAE event as defined in the GHCB specification version 2. Forward these requests to userspace as KVM_EXIT_VMGEXITs, similar to how it is done for requests that don't use a GHCB page. Co-developed-by: Michael Roth <[email protected]> Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]>
While resolving the RMP page fault, there may be cases where the page level between the RMP entry and TDP does not match and the 2M RMP entry must be split into 4K RMP entries. Or a 2M TDP page need to be broken into multiple of 4K pages. To keep the RMP and TDP page level in sync, zap the gfn range after splitting the pages in the RMP entry. The zap should force the TDP to gets rebuilt with the new page level. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> Signed-off-by: Michael Roth <[email protected]>
When SEV-SNP is enabled in the guest, the hardware places restrictions on all memory accesses based on the contents of the RMP table. When hardware encounters RMP check failure caused by the guest memory access it raises the #NPF. The error code contains additional information on the access type. See the APM volume 2 for additional information. When using gmem, RMP faults resulting from mismatches between the state in the RMP table vs. what the guest expects via its page table result in KVM_EXIT_MEMORY_FAULTs being forwarded to userspace to handle. This means the only expected case that needs to be handled in the kernel is when the page size of the entry in the RMP table is larger than the mapping in the nested page table, in which case a PSMASH instruction needs to be issued to split the large RMP entry into individual 4K entries so that subsequent accesses can succeed. Signed-off-by: Brijesh Singh <[email protected]> Co-developed-by: Michael Roth <[email protected]> Signed-off-by: Michael Roth <[email protected]> Signed-off-by: Ashish Kalra <[email protected]>
In preparation to support SEV-SNP AP Creation, use a variable that holds the VMSA physical address rather than converting the virtual address. This will allow SEV-SNP AP Creation to set the new physical address that will be used should the vCPU reset path be taken. Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> Signed-off-by: Michael Roth <[email protected]>
Add support for the SEV-SNP AP Creation NAE event. This allows SEV-SNP guests to alter the register state of the APs on their own. This allows the guest a way of simulating INIT-SIPI. A new event, KVM_REQ_UPDATE_PROTECTED_GUEST_STATE, is created and used so as to avoid updating the VMSA pointer while the vCPU is running. For CREATE The guest supplies the GPA of the VMSA to be used for the vCPU with the specified APIC ID. The GPA is saved in the svm struct of the target vCPU, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added to the vCPU and then the vCPU is kicked. For CREATE_ON_INIT: The guest supplies the GPA of the VMSA to be used for the vCPU with the specified APIC ID the next time an INIT is performed. The GPA is saved in the svm struct of the target vCPU. For DESTROY: The guest indicates it wishes to stop the vCPU. The GPA is cleared from the svm struct, the KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event is added to vCPU and then the vCPU is kicked. The KVM_REQ_UPDATE_PROTECTED_GUEST_STATE event handler will be invoked as a result of the event or as a result of an INIT. The handler sets the vCPU to the KVM_MP_STATE_UNINITIALIZED state, so that any errors will leave the vCPU as not runnable. Any previous VMSA pages that were installed as part of an SEV-SNP AP Creation NAE event are un-pinned. If a new VMSA is to be installed, the VMSA guest page is pinned and set as the VMSA in the vCPU VMCB and the vCPU state is set to KVM_MP_STATE_RUNNABLE. If a new VMSA is not to be installed, the VMSA is cleared in the vCPU VMCB and the vCPU state is left as KVM_MP_STATE_UNINITIALIZED to prevent it from being run. Signed-off-by: Tom Lendacky <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> [mdr: add handling for gmem] Signed-off-by: Michael Roth <[email protected]>
GHCB version 2 adds support for a GHCB-based termination request that a guest can issue when it reaches an error state and wishes to inform the hypervisor that it should be terminated. Implement support for that similarly to GHCB MSR-based termination requests that are already available to SEV-ES guests via earlier versions of the GHCB protocol. See 'Termination Request' in the 'Invoking VMGEXIT' section of the GHCB specification for more details. Signed-off-by: Michael Roth <[email protected]>
This will handle RMP table updates and direct map changes needed to put a page into a private state before mapping it into an SEV-SNP guest. Signed-off-by: Michael Roth <[email protected]>
Implement a platform hook to do the work of restoring the direct map entries of gmem-managed pages and transitioning the corresponding RMP table entries back to the default shared/hypervisor-owned state. Signed-off-by: Michael Roth <[email protected]>
In the case of SEV-SNP, whether or not a 2MB page can be mapped via a 2MB mapping in the guest's nested page table depends on whether or not any subpages within the range have already been initialized as private in the RMP table. The existing mixed-attribute tracking in KVM is insufficient here, for instance: - gmem allocates 2MB page - guest issues PVALIDATE on 2MB page - guest later converts a subpage to shared - SNP host code issues PSMASH to split 2MB RMP mapping to 4K - KVM MMU splits NPT mapping to 4K At this point there are no mixed attributes, and KVM would normally allow for 2MB NPT mappings again, but this is actually not allowed because the RMP table mappings are 4K and cannot be promoted on the hypervisor side, so the NPT mappings must still be limited to 4K to match this. Add a hook to determine the max NPT mapping size in situations like this. Signed-off-by: Michael Roth <[email protected]>
With SNP/guest_memfd, private/encrypted memory should not be mappable, and MMU notifications for HVA-mapped memory will only be relevant to unencrypted guest memory. Therefore, the rationale behind issuing a wbinvd_on_all_cpus() in sev_guest_memory_reclaimed() should not apply for SNP guests and can be ignored. Signed-off-by: Ashish Kalra <[email protected]> [mdr: Add some clarifications in commit] Signed-off-by: Michael Roth <[email protected]>
Add a module parameter than can be used to enable or disable the SEV-SNP feature. Now that KVM contains the support for the SNP set the GHCB hypervisor feature flag to indicate that SNP is supported. Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]>
Version 2 of GHCB specification added support for the SNP Guest Request Message NAE event. The event allows for an SEV-SNP guest to make requests to the SEV-SNP firmware through hypervisor using the SNP_GUEST_REQUEST API defined in the SEV-SNP firmware specification. This is used by guests primarily to request attestation reports from firmware. There are other request types are available as well, but the specifics of what guest requests are being made are opaque to the hypervisor, which only serves as a proxy for the guest requests and firmware responses. Implement handling for these events. Co-developed-by: Alexey Kardashevskiy <[email protected]> Signed-off-by: Alexey Kardashevskiy <[email protected]> Signed-off-by: Brijesh Singh <[email protected]> Signed-off-by: Ashish Kalra <[email protected]> [mdr: ensure FW command failures are indicated to guest, drop extended request handling to be re-written as separate patch, massage commit] Signed-off-by: Michael Roth <[email protected]>
These commands can be used to create a transaction such that commands that update the reported TCB, such as SNP_SET_CONFIG/SNP_COMMIT, and updates to userspace-supplied certificates, can be handled atomically relative to any extended guest requests issued by any SNP guests while the updates are taking place. Without this interface, there is a risk that a guest will be given certificate information that does not correspond to the VCEK/VLEK used to sign a particular attestation report unless all the running guests are paused in advance, which would cause disruption to all guests in the system even if no attestation requests are being made. Even then, care is needed to ensure that KVM does not pass along certificate information that was fetched from userspace in advance of the guest being paused. This interface also provides some versatility with how similar firmware maintenance activity can be handled in the future without passing unnecessary management complexity on to userspace. Signed-off-by: Michael Roth <[email protected]>
Version 2 of GHCB specification added support for the SNP Extended Guest Request Message NAE event. This event serves a nearly identical purpose to the previously-added SNP_GUEST_REQUEST event, but allows for additional certificate data to be supplied via an additional guest-supplied buffer to be used mainly for verifying the signature of an attestation report as returned by firmware. This certificate data is supplied by userspace, so unlike with SNP_GUEST_REQUEST events, SNP_EXTENDED_GUEST_REQUEST events are first forwarded to userspace via a KVM_EXIT_VMGEXIT exit type, and then the firmware request is made only afterward. Implement handling for these events. Since there is a potential for race conditions where the userspace-supplied certificate data may be out-of-sync relative to the reported TCB that firmware will use when signing attestation reports, make use of the transaction/synchronization mechanisms added by the SNP_SET_CONFIG_{START,END} SEV device ioctls such that the guest will be told to retry the request when an update to reported TCB or userspace-supplied certificates may have occurred or is in progress while an extended guest request is being processed. Signed-off-by: Michael Roth <[email protected]>
mdroth
pushed a commit
that referenced
this pull request
Aug 19, 2024
…play During inode logging (and log replay too), we are holding a transaction handle and we often need to call btrfs_iget(), which will read an inode from its subvolume btree if it's not loaded in memory and that results in allocating an inode with GFP_KERNEL semantics at the btrfs_alloc_inode() callback - and this may recurse into the filesystem in case we are under memory pressure and attempt to commit the current transaction, resulting in a deadlock since the logging (or log replay) task is holding a transaction handle open. Syzbot reported this with the following stack traces: WARNING: possible circular locking dependency detected 6.10.0-rc2-syzkaller-00361-g061d1af7b030 #0 Not tainted ------------------------------------------------------ syz-executor.1/9919 is trying to acquire lock: ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:334 [inline] ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:3891 [inline] ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:3981 [inline] ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020 but task is already holding lock: ffff88804b569358 (&ei->log_mutex){+.+.}-{3:3}, at: btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&ei->log_mutex){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:608 [inline] __mutex_lock+0x175/0x9c0 kernel/locking/mutex.c:752 btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481 btrfs_log_inode_parent+0x8cb/0x2a90 fs/btrfs/tree-log.c:7079 btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180 btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959 vfs_fsync_range+0x141/0x230 fs/sync.c:188 generic_write_sync include/linux/fs.h:2794 [inline] btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705 new_sync_write fs/read_write.c:497 [inline] vfs_write+0x6b6/0x1140 fs/read_write.c:590 ksys_write+0x12f/0x260 fs/read_write.c:643 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}: join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_commit_super+0xa1/0x110 fs/btrfs/disk-io.c:4170 close_ctree+0xcb0/0xf90 fs/btrfs/disk-io.c:4324 generic_shutdown_super+0x159/0x3d0 fs/super.c:642 kill_anon_super+0x3a/0x60 fs/super.c:1226 btrfs_kill_super+0x3b/0x50 fs/btrfs/super.c:2096 deactivate_locked_super+0xbe/0x1a0 fs/super.c:473 deactivate_super+0xde/0x100 fs/super.c:506 cleanup_mnt+0x222/0x450 fs/namespace.c:1267 task_work_run+0x14e/0x250 kernel/task_work.c:180 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop kernel/entry/common.c:114 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x278/0x2a0 kernel/entry/common.c:218 __do_fast_syscall_32+0x80/0x120 arch/x86/entry/common.c:389 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #1 (btrfs_trans_num_writers){++++}-{0:0}: __lock_release kernel/locking/lockdep.c:5468 [inline] lock_release+0x33e/0x6c0 kernel/locking/lockdep.c:5774 percpu_up_read include/linux/percpu-rwsem.h:99 [inline] __sb_end_write include/linux/fs.h:1650 [inline] sb_end_intwrite include/linux/fs.h:1767 [inline] __btrfs_end_transaction+0x5ca/0x920 fs/btrfs/transaction.c:1071 btrfs_commit_inode_delayed_inode+0x228/0x330 fs/btrfs/delayed-inode.c:1301 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 dentry_unlink_inode+0x295/0x480 fs/dcache.c:400 __dentry_kill+0x1d0/0x600 fs/dcache.c:603 dput.part.0+0x4b1/0x9b0 fs/dcache.c:845 dput+0x1f/0x30 fs/dcache.c:835 ovl_stack_put+0x60/0x90 fs/overlayfs/util.c:132 ovl_destroy_inode+0xc6/0x190 fs/overlayfs/super.c:182 destroy_inode+0xc4/0x1b0 fs/inode.c:311 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 dentry_unlink_inode+0x295/0x480 fs/dcache.c:400 __dentry_kill+0x1d0/0x600 fs/dcache.c:603 shrink_kill fs/dcache.c:1048 [inline] shrink_dentry_list+0x140/0x5d0 fs/dcache.c:1075 prune_dcache_sb+0xeb/0x150 fs/dcache.c:1156 super_cache_scan+0x32a/0x550 fs/super.c:221 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab_memcg mm/shrinker.c:548 [inline] shrink_slab+0xa87/0x1310 mm/shrinker.c:626 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 -> #0 (fs_reclaim){+.+.}-{0:0}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 __fs_reclaim_acquire mm/page_alloc.c:3801 [inline] fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3815 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3891 [inline] slab_alloc_node mm/slub.c:3981 [inline] kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020 btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411 alloc_inode+0x5d/0x230 fs/inode.c:261 iget5_locked fs/inode.c:1235 [inline] iget5_locked+0x1c9/0x2c0 fs/inode.c:1228 btrfs_iget_locked fs/btrfs/inode.c:5590 [inline] btrfs_iget_path fs/btrfs/inode.c:5607 [inline] btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636 add_conflicting_inode fs/btrfs/tree-log.c:5657 [inline] copy_inode_items_to_log+0x1039/0x1e30 fs/btrfs/tree-log.c:5928 btrfs_log_inode+0xa48/0x4660 fs/btrfs/tree-log.c:6592 log_new_delayed_dentries fs/btrfs/tree-log.c:6363 [inline] btrfs_log_inode+0x27dd/0x4660 fs/btrfs/tree-log.c:6718 btrfs_log_all_parents fs/btrfs/tree-log.c:6833 [inline] btrfs_log_inode_parent+0x22ba/0x2a90 fs/btrfs/tree-log.c:7141 btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180 btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959 vfs_fsync_range+0x141/0x230 fs/sync.c:188 generic_write_sync include/linux/fs.h:2794 [inline] btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705 do_iter_readv_writev+0x504/0x780 fs/read_write.c:741 vfs_writev+0x36f/0xde0 fs/read_write.c:971 do_pwritev+0x1b2/0x260 fs/read_write.c:1072 __do_compat_sys_pwritev2 fs/read_write.c:1218 [inline] __se_compat_sys_pwritev2 fs/read_write.c:1210 [inline] __ia32_compat_sys_pwritev2+0x121/0x1b0 fs/read_write.c:1210 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e other info that might help us debug this: Chain exists of: fs_reclaim --> btrfs_trans_num_extwriters --> &ei->log_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ei->log_mutex); lock(btrfs_trans_num_extwriters); lock(&ei->log_mutex); lock(fs_reclaim); *** DEADLOCK *** 7 locks held by syz-executor.1/9919: #0: ffff88802be20420 (sb_writers#23){.+.+}-{0:0}, at: do_pwritev+0x1b2/0x260 fs/read_write.c:1072 #1: ffff888065c0f8f0 (&sb->s_type->i_mutex_key#33){++++}-{3:3}, at: inode_lock include/linux/fs.h:791 [inline] #1: ffff888065c0f8f0 (&sb->s_type->i_mutex_key#33){++++}-{3:3}, at: btrfs_inode_lock+0xc8/0x110 fs/btrfs/inode.c:385 #2: ffff888065c0f778 (&ei->i_mmap_lock){++++}-{3:3}, at: btrfs_inode_lock+0xee/0x110 fs/btrfs/inode.c:388 #3: ffff88802be20610 (sb_internal#4){.+.+}-{0:0}, at: btrfs_sync_file+0x95b/0xe10 fs/btrfs/file.c:1952 #4: ffff8880546323f0 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x430/0xf40 fs/btrfs/transaction.c:290 #5: ffff888054632418 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x430/0xf40 fs/btrfs/transaction.c:290 #6: ffff88804b569358 (&ei->log_mutex){+.+.}-{3:3}, at: btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481 stack backtrace: CPU: 2 PID: 9919 Comm: syz-executor.1 Not tainted 6.10.0-rc2-syzkaller-00361-g061d1af7b030 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114 check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 __fs_reclaim_acquire mm/page_alloc.c:3801 [inline] fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3815 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3891 [inline] slab_alloc_node mm/slub.c:3981 [inline] kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020 btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411 alloc_inode+0x5d/0x230 fs/inode.c:261 iget5_locked fs/inode.c:1235 [inline] iget5_locked+0x1c9/0x2c0 fs/inode.c:1228 btrfs_iget_locked fs/btrfs/inode.c:5590 [inline] btrfs_iget_path fs/btrfs/inode.c:5607 [inline] btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636 add_conflicting_inode fs/btrfs/tree-log.c:5657 [inline] copy_inode_items_to_log+0x1039/0x1e30 fs/btrfs/tree-log.c:5928 btrfs_log_inode+0xa48/0x4660 fs/btrfs/tree-log.c:6592 log_new_delayed_dentries fs/btrfs/tree-log.c:6363 [inline] btrfs_log_inode+0x27dd/0x4660 fs/btrfs/tree-log.c:6718 btrfs_log_all_parents fs/btrfs/tree-log.c:6833 [inline] btrfs_log_inode_parent+0x22ba/0x2a90 fs/btrfs/tree-log.c:7141 btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180 btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959 vfs_fsync_range+0x141/0x230 fs/sync.c:188 generic_write_sync include/linux/fs.h:2794 [inline] btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705 do_iter_readv_writev+0x504/0x780 fs/read_write.c:741 vfs_writev+0x36f/0xde0 fs/read_write.c:971 do_pwritev+0x1b2/0x260 fs/read_write.c:1072 __do_compat_sys_pwritev2 fs/read_write.c:1218 [inline] __se_compat_sys_pwritev2 fs/read_write.c:1210 [inline] __ia32_compat_sys_pwritev2+0x121/0x1b0 fs/read_write.c:1210 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e RIP: 0023:0xf7334579 Code: b8 01 10 06 03 (...) RSP: 002b:00000000f5f265ac EFLAGS: 00000292 ORIG_RAX: 000000000000017b RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000200002c0 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Fix this by ensuring we are under a NOFS scope whenever we call btrfs_iget() during inode logging and log replay. Reported-by: [email protected] Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Fixes: 712e36c ("btrfs: use GFP_KERNEL in btrfs_alloc_inode") Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
mdroth
pushed a commit
that referenced
this pull request
Aug 19, 2024
The code in ocfs2_dio_end_io_write() estimates number of necessary transaction credits using ocfs2_calc_extend_credits(). This however does not take into account that the IO could be arbitrarily large and can contain arbitrary number of extents. Extent tree manipulations do often extend the current transaction but not in all of the cases. For example if we have only single block extents in the tree, ocfs2_mark_extent_written() will end up calling ocfs2_replace_extent_rec() all the time and we will never extend the current transaction and eventually exhaust all the transaction credits if the IO contains many single block extents. Once that happens a WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to this error. This was actually triggered by one of our customers on a heavily fragmented OCFS2 filesystem. To fix the issue make sure the transaction always has enough credits for one extent insert before each call of ocfs2_mark_extent_written(). Heming Zhao said: ------ PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error" PID: xxx TASK: xxxx CPU: 5 COMMAND: "SubmitThread-CA" #0 machine_kexec at ffffffff8c069932 #1 __crash_kexec at ffffffff8c1338fa #2 panic at ffffffff8c1d69b9 #3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2] #4 __ocfs2_abort at ffffffffc0c88387 [ocfs2] #5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2] #6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2] torvalds#7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2] torvalds#8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2] torvalds#9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2] torvalds#10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2] torvalds#11 dio_complete at ffffffff8c2b9fa7 torvalds#12 do_blockdev_direct_IO at ffffffff8c2bc09f torvalds#13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2] torvalds#14 generic_file_direct_write at ffffffff8c1dcf14 torvalds#15 __generic_file_write_iter at ffffffff8c1dd07b torvalds#16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2] torvalds#17 aio_write at ffffffff8c2cc72e torvalds#18 kmem_cache_alloc at ffffffff8c248dde torvalds#19 do_io_submit at ffffffff8c2ccada torvalds#20 do_syscall_64 at ffffffff8c004984 torvalds#21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: c15471f ("ocfs2: fix sparse file & data ordering issue in direct io") Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Reviewed-by: Heming Zhao <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
mdroth
pushed a commit
that referenced
this pull request
Aug 19, 2024
A kernel warning was reported when pinning folio in CMA memory when launching SEV virtual machine. The splat looks like: [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520 [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6 [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520 [ 464.325515] Call Trace: [ 464.325520] <TASK> [ 464.325523] ? __get_user_pages+0x423/0x520 [ 464.325528] ? __warn+0x81/0x130 [ 464.325536] ? __get_user_pages+0x423/0x520 [ 464.325541] ? report_bug+0x171/0x1a0 [ 464.325549] ? handle_bug+0x3c/0x70 [ 464.325554] ? exc_invalid_op+0x17/0x70 [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20 [ 464.325567] ? __get_user_pages+0x423/0x520 [ 464.325575] __gup_longterm_locked+0x212/0x7a0 [ 464.325583] internal_get_user_pages_fast+0xfb/0x190 [ 464.325590] pin_user_pages_fast+0x47/0x60 [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd] [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd] Per the analysis done by yangge, when starting the SEV virtual machine, it will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory. But the page is in CMA area, so fast GUP will fail then fallback to the slow path due to the longterm pinnalbe check in try_grab_folio(). The slow path will try to pin the pages then migrate them out of CMA area. But the slow path also uses try_grab_folio() to pin the page, it will also fail due to the same check then the above warning is triggered. In addition, the try_grab_folio() is supposed to be used in fast path and it elevates folio refcount by using add ref unless zero. We are guaranteed to have at least one stable reference in slow path, so the simple atomic add could be used. The performance difference should be trivial, but the misuse may be confusing and misleading. Redefined try_grab_folio() to try_grab_folio_fast(), and try_grab_page() to try_grab_folio(), and use them in the proper paths. This solves both the abuse and the kernel warning. The proper naming makes their usecase more clear and should prevent from abusing in the future. peterx said: : The user will see the pin fails, for gpu-slow it further triggers the WARN : right below that failure (as in the original report): : : folio = try_grab_folio(page, page_increm - 1, : foll_flags); : if (WARN_ON_ONCE(!folio)) { <------------------------ here : /* : * Release the 1st page ref if the : * folio is problematic, fail hard. : */ : gup_put_folio(page_folio(page), 1, : foll_flags); : ret = -EFAULT; : goto out; : } [1] https://lore.kernel.org/linux-mm/[email protected]/ [[email protected]: fix implicit declaration of function try_grab_folio_fast] Link: https://lkml.kernel.org/r/CAHbLzkowMSso-4Nufc9hcMehQsK9PNz3OSu-+eniU-2Mm-xjhA@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Fixes: 57edfcf ("mm/gup: accelerate thp gup even for "pages != NULL"") Signed-off-by: Yang Shi <[email protected]> Reported-by: yangge <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Peter Xu <[email protected]> Cc: <[email protected]> [6.6+] Signed-off-by: Andrew Morton <[email protected]>
mdroth
force-pushed
the
snp-host-latest
branch
from
August 19, 2024 19:33
05b1014
to
85ef1ac
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.