Commit 9f16d5e6 authored Nov 23, 2024 by Linus Torvalds

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm updates from Paolo Bonzini:
 "The biggest change here is eliminating the awful idea that KVM had of
  essentially guessing which pfns are refcounted pages.

  The reason to do so was that KVM needs to map both non-refcounted
  pages (for example BARs of VFIO devices) and VM_PFNMAP/VM_MIXMEDMAP
  VMAs that contain refcounted pages.

  However, the result was security issues in the past, and more recently
  the inability to map VM_IO and VM_PFNMAP memory that _is_ backed by
  struct page but is not refcounted. In particular this broke virtio-gpu
  blob resources (which directly map host graphics buffers into the
  guest as "vram" for the virtio-gpu device) with the amdgpu driver,
  because amdgpu allocates non-compound higher order pages and the tail
  pages could not be mapped into KVM.

  This requires adjusting all uses of struct page in the
  per-architecture code, to always work on the pfn whenever possible.
  The large series that did this, from David Stevens and Sean
  Christopherson, also cleaned up substantially the set of functions
  that provided arch code with the pfn for a host virtual addresses.

  The previous maze of twisty little passages, all different, is
  replaced by five functions (__gfn_to_page, __kvm_faultin_pfn, the
  non-__ versions of these two, and kvm_prefetch_pages) saving almost
  200 lines of code.

  ARM:

   - Support for stage-1 permission indirection (FEAT_S1PIE) and
     permission overlays (FEAT_S1POE), including nested virt + the
     emulated page table walker

   - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This
     call was introduced in PSCIv1.3 as a mechanism to request
     hibernation, similar to the S4 state in ACPI

   - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
     part of it, introduce trivial initialization of the host's MPAM
     context so KVM can use the corresponding traps

   - PMU support under nested virtualization, honoring the guest
     hypervisor's trap configuration and event filtering when running a
     nested guest

   - Fixes to vgic ITS serialization where stale device/interrupt table
     entries are not zeroed when the mapping is invalidated by the VM

   - Avoid emulated MMIO completion if userspace has requested
     synchronous external abort injection

   - Various fixes and cleanups affecting pKVM, vCPU initialization, and
     selftests

  LoongArch:

   - Add iocsr and mmio bus simulation in kernel.

   - Add in-kernel interrupt controller emulation.

   - Add support for virtualization extensions to the eiointc irqchip.

  PPC:

   - Drop lingering and utterly obsolete references to PPC970 KVM, which
     was removed 10 years ago.

   - Fix incorrect documentation references to non-existing ioctls

  RISC-V:

   - Accelerate KVM RISC-V when running as a guest

   - Perf support to collect KVM guest statistics from host side

  s390:

   - New selftests: more ucontrol selftests and CPU model sanity checks

   - Support for the gen17 CPU model

   - List registers supported by KVM_GET/SET_ONE_REG in the
     documentation

  x86:

   - Cleanup KVM's handling of Accessed and Dirty bits to dedup code,
     improve documentation, harden against unexpected changes.

     Even if the hardware A/D tracking is disabled, it is possible to
     use the hardware-defined A/D bits to track if a PFN is Accessed
     and/or Dirty, and that removes a lot of special cases.

   - Elide TLB flushes when aging secondary PTEs, as has been done in
     x86's primary MMU for over 10 years.

   - Recover huge pages in-place in the TDP MMU when dirty page logging
     is toggled off, instead of zapping them and waiting until the page
     is re-accessed to create a huge mapping. This reduces vCPU jitter.

   - Batch TLB flushes when dirty page logging is toggled off. This
     reduces the time it takes to disable dirty logging by ~3x.

   - Remove the shrinker that was (poorly) attempting to reclaim shadow
     page tables in low-memory situations.

   - Clean up and optimize KVM's handling of writes to
     MSR_IA32_APICBASE.

   - Advertise CPUIDs for new instructions in Clearwater Forest

   - Quirk KVM's misguided behavior of initialized certain feature MSRs
     to their maximum supported feature set, which can result in KVM
     creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to
     a non-zero value results in the vCPU having invalid state if
     userspace hides PDCM from the guest, which in turn can lead to
     save/restore failures.

   - Fix KVM's handling of non-canonical checks for vCPUs that support
     LA57 to better follow the "architecture", in quotes because the
     actual behavior is poorly documented. E.g. most MSR writes and
     descriptor table loads ignore CR4.LA57 and operate purely on
     whether the CPU supports LA57.

   - Bypass the register cache when querying CPL from kvm_sched_out(),
     as filling the cache from IRQ context is generally unsafe; harden
     the cache accessors to try to prevent similar issues from occuring
     in the future. The issue that triggered this change was already
     fixed in 6.12, but was still kinda latent.

   - Advertise AMD_IBPB_RET to userspace, and fix a related bug where
     KVM over-advertises SPEC_CTRL when trying to support cross-vendor
     VMs.

   - Minor cleanups

   - Switch hugepage recovery thread to use vhost_task.

     These kthreads can consume significant amounts of CPU time on
     behalf of a VM or in response to how the VM behaves (for example
     how it accesses its memory); therefore KVM tried to place the
     thread in the VM's cgroups and charge the CPU time consumed by that
     work to the VM's container.

     However the kthreads did not process SIGSTOP/SIGCONT, and therefore
     cgroups which had KVM instances inside could not complete freezing.

     Fix this by replacing the kthread with a PF_USER_WORKER thread, via
     the vhost_task abstraction. Another 100+ lines removed, with
     generally better behavior too like having these threads properly
     parented in the process tree.

   - Revert a workaround for an old CPU erratum (Nehalem/Westmere) that
     didn't really work; there was really nothing to work around anyway:
     the broken patch was meant to fix nested virtualization, but the
     PERF_GLOBAL_CTRL MSR is virtualized and therefore unaffected by the
     erratum.

   - Fix 6.12 regression where CONFIG_KVM will be built as a module even
     if asked to be builtin, as long as neither KVM_INTEL nor KVM_AMD is
     'y'.

  x86 selftests:

   - x86 selftests can now use AVX.

  Documentation:

   - Use rST internal links

   - Reorganize the introduction to the API document

  Generic:

   - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock
     instead of RCU, so that running a vCPU on a different task doesn't
     encounter long due to having to wait for all CPUs become quiescent.

     In general both reads and writes are rare, but userspace that
     supports confidential computing is introducing the use of "helper"
     vCPUs that may jump from one host processor to another. Those will
     be very happy to trigger a synchronize_rcu(), and the effect on
     performance is quite the disaster"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (298 commits)
  KVM: x86: Break CONFIG_KVM_X86's direct dependency on KVM_INTEL || KVM_AMD
  KVM: x86: add back X86_LOCAL_APIC dependency
  Revert "KVM: VMX: Move LOAD_IA32_PERF_GLOBAL_CTRL errata handling out of setup_vmcs_config()"
  KVM: x86: switch hugepage recovery thread to vhost_task
  KVM: x86: expose MSR_PLATFORM_INFO as a feature MSR
  x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest
  Documentation: KVM: fix malformed table
  irqchip/loongson-eiointc: Add virt extension support
  LoongArch: KVM: Add irqfd support
  LoongArch: KVM: Add PCHPIC user mode read and write functions
  LoongArch: KVM: Add PCHPIC read and write functions
  LoongArch: KVM: Add PCHPIC device support
  LoongArch: KVM: Add EIOINTC user mode read and write functions
  LoongArch: KVM: Add EIOINTC read and write functions
  LoongArch: KVM: Add EIOINTC device support
  LoongArch: KVM: Add IPI user mode read and write function
  LoongArch: KVM: Add IPI read and write function
  LoongArch: KVM: Add IPI device support
  LoongArch: KVM: Add iocsr and mmio bus simulation in kernel
  KVM: arm64: Pass on SVE mapping failures
  ...

parents 42d9e8b7 9ee62c33

Documentation/arch/arm64/cpu-feature-registers.rst

+2 −0

Original line number	Diff line number	Diff line
		@@ -152,6 +152,8 @@ infrastructure:
		+------------------------------+---------+---------+
		\| DIT \| [51-48] \| y \|
		+------------------------------+---------+---------+
		\| MPAM \| [43-40] \| n \|
		+------------------------------+---------+---------+
		\| SVE \| [35-32] \| y \|
		+------------------------------+---------+---------+
		\| GIC \| [27-24] \| n \|

Documentation/arch/loongarch/irq-chip-model.rst

+64 −0

Original line number	Diff line number	Diff line
		@@ -85,6 +85,70 @@ to CPUINTC directly::
		\| Devices \|
		+---------+

		Virtual Extended IRQ model
		==========================

		In this model, IPI (Inter-Processor Interrupt) and CPU Local Timer interrupt
		go to CPUINTC directly, CPU UARTS interrupts go to PCH-PIC, while all other
		devices interrupts go to PCH-PIC/PCH-MSI and gathered by V-EIOINTC (Virtual
		Extended I/O Interrupt Controller), and then go to CPUINTC directly::

		+-----+ +-------------------+ +-------+
		\| IPI \|--> \| CPUINTC(0-255vcpu)\| <-- \| Timer \|
		+-----+ +-------------------+ +-------+
		^
		\|
		+-----------+
		\| V-EIOINTC \|
		+-----------+
		^ ^
		\| \|
		+---------+ +---------+
		\| PCH-PIC \| \| PCH-MSI \|
		+---------+ +---------+
		^ ^ ^
		\| \| \|
		+--------+ +---------+ +---------+
		\| UARTs \| \| Devices \| \| Devices \|
		+--------+ +---------+ +---------+


		Description
		-----------
		V-EIOINTC (Virtual Extended I/O Interrupt Controller) is an extension of
		EIOINTC, it only works in VM mode which runs in KVM hypervisor. Interrupts can
		be routed to up to four vCPUs via standard EIOINTC, however with V-EIOINTC
		interrupts can be routed to up to 256 virtual cpus.

		With standard EIOINTC, interrupt routing setting includes two parts: eight
		bits for CPU selection and four bits for CPU IP (Interrupt Pin) selection.
		For CPU selection there is four bits for EIOINTC node selection, four bits
		for EIOINTC CPU selection. Bitmap method is used for CPU selection and
		CPU IP selection, so interrupt can only route to CPU0 - CPU3 and IP0-IP3 in
		one EIOINTC node.

		With V-EIOINTC it supports to route more CPUs and CPU IP (Interrupt Pin),
		there are two newly added registers with V-EIOINTC.

		EXTIOI_VIRT_FEATURES
		--------------------
		This register is read-only register, which indicates supported features with
		V-EIOINTC. Feature EXTIOI_HAS_INT_ENCODE and EXTIOI_HAS_CPU_ENCODE is added.

		Feature EXTIOI_HAS_INT_ENCODE is part of standard EIOINTC. If it is 1, it
		indicates that CPU Interrupt Pin selection can be normal method rather than
		bitmap method, so interrupt can be routed to IP0 - IP15.

		Feature EXTIOI_HAS_CPU_ENCODE is entension of V-EIOINTC. If it is 1, it
		indicates that CPU selection can be normal method rather than bitmap method,
		so interrupt can be routed to CPU0 - CPU255.

		EXTIOI_VIRT_CONFIG
		------------------
		This register is read-write register, for compatibility intterupt routed uses
		the default method which is the same with standard EIOINTC. If the bit is set
		with 1, it indicated HW to use normal method rather than bitmap method.

		Advanced Extended IRQ model
		===========================

Documentation/translations/zh_CN/arch/loongarch/irq-chip-model.rst

+55 −0

Original line number	Diff line number	Diff line
		@@ -87,6 +87,61 @@ PCH-LPC/PCH-MSI，然后被EIOINTC统一收集，再直接到达CPUINTC::
		\| Devices \|
		+---------+

		虚拟扩展IRQ模型
		===============

		在这种模型里面, IPI(Inter-Processor Interrupt) 和CPU本地时钟中断直接发送到CPUINTC,
		CPU串口 (UARTs) 中断发送到PCH-PIC, 而其他所有设备的中断则分别发送到所连接的PCH_PIC/
		PCH-MSI, 然后V-EIOINTC统一收集，再直接到达CPUINTC::

		+-----+ +-------------------+ +-------+
		\| IPI \|--> \| CPUINTC(0-255vcpu)\| <-- \| Timer \|
		+-----+ +-------------------+ +-------+
		^
		\|
		+-----------+
		\| V-EIOINTC \|
		+-----------+
		^ ^
		\| \|
		+---------+ +---------+
		\| PCH-PIC \| \| PCH-MSI \|
		+---------+ +---------+
		^ ^ ^
		\| \| \|
		+--------+ +---------+ +---------+
		\| UARTs \| \| Devices \| \| Devices \|
		+--------+ +---------+ +---------+

		V-EIOINTC 是EIOINTC的扩展, 仅工作在虚拟机模式下, 中断经EIOINTC最多可个路由到
		４个虚拟CPU. 但中断经V-EIOINTC最多可个路由到256个虚拟CPU.

		传统的EIOINTC中断控制器，中断路由分为两个部分：8比特用于控制路由到哪个CPU，
		4比特用于控制路由到特定CPU的哪个中断管脚。控制CPU路由的8比特前4比特用于控制
		路由到哪个EIOINTC节点，后4比特用于控制此节点哪个CPU。中断路由在选择CPU路由
		和CPU中断管脚路由时，使用bitmap编码方式而不是正常编码方式，所以对于一个
		EIOINTC中断控制器节点，中断只能路由到CPU0 - CPU3，中断管脚IP0-IP3。

		V-EIOINTC新增了两个寄存器，支持中断路由到更多CPU个和中断管脚。

		V-EIOINTC功能寄存器
		-------------------
		功能寄存器是只读寄存器，用于显示V-EIOINTC支持的特性，目前两个支持两个特性
		EXTIOI_HAS_INT_ENCODE 和 EXTIOI_HAS_CPU_ENCODE。

		特性EXTIOI_HAS_INT_ENCODE是传统EIOINTC中断控制器的一个特性，如果此比特为1，
		显示CPU中断管脚路由方式支持正常编码，而不是bitmap编码，所以中断可以路由到
		管脚IP0 - IP15。

		特性EXTIOI_HAS_CPU_ENCODE是V-EIOINTC新增特性，如果此比特为1，表示CPU路由
		方式支持正常编码，而不是bitmap编码，所以中断可以路由到CPU0 - CPU255。

		V-EIOINTC配置寄存器
		-------------------
		配置寄存器是可读写寄存器，为了兼容性考虑，如果不写此寄存器，中断路由采用
		和传统EIOINTC相同的路由设置。如果对应比特设置为1，表示采用正常路由方式而
		不是bitmap编码的路由方式。

		高级扩展IRQ模型
		===============

Documentation/virt/kvm/api.rst

+116 −74

Original line number	Diff line number	Diff line
		@@ -7,8 +7,19 @@ The Definitive KVM (Kernel-based Virtual Machine) API Documentation
		1. General description
		======================

		The kvm API is a set of ioctls that are issued to control various aspects
		of a virtual machine. The ioctls belong to the following classes:
		The kvm API is centered around different kinds of file descriptors
		and ioctls that can be issued to these file descriptors. An initial
		open("/dev/kvm") obtains a handle to the kvm subsystem; this handle
		can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this
		handle will create a VM file descriptor which can be used to issue VM
		ioctls. A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will
		create a virtual cpu or device and return a file descriptor pointing to
		the new resource.

		In other words, the kvm API is a set of ioctls that are issued to
		different kinds of file descriptor in order to control various aspects of
		a virtual machine. Depending on the file descriptor that accepts them,
		ioctls belong to the following classes:

		- System ioctls: These query and set global attributes which affect the
		whole kvm subsystem. In addition a system ioctl is used to create
		@@ -35,18 +46,19 @@ of a virtual machine. The ioctls belong to the following classes:
		device ioctls must be issued from the same process (address space) that
		was used to create the VM.

		2. File descriptors
		===================
		While most ioctls are specific to one kind of file descriptor, in some
		cases the same ioctl can belong to more than one class.

		The KVM API grew over time. For this reason, KVM defines many constants
		of the form ``KVM_CAP_*``, each corresponding to a set of functionality
		provided by one or more ioctls. Availability of these "capabilities" can
		be checked with :ref:`KVM_CHECK_EXTENSION <KVM_CHECK_EXTENSION>`. Some
		capabilities also need to be enabled for VMs or VCPUs where their
		functionality is desired (see :ref:`cap_enable` and :ref:`cap_enable_vm`).

		The kvm API is centered around file descriptors. An initial
		open("/dev/kvm") obtains a handle to the kvm subsystem; this handle
		can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this
		handle will create a VM file descriptor which can be used to issue VM
		ioctls. A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will
		create a virtual cpu or device and return a file descriptor pointing to
		the new resource. Finally, ioctls on a vcpu or device fd can be used
		to control the vcpu or device. For vcpus, this includes the important
		task of actually running guest code.

		2. Restrictions
		===============

		In general file descriptors can be migrated among processes by means
		of fork() and the SCM_RIGHTS facility of unix domain socket. These
		@@ -96,12 +108,9 @@ description:
		Capability:
		which KVM extension provides this ioctl. Can be 'basic',
		which means that is will be provided by any kernel that supports
		API version 12 (see section 4.1), a KVM_CAP_xyz constant, which
		means availability needs to be checked with KVM_CHECK_EXTENSION
		(see section 4.4), or 'none' which means that while not all kernels
		support this ioctl, there's no capability bit to check its
		availability: for kernels that don't support the ioctl,
		the ioctl returns -ENOTTY.
		API version 12 (see :ref:`KVM_GET_API_VERSION <KVM_GET_API_VERSION>`),
		or a KVM_CAP_xyz constant that can be checked with
		:ref:`KVM_CHECK_EXTENSION <KVM_CHECK_EXTENSION>`.

		Architectures:
		which instruction set architectures provide this ioctl.
		@@ -118,6 +127,8 @@ description:
		are not detailed, but errors with specific meanings are.


		.. _KVM_GET_API_VERSION:

		4.1 KVM_GET_API_VERSION
		-----------------------

		@@ -246,6 +257,8 @@ This list also varies by kvm version and host processor, but does not change
		otherwise.


		.. _KVM_CHECK_EXTENSION:

		4.4 KVM_CHECK_EXTENSION
		-----------------------

		@@ -288,7 +301,7 @@ the VCPU file descriptor can be mmap-ed, including:

		- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
		KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on
		KVM_CAP_DIRTY_LOG_RING, see section 8.3.
		KVM_CAP_DIRTY_LOG_RING, see :ref:`KVM_CAP_DIRTY_LOG_RING`.


		4.7 KVM_CREATE_VCPU
		@@ -338,8 +351,8 @@ KVM_S390_SIE_PAGE_OFFSET in order to obtain a memory map of the virtual
		cpu's hardware control block.


		4.8 KVM_GET_DIRTY_LOG (vm ioctl)
		--------------------------------
		4.8 KVM_GET_DIRTY_LOG
		---------------------

		:Capability: basic
		:Architectures: all
		@@ -1298,7 +1311,7 @@ See KVM_GET_VCPU_EVENTS for the data structure.

		:Capability: KVM_CAP_DEBUGREGS
		:Architectures: x86
		:Type: vm ioctl
		:Type: vcpu ioctl
		:Parameters: struct kvm_debugregs (out)
		:Returns: 0 on success, -1 on error

		@@ -1320,7 +1333,7 @@ Reads debug registers from the vcpu.

		:Capability: KVM_CAP_DEBUGREGS
		:Architectures: x86
		:Type: vm ioctl
		:Type: vcpu ioctl
		:Parameters: struct kvm_debugregs (in)
		:Returns: 0 on success, -1 on error

		@@ -1429,6 +1442,8 @@ because of a quirk in the virtualization implementation (see the internals
		documentation when it pops into existence).


		.. _KVM_ENABLE_CAP:

		4.37 KVM_ENABLE_CAP
		-------------------

		@@ -2116,8 +2131,8 @@ TLB, prior to calling KVM_RUN on the associated vcpu.

		The "bitmap" field is the userspace address of an array. This array
		consists of a number of bits, equal to the total number of TLB entries as
		determined by the last successful call to KVM_CONFIG_TLB, rounded up to the
		nearest multiple of 64.
		determined by the last successful call to ``KVM_ENABLE_CAP(KVM_CAP_SW_TLB)``,
		rounded up to the nearest multiple of 64.

		Each bit corresponds to one TLB entry, ordered the same as in the shared TLB
		array.
		@@ -2170,42 +2185,6 @@ userspace update the TCE table directly which is useful in some
		circumstances.


		4.63 KVM_ALLOCATE_RMA
		---------------------

		:Capability: KVM_CAP_PPC_RMA
		:Architectures: powerpc
		:Type: vm ioctl
		:Parameters: struct kvm_allocate_rma (out)
		:Returns: file descriptor for mapping the allocated RMA

		This allocates a Real Mode Area (RMA) from the pool allocated at boot
		time by the kernel. An RMA is a physically-contiguous, aligned region
		of memory used on older POWER processors to provide the memory which
		will be accessed by real-mode (MMU off) accesses in a KVM guest.
		POWER processors support a set of sizes for the RMA that usually
		includes 64MB, 128MB, 256MB and some larger powers of two.

		::

		/* for KVM_ALLOCATE_RMA */
		struct kvm_allocate_rma {
		__u64 rma_size;
		};

		The return value is a file descriptor which can be passed to mmap(2)
		to map the allocated RMA into userspace. The mapped area can then be
		passed to the KVM_SET_USER_MEMORY_REGION ioctl to establish it as the
		RMA for a virtual machine. The size of the RMA in bytes (which is
		fixed at host kernel boot time) is returned in the rma_size field of
		the argument structure.

		The KVM_CAP_PPC_RMA capability is 1 or 2 if the KVM_ALLOCATE_RMA ioctl
		is supported; 2 if the processor requires all virtual machines to have
		an RMA, or 1 if the processor can use an RMA but doesn't require it,
		because it supports the Virtual RMA (VRMA) facility.


		4.64 KVM_NMI
		------------

		@@ -2602,7 +2581,7 @@ Specifically:
		======================= ========= ===== =======================================

		.. [1] These encodings are not accepted for SVE-enabled vcpus. See
		KVM_ARM_VCPU_INIT.
		:ref:`KVM_ARM_VCPU_INIT`.

		The equivalent register content can be accessed via bits [127:0] of
		the corresponding SVE Zn registers instead for vcpus that have SVE
		@@ -3593,6 +3572,27 @@ Errors:
		This ioctl returns the guest registers that are supported for the
		KVM_GET_ONE_REG/KVM_SET_ONE_REG calls.

		Note that s390 does not support KVM_GET_REG_LIST for historical reasons
		(read: nobody cared). The set of registers in kernels 4.x and newer is:

		- KVM_REG_S390_TODPR

		- KVM_REG_S390_EPOCHDIFF

		- KVM_REG_S390_CPU_TIMER

		- KVM_REG_S390_CLOCK_COMP

		- KVM_REG_S390_PFTOKEN

		- KVM_REG_S390_PFCOMPARE

		- KVM_REG_S390_PFSELECT

		- KVM_REG_S390_PP

		- KVM_REG_S390_GBEA


		4.85 KVM_ARM_SET_DEVICE_ADDR (deprecated)
		-----------------------------------------
		@@ -4956,8 +4956,8 @@ Coalesced pio is based on coalesced mmio. There is little difference
		between coalesced mmio and pio except that coalesced pio records accesses
		to I/O ports.

		4.117 KVM_CLEAR_DIRTY_LOG (vm ioctl)
		------------------------------------
		4.117 KVM_CLEAR_DIRTY_LOG
		-------------------------

		:Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
		:Architectures: x86, arm64, mips
		@@ -5093,8 +5093,8 @@ Recognised values for feature:
		Finalizes the configuration of the specified vcpu feature.

		The vcpu must already have been initialised, enabling the affected feature, by
		means of a successful KVM_ARM_VCPU_INIT call with the appropriate flag set in
		features[].
		means of a successful :ref:`KVM_ARM_VCPU_INIT <KVM_ARM_VCPU_INIT>` call with the
		appropriate flag set in features[].

		For affected vcpu features, this is a mandatory step that must be performed
		before the vcpu is fully usable.
		@@ -5266,7 +5266,7 @@ the cpu reset definition in the POP (Principles Of Operation).
		4.123 KVM_S390_INITIAL_RESET
		----------------------------

		:Capability: none
		:Capability: basic
		:Architectures: s390
		:Type: vcpu ioctl
		:Parameters: none
		@@ -6205,7 +6205,7 @@ applied.
		.. _KVM_ARM_GET_REG_WRITABLE_MASKS:

		4.139 KVM_ARM_GET_REG_WRITABLE_MASKS
		-------------------------------------------
		------------------------------------

		:Capability: KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES
		:Architectures: arm64
		@@ -6443,6 +6443,8 @@ the capability to be present.
		`flags` must currently be zero.


		.. _kvm_run:

		5. The kvm_run structure
		========================

		@@ -6855,6 +6857,10 @@ the first `ndata` items (possibly zero) of the data array are valid.
		the guest issued a SYSTEM_RESET2 call according to v1.1 of the PSCI
		specification.

		- for arm64, data[0] is set to KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2
		if the guest issued a SYSTEM_OFF2 call according to v1.3 of the PSCI
		specification.

		- for RISC-V, data[0] is set to the value of the second argument of the
		``sbi_system_reset`` call.

		@@ -6888,6 +6894,12 @@ either:
		- Deny the guest request to suspend the VM. See ARM DEN0022D.b 5.19.2
		"Caller responsibilities" for possible return values.

		Hibernation using the PSCI SYSTEM_OFF2 call is enabled when PSCI v1.3
		is enabled. If a guest invokes the PSCI SYSTEM_OFF2 function, KVM will
		exit to userspace with the KVM_SYSTEM_EVENT_SHUTDOWN event type and with
		data[0] set to KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2. The only
		supported hibernate type for the SYSTEM_OFF2 function is HIBERNATE_OFF.

		::

		/* KVM_EXIT_IOAPIC_EOI */
		@@ -7162,11 +7174,15 @@ primary storage for certain register types. Therefore, the kernel may use the
		values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.


		.. _cap_enable:

		6. Capabilities that can be enabled on vCPUs
		============================================

		There are certain capabilities that change the behavior of the virtual CPU or
		the virtual machine when enabled. To enable them, please see section 4.37.
		the virtual machine when enabled. To enable them, please see
		:ref:`KVM_ENABLE_CAP`.

		Below you can find a list of capabilities and what their effect on the vCPU or
		the virtual machine is when enabling them.

		@@ -7375,7 +7391,7 @@ KVM API and also from the guest.
		sets are supported
		(bitfields defined in arch/x86/include/uapi/asm/kvm.h).

		As described above in the kvm_sync_regs struct info in section 5 (kvm_run):
		As described above in the kvm_sync_regs struct info in section :ref:`kvm_run`,
		KVM_CAP_SYNC_REGS "allow[s] userspace to access certain guest registers
		without having to call SET/GET_*REGS". This reduces overhead by eliminating
		repeated ioctl calls for setting and/or getting register values. This is
		@@ -7421,13 +7437,15 @@ Unused bitfields in the bitarrays must be set to zero.

		This capability connects the vcpu to an in-kernel XIVE device.

		.. _cap_enable_vm:

		7. Capabilities that can be enabled on VMs
		==========================================

		There are certain capabilities that change the behavior of the virtual
		machine when enabled. To enable them, please see section 4.37. Below
		you can find a list of capabilities and what their effect on the VM
		is when enabling them.
		machine when enabled. To enable them, please see section
		:ref:`KVM_ENABLE_CAP`. Below you can find a list of capabilities and
		what their effect on the VM is when enabling them.

		The following information is provided along with the description:

		@@ -8107,6 +8125,28 @@ KVM_X86_QUIRK_SLOT_ZAP_ALL By default, for KVM_X86_DEFAULT_VM VMs, KVM
		or moved memslot isn't reachable, i.e KVM
		_may_ invalidate only SPTEs related to the
		memslot.

		KVM_X86_QUIRK_STUFF_FEATURE_MSRS By default, at vCPU creation, KVM sets the
		vCPU's MSR_IA32_PERF_CAPABILITIES (0x345),
		MSR_IA32_ARCH_CAPABILITIES (0x10a),
		MSR_PLATFORM_INFO (0xce), and all VMX MSRs
		(0x480..0x492) to the maximal capabilities
		supported by KVM. KVM also sets
		MSR_IA32_UCODE_REV (0x8b) to an arbitrary
		value (which is different for Intel vs.
		AMD). Lastly, when guest CPUID is set (by
		userspace), KVM modifies select VMX MSR
		fields to force consistency between guest
		CPUID and L2's effective ISA. When this
		quirk is disabled, KVM zeroes the vCPU's MSR
		values (with two exceptions, see below),
		i.e. treats the feature MSRs like CPUID
		leaves and gives userspace full control of
		the vCPU model definition. This quirk does
		not affect VMX MSRs CR0/CR4_FIXED1 (0x487
		and 0x489), as KVM does now allow them to
		be set by userspace (KVM sets them based on
		guest CPUID, for safety purposes).
		=================================== ============================================

		7.32 KVM_CAP_MAX_VCPU_ID
		@@ -8588,6 +8628,8 @@ guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
		(0x40000001). Otherwise, a guest may use the paravirtual features
		regardless of what has actually been exposed through the CPUID leaf.

		.. _KVM_CAP_DIRTY_LOG_RING:

		8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
		----------------------------------------------------------

Documentation/virt/kvm/locking.rst

+41 −39

Original line number	Diff line number	Diff line
		@@ -135,8 +135,8 @@ We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
		For direct sp, we can easily avoid it since the spte of direct sp is fixed
		to gfn. For indirect sp, we disabled fast page fault for simplicity.

		A solution for indirect sp could be to pin the gfn, for example via
		gfn_to_pfn_memslot_atomic, before the cmpxchg. After the pinning:
		A solution for indirect sp could be to pin the gfn before the cmpxchg. After
		the pinning:

		- We have held the refcount of pfn; that means the pfn can not be freed and
		be reused for another gfn.
		@@ -147,22 +147,22 @@ Then, we can ensure the dirty bitmaps is correctly set for a gfn.

		2) Dirty bit tracking

		In the origin code, the spte can be fast updated (non-atomically) if the
		In the original code, the spte can be fast updated (non-atomically) if the
		spte is read-only and the Accessed bit has already been set since the
		Accessed bit and Dirty bit can not be lost.

		But it is not true after fast page fault since the spte can be marked
		writable between reading spte and updating spte. Like below case:

		+------------------------------------------------------------------------+
		+-------------------------------------------------------------------------+
		\| At the beginning:: \|
		\| \|
		\| spte.W = 0 \|
		\| spte.Accessed = 1 \|
		+------------------------------------+-----------------------------------+
		+-------------------------------------+-----------------------------------+
		\| CPU 0: \| CPU 1: \|
		+------------------------------------+-----------------------------------+
		\| In mmu_spte_clear_track_bits():: \| \|
		+-------------------------------------+-----------------------------------+
		\| In mmu_spte_update():: \| \|
		\| \| \|
		\| old_spte = *spte; \| \|
		\| \| \|
		@@ -170,8 +170,8 @@ writable between reading spte and updating spte. Like below case:
		\| /* 'if' condition is satisfied. */ \| \|
		\| if (old_spte.Accessed == 1 && \| \|
		\| old_spte.W == 0) \| \|
		\| spte = 0ull; \| \|
		+------------------------------------+-----------------------------------+
		\| spte = new_spte; \| \|
		+-------------------------------------+-----------------------------------+
		\| \| on fast page fault path:: \|
		\| \| \|
		\| \| spte.W = 1 \|
		@@ -179,17 +179,19 @@ writable between reading spte and updating spte. Like below case:
		\| \| memory write on the spte:: \|
		\| \| \|
		\| \| spte.Dirty = 1 \|
		+------------------------------------+-----------------------------------+
		+-------------------------------------+-----------------------------------+
		\| :: \| \|
		\| \| \|
		\| else \| \|
		\| old_spte = xchg(spte, 0ull) \| \|
		\| if (old_spte.Accessed == 1) \| \|
		\| kvm_set_pfn_accessed(spte.pfn);\| \|
		\| if (old_spte.Dirty == 1) \| \|
		\| kvm_set_pfn_dirty(spte.pfn); \| \|
		\| old_spte = xchg(spte, new_spte);\| \|
		\| if (old_spte.Accessed && \| \|
		\| !new_spte.Accessed) \| \|
		\| flush = true; \| \|
		\| if (old_spte.Dirty && \| \|
		\| !new_spte.Dirty) \| \|
		\| flush = true; \| \|
		\| OOPS!!! \| \|
		+------------------------------------+-----------------------------------+
		+-------------------------------------+-----------------------------------+

		The Dirty bit is lost in this case.

Original line number	Diff line number	Diff line
		@@ -135,8 +135,8 @@ We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
		For direct sp, we can easily avoid it since the spte of direct sp is fixed
		to gfn. For indirect sp, we disabled fast page fault for simplicity.

		A solution for indirect sp could be to pin the gfn, for example via
		gfn_to_pfn_memslot_atomic, before the cmpxchg. After the pinning:
		A solution for indirect sp could be to pin the gfn before the cmpxchg. After
		the pinning:

		- We have held the refcount of pfn; that means the pfn can not be freed and
		be reused for another gfn.
		@@ -147,22 +147,22 @@ Then, we can ensure the dirty bitmaps is correctly set for a gfn.

		2) Dirty bit tracking

		In the origin code, the spte can be fast updated (non-atomically) if the
		In the original code, the spte can be fast updated (non-atomically) if the
		spte is read-only and the Accessed bit has already been set since the
		Accessed bit and Dirty bit can not be lost.

		But it is not true after fast page fault since the spte can be marked
		writable between reading spte and updating spte. Like below case:

		+------------------------------------------------------------------------+
		+-------------------------------------------------------------------------+
		\| At the beginning:: \|
		\| \|
		\| spte.W = 0 \|
		\| spte.Accessed = 1 \|
		+------------------------------------+-----------------------------------+
		+-------------------------------------+-----------------------------------+
		\| CPU 0: \| CPU 1: \|
		+------------------------------------+-----------------------------------+
		\| In mmu_spte_clear_track_bits():: \| \|
		+-------------------------------------+-----------------------------------+
		\| In mmu_spte_update():: \| \|
		\| \| \|
		\| old_spte = *spte; \| \|
		\| \| \|
		@@ -170,8 +170,8 @@ writable between reading spte and updating spte. Like below case:
		\| /* 'if' condition is satisfied. */ \| \|
		\| if (old_spte.Accessed == 1 && \| \|
		\| old_spte.W == 0) \| \|
		\| spte = 0ull; \| \|
		+------------------------------------+-----------------------------------+
		\| spte = new_spte; \| \|
		+-------------------------------------+-----------------------------------+
		\| \| on fast page fault path:: \|
		\| \| \|
		\| \| spte.W = 1 \|
		@@ -179,17 +179,19 @@ writable between reading spte and updating spte. Like below case:
		\| \| memory write on the spte:: \|
		\| \| \|
		\| \| spte.Dirty = 1 \|
		+------------------------------------+-----------------------------------+
		+-------------------------------------+-----------------------------------+
		\| :: \| \|
		\| \| \|
		\| else \| \|
		\| old_spte = xchg(spte, 0ull) \| \|
		\| if (old_spte.Accessed == 1) \| \|
		\| kvm_set_pfn_accessed(spte.pfn);\| \|
		\| if (old_spte.Dirty == 1) \| \|
		\| kvm_set_pfn_dirty(spte.pfn); \| \|
		\| old_spte = xchg(spte, new_spte);\| \|
		\| if (old_spte.Accessed && \| \|
		\| !new_spte.Accessed) \| \|
		\| flush = true; \| \|
		\| if (old_spte.Dirty && \| \|
		\| !new_spte.Dirty) \| \|
		\| flush = true; \| \|
		\| OOPS!!! \| \|
		+------------------------------------+-----------------------------------+
		+-------------------------------------+-----------------------------------+

		The Dirty bit is lost in this case.