Commit 43db1111 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull kvm updates from Paolo Bonzini:
 "As far as x86 goes this pull request "only" includes TDX host support.

  Quotes are appropriate because (at 6k lines and 100+ commits) it is
  much bigger than the rest, which will come later this week and
  consists mostly of bugfixes and selftests. s390 changes will also come
  in the second batch.

  ARM:

   - Add large stage-2 mapping (THP) support for non-protected guests
     when pKVM is enabled, clawing back some performance.

   - Enable nested virtualisation support on systems that support it,
     though it is disabled by default.

   - Add UBSAN support to the standalone EL2 object used in nVHE/hVHE
     and protected modes.

   - Large rework of the way KVM tracks architecture features and links
     them with the effects of control bits. While this has no functional
     impact, it ensures correctness of emulation (the data is
     automatically extracted from the published JSON files), and helps
     dealing with the evolution of the architecture.

   - Significant changes to the way pKVM tracks ownership of pages,
     avoiding page table walks by storing the state in the hypervisor's
     vmemmap. This in turn enables the THP support described above.

   - New selftest checking the pKVM ownership transition rules

   - Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
     even if the host didn't have it.

   - Fixes for the address translation emulation, which happened to be
     rather buggy in some specific contexts.

   - Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
     from the number of counters exposed to a guest and addressing a
     number of issues in the process.

   - Add a new selftest for the SVE host state being corrupted by a
     guest.

   - Keep HCR_EL2.xMO set at all times for systems running with the
     kernel at EL2, ensuring that the window for interrupts is slightly
     bigger, and avoiding a pretty bad erratum on the AmpereOne HW.

   - Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
     from a pretty bad case of TLB corruption unless accesses to HCR_EL2
     are heavily synchronised.

   - Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
     tables in a human-friendly fashion.

   - and the usual random cleanups.

  LoongArch:

   - Don't flush tlb if the host supports hardware page table walks.

   - Add KVM selftests support.

  RISC-V:

   - Add vector registers to get-reg-list selftest

   - VCPU reset related improvements

   - Remove scounteren initialization from VCPU reset

   - Support VCPU reset from userspace using set_mpstate() ioctl

  x86:

   - Initial support for TDX in KVM.

     This finally makes it possible to use the TDX module to run
     confidential guests on Intel processors. This is quite a large
     series, including support for private page tables (managed by the
     TDX module and mirrored in KVM for efficiency), forwarding some
     TDVMCALLs to userspace, and handling several special VM exits from
     the TDX module.

     This has been in the works for literally years and it's not really
     possible to describe everything here, so I'll defer to the various
     merge commits up to and including commit 7bcf7246 ('Merge
     branch 'kvm-tdx-finish-initial' into HEAD')"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (248 commits)
  x86/tdx: mark tdh_vp_enter() as __flatten
  Documentation: virt/kvm: remove unreferenced footnote
  RISC-V: KVM: lock the correct mp_state during reset
  KVM: arm64: Fix documentation for vgic_its_iter_next()
  KVM: arm64: np-guest CMOs with PMD_SIZE fixmap
  KVM: arm64: Stage-2 huge mappings for np-guests
  KVM: arm64: Add a range to pkvm_mappings
  KVM: arm64: Convert pkvm_mappings to interval tree
  KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest()
  KVM: arm64: Add a range to __pkvm_host_wrprotect_guest()
  KVM: arm64: Add a range to __pkvm_host_unshare_guest()
  KVM: arm64: Add a range to __pkvm_host_share_guest()
  KVM: arm64: Introduce for_each_hyp_page
  KVM: arm64: Handle huge mappings for np-guest CMOs
  KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section
  KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held
  KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating
  RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET
  RISC-V: KVM: Remove scounteren initialization
  KVM: RISC-V: remove unnecessary SBI reset state
  ...
parents 12e9b9e5 e9f17038
Loading
Loading
Loading
Loading
+2 −0
Original line number Diff line number Diff line
@@ -57,6 +57,8 @@ stable kernels.
+----------------+-----------------+-----------------+-----------------------------+
| Ampere         | AmpereOne AC04  | AC04_CPU_10     | AMPERE_ERRATUM_AC03_CPU_38  |
+----------------+-----------------+-----------------+-----------------------------+
| Ampere         | AmpereOne AC04  | AC04_CPU_23     | AMPERE_ERRATUM_AC04_CPU_23  |
+----------------+-----------------+-----------------+-----------------------------+
+----------------+-----------------+-----------------+-----------------------------+
| ARM            | Cortex-A510     | #2457168        | ARM64_ERRATUM_2457168       |
+----------------+-----------------+-----------------+-----------------------------+
+61 −5
Original line number Diff line number Diff line
@@ -1411,6 +1411,9 @@ the memory region are automatically reflected into the guest. For example, an
mmap() that affects the region will be made visible immediately.  Another
example is madvise(MADV_DROP).

For TDX guest, deleting/moving memory region loses guest memory contents.
Read only region isn't supported.  Only as-id 0 is supported.

Note: On arm64, a write generated by the page-table walker (to update
the Access and Dirty flags, for example) never results in a
KVM_EXIT_MMIO exit when the slot has the KVM_MEM_READONLY flag. This
@@ -3460,7 +3463,8 @@ The initial values are defined as:
	- FPSIMD/NEON registers: set to 0
	- SVE registers: set to 0
	- System registers: Reset to their architecturally defined
	  values as for a warm reset to EL1 (resp. SVC)
	  values as for a warm reset to EL1 (resp. SVC) or EL2 (in the
	  case of EL2 being enabled).

Note that because some registers reflect machine topology, all vcpus
should be created before this ioctl is invoked.
@@ -3527,6 +3531,17 @@ Possible features:
	      - the KVM_REG_ARM64_SVE_VLS pseudo-register is immutable, and can
	        no longer be written using KVM_SET_ONE_REG.

	- KVM_ARM_VCPU_HAS_EL2: Enable Nested Virtualisation support,
	  booting the guest from EL2 instead of EL1.
	  Depends on KVM_CAP_ARM_EL2.
	  The VM is running with HCR_EL2.E2H being RES1 (VHE) unless
	  KVM_ARM_VCPU_HAS_EL2_E2H0 is also set.

	- KVM_ARM_VCPU_HAS_EL2_E2H0: Restrict Nested Virtualisation
	  support to HCR_EL2.E2H being RES0 (non-VHE).
	  Depends on KVM_CAP_ARM_EL2_E2H0.
	  KVM_ARM_VCPU_HAS_EL2 must also be set.

4.83 KVM_ARM_PREFERRED_TARGET
-----------------------------

@@ -4768,7 +4783,7 @@ H_GET_CPU_CHARACTERISTICS hypercall.

:Capability: basic
:Architectures: x86
:Type: vm
:Type: vm ioctl, vcpu ioctl
:Parameters: an opaque platform specific structure (in/out)
:Returns: 0 on success; -1 on error

@@ -4776,9 +4791,11 @@ If the platform supports creating encrypted VMs then this ioctl can be used
for issuing platform-specific memory encryption commands to manage those
encrypted VMs.

Currently, this ioctl is used for issuing Secure Encrypted Virtualization
(SEV) commands on AMD Processors. The SEV commands are defined in
Documentation/virt/kvm/x86/amd-memory-encryption.rst.
Currently, this ioctl is used for issuing both Secure Encrypted Virtualization
(SEV) commands on AMD Processors and Trusted Domain Extensions (TDX) commands
on Intel Processors.  The detailed commands are defined in
Documentation/virt/kvm/x86/amd-memory-encryption.rst and
Documentation/virt/kvm/x86/intel-tdx.rst.

4.111 KVM_MEMORY_ENCRYPT_REG_REGION
-----------------------------------
@@ -6827,6 +6844,7 @@ should put the acknowledged interrupt vector into the 'epr' field.
  #define KVM_SYSTEM_EVENT_WAKEUP         4
  #define KVM_SYSTEM_EVENT_SUSPEND        5
  #define KVM_SYSTEM_EVENT_SEV_TERM       6
  #define KVM_SYSTEM_EVENT_TDX_FATAL      7
			__u32 type;
                        __u32 ndata;
                        __u64 data[16];
@@ -6853,6 +6871,11 @@ Valid values for 'type' are:
   reset/shutdown of the VM.
 - KVM_SYSTEM_EVENT_SEV_TERM -- an AMD SEV guest requested termination.
   The guest physical address of the guest's GHCB is stored in `data[0]`.
 - KVM_SYSTEM_EVENT_TDX_FATAL -- a TDX guest reported a fatal error state.
   KVM doesn't do any parsing or conversion, it just dumps 16 general-purpose
   registers to userspace, in ascending order of the 4-bit indices for x86-64
   general-purpose registers in instruction encoding, as defined in the Intel
   SDM.
 - KVM_SYSTEM_EVENT_WAKEUP -- the exiting vCPU is in a suspended state and
   KVM has recognized a wakeup event. Userspace may honor this event by
   marking the exiting vCPU as runnable, or deny it and call KVM_RUN again.
@@ -8194,6 +8217,28 @@ KVM_X86_QUIRK_STUFF_FEATURE_MSRS By default, at vCPU creation, KVM sets the
                                    and 0x489), as KVM does now allow them to
                                    be set by userspace (KVM sets them based on
                                    guest CPUID, for safety purposes).

KVM_X86_QUIRK_IGNORE_GUEST_PAT      By default, on Intel platforms, KVM ignores
                                    guest PAT and forces the effective memory
                                    type to WB in EPT.  The quirk is not available
                                    on Intel platforms which are incapable of
                                    safely honoring guest PAT (i.e., without CPU
                                    self-snoop, KVM always ignores guest PAT and
                                    forces effective memory type to WB).  It is
                                    also ignored on AMD platforms or, on Intel,
                                    when a VM has non-coherent DMA devices
                                    assigned; KVM always honors guest PAT in
                                    such case. The quirk is needed to avoid
                                    slowdowns on certain Intel Xeon platforms
                                    (e.g. ICX, SPR) where self-snoop feature is
                                    supported but UC is slow enough to cause
                                    issues with some older guests that use
                                    UC instead of WC to map the video RAM.
                                    Userspace can disable the quirk to honor
                                    guest PAT if it knows that there is no such
                                    guest software, for example if it does not
                                    expose a bochs graphics device (which is
                                    known to have had a buggy driver).
=================================== ============================================

7.32 KVM_CAP_MAX_VCPU_ID
@@ -8496,6 +8541,17 @@ aforementioned registers before the first KVM_RUN. These registers are VM
scoped, meaning that the same set of values are presented on all vCPUs in a
given VM.

7.43 KVM_CAP_RISCV_MP_STATE_RESET
---------------------------------

:Architectures: riscv
:Type: VM
:Parameters: None
:Returns: 0 on success, -EINVAL if arg[0] is not zero

When this capability is enabled, KVM resets the VCPU when setting
MP_STATE_INIT_RECEIVED through IOCTL.  The original MP_STATE is preserved.

8. Other capabilities.
======================

+24 −0
Original line number Diff line number Diff line
@@ -137,6 +137,30 @@ exit_reason = KVM_EXIT_FAIL_ENTRY and populate the fail_entry struct by setting
hardare_entry_failure_reason field to KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED and
the cpu field to the processor id.

1.5 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_SET_NR_COUNTERS
--------------------------------------------------

:Parameters: in kvm_device_attr.addr the address to an unsigned int
	     representing the maximum value taken by PMCR_EL0.N

:Returns:

	 =======  ====================================================
	 -EBUSY   PMUv3 already initialized, a VCPU has already run or
                  an event filter has already been set
	 -EFAULT  Error accessing the value pointed to by addr
	 -ENODEV  PMUv3 not supported or GIC not initialized
	 -EINVAL  No PMUv3 explicitly selected, or value of N out of
	 	  range
	 =======  ====================================================

Set the number of implemented event counters in the virtual PMU. This
mandates that a PMU has explicitly been selected via
KVM_ARM_VCPU_PMU_V3_SET_PMU, and will fail when no PMU has been
explicitly selected, or the number of counters is out of range for the
selected PMU. Selecting a new PMU cancels the effect of setting this
attribute.

2. GROUP: KVM_ARM_VCPU_TIMER_CTRL
=================================

+1 −0
Original line number Diff line number Diff line
@@ -11,6 +11,7 @@ KVM for x86 systems
   cpuid
   errata
   hypercalls
   intel-tdx
   mmu
   msr
   nested-vmx
+255 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

===================================
Intel Trust Domain Extensions (TDX)
===================================

Overview
========
Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from the
host and physical attacks.  A CPU-attested software module called 'the TDX
module' runs inside a new CPU isolated range to provide the functionalities to
manage and run protected VMs, a.k.a, TDX guests or TDs.

Please refer to [1] for the whitepaper, specifications and other resources.

This documentation describes TDX-specific KVM ABIs.  The TDX module needs to be
initialized before it can be used by KVM to run any TDX guests.  The host
core-kernel provides the support of initializing the TDX module, which is
described in the Documentation/arch/x86/tdx.rst.

API description
===============

KVM_MEMORY_ENCRYPT_OP
---------------------
:Type: vm ioctl, vcpu ioctl

For TDX operations, KVM_MEMORY_ENCRYPT_OP is re-purposed to be generic
ioctl with TDX specific sub-ioctl() commands.

::

  /* Trust Domain Extensions sub-ioctl() commands. */
  enum kvm_tdx_cmd_id {
          KVM_TDX_CAPABILITIES = 0,
          KVM_TDX_INIT_VM,
          KVM_TDX_INIT_VCPU,
          KVM_TDX_INIT_MEM_REGION,
          KVM_TDX_FINALIZE_VM,
          KVM_TDX_GET_CPUID,

          KVM_TDX_CMD_NR_MAX,
  };

  struct kvm_tdx_cmd {
        /* enum kvm_tdx_cmd_id */
        __u32 id;
        /* flags for sub-command. If sub-command doesn't use this, set zero. */
        __u32 flags;
        /*
         * data for each sub-command. An immediate or a pointer to the actual
         * data in process virtual address.  If sub-command doesn't use it,
         * set zero.
         */
        __u64 data;
        /*
         * Auxiliary error code.  The sub-command may return TDX SEAMCALL
         * status code in addition to -Exxx.
         */
        __u64 hw_error;
  };

KVM_TDX_CAPABILITIES
--------------------
:Type: vm ioctl
:Returns: 0 on success, <0 on error

Return the TDX capabilities that current KVM supports with the specific TDX
module loaded in the system.  It reports what features/capabilities are allowed
to be configured to the TDX guest.

- id: KVM_TDX_CAPABILITIES
- flags: must be 0
- data: pointer to struct kvm_tdx_capabilities
- hw_error: must be 0

::

  struct kvm_tdx_capabilities {
        __u64 supported_attrs;
        __u64 supported_xfam;
        __u64 reserved[254];

        /* Configurable CPUID bits for userspace */
        struct kvm_cpuid2 cpuid;
  };


KVM_TDX_INIT_VM
---------------
:Type: vm ioctl
:Returns: 0 on success, <0 on error

Perform TDX specific VM initialization.  This needs to be called after
KVM_CREATE_VM and before creating any VCPUs.

- id: KVM_TDX_INIT_VM
- flags: must be 0
- data: pointer to struct kvm_tdx_init_vm
- hw_error: must be 0

::

  struct kvm_tdx_init_vm {
          __u64 attributes;
          __u64 xfam;
          __u64 mrconfigid[6];          /* sha384 digest */
          __u64 mrowner[6];             /* sha384 digest */
          __u64 mrownerconfig[6];       /* sha384 digest */

          /* The total space for TD_PARAMS before the CPUIDs is 256 bytes */
          __u64 reserved[12];

        /*
         * Call KVM_TDX_INIT_VM before vcpu creation, thus before
         * KVM_SET_CPUID2.
         * This configuration supersedes KVM_SET_CPUID2s for VCPUs because the
         * TDX module directly virtualizes those CPUIDs without VMM.  The user
         * space VMM, e.g. qemu, should make KVM_SET_CPUID2 consistent with
         * those values.  If it doesn't, KVM may have wrong idea of vCPUIDs of
         * the guest, and KVM may wrongly emulate CPUIDs or MSRs that the TDX
         * module doesn't virtualize.
         */
          struct kvm_cpuid2 cpuid;
  };


KVM_TDX_INIT_VCPU
-----------------
:Type: vcpu ioctl
:Returns: 0 on success, <0 on error

Perform TDX specific VCPU initialization.

- id: KVM_TDX_INIT_VCPU
- flags: must be 0
- data: initial value of the guest TD VCPU RCX
- hw_error: must be 0

KVM_TDX_INIT_MEM_REGION
-----------------------
:Type: vcpu ioctl
:Returns: 0 on success, <0 on error

Initialize @nr_pages TDX guest private memory starting from @gpa with userspace
provided data from @source_addr.

Note, before calling this sub command, memory attribute of the range
[gpa, gpa + nr_pages] needs to be private.  Userspace can use
KVM_SET_MEMORY_ATTRIBUTES to set the attribute.

If KVM_TDX_MEASURE_MEMORY_REGION flag is specified, it also extends measurement.

- id: KVM_TDX_INIT_MEM_REGION
- flags: currently only KVM_TDX_MEASURE_MEMORY_REGION is defined
- data: pointer to struct kvm_tdx_init_mem_region
- hw_error: must be 0

::

  #define KVM_TDX_MEASURE_MEMORY_REGION   (1UL << 0)

  struct kvm_tdx_init_mem_region {
          __u64 source_addr;
          __u64 gpa;
          __u64 nr_pages;
  };


KVM_TDX_FINALIZE_VM
-------------------
:Type: vm ioctl
:Returns: 0 on success, <0 on error

Complete measurement of the initial TD contents and mark it ready to run.

- id: KVM_TDX_FINALIZE_VM
- flags: must be 0
- data: must be 0
- hw_error: must be 0


KVM_TDX_GET_CPUID
-----------------
:Type: vcpu ioctl
:Returns: 0 on success, <0 on error

Get the CPUID values that the TDX module virtualizes for the TD guest.
When it returns -E2BIG, the user space should allocate a larger buffer and
retry. The minimum buffer size is updated in the nent field of the
struct kvm_cpuid2.

- id: KVM_TDX_GET_CPUID
- flags: must be 0
- data: pointer to struct kvm_cpuid2 (in/out)
- hw_error: must be 0 (out)

::

  struct kvm_cpuid2 {
	  __u32 nent;
	  __u32 padding;
	  struct kvm_cpuid_entry2 entries[0];
  };

  struct kvm_cpuid_entry2 {
	  __u32 function;
	  __u32 index;
	  __u32 flags;
	  __u32 eax;
	  __u32 ebx;
	  __u32 ecx;
	  __u32 edx;
	  __u32 padding[3];
  };

KVM TDX creation flow
=====================
In addition to the standard KVM flow, new TDX ioctls need to be called.  The
control flow is as follows:

#. Check system wide capability

   * KVM_CAP_VM_TYPES: Check if VM type is supported and if KVM_X86_TDX_VM
     is supported.

#. Create VM

   * KVM_CREATE_VM
   * KVM_TDX_CAPABILITIES: Query TDX capabilities for creating TDX guests.
   * KVM_CHECK_EXTENSION(KVM_CAP_MAX_VCPUS): Query maximum VCPUs the TD can
     support at VM level (TDX has its own limitation on this).
   * KVM_SET_TSC_KHZ: Configure TD's TSC frequency if a different TSC frequency
     than host is desired.  This is Optional.
   * KVM_TDX_INIT_VM: Pass TDX specific VM parameters.

#. Create VCPU

   * KVM_CREATE_VCPU
   * KVM_TDX_INIT_VCPU: Pass TDX specific VCPU parameters.
   * KVM_SET_CPUID2: Configure TD's CPUIDs.
   * KVM_SET_MSRS: Configure TD's MSRs.

#. Initialize initial guest memory

   * Prepare content of initial guest memory.
   * KVM_TDX_INIT_MEM_REGION: Add initial guest memory.
   * KVM_TDX_FINALIZE_VM: Finalize the measurement of the TDX guest.

#. Run VCPU

References
==========

https://www.intel.com/content/www/us/en/developer/tools/trust-domain-extensions/documentation.html
Loading