Commit fcbe3482 authored Mar 06, 2025 by Paolo Bonzini

Merge branch 'kvm-tdx-mmu' into HEAD

This series picks up from commit 86eb1aef ("Merge branch
'kvm-mirror-page-tables' into HEAD", 2025-01-20), which focused on
changes to the generic x86 parts of the KVM MMU code, and adds support
for TDX's secure page tables to the Intel side of KVM.

Confidential computing solutions have concepts of private and shared
memory. Often the guest accesses either private or shared memory via a bit
in the guest PTE. Solutions like SEV treat this bit more like a permission
bit, where solutions like TDX and ARM CCA treat it more like a GPA bit. In
the latter case, the host maps private memory in one half of the address
space and shared in another. For TDX these two halves are mapped by
different EPT roots. The private half (also called Secure EPT in Intel
documentation) gets managed by the privileged TDX Module. The shared half
is managed by the untrusted part of the VMM (KVM).

In addition to the separate roots for private and shared, there are
limitations on what operations can be done on the private side. Like SNP,
TDX wants to protect against protected memory being reset or otherwise
scrambled by the host. In order to prevent this, the guest has to take
specific action to “accept” memory after changes are made by the VMM to
the private EPT. This prevents the VMM from performing many of the usual
memory management operations that involve zapping and refaulting memory.
The private memory also is always RWX and cannot have VMM specified cache
attribute attributes applied.

TDX memory implementation
=========================

Creating shared EPT
-------------------
Shared EPT handling is relatively simple compared to private memory. It is
managed from within KVM. The main differences between shared EPT and EPT
in a normal VM are that the root is set with a TDVMCS field (via SEAMCALL),
and that the GFN specified in the memslot perspective needs to be mapped
at an offset in the EPT. For the former, this series plumbs in the
load_mmu_pgd() operation to the correct field for the shared EPT. For the
latter, previous patches have laid the groundwork for mapping so called
“direct roots” roots at an offset specified in kvm->arch.gfn_direct_bits.

Creating private EPT
--------------------
In previous patches, the concept of “mirrored roots” were introduced. Such
roots maintain a KVM side “mirror” of the “external” EPT by keeping an
unmapped EPT tree within the KVM MMU code. When changing these mirror
EPTs, the KVM MMU code calls out via x86_ops to update the external EPT.
This series adds implementations for these “external” ops for TDX to
create and manage “private” memory via TDX module APIs.

Managing S-EPT with the TDX Module
----------------------------------
The TDX module allows the TD’s private memory to be managed via SEAMCALLs.
This management consists of operating on two internal elements:

1. The private EPT, which the TDX module calls the S-EPT. It maps the
   actual mapped, private half of the GPA space using an EPT tree.

2. The HKID, which represents private encryption keys used for encrypting
   TD memory. The CPU doesn’t guarantee cache coherency between these
   encryption keys, so memory that is encrypted with one of these keys
   needs to be reclaimed for use on the host in special ways.

This series will primarily focus on the SEAMCALLs for managing the private
EPT. Consideration of the HKID is needed for when the TD is torn down.

Populating TDX Private memory
-----------------------------
TDX allows the EPT mapping the TD's private memory to be modified in
limited ways. There are SEAMCALLs for building and tearing down the EPT
tree, as well as mapping pages into the private EPT.

As for building and tearing down the EPT page tables, it is relatively
simple. There are SEAMCALLs for installing and removing them. However, the
current implementation only supports adding private EPT page tables, and
leaves them installed for the lifetime of the TD. For teardown, the
details are discussed in a later section.

As for populating and zapping private SPTE, there are SEAMCALLs for this
as well. The zapping case will be described in detail later. As for the
populating case, there are two categories: before TD is finalized and
after TD is finalized. Both of these scenarios go through the TDP MMU map
path. The changes done previously to introduce “mirror” and “external”
page tables handle directing SPTE installation operations through the
set_external_spte() op.

In the “after” case, the TDX set_external_spte() handler simply calls a
SEAMCALL (TDX.MEM.PAGE.AUG).

For the before case, it is a bit more complicated as it requires both
setting the private SPTE *and* copying in the initial contents of the page
at the same time. For TDX this is done via the KVM_TDX_INIT_MEM_REGION
ioctl, which is effectively the kvm_gmem_populate() operation.

For SNP, the private memory can be pre-populated first, and faulted in
later like normal. But for TDX these need to both happen both at the same
time and the setting of the private SPTE needs to happen in a different
way than the “after” case described above. It needs to use the
TDH.MEM.SEPT.ADD SEAMCALL which does both the copying in of the data and
setting the SPTE.

Without extensive modification to the fault path, it’s not possible
utilize this callback from the set_external_spte() handler because it the
source page for the data to be copied in is not known deep down in this
callchain. So instead the post-populate callback does a three step
process.

1. Pre-fault the memory into the mirror EPT, but have the
   set_external_spte() not make any SEAMCALLs.

2. Check that the page is still faulted into the mirror EPT under read
   mmu_lock that is held over this and the following step.

3. Call TDH.MEM.SEPT.ADD with the HPA of the page to copy data from, and
   the private page installed in the mirror EPT to use for the private
   mapping.

The scheme involves some assumptions about the operations that might
operate on the mirrored EPT before the VM is finalized. It assumes that no
other memory will be faulted into the mirror EPT, that is not also added
via TDH.MEM.SEPT.ADD). If this is violated the KVM MMU may not see private
memory faulted in there later and so not make the proper external spte
callbacks. To check this, KVM enforces that the number of
pre-faulted pages is the same as the number of pages added via
KVM_TDX_INIT_MEM_REGION.

TDX TLB flushing
----------------
For TDX, TLB flushing needs to happen in different ways depending on
whether private and/or shared EPT needs to be flushed. Shared EPT can be
flushed like normal EPT with INVEPT. To avoid reading TD's EPTP out from
TDX module, this series flushes shared EPT with type 2 INVEPT. Private TLB
entries can be flushed this way too (via type 2). However, since the TDX
module needs to enforce some guarantees around which private memory is
mapped in the TD, it requires these operations to be done in special ways
for private memory.

For flushing private memory, two methods are possible.  The simple one
is the TDH.VP.FLUSH SEAMCALL; this flush is of the INVEPT type 1 variety
(i.e. mappings associated with the TD).

The second method is part of a sequence of SEAMCALLs for removing a guest
page. The sequence looks like:

1. TDH.MEM.RANGE.BLOCK - Remove RWX bits from entry (similar to KVM’s zap).

2. TDH.MEM.TRACK - Increment the TD TLB epoch, which is a per-TD counter

3. Kick off all vCPUs - In order to force them to have to re-enter.

4. TDH.MEM.PAGE.REMOVE - Actually remove the page and make it available for
   other use.

5. TDH.VP.ENTER - On re-entering TDX module will see the epoch is
   incremented and flush the TLB.

On top of this, during TDX module init TDH.SYS.LP.INIT (which is used
to online a CPU for TDX usage) invokes INVEPT to flush all mappings in
the TLB.

During runtime, for normal (TDP MMU, non-nested) guests, KVM will do a TLB
flushes in 4 scenarios:

(1) kvm_mmu_load()

    After EPT is loaded, call kvm_x86_flush_tlb_current() to invalidate
    TLBs for current vCPU loaded EPT on current pCPU.

(2) Loading vCPU to a new pCPU

    Send request KVM_REQ_TLB_FLUSH to current vCPU, the request handler
    will call kvm_x86_flush_tlb_all() to flush all EPTs assocated with the
    new pCPU.

(3) When EPT mapping has changed (after removing or permission reduction)
    (e.g. in kvm_flush_remote_tlbs())

    Send request KVM_REQ_TLB_FLUSH to all vCPUs by kicking all them off,
    the request handler on each vCPU will call kvm_x86_flush_tlb_all() to
    invalidate TLBs for all EPTs associated with the pCPU.

(4) When EPT changes only affects current vCPU, e.g. virtual apic mode
    changed.

    Send request KVM_REQ_TLB_FLUSH_CURRENT, the request handler will call
    kvm_x86_flush_tlb_current() to invalidate TLBs for current vCPU loaded
    EPT on current pCPU.

Only the first 3 are relevant to TDX. They are implemented as follows.

(1) kvm_mmu_load()

    Only the shared EPT root is loaded in this path. The TDX module does
    not require any assurances about the operation, so the
    flush_tlb_current()->ept_sync_global() can be called as normal.

(2) vCPU load

    When a vCPU migrates to a new logical processor, it has to be flushed
    on the *old* pCPU, unlike normal VMs where the INVEPT is executed on
    the new pCPU to remove stale mappings from previous usage of the same
    EPTP on the new pCPU. The TDX behavior comes from a requirement
    that a vCPU can only be associated with one pCPU at at time. This
    flush happens via an IPI that invokes TDH.VP.FLUSH SEAMCALL, during
    the vcpu_load callback.

(3) Removing a private SPTE

    This is the more complicated flow. It is done in a simple way for now
    and is especially inefficient during VM teardown. The plan is to get a
    basic functional version working and optimize some of these flows
    later.

    When a private page mapping is removed, the core MMU code calls the
    newly remove_external_spte() op, and flushes the TLB on all vCPUs. But
    TDX can’t rely on doing that for private memory, so it has it’s own
    process for making sure the private page is removed. This flow
    (TDH.MEM.RANGE.BLOCK, TDH.MEM.TRACK, TDH.MEM.PAGE.REMOVE) is done
    withing the remove_external_spte() implementation as described in the
    “TDX TLB flushing” section above.

    After that, back in the core MMU code, KVM will call
    kvm_flush_remote_tlbs*() resulting in an INVEPT. Despite that, when
    the vCPUs re-enter (TDH.VP.ENTER) the TD, the TDX module will do
    another INVEPT for its own reassurance.

Private memory teardown
-----------------------
Tearing down private memory involves reclaiming three types of resources
from the TDX module:

 1. TD’s HKID

    To reclaim the TD’s HKID, no mappings may be mapped with it.

 2. Private guest pages (mapped with HKID)
 3. Private page tables that map private pages (mapped with HKID)

    From the TDX module’s perspective, to reclaim guest private pages they
    need to be prevented from be accessed via the HKID (unmapped and TLB
    flushed), their HKID associated cachelines need to be flushed, and
    they need to be marked as no longer use by the TD in the TDX modules
    internal tracking (PAMT)

During runtime private PTEs can be zapped as part of memslot deletion or
when memory coverts from shared to private, but private page tables and
HKIDs are not torn down until the TD is being destructed. The means the
operation to zap private guest mapped pages needs to do the required cache
writeback under the assumption that other vCPU’s may be active, but the
PTs do not.

TD teardown resource reclamation
--------------------------------
The code that does the TD teardown is organized such that when an HKID is
reclaimed:
1. vCPUs will no longer enter the TD
2. The TLB is flushed on all CPUs
3. The HKID associated cachelines have been flushed.

So at that point most of the steps needed to reclaim TD private pages and
page tables have already been done and the reclaim operation only needs to
update the TDX module’s tracking of page ownership. For simplicity each
operation only supports one scenario: before or after HKID reclaim. Since
zapping and reclaiming private pages has to function during runtime for
memslot deletion and converting from shared to private, the TD teardown is
arranged so this happens before HKID reclaim. Since private page tables
are never torn down during TD runtime, they can happen in a simpler and
more efficient way after HKID reclaim. The private page reclaim is
initiated from the kvm fd release. The callchain looks like this:

do_exit
  |->exit_mm --> tdx_mmu_release_hkid() was called here previously in v19
  |->exit_files
      |->1.release vcpu fd
      |->2.kvm_gmem_release
      |     |->kvm_gmem_invalidate_begin --> unmap all leaf entries, causing
      |                                      zapping of private guest pages
      |->3.release kvmfd
            |->kvm_destroy_vm
                |->kvm_arch_pre_destroy_vm
                |   |  kvm_x86_call(vm_pre_destroy)(kvm) -->tdx_mmu_release_hkid()
                |->kvm_arch_destroy_vm
                    |->kvm_unload_vcpu_mmus
                    |  kvm_destroy_vcpus(kvm)
                    |   |->kvm_arch_vcpu_destroy
                    |   |->kvm_x86_call(vcpu_free)(vcpu)
                    |   |  kvm_mmu_destroy(vcpu) -->unref mirror root
                    |  kvm_mmu_uninit_vm(kvm) --> mirror root ref is 1 here,
                    |                             zap private page tables
                    | static_call_cond(kvm_x86_vm_destroy)(kvm);

parents 0d20742b eac0b72f

arch/x86/include/asm/kvm_host.h

+7 −5

Original line number	Diff line number	Diff line
		@@ -1562,6 +1562,13 @@ struct kvm_arch {
		struct kvm_mmu_memory_cache split_desc_cache;

		gfn_t gfn_direct_bits;

		/*
		* Size of the CPU's dirty log buffer, i.e. VMX's PML buffer. A Zero
		* value indicates CPU dirty logging is unsupported or disabled in
		* current VM.
		*/
		int cpu_dirty_log_size;
		};

		struct kvm_vm_stat {
		@@ -1815,11 +1822,6 @@ struct kvm_x86_ops {
		struct x86_exception *exception);
		void (handle_exit_irqoff)(struct kvm_vcpu vcpu);

		/*
		* Size of the CPU's dirty log buffer, i.e. VMX's PML buffer. A zero
		* value indicates CPU dirty logging is unsupported or disabled.
		*/
		int cpu_dirty_log_size;
		void (update_cpu_dirty_logging)(struct kvm_vcpu vcpu);

		const struct kvm_x86_nested_ops *nested_ops;

arch/x86/include/asm/tdx.h

+14 −1

Original line number	Diff line number	Diff line
		@@ -148,7 +148,6 @@ struct tdx_vp {
		struct page **tdcx_pages;
		};


		static inline u64 mk_keyed_paddr(u16 hkid, struct page *page)
		{
		u64 ret;
		@@ -158,15 +157,26 @@ static inline u64 mk_keyed_paddr(u16 hkid, struct page *page)
		ret \|= (u64)hkid << boot_cpu_data.x86_phys_bits;

		return ret;
		}

		static inline int pg_level_to_tdx_sept_level(enum pg_level level)
		{
		WARN_ON_ONCE(level == PG_LEVEL_NONE);
		return level - 1;
		}

		u64 tdh_mng_addcx(struct tdx_td td, struct page tdcs_page);
		u64 tdh_mem_page_add(struct tdx_td td, u64 gpa, struct page page, struct page source, u64 ext_err1, u64 *ext_err2);
		u64 tdh_mem_sept_add(struct tdx_td td, u64 gpa, int level, struct page page, u64 ext_err1, u64 ext_err2);
		u64 tdh_vp_addcx(struct tdx_vp vp, struct page tdcx_page);
		u64 tdh_mem_page_aug(struct tdx_td td, u64 gpa, int level, struct page page, u64 ext_err1, u64 ext_err2);
		u64 tdh_mem_range_block(struct tdx_td td, u64 gpa, int level, u64 ext_err1, u64 *ext_err2);
		u64 tdh_mng_key_config(struct tdx_td *td);
		u64 tdh_mng_create(struct tdx_td *td, u16 hkid);
		u64 tdh_vp_create(struct tdx_td td, struct tdx_vp vp);
		u64 tdh_mng_rd(struct tdx_td td, u64 field, u64 data);
		u64 tdh_mr_extend(struct tdx_td td, u64 gpa, u64 ext_err1, u64 *ext_err2);
		u64 tdh_mr_finalize(struct tdx_td *td);
		u64 tdh_vp_flush(struct tdx_vp *vp);
		u64 tdh_mng_vpflushdone(struct tdx_td *td);
		u64 tdh_mng_key_freeid(struct tdx_td *td);
		@@ -175,8 +185,11 @@ u64 tdh_vp_init(struct tdx_vp *vp, u64 initial_rcx, u32 x2apicid);
		u64 tdh_vp_rd(struct tdx_vp vp, u64 field, u64 data);
		u64 tdh_vp_wr(struct tdx_vp *vp, u64 field, u64 data, u64 mask);
		u64 tdh_phymem_page_reclaim(struct page page, u64 tdx_pt, u64 tdx_owner, u64 tdx_size);
		u64 tdh_mem_track(struct tdx_td *tdr);
		u64 tdh_mem_page_remove(struct tdx_td td, u64 gpa, u64 level, u64 ext_err1, u64 *ext_err2);
		u64 tdh_phymem_cache_wb(bool resume);
		u64 tdh_phymem_page_wbinvd_tdr(struct tdx_td *td);
		u64 tdh_phymem_page_wbinvd_hkid(u64 hkid, struct page *page);
		#else
		static inline void tdx_init(void) { }
		static inline int tdx_cpu_enable(void) { return -ENODEV; }

arch/x86/include/asm/vmx.h

+1 −0

Original line number	Diff line number	Diff line
		@@ -256,6 +256,7 @@ enum vmcs_field {
		TSC_MULTIPLIER_HIGH = 0x00002033,
		TERTIARY_VM_EXEC_CONTROL = 0x00002034,
		TERTIARY_VM_EXEC_CONTROL_HIGH = 0x00002035,
		SHARED_EPT_POINTER = 0x0000203C,
		PID_POINTER_TABLE = 0x00002042,
		PID_POINTER_TABLE_HIGH = 0x00002043,
		GUEST_PHYSICAL_ADDRESS = 0x00002400,

arch/x86/include/uapi/asm/kvm.h

+10 −0

Original line number	Diff line number	Diff line
		@@ -932,6 +932,8 @@ enum kvm_tdx_cmd_id {
		KVM_TDX_CAPABILITIES = 0,
		KVM_TDX_INIT_VM,
		KVM_TDX_INIT_VCPU,
		KVM_TDX_INIT_MEM_REGION,
		KVM_TDX_FINALIZE_VM,
		KVM_TDX_GET_CPUID,

		KVM_TDX_CMD_NR_MAX,
		@@ -987,4 +989,12 @@ struct kvm_tdx_init_vm {
		struct kvm_cpuid2 cpuid;
		};

		#define KVM_TDX_MEASURE_MEMORY_REGION _BITULL(0)

		struct kvm_tdx_init_mem_region {
		__u64 source_addr;
		__u64 gpa;
		__u64 nr_pages;
		};

		#endif /* _ASM_X86_KVM_H */

arch/x86/kvm/mmu.h

+4 −0

Original line number	Diff line number	Diff line
		@@ -79,6 +79,7 @@ static inline gfn_t kvm_mmu_max_gfn(void)
		u8 kvm_mmu_get_max_tdp_level(void);

		void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
		void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value);
		void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
		void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);

		@@ -253,6 +254,9 @@ extern bool tdp_mmu_enabled;
		#define tdp_mmu_enabled false
		#endif

		bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
		int kvm_tdp_map_page(struct kvm_vcpu vcpu, gpa_t gpa, u64 error_code, u8 level);

		static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
		{
		return !tdp_mmu_enabled \|\| kvm_shadow_root_allocated(kvm);