Commit 83a39807 authored by Marc Zyngier's avatar Marc Zyngier
Browse files

Merge branch kvm-arm64/pkvm-protected-guest into kvmarm-master/next



* kvm-arm64/pkvm-protected-guest: (41 commits)
  : .
  : pKVM support for protected guests, implementing the very long
  : awaited support for anonymous memory, as the elusive guestmem
  : has failed to deliver on its promises despite a multi-year
  : effort. Patches courtesy of Will Deacon. From the initial cover
  : letter:
  :
  : "[...] this patch series implements support for protected guest
  : memory with pKVM, where pages are unmapped from the host as they are
  : faulted into the guest and can be shared back from the guest using pKVM
  : hypercalls. Protected guests are created using a new machine type
  : identifier and can be booted to a shell using the kvmtool patches
  : available at [2], which finally means that we are able to test the pVM
  : logic in pKVM. Since this is an incremental step towards full isolation
  : from the host (for example, the CPU register state and DMA accesses are
  : not yet isolated), creating a pVM requires a developer Kconfig option to
  : be enabled in addition to booting with 'kvm-arm.mode=protected' and
  : results in a kernel taint."
  : .
  KVM: arm64: Don't hold 'vm_table_lock' across guest page reclaim
  KVM: arm64: Allow get_pkvm_hyp_vm() to take a reference to a dying VM
  KVM: arm64: Prevent teardown finalisation of referenced 'hyp_vm'
  drivers/virt: pkvm: Add Kconfig dependency on DMA_RESTRICTED_POOL
  KVM: arm64: Rename PKVM_PAGE_STATE_MASK
  KVM: arm64: Extend pKVM page ownership selftests to cover guest hvcs
  KVM: arm64: Extend pKVM page ownership selftests to cover forced reclaim
  KVM: arm64: Register 'selftest_vm' in the VM table
  KVM: arm64: Extend pKVM page ownership selftests to cover guest donation
  KVM: arm64: Add some initial documentation for pKVM
  KVM: arm64: Allow userspace to create protected VMs when pKVM is enabled
  KVM: arm64: Implement the MEM_UNSHARE hypercall for protected VMs
  KVM: arm64: Implement the MEM_SHARE hypercall for protected VMs
  KVM: arm64: Add hvc handler at EL2 for hypercalls from protected VMs
  KVM: arm64: Return -EFAULT from VCPU_RUN on access to a poisoned pte
  KVM: arm64: Reclaim faulting page from pKVM in spurious fault handler
  KVM: arm64: Introduce hypercall to force reclaim of a protected page
  KVM: arm64: Annotate guest donations with handle and gfn in host stage-2
  KVM: arm64: Change 'pkvm_handle_t' to u16
  KVM: arm64: Introduce host_stage2_set_owner_metadata_locked()
  ...

Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
parents 73bb0bc2 bc20692f
Loading
Loading
Loading
Loading
+2 −2
Original line number Diff line number Diff line
@@ -3247,8 +3247,8 @@ Kernel parameters
			for the host. To force nVHE on VHE hardware, add
			"arm64_sw.hvhe=0 id_aa64mmfr1.vh=0" to the
			command-line.
			"nested" is experimental and should be used with
			extreme caution.
			"nested" and "protected" are experimental and should be
			used with extreme caution.

	kvm-arm.vgic_v3_group0_trap=
			[KVM,ARM,EARLY] Trap guest accesses to GICv3 group-0
+1 −0
Original line number Diff line number Diff line
@@ -10,6 +10,7 @@ ARM
   fw-pseudo-registers
   hyp-abi
   hypercalls
   pkvm
   pvtime
   ptp_kvm
   vcpu-features
+106 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

====================
Protected KVM (pKVM)
====================

**NOTE**: pKVM is currently an experimental, development feature and
subject to breaking changes as new isolation features are implemented.
Please reach out to the developers at kvmarm@lists.linux.dev if you have
any questions.

Overview
========

Booting a host kernel with '``kvm-arm.mode=protected``' enables
"Protected KVM" (pKVM). During boot, pKVM installs a stage-2 identity
map page-table for the host and uses it to isolate the hypervisor
running at EL2 from the rest of the host running at EL1/0.

pKVM permits creation of protected virtual machines (pVMs) by passing
the ``KVM_VM_TYPE_ARM_PROTECTED`` machine type identifier to the
``KVM_CREATE_VM`` ioctl(). The hypervisor isolates pVMs from the host by
unmapping pages from the stage-2 identity map as they are accessed by a
pVM. Hypercalls are provided for a pVM to share specific regions of its
IPA space back with the host, allowing for communication with the VMM.
A Linux guest must be configured with ``CONFIG_ARM_PKVM_GUEST=y`` in
order to issue these hypercalls.

See hypercalls.rst for more details.

Isolation mechanisms
====================

pKVM relies on a number of mechanisms to isolate PVMs from the host:

CPU memory isolation
--------------------

Status: Isolation of anonymous memory and metadata pages.

Metadata pages (e.g. page-table pages and '``struct kvm_vcpu``' pages)
are donated from the host to the hypervisor during pVM creation and
are consequently unmapped from the stage-2 identity map until the pVM is
destroyed.

Similarly to regular KVM, pages are lazily mapped into the guest in
response to stage-2 page faults handled by the host. However, when
running a pVM, these pages are first pinned and then unmapped from the
stage-2 identity map as part of the donation procedure. This gives rise
to some user-visible differences when compared to non-protected VMs,
largely due to the lack of MMU notifiers:

* Memslots cannot be moved or deleted once the pVM has started running.
* Read-only memslots and dirty logging are not supported.
* With the exception of swap, file-backed pages cannot be mapped into a
  pVM.
* Donated pages are accounted against ``RLIMIT_MLOCK`` and so the VMM
  must have a sufficient resource limit or be granted ``CAP_IPC_LOCK``.
  The lack of a runtime reclaim mechanism means that memory locked for
  a pVM will remain locked until the pVM is destroyed.
* Changes to the VMM address space (e.g. a ``MAP_FIXED`` mmap() over a
  mapping associated with a memslot) are not reflected in the guest and
  may lead to loss of coherency.
* Accessing pVM memory that has not been shared back will result in the
  delivery of a SIGSEGV.
* If a system call accesses pVM memory that has not been shared back
  then it will either return ``-EFAULT`` or forcefully reclaim the
  memory pages. Reclaimed memory is zeroed by the hypervisor and a
  subsequent attempt to access it in the pVM will return ``-EFAULT``
  from the ``VCPU_RUN`` ioctl().

CPU state isolation
-------------------

Status: **Unimplemented.**

DMA isolation using an IOMMU
----------------------------

Status: **Unimplemented.**

Proxying of Trustzone services
------------------------------

Status: FF-A and PSCI calls from the host are proxied by the pKVM
hypervisor.

The FF-A proxy ensures that the host cannot share pVM or hypervisor
memory with Trustzone as part of a "confused deputy" attack.

The PSCI proxy ensures that CPUs always have the stage-2 identity map
installed when they are executing in the host.

Protected VM firmware (pvmfw)
-----------------------------

Status: **Unimplemented.**

Resources
=========

Quentin Perret's KVM Forum 2022 talk entitled "Protected KVM on arm64: A
technical deep dive" remains a good resource for learning more about
pKVM, despite some of the details having changed in the meantime:

https://www.youtube.com/watch?v=9npebeVFbFw
+20 −11
Original line number Diff line number Diff line
@@ -51,7 +51,7 @@
#include <linux/mm.h>

enum __kvm_host_smccc_func {
	/* Hypercalls available only prior to pKVM finalisation */
	/* Hypercalls that are unavailable once pKVM has finalised. */
	/* __KVM_HOST_SMCCC_FUNC___kvm_hyp_init */
	__KVM_HOST_SMCCC_FUNC___pkvm_init = __KVM_HOST_SMCCC_FUNC___kvm_hyp_init + 1,
	__KVM_HOST_SMCCC_FUNC___pkvm_create_private_mapping,
@@ -60,16 +60,9 @@ enum __kvm_host_smccc_func {
	__KVM_HOST_SMCCC_FUNC___vgic_v3_init_lrs,
	__KVM_HOST_SMCCC_FUNC___vgic_v3_get_gic_config,
	__KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize,
	__KVM_HOST_SMCCC_FUNC_MIN_PKVM = __KVM_HOST_SMCCC_FUNC___pkvm_prot_finalize,

	/* Hypercalls available after pKVM finalisation */
	__KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_hyp,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_share_guest,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_guest,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_relax_perms_guest,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_wrprotect_guest,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_test_clear_young_guest,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_mkyoung_guest,
	/* Hypercalls that are always available and common to [nh]VHE/pKVM. */
	__KVM_HOST_SMCCC_FUNC___kvm_adjust_pc,
	__KVM_HOST_SMCCC_FUNC___kvm_vcpu_run,
	__KVM_HOST_SMCCC_FUNC___kvm_flush_vm_context,
@@ -83,11 +76,27 @@ enum __kvm_host_smccc_func {
	__KVM_HOST_SMCCC_FUNC___vgic_v3_restore_vmcr_aprs,
	__KVM_HOST_SMCCC_FUNC___vgic_v5_save_apr,
	__KVM_HOST_SMCCC_FUNC___vgic_v5_restore_vmcr_apr,
	__KVM_HOST_SMCCC_FUNC_MAX_NO_PKVM = __KVM_HOST_SMCCC_FUNC___vgic_v5_restore_vmcr_apr,

	/* Hypercalls that are available only when pKVM has finalised. */
	__KVM_HOST_SMCCC_FUNC___pkvm_host_share_hyp,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_hyp,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_donate_guest,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_share_guest,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_unshare_guest,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_relax_perms_guest,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_wrprotect_guest,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_test_clear_young_guest,
	__KVM_HOST_SMCCC_FUNC___pkvm_host_mkyoung_guest,
	__KVM_HOST_SMCCC_FUNC___pkvm_reserve_vm,
	__KVM_HOST_SMCCC_FUNC___pkvm_unreserve_vm,
	__KVM_HOST_SMCCC_FUNC___pkvm_init_vm,
	__KVM_HOST_SMCCC_FUNC___pkvm_init_vcpu,
	__KVM_HOST_SMCCC_FUNC___pkvm_teardown_vm,
	__KVM_HOST_SMCCC_FUNC___pkvm_vcpu_in_poison_fault,
	__KVM_HOST_SMCCC_FUNC___pkvm_force_reclaim_guest_page,
	__KVM_HOST_SMCCC_FUNC___pkvm_reclaim_dying_guest_page,
	__KVM_HOST_SMCCC_FUNC___pkvm_start_teardown_vm,
	__KVM_HOST_SMCCC_FUNC___pkvm_finalize_teardown_vm,
	__KVM_HOST_SMCCC_FUNC___pkvm_vcpu_load,
	__KVM_HOST_SMCCC_FUNC___pkvm_vcpu_put,
	__KVM_HOST_SMCCC_FUNC___pkvm_tlb_flush_vmid,
+8 −1
Original line number Diff line number Diff line
@@ -251,7 +251,7 @@ struct kvm_smccc_features {
	unsigned long vendor_hyp_bmap_2; /* Function numbers 64-127 */
};

typedef unsigned int pkvm_handle_t;
typedef u16 pkvm_handle_t;

struct kvm_protected_vm {
	pkvm_handle_t handle;
@@ -259,6 +259,13 @@ struct kvm_protected_vm {
	struct kvm_hyp_memcache stage2_teardown_mc;
	bool is_protected;
	bool is_created;

	/*
	 * True when the guest is being torn down. When in this state, the
	 * guest's vCPUs can't be loaded anymore, but its pages can be
	 * reclaimed by the host.
	 */
	bool is_dying;
};

struct kvm_mpidr_data {
Loading