Commit Graph

308 Commits

Author SHA1 Message Date
Isaku Yamahata
52f52ea79a Documentation/virt/kvm: Document on Trust Domain Extensions (TDX)
Add documentation to Intel Trusted Domain Extensions (TDX) support.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250227012021.1778144-21-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Yan Zhao
c9c1e20b4c KVM: x86: Introduce Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT
Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have
KVM ignore guest PAT when this quirk is enabled.

On AMD platforms, KVM always honors guest PAT.  On Intel however there are
two issues.  First, KVM *cannot* honor guest PAT if CPU feature self-snoop
is not supported. Second, UC access on certain Intel platforms can be very
slow[1] and honoring guest PAT on those platforms may break some old
guests that accidentally specify video RAM as UC. Those old guests may
never expect the slowness since KVM always forces WB previously. See [2].

So, introduce a quirk that KVM can enable by default on all Intel platforms
to avoid breaking old unmodifiable guests. Newer userspace can disable this
quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail
if self-snoop is not supported, i.e. if KVM cannot obey the wish.

The quirk is a no-op on AMD and also if any assigned devices have
non-coherent DMA.  This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is
another example of a quirk that is sometimes automatically disabled.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com # [1]
Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com # [2]
Message-ID: <20250224070946.31482-1-yan.y.zhao@intel.com>
[Use supported_quirks/inapplicable_quirks to support both AMD and
 no-self-snoop cases, as well as to remove the shadow_memtype_mask check
 from kvm_mmu_may_ignore_guest_pat(). - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:58 -04:00
Binbin Wu
79462faa2b KVM: TDX: Handle TDG.VP.VMCALL<ReportFatalError>
Convert TDG.VP.VMCALL<ReportFatalError> to KVM_EXIT_SYSTEM_EVENT with
a new type KVM_SYSTEM_EVENT_TDX_FATAL and forward it to userspace for
handling.

TD guest can use TDG.VP.VMCALL<ReportFatalError> to report the fatal
error it has experienced.  This hypercall is special because TD guest
is requesting a termination with the error information, KVM needs to
forward the hypercall to userspace anyway, KVM doesn't do parsing or
conversion, it just dumps the 16 general-purpose registers to userspace
and let userspace decide what to do.

Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com>
Message-ID: <20250222014225.897298-8-binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14 14:20:55 -04:00
Claudio Imbrenda
413c98f24c KVM: s390: fake memslot for ucontrol VMs
Create a fake memslot for ucontrol VMs. The fake memslot identity-maps
userspace.

Now memslots will always be present, and ucontrol is not a special case
anymore.

Suggested-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Link: https://lore.kernel.org/r/20250123144627.312456-4-imbrenda@linux.ibm.com
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Message-ID: <20250123144627.312456-4-imbrenda@linux.ibm.com>
2025-01-31 12:03:52 +01:00
Linus Torvalds
0f8e26b38d Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "Loongarch:

   - Clear LLBCTL if secondary mmu mapping changes

   - Add hypercall service support for usermode VMM

  x86:

   - Add a comment to kvm_mmu_do_page_fault() to explain why KVM
     performs a direct call to kvm_tdp_page_fault() when RETPOLINE is
     enabled

   - Ensure that all SEV code is compiled out when disabled in Kconfig,
     even if building with less brilliant compilers

   - Remove a redundant TLB flush on AMD processors when guest CR4.PGE
     changes

   - Use str_enabled_disabled() to replace open coded strings

   - Drop kvm_x86_ops.hwapic_irr_update() as KVM updates hardware's
     APICv cache prior to every VM-Enter

   - Overhaul KVM's CPUID feature infrastructure to track all vCPU
     capabilities instead of just those where KVM needs to manage state
     and/or explicitly enable the feature in hardware. Along the way,
     refactor the code to make it easier to add features, and to make it
     more self-documenting how KVM is handling each feature

   - Rework KVM's handling of VM-Exits during event vectoring; this
     plugs holes where KVM unintentionally puts the vCPU into infinite
     loops in some scenarios (e.g. if emulation is triggered by the
     exit), and brings parity between VMX and SVM

   - Add pending request and interrupt injection information to the
     kvm_exit and kvm_entry tracepoints respectively

   - Fix a relatively benign flaw where KVM would end up redoing RDPKRU
     when loading guest/host PKRU, due to a refactoring of the kernel
     helpers that didn't account for KVM's pre-checking of the need to
     do WRPKRU

   - Make the completion of hypercalls go through the complete_hypercall
     function pointer argument, no matter if the hypercall exits to
     userspace or not.

     Previously, the code assumed that KVM_HC_MAP_GPA_RANGE specifically
     went to userspace, and all the others did not; the new code need
     not special case KVM_HC_MAP_GPA_RANGE and in fact does not care at
     all whether there was an exit to userspace or not

   - As part of enabling TDX virtual machines, support support
     separation of private/shared EPT into separate roots.

     When TDX will be enabled, operations on private pages will need to
     go through the privileged TDX Module via SEAMCALLs; as a result,
     they are limited and relatively slow compared to reading a PTE.

     The patches included in 6.14 allow KVM to keep a mirror of the
     private EPT in host memory, and define entries in kvm_x86_ops to
     operate on external page tables such as the TDX private EPT

   - The recently introduced conversion of the NX-page reclamation
     kthread to vhost_task moved the task under the main process. The
     task is created as soon as KVM_CREATE_VM was invoked and this, of
     course, broke userspace that didn't expect to see any child task of
     the VM process until it started creating its own userspace threads.

     In particular crosvm refuses to fork() if procfs shows any child
     task, so unbreak it by creating the task lazily. This is arguably a
     userspace bug, as there can be other kinds of legitimate worker
     tasks and they wouldn't impede fork(); but it's not like userspace
     has a way to distinguish kernel worker tasks right now. Should they
     show as "Kthread: 1" in proc/.../status?

  x86 - Intel:

   - Fix a bug where KVM updates hardware's APICv cache of the highest
     ISR bit while L2 is active, while ultimately results in a
     hardware-accelerated L1 EOI effectively being lost

   - Honor event priority when emulating Posted Interrupt delivery
     during nested VM-Enter by queueing KVM_REQ_EVENT instead of
     immediately handling the interrupt

   - Rework KVM's processing of the Page-Modification Logging buffer to
     reap entries in the same order they were created, i.e. to mark gfns
     dirty in the same order that hardware marked the page/PTE dirty

   - Misc cleanups

  Generic:

   - Cleanup and harden kvm_set_memory_region(); add proper lockdep
     assertions when setting memory regions and add a dedicated API for
     setting KVM-internal memory regions. The API can then explicitly
     disallow all flags for KVM-internal memory regions

   - Explicitly verify the target vCPU is online in kvm_get_vcpu() to
     fix a bug where KVM would return a pointer to a vCPU prior to it
     being fully online, and give kvm_for_each_vcpu() similar treatment
     to fix a similar flaw

   - Wait for a vCPU to come online prior to executing a vCPU ioctl, to
     fix a bug where userspace could coerce KVM into handling the ioctl
     on a vCPU that isn't yet onlined

   - Gracefully handle xarray insertion failures; even though such
     failures are impossible in practice after xa_reserve(), reserving
     an entry is always followed by xa_store() which does not know (or
     differentiate) whether there was an xa_reserve() before or not

  RISC-V:

   - Zabha, Svvptc, and Ziccrse extension support for guests. None of
     them require anything in KVM except for detecting them and marking
     them as supported; Zabha adds byte and halfword atomic operations,
     while the others are markers for specific operation of the TLB and
     of LL/SC instructions respectively

   - Virtualize SBI system suspend extension for Guest/VM

   - Support firmware counters which can be used by the guests to
     collect statistics about traps that occur in the host

  Selftests:

   - Rework vcpu_get_reg() to return a value instead of using an
     out-param, and update all affected arch code accordingly

   - Convert the max_guest_memory_test into a more generic
     mmu_stress_test. The basic gist of the "conversion" is to have the
     test do mprotect() on guest memory while vCPUs are accessing said
     memory, e.g. to verify KVM and mmu_notifiers are working as
     intended

   - Play nice with treewrite builds of unsupported architectures, e.g.
     arm (32-bit), as KVM selftests' Makefile doesn't do anything to
     ensure the target architecture is actually one KVM selftests
     supports

   - Use the kernel's $(ARCH) definition instead of the target triple
     for arch specific directories, e.g. arm64 instead of aarch64,
     mainly so as not to be different from the rest of the kernel

   - Ensure that format strings for logging statements are checked by
     the compiler even when the logging statement itself is disabled

   - Attempt to whack the last LLC references/misses mole in the Intel
     PMU counters test by adding a data load and doing CLFLUSH{OPT} on
     the data instead of the code being executed. It seems that modern
     Intel CPUs have learned new code prefetching tricks that bypass the
     PMU counters

   - Fix a flaw in the Intel PMU counters test where it asserts that
     events are counting correctly without actually knowing what the
     events count given the underlying hardware; this can happen if
     Intel reuses a formerly microarchitecture-specific event encoding
     as an architectural event, as was the case for Top-Down Slots"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (151 commits)
  kvm: defer huge page recovery vhost task to later
  KVM: x86/mmu: Return RET_PF* instead of 1 in kvm_mmu_page_fault()
  KVM: Disallow all flags for KVM-internal memslots
  KVM: x86: Drop double-underscores from __kvm_set_memory_region()
  KVM: Add a dedicated API for setting KVM-internal memslots
  KVM: Assert slots_lock is held when setting memory regions
  KVM: Open code kvm_set_memory_region() into its sole caller (ioctl() API)
  LoongArch: KVM: Add hypercall service support for usermode VMM
  LoongArch: KVM: Clear LLBCTL if secondary mmu mapping is changed
  KVM: SVM: Use str_enabled_disabled() helper in svm_hardware_setup()
  KVM: VMX: read the PML log in the same order as it was written
  KVM: VMX: refactor PML terminology
  KVM: VMX: Fix comment of handle_vmx_instruction()
  KVM: VMX: Reinstate __exit attribute for vmx_exit()
  KVM: SVM: Use str_enabled_disabled() helper in sev_hardware_setup()
  KVM: x86: Avoid double RDPKRU when loading host/guest PKRU
  KVM: x86: Use LVT_TIMER instead of an open coded literal
  RISC-V: KVM: Add new exit statstics for redirected traps
  RISC-V: KVM: Update firmware counters for various events
  RISC-V: KVM: Redirect instruction access fault trap to guest
  ...
2025-01-25 09:55:09 -08:00
Linus Torvalds
d0f93ac2c3 Merge tag 'docs-6.14' of git://git.lwn.net/linux
Pull Documentation updates from Jonathan Corbet:

 - Quite a bit of Chinese and Spanish translation work

 - Clarifying that Git commit IDs >12chars are OK

 - A new nvme-multipath document

 - A reorganization of the admin-guide top-level page to make it
   readable

 - Clarification of the role of Acked-by and maintainer discretion on
   their acceptance

 - Some reorganization of debugging-oriented docs

... and typo fixes, documentation updates, etc as usual

* tag 'docs-6.14' of git://git.lwn.net/linux: (50 commits)
  Documentation: Fix x86_64 UEFI outdated references to elilo
  Documentation/sysctl: Add timer_migration to kernel.rst
  docs/mm: Physical memory: Remove zone_t
  docs: submitting-patches: clarify that signers may use their discretion on tags
  docs: submitting-patches: clarify difference between Acked-by and Reviewed-by
  docs: submitting-patches: clarify Acked-by and introduce "# Suffix"
  Documentation: bug-hunting.rst: remove odd contact information
  docs/zh_CN: Add sak index Chinese translation
  doc: module: DEFAULT_SYMBOL_NAMESPACE must be defined before #includes
  doc: module: Fix documented type of namespace
  Documentation/kernel-parameters: Fix a reference to vga-softcursor.rst
  docs/zh_CN: Add landlock index Chinese translation
  Documentation: Fix typo localmodonfig -> localmodconfig
  overlayfs.rst: Fix and improve grammar
  docs/zh_CN: Add siphash index Chinese translation
  docs/zh_CN: Add security IMA-templates Chinese translation
  docs/zh_CN: Add security digsig Chinese translation
  Align git commit ID abbreviation guidelines and checks
  docs: process: submitting-patches: split canonical patch format section
  docs/zh_CN: Add security lsm Chinese translation
  ...
2025-01-21 18:00:00 -08:00
Paolo Bonzini
4f7ff70c05 Merge tag 'kvm-x86-misc-6.14' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.14:

 - Overhaul KVM's CPUID feature infrastructure to track all vCPU capabilities
   instead of just those where KVM needs to manage state and/or explicitly
   enable the feature in hardware.  Along the way, refactor the code to make
   it easier to add features, and to make it more self-documenting how KVM
   is handling each feature.

 - Rework KVM's handling of VM-Exits during event vectoring; this plugs holes
   where KVM unintentionally puts the vCPU into infinite loops in some scenarios
   (e.g. if emulation is triggered by the exit), and brings parity between VMX
   and SVM.

 - Add pending request and interrupt injection information to the kvm_exit and
   kvm_entry tracepoints respectively.

 - Fix a relatively benign flaw where KVM would end up redoing RDPKRU when
   loading guest/host PKRU, due to a refactoring of the kernel helpers that
   didn't account for KVM's pre-checking of the need to do WRPKRU.
2025-01-20 06:49:39 -05:00
Christoph Schlameuss
5021fd77d6 KVM: s390: Reject KVM_SET_GSI_ROUTING on ucontrol VMs
Prevent null pointer dereference when processing
KVM_IRQ_ROUTING_S390_ADAPTER routing entries.
The ioctl cannot be processed for ucontrol VMs.

Fixes: f65470661f ("KVM: s390/interrupt: do not pin adapter interrupt pages")
Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Tested-by: Hariharan Mari <hari55@linux.ibm.com>
Reviewed-by: Hariharan Mari <hari55@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20241216092140.329196-4-schlameuss@linux.ibm.com
Message-ID: <20241216092140.329196-4-schlameuss@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
2025-01-07 16:36:11 +01:00
Sean Christopherson
9be4ec35d6 KVM: x86: Advertise TSC_DEADLINE_TIMER in KVM_GET_SUPPORTED_CPUID
Unconditionally advertise TSC_DEADLINE_TIMER via KVM_GET_SUPPORTED_CPUID,
as KVM always emulates deadline mode, *if* the VM has an in-kernel local
APIC.  The odds of a VMM emulating the local APIC in userspace, not
emulating the TSC deadline timer, _and_ reflecting
KVM_GET_SUPPORTED_CPUID back into KVM_SET_CPUID2, i.e. the risk of
over-advertising and breaking any setups, is extremely low.

KVM has _unconditionally_ advertised X2APIC via CPUID since commit
0d1de2d901 ("KVM: Always report x2apic as supported feature"), and it
is completely impossible for userspace to emulate X2APIC as KVM doesn't
support forwarding the MSR accesses to userspace.  I.e. KVM has relied on
userspace VMMs to not misreport local APIC capabilities for nearly 13
years.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-38-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:20:01 -08:00
Sean Christopherson
04cd8f8628 KVM: x86: Disallow KVM_CAP_X86_DISABLE_EXITS after vCPU creation
Reject KVM_CAP_X86_DISABLE_EXITS if vCPUs have been created, as disabling
PAUSE/MWAIT/HLT exits after vCPUs have been created is broken and useless,
e.g. except for PAUSE on SVM, the relevant intercepts aren't updated after
vCPU creation.  vCPUs may also end up with an inconsistent configuration
if exits are disabled between creation of multiple vCPUs.

Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Link: https://lore.kernel.org/all/9227068821b275ac547eb2ede09ec65d2281fe07.1680179693.git.houwenlong.hwl@antgroup.com
Link: https://lore.kernel.org/all/20230121020738.2973-2-kechenl@nvidia.com
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20241128013424.4096668-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-12-18 14:19:35 -08:00
Gianfranco Trad
311cf7164a Documentation: kvm: fix typo in api.rst
Fix minor typo in api.rst where the word physical was misspelled
as physcial.

Signed-off-by: Gianfranco Trad <gianf.trad@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20241115011831.300705-5-gianf.trad@gmail.com
2024-12-11 09:07:40 -07:00
Paolo Bonzini
7b541d557f Merge tag 'kvmarm-6.13' of https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 changes for 6.13, part #1

 - Support for stage-1 permission indirection (FEAT_S1PIE) and
   permission overlays (FEAT_S1POE), including nested virt + the
   emulated page table walker

 - Introduce PSCI SYSTEM_OFF2 support to KVM + client driver. This call
   was introduced in PSCIv1.3 as a mechanism to request hibernation,
   similar to the S4 state in ACPI

 - Explicitly trap + hide FEAT_MPAM (QoS controls) from KVM guests. As
   part of it, introduce trivial initialization of the host's MPAM
   context so KVM can use the corresponding traps

 - PMU support under nested virtualization, honoring the guest
   hypervisor's trap configuration and event filtering when running a
   nested guest

 - Fixes to vgic ITS serialization where stale device/interrupt table
   entries are not zeroed when the mapping is invalidated by the VM

 - Avoid emulated MMIO completion if userspace has requested synchronous
   external abort injection

 - Various fixes and cleanups affecting pKVM, vCPU initialization, and
   selftests
2024-11-14 07:05:36 -05:00
Paolo Bonzini
2e9a2c624e Merge branch 'kvm-docs-6.13' into HEAD
- Drop obsolete references to PPC970 KVM, which was removed 10 years ago.

- Fix incorrect references to non-existing ioctls

- List registers supported by KVM_GET/SET_ONE_REG on s390

- Use rST internal links

- Reorganize the introduction to the API document
2024-11-13 07:18:12 -05:00
Oliver Upton
7ccd615bc6 Merge branch kvm-arm64/psci-1.3 into kvmarm/next
* kvm-arm64/psci-1.3:
  : PSCI v1.3 support, courtesy of David Woodhouse
  :
  : Bump KVM's PSCI implementation up to v1.3, with the added bonus of
  : implementing the SYSTEM_OFF2 call. Like other system-scoped PSCI calls,
  : this gets relayed to userspace for further processing with a new
  : KVM_SYSTEM_EVENT_SHUTDOWN flag.
  :
  : As an added bonus, implement client-side support for hibernation with
  : the SYSTEM_OFF2 call.
  arm64: Use SYSTEM_OFF2 PSCI call to power off for hibernate
  KVM: arm64: nvhe: Pass through PSCI v1.3 SYSTEM_OFF2 call
  KVM: selftests: Add test for PSCI SYSTEM_OFF2
  KVM: arm64: Add support for PSCI v1.2 and v1.3
  KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation
  firmware/psci: Add definitions for PSCI v1.3 specification

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-11-11 18:36:46 +00:00
Paolo Bonzini
5b47f5a725 Documentation: kvm: reorganize introduction
Reorganize the text to mention file descriptors as early as possible.
Also mention capabilities early as they are a central part of KVM's
API.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241023124507.280382-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08 05:57:13 -05:00
Paolo Bonzini
badd5372ec Documentation: kvm: replace section numbers with links
In order to simplify further introduction of hyperlinks, replace explicit
section numbers with rST hyperlinks.  The section numbers could actually
be removed now, but I'm not going to do a huge change throughout the file
for an RFC...

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241023124507.280382-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08 05:57:13 -05:00
Paolo Bonzini
aae7527ea9 Documentation: kvm: fix a few mistakes
The only occurrence "Capability: none" actually meant the same as "basic".
Fix that and a few more aesthetic or content issues in the document.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241023124507.280382-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08 05:57:13 -05:00
Paolo Bonzini
e16e018e82 KVM: powerpc: remove remaining traces of KVM_CAP_PPC_RMA
This was only needed for PPC970 support, which is long gone: the
implementation was removed in 2014.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241023124507.280382-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08 05:57:13 -05:00
Sean Christopherson
dcb988cdac KVM: x86: Quirk initialization of feature MSRs to KVM's max configuration
Add a quirk to control KVM's misguided initialization of select feature
MSRs to KVM's max configuration, as enabling features by default violates
KVM's approach of letting userspace own the vCPU model, and is actively
problematic for MSRs that are conditionally supported, as the vCPU will
end up with an MSR value that userspace can't restore.  E.g. if the vCPU
is configured with PDCM=0, userspace will save and attempt to restore a
non-zero PERF_CAPABILITIES, thanks to KVM's meddling.

Link: https://lore.kernel.org/r/20240802185511.305849-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01 09:22:31 -07:00
David Woodhouse
97413cea1c KVM: arm64: Add PSCI v1.3 SYSTEM_OFF2 function for hibernation
The PSCI v1.3 specification adds support for a SYSTEM_OFF2 function
which is analogous to ACPI S4 state. This will allow hosting
environments to determine that a guest is hibernated rather than just
powered off, and ensure that they preserve the virtual environment
appropriately to allow the guest to resume safely (or bump the
hardware_signature in the FACS to trigger a clean reboot instead).

This feature is safe to enable unconditionally (in a subsequent commit)
because it is exposed to userspace through the existing
KVM_SYSTEM_EVENT_SHUTDOWN event, just with an additional flag which
userspace can use to know that the instance intended hibernation instead
of a plain power-off.

As with SYSTEM_RESET2, there is only one type available (in this case
HIBERNATE_OFF), and it is not explicitly reported to userspace through
the event; userspace can get it from the registers if it cares).

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Miguel Luis <miguel.luis@oracle.com>
Link: https://lore.kernel.org/r/20241019172459.2241939-3-dwmw2@infradead.org
[oliver: slight cleanup of comments]
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-10-24 16:36:51 -07:00
Sean Christopherson
5a27984244 KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL
Massage the documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL to call out that
it applies to moved memslots as well as deleted memslots, to avoid KVM's
"fast zap" terminology (which has no meaning for userspace), and to reword
the documented targeted zap behavior to specifically say that KVM _may_
zap a subset of all SPTEs.  As evidenced by the fix to zap non-leafs SPTEs
with gPTEs, formally documenting KVM's exact internal behavior is risky
and unnecessary.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241009192345.1148353-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20 07:31:05 -04:00
Paolo Bonzini
3f8df62852 Merge tag 'kvm-x86-vmx-6.12' of https://github.com/kvm-x86/linux into HEAD
KVM VMX changes for 6.12:

 - Set FINAL/PAGE in the page fault error code for EPT Violations if and only
   if the GVA is valid.  If the GVA is NOT valid, there is no guest-side page
   table walk and so stuffing paging related metadata is nonsensical.

 - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of
   emulating posted interrupt delivery to L2.

 - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures.

 - Harden eVMCS loading against an impossible NULL pointer deref (really truly
   should be impossible).

 - Minor SGX fix and a cleanup.
2024-09-17 12:41:23 -04:00
Sean Christopherson
653ea4489e KVM: nVMX: Honor userspace MSR filter lists for nested VM-Enter/VM-Exit
Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if
L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has
disallowed access to via an MSR filter.  Intel already disallows including
a handful of "special" MSRs in the VMCS lists, so denying access isn't
completely without precedent.

More importantly, the behavior is well-defined _and_ can be communicated
the end user, e.g. to the customer that owns a VM running as L1 on top of
KVM.  On the other hand, ignoring userspace MSR filters is all but
guaranteed to result in unexpected behavior as the access will hit KVM's
internal state, which is likely not up-to-date.

Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS
fields, the MSRs in the VMCS load/store lists are 100% guest controlled,
thus making it all but impossible to reason about the correctness of
ignoring the MSR filter.  And if userspace *really* wants to deny access
to MSRs via the aforementioned scenarios, userspace can hide the
associated feature from the guest, e.g. by disabling the PMU to prevent
accessing PERF_GLOBAL_CTRL via its VMCS field.  But for the MSR lists, KVM
is blindly processing MSRs; the  MSR filters are the _only_ way for
userspace to deny access.

This partially reverts commit ac8d6cad3c ("KVM: x86: Only do MSR
filtering when access MSR by rdmsr/wrmsr").

Cc: Hou Wenlong <houwenlong.hwl@antgroup.com>
Cc: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20240722235922.3351122-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22 11:35:16 -07:00
Yan Zhao
aa8d1f48d3 KVM: x86/mmu: Introduce a quirk to control memslot zap behavior
Introduce the quirk KVM_X86_QUIRK_SLOT_ZAP_ALL to allow users to select
KVM's behavior when a memslot is moved or deleted for KVM_X86_DEFAULT_VM
VMs. Make sure KVM behave as if the quirk is always disabled for
non-KVM_X86_DEFAULT_VM VMs.

The KVM_X86_QUIRK_SLOT_ZAP_ALL quirk offers two behavior options:
- when enabled:  Invalidate/zap all SPTEs ("zap-all"),
- when disabled: Precisely zap only the leaf SPTEs within the range of the
                 moving/deleting memory slot ("zap-slot-leafs-only").

"zap-all" is today's KVM behavior to work around a bug [1] where the
changing the zapping behavior of memslot move/deletion would cause VM
instability for VMs with an Nvidia GPU assigned; while
"zap-slot-leafs-only" allows for more precise zapping of SPTEs within the
memory slot range, improving performance in certain scenarios [2], and
meeting the functional requirements for TDX.

Previous attempts to select "zap-slot-leafs-only" include a per-VM
capability approach [3] (which was not preferred because the root cause of
the bug remained unidentified) and a per-memslot flag approach [4]. Sean
and Paolo finally recommended the implementation of this quirk and
explained that it's the least bad option [5].

By default, the quirk is enabled on KVM_X86_DEFAULT_VM VMs to use
"zap-all". Users have the option to disable the quirk to select
"zap-slot-leafs-only" for specific KVM_X86_DEFAULT_VM VMs that are
unaffected by this bug.

For non-KVM_X86_DEFAULT_VM VMs, the "zap-slot-leafs-only" behavior is
always selected without user's opt-in, regardless of if the user opts for
"zap-all".
This is because it is assumed until proven otherwise that non-
KVM_X86_DEFAULT_VM VMs will not be exposed to the bug [1], and most
importantly, it's because TDX must have "zap-slot-leafs-only" always
selected. In TDX's case a memslot's GPA range can be a mixture of "private"
or "shared" memory. Shared is roughly analogous to how EPT is handled for
normal VMs, but private GPAs need lots of special treatment:
1) "zap-all" would require to zap private root page or non-leaf entries or
   at least leaf-entries beyond the deleting memslot scope. However, TDX
   demands that the root page of the private page table remains unchanged,
   with leaf entries being zapped before non-leaf entries, and any dropped
   private guest pages must be re-accepted by the guest.
2) if "zap-all" zaps only shared page tables, it would result in private
   pages still being mapped when the memslot is gone. This may affect even
   other processes if later the gmem fd was whole punched, causing the
   pages being freed on the host while still mapped in the TD, because
   there's no pgoff to the gfn information to zap the private page table
   after memslot is gone.

So, simply go "zap-slot-leafs-only" as if the quirk is always disabled for
non-KVM_X86_DEFAULT_VM VMs to avoid manual opt-in for every VM type [6] or
complicating quirk disabling interface (current quirk disabling interface
is limited, no way to query quirks, or force them to be disabled).

Add a new function kvm_mmu_zap_memslot_leafs() to implement
"zap-slot-leafs-only". This function does not call kvm_unmap_gfn_range(),
bypassing special handling to APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, as
1) The APIC_ACCESS_PAGE_PRIVATE_MEMSLOT cannot be created by users, nor can
   it be moved. It is only deleted by KVM when APICv is permanently
   inhibited.
2) kvm_vcpu_reload_apic_access_page() effectively does nothing when
   APIC_ACCESS_PAGE_PRIVATE_MEMSLOT is deleted.
3) Avoid making all cpus request of KVM_REQ_APIC_PAGE_RELOAD can save on
   costly IPIs.

Suggested-by: Kai Huang <kai.huang@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com [1]
Link: https://patchwork.kernel.org/project/kvm/patch/20190205210137.1377-11-sean.j.christopherson@intel.com/#25054908 [2]
Link: https://lore.kernel.org/kvm/20200713190649.GE29725@linux.intel.com/T/#mabc0119583dacf621025e9d873c85f4fbaa66d5c [3]
Link: https://lore.kernel.org/all/20240515005952.3410568-3-rick.p.edgecombe@intel.com [4]
Link: https://lore.kernel.org/all/7df9032d-83e4-46a1-ab29-6c7973a2ab0b@redhat.com [5]
Link: https://lore.kernel.org/all/ZnGa550k46ow2N3L@google.com [6]
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Message-ID: <20240703021043.13881-1-yan.y.zhao@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-14 12:29:11 -04:00
Paolo Bonzini
747cfbf161 Merge tag 'kvmarm-fixes-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.11, round #1

 - Use kvfree() for the kvmalloc'd nested MMUs array

 - Set of fixes to address warnings in W=1 builds

 - Make KVM depend on assembler support for ARMv8.4

 - Fix for vgic-debug interface for VMs without LPIs

 - Actually check ID_AA64MMFR3_EL1.S1PIE in get-reg-list selftest

 - Minor code / comment cleanups for configuring PAuth traps

 - Take kvm->arch.config_lock to prevent destruction / initialization
   race for a vCPU's CPUIF which may lead to a UAF
2024-08-13 06:06:27 -04:00
Takahiro Itazuri
7fef1eb0b0 docs: KVM: Fix register ID of SPSR_FIQ
Fixes the register ID of SPSR_FIQ.

SPSR_FIQ is a 64-bit register and the 64-bit register size mask is
0x0030000000000000ULL.

Fixes: fd3bc912d3 ("KVM: Documentation: Document arm64 core registers in detail")
Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230606154628.95498-1-itazur@amazon.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-08-05 20:16:35 +00:00
Paolo Bonzini
1773014a97 Merge branch 'kvm-fixes' into HEAD
* fix latent bug in how usage of large pages is determined for
  confidential VMs

* fix "underline too short" in docs

* eliminate log spam from limited APIC timer periods

* disallow pre-faulting of memory before SEV-SNP VMs are initialized

* delay clearing and encrypting private memory until it is added to
  guest page tables

* this change also enables another small cleanup: the checks in
  SNP_LAUNCH_UPDATE that limit it to non-populated, private pages
  can now be moved in the common kvm_gmem_populate() function
2024-08-02 12:33:43 -04:00
Paolo Bonzini
5932ca411e KVM: x86: disallow pre-fault for SNP VMs before initialization
KVM_PRE_FAULT_MEMORY for an SNP guest can race with
sev_gmem_post_populate() in bad ways. The following sequence for
instance can potentially trigger an RMP fault:

  thread A, sev_gmem_post_populate: called
  thread B, sev_gmem_prepare: places below 'pfn' in a private state in RMP
  thread A, sev_gmem_post_populate: *vaddr = kmap_local_pfn(pfn + i);
  thread A, sev_gmem_post_populate: copy_from_user(vaddr, src + i * PAGE_SIZE, PAGE_SIZE);
  RMP #PF

Fix this by only allowing KVM_PRE_FAULT_MEMORY to run after a guest's
initial private memory contents have been finalized via
KVM_SEV_SNP_LAUNCH_FINISH.

Beyond fixing this issue, it just sort of makes sense to enforce this,
since the KVM_PRE_FAULT_MEMORY documentation states:

  "KVM maps memory as if the vCPU generated a stage-2 read page fault"

which sort of implies we should be acting on the same guest state that a
vCPU would see post-launch after the initial guest memory is all set up.

Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26 14:46:14 -04:00
Chang Yu
c2adcf051b KVM: Documentation: Fix title underline too short warning
Fix "WARNING: Title underline too short" by extending title line to the
proper length.

Signed-off-by: Chang Yu <marcus.yu.56@gmail.com>
Message-ID: <ZqB3lofbzMQh5Q-5@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-26 13:35:45 -04:00
Linus Torvalds
2c9b351240 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
 "ARM:

   - Initial infrastructure for shadow stage-2 MMUs, as part of nested
     virtualization enablement

   - Support for userspace changes to the guest CTR_EL0 value, enabling
     (in part) migration of VMs between heterogenous hardware

   - Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1
     of the protocol

   - FPSIMD/SVE support for nested, including merged trap configuration
     and exception routing

   - New command-line parameter to control the WFx trap behavior under
     KVM

   - Introduce kCFI hardening in the EL2 hypervisor

   - Fixes + cleanups for handling presence/absence of FEAT_TCRX

   - Miscellaneous fixes + documentation updates

  LoongArch:

   - Add paravirt steal time support

   - Add support for KVM_DIRTY_LOG_INITIALLY_SET

   - Add perf kvm-stat support for loongarch

  RISC-V:

   - Redirect AMO load/store access fault traps to guest

   - perf kvm stat support

   - Use guest files for IMSIC virtualization, when available

  s390:

   - Assortment of tiny fixes which are not time critical

  x86:

   - Fixes for Xen emulation

   - Add a global struct to consolidate tracking of host values, e.g.
     EFER

   - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the
     effective APIC bus frequency, because TDX

   - Print the name of the APICv/AVIC inhibits in the relevant
     tracepoint

   - Clean up KVM's handling of vendor specific emulation to
     consistently act on "compatible with Intel/AMD", versus checking
     for a specific vendor

   - Drop MTRR virtualization, and instead always honor guest PAT on
     CPUs that support self-snoop

   - Update to the newfangled Intel CPU FMS infrastructure

   - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as
     it reads '0' and writes from userspace are ignored

   - Misc cleanups

  x86 - MMU:

   - Small cleanups, renames and refactoring extracted from the upcoming
     Intel TDX support

   - Don't allocate kvm_mmu_page.shadowed_translation for shadow pages
     that can't hold leafs SPTEs

   - Unconditionally drop mmu_lock when allocating TDP MMU page tables
     for eager page splitting, to avoid stalling vCPUs when splitting
     huge pages

   - Bug the VM instead of simply warning if KVM tries to split a SPTE
     that is non-present or not-huge. KVM is guaranteed to end up in a
     broken state because the callers fully expect a valid SPTE, it's
     all but dangerous to let more MMU changes happen afterwards

  x86 - AMD:

   - Make per-CPU save_area allocations NUMA-aware

   - Force sev_es_host_save_area() to be inlined to avoid calling into
     an instrumentable function from noinstr code

   - Base support for running SEV-SNP guests. API-wise, this includes a
     new KVM_X86_SNP_VM type, encrypting/measure the initial image into
     guest memory, and finalizing it before launching it. Internally,
     there are some gmem/mmu hooks needed to prepare gmem-allocated
     pages before mapping them into guest private memory ranges

     This includes basic support for attestation guest requests, enough
     to say that KVM supports the GHCB 2.0 specification

     There is no support yet for loading into the firmware those signing
     keys to be used for attestation requests, and therefore no need yet
     for the host to provide certificate data for those keys.

     To support fetching certificate data from userspace, a new KVM exit
     type will be needed to handle fetching the certificate from
     userspace.

     An attempt to define a new KVM_EXIT_COCO / KVM_EXIT_COCO_REQ_CERTS
     exit type to handle this was introduced in v1 of this patchset, but
     is still being discussed by community, so for now this patchset
     only implements a stub version of SNP Extended Guest Requests that
     does not provide certificate data

  x86 - Intel:

   - Remove an unnecessary EPT TLB flush when enabling hardware

   - Fix a series of bugs that cause KVM to fail to detect nested
     pending posted interrupts as valid wake eents for a vCPU executing
     HLT in L2 (with HLT-exiting disable by L1)

   - KVM: x86: Suppress MMIO that is triggered during task switch
     emulation

     Explicitly suppress userspace emulated MMIO exits that are
     triggered when emulating a task switch as KVM doesn't support
     userspace MMIO during complex (multi-step) emulation

     Silently ignoring the exit request can result in the
     WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace
     for some other reason prior to purging mmio_needed

     See commit 0dc902267c ("KVM: x86: Suppress pending MMIO write
     exits if emulator detects exception") for more details on KVM's
     limitations with respect to emulated MMIO during complex emulator
     flows

  Generic:

   - Rename the AS_UNMOVABLE flag that was introduced for KVM to
     AS_INACCESSIBLE, because the special casing needed by these pages
     is not due to just unmovability (and in fact they are only
     unmovable because the CPU cannot access them)

   - New ioctl to populate the KVM page tables in advance, which is
     useful to mitigate KVM page faults during guest boot or after live
     migration. The code will also be used by TDX, but (probably) not
     through the ioctl

   - Enable halt poll shrinking by default, as Intel found it to be a
     clear win

   - Setup empty IRQ routing when creating a VM to avoid having to
     synchronize SRCU when creating a split IRQCHIP on x86

   - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with
     a flag that arch code can use for hooking both sched_in() and
     sched_out()

   - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
     truncating a bogus value from userspace, e.g. to help userspace
     detect bugs

   - Mark a vCPU as preempted if and only if it's scheduled out while in
     the KVM_RUN loop, e.g. to avoid marking it preempted and thus
     writing guest memory when retrieving guest state during live
     migration blackout

  Selftests:

   - Remove dead code in the memslot modification stress test

   - Treat "branch instructions retired" as supported on all AMD Family
     17h+ CPUs

   - Print the guest pseudo-RNG seed only when it changes, to avoid
     spamming the log for tests that create lots of VMs

   - Make the PMU counters test less flaky when counting LLC cache
     misses by doing CLFLUSH{OPT} in every loop iteration"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits)
  crypto: ccp: Add the SNP_VLEK_LOAD command
  KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops
  KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops
  KVM: x86: Replace static_call_cond() with static_call()
  KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event
  x86/sev: Move sev_guest.h into common SEV header
  KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event
  KVM: x86: Suppress MMIO that is triggered during task switch emulation
  KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro
  KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE
  KVM: selftests: x86: Add test for KVM_PRE_FAULT_MEMORY
  KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory()
  KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level
  KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault"
  KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler
  KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory
  KVM: Document KVM_PRE_FAULT_MEMORY ioctl
  mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE
  perf kvm: Add kvm-stat for loongarch64
  LoongArch: KVM: Add PV steal time support in guest side
  ...
2024-07-20 12:41:03 -07:00
Paolo Bonzini
5c5ddf7107 Merge tag 'kvm-x86-mtrrs-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM x86 MTRR virtualization removal

Remove support for virtualizing MTRRs on Intel CPUs, along with a nasty CR0.CD
hack, and instead always honor guest PAT on CPUs that support self-snoop.
2024-07-16 09:54:57 -04:00
Paolo Bonzini
5dcc1e7614 Merge tag 'kvm-x86-misc-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM x86 misc changes for 6.11

 - Add a global struct to consolidate tracking of host values, e.g. EFER, and
   move "shadow_phys_bits" into the structure as "maxphyaddr".

 - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
   bus frequency, because TDX.

 - Print the name of the APICv/AVIC inhibits in the relevant tracepoint.

 - Clean up KVM's handling of vendor specific emulation to consistently act on
   "compatible with Intel/AMD", versus checking for a specific vendor.

 - Misc cleanups
2024-07-16 09:53:05 -04:00
Paolo Bonzini
86014c1e20 Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM generic changes for 6.11

 - Enable halt poll shrinking by default, as Intel found it to be a clear win.

 - Setup empty IRQ routing when creating a VM to avoid having to synchronize
   SRCU when creating a split IRQCHIP on x86.

 - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
   that arch code can use for hooking both sched_in() and sched_out().

 - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
   truncating a bogus value from userspace, e.g. to help userspace detect bugs.

 - Mark a vCPU as preempted if and only if it's scheduled out while in the
   KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
   memory when retrieving guest state during live migration blackout.

 - A few minor cleanups
2024-07-16 09:51:36 -04:00
Paolo Bonzini
1c5a0b55ab Merge tag 'kvmarm-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 changes for 6.11

 - Initial infrastructure for shadow stage-2 MMUs, as part of nested
   virtualization enablement

 - Support for userspace changes to the guest CTR_EL0 value, enabling
   (in part) migration of VMs between heterogenous hardware

 - Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of
   the protocol

 - FPSIMD/SVE support for nested, including merged trap configuration
   and exception routing

 - New command-line parameter to control the WFx trap behavior under KVM

 - Introduce kCFI hardening in the EL2 hypervisor

 - Fixes + cleanups for handling presence/absence of FEAT_TCRX

 - Miscellaneous fixes + documentation updates
2024-07-16 09:50:44 -04:00
Paolo Bonzini
f0a23883fa Merge tag 'kvm-s390-next-6.11-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
Assortment of tiny fixes which are not time critical:
- Rejecting memory region operations for ucontrol mode VMs
- Rewind the PSW on host intercepts for VSIE
- Remove unneeded include
2024-07-12 11:23:19 -04:00
Isaku Yamahata
9aed7a6c0b KVM: Document KVM_PRE_FAULT_MEMORY ioctl
Adds documentation of KVM_PRE_FAULT_MEMORY ioctl. [1]

It populates guest memory.  It doesn't do extra operations on the
underlying technology-specific initialization [2].  For example,
CoCo-related operations won't be performed.  Concretely for TDX, this API
won't invoke TDH.MEM.PAGE.ADD() or TDH.MR.EXTEND().  Vendor-specific APIs
are required for such operations.

The key point is to adapt of vcpu ioctl instead of VM ioctl.  First,
populating guest memory requires vcpu.  If it is VM ioctl, we need to pick
one vcpu somehow.  Secondly, vcpu ioctl allows each vcpu to invoke this
ioctl in parallel.  It helps to scale regarding guest memory size, e.g.,
hundreds of GB.

[1] https://lore.kernel.org/kvm/Zbrj5WKVgMsUFDtb@google.com/
[2] https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com/

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <9a060293c9ad9a78f1d8994cfe1311e818e99257.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-07-12 11:17:35 -04:00
Christoph Schlameuss
7816e58967 kvm: s390: Reject memory region operations for ucontrol VMs
This change rejects the KVM_SET_USER_MEMORY_REGION and
KVM_SET_USER_MEMORY_REGION2 ioctls when called on a ucontrol VM.
This is necessary since ucontrol VMs have kvm->arch.gmap set to 0 and
would thus result in a null pointer dereference further in.
Memory management needs to be performed in userspace and using the
ioctls KVM_S390_UCAS_MAP and KVM_S390_UCAS_UNMAP.

Also improve s390 specific documentation for KVM_SET_USER_MEMORY_REGION
and KVM_SET_USER_MEMORY_REGION2.

Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Fixes: 27e0393f15 ("KVM: s390: ucontrol: per vcpu address spaces")
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Link: https://lore.kernel.org/r/20240624095902.29375-1-schlameuss@linux.ibm.com
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
[frankja@linux.ibm.com: commit message spelling fix, subject prefix fix]
Message-ID: <20240624095902.29375-1-schlameuss@linux.ibm.com>
2024-07-04 09:07:24 +02:00
Changyuan Lyu
03bd36a387 KVM: Documentation: Enumerate allowed value macros of irq_type
The expression `irq_type[n]` may confuse readers to interpret `n`
as the bit position and think of CPU = 1 << 0, SPI = 1 << 1, and
PPI = 1 << 2.

Since arch/arm64/include/uapi/asm/kvm.h already has macro definitions
for the allowed values, this commit uses these symbols to clear up
the ambiguity.

Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Link: https://lore.kernel.org/r/20240623164542.2999626-2-changyuanl@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-28 22:00:18 +00:00
Changyuan Lyu
dc1d234cdd KVM: Documentation: Fix typo BFD
BDF is the acronym for Bus, Device, Function.

Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Link: https://lore.kernel.org/r/20240623164542.2999626-1-changyuanl@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-06-28 22:00:13 +00:00
Thomas Prescher
85542adb65 KVM: x86: Add KVM_RUN_X86_GUEST_MODE kvm_run flag
When a vCPU is interrupted by a signal while running a nested guest,
KVM will exit to userspace with L2 state. However, userspace has no
way to know whether it sees L1 or L2 state (besides calling
KVM_GET_STATS_FD, which does not have a stable ABI).

This causes multiple problems:

The simplest one is L2 state corruption when userspace marks the sregs
as dirty. See this mailing list thread [1] for a complete discussion.

Another problem is that if userspace decides to continue by emulating
instructions, it will unknowingly emulate with L2 state as if L1
doesn't exist, which can be considered a weird guest escape.

Introduce a new flag, KVM_RUN_X86_GUEST_MODE, in the kvm_run data
structure, which is set when the vCPU exited while running a nested
guest.  Also introduce a new capability, KVM_CAP_X86_GUEST_MODE, to
advertise the functionality to userspace.

[1] https://lore.kernel.org/kvm/20240416123558.212040-1-julian.stecklina@cyberus-technology.de/T/#m280aadcb2e10ae02c191a7dc4ed4b711a74b1f55

Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Link: https://lore.kernel.org/r/20240508132502.184428-1-julian.stecklina@cyberus-technology.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 09:24:31 -07:00
Carlos López
508f0c7bf6 KVM: x86: Improve documentation for KVM_CAP_X86_BUS_LOCK_EXIT
Improve the description for the KVM_CAP_X86_BUS_LOCK_EXIT capability to
fix a few typos and grammar issues, and to clarify the purpose of the
capability.

Signed-off-by: Carlos López <clopez@suse.de>
Link: https://lore.kernel.org/r/20240424105616.29596-1-clopez@suse.de
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-11 09:22:57 -07:00
Shivaprasad G Bhat
9a0d2f4995 KVM: PPC: Book3S HV: Add one-reg interface for HASHPKEYR register
The patch adds a one-reg register identifier which can be used to
read and set the virtual HASHPKEYR for the guest during enter/exit
with KVM_REG_PPC_HASHPKEYR. The specific SPR KVM API documentation
too updated.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/171759285547.1480.12374595786792346073.stgit@linux.ibm.com
2024-06-06 22:39:04 +10:00
Shivaprasad G Bhat
e9eb790b25 KVM: PPC: Book3S HV: Add one-reg interface for HASHKEYR register
The patch adds a one-reg register identifier which can be used to
read and set the virtual HASHKEYR for the guest during enter/exit
with KVM_REG_PPC_HASHKEYR. The specific SPR KVM API documentation
too updated.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/171759283170.1480.12904332463112769129.stgit@linux.ibm.com
2024-06-06 22:39:04 +10:00
Shivaprasad G Bhat
1a1e6865f5 KVM: PPC: Book3S HV: Add one-reg interface for DEXCR register
The patch adds a one-reg register identifier which can be used to
read and set the DEXCR for the guest during enter/exit with
KVM_REG_PPC_DEXCR. The specific SPR KVM API documentation
too updated.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/171759279613.1480.12873911783530175699.stgit@linux.ibm.com
2024-06-06 22:39:03 +10:00
Sean Christopherson
e1548088ff KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1
Drop KVM's emulation of CR0.CD=1 on Intel CPUs now that KVM no longer
honors guest MTRR memtypes, as forcing UC memory for VMs with
non-coherent DMA only makes sense if the guest is using something other
than PAT to configure the memtype for the DMA region.

Furthermore, KVM has forced WB memory for CR0.CD=1 since commit
fb279950ba ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED"), and no known
VMM in existence disables KVM_X86_QUIRK_CD_NW_CLEARED, let alone does
so with non-coherent DMA.

Lastly, commit fb279950ba ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") was
from the same author as commit b18d5431ac ("KVM: x86: fix CR0.CD
virtualization"), and followed by a mere month.  I.e. forcing UC memory
was likely the result of code inspection or perhaps misdiagnosed failures,
and not the necessitate by a concrete use case.

Update KVM's documentation to note that KVM_X86_QUIRK_CD_NW_CLEARED is now
AMD-only, and to take an erratum for lack of CR0.CD virtualization on
Intel.

Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20240309010929.1403984-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 08:13:14 -07:00
Isaku Yamahata
6fef518594 KVM: x86: Add a capability to configure bus frequency for APIC timer
Add KVM_CAP_X86_APIC_BUS_CYCLES_NS capability to configure the APIC
bus clock frequency for APIC timer emulation.
Allow KVM_ENABLE_CAPABILITY(KVM_CAP_X86_APIC_BUS_CYCLES_NS) to set the
frequency in nanoseconds. When using this capability, the user space
VMM should configure CPUID leaf 0x15 to advertise the frequency.

Vishal reported that the TDX guest kernel expects a 25MHz APIC bus
frequency but ends up getting interrupts at a significantly higher rate.

The TDX architecture hard-codes the core crystal clock frequency to
25MHz and mandates exposing it via CPUID leaf 0x15. The TDX architecture
does not allow the VMM to override the value.

In addition, per Intel SDM:
    "The APIC timer frequency will be the processor’s bus clock or core
     crystal clock frequency (when TSC/core crystal clock ratio is
     enumerated in CPUID leaf 0x15) divided by the value specified in
     the divide configuration register."

The resulting 25MHz APIC bus frequency conflicts with the KVM hardcoded
APIC bus frequency of 1GHz.

The KVM doesn't enumerate CPUID leaf 0x15 to the guest unless the user
space VMM sets it using KVM_SET_CPUID. If the CPUID leaf 0x15 is
enumerated, the guest kernel uses it as the APIC bus frequency. If not,
the guest kernel measures the frequency based on other known timers like
the ACPI timer or the legacy PIT. As reported by Vishal the TDX guest
kernel expects a 25MHz timer frequency but gets timer interrupt more
frequently due to the 1GHz frequency used by KVM.

To ensure that the guest doesn't have a conflicting view of the APIC bus
frequency, allow the userspace to tell KVM to use the same frequency that
TDX mandates instead of the default 1Ghz.

Reported-by: Vishal Annapurve <vannapurve@google.com>
Closes: https://lore.kernel.org/lkml/20231006011255.4163884-1-vannapurve@google.com
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Co-developed-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/6748a4c12269e756f0c48680da8ccc5367c31ce7.1714081726.git.reinette.chatre@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05 06:18:27 -07:00
Julian Stecklina
f2362c0475 KVM: fix documentation rendering for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
The documentation for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM doesn't use the
correct keyword formatting, which breaks rendering on
https://www.kernel.org/doc/html/latest/virt/kvm/api.html.

Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
Link: https://lore.kernel.org/r/20240520143220.340737-1-julian.stecklina@cyberus-technology.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03 08:56:39 -07:00
Linus Torvalds
ff2632d7d0 Merge tag 'powerpc-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:

 - Enable BPF Kernel Functions (kfuncs) in the powerpc BPF JIT.

 - Allow per-process DEXCR (Dynamic Execution Control Register) settings
   via prctl, notably NPHIE which controls hashst/hashchk for ROP
   protection.

 - Install powerpc selftests in sub-directories. Note this changes the
   way run_kselftest.sh needs to be invoked for powerpc selftests.

 - Change fadump (Firmware Assisted Dump) to better handle memory
   add/remove.

 - Add support for passing additional parameters to the fadump kernel.

 - Add support for updating the kdump image on CPU/memory add/remove
   events.

 - Other small features, cleanups and fixes.

Thanks to Andrew Donnellan, Andy Shevchenko, Aneesh Kumar K.V, Arnd
Bergmann, Benjamin Gray, Bjorn Helgaas, Christian Zigotzky, Christophe
Jaillet, Christophe Leroy, Colin Ian King, Cédric Le Goater, Dr. David
Alan Gilbert, Erhard Furtner, Frank Li, GUO Zihua, Ganesh Goudar, Geoff
Levand, Ghanshyam Agrawal, Greg Kurz, Hari Bathini, Joel Stanley, Justin
Stitt, Kunwu Chan, Li Yang, Lidong Zhong, Madhavan Srinivasan, Mahesh
Salgaonkar, Masahiro Yamada, Matthias Schiffer, Naresh Kamboju, Nathan
Chancellor, Nathan Lynch, Naveen N Rao, Nicholas Miehlbradt, Ran Wang,
Randy Dunlap, Ritesh Harjani, Sachin Sant, Shirisha Ganta, Shrikanth
Hegde, Sourabh Jain, Stephen Rothwell, sundar, Thorsten Blum, Vaibhav
Jain, Xiaowei Bao, Yang Li, and Zhao Chenhui.

* tag 'powerpc-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (85 commits)
  powerpc/fadump: Fix section mismatch warning
  powerpc/85xx: fix compile error without CONFIG_CRASH_DUMP
  powerpc/fadump: update documentation about bootargs_append
  powerpc/fadump: pass additional parameters when fadump is active
  powerpc/fadump: setup additional parameters for dump capture kernel
  powerpc/pseries/fadump: add support for multiple boot memory regions
  selftests/powerpc/dexcr: Fix spelling mistake "predicition" -> "prediction"
  KVM: PPC: Book3S HV nestedv2: Fix an error handling path in gs_msg_ops_kvmhv_nestedv2_config_fill_info()
  KVM: PPC: Fix documentation for ppc mmu caps
  KVM: PPC: code cleanup for kvmppc_book3s_irqprio_deliver
  KVM: PPC: Book3S HV nestedv2: Cancel pending DEC exception
  powerpc/xmon: Check cpu id in commands "c#", "dp#" and "dx#"
  powerpc/code-patching: Use dedicated memory routines for patching
  powerpc/code-patching: Test patch_instructions() during boot
  powerpc64/kasan: Pass virtual addresses to kasan_init_phys_region()
  powerpc: rename SPRN_HID2 define to SPRN_HID2_750FX
  powerpc: Fix typos
  powerpc/eeh: Fix spelling of the word "auxillary" and update comment
  macintosh/ams: Fix unused variable warning
  powerpc/Makefile: Remove bits related to the previous use of -mcmodel=large
  ...
2024-05-17 09:05:46 -07:00
Paolo Bonzini
f4bc1373d5 Merge tag 'kvm-x86-generic-6.10' of https://github.com/kvm-x86/linux into HEAD
KVM cleanups for 6.10:

 - Misc cleanups extracted from the "exit on missing userspace mapping" series,
   which has been put on hold in anticipation of a "KVM Userfault" approach,
   which should provide a superset of functionality.

 - Remove kvm_make_all_cpus_request_except(), which got added to hack around an
   AVIC bug, and then became dead code when a more robust fix came along.

 - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation.
2024-05-12 03:16:47 -04:00
Paolo Bonzini
e5f62e27b1 Merge tag 'kvmarm-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for Linux 6.10

- Move a lot of state that was previously stored on a per vcpu
  basis into a per-CPU area, because it is only pertinent to the
  host while the vcpu is loaded. This results in better state
  tracking, and a smaller vcpu structure.

- Add full handling of the ERET/ERETAA/ERETAB instructions in
  nested virtualisation. The last two instructions also require
  emulating part of the pointer authentication extension.
  As a result, the trap handling of pointer authentication has
  been greattly simplified.

- Turn the global (and not very scalable) LPI translation cache
  into a per-ITS, scalable cache, making non directly injected
  LPIs much cheaper to make visible to the vcpu.

- A batch of pKVM patches, mostly fixes and cleanups, as the
  upstreaming process seems to be resuming. Fingers crossed!

- Allocate PPIs and SGIs outside of the vcpu structure, allowing
  for smaller EL2 mapping and some flexibility in implementing
  more or less than 32 private IRQs.

- Purge stale mpidr_data if a vcpu is created after the MPIDR
  map has been created.

- Preserve vcpu-specific ID registers across a vcpu reset.

- Various minor cleanups and improvements.
2024-05-12 03:15:53 -04:00