Commit bf2c3138 authored by Paolo Bonzini's avatar Paolo Bonzini
Browse files

Merge tag 'kvm-x86-pmu-6.20' of https://github.com/kvm-x86/linux into HEAD

KVM mediated PMU support for 6.20

Add support for mediated PMUs, where KVM gives the guest full ownership of PMU
hardware (contexted switched around the fastpath run loop) and allows direct
access to data MSRs and PMCs (restricted by the vPMU model), but intercepts
access to control registers, e.g. to enforce event filtering and to prevent the
guest from profiling sensitive host state.

To keep overall complexity reasonable, mediated PMU usage is all or nothing
for a given instance of KVM (controlled via module param).  The Mediated PMU
is disabled default, partly to maintain backwards compatilibity for existing
setup, partly because there are tradeoffs when running with a mediated PMU that
may be non-starters for some use cases, e.g. the host loses the ability to
profile guests with mediated PMUs, the fastpath run loop is also a blind spot,
entry/exit transitions are more expensive, etc.

Versus the emulated PMU, where KVM is "just another perf user", the mediated
PMU delivers more accurate profiling and monitoring (no risk of contention and
thus dropped events), with significantly less overhead (fewer exits and faster
emulation/programming of event selectors) E.g. when running Specint-2017 on
a single-socket Sapphire Rapids with 56 cores and no-SMT, and using perf from
within the guest:

  Perf command:
  a. basic-sampling: perf record -F 1000 -e 6-instructions  -a --overwrite
  b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite

  Guest performance overhead:
  ---------------------------------------------------------------------------
  | Test case          | emulated vPMU | all passthrough | passthrough with |
  |                    |               |                 | event filters    |
  ---------------------------------------------------------------------------
  | basic-sampling     |   33.62%      |    4.24%        |   6.21%          |
  ---------------------------------------------------------------------------
  | multiplex-sampling |   79.32%      |    7.34%        |   10.45%         |
  ---------------------------------------------------------------------------
parents 1b13885e d374b89e
Loading
Loading
Loading
Loading
+49 −0
Original line number Diff line number Diff line
@@ -3079,6 +3079,26 @@ Kernel parameters

			Default is Y (on).

	kvm.enable_pmu=[KVM,X86]
			If enabled, KVM will virtualize PMU functionality based
			on the virtual CPU model defined by userspace.  This
			can be overridden on a per-VM basis via
			KVM_CAP_PMU_CAPABILITY.

			If disabled, KVM will not virtualize PMU functionality,
			e.g. MSRs, PMCs, PMIs, etc., even if userspace defines
			a virtual CPU model that contains PMU assets.

			Note, KVM's vPMU support implicitly requires running
			with an in-kernel local APIC, e.g. to deliver PMIs to
			the guest.  Running without an in-kernel local APIC is
			not supported, though KVM will allow such a combination
			(with severely degraded functionality).

			See also enable_mediated_pmu.

			Default is Y (on).

	kvm.enable_virt_at_load=[KVM,ARM64,LOONGARCH,MIPS,RISCV,X86]
			If enabled, KVM will enable virtualization in hardware
			when KVM is loaded, and disable virtualization when KVM
@@ -3125,6 +3145,35 @@ Kernel parameters
			If the value is 0 (the default), KVM will pick a period based
			on the ratio, such that a page is zapped after 1 hour on average.

	kvm-{amd,intel}.enable_mediated_pmu=[KVM,AMD,INTEL]
			If enabled, KVM will provide a mediated virtual PMU,
			instead of the default perf-based virtual PMU (if
			kvm.enable_pmu is true and PMU is enumerated via the
			virtual CPU model).

			With a perf-based vPMU, KVM operates as a user of perf,
			i.e. emulates guest PMU counters using perf events.
			KVM-created perf events are managed by perf as regular
			(guest-only) events, e.g. are scheduled in/out, contend
			for hardware resources, etc.  Using a perf-based vPMU
			allows guest and host usage of the PMU to co-exist, but
			incurs non-trivial overhead and can result in silently
			dropped guest events (due to resource contention).

			With a mediated vPMU, hardware PMU state is context
			switched around the world switch to/from the guest.
			KVM mediates which events the guest can utilize, but
			gives the guest direct access to all other PMU assets
			when possible (KVM may intercept some accesses if the
			virtual CPU model provides a subset of hardware PMU
			functionality).  Using a mediated vPMU significantly
			reduces PMU virtualization overhead and eliminates lost
			guest events, but is mutually exclusive with using perf
			to profile KVM guests and adds latency to most VM-Exits
			(to context switch PMU state).

			Default is N (off).

	kvm-amd.nested=	[KVM,AMD] Control nested virtualization feature in
			KVM/SVM. Default is 1 (enabled).

+1 −1
Original line number Diff line number Diff line
@@ -2413,7 +2413,7 @@ static int __init init_subsystems(void)
	if (err)
		goto out;

	kvm_register_perf_callbacks(NULL);
	kvm_register_perf_callbacks();

out:
	if (err)
+1 −1
Original line number Diff line number Diff line
@@ -402,7 +402,7 @@ static int kvm_loongarch_env_init(void)
	}

	kvm_init_gcsr_flag();
	kvm_register_perf_callbacks(NULL);
	kvm_register_perf_callbacks();

	/* Register LoongArch IPI interrupt controller interface. */
	ret = kvm_loongarch_register_ipi_device();
+1 −1
Original line number Diff line number Diff line
@@ -174,7 +174,7 @@ static int __init riscv_kvm_init(void)

	kvm_riscv_setup_vendor_features();

	kvm_register_perf_callbacks(NULL);
	kvm_register_perf_callbacks();

	rc = kvm_init(sizeof(struct kvm_vcpu), 0, THIS_MODULE);
	if (rc) {
+1 −0
Original line number Diff line number Diff line
@@ -114,6 +114,7 @@ static idtentry_t sysvec_table[NR_SYSTEM_VECTORS] __ro_after_init = {

	SYSVEC(IRQ_WORK_VECTOR,			irq_work),

	SYSVEC(PERF_GUEST_MEDIATED_PMI_VECTOR,	perf_guest_mediated_pmi_handler),
	SYSVEC(POSTED_INTR_VECTOR,		kvm_posted_intr_ipi),
	SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	kvm_posted_intr_wakeup_ipi),
	SYSVEC(POSTED_INTR_NESTED_VECTOR,	kvm_posted_intr_nested_ipi),
Loading