Commit f5caf26f authored by Besar Wicaksono's avatar Besar Wicaksono Committed by Will Deacon
Browse files

perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU



The Unified Coherence Fabric (UCF) contains last level cache
and cache coherent interconnect in Tegra410 SOC. The PMU in
this device can be used to capture events related to access
to the last level cache and memory from different sources.

Reviewed-by: default avatarIlkka Koskinen <ilkka@os.amperecomputing.com>
Signed-off-by: default avatarBesar Wicaksono <bwicaksono@nvidia.com>
Signed-off-by: default avatarWill Deacon <will@kernel.org>
parent d332424d
Loading
Loading
Loading
Loading
+1 −0
Original line number Diff line number Diff line
@@ -25,6 +25,7 @@ Performance monitor support
   alibaba_pmu
   dwc_pcie_pmu
   nvidia-tegra241-pmu
   nvidia-tegra410-pmu
   meson-ddr-pmu
   cxl
   ampere_cspmu
+106 −0
Original line number Diff line number Diff line
=====================================================================
NVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU)
=====================================================================

The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
metrics like memory bandwidth, latency, and utilization:

* Unified Coherence Fabric (UCF)

PMU Driver
----------

The PMU driver describes the available events and configuration of each PMU in
sysfs. Please see the sections below to get the sysfs path of each PMU. Like
other uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show
the CPU id used to handle the PMU event. There is also "associated_cpus"
sysfs attribute, which contains a list of CPUs associated with the PMU instance.

UCF PMU
-------

The Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a
distributed cache, last level for CPU Memory and CXL Memory, and cache coherent
interconnect that supports hardware coherence across multiple coherently caching
agents, including:

  * CPU clusters
  * GPU
  * PCIe Ordering Controller Unit (OCU)
  * Other IO-coherent requesters

The events and configuration options of this PMU device are described in sysfs,
see /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>.

Some of the events available in this PMU can be used to measure bandwidth and
utilization:

  * slc_access_rd: count the number of read requests to SLC.
  * slc_access_wr: count the number of write requests to SLC.
  * slc_bytes_rd: count the number of bytes transferred by slc_access_rd.
  * slc_bytes_wr: count the number of bytes transferred by slc_access_wr.
  * mem_access_rd: count the number of read requests to local or remote memory.
  * mem_access_wr: count the number of write requests to local or remote memory.
  * mem_bytes_rd: count the number of bytes transferred by mem_access_rd.
  * mem_bytes_wr: count the number of bytes transferred by mem_access_wr.
  * cycles: counts the UCF cycles.

The average bandwidth is calculated as::

   AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS
   AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS
   AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS
   AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS

The average request rate is calculated as::

   AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES
   AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES
   AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES
   AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES

More details about what other events are available can be found in Tegra410 SoC
technical reference manual.

The events can be filtered based on source or destination. The source filter
indicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or
remote socket. The destination filter specifies the destination memory type,
e.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The
local/remote classification of the destination filter is based on the home
socket of the address, not where the data actually resides. The available
filters are described in
/sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/.

The list of UCF PMU event filters:

* Source filter:

  * src_loc_cpu: if set, count events from local CPU
  * src_loc_noncpu: if set, count events from local non-CPU device
  * src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket

* Destination filter:

  * dst_loc_cmem: if set, count events to local system memory (CMEM) address
  * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
  * dst_loc_other: if set, count events to local CXL memory address
  * dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket

If the source is not specified, the PMU will count events from all sources. If
the destination is not specified, the PMU will count events to all destinations.

Example usage:

* Count event id 0x0 in socket 0 from all sources and to all destinations::

    perf stat -a -e nvidia_ucf_pmu_0/event=0x0/

* Count event id 0x0 in socket 0 with source filter = local CPU and destination
  filter = local system memory (CMEM)::

    perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/

* Count event id 0x0 in socket 1 with source filter = local non-CPU device and
  destination filter = remote memory::

    perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
+86 −1
Original line number Diff line number Diff line
// SPDX-License-Identifier: GPL-2.0
/*
 * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 *
 */

@@ -21,6 +21,13 @@
#define NV_CNVL_PORT_COUNT           4ULL
#define NV_CNVL_FILTER_ID_MASK       GENMASK_ULL(NV_CNVL_PORT_COUNT - 1, 0)

#define NV_UCF_SRC_COUNT             3ULL
#define NV_UCF_DST_COUNT             4ULL
#define NV_UCF_FILTER_ID_MASK        GENMASK_ULL(11, 0)
#define NV_UCF_FILTER_SRC            GENMASK_ULL(2, 0)
#define NV_UCF_FILTER_DST            GENMASK_ULL(11, 8)
#define NV_UCF_FILTER_DEFAULT        (NV_UCF_FILTER_SRC | NV_UCF_FILTER_DST)

#define NV_GENERIC_FILTER_ID_MASK    GENMASK_ULL(31, 0)

#define NV_PRODID_MASK	(PMIIDR_PRODUCTID | PMIIDR_VARIANT | PMIIDR_REVISION)
@@ -124,6 +131,36 @@ static struct attribute *mcf_pmu_event_attrs[] = {
	NULL,
};

static struct attribute *ucf_pmu_event_attrs[] = {
	ARM_CSPMU_EVENT_ATTR(bus_cycles,            0x1D),

	ARM_CSPMU_EVENT_ATTR(slc_allocate,          0xF0),
	ARM_CSPMU_EVENT_ATTR(slc_wb,                0xF3),
	ARM_CSPMU_EVENT_ATTR(slc_refill_rd,         0x109),
	ARM_CSPMU_EVENT_ATTR(slc_refill_wr,         0x10A),
	ARM_CSPMU_EVENT_ATTR(slc_hit_rd,            0x119),

	ARM_CSPMU_EVENT_ATTR(slc_access_dataless,   0x183),
	ARM_CSPMU_EVENT_ATTR(slc_access_atomic,     0x184),

	ARM_CSPMU_EVENT_ATTR(slc_access_rd,         0x111),
	ARM_CSPMU_EVENT_ATTR(slc_access_wr,         0x112),
	ARM_CSPMU_EVENT_ATTR(slc_bytes_rd,          0x113),
	ARM_CSPMU_EVENT_ATTR(slc_bytes_wr,          0x114),

	ARM_CSPMU_EVENT_ATTR(mem_access_rd,         0x121),
	ARM_CSPMU_EVENT_ATTR(mem_access_wr,         0x122),
	ARM_CSPMU_EVENT_ATTR(mem_bytes_rd,          0x123),
	ARM_CSPMU_EVENT_ATTR(mem_bytes_wr,          0x124),

	ARM_CSPMU_EVENT_ATTR(local_snoop,           0x180),
	ARM_CSPMU_EVENT_ATTR(ext_snp_access,        0x181),
	ARM_CSPMU_EVENT_ATTR(ext_snp_evict,         0x182),

	ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
	NULL
};

static struct attribute *generic_pmu_event_attrs[] = {
	ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
	NULL,
@@ -152,6 +189,18 @@ static struct attribute *cnvlink_pmu_format_attrs[] = {
	NULL,
};

static struct attribute *ucf_pmu_format_attrs[] = {
	ARM_CSPMU_FORMAT_EVENT_ATTR,
	ARM_CSPMU_FORMAT_ATTR(src_loc_noncpu, "config1:0"),
	ARM_CSPMU_FORMAT_ATTR(src_loc_cpu, "config1:1"),
	ARM_CSPMU_FORMAT_ATTR(src_rem, "config1:2"),
	ARM_CSPMU_FORMAT_ATTR(dst_loc_cmem, "config1:8"),
	ARM_CSPMU_FORMAT_ATTR(dst_loc_gmem, "config1:9"),
	ARM_CSPMU_FORMAT_ATTR(dst_loc_other, "config1:10"),
	ARM_CSPMU_FORMAT_ATTR(dst_rem, "config1:11"),
	NULL
};

static struct attribute *generic_pmu_format_attrs[] = {
	ARM_CSPMU_FORMAT_EVENT_ATTR,
	ARM_CSPMU_FORMAT_FILTER_ATTR,
@@ -236,6 +285,27 @@ static void nv_cspmu_set_cc_filter(struct arm_cspmu *cspmu,
	writel(filter, cspmu->base0 + PMCCFILTR);
}

static u32 ucf_pmu_event_filter(const struct perf_event *event)
{
	u32 ret, filter, src, dst;

	filter = nv_cspmu_event_filter(event);

	/* Monitor all sources if none is selected. */
	src = FIELD_GET(NV_UCF_FILTER_SRC, filter);
	if (src == 0)
		src = GENMASK_ULL(NV_UCF_SRC_COUNT - 1, 0);

	/* Monitor all destinations if none is selected. */
	dst = FIELD_GET(NV_UCF_FILTER_DST, filter);
	if (dst == 0)
		dst = GENMASK_ULL(NV_UCF_DST_COUNT - 1, 0);

	ret = FIELD_PREP(NV_UCF_FILTER_SRC, src);
	ret |= FIELD_PREP(NV_UCF_FILTER_DST, dst);

	return ret;
}

enum nv_cspmu_name_fmt {
	NAME_FMT_GENERIC,
@@ -342,6 +412,21 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
		.init_data = NULL
	  },
	},
	{
	  .prodid = 0x2CF20000,
	  .prodid_mask = NV_PRODID_MASK,
	  .name_pattern = "nvidia_ucf_pmu_%u",
	  .name_fmt = NAME_FMT_SOCKET,
	  .template_ctx = {
		.event_attr = ucf_pmu_event_attrs,
		.format_attr = ucf_pmu_format_attrs,
		.filter_mask = NV_UCF_FILTER_ID_MASK,
		.filter_default_val = NV_UCF_FILTER_DEFAULT,
		.filter2_mask = 0x0,
		.filter2_default_val = 0x0,
		.get_filter = ucf_pmu_event_filter,
	  },
	},
	{
	  .prodid = 0,
	  .prodid_mask = 0,