Commit 3f071d00 authored by Dave Airlie's avatar Dave Airlie
Browse files

Merge tag 'drm-xe-next-2026-03-12' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-next



UAPI Changes:
- add VM_BIND DECOMPRESS support and on-demand decompression (Nitin)
- Allow per queue programming of COMMON_SLICE_CHICKEN3 bit13 (Lionel)

Cross-subsystem Changes:
- Introduce the DRM RAS infrastructure over generic netlink (Riana, Rodrigo)

Core Changes:
- Two-pass MMU interval notifiers (Thomas)

Driver Changes:
- Merge drm/drm-next into drm-xe-next (Brost)
- Fix overflow in guc_ct_snapshot_capture (Mika, Fixes)
- Extract gt_pta_entry (Gustavo)
- Extra enabling patches for NVL-P (Gustavo)
- Add Wa_14026578760 (Varun)
- Add type-specific GT loop iterator (Roper)
- Refactor xe_migrate_prepare_vm (Raag)
- Don't disable GuCRC in suspend path (Vinay, Fixes)
- Add missing kernel docs in xe_exec_queue.c (Niranjana)
- Change TEST_VRAM to work with 32-bit resource_size_t (Wajdeczko)
- Fix memory leak in xe_vm_madvise_ioctl (Varun, Fixes)
- Skip access counter queue init for unsupported platforms (Himal)

Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
From: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/abLUVfSHu8EHRF9q@lstrano-desk.jf.intel.com
parents 38cb89a6 42d3b66d
Loading
Loading
Loading
Loading
+103 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0+

============================
DRM RAS over Generic Netlink
============================

The DRM RAS (Reliability, Availability, Serviceability) interface provides a
standardized way for GPU/accelerator drivers to expose error counters and
other reliability nodes to user space via Generic Netlink. This allows
diagnostic tools, monitoring daemons, or test infrastructure to query hardware
health in a uniform way across different DRM drivers.

Key Goals:

* Provide a standardized RAS solution for GPU and accelerator drivers, enabling
  data center monitoring and reliability operations.
* Implement a single drm-ras Generic Netlink family to meet modern Netlink YAML
  specifications and centralize all RAS-related communication in one namespace.
* Support a basic error counter interface, addressing the immediate, essential
  monitoring needs.
* Offer a flexible, future-proof interface that can be extended to support
  additional types of RAS data in the future.
* Allow multiple nodes per driver, enabling drivers to register separate
  nodes for different IP blocks, sub-blocks, or other logical subdivisions
  as applicable.

Nodes
=====

Nodes are logical abstractions representing an error type or error source within
the device. Currently, only error counter nodes is supported.

Drivers are responsible for registering and unregistering nodes via the
`drm_ras_node_register()` and `drm_ras_node_unregister()` APIs.

Node Management
-------------------

.. kernel-doc:: drivers/gpu/drm/drm_ras.c
   :doc: DRM RAS Node Management
.. kernel-doc:: drivers/gpu/drm/drm_ras.c
   :internal:

Generic Netlink Usage
=====================

The interface is implemented as a Generic Netlink family named ``drm-ras``.
User space tools can:

* List registered nodes with the ``list-nodes`` command.
* List all error counters in an node with the ``get-error-counter`` command with ``node-id``
  as a parameter.
* Query specific error counter values with the ``get-error-counter`` command, using both
  ``node-id`` and ``error-id`` as parameters.

YAML-based Interface
--------------------

The interface is described in a YAML specification ``Documentation/netlink/specs/drm_ras.yaml``

This YAML is used to auto-generate user space bindings via
``tools/net/ynl/pyynl/ynl_gen_c.py``, and drives the structure of netlink
attributes and operations.

Usage Notes
-----------

* User space must first enumerate nodes to obtain their IDs.
* Node IDs or Node names can be used for all further queries, such as error counters.
* Error counters can be queried by either the Error ID or Error name.
* Query Parameters should be defined as part of the uAPI to ensure user interface stability.
* The interface supports future extension by adding new node types and
  additional attributes.

Example: List nodes using ynl

.. code-block:: bash

    sudo ynl --family drm_ras --dump list-nodes
    [{'device-name': '0000:03:00.0',
    'node-id': 0,
    'node-name': 'correctable-errors',
    'node-type': 'error-counter'},
    {'device-name': '0000:03:00.0',
     'node-id': 1,
     'node-name': 'uncorrectable-errors',
     'node-type': 'error-counter'}]

Example: List all error counters using ynl

.. code-block:: bash

    sudo ynl --family drm_ras --dump get-error-counter --json '{"node-id":0}'
    [{'error-id': 1, 'error-name': 'error_name1', 'error-value': 0},
    {'error-id': 2, 'error-name': 'error_name2', 'error-value': 0}]

Example: Query an error counter for a given node

.. code-block:: bash

    sudo ynl --family drm_ras --do get-error-counter --json '{"node-id":0, "error-id":1}'
    {'error-id': 1, 'error-name': 'error_name1', 'error-value': 0}
+1 −0
Original line number Diff line number Diff line
@@ -9,6 +9,7 @@ GPU Driver Developer's Guide
   drm-mm
   drm-kms
   drm-kms-helpers
   drm-ras
   drm-uapi
   drm-usage-stats
   driver-uapi
+115 −0
Original line number Diff line number Diff line
# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)
---
name: drm-ras
protocol: genetlink
uapi-header: drm/drm_ras.h

doc: >-
  DRM RAS (Reliability, Availability, Serviceability) over Generic Netlink.
  Provides a standardized mechanism for DRM drivers to register "nodes"
  representing hardware/software components capable of reporting error counters.
  Userspace tools can query the list of nodes or individual error counters
  via the Generic Netlink interface.

definitions:
  -
    type: enum
    name: node-type
    value-start: 1
    entries: [error-counter]
    doc: >-
         Type of the node. Currently, only error-counter nodes are
         supported, which expose reliability counters for a hardware/software
         component.

attribute-sets:
  -
    name: node-attrs
    attributes:
      -
        name: node-id
        type: u32
        doc: >-
             Unique identifier for the node.
             Assigned dynamically by the DRM RAS core upon registration.
      -
        name: device-name
        type: string
        doc: >-
             Device name chosen by the driver at registration.
             Can be a PCI BDF, UUID, or module name if unique.
      -
        name: node-name
        type: string
        doc: >-
             Node name chosen by the driver at registration.
             Can be an IP block name, or any name that identifies the
             RAS node inside the device.
      -
        name: node-type
        type: u32
        doc: Type of this node, identifying its function.
        enum: node-type
  -
    name: error-counter-attrs
    attributes:
      -
        name: node-id
        type: u32
        doc: Node ID targeted by this error counter operation.
      -
        name: error-id
        type: u32
        doc: Unique identifier for a specific error counter within an node.
      -
        name: error-name
        type: string
        doc: Name of the error.
      -
        name: error-value
        type: u32
        doc: Current value of the requested error counter.

operations:
  list:
    -
      name: list-nodes
      doc: >-
           Retrieve the full list of currently registered DRM RAS nodes.
           Each node includes its dynamically assigned ID, name, and type.
           **Important:** User space must call this operation first to obtain
           the node IDs. These IDs are required for all subsequent
           operations on nodes, such as querying error counters.
      attribute-set: node-attrs
      flags: [admin-perm]
      dump:
        reply:
          attributes:
            - node-id
            - device-name
            - node-name
            - node-type
    -
      name: get-error-counter
      doc: >-
           Retrieve error counter for a given node.
           The response includes the id, the name, and even the current
           value of each counter.
      attribute-set: error-counter-attrs
      flags: [admin-perm]
      do:
        request:
          attributes:
            - node-id
            - error-id
        reply:
          attributes: &errorinfo
            - error-id
            - error-name
            - error-value
      dump:
        request:
          attributes:
            - node-id
        reply:
          attributes: *errorinfo
+10 −0
Original line number Diff line number Diff line
@@ -130,6 +130,16 @@ config DRM_PANIC_SCREEN_QR_VERSION
	  Smaller QR code are easier to read, but will contain less debugging
	  data. Default is 40.

config DRM_RAS
	bool "DRM RAS support"
	depends on DRM
	depends on NET
	help
	  Enables the DRM RAS (Reliability, Availability and Serviceability)
	  support for DRM drivers. This provides a Generic Netlink interface
	  for error reporting and queries.
	  If in doubt, say "N".

config DRM_DEBUG_DP_MST_TOPOLOGY_REFS
        bool "Enable refcount backtrace history in the DP MST helpers"
	depends on STACKTRACE_SUPPORT
+1 −0
Original line number Diff line number Diff line
@@ -93,6 +93,7 @@ drm-$(CONFIG_DRM_ACCEL) += ../../accel/drm_accel.o
drm-$(CONFIG_DRM_PANIC) += drm_panic.o
drm-$(CONFIG_DRM_DRAW) += drm_draw.o
drm-$(CONFIG_DRM_PANIC_SCREEN_QR_CODE) += drm_panic_qr.o
drm-$(CONFIG_DRM_RAS) += drm_ras.o drm_ras_nl.o drm_ras_genl_family.o
obj-$(CONFIG_DRM)	+= drm.o

obj-$(CONFIG_DRM_PANEL_ORIENTATION_QUIRKS) += drm_panel_orientation_quirks.o
Loading