Unverified Commit f646c9f9 authored by Riana Tauro's avatar Riana Tauro Committed by Rodrigo Vivi
Browse files

drm/xe/doc: Document device wedged and runtime survivability



Add documentation for vendor specific device wedged recovery method
and runtime survivability.

v2: fix documentation (Raag)
v3: add userspace tool for firmware update (Raag)
v4: use consistent documentation (Raag)
v5: add more documentation

Signed-off-by: default avatarRiana Tauro <riana.tauro@intel.com>
Reviewed-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: default avatarRaag Jadav <raag.jadav@intel.com>
Link: https://lore.kernel.org/r/20250826063419.3022216-8-riana.tauro@intel.com


Signed-off-by: default avatarRodrigo Vivi <rodrigo.vivi@intel.com>
parent a2ca0633
Loading
Loading
Loading
Loading
+4 −2
Original line number Diff line number Diff line
@@ -13,9 +13,11 @@ Internal API
.. kernel-doc:: drivers/gpu/drm/xe/xe_pcode.c
   :internal:

.. _xe-survivability-mode:

==================
Boot Survivability
Survivability Mode
==================

.. kernel-doc:: drivers/gpu/drm/xe/xe_survivability_mode.c
   :doc: Xe Boot Survivability
   :doc: Survivability Mode
+27 −0
Original line number Diff line number Diff line
@@ -1174,6 +1174,33 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg)
 * used. Certain critical errors like gt reset failure, firmware failures can cause
 * the device to be wedged. The default recovery method for a wedged state
 * is rebind/bus-reset.
 *
 * Another recovery method is vendor-specific. Below are the cases that send
 * ``WEDGED=vendor-specific`` recovery method in drm device wedged uevent.
 *
 * Case: Firmware Flash
 * --------------------
 *
 * Identification Hint
 * +++++++++++++++++++
 *
 * ``WEDGED=vendor-specific`` drm device wedged uevent with
 * :ref:`Runtime Survivability mode <xe-survivability-mode>` is used to notify
 * admin/userspace consumer about the need for a firmware flash.
 *
 * Recovery Procedure
 * ++++++++++++++++++
 *
 * Once ``WEDGED=vendor-specific`` drm device wedged uevent is received, follow
 * the below steps
 *
 * - Check Runtime Survivability mode sysfs.
 *   If enabled, firmware flash is required to recover the device.
 *
 *   /sys/bus/pci/devices/<device>/survivability_mode
 *
 * - Admin/userpsace consumer can use firmware flashing tools like fwupd to flash
 *   firmware and restore device to normal operation.
 */

/**
+27 −8
Original line number Diff line number Diff line
@@ -22,15 +22,18 @@
#define MAX_SCRATCH_MMIO 8

/**
 * DOC: Xe Boot Survivability
 * DOC: Survivability Mode
 *
 * Boot Survivability is a software based workflow for recovering a system in a failed boot state
 * Survivability Mode is a software based workflow for recovering a system in a failed boot state
 * Here system recoverability is concerned with recovering the firmware responsible for boot.
 *
 * This is implemented by loading the driver with bare minimum (no drm card) to allow the firmware
 * to be flashed through mei and collect telemetry. The driver's probe flow is modified
 * such that it enters survivability mode when pcode initialization is incomplete and boot status
 * denotes a failure.
 * Boot Survivability
 * ===================
 *
 * Boot Survivability is implemented by loading the driver with bare minimum (no drm card) to allow
 * the firmware to be flashed through mei driver and collect telemetry. The driver's probe flow is
 * modified such that it enters survivability mode when pcode initialization is incomplete and boot
 * status denotes a failure.
 *
 * Survivability mode can also be entered manually using the survivability mode attribute available
 * through configfs which is beneficial in several usecases. It can be used to address scenarios
@@ -46,7 +49,7 @@
 * Survivability mode is indicated by the below admin-only readable sysfs which provides additional
 * debug information::
 *
 *	/sys/bus/pci/devices/<device>/surivability_mode
 *	/sys/bus/pci/devices/<device>/survivability_mode
 *
 * Capability Information:
 *	Provides boot status
@@ -56,6 +59,22 @@
 *	Provides history of previous failures
 * Auxiliary Information
 *	Certain failures may have information in addition to postcode information
 *
 * Runtime Survivability
 * =====================
 *
 * Certain runtime firmware errors can cause the device to enter a wedged state
 * (:ref:`xe-device-wedging`) requiring a firmware flash to restore normal operation.
 * Runtime Survivability Mode indicates that a firmware flash is necessary to recover the device and
 * is indicated by the presence of survivability mode sysfs::
 *
 *	/sys/bus/pci/devices/<device>/survivability_mode
 *
 * Survivability mode sysfs provides information about the type of survivability mode.
 *
 * When such errors occur, userspace is notified with the drm device wedged uevent and runtime
 * survivability mode. User can then initiate a firmware flash using userspace tools like fwupd
 * to restore device to normal operation.
 */

static u32 aux_history_offset(u32 reg_value)
@@ -327,7 +346,7 @@ int xe_survivability_mode_runtime_enable(struct xe_device *xe)

	xe_device_set_wedged_method(xe, DRM_WEDGE_RECOVERY_VENDOR);
	xe_device_declare_wedged(xe);
	dev_err(&pdev->dev, "Firmware update required, Refer the userspace documentation for more details!\n");
	dev_err(&pdev->dev, "Firmware flash required, Refer the userspace documentation for more details!\n");

	return 0;
}