Commit ae8371a4 authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'edac_updates_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC updates from Borislav Petkov:

 - Add infrastructure support to EDAC in order to be able to register
   memory scrubbing RAS functionality with the kernel and expose sysfs
   nodes to control such scrubbing functionality.

   The main use case is CXL devices which provide different scrubbers
   for their built-in memories so that tools like rasdaemon can
   configure and control memory scrubbing and other, more advanced RAS
   functionality (Shiju Jose and Jonathan Cameron)

 - Add support to ie31200_edac for client SoCs like Raptor Lake-S which
   have multiple memory controllers and out-of-band ECC capability
   (Qiuxu Zhuo)

 - The usual round of cleanups, simplifications and fixlets

* tag 'edac_updates_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: (25 commits)
  MAINTAINERS: Add a secondary maintainer for bluefield_edac
  EDAC/ie31200: Switch Raptor Lake-S to interrupt mode
  EDAC/ie31200: Add Intel Raptor Lake-S SoCs support
  EDAC/ie31200: Break up ie31200_probe1()
  EDAC/ie31200: Fold the two channel loops into one loop
  EDAC/ie31200: Make struct dimm_data contain decoded information
  EDAC/ie31200: Make the memory controller resources configurable
  EDAC/ie31200: Simplify the pci_device_id table
  EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info()
  EDAC/ie31200: Fix the error path order of ie31200_init()
  EDAC/ie31200: Fix the DIMM size mask for several SoCs
  EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layer
  EDAC/device: Fix dev_set_name() format string
  EDAC/pnd2: Make read-only const array intlv static
  EDAC/igen6: Constify struct res_config
  EDAC/amd64: Simplify return statement in dct_ecc_enabled()
  EDAC: Update memory repair control interface for memory sparing feature
  EDAC: Add a memory repair control feature
  EDAC: Use string choice helper functions
  EDAC: Add a Error Check Scrub control feature
  ...
parents 2899aa39 298ffd53
Loading
Loading
Loading
Loading
+74 −0
Original line number Diff line number Diff line
What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		The sysfs EDAC bus devices /<dev-name>/ecs_fruX subdirectory
		pertains to the memory media ECS (Error Check Scrub) control
		feature, where <dev-name> directory corresponds to a device
		registered with the EDAC device driver for the ECS feature.
		/ecs_fruX belongs to the media FRUs (Field Replaceable Unit)
		under the memory device.

		The sysfs ECS attr nodes are only present if the parent
		driver has implemented the corresponding attr callback
		function and provided the necessary operations to the EDAC
		device driver during registration.

What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX/log_entry_type
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RW) The log entry type of how the DDR5 ECS log is reported.

		- 0 - per DRAM.

		- 1 - per memory media FRU.

		- All other values are reserved.

What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX/mode
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RW) The mode of how the DDR5 ECS counts the errors.
		Error count is tracked based on two different modes
		selected by DDR5 ECS Control Feature - Codeword mode and
		Row Count mode. If the ECS is under Codeword mode, then
		the error count increments each time a codeword with check
		bit errors is detected. If the ECS is under Row Count mode,
		then the error counter increments each time a row with
		check bit errors is detected.

		- 0 - ECS counts rows in the memory media that have ECC errors.

		- 1 - ECS counts codewords with errors, specifically, it counts
		      the number of ECC-detected errors in the memory media.

		- All other values are reserved.

What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX/reset
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(WO) ECS reset ECC counter.

		- 1 - reset ECC counter to the default value.

		- All other values are reserved.

What:		/sys/bus/edac/devices/<dev-name>/ecs_fruX/threshold
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RW) DDR5 ECS threshold count per gigabits of memory cells.
		The ECS error count is subject to the ECS Threshold count
		per Gbit, which masks error counts less than the Threshold.

		Supported values are 256, 1024 and 4096.

		All other values are reserved.
+206 −0
Original line number Diff line number Diff line
What:		/sys/bus/edac/devices/<dev-name>/mem_repairX
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		The sysfs EDAC bus devices /<dev-name>/mem_repairX subdirectory
		pertains to the memory media repair features control, such as
		PPR (Post Package Repair), memory sparing etc, where <dev-name>
		directory corresponds to a device registered with the EDAC
		device driver for the memory repair features.

		Post Package Repair is a maintenance operation requests the memory
		device to perform a repair operation on its media. It is a memory
		self-healing feature that fixes a failing memory location by
		replacing it with a spare row in a DRAM device. For example, a
		CXL memory device with DRAM components that support PPR features may
		implement PPR maintenance operations. DRAM components may support
		two types of PPR functions: hard PPR, for a permanent row repair, and
		soft PPR, for a temporary row repair. Soft PPR may be much faster
		than hard PPR, but the repair is lost with a power cycle.

		The sysfs attributes nodes for a repair feature are only
		present if the parent driver has implemented the corresponding
		attr callback function and provided the necessary operations
		to the EDAC device driver during registration.

		In some states of system configuration (e.g. before address
		decoders have been configured), memory devices (e.g. CXL)
		may not have an active mapping in the main host address
		physical address map. As such, the memory to repair must be
		identified by a device specific physical addressing scheme
		using a device physical address(DPA). The DPA and other control
		attributes to use will be presented in related error records.

What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair_type
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RO) Memory repair type. For eg. post package repair,
		memory sparing etc. Valid values are:

		- ppr - Post package repair.

		- cacheline-sparing

		- row-sparing

		- bank-sparing

		- rank-sparing

		- All other values are reserved.

What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RW) Get/Set the current persist repair mode set for a
		repair function. Persist repair modes supported in the
		device, based on a memory repair function, either is temporary,
		which is lost with a power cycle or permanent. Valid values are:

		- 0 - Soft memory repair (temporary repair).

		- 1 - Hard memory repair (permanent repair).

		- All other values are reserved.

What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair_safe_when_in_use
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RO) True if memory media is accessible and data is retained
		during the memory repair operation.
		The data may not be retained and memory requests may not be
		correctly processed during a repair operation. In such case
		repair operation can not be executed at runtime. The memory
		must be taken offline.

What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/hpa
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RW) Host Physical Address (HPA) of the memory to repair.
		The HPA to use will be provided in related error records.

What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/dpa
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RW) Device Physical Address (DPA) of the memory to repair.
		The specific DPA to use will be provided in related error
		records.

		In some states of system configuration (e.g. before address
		decoders have been configured), memory devices (e.g. CXL)
		may not have an active mapping in the main host address
		physical address map. As such, the memory to repair must be
		identified by a device specific physical addressing scheme
		using a DPA. The device physical address(DPA) to use will be
		presented in related error records.

What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/nibble_mask
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RW) Read/Write Nibble mask of the memory to repair.
		Nibble mask identifies one or more nibbles in error on the
		memory bus that produced the error event. Nibble Mask bit 0
		shall be set if nibble 0 on the memory bus produced the
		event, etc. For example, CXL PPR and sparing, a nibble mask
		bit set to 1 indicates the request to perform repair
		operation in the specific device. All nibble mask bits set
		to 1 indicates the request to perform the operation in all
		devices. Eg. for CXL memory repair, the specific value of
		nibble mask to use will be provided in related error records.
		For more details, See nibble mask field in CXL spec ver 3.1,
		section 8.2.9.7.1.2 Table 8-103 soft PPR and section
		8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4
		Table 8-105 memory sparing.

What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_hpa
What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_hpa
What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/min_dpa
What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/max_dpa
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RW) The supported range of memory address that is to be
		repaired. The memory device may give the supported range of
		attributes to use and it will depend on the memory device
		and the portion of memory to repair.
		The userspace may receive the specific value of attributes
		to use for a repair operation from the memory device via
		related error records and trace events, for eg. CXL DRAM
		and CXL general media error records in CXL memory devices.

What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/bank_group
What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/bank
What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/rank
What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/row
What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/column
What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/channel
What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/sub_channel
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RW) The control attributes for the memory to be repaired.
		The specific value of attributes to use depends on the
		portion of memory to repair and will be reported to the host
		in related error records and be available to userspace
		in trace events, such as CXL DRAM and CXL general media
		error records of CXL memory devices.

		When readng back these attributes, it returns the current
		value of memory requested to be repaired.

		bank_group - The bank group of the memory to repair.

		bank - The bank number of the memory to repair.

		rank - The rank of the memory to repair. Rank is defined as a
		set of memory devices on a channel that together execute a
		transaction.

		row - The row number of the memory to repair.

		column - The column number of the memory to repair.

		channel - The channel of the memory to repair. Channel is
		defined as an interface that can be independently accessed
		for a transaction.

		sub_channel - The subchannel of the memory to repair.

		The requirement to set these attributes varies based on the
		repair function. The attributes in sysfs are not present
		unless required for a repair function.

		For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103
		soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR operations,
		these attributes are not required to set. CXL spec ver 3.1,
		Section 8.2.9.7.1.4 Table 8-105 memory sparing, these attributes
		are required to set based on memory sparing granularity.

What:		/sys/bus/edac/devices/<dev-name>/mem_repairX/repair
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(WO) Issue the memory repair operation for the specified
		memory repair attributes. The operation may fail if resources
		are insufficient based on the requirements of the memory
		device and repair function.

		- 1 - Issue the repair operation.

		- All other values are reserved.
+69 −0
Original line number Diff line number Diff line
What:		/sys/bus/edac/devices/<dev-name>/scrubX
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		The sysfs EDAC bus devices /<dev-name>/scrubX subdirectory
		belongs to an instance of memory scrub control feature,
		where <dev-name> directory corresponds to a device/memory
		region registered with the EDAC device driver for the
		scrub control feature.

		The sysfs scrub attr nodes are only present if the parent
		driver has implemented the corresponding attr callback
		function and provided the necessary operations to the EDAC
		device driver during registration.

What:		/sys/bus/edac/devices/<dev-name>/scrubX/addr
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RW) The base address of the memory region to be scrubbed
		for on-demand scrubbing. Setting address starts scrubbing.
		The size must be set before that.

		The readback addr value is non-zero if the requested
		on-demand scrubbing is in progress, zero otherwise.

What:		/sys/bus/edac/devices/<dev-name>/scrubX/size
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RW) The size of the memory region to be scrubbed
		(on-demand scrubbing).

What:		/sys/bus/edac/devices/<dev-name>/scrubX/enable_background
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RW) Start/Stop background (patrol) scrubbing if supported.

What:		/sys/bus/edac/devices/<dev-name>/scrubX/min_cycle_duration
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RO) Supported minimum scrub cycle duration in seconds
		by the memory scrubber.

What:		/sys/bus/edac/devices/<dev-name>/scrubX/max_cycle_duration
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RO) Supported maximum scrub cycle duration in seconds
		by the memory scrubber.

What:		/sys/bus/edac/devices/<dev-name>/scrubX/current_cycle_duration
Date:		March 2025
KernelVersion:	6.15
Contact:	linux-edac@vger.kernel.org
Description:
		(RW) The current scrub cycle duration in seconds and must be
		within the supported range by the memory scrubber.

		Scrub has an overhead when running and that may want to be
		reduced by taking longer to do it.
+103 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later

=================
EDAC/RAS features
=================

Copyright (c) 2024-2025 HiSilicon Limited.

:Author:   Shiju Jose <shiju.jose@huawei.com>
:License:  The GNU Free Documentation License, Version 1.2 without
           Invariant Sections, Front-Cover Texts nor Back-Cover Texts.
           (dual licensed under the GPL v2)

- Written for: 6.15

Introduction
------------

EDAC/RAS components plugging and high-level design:

1. Scrub control

2. Error Check Scrub (ECS) control

3. ACPI RAS2 features

4. Post Package Repair (PPR) control

5. Memory Sparing Repair control

High level design is illustrated in the following diagram::

        +-----------------------------------------------+
        |   Userspace - Rasdaemon                       |
        | +-------------+                               |
        | | RAS CXL mem |     +---------------+         |
        | |error handler|---->|               |         |
        | +-------------+     | RAS dynamic   |         |
        | +-------------+     | scrub, memory |         |
        | | RAS memory  |---->| repair control|         |
        | |error handler|     +----|----------+         |
        | +-------------+          |                    |
        +--------------------------|--------------------+
                                   |
                                   |
   +-------------------------------|------------------------------+
   |     Kernel EDAC extension for | controlling RAS Features     |
   |+------------------------------|----------------------------+ |
   || EDAC Core          Sysfs EDAC| Bus                        | |
   ||   +--------------------------|---------------------------+| |
   ||   |/sys/bus/edac/devices/<dev>/scrubX/ |   | EDAC device || |
   ||   |/sys/bus/edac/devices/<dev>/ecsX/   |<->| EDAC MC     || |
   ||   |/sys/bus/edac/devices/<dev>/repairX |   | EDAC sysfs  || |
   ||   +---------------------------|--------------------------+| |
   ||                           EDAC|Bus                        | |
   ||                               |                           | |
   ||   +----------+ Get feature    |      Get feature          | |
   ||   |          | desc +---------|------+ desc +----------+  | |
   ||   |EDAC scrub|<-----| EDAC device    |      |          |  | |
   ||   +----------+      | driver- RAS    |----->| EDAC mem |  | |
   ||   +----------+      | feature control|      | repair   |  | |
   ||   |          |<-----|                |      +----------+  | |
   ||   |EDAC ECS  |      +---------|------+                    | |
   ||   +----------+    Register RAS|features                   | |
   ||         ______________________|_____________              | |
   |+---------|---------------|------------------|--------------+ |
   |  +-------|----+  +-------|-------+     +----|----------+     |
   |  |            |  | CXL mem driver|     | Client driver |     |
   |  | ACPI RAS2  |  | scrub, ECS,   |     | memory repair |     |
   |  | driver     |  | sparing, PPR  |     | features      |     |
   |  +-----|------+  +-------|-------+     +------|--------+     |
   |        |                 |                    |              |
   +--------|-----------------|--------------------|--------------+
            |                 |                    |
   +--------|-----------------|--------------------|--------------+
   |    +---|-----------------|--------------------|-------+      |
   |    |                                                  |      |
   |    |            Platform HW and Firmware              |      |
   |    +--------------------------------------------------+      |
   +--------------------------------------------------------------+


1. EDAC Features components - Create feature-specific descriptors. For
   example: scrub, ECS, memory repair in the above diagram.

2. EDAC device driver for controlling RAS Features - Get feature's attribute
   descriptors from EDAC RAS feature component and registers device's RAS
   features with EDAC bus and expose the features control attributes via
   sysfs. For example, /sys/bus/edac/devices/<dev-name>/<feature>X/

3. RAS dynamic feature controller - Userspace sample modules in rasdaemon for
   dynamic scrub/repair control to issue scrubbing/repair when excess number
   of corrected memory errors are reported in a short span of time.

RAS features
------------
1. Memory Scrub

Memory scrub features are documented in `Documentation/edac/scrub.rst`.

2. Memory Repair

Memory repair features are documented in `Documentation/edac/memory_repair.rst`.
+12 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later

==============
EDAC Subsystem
==============

.. toctree::
   :maxdepth: 1

   features
   memory_repair
   scrub
Loading