Commit b0402403 authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC updates from Borislav Petkov:

 - Add a FRU (Field Replaceable Unit) memory poison manager which
   collects and manages previously encountered hw errors in order to
   save them to persistent storage across reboots. Previously recorded
   errors are "replayed" upon reboot in order to poison memory which has
   caused said errors in the past.

   The main use case is stacked, on-chip memory which cannot simply be
   replaced so poisoning faulty areas of it and thus making them
   inaccessible is the only strategy to prolong its lifetime.

 - Add an AMD address translation library glue which converts the
   reported addresses of hw errors into system physical addresses in
   order to be used by other subsystems like memory failure, for
   example. Add support for MI300 accelerators to that library.

 - igen6: Add support for Alder Lake-N SoC

 - i10nm: Add Grand Ridge support

 - The usual fixlets and cleanups

* tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/versal: Convert to platform remove callback returning void
  RAS/AMD/FMPM: Fix off by one when unwinding on error
  RAS/AMD/FMPM: Add debugfs interface to print record entries
  RAS/AMD/FMPM: Save SPA values
  RAS: Export helper to get ras_debugfs_dir
  RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2()
  RAS: Introduce a FRU memory poison manager
  RAS/AMD/ATL: Add MI300 row retirement support
  Documentation: Move RAS section to admin-guide
  EDAC/versal: Make the bit position of injected errors configurable
  EDAC/i10nm: Add Intel Grand Ridge micro-server support
  EDAC/igen6: Add one more Intel Alder Lake-N SoC support
  RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support
  RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300()
  RAS/AMD/ATL: Add MI300 support
  Documentation: RAS: Add index and address translation section
  EDAC/amd64: Use new AMD Address Translation Library
  RAS: Introduce AMD Address Translation Library
  EDAC/synopsys: Convert to devm_platform_ioremap_resource()
parents 1f75619a af65545a
Loading
Loading
Loading
Loading
+24 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

Address translation
===================

x86 AMD
-------

Zen-based AMD systems include a Data Fabric that manages the layout of
physical memory. Devices attached to the Fabric, like memory controllers,
I/O, etc., may not have a complete view of the system physical memory map.
These devices may provide a "normalized", i.e. device physical, address
when reporting memory errors. Normalized addresses must be translated to
a system physical address for the kernel to action on the memory.

AMD Address Translation Library (CONFIG_AMD_ATL) provides translation for
this case.

Glossary of acronyms used in address translation for Zen-based systems

* CCM               = Cache Coherent Moderator
* COD               = Cluster-on-Die
* COH_ST            = Coherent Station
* DF                = Data Fabric
+3 −8
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

Reliability, Availability and Serviceability features
=====================================================

This documents different aspects of the RAS functionality present in the
kernel.

Error decoding
---------------
==============

* x86
x86
---

Error decoding on AMD systems should be done using the rasdaemon tool:
https://github.com/mchehab/rasdaemon/
+7 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0
.. toctree::
   :maxdepth: 2

   main
   error-decoding
   address-translation
+7 −3
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>

============================================
Reliability, Availability and Serviceability
============================================
==================================================
Reliability, Availability and Serviceability (RAS)
==================================================

This documents different aspects of the RAS functionality present in the
kernel.

RAS concepts
************
+1 −1
Original line number Diff line number Diff line
@@ -122,7 +122,7 @@ configure specific aspects of kernel behavior to your liking.
   pmf
   pnp
   rapidio
   ras
   RAS/index
   rtc
   serial-console
   svga
Loading