Loading
x86/mce/amd: Filter bogus hardware errors on Zen3 clients
Users have been observing multiple L3 cache deferred errors after recent kernel rework of deferred error handling.¹ ⁴ The errors are bogus due to inconsistent status values. Also, user verified that bogus MCA_DESTAT values are present on the system even with an older kernel.² The errors seem to be garbage values present in the MCA_DESTAT of some L3 cache banks. These were implicitly ignored before the recent kernel rework because these do not generate a deferred error interrupt. A later revision of the rework patch was merged for v6.19. This naturally filtered out most of the bogus error logs. However, a few signatures still remain.³ Minimize the scope of the filter to the reported CPU family/model/stepping and only for errors which don't have the Enabled bit in the MCi status MSR. ¹ https://lore.kernel.org/20250915010010.3547-1-spasswolf@web.de ² https://lore.kernel.org/6e1eda7dd55f6fa30405edf7b0f75695cf55b237.camel@web.de ³ https://lore.kernel.org/21ba47fa8893b33b94370c2a42e5084cf0d2e975.camel@web.de ⁴ https://lore.kernel.org/r/CAKFB093B2k3sKsGJ_QNX1jVQsaXVFyy=wNwpzCGLOXa_vSDwXw@mail.gmail.com [ bp: Generalize the condition according to which errors are bogus. ] Fixes: 7cb735d7 ("x86/mce: Unify AMD DFR handler with MCA Polling") Closes: https://lore.kernel.org/20250915010010.3547-1-spasswolf@web.de Reported-by:Bert Karwatzki <spasswolf@web.de> Signed-off-by:
Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by:
Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by:
Mario Limonciello <mario.limonciello@amd.com> Tested-By:
Bert Karwatzki <spasswolf@web.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250915010010.3547-1-spasswolf@web.de