drm/amdgpu: harvest edc status when connected to host via xGMI

When connected to a host via xGMI, system fatal errors may trigger
warm reset, driver has no change to query edc status before reset.
Therefore in this case, driver should harvest previous error loging
registers during boot, instead of only resetting them.

v2:
1. IP's ras_manager object is created when its ras feature is enabled,
so change to query edc status after amdgpu_ras_late_init called

2. change to enable watchdog timer after finishing gfx edc init

Signed-off-by: Dennis Li <Dennis.Li@amd.com>
Reivewed-by: Hawking Zhang <hawking.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
This commit is contained in:
Dennis Li
2021-02-04 13:32:05 +08:00
committed by Alex Deucher
parent 63dbb0db3a
commit 761d86d37f
7 changed files with 107 additions and 28 deletions

View File

@@ -588,9 +588,12 @@ int amdgpu_ras_sysfs_remove(struct amdgpu_device *adev,
void amdgpu_ras_debugfs_create_all(struct amdgpu_device *adev);
int amdgpu_ras_error_query(struct amdgpu_device *adev,
int amdgpu_ras_query_error_status(struct amdgpu_device *adev,
struct ras_query_if *info);
int amdgpu_ras_reset_error_status(struct amdgpu_device *adev,
enum amdgpu_ras_block block);
int amdgpu_ras_error_inject(struct amdgpu_device *adev,
struct ras_inject_if *info);