Commit 246d8b6c authored by Tomer Tayar's avatar Tomer Tayar Committed by Oded Gabbay
Browse files

accel/habanalabs: abort device reset for consecutive heartbeat failures



The mechanism of aborting device reset for consecutive fatal errors is
currently only for fatal errors that are reported by FW.
A non-responsive FW and consecutive heartbeat failures is also
considered fatal, so add them as well to this mechanism to avoid
recurring device reset in such a case.

Signed-off-by: default avatarTomer Tayar <ttayar@habana.ai>
Reviewed-by: default avatarOded Gabbay <ogabbay@kernel.org>
Signed-off-by: default avatarOded Gabbay <ogabbay@kernel.org>
parent d0df8a35
Loading
Loading
Loading
Loading
+8 −6
Original line number Diff line number Diff line
@@ -1769,14 +1769,16 @@ int hl_device_reset(struct hl_device *hdev, u32 flags)
		hdev->device_cpu_disabled = false;
		hdev->reset_info.hard_reset_pending = false;

		if (hdev->reset_info.reset_trigger_repeated &&
				(hdev->reset_info.prev_reset_trigger ==
						HL_DRV_RESET_FW_FATAL_ERR)) {
			/* if there 2 back to back resets from FW,
			 * ensure driver puts the driver in a unusable state
		/*
		 * Put the device in an unusable state if there are 2 back to back resets due to
		 * fatal errors.
		 */
		if (hdev->reset_info.reset_trigger_repeated &&
				(hdev->reset_info.prev_reset_trigger == HL_DRV_RESET_FW_FATAL_ERR ||
						hdev->reset_info.prev_reset_trigger ==
								HL_DRV_RESET_HEARTBEAT)) {
			dev_crit(hdev->dev,
				"%s Consecutive FW fatal errors received, stopping hard reset\n",
				"%s Consecutive fatal errors, stopping hard reset\n",
				dev_name(&(hdev)->pdev->dev));
			rc = -EIO;
			goto out_err;