Commit b2ef8087 authored by Christian König's avatar Christian König Committed by Christian König
Browse files

drm/sched: add optional errno to drm_sched_start()



The current implementation of drm_sched_start uses a hardcoded
-ECANCELED to dispose of a job when the parent/hw fence is NULL.
This results in drm_sched_job_done being called with -ECANCELED for
each job with a NULL parent in the pending list, making it difficult
to distinguish between recovery methods, whether a queue reset or a
full GPU reset was used.

To improve this, we first try a soft recovery for timeout jobs and
use the error code -ENODATA. If soft recovery fails, we proceed with
a queue reset, where the error code remains -ENODATA for the job.
Finally, for a full GPU reset, we use error codes -ECANCELED or
-ETIME. This patch adds an error code parameter to drm_sched_start,
allowing us to differentiate between queue reset and GPU reset
failures. This enables user mode and test applications to validate
the expected correctness of the requested operation. After a
successful queue reset, the only way to continue normal operation is
to call drm_sched_job_done with the specific error code -ENODATA.

v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain
    and amdgpu_device_unlock_reset_domain to allow user mode to track
    the queue reset status and distinguish between queue reset and
    GPU reset.
v2: Christian suggested using the error codes -ENODATA for queue reset
    and -ECANCELED or -ETIME for GPU reset, returned to
    amdgpu_cs_wait_ioctl.
v3: To meet the requirements, we introduce a new function
    drm_sched_start_ex with an additional parameter to set
    dma_fence_set_error, allowing us to handle the specific error
    codes appropriately and dispose of bad jobs with the selected
    error code depending on whether it was a queue reset or GPU reset.
v4: Alex suggested using a new name, drm_sched_start_with_recovery_error,
    which more accurately describes the function's purpose.
    Additionally, it was recommended to add documentation details
    about the new method.
v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex)
v6 (chk): rebase on upstream changes, cleanup the commit message,
          drop the new function again and update all callers,
          apply the errno also to scheduler fences with hw fences
v7 (chk): rebased

Signed-off-by: default avatarJesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: default avatarVitaly Prosyak <vitaly.prosyak@amd.com>
Signed-off-by: default avatarChristian König <christian.koenig@amd.com>
Acked-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240826122541.85663-1-christian.koenig@amd.com
parent 498ba746
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
@@ -300,7 +300,7 @@ static int suspend_resume_compute_scheduler(struct amdgpu_device *adev, bool sus
			if (r)
				goto out;
		} else {
			drm_sched_start(&ring->sched);
			drm_sched_start(&ring->sched, 0);
		}
	}

+2 −2
Original line number Diff line number Diff line
@@ -5907,7 +5907,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
			if (!amdgpu_ring_sched_ready(ring))
				continue;

			drm_sched_start(&ring->sched);
			drm_sched_start(&ring->sched, 0);
		}

		if (!drm_drv_uses_atomic_modeset(adev_to_drm(tmp_adev)) && !job_signaled)
@@ -6414,7 +6414,7 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
		if (!amdgpu_ring_sched_ready(ring))
			continue;

		drm_sched_start(&ring->sched);
		drm_sched_start(&ring->sched, 0);
	}

	amdgpu_device_unset_mp1_state(adev);
+1 −1
Original line number Diff line number Diff line
@@ -87,7 +87,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
			atomic_inc(&ring->adev->gpu_reset_counter);
			amdgpu_fence_driver_force_completion(ring);
			if (amdgpu_ring_sched_ready(ring))
				drm_sched_start(&ring->sched);
				drm_sched_start(&ring->sched, 0);
			goto exit;
		}
	}
+1 −1
Original line number Diff line number Diff line
@@ -72,7 +72,7 @@ static enum drm_gpu_sched_stat etnaviv_sched_timedout_job(struct drm_sched_job

	drm_sched_resubmit_jobs(&gpu->sched);

	drm_sched_start(&gpu->sched);
	drm_sched_start(&gpu->sched, 0);
	return DRM_GPU_SCHED_STAT_NOMINAL;

out_no_timeout:
+2 −2
Original line number Diff line number Diff line
@@ -782,7 +782,7 @@ static void pvr_queue_start(struct pvr_queue *queue)
		}
	}

	drm_sched_start(&queue->scheduler);
	drm_sched_start(&queue->scheduler, 0);
}

/**
@@ -842,7 +842,7 @@ pvr_queue_timedout_job(struct drm_sched_job *s_job)
	}
	mutex_unlock(&pvr_dev->queues.lock);

	drm_sched_start(sched);
	drm_sched_start(sched, 0);

	return DRM_GPU_SCHED_STAT_NOMINAL;
}
Loading