Loading
drm/panthor: Make the timeout per-queue instead of per-job
The timeout logic provided by drm_sched leads to races when we try to suspend it while the drm_sched workqueue queues more jobs. Let's overhaul the timeout handling in panthor to have our own delayed work that's resumed/suspended when a group is resumed/suspended. When an actual timeout occurs, we call drm_sched_fault() to report it through drm_sched, still. But otherwise, the drm_sched timeout is disabled (set to MAX_SCHEDULE_TIMEOUT), which leaves us in control of how we protect modifications on the timer. One issue seems to be when we call drm_sched_suspend_timeout() from both queue_run_job() and tick_work() which could lead to races due to drm_sched_suspend_timeout() not having a lock. Another issue seems to be in queue_run_job() if the group is not scheduled, we suspend the timeout again which undoes what drm_sched_job_begin() did when calling drm_sched_start_timeout(). So the timeout does not reset when a job is finished. v2: - Fix syntax error v3: - Split the changes in two commits v4: - No changes v5: - No changes v6: - Fix a NULL deref in group_can_run(), and narrow the group variable scope to avoid such mistakes in the future - Add an queue_timeout_is_suspended() helper to clarify things v7: - No changes v8: - Don't touch drm_gpu_scheduler::timeout in queue_timedout_job() Fixes: de854881 ("drm/panthor: Add the scheduler logical block") Reviewed-by:Steven Price <steven.price@arm.com> Reviewed-by:
Liviu Dudau <liviu.dudau@arm.com> Reviewed-by:
Adrián Larumbe <adrian.larumbe@collabora.com> Signed-off-by:
Ashley Smith <ashley.smith@collabora.com> Co-developed-by:
Boris Brezillon <boris.brezillon@collabora.com> Signed-off-by:
Boris Brezillon <boris.brezillon@collabora.com> Link: https://patch.msgid.link/20251113105734.1520338-2-boris.brezillon@collabora.com