Commit ade5add0 authored by Dave Airlie's avatar Dave Airlie
Browse files

Merge tag 'amd-drm-next-6.13-2024-11-15' of...

Merge tag 'amd-drm-next-6.13-2024-11-15' of https://gitlab.freedesktop.org/agd5f/linux

 into drm-next

amd-drm-next-6.13-2024-11-15:

amdgpu:
- Parition fixes
- GFX 12 fixes
- SR-IOV fixes
- MES fixes
- RAS fixes
- GC queue handling fixes
- VCN fixes
- Add sysfs reset masks
- Better error messages for P2P failurs
- SMU fixes
- Documentation updates
- GFX11 enforce isolation updates
- Display HPD fixes
- PSR fixes
- Panel replay fixes
- DP MST fixes
- USB4 fixes
- Misc display fixes and cleanups
- VRAM handling fix for APUs
- NBIO fix

amdkfd:
- INIT_WORK fix
- Refcount fix
- KFD MES scheduling fixes

drm/fourcc:
- Add missing tiling mode

Signed-off-by: default avatarDave Airlie <airlied@redhat.com>

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241115165012.573465-1-alexander.deucher@amd.com
parents 56b70bf9 447a54a0
Loading
Loading
Loading
Loading
+1 −0
Original line number Diff line number Diff line
@@ -16,4 +16,5 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) architectures.
   thermal
   driver-misc
   debugging
   process-isolation
   amdgpu-glossary
+59 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

=========================
 AMDGPU Process Isolation
=========================

The AMDGPU driver includes a feature that enables automatic process isolation on the graphics engine. This feature serializes access to the graphics engine and adds a cleaner shader which clears the Local Data Store (LDS) and General Purpose Registers (GPRs) between jobs. All processes using the GPU, including both graphics and compute workloads, are serialized when this feature is enabled. On GPUs that support partitionable graphics engines, this feature can be enabled on a per-partition basis.

In addition, there is an interface to manually run the cleaner shader when the use of the GPU is complete. This may be preferable in some use cases, such as a single-user system where the login manager triggers the cleaner shader when the user logs out.

Process Isolation
=================

The `run_cleaner_shader` and `enforce_isolation` sysfs interfaces allow users to manually execute the cleaner shader and control the process isolation feature, respectively.

Partition Handling
------------------

The `enforce_isolation` file in sysfs can be used to enable process isolation and automatic shader cleanup between processes. On GPUs that support graphics engine partitioning, this can be enabled per partition. The partition and its current setting (0 disabled, 1 enabled) can be read from sysfs. On GPUs that do not support graphics engine partitioning, only a single partition will be present. Writing 1 to the partition position enables enforce isolation, writing 0 disables it.

Example of enabling enforce isolation on a GPU with multiple partitions:

.. code-block:: console

    $ echo 1 0 1 0 > /sys/class/drm/card0/device/enforce_isolation
    $ cat /sys/class/drm/card0/device/enforce_isolation
    1 0 1 0

The output indicates that enforce isolation is enabled on zeroth and second parition and disabled on first and fourth parition.

For devices with a single partition or those that do not support partitions, there will be only one element:

.. code-block:: console

    $ echo 1 > /sys/class/drm/card0/device/enforce_isolation
    $ cat /sys/class/drm/card0/device/enforce_isolation
    1

Cleaner Shader Execution
========================

The driver can trigger a cleaner shader to clean up the LDS and GPR state on the graphics engine. When process isolation is enabled, this happens automatically between processes. In addition, there is a sysfs file to manually trigger cleaner shader execution.

To manually trigger the execution of the cleaner shader, write `0` to the `run_cleaner_shader` sysfs file:

.. code-block:: console

    $ echo 0 > /sys/class/drm/card0/device/run_cleaner_shader

For multi-partition devices, you can specify the partition index when triggering the cleaner shader:

.. code-block:: console

    $ echo 0 > /sys/class/drm/card0/device/run_cleaner_shader # For partition 0
    $ echo 1 > /sys/class/drm/card0/device/run_cleaner_shader # For partition 1
    $ echo 2 > /sys/class/drm/card0/device/run_cleaner_shader # For partition 2
    # ... and so on for each partition

This command initiates the cleaner shader, which will run and complete before any new tasks are scheduled on the GPU.
+8 −0
Original line number Diff line number Diff line
@@ -299,6 +299,12 @@ extern int amdgpu_wbrf;
#define AMDGPU_RESET_VCE			(1 << 13)
#define AMDGPU_RESET_VCE1			(1 << 14)

/* reset mask */
#define AMDGPU_RESET_TYPE_FULL (1 << 0) /* full adapter reset, mode1/mode2/BACO/etc. */
#define AMDGPU_RESET_TYPE_SOFT_RESET (1 << 1) /* IP level soft reset */
#define AMDGPU_RESET_TYPE_PER_QUEUE (1 << 2) /* per queue */
#define AMDGPU_RESET_TYPE_PER_PIPE (1 << 3) /* per pipe */

/* max cursor sizes (in pixels) */
#define CIK_CURSOR_WIDTH 128
#define CIK_CURSOR_HEIGHT 128
@@ -1464,6 +1470,8 @@ struct dma_fence *amdgpu_device_get_gang(struct amdgpu_device *adev);
struct dma_fence *amdgpu_device_switch_gang(struct amdgpu_device *adev,
					    struct dma_fence *gang);
bool amdgpu_device_has_display_hardware(struct amdgpu_device *adev);
ssize_t amdgpu_get_soft_full_reset_mask(struct amdgpu_ring *ring);
ssize_t amdgpu_show_reset_mask(char *buf, uint32_t supported_reset);

/* atpx handler */
#if defined(CONFIG_VGA_SWITCHEROO)
+1 −1
Original line number Diff line number Diff line
@@ -158,7 +158,7 @@ static int aca_smu_get_valid_aca_banks(struct amdgpu_device *adev, enum aca_smu_
		return -EINVAL;
	}

	if (start + count >= max_count)
	if (start + count > max_count)
		return -EINVAL;

	count = min_t(int, count, max_count);
+11 −2
Original line number Diff line number Diff line
@@ -834,6 +834,9 @@ int amdgpu_amdkfd_unmap_hiq(struct amdgpu_device *adev, u32 doorbell_off,
	if (!kiq->pmf || !kiq->pmf->kiq_unmap_queues)
		return -EINVAL;

	if (!kiq_ring->sched.ready || adev->job_hang)
		return 0;

	ring_funcs = kzalloc(sizeof(*ring_funcs), GFP_KERNEL);
	if (!ring_funcs)
		return -ENOMEM;
@@ -858,7 +861,13 @@ int amdgpu_amdkfd_unmap_hiq(struct amdgpu_device *adev, u32 doorbell_off,

	kiq->pmf->kiq_unmap_queues(kiq_ring, ring, RESET_QUEUES, 0, 0);

	if (kiq_ring->sched.ready && !adev->job_hang)
	/* Submit unmap queue packet */
	amdgpu_ring_commit(kiq_ring);
	/*
	 * Ring test will do a basic scratch register change check. Just run
	 * this to ensure that unmap queues that is submitted before got
	 * processed successfully before returning.
	 */
	r = amdgpu_ring_test_helper(kiq_ring);

	spin_unlock(&kiq->ring_lock);
Loading