Commit 34633158 authored by Dave Airlie's avatar Dave Airlie
Browse files

Merge tag 'amd-drm-next-6.10-2024-04-13' of...

Merge tag 'amd-drm-next-6.10-2024-04-13' of https://gitlab.freedesktop.org/agd5f/linux into drm-next

amd-drm-next-6.10-2024-04-13:

amdgpu:
- HDCP fixes
- ODM fixes
- RAS fixes
- Devcoredump improvements
- Misc code cleanups
- Expose VCN activity via sysfs
- SMY 13.0.x updates
- Enable fast updates on DCN 3.1.4
- Add dclk and vclk reporting on additional devices
- Add ACA RAS infrastructure
- Implement TLB flush fence
- EEPROM handling fixes
- SMUIO 14.0.2 support
- SMU 14.0.1 Updates
- Sync page table freeing with TLB flushes
- DML2 refactor
- DC debug improvements
- SR-IOV fixes
- Suspend and Resume fixes
- DCN 3.5.x Updates
- Z8 fixes
- UMSCH fixes
- GPU reset fixes
- HDP fix for second GFX pipe on GC 10.x
- Enable secondary GFX pipe on GC 10.3
- Refactor and clean up BACO/BOCO/BAMACO handling
- VCN partitioning fix
- DC DWB fixes
- VSC SDP fixes
- DCN 3.1.6 fix
- GC 11.5 fixes
- Remove invalid TTM resource start check
- DCN 1.0 fixes

amdkfd:
- MQD handling cleanup
- Preemption handling fixes for XCDs
- TLB flush fix for GC 9.4.2
- Properly clean up workqueue during module unload
- Fix memory leak process create failure
- Range check CP bad op exception targets to avoid reporting invalid exceptions to userspace

radeon:
- Misc code cleanups

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240413213708.3427038-1-alexander.deucher@amd.com


Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
parents 6e1f415e ab956ed9
Loading
Loading
Loading
Loading
+80 −0
Original line number Diff line number Diff line
===============
 GPU Debugging
===============

GPUVM Debugging
===============

To aid in debugging GPU virtual memory related problems, the driver supports a
number of options module parameters:

`vm_fault_stop` - If non-0, halt the GPU memory controller on a GPU page fault.

`vm_update_mode` - If non-0, use the CPU to update GPU page tables rather than
the GPU.


Decoding a GPUVM Page Fault
===========================

If you see a GPU page fault in the kernel log, you can decode it to figure
out what is going wrong in your application.  A page fault in your kernel
log may look something like this:

::

 [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32777, for process glxinfo pid 2424 thread glxinfo:cs0 pid 2425)
   in page starting at address 0x0000800102800000 from IH client 0x1b (UTCL2)
 VM_L2_PROTECTION_FAULT_STATUS:0x00301030
 	Faulty UTCL2 client ID: TCP (0x8)
 	MORE_FAULTS: 0x0
 	WALKER_ERROR: 0x0
 	PERMISSION_FAULTS: 0x3
 	MAPPING_ERROR: 0x0
 	RW: 0x0

First you have the memory hub, gfxhub and mmhub.  gfxhub is the memory
hub used for graphics, compute, and sdma on some chips.  mmhub is the
memory hub used for multi-media and sdma on some chips.

Next you have the vmid and pasid.  If the vmid is 0, this fault was likely
caused by the kernel driver or firmware.  If the vmid is non-0, it is generally
a fault in a user application.  The pasid is used to link a vmid to a system
process id.  If the process is active when the fault happens, the process
information will be printed.

The GPU virtual address that caused the fault comes next.

The client ID indicates the GPU block that caused the fault.
Some common client IDs:

- CB/DB: The color/depth backend of the graphics pipe
- CPF: Command Processor Frontend
- CPC: Command Processor Compute
- CPG: Command Processor Graphics
- TCP/SQC/SQG: Shaders
- SDMA: SDMA engines
- VCN: Video encode/decode engines
- JPEG: JPEG engines

PERMISSION_FAULTS describe what faults were encountered:

- bit 0: the PTE was not valid
- bit 1: the PTE read bit was not set
- bit 2: the PTE write bit was not set
- bit 3: the PTE execute bit was not set

Finally, RW, indicates whether the access was a read (0) or a write (1).

In the example above, a shader (cliend id = TCP) generated a read (RW = 0x0) to
an invalid page (PERMISSION_FAULTS = 0x3) at GPU virtual address
0x0000800102800000.  The user can then inspect their shader code and resource
descriptor state to determine what caused the GPU page fault.

UMR
===

`umr <https://gitlab.freedesktop.org/tomstdenis/umr>`_ is a general purpose
GPU debugging and diagnostics tool.  Please see the umr
`documentation <https://umr.readthedocs.io/en/main/>`_ for more information
about its capabilities.
+1 −1
Original line number Diff line number Diff line
@@ -135,7 +135,7 @@ Enable underlay
---------------

AMD display has this feature called underlay (which you can read more about at
'Documentation/GPU/amdgpu/display/mpo-overview.rst') which is intended to
'Documentation/gpu/amdgpu/display/mpo-overview.rst') which is intended to
save power when playing a video. The basic idea is to put a video in the
underlay plane at the bottom and the desktop in the plane above it with a hole
in the video area. This feature is enabled in ChromeOS, and from our data
+1 −0
Original line number Diff line number Diff line
@@ -15,4 +15,5 @@ Next (GCN), Radeon DNA (RDNA), and Compute DNA (CDNA) architectures.
   ras
   thermal
   driver-misc
   debugging
   amdgpu-glossary
+5 −3
Original line number Diff line number Diff line
@@ -70,7 +70,8 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \
	amdgpu_cs.o amdgpu_bios.o amdgpu_benchmark.o \
	atombios_dp.o amdgpu_afmt.o amdgpu_trace_points.o \
	atombios_encoders.o amdgpu_sa.o atombios_i2c.o \
	amdgpu_dma_buf.o amdgpu_vm.o amdgpu_vm_pt.o amdgpu_ib.o amdgpu_pll.o \
	amdgpu_dma_buf.o amdgpu_vm.o amdgpu_vm_pt.o amdgpu_vm_tlb_fence.o \
	amdgpu_ib.o amdgpu_pll.o \
	amdgpu_ucode.o amdgpu_bo_list.o amdgpu_ctx.o amdgpu_sync.o \
	amdgpu_gtt_mgr.o amdgpu_preempt_mgr.o amdgpu_vram_mgr.o amdgpu_virt.o \
	amdgpu_atomfirmware.o amdgpu_vf_error.o amdgpu_sched.o \
@@ -80,7 +81,7 @@ amdgpu-y += amdgpu_device.o amdgpu_doorbell_mgr.o amdgpu_kms.o \
	amdgpu_umc.o smu_v11_0_i2c.o amdgpu_fru_eeprom.o amdgpu_rap.o \
	amdgpu_fw_attestation.o amdgpu_securedisplay.o \
	amdgpu_eeprom.o amdgpu_mca.o amdgpu_psp_ta.o amdgpu_lsdma.o \
	amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o
	amdgpu_ring_mux.o amdgpu_xcp.o amdgpu_seq64.o amdgpu_aca.o amdgpu_dev_coredump.o

amdgpu-$(CONFIG_PROC_FS) += amdgpu_fdinfo.o

@@ -247,7 +248,8 @@ amdgpu-y += \
	smuio_v11_0_6.o \
	smuio_v13_0.o \
	smuio_v13_0_3.o \
	smuio_v13_0_6.o
	smuio_v13_0_6.o \
	smuio_v14_0_2.o

# add reset block
amdgpu-y += \
+3 −2
Original line number Diff line number Diff line
@@ -210,6 +210,7 @@ extern int amdgpu_async_gfx_ring;
extern int amdgpu_mcbp;
extern int amdgpu_discovery;
extern int amdgpu_mes;
extern int amdgpu_mes_log_enable;
extern int amdgpu_mes_kiq;
extern int amdgpu_noretry;
extern int amdgpu_force_asic_type;
@@ -605,7 +606,7 @@ struct amdgpu_asic_funcs {
	/* PCIe replay counter */
	uint64_t (*get_pcie_replay_count)(struct amdgpu_device *adev);
	/* device supports BACO */
	bool (*supports_baco)(struct amdgpu_device *adev);
	int (*supports_baco)(struct amdgpu_device *adev);
	/* pre asic_init quirks */
	void (*pre_asic_init)(struct amdgpu_device *adev);
	/* enter/exit umd stable pstate */
@@ -1407,7 +1408,7 @@ bool amdgpu_device_supports_atpx(struct drm_device *dev);
bool amdgpu_device_supports_px(struct drm_device *dev);
bool amdgpu_device_supports_boco(struct drm_device *dev);
bool amdgpu_device_supports_smart_shift(struct drm_device *dev);
bool amdgpu_device_supports_baco(struct drm_device *dev);
int amdgpu_device_supports_baco(struct drm_device *dev);
bool amdgpu_device_is_peer_accessible(struct amdgpu_device *adev,
				      struct amdgpu_device *peer_adev);
int amdgpu_device_baco_enter(struct drm_device *dev);
Loading