Commit Graph

273 Commits

Author SHA1 Message Date
Jonathan Kim
80a780ab27 drm/amdkfd: decrement queue count on mes queue destroy
Queue count should decrement on queue destruction regardless of HWS
support type.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-15 11:06:58 -04:00
Ruili Ji
fb120e84b0 drm/amdkfd: To enable traps for GC_11_0_4 and up
Flag trap_en should be enabled for trap handler.

Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-15 11:06:57 -04:00
Jonathan Kim
09d49e14ea drm/amdkfd: fix and enable debugging for gfx11
There are a couple of fixes required to enable gfx11 debugging.

First, ADD_QUEUE.trap_en is an inappropriate place to toggle
a per-process register so move it to SET_SHADER_DEBUGGER.trap_en.
When ADD_QUEUE.skip_process_ctx_clear is set, MES will prioritize
the SET_SHADER_DEBUGGER.trap_en setting.

Second, to preserve correct save/restore priviledged wave states
in coordination with the trap enablement setting, resume suspended
waves early in the disable call.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:48:19 -04:00
Mukul Joshi
597364adc0 drm/amdkfd: Fix reserved SDMA queues handling
This patch fixes a regression caused by a bad merge where
the handling of reserved SDMA queues was accidentally removed.
With the fix, the reserved SDMA queues are again correctly
marked as unavailable for allocation.

Fixes: a805889a15 ("drm/amdkfd: Update SDMA queue management for GFX9.4.3")
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:44:56 -04:00
Jonathan Kim
b17bd5dbf6 drm/amdkfd: add debug queue snapshot operation
Allow the debugger to get a snapshot of a specified number of queues
containing various queue property information that is copied to the
debugger.

Since the debugger doesn't know how many queues exist at any given time,
allow the debugger to pass the requested number of snapshots as 0 to get
the actual number of potential snapshots to use for a subsequent snapshot
request for actual information.

To prevent future ABI breakage, pass in the requested entry_size.
The KFD will return it's own entry_size in case the debugger still wants
log the information in a core dump on sizing failure.

Also allow the debugger to clear exceptions when doing a snapshot.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:36:57 -04:00
Jonathan Kim
a70a93fa56 drm/amdkfd: add debug suspend and resume process queues operation
In order to inspect waves from the saved context at any point during a
debug session, the debugger must be able to preempt queues to trigger
context save by suspending them.

On queue suspend, the KFD will copy the context save header information
so that the debugger can correctly crawl the appropriate size of the saved
context. The debugger must then also be allowed to resume suspended queues.

A queue that is newly created cannot be suspended because queue ids are
recycled after destruction so the debugger needs to know that this has
occurred.  Query functions will be later added that will clear a given
queue of its new queue status.

A queue cannot be destroyed while it is suspended to preserve its saved
context during debugger inspection.  Have queue destruction block while
a queue is suspended and unblocked when it is resumed.  Likewise, if a
queue is about to be destroyed, it cannot be suspended.

Return the number of queues successfully suspended or resumed along with
a per queue status array where the upper bits per queue status show that
the request was invalid (new/destroyed queue suspend request, missing
queue) or an error occurred (HWS in a fatal state so it can't suspend or
resume queues).

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:36:43 -04:00
Jonathan Kim
69a8c3ae2d drm/amdkfd: apply trap workaround for gfx11
Due to a HW bug, waves in only half the shader arrays can enter trap.

When starting a debug session, relocate all waves to the first shader
array of each shader engine and mask off the 2nd shader array as
unavailable.

When ending a debug session, re-enable the 2nd shader array per
shader engine.

User CU masking per queue cannot be guaranteed to remain functional
if requested during debugging (e.g. user cu mask requests only 2nd shader
array as an available resource leading to zero HW resources available)
nor can runtime be alerted of any of these changes during execution.

Make user CU masking and debugging mutual exclusive with respect to
availability.

If the debugger tries to attach to a process with a user cu masked
queue, return the runtime status as enabled but busy.

If the debugger tries to attach and fails to reallocate queue waves to
the first shader array of each shader engine, return the runtime status
as enabled but with an error.

In addition, like any other mutli-process debug supported devices,
disable trap temporary setup per-process to avoid performance impact from
setup overhead.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:35:52 -04:00
Jonathan Kim
0de4ec9a03 drm/amdgpu: prepare map process for multi-process debug devices
Unlike single process debug devices, multi-process debug devices allow
debug mode setting per-VMID (non-device-global).

Because the HWS manages PASID-VMID mapping, the new MAP_PROCESS API allows
the KFD to forward the required SPI debug register write requests.

To request a new debug mode setting change, the KFD must be able to
preempt all queues then remap all queues with these new setting
requests for MAP_PROCESS to take effect.

Note that by default, trap enablement in non-debug mode must be disabled
for performance reasons for multi-process debug devices due to setup
overhead in FW.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:35:39 -04:00
Jonathan Kim
97ae3c8cce drm/amdkfd: prepare map process for single process debug devices
Older HW only supports debugging on a single process because the
SPI debug mode setting registers are device global.

The HWS has supplied a single pinned VMID (0xf) for MAP_PROCESS
for debug purposes. To pin the VMID, the KFD will remove the VMID from
the HWS dynamic VMID allocation via SET_RESOUCES so that a debugged
process will never migrate away from its pinned VMID.

The KFD is responsible for reserving and releasing this pinned VMID
accordingly whenever the debugger attaches and detaches respectively.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:35:36 -04:00
Jonathan Kim
7cee6a6824 drm/amdgpu: add configurable grace period for unmap queues
The HWS schedule allows a grace period for wave completion prior to
preemption for better performance by avoiding CWSR on waves that can
potentially complete quickly. The debugger, on the other hand, will
want to inspect wave status immediately after it actively triggers
preemption (a suspend function to be provided).

To minimize latency between preemption and debugger wave inspection, allow
immediate preemption by setting the grace period to 0.

Note that setting the preepmtion grace period to 0 will result in an
infinite grace period being set due to a CP FW bug so set it to 1 for now.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:35:31 -04:00
Jonathan Kim
0ab2d7532b drm/amdkfd: prepare per-process debug enable and disable
The ROCm debugger will attach to a process to debug by PTRACE and will
expect the KFD to prepare a process for the target PID, whether the
target PID has opened the KFD device or not.

This patch is to explicity handle this requirement.  Further HW mode
setting and runtime coordination requirements will be handled in
following patches.

In the case where the target process has not opened the KFD device,
a new KFD process must be created for the target PID.
The debugger as well as the target process for this case will have not
acquired any VMs so handle process restoration to correctly account for
this.

To coordinate with HSA runtime, the debugger must be aware of the target
process' runtime enablement status and will copy the runtime status
information into the debugged KFD process for later query.

On enablement, the debugger will subscribe to a set of exceptions where
each exception events will notify the debugger through a pollable FIFO
file descriptor that the debugger provides to the KFD to manage.

Finally on process termination of either the debugger or the target,
debugging must be disabled if it has not been done so.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:34:48 -04:00
Lijo Lazar
b695c97b58 drm/amdkfd: Fix MEC pipe interrupt enablement
for_each_inst modifies xcc_mask and therefore the loop doesn't
initialize properly interrupts on all pipes. Keep looping through xcc as
the outer loop to fix this issue.

Fixes: c4050ff1a4 ("drm/amdkfd: Use xcc mask for identifying xcc")
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:31:35 -04:00
Lijo Lazar
c4050ff1a4 drm/amdkfd: Use xcc mask for identifying xcc
Instead of start xcc id and number of xcc per node, use the xcc mask
which is the mask of logical ids of xccs belonging to a parition.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Le Ma <le.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:57:20 -04:00
Mukul Joshi
643e40d4c0 drm/amdkfd: Fix SDMA in CPX mode
When creating a user-mode SDMA queue, CP FW expects
driver to use/set virtual SDMA engine id in MAP_QUEUES
packet instead of using the physical SDMA engine id.
Each partition node's virtual SDMA number should start
from 0. However, when allocating doorbell for the queue,
KFD needs to allocate the doorbell from doorbell space
corresponding to the physical SDMA engine id, otherwise
the hwardware will not see the doorbell press.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:44:13 -04:00
Mukul Joshi
a805889a15 drm/amdkfd: Update SDMA queue management for GFX9.4.3
This patch updates SDMA queue management for multi XCC in GFX9.4.3.
- Allocate/deallocate SDMA queues from the correct SDMA engines
  based on the partition mode.
- Updates the kgd2kfd interface to fetch the correct SDMA register
  addresses.
- It also fixes dumping correct SDMA queue info in debugfs.

v2: squash in fix "drm/amdkfd: Fix XGMI SDMA user-mode queue allocation"

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:43:05 -04:00
Mukul Joshi
5e40601236 drm/amdkfd: Call DQM stop during DQM uninitialize
During DQM tear down, call DQM stop to unitialize HIQ and
associated memory allocated during packet manager init.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:42:58 -04:00
Mukul Joshi
7fe51e6fd2 drm/amdkfd: Update context save handling for multi XCC setup (v2)
Context save handling needs to be updated for a multi XCC
setup:
- On a multi XCC setup, KFD needs to report context save base
  address and size for each XCC in MQD.
- Thunk will allocate a large context save area covering all
  XCCs which will be equal to: num_of_xccs in a partition * size
  of context save area for 1 XCC. However, it will report only the
  size of context save area for 1 XCC only in the ioctl call.
- Driver then setups the MQD correctly using the size passed from
  Thunk and information about number of XCCs in a partition.
- Update get_wave_state function to return context save area
  for all XCCs in the partition.

v2: update the get_wave_state function for mqd manager v11 (Morris)

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Tested-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Morris Zhang <Shiwu.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:42:50 -04:00
Mukul Joshi
e2069a7b08 drm/amdkfd: Add XCC instance to kgd2kfd interface (v3)
Gfx 9 starts to have multiple XCC instances in one device. Add instance
parameter to kgd2kfd functions where XCC instance was hard coded as 0.
Also, update code to pass the correct instance number when running
on a multi-XCC setup.

v2: introduce the XCC instance to gfx v11 (Morris)
v3: rebase (Alex)

Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Tested-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Morris Zhang <Shiwu.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:42:43 -04:00
Mukul Joshi
2f77b9a242 drm/amdkfd: Update MQD management on multi XCC setup
Update MQD management for both HIQ and user-mode compute
queues on a multi XCC setup. MQDs needs to be allocated,
initialized, loaded and destroyed for each XCC in the KFD
node.

v2: squash in fix "drm/amdkfd: Fix SDMA+HIQ HQD allocation on GFX9.4.3"

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Tested-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:42:36 -04:00
Mukul Joshi
74c5b85da7 drm/amdkfd: Add spatial partitioning support in KFD
This patch introduces multi-partition support in KFD.
This patch includes:
- Support for maximum 8 spatial partitions in KFD.
- Initialize one HIQ per partition.
- Management of VMID range depending on partition mode.
- Management of doorbell aperture space between all
  partitions.
- Each partition does its own queue management, interrupt
  handling, SMI event reporting.
- IOMMU, if enabled with multiple partitions, will only work
  on first partition.
- SPM is only supported on the first partition.
- Currently, there is no support for resetting individual
  partitions. All partitions will reset together.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Tested-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:42:33 -04:00
Mukul Joshi
8dc1db3172 drm/amdkfd: Introduce kfd_node struct (v5)
Introduce a new structure, kfd_node, which will now represent
a compute node. kfd_node is carved out of kfd_dev structure.
kfd_dev struct now will become the parent of kfd_node, and will
store common resources such as doorbells, GTT sub-alloctor etc.
kfd_node struct will store all resources specific to a compute
node, such as device queue manager, interrupt handling etc.

This is the first step in adding compute partition support in KFD.

v2: introduce kfd_node struct to gc v11 (Hawking)
v3: make reference to kfd_dev struct through kfd_node (Morris)
v4: use kfd_node instead for kfd isr/mqd functions (Morris)
v5: rebase (Alex)

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Tested-by: Amber Lin <Amber.Lin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Morris Zhang <Shiwu.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:42:27 -04:00
Ruili Ji
2e2b9baf00 drm/amdkfd: To fix sdma page fault issue for GC 11
For the MQD memory, KMD would always allocate 4K memory,
and mes scheduler would write to the end of MQD for unmap flag.

Signed-off-by: Ruili Ji <ruiliji2@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-02-23 17:35:57 -05:00
Eric Huang
7f347e3f82 drm/amdkfd: Fix NULL pointer error for GC 11.0.1 on mGPU
The point bo->kfd_bo is NULL for queue's write pointer BO
when creating queue on mGPU. To avoid using the pointer
fixes the error.

Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-10 18:05:03 -05:00
Felix Kuehling
b292cafe2d drm/amdkfd: Fix UBSAN shift-out-of-bounds warning
This was fixed in initialize_cpsch before, but not in initialize_nocpsch.
Factor sdma bitmap initialization into a helper function to apply the
correct implementation in both cases without duplicating it.

v2: Added a range check

Reported-by: Ellis Michael <ellis@ellismichael.com>
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Graham Sider <Graham.Sider@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-30 11:21:25 -04:00
Graham Sider
3e9cf23428 drm/amdgpu: pass queue size and is_aql_queue to MES
Update mes_v11_api_def.h add_queue API with is_aql_queue parameter. Also
re-use gds_size for the queue size (unused for KFD). MES requires the
queue size in order to compute the actual wptr offset within the queue
RB since it increases monotonically for AQL queues.

v2: Make is_aql_queue assign clearer

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-29 09:41:44 -04:00
xinhui pan
cc3cb791f1 drm/amdgpu: Fix one list corruption when create queue fails
Queue would be freed when create_queue_cpsch fails
So lets do queue cleanup otherwise various list and memory issues
happen.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-07 15:55:56 -04:00
Philip Yang
8c07f33ea0 Revert "drm/amdkfd: Free queue after unmap queue success"
This reverts commit ab8529b0cd.

This causes KFDTest KFDMemoryTest.MemoryRegister test failed on gfx9.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-28 11:24:59 -04:00
Jack Xiao
fe4e9ff987 drm/amdgpu: add mc wptr addr support for mes
MES requires mc wptr address for usermode queues.
Export bo gart address for mc wptr address.

Signed-off-by: Jack Xiao <Jack.Xiao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-28 11:24:05 -04:00
Graham Sider
a957995618 drm/amdgpu: Update mes_v11_api_def.h
Update MES API to support oversubscription without aggregated doorbell
for usermode queues.

v2: Change oversubscription_no_aggregated_en to is_kfd_process (align
with MES)

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Jack Xiao <Jack.Xiao@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-23 17:22:31 -04:00
Graham Sider
e77a541f5d drm/amdkfd: Enable GFX11 usermode queue oversubscription
Starting with GFX11, MES requires wptr BOs to be GTT allocated/mapped to
GART for usermode queues in order to support oversubscription. In the
case that work is submitted to an unmapped queue, MES must have a GART
wptr address to determine whether the queue should be mapped.

This change is accompanied with changes in MES and is applicable for
MES_API_VERSION >= 2.

v3:
- Use amdgpu_vm_bo_lookup_mapping for wptr_bo mapping lookup
- Move wptr_bo refcount increment to amdgpu_amdkfd_map_gtt_bo_to_gart
- Remove list_del_init from amdgpu_amdkfd_map_gtt_bo_to_gart
- Cleanup/fix create_queue wptr_bo error handling
v4:
- Add MES version shift/mask defines to amdgpu_mes.h
- Change version check from MES_VERSION to MES_API_VERSION
- Add check in kfd_ioctl_create_queue before wptr bo pin/GART map to
ensure bo is a single page.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-23 17:22:12 -04:00
Philip Yang
ab8529b0cd drm/amdkfd: Free queue after unmap queue success
After queue unmap or remove from MES successfully, free queue sysfs
entries, doorbell and remove from queue list. Otherwise, application may
destroy queue again, cause below kernel warning or crash backtrace.

For outstanding queues, either application forget to destroy or failed
to destroy, kfd_process_notifier_release will remove queue sysfs
entries, kfd_process_wq_release will free queue doorbell.

v2: decrement_queue_count for MES queue

 refcount_t: underflow; use-after-free.
 WARNING: CPU: 7 PID: 3053 at lib/refcount.c:28
  Call Trace:
   kobject_put+0xd6/0x1a0
   kfd_procfs_del_queue+0x27/0x30 [amdgpu]
   pqm_destroy_queue+0xeb/0x240 [amdgpu]
   kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
   kfd_ioctl+0x27d/0x500 [amdgpu]
   do_syscall_64+0x35/0x80

 WARNING: CPU: 2 PID: 3053 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:400
  Call Trace:
   deallocate_doorbell.isra.0+0x39/0x40 [amdgpu]
   destroy_queue_cpsch+0xb3/0x270 [amdgpu]
   pqm_destroy_queue+0x108/0x240 [amdgpu]
   kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
   kfd_ioctl+0x27d/0x500 [amdgpu]

 general protection fault, probably for non-canonical address
0xdead000000000108:
 Call Trace:
  pqm_destroy_queue+0xf0/0x200 [amdgpu]
  kfd_ioctl_destroy_queue+0x2f/0x60 [amdgpu]
  kfd_ioctl+0x19b/0x600 [amdgpu]

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Graham Sider <Graham.Sider@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-22 16:55:03 -04:00
Philip Yang
f4f9b827d7 drm/amdkfd: Add queue to MES if it becomes active
We remove the user queue from MES scheduler to update queue properties.
If the queue becomes active after updating, add the user queue to MES
scheduler, to be able to handle command packet submission.

v2: don't break pqm_set_gws

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Graham Sider <Graham.Sider@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-22 16:54:45 -04:00
Graham Sider
04fd07397e drm/amdkfd: Fix static checker warning on MES queue type
convert_to_mes_queue_type return can be negative, but
queue_input.queue_type is uint32_t. Put return in integer var and cast
to unsigned after negative check.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-16 10:02:58 -04:00
Mukul Joshi
cc009e613d drm/amdkfd: Add KFD support for soc21 v3
Add initial support for soc21 in KFD compute
driver (Mukul)
- Add new definition for soc21 device.
- Add new file for amdgpu-kfd interface for GFX11 family.
- Add new file for queue management, interrupt handling,
  mqd management for GFX11 family in KFD driver.
- Related changes/updates for soc21 device in
  KFD driver.
- Repurpose last 2 entries of SDMA MQD for driver use.

v2: Add an optional argument into update queue operation (Mukul)

v3: Switch to ip version check, replace kgd_dev with
    amdgpu_device (Hawking)

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Oak Zeng <Oak.Zeng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-04 10:43:54 -04:00
David Yat Sin
ab4d51d47f drm/amdkfd: Fix GWS queue count
dqm->gws_queue_count and pdd->qpd.mapped_gws_queue need to be updated
each time the queue gets evicted.

Fixes: b8020b0304 ("drm/amdkfd: Enable over-subscription with >1 GWS queue")
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-21 11:30:42 -04:00
Yifan Zhang
d55957fb29 drm/amdkfd: bail out early if no get_atc_vmid_pasid_mapping_info
it makes no sense to continue with an undefined vmid.

Fixes: c8b0507f40 ("drm/amdkfd: judge get_atc_vmid_pasid_mapping_info before call")
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reported-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-09 17:27:50 -05:00
Yifan Zhang
c8b0507f40 drm/amdkfd: judge get_atc_vmid_pasid_mapping_info before call
Fix the NULL point issue:

[ 3076.255609] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 3076.255624] #PF: supervisor instruction fetch in kernel mode
[ 3076.255637] #PF: error_code(0x0010) - not-present page
[ 3076.255649] PGD 0 P4D 0
[ 3076.255660] Oops: 0010 [#1] SMP NOPTI
[ 3076.255669] CPU: 20 PID: 2415 Comm: kfdtest Tainted: G        W  OE     5.11.0-41-generic #45~20.04.1-Ubuntu
[ 3076.255691] Hardware name: AMD Splinter/Splinter-RPL, BIOS VS2326337N.FD 02/07/2022
[ 3076.255706] RIP: 0010:0x0
[ 3076.255718] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 3076.255732] RSP: 0018:ffffb64283e3fc10 EFLAGS: 00010297
[ 3076.255744] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000027
[ 3076.255759] RDX: ffffb64283e3fc1e RSI: 0000000000000008 RDI: ffff8c7a87f60000
[ 3076.255776] RBP: ffffb64283e3fc78 R08: ffff8c7d88518ac0 R09: ffffb64283e3fa60
[ 3076.255791] R10: 0000000000000001 R11: 0000000000000001 R12: 000000000000000f
[ 3076.255805] R13: ffff8c7bdcea5800 R14: ffff8c7a9f3f3000 R15: ffff8c7a8696bc00
[ 3076.255820] FS:  0000000000000000(0000) GS:ffff8c7d88500000(0000) knlGS:0000000000000000
[ 3076.255839] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3076.255851] CR2: ffffffffffffffd6 CR3: 0000000109e3c000 CR4: 0000000000750ee0
[ 3076.255866] PKRU: 55555554
[ 3076.255873] Call Trace:
[ 3076.255884]  dbgdev_wave_reset_wavefronts+0x72/0x160 [amdgpu]
[ 3076.256025]  process_termination_cpsch.cold+0x26/0x2f [amdgpu]
[ 3076.256182]  ? ktime_get_mono_fast_ns+0x4e/0xa0
[ 3076.256196]  kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu]
[ 3076.256328]  kfd_process_notifier_release+0x187/0x2b0 [amdgpu]
[ 3076.256451]  ? mn_itree_inv_end+0xdc/0x110
[ 3076.256463]  __mmu_notifier_release+0x74/0x1f0
[ 3076.256474]  exit_mmap+0x170/0x200
[ 3076.256484]  ? __handle_mm_fault+0x677/0x920
[ 3076.256496]  ? _cond_resched+0x19/0x30
[ 3076.256507]  mmput+0x5d/0x130
[ 3076.256518]  do_exit+0x332/0xaf0
[ 3076.256526]  ? handle_mm_fault+0xd7/0x2b0
[ 3076.256537]  do_group_exit+0x43/0xa0
[ 3076.256548]  __x64_sys_exit_group+0x18/0x20
[ 3076.256559]  do_syscall_64+0x38/0x90
[ 3076.256569]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-04 13:03:30 -05:00
Jonathan Kim
d2cb0b21b8 drm/amdkfd: remove unneeded unmap single queue option
The KFD only unmaps all queues, all dynamics queues or all process queues
since RUN_LIST is mapped with all KFD queues.

There's no need to provide a single type unmap so remove this option.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:41 -05:00
Rajneesh Bhardwaj
2243f4937a drm/amdkfd: Fix leftover errors and warnings
A bunch of errors and warnings are leftover KFD over the years, attempt
to fix the errors and most warnings reported by checkpatch tool. Still a
few warnings remain which may be false positives so ignore them for now.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:40 -05:00
Rajneesh Bhardwaj
d87f36a063 drm/amdkfd: update SPDX license header
Update the SPDX License header for all the KFD files.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:40 -05:00
Mukul Joshi
5bdd3eb253 drm/amdkfd: Remove unused old debugger implementation
Cleanup the kfd code by removing the unused old debugger
implementation.
The address watch was only ever implemented in the upstream
driver for GFXv7 (Kaveri). The user mode tools runtime using
this API was never open-sourced. Work on the old debugger
prototype that used this API has been discontinued years ago.
Only a small piece of resetting wavefronts is kept and
is moved to kfd_device_queue_manager.c.

Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 16:57:51 -05:00
Tao Zhou
03e5b167bd drm/amdkfd: rename kfd_process_vm_fault to kfd_dqm_evict_pasid
As the function is used in more different cases, use a more general
name.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-09 14:14:53 -05:00
David Yat Sin
3a9822d7bd drm/amdkfd: CRIU checkpoint and restore queue control stack
Checkpoint contents of queue control stacks on CRIU dump and restore them
during CRIU restore.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
42c6c48214 drm/amdkfd: CRIU checkpoint and restore queue mqds
Checkpoint contents of queue MQD's on CRIU dump and restore them during
CRIU restore.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
5bb6a8fa75 drm/amdkfd: CRIU restore queue doorbell id
When re-creating queues during CRIU restore, restore the queue with the
same doorbell id value used during CRIU dump.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
David Yat Sin
2485c12c98 drm/amdkfd: CRIU restore sdma id for queues
When re-creating queues during CRIU restore, restore the queue with the
same sdma id value used during CRIU dump.

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:52 -05:00
Felix Kuehling
6f4cb84ae0 drm/amdkfd: Fix DQM asserts on Hawaii
start_nocpsch would never set dqm->sched_running on Hawaii due to an
early return statement. This would trigger asserts in other functions
and end up in inconsistent states.

Bug: https://github.com/RadeonOpenCompute/ROCm/issues/1624
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-11 15:44:28 -05:00
Tao Zhou
dec6344338 drm/amdkfd: add reset queue function for RAS poison (v2)
The new interface unmaps queues with reset mode for the process consumes
RAS poison, it's only for compute queue.

v2: rename the function to reset_queues.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-28 16:02:47 -05:00
Tao Zhou
f6b80c04aa drm/amdkfd: add reset parameter for unmap queues
So we can set reset mode for unmap operation, no functional change.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-28 16:02:39 -05:00
Graham Sider
f0dc99a6f7 drm/amdkfd: add kfd_device_info_init function
Initializes kfd->device_info given either asic_type (enum) if GFX
version is less than GFX9, or GC IP version if greater. Also takes in vf
and the target compiler gfx version. Uses SDMA version to determine
num_sdma_queues_per_engine.

Convert device_info to a non-pointer member of kfd, change references
accordingly.

Change unsupported asic condition to only probe f2g, move device_info
initialization post-switch.

Signed-off-by: Graham Sider <Graham.Sider@amd.com>
Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2021-12-01 16:15:37 -05:00