Commit 1746db26 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull PCI updates from Bjorn Helgaas:
 "Enumeration:

   - Make pci_stop_dev() and pci_destroy_dev() safe so concurrent
     callers can't stop a device multiple times, even as we migrate from
     the global pci_rescan_remove_lock to finer-grained locking (Keith
     Busch)

   - Improve pci_walk_bus() implementation by making it recursive and
     moving locking up to avoid need for a 'locked' parameter (Keith
     Busch)

   - Unexport pci_walk_bus_locked(), which is only used internally by
     the PCI core (Keith Busch)

   - Detect some Thunderbolt chips that are built-in and hence
     'trustworthy' by a heuristic since the 'ExternalFacingPort' and
     'usb4-host-interface' ACPI properties are not quite enough (Esther
     Shimanovich)

  Resource management:

   - Use PCI bus addresses (not CPU addresses) in 'ranges' properties
     when building dynamic DT nodes so systems where PCI and CPU
     addresses differ work correctly (Andrea della Porta)

   - Tidy resource sizing and assignment with helpers to reduce
     redundancy (Ilpo Järvinen)

   - Improve pdev_sort_resources() 'bogus alignment' warning to be more
     specific (Ilpo Järvinen)

  Driver binding:

   - Convert driver .remove_new() callbacks to .remove() again to finish
     the conversion from returning 'int' to being 'void' (Sergio
     Paracuellos)

   - Export pcim_request_all_regions(), a managed interface to request
     all BARs (Philipp Stanner)

   - Replace pcim_iomap_regions_request_all() with
     pcim_request_all_regions(), and pcim_iomap_table()[n] with
     pcim_iomap(n), in the following drivers: ahci, crypto qat, crypto
     octeontx2, intel_th, iwlwifi, ntb idt, serial rp2, ALSA korg1212
     (Philipp Stanner)

   - Remove the now unused pcim_iomap_regions_request_all() (Philipp
     Stanner)

   - Export pcim_iounmap_region(), a managed interface to unmap and
     release a PCI BAR (Philipp Stanner)

   - Replace pcim_iomap_regions(mask) with pcim_iomap_region(n), and
     pcim_iounmap_regions(mask) with pcim_iounmap_region(n), in the
     following drivers: fpga dfl-pci, block mtip32xx, gpio-merrifield,
     cavium (Philipp Stanner)

  Error handling:

   - Add sysfs 'reset_subordinate' to reset the entire hierarchy below a
     bridge; previously Secondary Bus Reset could only be used when
     there was a single device below a bridge (Keith Busch)

   - Warn if we reset a running device where the driver didn't register
     pci_error_handlers notification callbacks (Keith Busch)

  ASPM:

   - Disable ASPM L1 before touching L1 PM Substates to follow the spec
     closer and avoid a CPU load timeout on some platforms (Ajay
     Agarwal)

   - Set devices below Intel VMD to D0 before enabling ASPM L1 Substates
     as required per spec for all L1 Substates changes (Jian-Hong Pan)

  Power management:

   - Enable starfive controller runtime PM before probing host bridge
     (Mayank Rana)

   - Enable runtime power management for host bridges (Krishna chaitanya
     chundru)

  Power control:

   - Use of_platform_device_create() instead of of_platform_populate()
     to create pwrctl platform devices so we can control it based on the
     child nodes (Manivannan Sadhasivam)

   - Create pwrctrl platform devices only if there's a relevant power
     supply property (Manivannan Sadhasivam)

   - Add device link from the pwrctl supplier to the PCI dev to ensure
     pwrctl drivers are probed before the PCI dev driver; this avoids a
     race where pwrctl could change device power state while the PCI
     driver was active (Manivannan Sadhasivam)

   - Find pwrctl device for removal with of_find_device_by_node()
     instead of searching all children of the parent (Manivannan
     Sadhasivam)

   - Rename 'pwrctl' to 'pwrctrl' to match new bandwidth controller
     ('bwctrl') and hotplug files (Bjorn Helgaas)

  Bandwidth control:

   - Add read/modify/write locking for Link Control 2, which is used to
     manage Link speed (Ilpo Järvinen)

   - Extract Link Bandwidth Management Status check into
     pcie_lbms_seen(), where it can be shared between the bandwidth
     controller and quirks that use it to help retrain failed links
     (Ilpo Järvinen)

   - Re-add Link Bandwidth notification support with updates to address
     the reasons it was previously reverted (Alexandru Gagniuc, Ilpo
     Järvinen)

   - Add pcie_set_target_speed() and related functionality so drivers
     can manage PCIe Link speed based on thermal or other constraints
     (Ilpo Järvinen)

   - Add a thermal cooling driver to throttle PCIe Links via the
     existing thermal management framework (Ilpo Järvinen)

   - Add a userspace selftest for the PCIe bandwidth controller (Ilpo
     Järvinen)

  PCI device hotplug:

   - Add hotplug controller driver for Marvell OCTEON multi-function
     device where function 0 has a management console interface to
     enable/disable and provision various personalities for the other
     functions (Shijith Thotton)

   - Retain a reference to the pci_bus for the lifetime of a pci_slot to
     avoid a use-after-free when the thunderbolt driver resets USB4 host
     routers on boot, causing hotplug remove/add of downstream docks or
     other devices (Lukas Wunner)

   - Remove unused cpcihp struct cpci_hp_controller_ops.hardware_test
     (Guilherme Giacomo Simoes)

   - Remove unused cpqphp struct ctrl_dbg.ctrl (Christophe JAILLET)

   - Use pci_bus_read_dev_vendor_id() instead of hand-coded presence
     detection in cpqphp (Ilpo Järvinen)

   - Simplify cpqphp enumeration, which is already simple-minded and
     doesn't handle devices below hot-added bridges (Ilpo Järvinen)

  Virtualization:

   - Add ACS quirk for Wangxun FF5xxx NICs, which don't advertise an ACS
     capability but do isolate functions as though PCI_ACS_RR and
     PCI_ACS_CR were set, so the functions can be in independent IOMMU
     groups (Mengyuan Lou)

  TLP Processing Hints (TPH):

   - Add and document TLP Processing Hints (TPH) support so drivers can
     enable and disable TPH and the kernel can save/restore TPH
     configuration (Wei Huang)

   - Add TPH Steering Tag support so drivers can retrieve Steering Tag
     values associated with specific CPUs via an ACPI _DSM to improve
     performance by directing DMA writes closer to their consumers (Wei
     Huang)

  Data Object Exchange (DOE):

   - Wait up to 1 second for DOE Busy bit to clear before writing a
     request to the mailbox to avoid failures if the mailbox is still
     busy from a previous transfer (Gregory Price)

  Endpoint framework:

   - Skip attempts to allocate from endpoint controller memory window if
     the requested size is larger than the window (Damien Le Moal)

   - Add and document pci_epc_mem_map() and pci_epc_mem_unmap() to
     handle controller-specific size and alignment constraints, and add
     test cases to the endpoint test driver (Damien Le Moal)

   - Implement dwc pci_epc_ops.align_addr() so pci_epc_mem_map() can
     observe DWC-specific alignment requirements (Damien Le Moal)

   - Synchronously cancel command handler work in endpoint test before
     cleaning up DMA and BARs (Damien Le Moal)

   - Respect endpoint page size in dw_pcie_ep_align_addr() (Niklas
     Cassel)

   - Use dw_pcie_ep_align_addr() in dw_pcie_ep_raise_msi_irq() and
     dw_pcie_ep_raise_msix_irq() instead of open coding the equivalent
     (Niklas Cassel)

   - Avoid NULL dereference if Modem Host Interface Endpoint lacks
     'mmio' DT property (Zhongqiu Han)

   - Release PCI domain ID of Endpoint controller parent (not controller
     itself) and before unregistering the controller, to avoid
     use-after-free (Zijun Hu)

   - Clear secondary (not primary) EPC in pci_epc_remove_epf() when
     removing the secondary controller associated with an NTB (Zijun Hu)

  Cadence PCIe controller driver:

   - Lower severity of 'phy-names' message (Bartosz Wawrzyniak)

  Freescale i.MX6 PCIe controller driver:

   - Fix suspend/resume support on i.MX6QDL, which has a hardware
     erratum that prevents use of L2 (Stefan Eichenberger)

  Intel VMD host bridge driver:

   - Add 0xb60b and 0xb06f Device IDs for client SKUs (Nirmal Patel)

  MediaTek PCIe Gen3 controller driver:

   - Update mediatek-gen3 DT binding to require the exact number of
     clocks for each SoC (Fei Shao)

   - Add support for DT 'max-link-speed' and 'num-lanes' properties to
     restrict the link speed and width (AngeloGioacchino Del Regno)

  Microchip PolarFlare PCIe controller driver:

   - Add DT and driver support for using either of the two PolarFire
     Root Ports (Conor Dooley)

  NVIDIA Tegra194 PCIe controller driver:

   - Move endpoint controller cleanups that depend on refclk from the
     host to the notifier that tells us the host has deasserted PERST#,
     when refclk should be valid (Manivannan Sadhasivam)

  Qualcomm PCIe controller driver:

   - Add qcom SAR2130P DT binding with an additional clock (Dmitry
     Baryshkov)

   - Enable MSI interrupts if 'global' IRQ is supported, since a
     previous commit unintentionally masked them (Manivannan Sadhasivam)

   - Move endpoint controller cleanups that depend on refclk from the
     host to the notifier that tells us the host has deasserted PERST#,
     when refclk should be valid (Manivannan Sadhasivam)

   - Add DT binding and driver support for IPQ9574, with Synopsys IP
     v5.80a and Qcom IP 1.27.0 (devi priya)

   - Move the OPP "operating-points-v2" table from the
     qcom,pcie-sm8450.yaml DT binding to qcom,pcie-common.yaml, where it
     can be used by other Qcom platforms (Qiang Yu)

   - Add 'global' SPI interrupt for events like link-up, link-down to
     qcom,pcie-x1e80100 DT binding so we can start enumeration when the
     link comes up (Qiang Yu)

   - Disable ASPM L0s for qcom,pcie-x1e80100 since the PHY is not tuned
     to support this (Qiang Yu)

   - Add ops_1_21_0 for SC8280X family SoC, which doesn't use the
     'iommu-map' DT property and doesn't need BDF-to-SID translation
     (Qiang Yu)

  Rockchip PCIe controller driver:

   - Define ROCKCHIP_PCIE_AT_SIZE_ALIGN to replace magic 256 endpoint
     .align value (Damien Le Moal)

   - When unmapping an endpoint window, compute the region index instead
     of searching for it, and verify that the address was mapped (Damien
     Le Moal)

   - When mapping an endpoint window, verify that the address hasn't
     been mapped already (Damien Le Moal)

   - Implement pci_epc_ops.align_addr() for rockchip-ep (Damien Le Moal)

   - Fix MSI IRQ data mapping to observe the alignment constraint, which
     fixes intermittent page faults in memcpy_toio() and memcpy_fromio()
     (Damien Le Moal)

   - Rename rockchip_pcie_parse_ep_dt() to
     rockchip_pcie_ep_get_resources() for consistency with similar DT
     interfaces (Damien Le Moal)

   - Skip the unnecessary link train in rockchip_pcie_ep_probe() and do
     it only in the endpoint start operation (Damien Le Moal)

   - Implement pci_epc_ops.stop_link() to disable link training and
     controller configuration (Damien Le Moal)

   - Attempt link training at 5 GT/s when both partners support it
     (Damien Le Moal)

   - Add a handler for PERST# signal so we can detect host-initiated
     resets and start link training after PERST# is deasserted (Damien
     Le Moal)

  Synopsys DesignWare PCIe controller driver:

   - Clear outbound address on unmap so dw_pcie_find_index() won't match
     an ATU index that was already unmapped (Damien Le Moal)

   - Use of_property_present() instead of of_property_read_bool() when
     testing for presence of non-boolean DT properties (Rob Herring)

   - Advertise 1MB size if endpoint supports Resizable BARs, which was
     inadvertently lost in v6.11 (Niklas Cassel)

  TI J721E PCIe driver:

   - Add PCIe support for J722S SoC (Siddharth Vadapalli)

   - Delay PCIE_T_PVPERL_MS (100 ms), not just PCIE_T_PERST_CLK_US (100
     us), before deasserting PERST# to ensure power and refclk are
     stable (Siddharth Vadapalli)

  TI Keystone PCIe controller driver:

   - Set the 'ti,keystone-pcie' mode so v3.65a devices work in Root
     Complex mode (Kishon Vijay Abraham I)

   - Try to avoid unrecoverable SError for attempts to issue config
     transactions when the link is down; this is racy but the best we
     can do (Kishon Vijay Abraham I)

  Miscellaneous:

   - Reorganize kerneldoc parameter names to match order in function
     signature (Julia Lawall)

   - Fix sysfs reset_method_store() memory leak (Todd Kjos)

   - Simplify pci_create_slot() (Ilpo Järvinen)

   - Fix incorrect printf format specifiers in pcitest (Luo Yifan)"

* tag 'pci-v6.13-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (127 commits)
  PCI: rockchip-ep: Handle PERST# signal in EP mode
  PCI: rockchip-ep: Improve link training
  PCI: rockship-ep: Implement the pci_epc_ops::stop_link() operation
  PCI: rockchip-ep: Refactor endpoint link training enable
  PCI: rockchip-ep: Refactor rockchip_pcie_ep_probe() MSI-X hiding
  PCI: rockchip-ep: Refactor rockchip_pcie_ep_probe() memory allocations
  PCI: rockchip-ep: Rename rockchip_pcie_parse_ep_dt()
  PCI: rockchip-ep: Fix MSI IRQ data mapping
  PCI: rockchip-ep: Implement the pci_epc_ops::align_addr() operation
  PCI: rockchip-ep: Improve rockchip_pcie_ep_map_addr()
  PCI: rockchip-ep: Improve rockchip_pcie_ep_unmap_addr()
  PCI: rockchip-ep: Use a macro to define EP controller .align feature
  PCI: rockchip-ep: Fix address translation unit programming
  PCI/pwrctrl: Rename pwrctrl functions and structures
  PCI/pwrctrl: Rename pwrctl files to pwrctrl
  PCI/pwrctl: Remove pwrctl device without iterating over all children of pwrctl parent
  PCI/pwrctl: Ensure that pwrctl drivers are probed before PCI client drivers
  PCI/pwrctl: Create pwrctl device only if at least one power supply is present
  PCI/pwrctl: Use of_platform_device_create() to create pwrctl devices
  tools: PCI: Fix incorrect printf format specifiers
  ...
parents 1dc707e6 10099266
Loading
Loading
Loading
Loading
+11 −0
Original line number Diff line number Diff line
@@ -163,6 +163,17 @@ Description:
		will be present in sysfs.  Writing 1 to this file
		will perform reset.

What:		/sys/bus/pci/devices/.../reset_subordinate
Date:		October 2024
Contact:	linux-pci@vger.kernel.org
Description:
		This is visible only for bridge devices. If you want to reset
		all devices attached through the subordinate bus of a specific
		bridge device, writing 1 to this will try to do it.  This will
		affect all devices attached to the system through this bridge
		similiar to writing 1 to their individual "reset" file, so use
		with caution.

What:		/sys/bus/pci/devices/.../vpd
Date:		February 2008
Contact:	Ben Hutchings <bwh@kernel.org>
+29 −0
Original line number Diff line number Diff line
@@ -117,6 +117,35 @@ by the PCI endpoint function driver.
   The PCI endpoint function driver should use pci_epc_mem_free_addr() to
   free the memory space allocated using pci_epc_mem_alloc_addr().

* pci_epc_map_addr()

  A PCI endpoint function driver should use pci_epc_map_addr() to map to a RC
  PCI address the CPU address of local memory obtained with
  pci_epc_mem_alloc_addr().

* pci_epc_unmap_addr()

  A PCI endpoint function driver should use pci_epc_unmap_addr() to unmap the
  CPU address of local memory mapped to a RC address with pci_epc_map_addr().

* pci_epc_mem_map()

  A PCI endpoint controller may impose constraints on the RC PCI addresses that
  can be mapped. The function pci_epc_mem_map() allows endpoint function
  drivers to allocate and map controller memory while handling such
  constraints. This function will determine the size of the memory that must be
  allocated with pci_epc_mem_alloc_addr() for successfully mapping a RC PCI
  address range. This function will also indicate the size of the PCI address
  range that was actually mapped, which can be less than the requested size, as
  well as the offset into the allocated memory to use for accessing the mapped
  RC PCI address range.

* pci_epc_mem_unmap()

  A PCI endpoint function driver can use pci_epc_mem_unmap() to unmap and free
  controller memory that was allocated and mapped using pci_epc_mem_map().


Other EPC APIs
~~~~~~~~~~~~~~

+1 −0
Original line number Diff line number Diff line
@@ -18,3 +18,4 @@ PCI Bus Subsystem
   pcieaer-howto
   endpoint/index
   boot-interrupts
   tph
+9 −5
Original line number Diff line number Diff line
@@ -217,8 +217,12 @@ capability structure except the PCI Express capability structure,
that is shared between many drivers including the service drivers.
RMW Capability accessors (pcie_capability_clear_and_set_word(),
pcie_capability_set_word(), and pcie_capability_clear_word()) protect
a selected set of PCI Express Capability Registers (Link Control
Register and Root Control Register). Any change to those registers
should be performed using RMW accessors to avoid problems due to
concurrent updates. For the up-to-date list of protected registers,
see pcie_capability_clear_and_set_word().
a selected set of PCI Express Capability Registers:

* Link Control Register
* Root Control Register
* Link Control 2 Register

Any change to those registers should be performed using RMW accessors to
avoid problems due to concurrent updates. For the up-to-date list of
protected registers, see pcie_capability_clear_and_set_word().
+132 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0


===========
TPH Support
===========

:Copyright: 2024 Advanced Micro Devices, Inc.
:Authors: - Eric van Tassell <eric.vantassell@amd.com>
          - Wei Huang <wei.huang2@amd.com>


Overview
========

TPH (TLP Processing Hints) is a PCIe feature that allows endpoint devices
to provide optimization hints for requests that target memory space.
These hints, in a format called Steering Tags (STs), are embedded in the
requester's TLP headers, enabling the system hardware, such as the Root
Complex, to better manage platform resources for these requests.

For example, on platforms with TPH-based direct data cache injection
support, an endpoint device can include appropriate STs in its DMA
traffic to specify which cache the data should be written to. This allows
the CPU core to have a higher probability of getting data from cache,
potentially improving performance and reducing latency in data
processing.


How to Use TPH
==============

TPH is presented as an optional extended capability in PCIe. The Linux
kernel handles TPH discovery during boot, but it is up to the device
driver to request TPH enablement if it is to be utilized. Once enabled,
the driver uses the provided API to obtain the Steering Tag for the
target memory and to program the ST into the device's ST table.

Enable TPH support in Linux
---------------------------

To support TPH, the kernel must be built with the CONFIG_PCIE_TPH option
enabled.

Manage TPH
----------

To enable TPH for a device, use the following function::

  int pcie_enable_tph(struct pci_dev *pdev, int mode);

This function enables TPH support for device with a specific ST mode.
Current supported modes include:

  * PCI_TPH_ST_NS_MODE - NO ST Mode
  * PCI_TPH_ST_IV_MODE - Interrupt Vector Mode
  * PCI_TPH_ST_DS_MODE - Device Specific Mode

`pcie_enable_tph()` checks whether the requested mode is actually
supported by the device before enabling. The device driver can figure out
which TPH mode is supported and can be properly enabled based on the
return value of `pcie_enable_tph()`.

To disable TPH, use the following function::

  void pcie_disable_tph(struct pci_dev *pdev);

Manage ST
---------

Steering Tags are platform specific. PCIe spec does not specify where STs
are from. Instead PCI Firmware Specification defines an ACPI _DSM method
(see the `Revised _DSM for Cache Locality TPH Features ECN
<https://members.pcisig.com/wg/PCI-SIG/document/15470>`_) for retrieving
STs for a target memory of various properties. This method is what is
supported in this implementation.

To retrieve a Steering Tag for a target memory associated with a specific
CPU, use the following function::

  int pcie_tph_get_cpu_st(struct pci_dev *pdev, enum tph_mem_type type,
                          unsigned int cpu_uid, u16 *tag);

The `type` argument is used to specify the memory type, either volatile
or persistent, of the target memory. The `cpu_uid` argument specifies the
CPU where the memory is associated to.

After the ST value is retrieved, the device driver can use the following
function to write the ST into the device::

  int pcie_tph_set_st_entry(struct pci_dev *pdev, unsigned int index,
                            u16 tag);

The `index` argument is the ST table entry index the ST tag will be
written into. `pcie_tph_set_st_entry()` will figure out the proper
location of ST table, either in the MSI-X table or in the TPH Extended
Capability space, and write the Steering Tag into the ST entry pointed by
the `index` argument.

It is completely up to the driver to decide how to use these TPH
functions. For example a network device driver can use the TPH APIs above
to update the Steering Tag when interrupt affinity of a RX/TX queue has
been changed. Here is a sample code for IRQ affinity notifier:

.. code-block:: c

    static void irq_affinity_notified(struct irq_affinity_notify *notify,
                                      const cpumask_t *mask)
    {
         struct drv_irq *irq;
         unsigned int cpu_id;
         u16 tag;

         irq = container_of(notify, struct drv_irq, affinity_notify);
         cpumask_copy(irq->cpu_mask, mask);

         /* Pick a right CPU as the target - here is just an example */
         cpu_id = cpumask_first(irq->cpu_mask);

         if (pcie_tph_get_cpu_st(irq->pdev, TPH_MEM_TYPE_VM, cpu_id,
                                 &tag))
             return;

         if (pcie_tph_set_st_entry(irq->pdev, irq->msix_nr, tag))
             return;
    }

Disable TPH system-wide
-----------------------

There is a kernel command line option available to control TPH feature:
    * "notph": TPH will be disabled for all endpoint devices.
Loading