Commit a3ebb59e authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio

Pull VFIO updates from Alex Williamson:

 - Move libvfio selftest artifacts in preparation of more tightly
   coupled integration with KVM selftests (David Matlack)

 - Fix comment typo in mtty driver (Chu Guangqing)

 - Support for new hardware revision in the hisi_acc vfio-pci variant
   driver where the migration registers can now be accessed via the PF.
   When enabled for this support, the full BAR can be exposed to the
   user (Longfang Liu)

 - Fix vfio cdev support for VF token passing, using the correct size
   for the kernel structure, thereby actually allowing userspace to
   provide a non-zero UUID token. Also set the match token callback for
   the hisi_acc, fixing VF token support for this this vfio-pci variant
   driver (Raghavendra Rao Ananta)

 - Introduce internal callbacks on vfio devices to simplify and
   consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO
   data, removing various ioctl intercepts with a more structured
   solution (Jason Gunthorpe)

 - Introduce dma-buf support for vfio-pci devices, allowing MMIO regions
   to be exposed through dma-buf objects with lifecycle managed through
   move operations. This enables low-level interactions such as a
   vfio-pci based SPDK drivers interacting directly with dma-buf capable
   RDMA devices to enable peer-to-peer operations. IOMMUFD is also now
   able to build upon this support to fill a long standing feature gap
   versus the legacy vfio type1 IOMMU backend with an implementation of
   P2P support for VM use cases that better manages the lifecycle of the
   P2P mapping (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy)

 - Convert eventfd triggering for error and request signals to use RCU
   mechanisms in order to avoid a 3-way lockdep reported deadlock issue
   (Alex Williamson)

 - Fix a 32-bit overflow introduced via dma-buf support manifesting with
   large DMA buffers (Alex Mastro)

 - Convert nvgrace-gpu vfio-pci variant driver to insert mappings on
   fault rather than at mmap time. This conversion serves both to make
   use of huge PFNMAPs but also to both avoid corrected RAS events
   during reset by now being subject to vfio-pci-core's use of
   unmap_mapping_range(), and to enable a device readiness test after
   reset (Ankit Agrawal)

 - Refactoring of vfio selftests to support multi-device tests and split
   code to provide better separation between IOMMU and device objects.
   This work also enables a new test suite addition to measure parallel
   device initialization latency (David Matlack)

* tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio: (65 commits)
  vfio: selftests: Add vfio_pci_device_init_perf_test
  vfio: selftests: Eliminate INVALID_IOVA
  vfio: selftests: Split libvfio.h into separate header files
  vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c
  vfio: selftests: Rename vfio_util.h to libvfio.h
  vfio: selftests: Stop passing device for IOMMU operations
  vfio: selftests: Move IOVA allocator into iova_allocator.c
  vfio: selftests: Move IOMMU library code into iommu.c
  vfio: selftests: Rename struct vfio_dma_region to dma_region
  vfio: selftests: Upgrade driver logging to dev_err()
  vfio: selftests: Prefix logs with device BDF where relevant
  vfio: selftests: Eliminate overly chatty logging
  vfio: selftests: Support multiple devices in the same container/iommufd
  vfio: selftests: Introduce struct iommu
  vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode
  vfio: selftests: Allow passing multiple BDFs on the command line
  vfio: selftests: Split run.sh into separate scripts
  vfio: selftests: Move run.sh into scripts directory
  vfio/nvgrace-gpu: wait for the GPU mem to be ready
  vfio/nvgrace-gpu: Inform devmem unmapped after reset
  ...
parents ce5cfb0f d721f52e
Loading
Loading
Loading
Loading
+74 −23
Original line number Diff line number Diff line
@@ -9,22 +9,48 @@ between two devices on the bus. This type of transaction is henceforth
called Peer-to-Peer (or P2P). However, there are a number of issues that
make P2P transactions tricky to do in a perfectly safe way.

One of the biggest issues is that PCI doesn't require forwarding
transactions between hierarchy domains, and in PCIe, each Root Port
defines a separate hierarchy domain. To make things worse, there is no
simple way to determine if a given Root Complex supports this or not.
(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
only supports doing P2P when the endpoints involved are all behind the
same PCI bridge, as such devices are all in the same PCI hierarchy
domain, and the spec guarantees that all transactions within the
hierarchy will be routable, but it does not require routing
between hierarchies.

The second issue is that to make use of existing interfaces in Linux,
memory that is used for P2P transactions needs to be backed by struct
pages. However, PCI BARs are not typically cache coherent so there are
a few corner case gotchas with these pages so developers need to
be careful about what they do with them.
For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up
until they reach a host bridge or root port. If the path includes PCIe switches
then based on the ACS settings the transaction can route entirely within
the PCIe hierarchy and never reach the root port. The kernel will evaluate
the PCIe topology and always permit P2P in these well-defined cases.

However, if the P2P transaction reaches the host bridge then it might have to
hairpin back out the same root port, be routed inside the CPU SOC to another
PCIe root port, or routed internally to the SOC.

The PCIe specification doesn't define the forwarding of transactions between
hierarchy domains and kernel defaults to blocking such routing. There is an
allow list to allow detecting known-good HW, in which case P2P between any
two PCIe devices will be permitted.

Since P2P inherently is doing transactions between two devices it requires two
drivers to be co-operating inside the kernel. The providing driver has to convey
its MMIO to the consuming driver. To meet the driver model lifecycle rules the
MMIO must have all DMA mapping removed, all CPU accesses prevented, all page
table mappings undone before the providing driver completes remove().

This requires the providing and consuming driver to actively work together to
guarantee that the consuming driver has stopped using the MMIO during a removal
cycle. This is done by either a synchronous invalidation shutdown or waiting
for all usage refcounts to reach zero.

At the lowest level the P2P subsystem offers a naked struct p2p_provider that
delegates lifecycle management to the providing driver. It is expected that
drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
to provide an invalidation shutdown. These MMIO addresess have no struct page, and
if used with mmap() must create special PTEs. As such there are very few
kernel uAPIs that can accept pointers to them; in particular they cannot be used
with read()/write(), including O_DIRECT.

Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
pgmap ensures that when the pgmap is destroyed all other drivers have stopped
using the MMIO. This option works with O_DIRECT flows, in some cases, if the
underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
it also relies on architecture support along with alignment and minimum size
limitations.


Driver Writer's Guide
@@ -114,14 +140,39 @@ allocating scatter-gather lists with P2P memory.
Struct Page Caveats
-------------------

Driver writers should be very careful about not passing these special
struct pages to code that isn't prepared for it. At this time, the kernel
interfaces do not have any checks for ensuring this. This obviously
precludes passing these pages to userspace.
While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs,
pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set.

P2P memory is also technically IO memory but should never have any side
effects behind it. Thus, the order of loads and stores should not be important
and ioreadX(), iowriteX() and friends should not be necessary.
The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The
KVA is still MMIO and must still be accessed through the normal
readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just
like any other MMIO mapping. While this will actually work on some
architectures, others will experience corruption or just crash in the kernel.
Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU
access happens.


Usage With DMABUF
=================

DMABUF provides an alternative to the above struct page-based
client/provider/orchestrator system and should be used when struct page
doesn't exist. In this mode the exporting driver will wrap
some of its MMIO in a DMABUF and give the DMABUF FD to userspace.

Userspace can then pass the FD to an importing driver which will ask the
exporting driver to map it to the importer.

In this case the initiator and target pci_devices are known and the P2P subsystem
is used to determine the mapping type. The phys_addr_t-based DMA API is used to
establish the dma_addr_t.

Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants
to remove() it must deliver an invalidation shutdown to all DMABUF importing
drivers through move_notify() and synchronously DMA unmap all the MMIO.

No importing driver can continue to have a DMA map to the MMIO after the
exporting driver has destroyed its p2p_provider.


P2P DMA Support Library
+1 −1
Original line number Diff line number Diff line
@@ -84,7 +84,7 @@ static inline bool blk_can_dma_map_iova(struct request *req,

static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec)
{
	iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr);
	iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr);
	iter->len = vec->len;
	return true;
}
+27 −0
Original line number Diff line number Diff line
@@ -3032,11 +3032,36 @@ static void qm_put_pci_res(struct hisi_qm *qm)
	pci_release_mem_regions(pdev);
}

static void hisi_mig_region_clear(struct hisi_qm *qm)
{
	u32 val;

	/* Clear migration region set of PF */
	if (qm->fun_type == QM_HW_PF && qm->ver > QM_HW_V3) {
		val = readl(qm->io_base + QM_MIG_REGION_SEL);
		val &= ~QM_MIG_REGION_EN;
		writel(val, qm->io_base + QM_MIG_REGION_SEL);
	}
}

static void hisi_mig_region_enable(struct hisi_qm *qm)
{
	u32 val;

	/* Select migration region of PF */
	if (qm->fun_type == QM_HW_PF && qm->ver > QM_HW_V3) {
		val = readl(qm->io_base + QM_MIG_REGION_SEL);
		val |= QM_MIG_REGION_EN;
		writel(val, qm->io_base + QM_MIG_REGION_SEL);
	}
}

static void hisi_qm_pci_uninit(struct hisi_qm *qm)
{
	struct pci_dev *pdev = qm->pdev;

	pci_free_irq_vectors(pdev);
	hisi_mig_region_clear(qm);
	qm_put_pci_res(qm);
	pci_disable_device(pdev);
}
@@ -5752,6 +5777,7 @@ int hisi_qm_init(struct hisi_qm *qm)
		goto err_free_qm_memory;

	qm_cmd_init(qm);
	hisi_mig_region_enable(qm);

	return 0;

@@ -5890,6 +5916,7 @@ static int qm_rebuild_for_resume(struct hisi_qm *qm)
	}

	qm_cmd_init(qm);
	hisi_mig_region_enable(qm);
	hisi_qm_dev_err_init(qm);
	/* Set the doorbell timeout to QM_DB_TIMEOUT_CFG ns. */
	writel(QM_DB_TIMEOUT_SET, qm->io_base + QM_DB_TIMEOUT_CFG);
+1 −1
Original line number Diff line number Diff line
# SPDX-License-Identifier: GPL-2.0-only
obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
	 dma-fence-unwrap.o dma-resv.o
	 dma-fence-unwrap.o dma-resv.o dma-buf-mapping.o
obj-$(CONFIG_DMABUF_HEAPS)	+= dma-heap.o
obj-$(CONFIG_DMABUF_HEAPS)	+= heaps/
obj-$(CONFIG_SYNC_FILE)		+= sync_file.o
+248 −0
Original line number Diff line number Diff line
// SPDX-License-Identifier: GPL-2.0-only
/*
 * DMA BUF Mapping Helpers
 *
 */
#include <linux/dma-buf-mapping.h>
#include <linux/dma-resv.h>

static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
					 dma_addr_t addr)
{
	unsigned int len, nents;
	int i;

	nents = DIV_ROUND_UP(length, UINT_MAX);
	for (i = 0; i < nents; i++) {
		len = min_t(size_t, length, UINT_MAX);
		length -= len;
		/*
		 * DMABUF abuses scatterlist to create a scatterlist
		 * that does not have any CPU list, only the DMA list.
		 * Always set the page related values to NULL to ensure
		 * importers can't use it. The phys_addr based DMA API
		 * does not require the CPU list for mapping or unmapping.
		 */
		sg_set_page(sgl, NULL, 0, 0);
		sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
		sg_dma_len(sgl) = len;
		sgl = sg_next(sgl);
	}

	return sgl;
}

static unsigned int calc_sg_nents(struct dma_iova_state *state,
				  struct dma_buf_phys_vec *phys_vec,
				  size_t nr_ranges, size_t size)
{
	unsigned int nents = 0;
	size_t i;

	if (!state || !dma_use_iova(state)) {
		for (i = 0; i < nr_ranges; i++)
			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
	} else {
		/*
		 * In IOVA case, there is only one SG entry which spans
		 * for whole IOVA address space, but we need to make sure
		 * that it fits sg->length, maybe we need more.
		 */
		nents = DIV_ROUND_UP(size, UINT_MAX);
	}

	return nents;
}

/**
 * struct dma_buf_dma - holds DMA mapping information
 * @sgt:    Scatter-gather table
 * @state:  DMA IOVA state relevant in IOMMU-based DMA
 * @size:   Total size of DMA transfer
 */
struct dma_buf_dma {
	struct sg_table sgt;
	struct dma_iova_state *state;
	size_t size;
};

/**
 * dma_buf_phys_vec_to_sgt - Returns the scatterlist table of the attachment
 * from arrays of physical vectors. This funciton is intended for MMIO memory
 * only.
 * @attach:	[in]	attachment whose scatterlist is to be returned
 * @provider:	[in]	p2pdma provider
 * @phys_vec:	[in]	array of physical vectors
 * @nr_ranges:	[in]	number of entries in phys_vec array
 * @size:	[in]	total size of phys_vec
 * @dir:	[in]	direction of DMA transfer
 *
 * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
 * on error. May return -EINTR if it is interrupted by a signal.
 *
 * On success, the DMA addresses and lengths in the returned scatterlist are
 * PAGE_SIZE aligned.
 *
 * A mapping must be unmapped by using dma_buf_free_sgt().
 *
 * NOTE: This function is intended for exporters. If direct traffic routing is
 * mandatory exporter should call routing pci_p2pdma_map_type() before calling
 * this function.
 */
struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
					 struct p2pdma_provider *provider,
					 struct dma_buf_phys_vec *phys_vec,
					 size_t nr_ranges, size_t size,
					 enum dma_data_direction dir)
{
	unsigned int nents, mapped_len = 0;
	struct dma_buf_dma *dma;
	struct scatterlist *sgl;
	dma_addr_t addr;
	size_t i;
	int ret;

	dma_resv_assert_held(attach->dmabuf->resv);

	if (WARN_ON(!attach || !attach->dmabuf || !provider))
		/* This function is supposed to work on MMIO memory only */
		return ERR_PTR(-EINVAL);

	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
	if (!dma)
		return ERR_PTR(-ENOMEM);

	switch (pci_p2pdma_map_type(provider, attach->dev)) {
	case PCI_P2PDMA_MAP_BUS_ADDR:
		/*
		 * There is no need in IOVA at all for this flow.
		 */
		break;
	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
		if (!dma->state) {
			ret = -ENOMEM;
			goto err_free_dma;
		}

		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
		break;
	default:
		ret = -EINVAL;
		goto err_free_dma;
	}

	nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size);
	ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO);
	if (ret)
		goto err_free_state;

	sgl = dma->sgt.sgl;

	for (i = 0; i < nr_ranges; i++) {
		if (!dma->state) {
			addr = pci_p2pdma_bus_addr_map(provider,
						       phys_vec[i].paddr);
		} else if (dma_use_iova(dma->state)) {
			ret = dma_iova_link(attach->dev, dma->state,
					    phys_vec[i].paddr, 0,
					    phys_vec[i].len, dir,
					    DMA_ATTR_MMIO);
			if (ret)
				goto err_unmap_dma;

			mapped_len += phys_vec[i].len;
		} else {
			addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
					    phys_vec[i].len, dir,
					    DMA_ATTR_MMIO);
			ret = dma_mapping_error(attach->dev, addr);
			if (ret)
				goto err_unmap_dma;
		}

		if (!dma->state || !dma_use_iova(dma->state))
			sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
	}

	if (dma->state && dma_use_iova(dma->state)) {
		WARN_ON_ONCE(mapped_len != size);
		ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len);
		if (ret)
			goto err_unmap_dma;

		sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr);
	}

	dma->size = size;

	/*
	 * No CPU list included — set orig_nents = 0 so others can detect
	 * this via SG table (use nents only).
	 */
	dma->sgt.orig_nents = 0;


	/*
	 * SGL must be NULL to indicate that SGL is the last one
	 * and we allocated correct number of entries in sg_alloc_table()
	 */
	WARN_ON_ONCE(sgl);
	return &dma->sgt;

err_unmap_dma:
	if (!i || !dma->state) {
		; /* Do nothing */
	} else if (dma_use_iova(dma->state)) {
		dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
				 DMA_ATTR_MMIO);
	} else {
		for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
	}
	sg_free_table(&dma->sgt);
err_free_state:
	kfree(dma->state);
err_free_dma:
	kfree(dma);
	return ERR_PTR(ret);
}
EXPORT_SYMBOL_NS_GPL(dma_buf_phys_vec_to_sgt, "DMA_BUF");

/**
 * dma_buf_free_sgt- unmaps the buffer
 * @attach:	[in]	attachment to unmap buffer from
 * @sgt:	[in]	scatterlist info of the buffer to unmap
 * @dir:	[in]	direction of DMA transfer
 *
 * This unmaps a DMA mapping for @attached obtained
 * by dma_buf_phys_vec_to_sgt().
 */
void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt,
		      enum dma_data_direction dir)
{
	struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt);
	int i;

	dma_resv_assert_held(attach->dmabuf->resv);

	if (!dma->state) {
		; /* Do nothing */
	} else if (dma_use_iova(dma->state)) {
		dma_iova_destroy(attach->dev, dma->state, dma->size, dir,
				 DMA_ATTR_MMIO);
	} else {
		struct scatterlist *sgl;

		for_each_sgtable_dma_sg(sgt, sgl, i)
			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
	}

	sg_free_table(sgt);
	kfree(dma->state);
	kfree(dma);

}
EXPORT_SYMBOL_NS_GPL(dma_buf_free_sgt, "DMA_BUF");
Loading