Commit 5185c4d8 authored by Jason Gunthorpe's avatar Jason Gunthorpe
Browse files

Merge branch 'iommufd_dmabuf' into k.o-iommufd/for-next

Jason Gunthorpe says:

====================
This series is the start of adding full DMABUF support to
iommufd. Currently it is limited to only work with VFIO's DMABUF exporter.
It sits on top of Leon's series to add a DMABUF exporter to VFIO:

   https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com/

The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but
otherwise works the same as it does today for a memfd. The user can select
a slice of the FD to map into the ioas and if the underliyng alignment
requirements are met it will be placed in the iommu_domain.

Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR
memory from VFIO to an iommu_domain controlled by iommufd. This is used
for PCI Peer to Peer support in VMs, and is the last feature that the VFIO
type 1 container has that iommufd couldn't do.

The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime
control and is a use-after-free security problem.

Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there
should be no access to the MMIO it can shoot down the mapping in iommufd
which will unmap it from the iommu_domain. There is no automatic remap,
this is a safety protocol so the kernel doesn't get stuck. Userspace is
expected to know it is doing something that will revoke the dmabuf and
map/unmap it around the activity. Eg when QEMU goes to issue FLR it should
do the map/unmap to iommufd.

Since DMABUF is missing some key general features for this use case it
relies on a "private interconnect" between VFIO and iommufd via the
vfio_pci_dma_buf_iommufd_map() call.

The call confirms the DMABUF has revoke semantics and delivers a phys_addr
for the memory suitable for use with iommu_map().

Medium term there is a desire to expand the supported DMABUFs to include
GPU drivers to support DPDK/SPDK type use cases so future series will work
to add a general concept of revoke and a general negotiation of
interconnect to remove vfio_pci_dma_buf_iommufd_map().

I also plan another series to modify iommufd's vfio_compat to
transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI
of type1.

The latest series for interconnect negotation to exchange a phys_addr is:
 https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com

And the discussion for design of revoke is here:
 https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/


====================

Based on a shared branch with vfio.

* iommufd_dmabuf:
  iommufd/selftest: Add some tests for the dmabuf flow
  iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
  iommufd: Have iopt_map_file_pages convert the fd to a file
  iommufd: Have pfn_reader process DMABUF iopt_pages
  iommufd: Allow MMIO pages in a batch
  iommufd: Allow a DMABUF to be revoked
  iommufd: Do not map/unmap revoked DMABUFs
  iommufd: Add DMABUF to iopt_pages
  vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
  vfio/nvgrace: Support get_dmabuf_phys
  vfio/pci: Add dma-buf export support for MMIO regions
  vfio/pci: Enable peer-to-peer DMA transactions by default
  vfio/pci: Share the core device pointer while invoking feature functions
  vfio: Export vfio device get and put registration helpers
  dma-buf: provide phys_vec to scatter-gather mapping routine
  PCI/P2PDMA: Document DMABUF model
  PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
  PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
  PCI/P2PDMA: Simplify bus address mapping API
  PCI/P2PDMA: Separate the mmap() support from the core logic

Signed-off-by: default avatarJason Gunthorpe <jgg@nvidia.com>
parents 81c45c62 d2041f1f
Loading
Loading
Loading
Loading
+74 −23
Original line number Diff line number Diff line
@@ -9,22 +9,48 @@ between two devices on the bus. This type of transaction is henceforth
called Peer-to-Peer (or P2P). However, there are a number of issues that
make P2P transactions tricky to do in a perfectly safe way.

One of the biggest issues is that PCI doesn't require forwarding
transactions between hierarchy domains, and in PCIe, each Root Port
defines a separate hierarchy domain. To make things worse, there is no
simple way to determine if a given Root Complex supports this or not.
(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
only supports doing P2P when the endpoints involved are all behind the
same PCI bridge, as such devices are all in the same PCI hierarchy
domain, and the spec guarantees that all transactions within the
hierarchy will be routable, but it does not require routing
between hierarchies.

The second issue is that to make use of existing interfaces in Linux,
memory that is used for P2P transactions needs to be backed by struct
pages. However, PCI BARs are not typically cache coherent so there are
a few corner case gotchas with these pages so developers need to
be careful about what they do with them.
For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up
until they reach a host bridge or root port. If the path includes PCIe switches
then based on the ACS settings the transaction can route entirely within
the PCIe hierarchy and never reach the root port. The kernel will evaluate
the PCIe topology and always permit P2P in these well-defined cases.

However, if the P2P transaction reaches the host bridge then it might have to
hairpin back out the same root port, be routed inside the CPU SOC to another
PCIe root port, or routed internally to the SOC.

The PCIe specification doesn't define the forwarding of transactions between
hierarchy domains and kernel defaults to blocking such routing. There is an
allow list to allow detecting known-good HW, in which case P2P between any
two PCIe devices will be permitted.

Since P2P inherently is doing transactions between two devices it requires two
drivers to be co-operating inside the kernel. The providing driver has to convey
its MMIO to the consuming driver. To meet the driver model lifecycle rules the
MMIO must have all DMA mapping removed, all CPU accesses prevented, all page
table mappings undone before the providing driver completes remove().

This requires the providing and consuming driver to actively work together to
guarantee that the consuming driver has stopped using the MMIO during a removal
cycle. This is done by either a synchronous invalidation shutdown or waiting
for all usage refcounts to reach zero.

At the lowest level the P2P subsystem offers a naked struct p2p_provider that
delegates lifecycle management to the providing driver. It is expected that
drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
to provide an invalidation shutdown. These MMIO addresess have no struct page, and
if used with mmap() must create special PTEs. As such there are very few
kernel uAPIs that can accept pointers to them; in particular they cannot be used
with read()/write(), including O_DIRECT.

Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
pgmap ensures that when the pgmap is destroyed all other drivers have stopped
using the MMIO. This option works with O_DIRECT flows, in some cases, if the
underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
it also relies on architecture support along with alignment and minimum size
limitations.


Driver Writer's Guide
@@ -114,14 +140,39 @@ allocating scatter-gather lists with P2P memory.
Struct Page Caveats
-------------------

Driver writers should be very careful about not passing these special
struct pages to code that isn't prepared for it. At this time, the kernel
interfaces do not have any checks for ensuring this. This obviously
precludes passing these pages to userspace.
While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs,
pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set.

P2P memory is also technically IO memory but should never have any side
effects behind it. Thus, the order of loads and stores should not be important
and ioreadX(), iowriteX() and friends should not be necessary.
The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The
KVA is still MMIO and must still be accessed through the normal
readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just
like any other MMIO mapping. While this will actually work on some
architectures, others will experience corruption or just crash in the kernel.
Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU
access happens.


Usage With DMABUF
=================

DMABUF provides an alternative to the above struct page-based
client/provider/orchestrator system and should be used when struct page
doesn't exist. In this mode the exporting driver will wrap
some of its MMIO in a DMABUF and give the DMABUF FD to userspace.

Userspace can then pass the FD to an importing driver which will ask the
exporting driver to map it to the importer.

In this case the initiator and target pci_devices are known and the P2P subsystem
is used to determine the mapping type. The phys_addr_t-based DMA API is used to
establish the dma_addr_t.

Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants
to remove() it must deliver an invalidation shutdown to all DMABUF importing
drivers through move_notify() and synchronously DMA unmap all the MMIO.

No importing driver can continue to have a DMA map to the MMIO after the
exporting driver has destroyed its p2p_provider.


P2P DMA Support Library
+1 −1
Original line number Diff line number Diff line
@@ -85,7 +85,7 @@ static inline bool blk_can_dma_map_iova(struct request *req,

static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec)
{
	iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr);
	iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr);
	iter->len = vec->len;
	return true;
}
+1 −1
Original line number Diff line number Diff line
# SPDX-License-Identifier: GPL-2.0-only
obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
	 dma-fence-unwrap.o dma-resv.o
	 dma-fence-unwrap.o dma-resv.o dma-buf-mapping.o
obj-$(CONFIG_DMABUF_HEAPS)	+= dma-heap.o
obj-$(CONFIG_DMABUF_HEAPS)	+= heaps/
obj-$(CONFIG_SYNC_FILE)		+= sync_file.o
+248 −0
Original line number Diff line number Diff line
// SPDX-License-Identifier: GPL-2.0-only
/*
 * DMA BUF Mapping Helpers
 *
 */
#include <linux/dma-buf-mapping.h>
#include <linux/dma-resv.h>

static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
					 dma_addr_t addr)
{
	unsigned int len, nents;
	int i;

	nents = DIV_ROUND_UP(length, UINT_MAX);
	for (i = 0; i < nents; i++) {
		len = min_t(size_t, length, UINT_MAX);
		length -= len;
		/*
		 * DMABUF abuses scatterlist to create a scatterlist
		 * that does not have any CPU list, only the DMA list.
		 * Always set the page related values to NULL to ensure
		 * importers can't use it. The phys_addr based DMA API
		 * does not require the CPU list for mapping or unmapping.
		 */
		sg_set_page(sgl, NULL, 0, 0);
		sg_dma_address(sgl) = addr + i * UINT_MAX;
		sg_dma_len(sgl) = len;
		sgl = sg_next(sgl);
	}

	return sgl;
}

static unsigned int calc_sg_nents(struct dma_iova_state *state,
				  struct dma_buf_phys_vec *phys_vec,
				  size_t nr_ranges, size_t size)
{
	unsigned int nents = 0;
	size_t i;

	if (!state || !dma_use_iova(state)) {
		for (i = 0; i < nr_ranges; i++)
			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
	} else {
		/*
		 * In IOVA case, there is only one SG entry which spans
		 * for whole IOVA address space, but we need to make sure
		 * that it fits sg->length, maybe we need more.
		 */
		nents = DIV_ROUND_UP(size, UINT_MAX);
	}

	return nents;
}

/**
 * struct dma_buf_dma - holds DMA mapping information
 * @sgt:    Scatter-gather table
 * @state:  DMA IOVA state relevant in IOMMU-based DMA
 * @size:   Total size of DMA transfer
 */
struct dma_buf_dma {
	struct sg_table sgt;
	struct dma_iova_state *state;
	size_t size;
};

/**
 * dma_buf_phys_vec_to_sgt - Returns the scatterlist table of the attachment
 * from arrays of physical vectors. This funciton is intended for MMIO memory
 * only.
 * @attach:	[in]	attachment whose scatterlist is to be returned
 * @provider:	[in]	p2pdma provider
 * @phys_vec:	[in]	array of physical vectors
 * @nr_ranges:	[in]	number of entries in phys_vec array
 * @size:	[in]	total size of phys_vec
 * @dir:	[in]	direction of DMA transfer
 *
 * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
 * on error. May return -EINTR if it is interrupted by a signal.
 *
 * On success, the DMA addresses and lengths in the returned scatterlist are
 * PAGE_SIZE aligned.
 *
 * A mapping must be unmapped by using dma_buf_free_sgt().
 *
 * NOTE: This function is intended for exporters. If direct traffic routing is
 * mandatory exporter should call routing pci_p2pdma_map_type() before calling
 * this function.
 */
struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
					 struct p2pdma_provider *provider,
					 struct dma_buf_phys_vec *phys_vec,
					 size_t nr_ranges, size_t size,
					 enum dma_data_direction dir)
{
	unsigned int nents, mapped_len = 0;
	struct dma_buf_dma *dma;
	struct scatterlist *sgl;
	dma_addr_t addr;
	size_t i;
	int ret;

	dma_resv_assert_held(attach->dmabuf->resv);

	if (WARN_ON(!attach || !attach->dmabuf || !provider))
		/* This function is supposed to work on MMIO memory only */
		return ERR_PTR(-EINVAL);

	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
	if (!dma)
		return ERR_PTR(-ENOMEM);

	switch (pci_p2pdma_map_type(provider, attach->dev)) {
	case PCI_P2PDMA_MAP_BUS_ADDR:
		/*
		 * There is no need in IOVA at all for this flow.
		 */
		break;
	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
		if (!dma->state) {
			ret = -ENOMEM;
			goto err_free_dma;
		}

		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
		break;
	default:
		ret = -EINVAL;
		goto err_free_dma;
	}

	nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size);
	ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO);
	if (ret)
		goto err_free_state;

	sgl = dma->sgt.sgl;

	for (i = 0; i < nr_ranges; i++) {
		if (!dma->state) {
			addr = pci_p2pdma_bus_addr_map(provider,
						       phys_vec[i].paddr);
		} else if (dma_use_iova(dma->state)) {
			ret = dma_iova_link(attach->dev, dma->state,
					    phys_vec[i].paddr, 0,
					    phys_vec[i].len, dir,
					    DMA_ATTR_MMIO);
			if (ret)
				goto err_unmap_dma;

			mapped_len += phys_vec[i].len;
		} else {
			addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
					    phys_vec[i].len, dir,
					    DMA_ATTR_MMIO);
			ret = dma_mapping_error(attach->dev, addr);
			if (ret)
				goto err_unmap_dma;
		}

		if (!dma->state || !dma_use_iova(dma->state))
			sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
	}

	if (dma->state && dma_use_iova(dma->state)) {
		WARN_ON_ONCE(mapped_len != size);
		ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len);
		if (ret)
			goto err_unmap_dma;

		sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr);
	}

	dma->size = size;

	/*
	 * No CPU list included — set orig_nents = 0 so others can detect
	 * this via SG table (use nents only).
	 */
	dma->sgt.orig_nents = 0;


	/*
	 * SGL must be NULL to indicate that SGL is the last one
	 * and we allocated correct number of entries in sg_alloc_table()
	 */
	WARN_ON_ONCE(sgl);
	return &dma->sgt;

err_unmap_dma:
	if (!i || !dma->state) {
		; /* Do nothing */
	} else if (dma_use_iova(dma->state)) {
		dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
				 DMA_ATTR_MMIO);
	} else {
		for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
	}
	sg_free_table(&dma->sgt);
err_free_state:
	kfree(dma->state);
err_free_dma:
	kfree(dma);
	return ERR_PTR(ret);
}
EXPORT_SYMBOL_NS_GPL(dma_buf_phys_vec_to_sgt, "DMA_BUF");

/**
 * dma_buf_free_sgt- unmaps the buffer
 * @attach:	[in]	attachment to unmap buffer from
 * @sgt:	[in]	scatterlist info of the buffer to unmap
 * @dir:	[in]	direction of DMA transfer
 *
 * This unmaps a DMA mapping for @attached obtained
 * by dma_buf_phys_vec_to_sgt().
 */
void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt,
		      enum dma_data_direction dir)
{
	struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt);
	int i;

	dma_resv_assert_held(attach->dmabuf->resv);

	if (!dma->state) {
		; /* Do nothing */
	} else if (dma_use_iova(dma->state)) {
		dma_iova_destroy(attach->dev, dma->state, dma->size, dir,
				 DMA_ATTR_MMIO);
	} else {
		struct scatterlist *sgl;

		for_each_sgtable_dma_sg(sgt, sgl, i)
			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
	}

	sg_free_table(sgt);
	kfree(dma->state);
	kfree(dma);

}
EXPORT_SYMBOL_NS_GPL(dma_buf_free_sgt, "DMA_BUF");
+2 −2
Original line number Diff line number Diff line
@@ -1439,8 +1439,8 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
			 * as a bus address, __finalise_sg() will copy the dma
			 * address into the output segment.
			 */
			s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
						sg_phys(s));
			s->dma_address = pci_p2pdma_bus_addr_map(
				p2pdma_state.mem, sg_phys(s));
			sg_dma_len(s) = sg->length;
			sg_dma_mark_bus_address(s);
			continue;
Loading