Merge branch 'iommufd_dmabuf' into k.o-iommufd/for-next (5185c4d8) · Commits · git / linux-net

Documentation/driver-api/pci/p2pdma.rst

+74 −23

Original line number	Diff line number	Diff line
		@@ -9,22 +9,48 @@ between two devices on the bus. This type of transaction is henceforth
		called Peer-to-Peer (or P2P). However, there are a number of issues that
		make P2P transactions tricky to do in a perfectly safe way.

		One of the biggest issues is that PCI doesn't require forwarding
		transactions between hierarchy domains, and in PCIe, each Root Port
		defines a separate hierarchy domain. To make things worse, there is no
		simple way to determine if a given Root Complex supports this or not.
		(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
		only supports doing P2P when the endpoints involved are all behind the
		same PCI bridge, as such devices are all in the same PCI hierarchy
		domain, and the spec guarantees that all transactions within the
		hierarchy will be routable, but it does not require routing
		between hierarchies.

		The second issue is that to make use of existing interfaces in Linux,
		memory that is used for P2P transactions needs to be backed by struct
		pages. However, PCI BARs are not typically cache coherent so there are
		a few corner case gotchas with these pages so developers need to
		be careful about what they do with them.
		For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up
		until they reach a host bridge or root port. If the path includes PCIe switches
		then based on the ACS settings the transaction can route entirely within
		the PCIe hierarchy and never reach the root port. The kernel will evaluate
		the PCIe topology and always permit P2P in these well-defined cases.

		However, if the P2P transaction reaches the host bridge then it might have to
		hairpin back out the same root port, be routed inside the CPU SOC to another
		PCIe root port, or routed internally to the SOC.

		The PCIe specification doesn't define the forwarding of transactions between
		hierarchy domains and kernel defaults to blocking such routing. There is an
		allow list to allow detecting known-good HW, in which case P2P between any
		two PCIe devices will be permitted.

		Since P2P inherently is doing transactions between two devices it requires two
		drivers to be co-operating inside the kernel. The providing driver has to convey
		its MMIO to the consuming driver. To meet the driver model lifecycle rules the
		MMIO must have all DMA mapping removed, all CPU accesses prevented, all page
		table mappings undone before the providing driver completes remove().

		This requires the providing and consuming driver to actively work together to
		guarantee that the consuming driver has stopped using the MMIO during a removal
		cycle. This is done by either a synchronous invalidation shutdown or waiting
		for all usage refcounts to reach zero.

		At the lowest level the P2P subsystem offers a naked struct p2p_provider that
		delegates lifecycle management to the providing driver. It is expected that
		drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
		to provide an invalidation shutdown. These MMIO addresess have no struct page, and
		if used with mmap() must create special PTEs. As such there are very few
		kernel uAPIs that can accept pointers to them; in particular they cannot be used
		with read()/write(), including O_DIRECT.

		Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
		pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
		pgmap ensures that when the pgmap is destroyed all other drivers have stopped
		using the MMIO. This option works with O_DIRECT flows, in some cases, if the
		underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
		FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
		it also relies on architecture support along with alignment and minimum size
		limitations.


		Driver Writer's Guide
		@@ -114,14 +140,39 @@ allocating scatter-gather lists with P2P memory.
		Struct Page Caveats
		-------------------

		Driver writers should be very careful about not passing these special
		struct pages to code that isn't prepared for it. At this time, the kernel
		interfaces do not have any checks for ensuring this. This obviously
		precludes passing these pages to userspace.
		While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs,
		pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set.

		P2P memory is also technically IO memory but should never have any side
		effects behind it. Thus, the order of loads and stores should not be important
		and ioreadX(), iowriteX() and friends should not be necessary.
		The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The
		KVA is still MMIO and must still be accessed through the normal
		readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just
		like any other MMIO mapping. While this will actually work on some
		architectures, others will experience corruption or just crash in the kernel.
		Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU
		access happens.


		Usage With DMABUF
		=================

		DMABUF provides an alternative to the above struct page-based
		client/provider/orchestrator system and should be used when struct page
		doesn't exist. In this mode the exporting driver will wrap
		some of its MMIO in a DMABUF and give the DMABUF FD to userspace.

		Userspace can then pass the FD to an importing driver which will ask the
		exporting driver to map it to the importer.

		In this case the initiator and target pci_devices are known and the P2P subsystem
		is used to determine the mapping type. The phys_addr_t-based DMA API is used to
		establish the dma_addr_t.

		Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants
		to remove() it must deliver an invalidation shutdown to all DMABUF importing
		drivers through move_notify() and synchronously DMA unmap all the MMIO.

		No importing driver can continue to have a DMA map to the MMIO after the
		exporting driver has destroyed its p2p_provider.


		P2P DMA Support Library

block/blk-mq-dma.c

+1 −1

Original line number	Diff line number	Diff line
		@@ -85,7 +85,7 @@ static inline bool blk_can_dma_map_iova(struct request *req,

		static bool blk_dma_map_bus(struct blk_dma_iter iter, struct phys_vec vec)
		{
		iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr);
		iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr);
		iter->len = vec->len;
		return true;
		}

drivers/dma-buf/Makefile

+1 −1

Original line number	Diff line number	Diff line
		# SPDX-License-Identifier: GPL-2.0-only
		obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
		dma-fence-unwrap.o dma-resv.o
		dma-fence-unwrap.o dma-resv.o dma-buf-mapping.o
		obj-$(CONFIG_DMABUF_HEAPS) += dma-heap.o
		obj-$(CONFIG_DMABUF_HEAPS) += heaps/
		obj-$(CONFIG_SYNC_FILE) += sync_file.o

drivers/dma-buf/dma-buf-mapping.c

0 → 100644

+248 −0

Original line number	Diff line number	Diff line
		// SPDX-License-Identifier: GPL-2.0-only
		/*
		* DMA BUF Mapping Helpers
		*
		*/
		#include <linux/dma-buf-mapping.h>
		#include <linux/dma-resv.h>

		static struct scatterlist fill_sg_entry(struct scatterlist sgl, size_t length,
		dma_addr_t addr)
		{
		unsigned int len, nents;
		int i;

		nents = DIV_ROUND_UP(length, UINT_MAX);
		for (i = 0; i < nents; i++) {
		len = min_t(size_t, length, UINT_MAX);
		length -= len;
		/*
		* DMABUF abuses scatterlist to create a scatterlist
		* that does not have any CPU list, only the DMA list.
		* Always set the page related values to NULL to ensure
		* importers can't use it. The phys_addr based DMA API
		* does not require the CPU list for mapping or unmapping.
		*/
		sg_set_page(sgl, NULL, 0, 0);
		sg_dma_address(sgl) = addr + i * UINT_MAX;
		sg_dma_len(sgl) = len;
		sgl = sg_next(sgl);
		}

		return sgl;
		}

		static unsigned int calc_sg_nents(struct dma_iova_state *state,
		struct dma_buf_phys_vec *phys_vec,
		size_t nr_ranges, size_t size)
		{
		unsigned int nents = 0;
		size_t i;

		if (!state \|\| !dma_use_iova(state)) {
		for (i = 0; i < nr_ranges; i++)
		nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
		} else {
		/*
		* In IOVA case, there is only one SG entry which spans
		* for whole IOVA address space, but we need to make sure
		* that it fits sg->length, maybe we need more.
		*/
		nents = DIV_ROUND_UP(size, UINT_MAX);
		}

		return nents;
		}

		/**
		* struct dma_buf_dma - holds DMA mapping information
		* @sgt: Scatter-gather table
		* @state: DMA IOVA state relevant in IOMMU-based DMA
		* @size: Total size of DMA transfer
		*/
		struct dma_buf_dma {
		struct sg_table sgt;
		struct dma_iova_state *state;
		size_t size;
		};

		/**
		* dma_buf_phys_vec_to_sgt - Returns the scatterlist table of the attachment
		* from arrays of physical vectors. This funciton is intended for MMIO memory
		* only.
		* @attach: [in] attachment whose scatterlist is to be returned
		* @provider: [in] p2pdma provider
		* @phys_vec: [in] array of physical vectors
		* @nr_ranges: [in] number of entries in phys_vec array
		* @size: [in] total size of phys_vec
		* @dir: [in] direction of DMA transfer
		*
		* Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
		* on error. May return -EINTR if it is interrupted by a signal.
		*
		* On success, the DMA addresses and lengths in the returned scatterlist are
		* PAGE_SIZE aligned.
		*
		* A mapping must be unmapped by using dma_buf_free_sgt().
		*
		* NOTE: This function is intended for exporters. If direct traffic routing is
		* mandatory exporter should call routing pci_p2pdma_map_type() before calling
		* this function.
		*/
		struct sg_table dma_buf_phys_vec_to_sgt(struct dma_buf_attachment attach,
		struct p2pdma_provider *provider,
		struct dma_buf_phys_vec *phys_vec,
		size_t nr_ranges, size_t size,
		enum dma_data_direction dir)
		{
		unsigned int nents, mapped_len = 0;
		struct dma_buf_dma *dma;
		struct scatterlist *sgl;
		dma_addr_t addr;
		size_t i;
		int ret;

		dma_resv_assert_held(attach->dmabuf->resv);

		if (WARN_ON(!attach \|\| !attach->dmabuf \|\| !provider))
		/* This function is supposed to work on MMIO memory only */
		return ERR_PTR(-EINVAL);

		dma = kzalloc(sizeof(*dma), GFP_KERNEL);
		if (!dma)
		return ERR_PTR(-ENOMEM);

		switch (pci_p2pdma_map_type(provider, attach->dev)) {
		case PCI_P2PDMA_MAP_BUS_ADDR:
		/*
		* There is no need in IOVA at all for this flow.
		*/
		break;
		case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
		if (!dma->state) {
		ret = -ENOMEM;
		goto err_free_dma;
		}

		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
		break;
		default:
		ret = -EINVAL;
		goto err_free_dma;
		}

		nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size);
		ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL \| __GFP_ZERO);
		if (ret)
		goto err_free_state;

		sgl = dma->sgt.sgl;

		for (i = 0; i < nr_ranges; i++) {
		if (!dma->state) {
		addr = pci_p2pdma_bus_addr_map(provider,
		phys_vec[i].paddr);
		} else if (dma_use_iova(dma->state)) {
		ret = dma_iova_link(attach->dev, dma->state,
		phys_vec[i].paddr, 0,
		phys_vec[i].len, dir,
		DMA_ATTR_MMIO);
		if (ret)
		goto err_unmap_dma;

		mapped_len += phys_vec[i].len;
		} else {
		addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
		phys_vec[i].len, dir,
		DMA_ATTR_MMIO);
		ret = dma_mapping_error(attach->dev, addr);
		if (ret)
		goto err_unmap_dma;
		}

		if (!dma->state \|\| !dma_use_iova(dma->state))
		sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
		}

		if (dma->state && dma_use_iova(dma->state)) {
		WARN_ON_ONCE(mapped_len != size);
		ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len);
		if (ret)
		goto err_unmap_dma;

		sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr);
		}

		dma->size = size;

		/*
		* No CPU list included — set orig_nents = 0 so others can detect
		* this via SG table (use nents only).
		*/
		dma->sgt.orig_nents = 0;


		/*
		* SGL must be NULL to indicate that SGL is the last one
		* and we allocated correct number of entries in sg_alloc_table()
		*/
		WARN_ON_ONCE(sgl);
		return &dma->sgt;

		err_unmap_dma:
		if (!i \|\| !dma->state) {
		; /* Do nothing */
		} else if (dma_use_iova(dma->state)) {
		dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
		DMA_ATTR_MMIO);
		} else {
		for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
		dma_unmap_phys(attach->dev, sg_dma_address(sgl),
		sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
		}
		sg_free_table(&dma->sgt);
		err_free_state:
		kfree(dma->state);
		err_free_dma:
		kfree(dma);
		return ERR_PTR(ret);
		}
		EXPORT_SYMBOL_NS_GPL(dma_buf_phys_vec_to_sgt, "DMA_BUF");

		/**
		* dma_buf_free_sgt- unmaps the buffer
		* @attach: [in] attachment to unmap buffer from
		* @sgt: [in] scatterlist info of the buffer to unmap
		* @dir: [in] direction of DMA transfer
		*
		* This unmaps a DMA mapping for @attached obtained
		* by dma_buf_phys_vec_to_sgt().
		*/
		void dma_buf_free_sgt(struct dma_buf_attachment attach, struct sg_table sgt,
		enum dma_data_direction dir)
		{
		struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt);
		int i;

		dma_resv_assert_held(attach->dmabuf->resv);

		if (!dma->state) {
		; /* Do nothing */
		} else if (dma_use_iova(dma->state)) {
		dma_iova_destroy(attach->dev, dma->state, dma->size, dir,
		DMA_ATTR_MMIO);
		} else {
		struct scatterlist *sgl;

		for_each_sgtable_dma_sg(sgt, sgl, i)
		dma_unmap_phys(attach->dev, sg_dma_address(sgl),
		sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
		}

		sg_free_table(sgt);
		kfree(dma->state);
		kfree(dma);

		}
		EXPORT_SYMBOL_NS_GPL(dma_buf_free_sgt, "DMA_BUF");

drivers/iommu/dma-iommu.c

+2 −2

Original line number	Diff line number	Diff line
		@@ -1439,8 +1439,8 @@ int iommu_dma_map_sg(struct device dev, struct scatterlist sg, int nents,
		* as a bus address, __finalise_sg() will copy the dma
		* address into the output segment.
		*/
		s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
		sg_phys(s));
		s->dma_address = pci_p2pdma_bus_addr_map(
		p2pdma_state.mem, sg_phys(s));
		sg_dma_len(s) = sg->length;
		sg_dma_mark_bus_address(s);
		continue;