Commit ca0b04ba authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'for-6.15/io_uring-rx-zc-20250325' of git://git.kernel.dk/linux

Pull io_uring zero-copy receive support from Jens Axboe:
 "This adds support for zero-copy receive with io_uring, enabling fast
  bulk receive of data directly into application memory, rather than
  needing to copy the data out of kernel memory.

  While this version only supports host memory as that was the initial
  target, other memory types are planned as well, with notably GPU
  memory coming next.

  This work depends on some networking components which were queued up
  on the networking side, but have now landed in your tree.

  This is the work of Pavel Begunkov and David Wei. From the v14 posting:

    'We configure a page pool that a driver uses to fill a hw rx queue
     to hand out user pages instead of kernel pages. Any data that ends
     up hitting this hw rx queue will thus be dma'd into userspace
     memory directly, without needing to be bounced through kernel
     memory. 'Reading' data out of a socket instead becomes a
     _notification_ mechanism, where the kernel tells userspace where
     the data is. The overall approach is similar to the devmem TCP
     proposal

     This relies on hw header/data split, flow steering and RSS to
     ensure packet headers remain in kernel memory and only desired
     flows hit a hw rx queue configured for zero copy. Configuring this
     is outside of the scope of this patchset.

     We share netdev core infra with devmem TCP. The main difference is
     that io_uring is used for the uAPI and the lifetime of all objects
     are bound to an io_uring instance. Data is 'read' using a new
     io_uring request type. When done, data is returned via a new shared
     refill queue. A zero copy page pool refills a hw rx queue from this
     refill queue directly. Of course, the lifetime of these data
     buffers are managed by io_uring rather than the networking stack,
     with different refcounting rules.

     This patchset is the first step adding basic zero copy support. We
     will extend this iteratively with new features e.g. dynamically
     allocated zero copy areas, THP support, dmabuf support, improved
     copy fallback, general optimisations and more'

  In a local setup, I was able to saturate a 200G link with a single CPU
  core, and at netdev conf 0x19 earlier this month, Jamal reported
  188Gbit of bandwidth using a single core (no HT, including soft-irq).

  Safe to say the efficiency is there, as bigger links would be needed
  to find the per-core limit, and it's considerably more efficient and
  faster than the existing devmem solution"

* tag 'for-6.15/io_uring-rx-zc-20250325' of git://git.kernel.dk/linux:
  io_uring/zcrx: add selftest case for recvzc with read limit
  io_uring/zcrx: add a read limit to recvzc requests
  io_uring: add missing IORING_MAP_OFF_ZCRX_REGION in io_uring_mmap
  io_uring: Rename KConfig to Kconfig
  io_uring/zcrx: fix leaks on failed registration
  io_uring/zcrx: recheck ifq on shutdown
  io_uring/zcrx: add selftest
  net: add documentation for io_uring zcrx
  io_uring/zcrx: add copy fallback
  io_uring/zcrx: throttle receive requests
  io_uring/zcrx: set pp memory provider for an rx queue
  io_uring/zcrx: add io_recvzc request
  io_uring/zcrx: dma-map area for the device
  io_uring/zcrx: implement zerocopy receive pp memory provider
  io_uring/zcrx: grab a net device
  io_uring/zcrx: add io_zcrx_area
  io_uring/zcrx: add interface queue and refill queue
parents 15cb9a2b 89baa22d
Loading
Loading
Loading
Loading
+1 −0
Original line number Diff line number Diff line
@@ -63,6 +63,7 @@ Contents:
   gtp
   ila
   ioam6-sysctl
   iou-zcrx
   ip_dynaddr
   ipsec
   ip-sysctl
+202 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

=====================
io_uring zero copy Rx
=====================

Introduction
============

io_uring zero copy Rx (ZC Rx) is a feature that removes kernel-to-user copy on
the network receive path, allowing packet data to be received directly into
userspace memory. This feature is different to TCP_ZEROCOPY_RECEIVE in that
there are no strict alignment requirements and no need to mmap()/munmap().
Compared to kernel bypass solutions such as e.g. DPDK, the packet headers are
processed by the kernel TCP stack as normal.

NIC HW Requirements
===================

Several NIC HW features are required for io_uring ZC Rx to work. For now the
kernel API does not configure the NIC and it must be done by the user.

Header/data split
-----------------

Required to split packets at the L4 boundary into a header and a payload.
Headers are received into kernel memory as normal and processed by the TCP
stack as normal. Payloads are received into userspace memory directly.

Flow steering
-------------

Specific HW Rx queues are configured for this feature, but modern NICs
typically distribute flows across all HW Rx queues. Flow steering is required
to ensure that only desired flows are directed towards HW queues that are
configured for io_uring ZC Rx.

RSS
---

In addition to flow steering above, RSS is required to steer all other non-zero
copy flows away from queues that are configured for io_uring ZC Rx.

Usage
=====

Setup NIC
---------

Must be done out of band for now.

Ensure there are at least two queues::

  ethtool -L eth0 combined 2

Enable header/data split::

  ethtool -G eth0 tcp-data-split on

Carve out half of the HW Rx queues for zero copy using RSS::

  ethtool -X eth0 equal 1

Set up flow steering, bearing in mind that queues are 0-indexed::

  ethtool -N eth0 flow-type tcp6 ... action 1

Setup io_uring
--------------

This section describes the low level io_uring kernel API. Please refer to
liburing documentation for how to use the higher level API.

Create an io_uring instance with the following required setup flags::

  IORING_SETUP_SINGLE_ISSUER
  IORING_SETUP_DEFER_TASKRUN
  IORING_SETUP_CQE32

Create memory area
------------------

Allocate userspace memory area for receiving zero copy data::

  void *area_ptr = mmap(NULL, area_size,
                        PROT_READ | PROT_WRITE,
                        MAP_ANONYMOUS | MAP_PRIVATE,
                        0, 0);

Create refill ring
------------------

Allocate memory for a shared ringbuf used for returning consumed buffers::

  void *ring_ptr = mmap(NULL, ring_size,
                        PROT_READ | PROT_WRITE,
                        MAP_ANONYMOUS | MAP_PRIVATE,
                        0, 0);

This refill ring consists of some space for the header, followed by an array of
``struct io_uring_zcrx_rqe``::

  size_t rq_entries = 4096;
  size_t ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe) + PAGE_SIZE;
  /* align to page size */
  ring_size = (ring_size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1);

Register ZC Rx
--------------

Fill in registration structs::

  struct io_uring_zcrx_area_reg area_reg = {
    .addr = (__u64)(unsigned long)area_ptr,
    .len = area_size,
    .flags = 0,
  };

  struct io_uring_region_desc region_reg = {
    .user_addr = (__u64)(unsigned long)ring_ptr,
    .size = ring_size,
    .flags = IORING_MEM_REGION_TYPE_USER,
  };

  struct io_uring_zcrx_ifq_reg reg = {
    .if_idx = if_nametoindex("eth0"),
    /* this is the HW queue with desired flow steered into it */
    .if_rxq = 1,
    .rq_entries = rq_entries,
    .area_ptr = (__u64)(unsigned long)&area_reg,
    .region_ptr = (__u64)(unsigned long)&region_reg,
  };

Register with kernel::

  io_uring_register_ifq(ring, &reg);

Map refill ring
---------------

The kernel fills in fields for the refill ring in the registration ``struct
io_uring_zcrx_ifq_reg``. Map it into userspace::

  struct io_uring_zcrx_rq refill_ring;

  refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.head);
  refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.tail);
  refill_ring.rqes =
    (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes);
  refill_ring.rq_tail = 0;
  refill_ring.ring_ptr = ring_ptr;

Receiving data
--------------

Prepare a zero copy recv request::

  struct io_uring_sqe *sqe;

  sqe = io_uring_get_sqe(ring);
  io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, fd, NULL, 0, 0);
  sqe->ioprio |= IORING_RECV_MULTISHOT;

Now, submit and wait::

  io_uring_submit_and_wait(ring, 1);

Finally, process completions::

  struct io_uring_cqe *cqe;
  unsigned int count = 0;
  unsigned int head;

  io_uring_for_each_cqe(ring, head, cqe) {
    struct io_uring_zcrx_cqe *rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1);

    unsigned long mask = (1ULL << IORING_ZCRX_AREA_SHIFT) - 1;
    unsigned char *data = area_ptr + (rcqe->off & mask);
    /* do something with the data */

    count++;
  }
  io_uring_cq_advance(ring, count);

Recycling buffers
-----------------

Return buffers back to the kernel to be used again::

  struct io_uring_zcrx_rqe *rqe;
  unsigned mask = refill_ring.ring_entries - 1;
  rqe = &refill_ring.rqes[refill_ring.rq_tail & mask];

  unsigned long area_offset = rcqe->off & ~IORING_ZCRX_AREA_MASK;
  rqe->off = area_offset | area_reg.rq_area_token;
  rqe->len = cqe->res;
  IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail);

Testing
=======

See ``tools/testing/selftests/drivers/net/hw/iou-zcrx.c``
+2 −0
Original line number Diff line number Diff line
@@ -30,3 +30,5 @@ source "lib/Kconfig"
source "lib/Kconfig.debug"

source "Documentation/Kconfig"

source "io_uring/Kconfig"
+6 −0
Original line number Diff line number Diff line
@@ -40,6 +40,8 @@ enum io_uring_cmd_flags {
	IO_URING_F_TASK_DEAD		= (1 << 13),
};

struct io_zcrx_ifq;

struct io_wq_work_node {
	struct io_wq_work_node *next;
};
@@ -384,6 +386,8 @@ struct io_ring_ctx {
	struct wait_queue_head		poll_wq;
	struct io_restriction		restrictions;

	struct io_zcrx_ifq		*ifq;

	u32			pers_next;
	struct xarray		personalities;

@@ -436,6 +440,8 @@ struct io_ring_ctx {
	struct io_mapped_region		ring_region;
	/* used for optimised request parameter and wait argument passing  */
	struct io_mapped_region		param_region;
	/* just one zcrx per ring for now, will move to io_zcrx_ifq eventually */
	struct io_mapped_region		zcrx_region;
};

/*
+53 −1
Original line number Diff line number Diff line
@@ -87,6 +87,7 @@ struct io_uring_sqe {
	union {
		__s32	splice_fd_in;
		__u32	file_index;
		__u32	zcrx_ifq_idx;
		__u32	optlen;
		struct {
			__u16	addr_len;
@@ -278,6 +279,7 @@ enum io_uring_op {
	IORING_OP_FTRUNCATE,
	IORING_OP_BIND,
	IORING_OP_LISTEN,
	IORING_OP_RECV_ZC,

	/* this goes last, obviously */
	IORING_OP_LAST,
@@ -641,7 +643,8 @@ enum io_uring_register_op {
	/* send MSG_RING without having a ring */
	IORING_REGISTER_SEND_MSG_RING		= 31,

	/* 32 reserved for zc rx */
	/* register a netdev hw rx queue for zerocopy */
	IORING_REGISTER_ZCRX_IFQ		= 32,

	/* resize CQ ring */
	IORING_REGISTER_RESIZE_RINGS		= 33,
@@ -958,6 +961,55 @@ enum io_uring_socket_op {
	SOCKET_URING_OP_SETSOCKOPT,
};

/* Zero copy receive refill queue entry */
struct io_uring_zcrx_rqe {
	__u64	off;
	__u32	len;
	__u32	__pad;
};

struct io_uring_zcrx_cqe {
	__u64	off;
	__u64	__pad;
};

/* The bit from which area id is encoded into offsets */
#define IORING_ZCRX_AREA_SHIFT	48
#define IORING_ZCRX_AREA_MASK	(~(((__u64)1 << IORING_ZCRX_AREA_SHIFT) - 1))

struct io_uring_zcrx_offsets {
	__u32	head;
	__u32	tail;
	__u32	rqes;
	__u32	__resv2;
	__u64	__resv[2];
};

struct io_uring_zcrx_area_reg {
	__u64	addr;
	__u64	len;
	__u64	rq_area_token;
	__u32	flags;
	__u32	__resv1;
	__u64	__resv2[2];
};

/*
 * Argument for IORING_REGISTER_ZCRX_IFQ
 */
struct io_uring_zcrx_ifq_reg {
	__u32	if_idx;
	__u32	if_rxq;
	__u32	rq_entries;
	__u32	flags;

	__u64	area_ptr; /* pointer to struct io_uring_zcrx_area_reg */
	__u64	region_ptr; /* struct io_uring_region_desc * */

	struct io_uring_zcrx_offsets offsets;
	__u64	__resv[4];
};

#ifdef __cplusplus
}
#endif
Loading