Merge branch 'device-memory-tcp' (e331673a) · Commits · git / linux-nf

Documentation/netlink/specs/netdev.yaml

+61 −0

Original line number	Diff line number	Diff line
		@@ -167,6 +167,10 @@ attribute-sets:
		"re-attached", they are just waiting to disappear.
		Attribute is absent if Page Pool has not been detached, and
		can still be used to allocate new memory.
		-
		name: dmabuf
		doc: ID of the dmabuf this page-pool is attached to.
		type: u32
		-
		name: page-pool-info
		subset-of: page-pool
		@@ -268,6 +272,10 @@ attribute-sets:
		name: napi-id
		doc: ID of the NAPI instance which services this queue.
		type: u32
		-
		name: dmabuf
		doc: ID of the dmabuf attached to this queue, if any.
		type: u32

		-
		name: qstats
		@@ -457,6 +465,39 @@ attribute-sets:
		Number of times driver re-started accepting send
		requests to this queue from the stack.
		type: uint
		-
		name: queue-id
		subset-of: queue
		attributes:
		-
		name: id
		-
		name: type
		-
		name: dmabuf
		attributes:
		-
		name: ifindex
		doc: netdev ifindex to bind the dmabuf to.
		type: u32
		checks:
		min: 1
		-
		name: queues
		doc: receive queues to bind the dmabuf to.
		type: nest
		nested-attributes: queue-id
		multi-attr: true
		-
		name: fd
		doc: dmabuf file descriptor to bind.
		type: u32
		-
		name: id
		doc: id of the dmabuf binding
		type: u32
		checks:
		min: 1

		operations:
		list:
		@@ -510,6 +551,7 @@ operations:
		- inflight
		- inflight-mem
		- detach-time
		- dmabuf
		dump:
		reply: *pp-reply
		config-cond: page-pool
		@@ -574,6 +616,7 @@ operations:
		- type
		- napi-id
		- ifindex
		- dmabuf
		dump:
		request:
		attributes:
		@@ -619,6 +662,24 @@ operations:
		- rx-bytes
		- tx-packets
		- tx-bytes
		-
		name: bind-rx
		doc: Bind dmabuf to netdev
		attribute-set: dmabuf
		flags: [ admin-perm ]
		do:
		request:
		attributes:
		- ifindex
		- fd
		- queues
		reply:
		attributes:
		- id

		kernel-family:
		headers: [ "linux/list.h"]
		sock-priv: struct list_head

		mcast-groups:
		list:

Documentation/networking/devmem.rst

0 → 100644

+269 −0

Original line number	Diff line number	Diff line
		.. SPDX-License-Identifier: GPL-2.0

		=================
		Device Memory TCP
		=================


		Intro
		=====

		Device memory TCP (devmem TCP) enables receiving data directly into device
		memory (dmabuf). The feature is currently implemented for TCP sockets.


		Opportunity
		-----------

		A large number of data transfers have device memory as the source and/or
		destination. Accelerators drastically increased the prevalence of such
		transfers. Some examples include:

		- Distributed training, where ML accelerators, such as GPUs on different hosts,
		exchange data.

		- Distributed raw block storage applications transfer large amounts of data with
		remote SSDs. Much of this data does not require host processing.

		Typically the Device-to-Device data transfers in the network are implemented as
		the following low-level operations: Device-to-Host copy, Host-to-Host network
		transfer, and Host-to-Device copy.

		The flow involving host copies is suboptimal, especially for bulk data transfers,
		and can put significant strains on system resources such as host memory
		bandwidth and PCIe bandwidth.

		Devmem TCP optimizes this use case by implementing socket APIs that enable
		the user to receive incoming network packets directly into device memory.

		Packet payloads go directly from the NIC to device memory.

		Packet headers go to host memory and are processed by the TCP/IP stack
		normally. The NIC must support header split to achieve this.

		Advantages:

		- Alleviate host memory bandwidth pressure, compared to existing
		network-transfer + device-copy semantics.

		- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest
		level of the PCIe tree, compared to the traditional path which sends data
		through the root complex.


		More Info
		---------

		slides, video
		https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html

		patchset
		[PATCH net-next v24 00/13] Device Memory TCP
		https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/


		Interface
		=========


		Example
		-------

		tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up
		the RX path of this API.


		NIC Setup
		---------

		Header split, flow steering, & RSS are required features for devmem TCP.

		Header split is used to split incoming packets into a header buffer in host
		memory, and a payload buffer in device memory.

		Flow steering & RSS are used to ensure that only flows targeting devmem land on
		an RX queue bound to devmem.

		Enable header split & flow steering::

		# enable header split
		ethtool -G eth1 tcp-data-split on


		# enable flow steering
		ethtool -K eth1 ntuple on

		Configure RSS to steer all traffic away from the target RX queue (queue 15 in
		this example)::

		ethtool --set-rxfh-indir eth1 equal 15


		The user must bind a dmabuf to any number of RX queues on a given NIC using
		the netlink API::

		/* Bind dmabuf to NIC RX queue 15 */
		struct netdev_queue *queues;
		queues = malloc(sizeof(queues) 1);

		queues[0]._present.type = 1;
		queues[0]._present.idx = 1;
		queues[0].type = NETDEV_RX_QUEUE_TYPE_RX;
		queues[0].idx = 15;

		*ys = ynl_sock_create(&ynl_netdev_family, &yerr);

		req = netdev_bind_rx_req_alloc();
		netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */);
		netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
		__netdev_bind_rx_req_set_queues(req, queues, n_queue_index);

		rsp = netdev_bind_rx(*ys, req);

		dmabuf_id = rsp->dmabuf_id;


		The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
		that has been bound.

		The user can unbind the dmabuf from the netdevice by closing the netlink socket
		that established the binding. We do this so that the binding is automatically
		unbound even if the userspace process crashes.

		Note that any reasonably well-behaved dmabuf from any exporter should work with
		devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
		this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.


		Socket Setup
		------------

		The socket must be flow steered to the dmabuf bound RX queue::

		ethtool -N eth1 flow-type tcp4 ... queue 15


		Receiving data
		--------------

		The user application must signal to the kernel that it is capable of receiving
		devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg::

		ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM);

		Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT
		on devmem data.

		Devmem data is received directly into the dmabuf bound to the NIC in 'NIC
		Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs::

		for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
		if (cm->cmsg_level != SOL_SOCKET \|\|
		(cm->cmsg_type != SCM_DEVMEM_DMABUF &&
		cm->cmsg_type != SCM_DEVMEM_LINEAR))
		continue;

		dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm);

		if (cm->cmsg_type == SCM_DEVMEM_DMABUF) {
		/* Frag landed in dmabuf.
		*
		* dmabuf_cmsg->dmabuf_id is the dmabuf the
		* frag landed on.
		*
		* dmabuf_cmsg->frag_offset is the offset into
		* the dmabuf where the frag starts.
		*
		* dmabuf_cmsg->frag_size is the size of the
		* frag.
		*
		* dmabuf_cmsg->frag_token is a token used to
		* refer to this frag for later freeing.
		*/

		struct dmabuf_token token;
		token.token_start = dmabuf_cmsg->frag_token;
		token.token_count = 1;
		continue;
		}

		if (cm->cmsg_type == SCM_DEVMEM_LINEAR)
		/* Frag landed in linear buffer.
		*
		* dmabuf_cmsg->frag_size is the size of the
		* frag.
		*/
		continue;

		}

		Applications may receive 2 cmsgs:

		- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated
		by dmabuf_id.

		- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer.
		This typically happens when the NIC is unable to split the packet at the
		header boundary, such that part (or all) of the payload landed in host
		memory.

		Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem,
		regular TCP data that landed on an RX queue not bound to a dmabuf.


		Freeing frags
		-------------

		Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user
		processes the frag. The user must return the frag to the kernel via
		SO_DEVMEM_DONTNEED::

		ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token,
		sizeof(token));

		The user must ensure the tokens are returned to the kernel in a timely manner.
		Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
		and will lead to packet drops.


		Implementation & Caveats
		========================

		Unreadable skbs
		---------------

		Devmem payloads are inaccessible to the kernel processing the packets. This
		results in a few quirks for payloads of devmem skbs:

		- Loopback is not functional. Loopback relies on copying the payload, which is
		not possible with devmem skbs.

		- Software checksum calculation fails.

		- TCP Dump and bpf can't access devmem packet payloads.


		Testing
		=======

		More realistic example code can be found in the kernel source under
		``tools/testing/selftests/net/ncdevmem.c``

		ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but
		receives data directly into a udmabuf.

		To run ncdevmem, you need to run it on a server on the machine under test, and
		you need to run netcat on a peer to provide the TX data.

		ncdevmem has a validation mode as well that expects a repeating pattern of
		incoming data and validates it as such. For example, you can launch
		ncdevmem on the server by::

		ncdevmem -s <server IP> -c <client IP> -f eth1 -d 3 -n 0000:06:00.0 -l \
		-p 5201 -v 7

		On client side, use regular netcat to send TX data to ncdevmem process
		on the server::

		yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) \| \
		tr \\n \\0 \| head -c 5G \| nc <server IP> 5201 -p 5201

Documentation/networking/index.rst

+1 −0

Original line number	Diff line number	Diff line
		@@ -49,6 +49,7 @@ Contents:
		cdc_mbim
		dccp
		dctcp
		devmem
		dns_resolver
		driver
		eql

arch/alpha/include/uapi/asm/socket.h

+6 −0

Original line number	Diff line number	Diff line
		@@ -140,6 +140,12 @@
		#define SO_PASSPIDFD 76
		#define SO_PEERPIDFD 77

		#define SO_DEVMEM_LINEAR 78
		#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR
		#define SO_DEVMEM_DMABUF 79
		#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF
		#define SO_DEVMEM_DONTNEED 80

		#if !defined(__KERNEL__)

		#if __BITS_PER_LONG == 64

arch/mips/include/uapi/asm/socket.h

+6 −0

Original line number	Diff line number	Diff line
		@@ -151,6 +151,12 @@
		#define SO_PASSPIDFD 76
		#define SO_PEERPIDFD 77

		#define SO_DEVMEM_LINEAR 78
		#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR
		#define SO_DEVMEM_DMABUF 79
		#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF
		#define SO_DEVMEM_DONTNEED 80

		#if !defined(__KERNEL__)

		#if __BITS_PER_LONG == 64