Commit e331673a authored by Jakub Kicinski's avatar Jakub Kicinski
Browse files

Merge branch 'device-memory-tcp'

Mina Almasry says:

====================
Device Memory TCP

Device memory TCP (devmem TCP) is a proposal for transferring data
to and/or from device memory efficiently, without bouncing the data
to a host memory buffer.

* Problem:

A large amount of data transfers have device memory as the source
and/or destination. Accelerators drastically increased the volume
of such transfers. Some examples include:

- ML accelerators transferring large amounts of training data from storage
  into GPU/TPU memory. In some cases ML training setup time can be as long
  as 50% of TPU compute time, improving data transfer throughput &
  efficiency can help improving GPU/TPU utilization.

- Distributed training, where ML accelerators, such as GPUs on different
  hosts, exchange data among them.

- Distributed raw block storage applications transfer large amounts of
  data with remote SSDs, much of this data does not require host
  processing.

Today, the majority of the Device-to-Device data transfers the network
are implemented as the following low level operations: Device-to-Host
copy, Host-to-Host network transfer, and Host-to-Device copy.

The implementation is suboptimal, especially for bulk data transfers,
and can put significant strains on system resources, such as host memory
bandwidth, PCIe bandwidth, etc. One important reason behind the current
state is the kernel’s lack of semantics to express device to network
transfers.

* Proposal:

In this patch series we attempt to optimize this use case by implementing
socket APIs that enable the user to:

1. send device memory across the network directly, and
2. receive incoming network packets directly into device memory.

Packet _payloads_ go directly from the NIC to device memory for receive
and from device memory to NIC for transmit.
Packet _headers_ go to/from host memory and are processed by the TCP/IP
stack normally. The NIC _must_ support header split to achieve this.

Advantages:

- Alleviate host memory bandwidth pressure, compared to existing
 network-transfer + device-copy semantics.

- Alleviate PCIe BW pressure, by limiting data transfer to the lowest level
  of the PCIe tree, compared to traditional path which sends data through
  the root complex.

* Patch overview:

** Part 1: netlink API

Gives user ability to bind dma-buf to an RX queue.

** Part 2: scatterlist support

Currently the standard for device memory sharing is DMABUF, which doesn't
generate struct pages. On the other hand, networking stack (skbs, drivers,
and page pool) operate on pages. We have 2 options:

1. Generate struct pages for dmabuf device memory, or,
2. Modify the networking stack to process scatterlist.

Approach #1 was attempted in RFC v1. RFC v2 implements approach #2.

** part 3: page pool support

We piggy back on page pool memory providers proposal:
https://github.com/kuba-moo/linux/tree/pp-providers

It allows the page pool to define a memory provider that provides the
page allocation and freeing. It helps abstract most of the device memory
TCP changes from the driver.

** part 4: support for unreadable skb frags

Page pool iovs are not accessible by the host; we implement changes
throughput the networking stack to correctly handle skbs with unreadable
frags.

** Part 5: recvmsg() APIs

We define user APIs for the user to send and receive device memory.

Not included with this series is the GVE devmem TCP support, just to
simplify the review. Code available here if desired:
https://github.com/mina/linux/tree/tcpdevmem

This series is built on top of net-next with Jakub's pp-providers changes
cherry-picked.

* NIC dependencies:

1. (strict) Devmem TCP require the NIC to support header split, i.e. the
   capability to split incoming packets into a header + payload and to put
   each into a separate buffer. Devmem TCP works by using device memory
   for the packet payload, and host memory for the packet headers.

2. (optional) Devmem TCP works better with flow steering support & RSS
   support, i.e. the NIC's ability to steer flows into certain rx queues.
   This allows the sysadmin to enable devmem TCP on a subset of the rx
   queues, and steer devmem TCP traffic onto these queues and non devmem
   TCP elsewhere.

The NIC I have access to with these properties is the GVE with DQO support
running in Google Cloud, but any NIC that supports these features would
suffice. I may be able to help reviewers bring up devmem TCP on their NICs.

* Testing:

The series includes a udmabuf kselftest that show a simple use case of
devmem TCP and validates the entire data path end to end without
a dependency on a specific dmabuf provider.

** Test Setup

Kernel: net-next with this series and memory provider API cherry-picked
locally.

Hardware: Google Cloud A3 VMs.

NIC: GVE with header split & RSS & flow steering support.
====================

Link: https://patch.msgid.link/20240910171458.219195-1-almasrymina@google.com


Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parents 24b8c193 d0caf987
Loading
Loading
Loading
Loading
+61 −0
Original line number Diff line number Diff line
@@ -167,6 +167,10 @@ attribute-sets:
          "re-attached", they are just waiting to disappear.
          Attribute is absent if Page Pool has not been detached, and
          can still be used to allocate new memory.
      -
        name: dmabuf
        doc: ID of the dmabuf this page-pool is attached to.
        type: u32
  -
    name: page-pool-info
    subset-of: page-pool
@@ -268,6 +272,10 @@ attribute-sets:
        name: napi-id
        doc: ID of the NAPI instance which services this queue.
        type: u32
      -
        name: dmabuf
        doc: ID of the dmabuf attached to this queue, if any.
        type: u32

  -
    name: qstats
@@ -457,6 +465,39 @@ attribute-sets:
          Number of times driver re-started accepting send
          requests to this queue from the stack.
        type: uint
  -
    name: queue-id
    subset-of: queue
    attributes:
      -
        name: id
      -
        name: type
  -
    name: dmabuf
    attributes:
      -
        name: ifindex
        doc: netdev ifindex to bind the dmabuf to.
        type: u32
        checks:
          min: 1
      -
        name: queues
        doc: receive queues to bind the dmabuf to.
        type: nest
        nested-attributes: queue-id
        multi-attr: true
      -
        name: fd
        doc: dmabuf file descriptor to bind.
        type: u32
      -
        name: id
        doc: id of the dmabuf binding
        type: u32
        checks:
          min: 1

operations:
  list:
@@ -510,6 +551,7 @@ operations:
            - inflight
            - inflight-mem
            - detach-time
            - dmabuf
      dump:
        reply: *pp-reply
      config-cond: page-pool
@@ -574,6 +616,7 @@ operations:
            - type
            - napi-id
            - ifindex
            - dmabuf
      dump:
        request:
          attributes:
@@ -619,6 +662,24 @@ operations:
            - rx-bytes
            - tx-packets
            - tx-bytes
    -
      name: bind-rx
      doc: Bind dmabuf to netdev
      attribute-set: dmabuf
      flags: [ admin-perm ]
      do:
        request:
          attributes:
            - ifindex
            - fd
            - queues
        reply:
          attributes:
            - id

kernel-family:
  headers: [ "linux/list.h"]
  sock-priv: struct list_head

mcast-groups:
  list:
+269 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

=================
Device Memory TCP
=================


Intro
=====

Device memory TCP (devmem TCP) enables receiving data directly into device
memory (dmabuf). The feature is currently implemented for TCP sockets.


Opportunity
-----------

A large number of data transfers have device memory as the source and/or
destination. Accelerators drastically increased the prevalence of such
transfers.  Some examples include:

- Distributed training, where ML accelerators, such as GPUs on different hosts,
  exchange data.

- Distributed raw block storage applications transfer large amounts of data with
  remote SSDs. Much of this data does not require host processing.

Typically the Device-to-Device data transfers in the network are implemented as
the following low-level operations: Device-to-Host copy, Host-to-Host network
transfer, and Host-to-Device copy.

The flow involving host copies is suboptimal, especially for bulk data transfers,
and can put significant strains on system resources such as host memory
bandwidth and PCIe bandwidth.

Devmem TCP optimizes this use case by implementing socket APIs that enable
the user to receive incoming network packets directly into device memory.

Packet payloads go directly from the NIC to device memory.

Packet headers go to host memory and are processed by the TCP/IP stack
normally. The NIC must support header split to achieve this.

Advantages:

- Alleviate host memory bandwidth pressure, compared to existing
  network-transfer + device-copy semantics.

- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest
  level of the PCIe tree, compared to the traditional path which sends data
  through the root complex.


More Info
---------

  slides, video
    https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html

  patchset
    [PATCH net-next v24 00/13] Device Memory TCP
    https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/


Interface
=========


Example
-------

tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up
the RX path of this API.


NIC Setup
---------

Header split, flow steering, & RSS are required features for devmem TCP.

Header split is used to split incoming packets into a header buffer in host
memory, and a payload buffer in device memory.

Flow steering & RSS are used to ensure that only flows targeting devmem land on
an RX queue bound to devmem.

Enable header split & flow steering::

	# enable header split
	ethtool -G eth1 tcp-data-split on


	# enable flow steering
	ethtool -K eth1 ntuple on

Configure RSS to steer all traffic away from the target RX queue (queue 15 in
this example)::

	ethtool --set-rxfh-indir eth1 equal 15


The user must bind a dmabuf to any number of RX queues on a given NIC using
the netlink API::

	/* Bind dmabuf to NIC RX queue 15 */
	struct netdev_queue *queues;
	queues = malloc(sizeof(*queues) * 1);

	queues[0]._present.type = 1;
	queues[0]._present.idx = 1;
	queues[0].type = NETDEV_RX_QUEUE_TYPE_RX;
	queues[0].idx = 15;

	*ys = ynl_sock_create(&ynl_netdev_family, &yerr);

	req = netdev_bind_rx_req_alloc();
	netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */);
	netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd);
	__netdev_bind_rx_req_set_queues(req, queues, n_queue_index);

	rsp = netdev_bind_rx(*ys, req);

	dmabuf_id = rsp->dmabuf_id;


The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
that has been bound.

The user can unbind the dmabuf from the netdevice by closing the netlink socket
that established the binding. We do this so that the binding is automatically
unbound even if the userspace process crashes.

Note that any reasonably well-behaved dmabuf from any exporter should work with
devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.


Socket Setup
------------

The socket must be flow steered to the dmabuf bound RX queue::

	ethtool -N eth1 flow-type tcp4 ... queue 15


Receiving data
--------------

The user application must signal to the kernel that it is capable of receiving
devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg::

	ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM);

Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT
on devmem data.

Devmem data is received directly into the dmabuf bound to the NIC in 'NIC
Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs::

		for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
			if (cm->cmsg_level != SOL_SOCKET ||
				(cm->cmsg_type != SCM_DEVMEM_DMABUF &&
				 cm->cmsg_type != SCM_DEVMEM_LINEAR))
				continue;

			dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm);

			if (cm->cmsg_type == SCM_DEVMEM_DMABUF) {
				/* Frag landed in dmabuf.
				 *
				 * dmabuf_cmsg->dmabuf_id is the dmabuf the
				 * frag landed on.
				 *
				 * dmabuf_cmsg->frag_offset is the offset into
				 * the dmabuf where the frag starts.
				 *
				 * dmabuf_cmsg->frag_size is the size of the
				 * frag.
				 *
				 * dmabuf_cmsg->frag_token is a token used to
				 * refer to this frag for later freeing.
				 */

				struct dmabuf_token token;
				token.token_start = dmabuf_cmsg->frag_token;
				token.token_count = 1;
				continue;
			}

			if (cm->cmsg_type == SCM_DEVMEM_LINEAR)
				/* Frag landed in linear buffer.
				 *
				 * dmabuf_cmsg->frag_size is the size of the
				 * frag.
				 */
				continue;

		}

Applications may receive 2 cmsgs:

- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated
  by dmabuf_id.

- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer.
  This typically happens when the NIC is unable to split the packet at the
  header boundary, such that part (or all) of the payload landed in host
  memory.

Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem,
regular TCP data that landed on an RX queue not bound to a dmabuf.


Freeing frags
-------------

Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user
processes the frag. The user must return the frag to the kernel via
SO_DEVMEM_DONTNEED::

	ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token,
			 sizeof(token));

The user must ensure the tokens are returned to the kernel in a timely manner.
Failure to do so will exhaust the limited dmabuf that is bound to the RX queue
and will lead to packet drops.


Implementation & Caveats
========================

Unreadable skbs
---------------

Devmem payloads are inaccessible to the kernel processing the packets. This
results in a few quirks for payloads of devmem skbs:

- Loopback is not functional. Loopback relies on copying the payload, which is
  not possible with devmem skbs.

- Software checksum calculation fails.

- TCP Dump and bpf can't access devmem packet payloads.


Testing
=======

More realistic example code can be found in the kernel source under
``tools/testing/selftests/net/ncdevmem.c``

ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but
receives data directly into a udmabuf.

To run ncdevmem, you need to run it on a server on the machine under test, and
you need to run netcat on a peer to provide the TX data.

ncdevmem has a validation mode as well that expects a repeating pattern of
incoming data and validates it as such. For example, you can launch
ncdevmem on the server by::

	ncdevmem -s <server IP> -c <client IP> -f eth1 -d 3 -n 0000:06:00.0 -l \
		 -p 5201 -v 7

On client side, use regular netcat to send TX data to ncdevmem process
on the server::

	yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \
		tr \\n \\0 | head -c 5G | nc <server IP> 5201 -p 5201
+1 −0
Original line number Diff line number Diff line
@@ -49,6 +49,7 @@ Contents:
   cdc_mbim
   dccp
   dctcp
   devmem
   dns_resolver
   driver
   eql
+6 −0
Original line number Diff line number Diff line
@@ -140,6 +140,12 @@
#define SO_PASSPIDFD		76
#define SO_PEERPIDFD		77

#define SO_DEVMEM_LINEAR	78
#define SCM_DEVMEM_LINEAR	SO_DEVMEM_LINEAR
#define SO_DEVMEM_DMABUF	79
#define SCM_DEVMEM_DMABUF	SO_DEVMEM_DMABUF
#define SO_DEVMEM_DONTNEED	80

#if !defined(__KERNEL__)

#if __BITS_PER_LONG == 64
+6 −0
Original line number Diff line number Diff line
@@ -151,6 +151,12 @@
#define SO_PASSPIDFD		76
#define SO_PEERPIDFD		77

#define SO_DEVMEM_LINEAR	78
#define SCM_DEVMEM_LINEAR	SO_DEVMEM_LINEAR
#define SO_DEVMEM_DMABUF	79
#define SCM_DEVMEM_DMABUF	SO_DEVMEM_DMABUF
#define SO_DEVMEM_DONTNEED	80

#if !defined(__KERNEL__)

#if __BITS_PER_LONG == 64
Loading