Commit ac4d1baf authored by Paolo Abeni's avatar Paolo Abeni
Browse files

Merge branch 'device-memory-tcp-tx'

Mina Almasry says:

====================
Device memory TCP TX

The TX path had been dropped from the Device Memory TCP patch series
post RFCv1 [1], to make that series slightly easier to review. This
series rebases the implementation of the TX path on top of the
net_iov/netmem framework agreed upon and merged. The motivation for
the feature is thoroughly described in the docs & cover letter of the
original proposal, so I don't repeat the lengthy descriptions here, but
they are available in [1].

Full outline on usage of the TX path is detailed in the documentation
included with this series.

Test example is available via the kselftest included in the series as well.

The series is relatively small, as the TX path for this feature largely
piggybacks on the existing MSG_ZEROCOPY implementation.

Patch Overview:
---------------

1. Documentation & tests to give high level overview of the feature
   being added.

1. Add netmem refcounting needed for the TX path.

2. Devmem TX netlink API.

3. Devmem TX net stack implementation.

4. Make dma-buf unbinding scheduled work to handle TX cases where it gets
   freed from contexts where we can't sleep.

5. Add devmem TX documentation.

6. Add scaffolding enabling driver support for netmem_tx. Add helpers, driver
feature flag, and docs to enable drivers to declare netmem_tx support.

7. Guard netmem_tx against being enabled against drivers that don't
   support it.

8. Add devmem_tx selftests. Add TX path to ncdevmem and add a test to
   devmem.py.

Testing:
--------

Testing is very similar to devmem TCP RX path. The ncdevmem test used
for the RX path is now augemented with client functionality to test TX
path.

* Test Setup:

Kernel: net-next with this RFC and memory provider API cherry-picked
locally.

Hardware: Google Cloud A3 VMs.

NIC: GVE with header split & RSS & flow steering support.

Performance results are not included with this version, unfortunately.
I'm having issues running the dma-buf exporter driver against the
upstream kernel on my test setup. The issues are specific to that
dma-buf exporter and do not affect this patch series. I plan to follow
up this series with perf fixes if the tests point to issues once they're
up and running.

Special thanks to Stan who took a stab at rebasing the TX implementation
on top of the netmem/net_iov framework merged. Parts of his proposal [2]
that are reused as-is are forked off into their own patches to give full
credit.

[1] https://lore.kernel.org/netdev/20240909054318.1809580-1-almasrymina@google.com/
[2] https://lore.kernel.org/netdev/20240913150913.1280238-2-sdf@fomichev.me/T/#m066dd407fbed108828e2c40ae50e3f4376ef57fd

Cc: sdf@fomichev.me
Cc: asml.silence@gmail.com
Cc: dw@davidwei.uk
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Victor Nogueira <victor@mojatatu.com>
Cc: Pedro Tammela <pctammela@mojatatu.com>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>

v14: https://lore.kernel.org/netdev/20250429032645.363766-1-almasrymina@google.com/
v13: https://lore.kernel.org/netdev/20250425204743.617260-1-almasrymina@google.com/
v12: https://lore.kernel.org/netdev/20250423031117.907681-1-almasrymina@google.com/
v11: https://lore.kernel.org/netdev/20250423031117.907681-1-almasrymina@google.com/
v10: https://lore.kernel.org/netdev/20250417231540.2780723-1-almasrymina@google.com/
v9: https://lore.kernel.org/netdev/20250415224756.152002-1-almasrymina@google.com/
v8: https://lore.kernel.org/netdev/20250308214045.1160445-1-almasrymina@google.com/
v7: https://lore.kernel.org/netdev/20250227041209.2031104-1-almasrymina@google.com/
v6: https://lore.kernel.org/netdev/20250222191517.743530-1-almasrymina@google.com/
v5: https://lore.kernel.org/netdev/20250220020914.895431-1-almasrymina@google.com/
v4: https://lore.kernel.org/netdev/20250203223916.1064540-1-almasrymina@google.com/
v3: https://patchwork.kernel.org/project/netdevbpf/list/?series=929401&state=*
RFC v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=920056&state=*
====================

Link: https://patch.msgid.link/20250508004830.4100853-1-almasrymina@google.com


Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
parents e39d14a7 2f1a805f
Loading
Loading
Loading
Loading
+12 −0
Original line number Diff line number Diff line
@@ -743,6 +743,18 @@ operations:
            - defer-hard-irqs
            - gro-flush-timeout
            - irq-suspend-timeout
    -
      name: bind-tx
      doc: Bind dmabuf to netdev for TX
      attribute-set: dmabuf
      do:
        request:
          attributes:
            - ifindex
            - fd
        reply:
          attributes:
            - id

kernel-family:
  headers: [ "net/netdev_netlink.h"]
+146 −4
Original line number Diff line number Diff line
@@ -62,15 +62,15 @@ More Info
    https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/


Interface
=========
RX Interface
============


Example
-------

tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up
the RX path of this API.
./tools/testing/selftests/drivers/net/hw/ncdevmem:do_server shows an example of
setting up the RX path of this API.


NIC Setup
@@ -235,6 +235,148 @@ can be less than the tokens provided by the user in case of:
(a) an internal kernel leak bug.
(b) the user passed more than 1024 frags.

TX Interface
============


Example
-------

./tools/testing/selftests/drivers/net/hw/ncdevmem:do_client shows an example of
setting up the TX path of this API.


NIC Setup
---------

The user must bind a TX dmabuf to a given NIC using the netlink API::

        struct netdev_bind_tx_req *req = NULL;
        struct netdev_bind_tx_rsp *rsp = NULL;
        struct ynl_error yerr;

        *ys = ynl_sock_create(&ynl_netdev_family, &yerr);

        req = netdev_bind_tx_req_alloc();
        netdev_bind_tx_req_set_ifindex(req, ifindex);
        netdev_bind_tx_req_set_fd(req, dmabuf_fd);

        rsp = netdev_bind_tx(*ys, req);

        tx_dmabuf_id = rsp->id;


The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf
that has been bound.

The user can unbind the dmabuf from the netdevice by closing the netlink socket
that established the binding. We do this so that the binding is automatically
unbound even if the userspace process crashes.

Note that any reasonably well-behaved dmabuf from any exporter should work with
devmem TCP, even if the dmabuf is not actually backed by devmem. An example of
this is udmabuf, which wraps user memory (non-devmem) in a dmabuf.

Socket Setup
------------

The user application must use MSG_ZEROCOPY flag when sending devmem TCP. Devmem
cannot be copied by the kernel, so the semantics of the devmem TX are similar
to the semantics of MSG_ZEROCOPY::

	setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt));

It is also recommended that the user binds the TX socket to the same interface
the dma-buf has been bound to via SO_BINDTODEVICE::

	setsockopt(socket_fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, strlen(ifname) + 1);


Sending data
------------

Devmem data is sent using the SCM_DEVMEM_DMABUF cmsg.

The user should create a msghdr where,

* iov_base is set to the offset into the dmabuf to start sending from
* iov_len is set to the number of bytes to be sent from the dmabuf

The user passes the dma-buf id to send from via the dmabuf_tx_cmsg.dmabuf_id.

The example below sends 1024 bytes from offset 100 into the dmabuf, and 2048
from offset 2000 into the dmabuf. The dmabuf to send from is tx_dmabuf_id::

       char ctrl_data[CMSG_SPACE(sizeof(struct dmabuf_tx_cmsg))];
       struct dmabuf_tx_cmsg ddmabuf;
       struct msghdr msg = {};
       struct cmsghdr *cmsg;
       struct iovec iov[2];

       iov[0].iov_base = (void*)100;
       iov[0].iov_len = 1024;
       iov[1].iov_base = (void*)2000;
       iov[1].iov_len = 2048;

       msg.msg_iov = iov;
       msg.msg_iovlen = 2;

       msg.msg_control = ctrl_data;
       msg.msg_controllen = sizeof(ctrl_data);

       cmsg = CMSG_FIRSTHDR(&msg);
       cmsg->cmsg_level = SOL_SOCKET;
       cmsg->cmsg_type = SCM_DEVMEM_DMABUF;
       cmsg->cmsg_len = CMSG_LEN(sizeof(struct dmabuf_tx_cmsg));

       ddmabuf.dmabuf_id = tx_dmabuf_id;

       *((struct dmabuf_tx_cmsg *)CMSG_DATA(cmsg)) = ddmabuf;

       sendmsg(socket_fd, &msg, MSG_ZEROCOPY);


Reusing TX dmabufs
------------------

Similar to MSG_ZEROCOPY with regular memory, the user should not modify the
contents of the dma-buf while a send operation is in progress. This is because
the kernel does not keep a copy of the dmabuf contents. Instead, the kernel
will pin and send data from the buffer available to the userspace.

Just as in MSG_ZEROCOPY, the kernel notifies the userspace of send completions
using MSG_ERRQUEUE::

        int64_t tstop = gettimeofday_ms() + waittime_ms;
        char control[CMSG_SPACE(100)] = {};
        struct sock_extended_err *serr;
        struct msghdr msg = {};
        struct cmsghdr *cm;
        int retries = 10;
        __u32 hi, lo;

        msg.msg_control = control;
        msg.msg_controllen = sizeof(control);

        while (gettimeofday_ms() < tstop) {
                if (!do_poll(fd)) continue;

                ret = recvmsg(fd, &msg, MSG_ERRQUEUE);

                for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
                        serr = (void *)CMSG_DATA(cm);

                        hi = serr->ee_data;
                        lo = serr->ee_info;

                        fprintf(stdout, "tx complete [%d,%d]\n", lo, hi);
                }
        }

After the associated sendmsg has been completed, the dmabuf can be reused by
the userspace.


Implementation & Caveats
========================

+1 −0
Original line number Diff line number Diff line
@@ -10,6 +10,7 @@ Type Name fastpath_tx_acce
=================================== =========================== =================== =================== ===================================================================================
unsigned_long:32                    priv_flags                  read_mostly                             __dev_queue_xmit(tx)
unsigned_long:1                     lltx                        read_mostly                             HARD_TX_LOCK,HARD_TX_TRYLOCK,HARD_TX_UNLOCK(tx)
unsigned long:1                     netmem_tx:1;                read_mostly
char                                name[16]
struct netdev_name_node*            name_node
struct dev_ifalias*                 ifalias
+5 −0
Original line number Diff line number Diff line
@@ -188,3 +188,8 @@ Redundancy) frames from one port to another in hardware.
This should be set for devices which duplicate outgoing HSR (High-availability
Seamless Redundancy) or PRP (Parallel Redundancy Protocol) tags automatically
frames in hardware.

* netmem-tx

This should be set for devices which support netmem TX. See
Documentation/networking/netmem.rst
+21 −2
Original line number Diff line number Diff line
@@ -19,8 +19,8 @@ Benefits of Netmem :
* Simplified Development: Drivers interact with a consistent API,
  regardless of the underlying memory implementation.

Driver Requirements
===================
Driver RX Requirements
======================

1. The driver must support page_pool.

@@ -77,3 +77,22 @@ Driver Requirements
   that purpose, but be mindful that some netmem types might have longer
   circulation times, such as when userspace holds a reference in zerocopy
   scenarios.

Driver TX Requirements
======================

1. The Driver must not pass the netmem dma_addr to any of the dma-mapping APIs
   directly. This is because netmem dma_addrs may come from a source like
   dma-buf that is not compatible with the dma-mapping APIs.

   Helpers like netmem_dma_unmap_page_attrs() & netmem_dma_unmap_addr_set()
   should be used in lieu of dma_unmap_page[_attrs](), dma_unmap_addr_set().
   The netmem variants will handle netmem dma_addrs correctly regardless of the
   source, delegating to the dma-mapping APIs when appropriate.

   Not all dma-mapping APIs have netmem equivalents at the moment. If your
   driver relies on a missing netmem API, feel free to add and propose to
   netdev@, or reach out to the maintainers and/or almasrymina@google.com for
   help adding the netmem API.

2. Driver should declare support by setting `netdev->netmem_tx = true`
Loading