Commit d7e14e53 authored by Jakub Kicinski's avatar Jakub Kicinski
Browse files

Merge tag 'mlx5-socket-direct-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
Support Multi-PF netdev (Socket Direct)

This series adds support for combining multiple devices (PFs) of the
same port under one netdev instance. Passing traffic through different
devices belonging to different NUMA sockets saves cross-numa traffic and
allows apps running on the same netdev from different numas to still
feel a sense of proximity to the device and achieve improved
performance.

We achieve this by grouping PFs together, and creating the netdev only
once all group members are probed. Symmetrically, we destroy the netdev
once any of the PFs is removed.

The channels are distributed between all devices, a proper configuration
would utilize the correct close numa when working on a certain app/cpu.

We pick one device to be a primary (leader), and it fills a special
role.  The other devices (secondaries) are disconnected from the network
in the chip level (set to silent mode). All RX/TX traffic is steered
through the primary to/from the secondaries.

Currently, we limit the support to PFs only, and up to two devices
(sockets).

* tag 'mlx5-socket-direct-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  Documentation: networking: Add description for multi-pf netdev
  net/mlx5: Enable SD feature
  net/mlx5e: Block TLS device offload on combined SD netdev
  net/mlx5e: Support per-mdev queue counter
  net/mlx5e: Support cross-vhca RSS
  net/mlx5e: Let channels be SD-aware
  net/mlx5e: Create EN core HW resources for all secondary devices
  net/mlx5e: Create single netdev per SD group
  net/mlx5: SD, Add debugfs
  net/mlx5: SD, Add informative prints in kernel log
  net/mlx5: SD, Implement steering for primary and secondaries
  net/mlx5: SD, Implement devcom communication and primary election
  net/mlx5: SD, Implement basic query and instantiation
  net/mlx5: SD, Introduce SD lib
  net/mlx5: Add MPIR bit in mcam_access_reg
====================

Link: https://lore.kernel.org/r/20240307084229.500776-1-saeed@kernel.org


Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parents 2f901582 77d9ec3f
Loading
Loading
Loading
Loading
+1 −0
Original line number Diff line number Diff line
@@ -74,6 +74,7 @@ Contents:
   mpls-sysctl
   mptcp-sysctl
   multiqueue
   multi-pf-netdev
   napi
   net_cachelines/index
   netconsole
+174 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>

===============
Multi-PF Netdev
===============

Contents
========

- `Background`_
- `Overview`_
- `mlx5 implementation`_
- `Channels distribution`_
- `Observability`_
- `Steering`_
- `Mutually exclusive features`_

Background
==========

The Multi-PF NIC technology enables several CPUs within a multi-socket server to connect directly to
the network, each through its own dedicated PCIe interface. Through either a connection harness that
splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a single card. This
results in eliminating the network traffic traversing over the internal bus between the sockets,
significantly reducing overhead and latency, in addition to reducing CPU utilization and increasing
network throughput.

Overview
========

The feature adds support for combining multiple PFs of the same port in a Multi-PF environment under
one netdev instance. It is implemented in the netdev layer. Lower-layer instances like pci func,
sysfs entry, and devlink are kept separate.
Passing traffic through different devices belonging to different NUMA sockets saves cross-NUMA
traffic and allows apps running on the same netdev from different NUMAs to still feel a sense of
proximity to the device and achieve improved performance.

mlx5 implementation
===================

Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
NIC and has the socket-direct property enabled, once all PFs are probed, we create a single netdev
to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.

The netdev network channels are distributed between all devices, a proper configuration would utilize
the correct close NUMA node when working on a certain app/CPU.

We pick one PF to be a primary (leader), and it fills a special role. The other devices
(secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
the leader PF (east <-> west traffic) to function. All Rx/Tx traffic is steered through the primary
to/from the secondaries.

Currently, we limit the support to PFs only, and up to two PFs (sockets).

Channels distribution
=====================

We distribute the channels between the different PFs to achieve local NUMA node performance
on multiple NUMA nodes.

Each combined channel works against one specific PF, creating all its datapath queues against it. We
distribute channels to PFs in a round-robin policy.

::

        Example for 2 PFs and 5 channels:
        +--------+--------+
        | ch idx | PF idx |
        +--------+--------+
        |    0   |    0   |
        |    1   |    1   |
        |    2   |    0   |
        |    3   |    1   |
        |    4   |    0   |
        +--------+--------+


The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
As the channel stats are persistent across channel's closure, changing the mapping every single time
would turn the accumulative stats less representing of the channel's history.

This is achieved by using the correct core device instance (mdev) in each channel, instead of them
all using the same instance under "priv->mdev".

Observability
=============
The relation between PF, irq, napi, and queue can be observed via netlink spec:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}'
[{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
 {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
 {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
 {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
 {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
 {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
 {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
 {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
 {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
 {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}'
[{'id': 543, 'ifindex': 13, 'irq': 42},
 {'id': 542, 'ifindex': 13, 'irq': 41},
 {'id': 541, 'ifindex': 13, 'irq': 40},
 {'id': 540, 'ifindex': 13, 'irq': 39},
 {'id': 539, 'ifindex': 13, 'irq': 36}]

Here you can clearly observe our channels distribution policy:

$ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
/proc/irq/36/mlx5_comp1@pci:0000:08:00.0
/proc/irq/39/mlx5_comp1@pci:0000:09:00.0
/proc/irq/40/mlx5_comp2@pci:0000:08:00.0
/proc/irq/41/mlx5_comp2@pci:0000:09:00.0
/proc/irq/42/mlx5_comp3@pci:0000:08:00.0

Steering
========
Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.

In Rx, the steering tables belong to the primary PF only, and it is its role to distribute incoming
traffic to other PFs, via cross-vhca steering capabilities. Still maintain a single default RSS table,
that is capable of pointing to the receive queues of a different PF.

In Tx, the primary PF creates a new Tx flow table, which is aliased by the secondaries, so they can
go out to the network through it.

In addition, we set default XPS configuration that, based on the CPU, selects an SQ belonging to the
PF on the same node as the CPU.

XPS default config example:

NUMA node(s):          2
NUMA node0 CPU(s):     0-11
NUMA node1 CPU(s):     12-23

PF0 on node0, PF1 on node1.

- /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
- /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
- /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
- /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
- /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
- /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
- /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
- /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
- /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
- /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
- /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
- /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
- /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
- /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
- /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
- /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
- /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
- /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
- /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
- /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
- /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
- /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
- /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
- /sys/class/net/eth2/queues/tx-23/xps_cpus:800000

Mutually exclusive features
===========================

The nature of Multi-PF, where different channels work with different PFs, conflicts with
stateful features where the state is maintained in one of the PFs.
For example, in the TLS device-offload feature, special context objects are created per connection
and maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
we disable this combination for now.
+1 −1
Original line number Diff line number Diff line
@@ -29,7 +29,7 @@ mlx5_core-$(CONFIG_MLX5_CORE_EN) += en/rqt.o en/tir.o en/rss.o en/rx_res.o \
		en/reporter_tx.o en/reporter_rx.o en/params.o en/xsk/pool.o \
		en/xsk/setup.o en/xsk/rx.o en/xsk/tx.o en/devlink.o en/ptp.o \
		en/qos.o en/htb.o en/trap.o en/fs_tt_redirect.o en/selq.o \
		lib/crypto.o
		lib/crypto.o lib/sd.o

#
# Netdev extra
+6 −3
Original line number Diff line number Diff line
@@ -60,6 +60,7 @@
#include "lib/clock.h"
#include "en/rx_res.h"
#include "en/selq.h"
#include "lib/sd.h"

extern const struct net_device_ops mlx5e_netdev_ops;
struct page_pool;
@@ -791,6 +792,8 @@ struct mlx5e_channel {
	struct hwtstamp_config    *tstamp;
	DECLARE_BITMAP(state, MLX5E_CHANNEL_NUM_STATES);
	int                        ix;
	int                        vec_ix;
	int                        sd_ix;
	int                        cpu;
	/* Sync between icosq recovery and XSK enable/disable. */
	struct mutex               icosq_recovery_lock;
@@ -914,7 +917,7 @@ struct mlx5e_priv {
	bool                       tx_ptp_opened;
	bool                       rx_ptp_opened;
	struct hwtstamp_config     tstamp;
	u16                        q_counter;
	u16                        q_counter[MLX5_SD_MAX_GROUP_SZ];
	u16                        drop_rq_q_counter;
	struct notifier_block      events_nb;
	struct notifier_block      blocking_events_nb;
@@ -1029,12 +1032,12 @@ struct mlx5e_xsk_param;

struct mlx5e_rq_param;
int mlx5e_open_rq(struct mlx5e_params *params, struct mlx5e_rq_param *param,
		  struct mlx5e_xsk_param *xsk, int node,
		  struct mlx5e_xsk_param *xsk, int node, u16 q_counter,
		  struct mlx5e_rq *rq);
#define MLX5E_RQ_WQES_TIMEOUT 20000 /* msecs */
int mlx5e_wait_for_min_rx_wqes(struct mlx5e_rq *rq, int wait_time);
void mlx5e_close_rq(struct mlx5e_rq *rq);
int mlx5e_create_rq(struct mlx5e_rq *rq, struct mlx5e_rq_param *param);
int mlx5e_create_rq(struct mlx5e_rq *rq, struct mlx5e_rq_param *param, u16 q_counter);
void mlx5e_destroy_rq(struct mlx5e_rq *rq);

struct mlx5e_sq_param;
+8 −2
Original line number Diff line number Diff line
@@ -23,20 +23,26 @@ bool mlx5e_channels_is_xsk(struct mlx5e_channels *chs, unsigned int ix)
	return test_bit(MLX5E_CHANNEL_STATE_XSK, c->state);
}

void mlx5e_channels_get_regular_rqn(struct mlx5e_channels *chs, unsigned int ix, u32 *rqn)
void mlx5e_channels_get_regular_rqn(struct mlx5e_channels *chs, unsigned int ix, u32 *rqn,
				    u32 *vhca_id)
{
	struct mlx5e_channel *c = mlx5e_channels_get(chs, ix);

	*rqn = c->rq.rqn;
	if (vhca_id)
		*vhca_id = MLX5_CAP_GEN(c->mdev, vhca_id);
}

void mlx5e_channels_get_xsk_rqn(struct mlx5e_channels *chs, unsigned int ix, u32 *rqn)
void mlx5e_channels_get_xsk_rqn(struct mlx5e_channels *chs, unsigned int ix, u32 *rqn,
				u32 *vhca_id)
{
	struct mlx5e_channel *c = mlx5e_channels_get(chs, ix);

	WARN_ON_ONCE(!test_bit(MLX5E_CHANNEL_STATE_XSK, c->state));

	*rqn = c->xskrq.rqn;
	if (vhca_id)
		*vhca_id = MLX5_CAP_GEN(c->mdev, vhca_id);
}

bool mlx5e_channels_get_ptp_rqn(struct mlx5e_channels *chs, u32 *rqn)
Loading