Commit 8f7aa3d3 authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Replace busylock at the Tx queuing layer with a lockless list.

     Resulting in a 300% (4x) improvement on heavy TX workloads, sending
     twice the number of packets per second, for half the cpu cycles.

   - Allow constantly busy flows to migrate to a more suitable CPU/NIC
     queue.

     Normally we perform queue re-selection when flow comes out of idle,
     but under extreme circumstances the flows may be constantly busy.

     Add sysctl to allow periodic rehashing even if it'd risk packet
     reordering.

   - Optimize the NAPI skb cache, make it larger, use it in more paths.

   - Attempt returning Tx skbs to the originating CPU (like we already
     did for Rx skbs).

   - Various data structure layout and prefetch optimizations from Eric.

   - Remove ktime_get() from the recvmsg() fast path, ktime_get() is
     sadly quite expensive on recent AMD machines.

   - Extend threaded NAPI polling to allow the kthread busy poll for
     packets.

   - Make MPTCP use Rx backlog processing. This lowers the lock
     pressure, improving the Rx performance.

   - Support memcg accounting of MPTCP socket memory.

   - Allow admin to opt sockets out of global protocol memory accounting
     (using a sysctl or BPF-based policy). The global limits are a poor
     fit for modern container workloads, where limits are imposed using
     cgroups.

   - Improve heuristics for when to kick off AF_UNIX garbage collection.

   - Allow users to control TCP SACK compression, and default to 33% of
     RTT.

   - Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid
     unnecessarily aggressive rcvbuf growth and overshot when the
     connection RTT is low.

   - Preserve skb metadata space across skb_push / skb_pull operations.

   - Support for IPIP encapsulation in the nftables flowtable offload.

   - Support appending IP interface information to ICMP messages (RFC
     5837).

   - Support setting max record size in TLS (RFC 8449).

   - Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.

   - Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.

   - Let users configure the number of write buffers in SMC.

   - Add new struct sockaddr_unsized for sockaddr of unknown length,
     from Kees.

   - Some conversions away from the crypto_ahash API, from Eric Biggers.

   - Some preparations for slimming down struct page.

   - YAML Netlink protocol spec for WireGuard.

   - Add a tool on top of YAML Netlink specs/lib for reporting commonly
     computed derived statistics and summarized system state.

  Driver API:

   - Add CAN XL support to the CAN Netlink interface.

   - Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as
     defined by the OPEN Alliance's "Advanced diagnostic features for
     100BASE-T1 automotive Ethernet PHYs" specification.

   - Add DPLL phase-adjust-gran pin attribute (and implement it in
     zl3073x).

   - Refactor xfrm_input lock to reduce contention when NIC offloads
     IPsec and performs RSS.

   - Add info to devlink params whether the current setting is the
     default or a user override. Allow resetting back to default.

   - Add standard device stats for PSP crypto offload.

   - Leverage DSA frame broadcast to implement simple HSR frame
     duplication for a lot of switches without dedicated HSR offload.

   - Add uAPI defines for 1.6Tbps link modes.

  Device drivers:

   - Add Motorcomm YT921x gigabit Ethernet switch support.

   - Add MUCSE driver for N500/N210 1GbE NIC series.

   - Convert drivers to support dedicated ops for timestamping control,
     and away from the direct IOCTL handling. While at it support GET
     operations for PHY timestamping.

   - Add (and convert most drivers to) a dedicated ethtool callback for
     reading the Rx ring count.

   - Significant refactoring efforts in the STMMAC driver, which
     supports Synopsys turn-key MAC IP integrated into a ton of SoCs.

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - support PPS in/out on all pins
      - Intel (100G, ice, idpf):
         - ice: implement standard ethtool and timestamping stats
         - i40e: support setting the max number of MAC addresses per VF
         - iavf: support RSS of GTP tunnels for 5G and LTE deployments
      - nVidia/Mellanox (mlx5):
         - reduce downtime on interface reconfiguration
         - disable being an XDP redirect target by default (same as
           other drivers) to avoid wasting resources if feature is
           unused
      - Meta (fbnic):
         - add support for Linux-managed PCS on 25G, 50G, and 100G links
      - Wangxun:
         - support Rx descriptor merge, and Tx head writeback
         - support Rx coalescing offload
         - support 25G SPF and 40G QSFP modules

   - Ethernet virtual:
      - Google (gve):
         - allow ethtool to configure rx_buf_len
         - implement XDP HW RX Timestamping support for DQ descriptor
           format
      - Microsoft vNIC (mana):
         - support HW link state events
         - handle hardware recovery events when probing the device

   - Ethernet NICs consumer, and embedded:
      - usbnet: add support for Byte Queue Limits (BQL)
      - AMD (amd-xgbe):
         - add device selftests
      - NXP (enetc):
         - add i.MX94 support
      - Broadcom integrated MACs (bcmgenet, bcmasp):
         - bcmasp: add support for PHY-based Wake-on-LAN
      - Broadcom switches (b53):
         - support port isolation
         - support BCM5389/97/98 and BCM63XX ARL formats
      - Lantiq/MaxLinear switches:
         - support bridge FDB entries on the CPU port
         - use regmap for register access
         - allow user to enable/disable learning
         - support Energy Efficient Ethernet
         - support configuring RMII clock delays
         - add tagging driver for MaxLinear GSW1xx switches
      - Synopsys (stmmac):
         - support using the HW clock in free running mode
         - add Eswin EIC7700 support
         - add Rockchip RK3506 support
         - add Altera Agilex5 support
      - Cadence (macb):
         - cleanup and consolidate descriptor and DMA address handling
         - add EyeQ5 support
      - TI:
         - icssg-prueth: support AF_XDP
      - Airoha access points:
         - add missing Ethernet stats and link state callback
         - add AN7583 support
         - support out-of-order Tx completion processing
      - Power over Ethernet:
         - pd692x0: preserve PSE configuration across reboots
         - add support for TPS23881B devices

   - Ethernet PHYs:
      - Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
      - Support 50G SerDes and 100G interfaces in Linux-managed PHYs
      - micrel:
         - support for non PTP SKUs of lan8814
         - enable in-band auto-negotiation on lan8814
      - realtek:
         - cable testing support on RTL8224
         - interrupt support on RTL8221B
      - motorcomm: support for PHY LEDs on YT853
      - microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
      - mscc: support for PHY LED control

   - CAN drivers:
      - m_can: add support for optional reset and system wake up
      - remove can_change_mtu() obsoleted by core handling
      - mcp251xfd: support GPIO controller functionality

   - Bluetooth:
      - add initial support for PASTa

   - WiFi:
      - split ieee80211.h file, it's way too big
      - improvements in VHT radiotap reporting, S1G, Channel Switch
        Announcement handling, rate tracking in mesh networks
      - improve multi-radio monitor mode support, and add a cfg80211
        debugfs interface for it
      - HT action frame handling on 6 GHz
      - initial chanctx work towards NAN
      - MU-MIMO sniffer improvements

   - WiFi drivers:
      - RealTek (rtw89):
         - support USB devices RTL8852AU and RTL8852CU
         - initial work for RTL8922DE
         - improved injection support
      - Intel:
         - iwlwifi: new sniffer API support
      - MediaTek (mt76):
         - WED support for >32-bit DMA
         - airoha NPU support
         - regdomain improvements
         - continued WiFi7/MLO work
      - Qualcomm/Atheros:
         - ath10k: factory test support
         - ath11k: TX power insertion support
         - ath12k: BSS color change support
         - ath12k: statistics improvements
      - brcmfmac: Acer A1 840 tablet quirk
      - rtl8xxxu: 40 MHz connection fixes/support"

* tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits)
  net: page_pool: sanitise allocation order
  net: page pool: xa init with destroy on pp init
  net/mlx5e: Support XDP target xmit with dummy program
  net/mlx5e: Update XDP features in switch channels
  selftests/tc-testing: Test CAKE scheduler when enqueue drops packets
  net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop
  wireguard: netlink: generate netlink code
  wireguard: uapi: generate header with ynl-gen
  wireguard: uapi: move flag enums
  wireguard: uapi: move enum wg_cmd
  wireguard: netlink: add YNL specification
  selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py
  selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py
  selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py
  selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py
  selftests: drv-net: introduce Iperf3Runner for measurement use cases
  selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS
  net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive()
  Documentation: net: dsa: mention simple HSR offload helpers
  Documentation: net: dsa: mention availability of RedBox
  ...
parents 015e7b0b 4de44542
Loading
Loading
Loading
Loading
+27 −2
Original line number Diff line number Diff line
@@ -212,6 +212,14 @@ mem_pcpu_rsv

Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU.

bypass_prot_mem
---------------

Skip charging socket buffers to the global per-protocol memory
accounting controlled by net.ipv4.tcp_mem, net.ipv4.udp_mem, etc.

Default: 0 (off)

rmem_default
------------

@@ -347,9 +355,9 @@ skb_defer_max
-------------

Max size (in skbs) of the per-cpu list of skbs being freed
by the cpu which allocated them. Used by TCP stack so far.
by the cpu which allocated them.

Default: 64
Default: 128

optmem_max
----------
@@ -406,6 +414,23 @@ to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt).
If set to 1 (default), hash rethink is performed on listening socket.
If set to 0, hash rethink is not performed.

txq_reselection_ms
------------------

Controls how often (in ms) a busy connected flow can select another tx queue.

A resection is desirable when/if user thread has migrated and XPS
would select a different queue. Same can occur without XPS
if the flow hash has changed.

But switching txq can introduce reorders, especially if the
old queue is under high pressure. Modern TCP stacks deal
well with reorders if they happen not too often.

To disable this feature, set the value to 0.

Default : 1000

gro_normal_batch
----------------

+34 −1
Original line number Diff line number Diff line
@@ -17,6 +17,7 @@ properties:
  compatible:
    enum:
      - airoha,en7581-eth
      - airoha,an7583-eth

  reg:
    items:
@@ -44,6 +45,7 @@ properties:
      - description: PDMA irq

  resets:
    minItems: 7
    maxItems: 8

  reset-names:
@@ -54,8 +56,9 @@ properties:
      - const: xsi-mac
      - const: hsi0-mac
      - const: hsi1-mac
      - const: hsi-mac
      - enum: [ hsi-mac, xfp-mac ]
      - const: xfp-mac
    minItems: 7

  memory-region:
    items:
@@ -81,6 +84,36 @@ properties:
      interface to implement hardware flow offloading programming Packet
      Processor Engine (PPE) flow table.

allOf:
  - $ref: ethernet-controller.yaml#
  - if:
      properties:
        compatible:
          contains:
            enum:
              - airoha,en7581-eth
    then:
      properties:
        resets:
          minItems: 8

        reset-names:
          minItems: 8

  - if:
      properties:
        compatible:
          contains:
            enum:
              - airoha,an7583-eth
    then:
      properties:
        resets:
          maxItems: 7

        reset-names:
          maxItems: 7

patternProperties:
  "^ethernet@[1-4]$":
    type: object
+1 −0
Original line number Diff line number Diff line
@@ -18,6 +18,7 @@ properties:
  compatible:
    enum:
      - airoha,en7581-npu
      - airoha,an7583-npu

  reg:
    maxItems: 1
+147 −0
Original line number Diff line number Diff line
# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
%YAML 1.2
---
$id: http://devicetree.org/schemas/net/amd,xgbe-seattle-v1a.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#

title: AMD XGBE Seattle v1a

maintainers:
  - Shyam Sundar S K <Shyam-sundar.S-k@amd.com>

allOf:
  - $ref: /schemas/net/ethernet-controller.yaml#

properties:
  compatible:
    const: amd,xgbe-seattle-v1a

  reg:
    items:
      - description: MAC registers
      - description: PCS registers
      - description: SerDes Rx/Tx registers
      - description: SerDes integration registers (1/2)
      - description: SerDes integration registers (2/2)

  interrupts:
    description: Device interrupts. The first entry is the general device
      interrupt. If amd,per-channel-interrupt is specified, each DMA channel
      interrupt must be specified. The last entry is the PCS auto-negotiation
      interrupt.
    minItems: 2
    maxItems: 6

  clocks:
    items:
      - description: DMA clock for the device
      - description: PTP clock for the device

  clock-names:
    items:
      - const: dma_clk
      - const: ptp_clk

  iommus:
    maxItems: 1

  phy-mode: true

  dma-coherent: true

  amd,per-channel-interrupt:
    description: Indicates that Rx and Tx complete will generate a unique
      interrupt for each DMA channel.
    type: boolean

  amd,speed-set:
    description: >
      Speed capabilities of the device.
        0 = 1GbE and 10GbE
        1 = 2.5GbE and 10GbE
    $ref: /schemas/types.yaml#/definitions/uint32
    enum: [0, 1]

  amd,serdes-blwc:
    description: Baseline wandering correction enablement for each speed.
    $ref: /schemas/types.yaml#/definitions/uint32-array
    minItems: 3
    maxItems: 3
    items:
      enum: [0, 1]

  amd,serdes-cdr-rate:
    description: CDR rate speed selection for each speed.
    $ref: /schemas/types.yaml#/definitions/uint32-array
    items:
      - description: CDR rate for 1GbE
      - description: CDR rate for 2.5GbE
      - description: CDR rate for 10GbE

  amd,serdes-pq-skew:
    description: PQ data sampling skew for each speed.
    $ref: /schemas/types.yaml#/definitions/uint32-array
    items:
      - description: PQ skew for 1GbE
      - description: PQ skew for 2.5GbE
      - description: PQ skew for 10GbE

  amd,serdes-tx-amp:
    description: TX amplitude boost for each speed.
    $ref: /schemas/types.yaml#/definitions/uint32-array
    items:
      - description: TX amplitude for 1GbE
      - description: TX amplitude for 2.5GbE
      - description: TX amplitude for 10GbE

  amd,serdes-dfe-tap-config:
    description: DFE taps available to run for each speed.
    $ref: /schemas/types.yaml#/definitions/uint32-array
    items:
      - description: DFE taps available for 1GbE
      - description: DFE taps available for 2.5GbE
      - description: DFE taps available for 10GbE

  amd,serdes-dfe-tap-enable:
    description: DFE taps to enable for each speed.
    $ref: /schemas/types.yaml#/definitions/uint32-array
    items:
      - description: DFE taps to enable for 1GbE
      - description: DFE taps to enable for 2.5GbE
      - description: DFE taps to enable for 10GbE

required:
  - compatible
  - reg
  - interrupts
  - clocks
  - clock-names
  - phy-mode

unevaluatedProperties: false

examples:
  - |
    ethernet@e0700000 {
        compatible = "amd,xgbe-seattle-v1a";
        reg = <0xe0700000 0x80000>,
              <0xe0780000 0x80000>,
              <0xe1240800 0x00400>,
              <0xe1250000 0x00060>,
              <0xe1250080 0x00004>;
        interrupts = <0 325 4>,
                     <0 326 1>, <0 327 1>, <0 328 1>, <0 329 1>,
                     <0 323 4>;
        amd,per-channel-interrupt;
        clocks = <&xgbe_dma_clk>, <&xgbe_ptp_clk>;
        clock-names = "dma_clk", "ptp_clk";
        phy-mode = "xgmii";
        mac-address = [ 02 a1 a2 a3 a4 a5 ];
        amd,speed-set = <0>;
        amd,serdes-blwc = <1>, <1>, <0>;
        amd,serdes-cdr-rate = <2>, <2>, <7>;
        amd,serdes-pq-skew = <10>, <10>, <30>;
        amd,serdes-tx-amp = <15>, <15>, <10>;
        amd,serdes-dfe-tap-config = <3>, <3>, <1>;
        amd,serdes-dfe-tap-enable = <0>, <0>, <127>;
    };
+0 −76
Original line number Diff line number Diff line
* AMD 10GbE driver (amd-xgbe)

Required properties:
- compatible: Should be "amd,xgbe-seattle-v1a"
- reg: Address and length of the register sets for the device
   - MAC registers
   - PCS registers
   - SerDes Rx/Tx registers
   - SerDes integration registers (1/2)
   - SerDes integration registers (2/2)
- interrupts: Should contain the amd-xgbe interrupt(s). The first interrupt
  listed is required and is the general device interrupt. If the optional
  amd,per-channel-interrupt property is specified, then one additional
  interrupt for each DMA channel supported by the device should be specified.
  The last interrupt listed should be the PCS auto-negotiation interrupt.
- clocks:
   - DMA clock for the amd-xgbe device (used for calculating the
     correct Rx interrupt watchdog timer value on a DMA channel
     for coalescing)
   - PTP clock for the amd-xgbe device
- clock-names: Should be the names of the clocks
   - "dma_clk" for the DMA clock
   - "ptp_clk" for the PTP clock
- phy-mode: See ethernet.txt file in the same directory

Optional properties:
- dma-coherent: Present if dma operations are coherent
- amd,per-channel-interrupt: Indicates that Rx and Tx complete will generate
  a unique interrupt for each DMA channel - this requires an additional
  interrupt be configured for each DMA channel
- amd,speed-set: Speed capabilities of the device
    0 - 1GbE and 10GbE (default)
    1 - 2.5GbE and 10GbE

The MAC address will be determined using the optional properties defined in
ethernet.txt.

The following optional properties are represented by an array with each
value corresponding to a particular speed. The first array value represents
the setting for the 1GbE speed, the second value for the 2.5GbE speed and
the third value for the 10GbE speed.  All three values are required if the
property is used.
- amd,serdes-blwc: Baseline wandering correction enablement
    0 - Off
    1 - On
- amd,serdes-cdr-rate: CDR rate speed selection
- amd,serdes-pq-skew: PQ (data sampling) skew
- amd,serdes-tx-amp: TX amplitude boost
- amd,serdes-dfe-tap-config: DFE taps available to run
- amd,serdes-dfe-tap-enable: DFE taps to enable

Example:
	xgbe@e0700000 {
		compatible = "amd,xgbe-seattle-v1a";
		reg = <0 0xe0700000 0 0x80000>,
		      <0 0xe0780000 0 0x80000>,
		      <0 0xe1240800 0 0x00400>,
		      <0 0xe1250000 0 0x00060>,
		      <0 0xe1250080 0 0x00004>;
		interrupt-parent = <&gic>;
		interrupts = <0 325 4>,
			     <0 326 1>, <0 327 1>, <0 328 1>, <0 329 1>,
			     <0 323 4>;
		amd,per-channel-interrupt;
		clocks = <&xgbe_dma_clk>, <&xgbe_ptp_clk>;
		clock-names = "dma_clk", "ptp_clk";
		phy-mode = "xgmii";
		mac-address = [ 02 a1 a2 a3 a4 a5 ];
		amd,speed-set = <0>;
		amd,serdes-blwc = <1>, <1>, <0>;
		amd,serdes-cdr-rate = <2>, <2>, <7>;
		amd,serdes-pq-skew = <10>, <10>, <30>;
		amd,serdes-tx-amp = <15>, <15>, <10>;
		amd,serdes-dfe-tap-config = <3>, <3>, <1>;
		amd,serdes-dfe-tap-enable = <0>, <0>, <127>;
	};
Loading