13372 Commits

Author SHA1 Message Date
Kees Cook
189f164e57 Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses
Conversion performed via this Coccinelle script:

  // SPDX-License-Identifier: GPL-2.0-only
  // Options: --include-headers-for-types --all-includes --include-headers --keep-comments
  virtual patch

  @gfp depends on patch && !(file in "tools") && !(file in "samples")@
  identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
 		    kzalloc_obj,kzalloc_objs,kzalloc_flex,
		    kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
		    kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
  @@

  	ALLOC(...
  -		, GFP_KERNEL
  	)

  $ make coccicheck MODE=patch COCCI=gfp.cocci

Build and boot tested x86_64 with Fedora 42's GCC and Clang:

Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01

Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22 08:26:33 -08:00
Linus Torvalds
32a92f8c89 Convert more 'alloc_obj' cases to default GFP_KERNEL arguments
This converts some of the visually simpler cases that have been split
over multiple lines.  I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.

Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script.  I probably had made it a bit _too_ trivial.

So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.

The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 20:03:00 -08:00
Linus Torvalds
323bbfcf1e Convert 'alloc_flex' family to use the new default GFP_KERNEL argument
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.

As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Linus Torvalds
8bf22c33e7 Merge tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from Netfilter.

  Current release - new code bugs:

   - net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT

   - eth: mlx5e: XSK, Fix unintended ICOSQ change

   - phy_port: correctly recompute the port's linkmodes

   - vsock: prevent child netns mode switch from local to global

   - couple of kconfig fixes for new symbols

  Previous releases - regressions:

   - nfc: nci: fix false-positive parameter validation for packet data

   - net: do not delay zero-copy skbs in skb_attempt_defer_free()

  Previous releases - always broken:

   - mctp: ensure our nlmsg responses to user space are zero-initialised

   - ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data()

   - fixes for ICMP rate limiting

  Misc:

   - intel: fix PCI device ID conflict between i40e and ipw2200"

* tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits)
  net: nfc: nci: Fix parameter validation for packet data
  net/mlx5e: Use unsigned for mlx5e_get_max_num_channels
  net/mlx5e: Fix deadlocks between devlink and netdev instance locks
  net/mlx5e: MACsec, add ASO poll loop in macsec_aso_set_arm_event
  net/mlx5: Fix misidentification of write combining CQE during poll loop
  net/mlx5e: Fix misidentification of ASO CQE during poll loop
  net/mlx5: Fix multiport device check over light SFs
  bonding: alb: fix UAF in rlb_arp_recv during bond up/down
  bnge: fix reserving resources from FW
  eth: fbnic: Advertise supported XDP features.
  rds: tcp: fix uninit-value in __inet_bind
  net/rds: Fix NULL pointer dereference in rds_tcp_accept_one
  octeontx2-af: Fix default entries mcam entry action
  net/mlx5e: XSK, Fix unintended ICOSQ change
  ipv6: icmp: icmpv6_xrlim_allow() optimization if net.ipv6.icmp.ratelimit is zero
  ipv4: icmp: icmpv4_xrlim_allow() optimization if net.ipv4.icmp_ratelimit is zero
  ipv6: icmp: remove obsolete code in icmpv6_xrlim_allow()
  inet: move icmp_global_{credit,stamp} to a separate cache line
  icmp: prevent possible overflow in icmp_global_allow()
  selftests/net: packetdrill: add ipv4-mapped-ipv6 tests
  ...
2026-02-19 10:39:08 -08:00
Eric Dumazet
d8d9ef2988 ipv4: icmp: icmpv4_xrlim_allow() optimization if net.ipv4.icmp_ratelimit is zero
If net.ipv4.icmp_ratelimit is zero, we do not have to call
inet_getpeer_v4() and inet_peer_xrlim_allow().

Both can be very expensive under DDOS.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260216142832.3834174-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18 16:46:36 -08:00
Eric Dumazet
034bbd8062 icmp: prevent possible overflow in icmp_global_allow()
Following expression can overflow
if sysctl_icmp_msgs_per_sec is big enough.

sysctl_icmp_msgs_per_sec * delta / HZ;

Fixes: 4cdf507d54 ("icmp: add a global rate limitation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260216142832.3834174-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-18 16:46:36 -08:00
Eric Dumazet
ad5dfde2a5 ping: annotate data-races in ping_lookup()
isk->inet_num, isk->inet_rcv_saddr and sk->sk_bound_dev_if
are read locklessly in ping_lookup().

Add READ_ONCE()/WRITE_ONCE() annotations.

The race on isk->inet_rcv_saddr is probably coming from IPv6 support,
but does not deserve a specific backport.

Fixes: dbca1596bb ("ping: convert to RCU lookups, get rid of rwlock")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260216100149.3319315-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-17 17:11:08 -08:00
Linus Torvalds
136114e0ab Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:

 - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves
   disk space by teaching ocfs2 to reclaim suballocator block group
   space (Heming Zhao)

 - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the
   ARRAY_END() macro and uses it in various places (Alejandro Colomar)

 - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes
   the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the
   page size (Pnina Feder)

 - "kallsyms: Prevent invalid access when showing module buildid" cleans
   up kallsyms code related to module buildid and fixes an invalid
   access crash when printing backtraces (Petr Mladek)

 - "Address page fault in ima_restore_measurement_list()" fixes a
   kexec-related crash that can occur when booting the second-stage
   kernel on x86 (Harshit Mogalapalli)

 - "kho: ABI headers and Documentation updates" updates the kexec
   handover ABI documentation (Mike Rapoport)

 - "Align atomic storage" adds the __aligned attribute to atomic_t and
   atomic64_t definitions to get natural alignment of both types on
   csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain)

 - "kho: clean up page initialization logic" simplifies the page
   initialization logic in kho_restore_page() (Pratyush Yadav)

 - "Unload linux/kernel.h" moves several things out of kernel.h and into
   more appropriate places (Yury Norov)

 - "don't abuse task_struct.group_leader" removes the usage of
   ->group_leader when it is "obviously unnecessary" (Oleg Nesterov)

 - "list private v2 & luo flb" adds some infrastructure improvements to
   the live update orchestrator (Pasha Tatashin)

* tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits)
  watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency
  procfs: fix missing RCU protection when reading real_parent in do_task_stat()
  watchdog/softlockup: fix sample ring index wrap in need_counting_irqs()
  kcsan, compiler_types: avoid duplicate type issues in BPF Type Format
  kho: fix doc for kho_restore_pages()
  tests/liveupdate: add in-kernel liveupdate test
  liveupdate: luo_flb: introduce File-Lifecycle-Bound global state
  liveupdate: luo_file: Use private list
  list: add kunit test for private list primitives
  list: add primitives for private list manipulations
  delayacct: fix uapi timespec64 definition
  panic: add panic_force_cpu= parameter to redirect panic to a specific CPU
  netclassid: use thread_group_leader(p) in update_classid_task()
  RDMA/umem: don't abuse current->group_leader
  drm/pan*: don't abuse current->group_leader
  drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
  drm/amdgpu: don't abuse current->group_leader
  android/binder: use same_thread_group(proc->tsk, current) in binder_mmap()
  android/binder: don't abuse current->group_leader
  kho: skip memoryless NUMA nodes when reserving scratch areas
  ...
2026-02-12 12:13:01 -08:00
Linus Torvalds
37a93dd5c4 Merge tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
 "Core & protocols:

   - A significant effort all around the stack to guide the compiler to
     make the right choice when inlining code, to avoid unneeded calls
     for small helper and stack canary overhead in the fast-path.

     This generates better and faster code with very small or no text
     size increases, as in many cases the call generated more code than
     the actual inlined helper.

   - Extend AccECN implementation so that is now functionally complete,
     also allow the user-space enabling it on a per network namespace
     basis.

   - Add support for memory providers with large (above 4K) rx buffer.
     Paired with hw-gro, larger rx buffer sizes reduce the number of
     buffers traversing the stack, dincreasing single stream CPU usage
     by up to ~30%.

   - Do not add HBH header to Big TCP GSO packets. This simplifies the
     RX path, the TX path and the NIC drivers, and is possible because
     user-space taps can now interpret correctly such packets without
     the HBH hint.

   - Allow IPv6 routes to be configured with a gateway address that is
     resolved out of a different interface than the one specified,
     aligning IPv6 to IPv4 behavior.

   - Multi-queue aware sch_cake. This makes it possible to scale the
     rate shaper of sch_cake across multiple CPUs, while still enforcing
     a single global rate on the interface.

   - Add support for the nbcon (new buffer console) infrastructure to
     netconsole, enabling lock-free, priority-based console operations
     that are safer in crash scenarios.

   - Improve the TCP ipv6 output path to cache the flow information,
     saving cpu cycles, reducing cache line misses and stack use.

   - Improve netfilter packet tracker to resolve clashes for most
     protocols, avoiding unneeded drops on rare occasions.

   - Add IP6IP6 tunneling acceleration to the flowtable infrastructure.

   - Reduce tcp socket size by one cache line.

   - Notify neighbour changes atomically, avoiding inconsistencies
     between the notification sequence and the actual states sequence.

   - Add vsock namespace support, allowing complete isolation of vsocks
     across different network namespaces.

   - Improve xsk generic performances with cache-alignment-oriented
     optimizations.

   - Support netconsole automatic target recovery, allowing netconsole
     to reestablish targets when underlying low-level interface comes
     back online.

  Driver API:

   - Support for switching the working mode (automatic vs manual) of a
     DPLL device via netlink.

   - Introduce PHY ports representation to expose multiple front-facing
     media ports over a single MAC.

   - Introduce "rx-polarity" and "tx-polarity" device tree properties,
     to generalize polarity inversion requirements for differential
     signaling.

   - Add helper to create, prepare and enable managed clocks.

  Device drivers:

   - Add Huawei hinic3 PF etherner driver.

   - Add DWMAC glue driver for Motorcomm YT6801 PCIe ethernet
     controller.

   - Add ethernet driver for MaxLinear MxL862xx switches

   - Remove parallel-port Ethernet driver.

   - Convert existing driver timestamp configuration reporting to
     hwtstamp_get and remove legacy ioctl().

   - Convert existing drivers to .get_rx_ring_count(), simplifing the RX
     ring count retrieval. Also remove the legacy fallback path.

   - Ethernet high-speed NICs:
      - Broadcom (bnxt, bng):
         - bnxt: add FW interface update to support FEC stats histogram
           and NVRAM defragmentation
         - bng: add TSO and H/W GRO support
      - nVidia/Mellanox (mlx5):
         - improve latency of channel restart operations, reducing the
           used H/W resources
         - add TSO support for UDP over GRE over VLAN
         - add flow counters support for hardware steering (HWS) rules
         - use a static memory area to store headers for H/W GRO,
           leading to 12% RX tput improvement
      - Intel (100G, ice, idpf):
         - ice: reorganizes layout of Tx and Rx rings for cacheline
           locality and utilizes __cacheline_group* macros on the new
           layouts
         - ice: introduces Synchronous Ethernet (SyncE) support
      - Meta (fbnic):
         - adds debugfs for firmware mailbox and tx/rx rings vectors

   - Ethernet virtual:
      - geneve: introduce GRO/GSO support for double UDP encapsulation

   - Ethernet NICs consumer, and embedded:
      - Synopsys (stmmac):
         - some code refactoring and cleanups
      - RealTek (r8169):
         - add support for RTL8127ATF (10G Fiber SFP)
         - add dash and LTR support
      - Airoha:
         - AN8811HB 2.5 Gbps phy support
      - Freescale (fec):
         - add XDP zero-copy support
      - Thunderbolt:
         - add get link setting support to allow bonding
      - Renesas:
         - add support for RZ/G3L GBETH SoC

   - Ethernet switches:
      - Maxlinear:
         - support R(G)MII slow rate configuration
         - add support for Intel GSW150
      - Motorcomm (yt921x):
         - add DCB/QoS support
      - TI:
         - icssm-prueth: support bridging (STP/RSTP) via the switchdev
           framework

   - Ethernet PHYs:
      - Realtek:
         - enable SGMII and 2500Base-X in-band auto-negotiation
         - simplify and reunify C22/C45 drivers
      - Micrel: convert bindings to DT schema

   - CAN:
      - move skb headroom content into skb extensions, making CAN
        metadata access more robust

   - CAN drivers:
      - rcar_canfd:
         - add support for FD-only mode
         - add support for the RZ/T2H SoC
      - sja1000: cleanup the CAN state handling

   - WiFi:
      - implement EPPKE/802.1X over auth frames support
      - split up drop reasons better, removing generic RX_DROP
      - additional FTM capabilities: 6 GHz support, supported number of
        spatial streams and supported number of LTF repetitions
      - better mac80211 iterators to enumerate resources
      - initial UHR (Wi-Fi 8) support for cfg80211/mac80211

   - WiFi drivers:
      - Qualcomm/Atheros:
         - ath11k: support for Channel Frequency Response measurement
         - ath12k: a significant driver refactor to support multi-wiphy
           devices and and pave the way for future device support in the
           same driver (rather than splitting to ath13k)
         - ath12k: support for the QCC2072 chipset
      - Intel:
         - iwlwifi: partial Neighbor Awareness Networking (NAN) support
         - iwlwifi: initial support for U-NII-9 and IEEE 802.11bn
      - RealTek (rtw89):
         - preparations for RTL8922DE support

   - Bluetooth:
      - implement setsockopt(BT_PHY) to set the connection packet type/PHY
      - set link_policy on incoming ACL connections

   - Bluetooth drivers:
      - btusb: add support for MediaTek7920, Realtek RTL8761BU and 8851BE
      - btqca: add WCN6855 firmware priority selection feature"

* tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1254 commits)
  bnge/bng_re: Add a new HSI
  net: macb: Fix tx/rx malfunction after phy link down and up
  af_unix: Fix memleak of newsk in unix_stream_connect().
  net: ti: icssg-prueth: Add optional dependency on HSR
  net: dsa: add basic initial driver for MxL862xx switches
  net: mdio: add unlocked mdiodev C45 bus accessors
  net: dsa: add tag format for MxL862xx switches
  dt-bindings: net: dsa: add MaxLinear MxL862xx
  selftests: drivers: net: hw: Modify toeplitz.c to poll for packets
  octeontx2-pf: Unregister devlink on probe failure
  net: renesas: rswitch: fix forwarding offload statemachine
  ionic: Rate limit unknown xcvr type messages
  tcp: inet6_csk_xmit() optimization
  tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock()
  tcp: populate inet->cork.fl.u.ip6 in tcp_v6_connect()
  ipv6: inet6_csk_xmit() and inet6_csk_update_pmtu() use inet->cork.fl.u.ip6
  ipv6: use inet->cork.fl.u.ip6 and np->final in ip6_datagram_dst_update()
  ipv6: use np->final in inet6_sk_rebuild_header()
  ipv6: add daddr/final storage in struct ipv6_pinfo
  net: stmmac: qcom-ethqos: fix qcom_ethqos_serdes_powerup()
  ...
2026-02-11 19:31:52 -08:00
Paolo Abeni
83310d6133 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in late fixes in preparation for the net-next PR.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-11 15:14:35 +01:00
Linus Torvalds
0923fd0419 Merge tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "Lock debugging:

   - Implement compiler-driven static analysis locking context checking,
     using the upcoming Clang 22 compiler's context analysis features
     (Marco Elver)

     We removed Sparse context analysis support, because prior to
     removal even a defconfig kernel produced 1,700+ context tracking
     Sparse warnings, the overwhelming majority of which are false
     positives. On an allmodconfig kernel the number of false positive
     context tracking Sparse warnings grows to over 5,200... On the plus
     side of the balance actual locking bugs found by Sparse context
     analysis is also rather ... sparse: I found only 3 such commits in
     the last 3 years. So the rate of false positives and the
     maintenance overhead is rather high and there appears to be no
     active policy in place to achieve a zero-warnings baseline to move
     the annotations & fixers to developers who introduce new code.

     Clang context analysis is more complete and more aggressive in
     trying to find bugs, at least in principle. Plus it has a different
     model to enabling it: it's enabled subsystem by subsystem, which
     results in zero warnings on all relevant kernel builds (as far as
     our testing managed to cover it). Which allowed us to enable it by
     default, similar to other compiler warnings, with the expectation
     that there are no warnings going forward. This enforces a
     zero-warnings baseline on clang-22+ builds (Which are still limited
     in distribution, admittedly)

     Hopefully the Clang approach can lead to a more maintainable
     zero-warnings status quo and policy, with more and more subsystems
     and drivers enabling the feature. Context tracking can be enabled
     for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y (default
     disabled), but this will generate a lot of false positives.

     ( Having said that, Sparse support could still be added back,
       if anyone is interested - the removal patch is still
       relatively straightforward to revert at this stage. )

  Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng)

    - Add support for Atomic<i8/i16/bool> and replace most Rust native
      AtomicBool usages with Atomic<bool>

    - Clean up LockClassKey and improve its documentation

    - Add missing Send and Sync trait implementation for SetOnce

    - Make ARef Unpin as it is supposed to be

    - Add __rust_helper to a few Rust helpers as a preparation for
      helper LTO

    - Inline various lock related functions to avoid additional function
      calls

  WW mutexes:

    - Extend ww_mutex tests and other test-ww_mutex updates (John
      Stultz)

  Misc fixes and cleanups:

    - rcu: Mark lockdep_assert_rcu_helper() __always_inline (Arnd
      Bergmann)

    - locking/local_lock: Include more missing headers (Peter Zijlstra)

    - seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap)

    - rust: sync: Replace `kernel::c_str!` with C-Strings (Tamir
      Duberstein)"

* tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits)
  locking/rwlock: Fix write_trylock_irqsave() with CONFIG_INLINE_WRITE_TRYLOCK
  rcu: Mark lockdep_assert_rcu_helper() __always_inline
  compiler-context-analysis: Remove __assume_ctx_lock from initializers
  tomoyo: Use scoped init guard
  crypto: Use scoped init guard
  kcov: Use scoped init guard
  compiler-context-analysis: Introduce scoped init guards
  cleanup: Make __DEFINE_LOCK_GUARD handle commas in initializers
  seqlock: fix scoped_seqlock_read kernel-doc
  tools: Update context analysis macros in compiler_types.h
  rust: sync: Replace `kernel::c_str!` with C-Strings
  rust: sync: Inline various lock related methods
  rust: helpers: Move #define __rust_helper out of atomic.c
  rust: wait: Add __rust_helper to helpers
  rust: time: Add __rust_helper to helpers
  rust: task: Add __rust_helper to helpers
  rust: sync: Add __rust_helper to helpers
  rust: refcount: Add __rust_helper to helpers
  rust: rcu: Add __rust_helper to helpers
  rust: processor: Add __rust_helper to helpers
  ...
2026-02-10 12:28:44 -08:00
Linus Torvalds
f17b474e36 Merge tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:

 - Support associating BPF program with struct_ops (Amery Hung)

 - Switch BPF local storage to rqspinlock and remove recursion detection
   counters which were causing false positives (Amery Hung)

 - Fix live registers marking for indirect jumps (Anton Protopopov)

 - Introduce execution context detection BPF helpers (Changwoo Min)

 - Improve verifier precision for 32bit sign extension pattern
   (Cupertino Miranda)

 - Optimize BTF type lookup by sorting vmlinux BTF and doing binary
   search (Donglin Peng)

 - Allow states pruning for misc/invalid slots in iterator loops (Eduard
   Zingerman)

 - In preparation for ASAN support in BPF arenas teach libbpf to move
   global BPF variables to the end of the region and enable arena kfuncs
   while holding locks (Emil Tsalapatis)

 - Introduce support for implicit arguments in kfuncs and migrate a
   number of them to new API. This is a prerequisite for cgroup
   sub-schedulers in sched-ext (Ihor Solodrai)

 - Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen)

 - Fix ORC stack unwind from kprobe_multi (Jiri Olsa)

 - Speed up fentry attach by using single ftrace direct ops in BPF
   trampolines (Jiri Olsa)

 - Require frozen map for calculating map hash (KP Singh)

 - Fix lock entry creation in TAS fallback in rqspinlock (Kumar
   Kartikeya Dwivedi)

 - Allow user space to select cpu in lookup/update operations on per-cpu
   array and hash maps (Leon Hwang)

 - Make kfuncs return trusted pointers by default (Matt Bobrowski)

 - Introduce "fsession" support where single BPF program is executed
   upon entry and exit from traced kernel function (Menglong Dong)

 - Allow bpf_timer and bpf_wq use in all programs types (Mykyta
   Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei
   Starovoitov)

 - Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their
   definition across the tree (Puranjay Mohan)

 - Allow BPF arena calls from non-sleepable context (Puranjay Mohan)

 - Improve register id comparison logic in the verifier and extend
   linked registers with negative offsets (Puranjay Mohan)

 - In preparation for BPF-OOM introduce kfuncs to access memcg events
   (Roman Gushchin)

 - Use CFI compatible destructor kfunc type (Sami Tolvanen)

 - Add bitwise tracking for BPF_END in the verifier (Tianci Cao)

 - Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou
   Tang)

 - Make BPF selftests work with 64k page size (Yonghong Song)

* tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits)
  selftests/bpf: Fix outdated test on storage->smap
  selftests/bpf: Choose another percpu variable in bpf for btf_dump test
  selftests/bpf: Remove test_task_storage_map_stress_lookup
  selftests/bpf: Update task_local_storage/task_storage_nodeadlock test
  selftests/bpf: Update task_local_storage/recursion test
  selftests/bpf: Update sk_storage_omem_uncharge test
  bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy}
  bpf: Support lockless unlink when freeing map or local storage
  bpf: Prepare for bpf_selem_unlink_nofail()
  bpf: Remove unused percpu counter from bpf_local_storage_map_free
  bpf: Remove cgroup local storage percpu counter
  bpf: Remove task local storage percpu counter
  bpf: Change local_storage->lock and b->lock to rqspinlock
  bpf: Convert bpf_selem_unlink to failable
  bpf: Convert bpf_selem_link_map to failable
  bpf: Convert bpf_selem_unlink_map to failable
  bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage
  selftests/xsk: fix number of Tx frags in invalid packet
  selftests/xsk: properly handle batch ending in the middle of a packet
  bpf: Prevent reentrance into call_rcu_tasks_trace()
  ...
2026-02-10 11:26:21 -08:00
Jiayuan Chen
81b84de32b xfrm: fix ip_rt_bug race in icmp_route_lookup reverse path
icmp_route_lookup() performs multiple route lookups to find a suitable
route for sending ICMP error messages, with special handling for XFRM
(IPsec) policies.

The lookup sequence is:
1. First, lookup output route for ICMP reply (dst = original src)
2. Pass through xfrm_lookup() for policy check
3. If blocked (-EPERM) or dst is not local, enter "reverse path"
4. In reverse path, call xfrm_decode_session_reverse() to get fl4_dec
   which reverses the original packet's flow (saddr<->daddr swapped)
5. If fl4_dec.saddr is local (we are the original destination), use
   __ip_route_output_key() for output route lookup
6. If fl4_dec.saddr is NOT local (we are a forwarding node), use
   ip_route_input() to simulate the reverse packet's input path
7. Finally, pass rt2 through xfrm_lookup() with XFRM_LOOKUP_ICMP flag

The bug occurs in step 6: ip_route_input() is called with fl4_dec.daddr
(original packet's source) as destination. If this address becomes local
between the initial check and ip_route_input() call (e.g., due to
concurrent "ip addr add"), ip_route_input() returns a LOCAL route with
dst.output set to ip_rt_bug.

This route is then used for ICMP output, causing dst_output() to call
ip_rt_bug(), triggering a WARN_ON:

 ------------[ cut here ]------------
 WARNING: net/ipv4/route.c:1275 at ip_rt_bug+0x21/0x30, CPU#1
 Call Trace:
  <TASK>
  ip_push_pending_frames+0x202/0x240
  icmp_push_reply+0x30d/0x430
  __icmp_send+0x1149/0x24f0
  ip_options_compile+0xa2/0xd0
  ip_rcv_finish_core+0x829/0x1950
  ip_rcv+0x2d7/0x420
  __netif_receive_skb_one_core+0x185/0x1f0
  netif_receive_skb+0x90/0x450
  tun_get_user+0x3413/0x3fb0
  tun_chr_write_iter+0xe4/0x220
  ...

Fix this by checking rt2->rt_type after ip_route_input(). If it's
RTN_LOCAL, the route cannot be used for output, so treat it as an error.

The reproducer requires kernel modification to widen the race window,
making it unsuitable as a selftest. It is available at:

  https://gist.github.com/mrpre/eae853b72ac6a750f5d45d64ddac1e81

Reported-by: syzbot+e738404dcd14b620923c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000b1060905eada8881@google.com/T/
Closes: https://lore.kernel.org/r/20260128090523.356953-1-jiayuan.chen@linux.dev
Fixes: 8b7817f3a9 ("[IPSEC]: Add ICMP host relookup support")
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://patch.msgid.link/20260206050220.59642-1-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-10 15:06:11 +01:00
Eric Dumazet
a35b6e4863 tcp: inline tcp_filter()
This helper is already (auto)inlined from IPv4 TCP stack.

Make it an inline function to benefit IPv6 as well.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 30/-49 (-19)
Function                                     old     new   delta
tcp_v6_rcv                                  3448    3478     +30
__pfx_tcp_filter                              16       -     -16
tcp_filter                                    33       -     -33
Total: Before=24891904, After=24891885, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260205164329.3401481-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-06 20:12:11 -08:00
Eric Dumazet
c89477ad79 inet: RAW sockets using IPPROTO_RAW MUST drop incoming ICMP
Yizhou Zhao reported that simply having one RAW socket on protocol
IPPROTO_RAW (255) was dangerous.

  socket(AF_INET, SOCK_RAW, 255);

A malicious incoming ICMP packet can set the protocol field to 255
and match this socket, leading to FNHE cache changes.

inner = IP(src="192.168.2.1", dst="8.8.8.8", proto=255)/Raw("TEST")
pkt = IP(src="192.168.1.1", dst="192.168.2.1")/ICMP(type=3, code=4, nexthopmtu=576)/inner

"man 7 raw" states:

  A protocol of IPPROTO_RAW implies enabled IP_HDRINCL and is able
  to send any IP protocol that is specified in the passed header.
  Receiving of all IP protocols via IPPROTO_RAW is not possible
  using raw sockets.

Make sure we drop these malicious packets.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn>
Link: https://lore.kernel.org/netdev/20251109134600.292125-1-zhaoyz24@mails.tsinghua.edu.cn/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260203192509.682208-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05 12:36:49 -08:00
Eric Dumazet
22c1264415 tcp: move __reqsk_free() out of line
Inlining __reqsk_free() is overkill, let's reclaim 2 Kbytes of text.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 2/4 grow/shrink: 2/14 up/down: 225/-2338 (-2113)
Function                                     old     new   delta
__reqsk_free                                   -     114    +114
sock_edemux                                   18      82     +64
inet_csk_listen_start                        233     264     +31
__pfx___reqsk_free                             -      16     +16
__pfx_reqsk_queue_alloc                       16       -     -16
__pfx_reqsk_free                              16       -     -16
reqsk_queue_alloc                             46       -     -46
tcp_req_err                                  272     177     -95
reqsk_fastopen_remove                        348     253     -95
cookie_bpf_check                             157      62     -95
cookie_tcp_reqsk_alloc                       387     290     -97
cookie_v4_check                             1568    1465    -103
reqsk_free                                   105       -    -105
cookie_v6_check                             1519    1412    -107
sock_gen_put                                 187      78    -109
sock_pfree                                   212      82    -130
tcp_try_fastopen                            1818    1683    -135
tcp_v4_rcv                                  3478    3294    -184
reqsk_put                                    306      90    -216
tcp_get_cookie_sock                          551     318    -233
tcp_v6_rcv                                  3404    3141    -263
tcp_conn_request                            2677    2384    -293
Total: Before=24887415, After=24885302, chg -0.01%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204055147.1682705-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05 09:23:06 -08:00
Eric Dumazet
a90765c6f6 tcp: move reqsk_fastopen_remove to net/ipv4/tcp_fastopen.c
This function belongs to TCP stack, not to net/core/request_sock.c

We get rid of the now empty request_sock.c n the following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204055147.1682705-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05 09:23:05 -08:00
Eric Dumazet
d5c5391554 inet: move reqsk_queue_alloc() to net/ipv4/inet_connection_sock.c
Only called once from inet_csk_listen_start(), it can be static.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260204055147.1682705-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05 09:23:05 -08:00
Eric Dumazet
309dd99421 tcp: split tcp_check_space() in two parts
tcp_check_space() is fat and not inlined.

Move its slow path in (out of line) __tcp_check_space()
and make tcp_check_space() an inline function for better TCP performance.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 2/2 grow/shrink: 4/0 up/down: 708/-582 (126)
Function                                     old     new   delta
__tcp_check_space                              -     521    +521
tcp_rcv_established                         1860    1916     +56
tcp_rcv_state_process                       3342    3384     +42
tcp_event_new_data_sent                      248     286     +38
tcp_data_snd_check                            71     106     +35
__pfx___tcp_check_space                        -      16     +16
__pfx_tcp_check_space                         16       -     -16
tcp_check_space                              566       -    -566
Total: Before=24896373, After=24896499, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260203050932.3522221-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-04 20:37:06 -08:00
Eric Dumazet
7c1db78ff7 tcp: move tcp_rbtree_insert() to tcp_output.c
tcp_rbtree_insert() is primarily used from tcp_output.c
In tcp_input.c, only (slow path) tcp_collapse() uses it.

Move it to tcp_output.c to allow its (auto)inlining to improve
TCP tx fast path.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 4/1 up/down: 445/-115 (330)
Function                                     old     new   delta
tcp_connect                                 4277    4478    +201
tcp_event_new_data_sent                      162     248     +86
tcp_send_synack                              780     862     +82
tcp_fragment                                1185    1261     +76
tcp_collapse                                1524    1409    -115
Total: Before=24896043, After=24896373, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260203045110.3499713-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-04 20:36:50 -08:00
Eric Dumazet
59b5e7f47c tcp: use __skb_push() in __tcp_transmit_skb()
We trust MAX_TCP_HEADER to be large enough.

Using the inlined version of skb_push() trades 8 bytes
of text for better performance of TCP TX fast path.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 1/0 up/down: 8/0 (8)
Function                                     old     new   delta
__tcp_transmit_skb                          3181    3189      +8
Total: Before=24896035, After=24896043, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260203044226.3489941-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-04 20:36:27 -08:00
Chia-Yu Chang
8ae3e8e6ce tcp: accecn: enable AccECN
Enable Accurate ECN negotiation and request for incoming and
outgoing connection by setting sysctl_tcp_ecn:

+==============+===========================================+
|              |  Highest ECN variant (Accurate ECN, ECN,  |
|   tcp_ecn    |  or no ECN) to be negotiated & requested  |
|              +---------------------+---------------------+
|              | Incoming connection | Outgoing connection |
+==============+=====================+=====================+
|      0       |        No ECN       |        No ECN       |
|      1       |         ECN         |         ECN         |
|      2       |         ECN         |        No ECN       |
+--------------+---------------------+---------------------+
|      3       |     Accurate ECN    |     Accurate ECN    |
|      4       |     Accurate ECN    |         ECN         |
|      5       |     Accurate ECN    |        No ECN       |
+==============+=====================+=====================+

Refer Documentation/networking/ip-sysctl.rst for more details.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-15-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:25 +01:00
Chia-Yu Chang
4fa4ac5e58 tcp: accecn: add tcpi_ecn_mode and tcpi_option2 in tcp_info
Add 2-bit tcpi_ecn_mode feild within tcp_info to indicate which ECN
mode is negotiated: ECN_MODE_DISABLED, ECN_MODE_RFC3168, ECN_MODE_ACCECN,
or ECN_MODE_PENDING. This is done by utilizing available bits from
tcpi_accecn_opt_seen (reduced from 16 bits to 2 bits) and
tcpi_accecn_fail_mode (reduced from 16 bits to 4 bits).

Also, an extra 24-bit tcpi_options2 field is identified to represent
newer options and connection features, as all 8 bits of tcpi_options
field have been used.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-14-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:25 +01:00
Chia-Yu Chang
1247fb19ca tcp: accecn: detect loss ACK w/ AccECN option and add TCP_ACCECN_OPTION_PERSIST
Detect spurious retransmission of a previously sent ACK carrying the
AccECN option after the second retransmission. Since this might be caused
by the middlebox dropping ACK with options it does not recognize, disable
the sending of the AccECN option in all subsequent ACKs. This patch
follows Section 3.2.3.2.2 of AccECN spec (RFC9768), and a new field
(accecn_opt_sent_w_dsack) is added to indicate that an AccECN option was
sent with duplicate SACK info.

Also, a new AccECN option sending mode is added to tcp_ecn_option sysctl:
(TCP_ECN_OPTION_PERSIST), which ignores the AccECN fallback policy and
persistently sends AccECN option once it fits into TCP option space.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-13-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:25 +01:00
Chia-Yu Chang
4024081feb tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion
Based on specification:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt

Based on Section 3.1.5 of AccECN spec (RFC9768), a TCP Server in
AccECN mode MUST NOT set ECT on any packet for the rest of the connection,
if it has received or sent at least one valid SYN or Acceptable SYN/ACK
with (AE,CWR,ECE) = (0,0,0) during the handshake.

In addition, a host in AccECN mode that is feeding back the IP-ECN
field on a SYN or SYN/ACK MUST feed back the IP-ECN field on the
latest valid SYN or acceptable SYN/ACK to arrive.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-11-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:24 +01:00
Chia-Yu Chang
f326f1f17f tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN SYN/ACK
For Accurate ECN, the first SYN/ACK sent by the TCP server shall set
the ACE flag (Table 1 of RFC9768) and the AccECN option to complete the
capability negotiation. However, if the TCP server needs to retransmit
such a SYN/ACK (for example, because it did not receive an ACK
acknowledging its SYN/ACK, or received a second SYN requesting AccECN
support), the TCP server retransmits the SYN/ACK without the AccECN
option. This is because the SYN/ACK may be lost due to congestion, or a
middlebox may block the AccECN option. Furthermore, if this retransmission
also times out, to expedite connection establishment, the TCP server
should retransmit the SYN/ACK with (AE,CWR,ECE) = (0,0,0) and without the
AccECN option, while maintaining AccECN feedback mode.

This complies with Section 3.2.3.2.2 of the AccECN spec RFC9768.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-10-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:24 +01:00
Chia-Yu Chang
f1eaea5585 tcp: add TCP_SYNACK_RETRANS synack_type
Before this patch, retransmitted SYN/ACK did not have a specific
synack_type; however, the upcoming patch needs to distinguish between
retransmitted and non-retransmitted SYN/ACK for AccECN negotiation to
transmit the fallback SYN/ACK during AccECN negotiation. Therefore, this
patch introduces a new synack_type (TCP_SYNACK_RETRANS).

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-9-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:24 +01:00
Chia-Yu Chang
3ae62b8b4a tcp: accecn: retransmit downgraded SYN in AccECN negotiation
Based on AccECN spec (RFC9768) Section 3.1.4.1, if the sender of an
AccECN SYN (the TCP Client) times out before receiving the SYN/ACK, it
SHOULD attempt to negotiate the use of AccECN at least one more time
by continuing to set all three TCP ECN flags (AE,CWR,ECE) = (1,1,1) on
the first retransmitted SYN (using the usual retransmission time-outs).

If this first retransmission also fails to be acknowledged, in
deployment scenarios where AccECN path traversal might be problematic,
the TCP Client SHOULD send subsequent retransmissions of the SYN with
the three TCP-ECN flags cleared (AE,CWR,ECE) = (0,0,0).

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-8-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:24 +01:00
Chia-Yu Chang
e68c28f22f tcp: disable RFC3168 fallback identifier for CC modules
When AccECN is not successfully negociated for a TCP flow, it defaults
fallback to classic ECN (RFC3168). However, L4S service will fallback
to non-ECN.

This patch enables congestion control module to control whether it
should not fallback to classic ECN after unsuccessful AccECN negotiation.
A new CA module flag (TCP_CONG_NO_FALLBACK_RFC3168) identifies this
behavior expected by the CA.

Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-6-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:24 +01:00
Chia-Yu Chang
100f946b8d tcp: ECT_1_NEGOTIATION and NEEDS_ACCECN identifiers
Two flags for congestion control (CC) module are added in this patch
related to AccECN negotiation. First, a new flag (TCP_CONG_NEEDS_ACCECN)
defines that the CC expects to negotiate AccECN functionality using the
ECE, CWR and AE flags in the TCP header.

Second, during ECN negotiation, ECT(0) in the IP header is used. This
patch enables CC to control whether ECT(0) or ECT(1) should be used on
a per-segment basis. A new flag (TCP_CONG_ECT_1_NEGOTIATION) defines the
expected ECT value in the IP header by the CA when not-yet initialized
for the connection.

The detailed AccECN negotiaotn can be found in IETF RFC9768.

Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-5-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:24 +01:00
Ilpo Järvinen
ab4c8b6f7f gro: flushing when CWR is set negatively affects AccECN
As AccECN may keep CWR bit asserted due to different
interpretation of the bit, flushing with GRO because of
CWR may effectively disable GRO until AccECN counter
field changes such that CWR-bit becomes 0.

There is no harm done from not immediately forwarding the
CWR'ed segment with RFC3168 ECN.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-3-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:24 +01:00
Ilpo Järvinen
7885ce0147 tcp: try to avoid safer when ACKs are thinned
Add newly acked pkts EWMA. When ACK thinning occurs, select
between safer and unsafe cep delta in AccECN processing based
on it. If the packets ACKed per ACK tends to be large, don't
conservatively assume ACE field overflow.

This patch uses the existing 2-byte holes in the rx group for new
u16 variables withtout creating more holes. Below are the pahole
outcomes before and after this patch:

[BEFORE THIS PATCH]
struct tcp_sock {
    [...]
    u32                        delivered_ecn_bytes[3]; /*  2744    12 */
    /* XXX 4 bytes hole, try to pack */

    [...]
    __cacheline_group_end__tcp_sock_write_rx[0];       /*  2816     0 */

    [...]
    /* size: 3264, cachelines: 51, members: 177 */
}

[AFTER THIS PATCH]
struct tcp_sock {
    [...]
    u32                        delivered_ecn_bytes[3]; /*  2744    12 */
    u16                        pkts_acked_ewma;        /*  2756     2 */
    /* XXX 2 bytes hole, try to pack */

    [...]
    __cacheline_group_end__tcp_sock_write_rx[0];       /*  2816     0 */

    [...]
    /* size: 3264, cachelines: 51, members: 178 */
}

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260131222515.8485-2-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-03 15:13:24 +01:00
Geliang Tang
2d85088d46 tcp: export tcp_splice_state
Export struct tcp_splice_state and tcp_splice_data_recv() in net/tcp.h
so that they can be used by MPTCP in the next patch.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Acked-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260130-net-next-mptcp-splice-v2-3-31332ba70d7f@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-02 18:15:32 -08:00
Eric Dumazet
fe8570186f ipv4: use dst4_mtu() instead of dst_mtu()
When we expect an IPv4 dst, use dst4_mtu() instead of dst_mtu()
to save some code space.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260130210303.3888261-8-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-02 17:49:29 -08:00
Mahdi Faramarzpour
820990d665 udp: add drop count for packets in udp_prod_queue
This commit adds SNMP drop count increment for the packets in
per NUMA queues which were introduced in commit b650bf0977
("udp: remove busylock and add per NUMA queues"). note that SNMP
counters are incremented currently by the caller for skb. And
that these skbs on the intermediate queue cannot be counted
there so need similar logic in their error path.

Signed-off-by: Mahdi Faramarzpour <mahdifrmx@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260129083806.204752-1-mahdifrmx@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-30 17:18:53 -08:00
Eric Dumazet
ed9b70040d tcp: reduce tcp sockets size by one cache line
By default, when a kmem_cache is created with SLAB_TYPESAFE_BY_RCU,
slub has to use extra storage for the freelist pointer after each
object, because slub assumes that any bit in the object
can be used by RCU readers.

Because proto_register() is also using SLAB_HWCACHE_ALIGN,
this forces slub to use one extra cache line per object.

We can instead put the slub freelist anywhere in the object,
granted the concurrent RCU readers are not supposed to
use the pointer value.

Add a new (struct sock)sk_freeptr field, in an union
with sk_rcu: No RCU readers would need to look at sk_rcu,
which is only used at free phase.

Tested:

grep . /sys/kernel/slab/TCP/{object_size,slab_size,objs_per_slab}
grep . /sys/kernel/slab/TCPv6/{object_size,slab_size,objs_per_slab}

Before:

/sys/kernel/slab/TCP/object_size:2368
/sys/kernel/slab/TCP/slab_size:2432
/sys/kernel/slab/TCP/objs_per_slab:13

/sys/kernel/slab/TCPv6/object_size:2496
/sys/kernel/slab/TCPv6/slab_size:2560
/sys/kernel/slab/TCPv6/objs_per_slab:12

After this patch, we can pack one more TCPv6 object per slab,
and object_size == slab_size.

/sys/kernel/slab/TCP/object_size:2368
/sys/kernel/slab/TCP/slab_size:2368
/sys/kernel/slab/TCP/objs_per_slab:13

/sys/kernel/slab/TCPv6/object_size:2496
/sys/kernel/slab/TCPv6/slab_size:2496
/sys/kernel/slab/TCPv6/objs_per_slab:13

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260129153458.4163797-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-30 17:15:51 -08:00
Jakub Kicinski
a010fe8d86 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.19-rc8).

No adjacent changes, conflicts:

drivers/net/ethernet/spacemit/k1_emac.c
  2c84959167 ("net: spacemit: Check for netif_carrier_ok() in emac_stats_update()")
  f66086798f ("net: spacemit: Remove broken flow control support")
https://lore.kernel.org/aXjAqZA3iEWD_DGM@sirena.org.uk

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-29 17:28:54 -08:00
Jibin Zhang
426ca15c7f net: fix segmentation of forwarding fraglist GRO
This patch enhances GSO segment handling by properly checking
the SKB_GSO_DODGY flag for frag_list GSO packets, addressing
low throughput issues observed when a station accesses IPv4
servers via hotspots with an IPv6-only upstream interface.

Specifically, it fixes a bug in GSO segmentation when forwarding
GRO packets containing a frag_list. The function skb_segment_list
cannot correctly process GRO skbs that have been converted by XLAT,
since XLAT only translates the header of the head skb. Consequently,
skbs in the frag_list may remain untranslated, resulting in protocol
inconsistencies and reduced throughput.

To address this, the patch explicitly sets the SKB_GSO_DODGY flag
for GSO packets in XLAT's IPv4/IPv6 protocol translation helpers
(bpf_skb_proto_4_to_6 and bpf_skb_proto_6_to_4). This marks GSO
packets as potentially modified after protocol translation. As a
result, GSO segmentation will avoid using skb_segment_list and
instead falls back to skb_segment for packets with the SKB_GSO_DODGY
flag. This ensures that only safe and fully translated frag_list
packets are processed by skb_segment_list, resolving protocol
inconsistencies and improving throughput when forwarding GRO packets
converted by XLAT.

Signed-off-by: Jibin Zhang <jibin.zhang@mediatek.com>
Fixes: 9fd1ff5d2a ("udp: Support UDP fraglist GRO/GSO.")
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260126152114.1211-1-jibin.zhang@mediatek.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-29 14:40:12 +01:00
Eric Dumazet
838eb96876 tcp: tcp_tx_timestamp() must look at the rtx queue
tcp_tx_timestamp() is only called at the end of tcp_sendmsg_locked()
before the final tcp_push().

By the time it is called, it is possible all the copied data
has been sent already (transmit queue is empty).

If this is the case, use the last skb in the rtx queue.

Fixes: 75c119afe1 ("tcp: implement rb-tree based retransmit queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20260127123828.4098577-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-28 19:35:35 -08:00
Kuniyuki Iwashima
5b71de34b7 ipv4: Use EXPORT_IPV6_MOD_GPL() for ip_fib_metrics_init().
ip_fib_metrics_init() is only called from fib_create_info()
and ip6_route_info_create().

Let's use EXPORT_IPV6_MOD_GPL() instead.

Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260127081335.646666-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-28 19:33:38 -08:00
Kuniyuki Iwashima
6e84fc395e ipv4: fib: Annotate access to struct fib_alias.fa_state.
syzbot reported that struct fib_alias.fa_state can be
modified locklessly by RCU readers. [0]

Let's use READ_ONCE()/WRITE_ONCE() properly.

[0]:
BUG: KCSAN: data-race in fib_table_lookup / fib_table_lookup

write to 0xffff88811b06a7fa of 1 bytes by task 4167 on cpu 0:
 fib_alias_accessed net/ipv4/fib_lookup.h:32 [inline]
 fib_table_lookup+0x361/0xd60 net/ipv4/fib_trie.c:1565
 fib_lookup include/net/ip_fib.h:390 [inline]
 ip_route_output_key_hash_rcu+0x378/0x1380 net/ipv4/route.c:2814
 ip_route_output_key_hash net/ipv4/route.c:2705 [inline]
 __ip_route_output_key include/net/route.h:169 [inline]
 ip_route_output_flow+0x65/0x110 net/ipv4/route.c:2932
 udp_sendmsg+0x13c3/0x15d0 net/ipv4/udp.c:1450
 inet_sendmsg+0xac/0xd0 net/ipv4/af_inet.c:859
 sock_sendmsg_nosec net/socket.c:727 [inline]
 __sock_sendmsg net/socket.c:742 [inline]
 ____sys_sendmsg+0x53a/0x600 net/socket.c:2592
 ___sys_sendmsg+0x195/0x1e0 net/socket.c:2646
 __sys_sendmmsg+0x185/0x320 net/socket.c:2735
 __do_sys_sendmmsg net/socket.c:2762 [inline]
 __se_sys_sendmmsg net/socket.c:2759 [inline]
 __x64_sys_sendmmsg+0x57/0x70 net/socket.c:2759
 x64_sys_call+0x1e28/0x3000 arch/x86/include/generated/asm/syscalls_64.h:308
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff88811b06a7fa of 1 bytes by task 4168 on cpu 1:
 fib_alias_accessed net/ipv4/fib_lookup.h:31 [inline]
 fib_table_lookup+0x338/0xd60 net/ipv4/fib_trie.c:1565
 fib_lookup include/net/ip_fib.h:390 [inline]
 ip_route_output_key_hash_rcu+0x378/0x1380 net/ipv4/route.c:2814
 ip_route_output_key_hash net/ipv4/route.c:2705 [inline]
 __ip_route_output_key include/net/route.h:169 [inline]
 ip_route_output_flow+0x65/0x110 net/ipv4/route.c:2932
 udp_sendmsg+0x13c3/0x15d0 net/ipv4/udp.c:1450
 inet_sendmsg+0xac/0xd0 net/ipv4/af_inet.c:859
 sock_sendmsg_nosec net/socket.c:727 [inline]
 __sock_sendmsg net/socket.c:742 [inline]
 ____sys_sendmsg+0x53a/0x600 net/socket.c:2592
 ___sys_sendmsg+0x195/0x1e0 net/socket.c:2646
 __sys_sendmmsg+0x185/0x320 net/socket.c:2735
 __do_sys_sendmmsg net/socket.c:2762 [inline]
 __se_sys_sendmmsg net/socket.c:2759 [inline]
 __x64_sys_sendmmsg+0x57/0x70 net/socket.c:2759
 x64_sys_call+0x1e28/0x3000 arch/x86/include/generated/asm/syscalls_64.h:308
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x00 -> 0x01

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 4168 Comm: syz.4.206 Not tainted syzkaller #0 PREEMPT(voluntary)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025

Reported-by: syzbot+d24f940f770afda885cf@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/69783ead.050a0220.c9109.0013.GAE@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260127043528.514160-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-28 19:33:07 -08:00
Eric Dumazet
d5fb143dbe tcp: move tcp_rack_advance() to tcp_input.c
tcp_rack_advance() is called from tcp_ack() and tcp_sacktag_one().

Moving it to tcp_input.c allows the compiler to inline it and save
both space and cpu cycles in TCP fast path.

$ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2
add/remove: 0/2 grow/shrink: 1/1 up/down: 98/-132 (-34)
Function                                     old     new   delta
tcp_ack                                     5741    5839     +98
tcp_sacktag_one                              407     395     -12
__pfx_tcp_rack_advance                        16       -     -16
tcp_rack_advance                             104       -    -104
Total: Before=22572680, After=22572646, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260127032147.3498272-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-28 19:31:51 -08:00
Eric Dumazet
629a68865a tcp: move tcp_rack_update_reo_wnd() to tcp_input.c
tcp_rack_update_reo_wnd() is called only once from tcp_ack()

Move it to tcp_input.c so that it can be inlined by the compiler
to save space and cpu cycles.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 110/-153 (-43)
Function                                     old     new   delta
tcp_ack                                     5631    5741    +110
__pfx_tcp_rack_update_reo_wnd                 16       -     -16
tcp_rack_update_reo_wnd                      137       -    -137
Total: Before=22572723, After=22572680, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260127032147.3498272-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-28 19:31:51 -08:00
Eric Dumazet
773a700213 tcp: mark tcp_process_tlp_ack() as unlikely
It is unlikely we have to call tcp_process_tlp_ack().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260127032147.3498272-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-28 19:31:51 -08:00
Gal Pressman
b10b446ce7 udp: gso: Use single MSS length in UDP header for GSO_PARTIAL
In GSO_PARTIAL segmentation, set the UDP length field to the single
segment size (gso_size + UDP header) instead of the large MSS size.
This provides hardware with a template length value for final
segmentation, similar to how tunnel GSO_PARTIAL handles outer headers
in UDP tunnels.

This will remove the need to manually adjust the UDP header length in
the drivers, as can be seen in subsequent patches.

This was suggested by Alex in 2018:
https://lore.kernel.org/netdev/CAKgT0UcdnUWgr3KQ=RnLKigokkiUuYefmL-ePpDvJOBNpKScFA@mail.gmail.com/

Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260125121649.778086-2-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-27 17:30:51 -08:00
Jiayuan Chen
929e30f931 bpf, sockmap: Fix FIONREAD for sockmap
A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.

Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to
calculate FIONREAD is not enough.

This patch adds a new msg_tot_len field in the psock structure to record
the data length in ingress_msg. Additionally, we implement new ioctl
interfaces for TCP and UDP to intercept FIONREAD operations.

Note that we intentionally do not include sk_receive_queue data in the
FIONREAD result. Data in sk_receive_queue has not yet been processed by
the BPF verdict program, and may be redirected to other sockets or
dropped. Including it would create semantic ambiguity since this data
may never be readable by the user.

Unix and VSOCK sockets have similar issues, but fixing them is outside
the scope of this patch as it would require more intrusive changes.

Previous work by John Fastabend made some efforts towards FIONREAD support:
commit e5c6de5fa0 ("bpf, sockmap: Incorrectly handling copied_seq")
Although the current patch is based on the previous work by John Fastabend,
it is acceptable for our Fixes tag to point to the same commit.

                                                      FD1:read()
                                                      --  FD1->copied_seq++
                                                          |  [read data]
                                                          |
                                   [enqueue data]         v
                  [sockmap]     -> ingress to self ->  ingress_msg queue
FD1 native stack  ------>                                 ^
-- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
                                       |                  |
                                       |             ingress to FD1
                                       v                  ^
                                      ...                 |  [sockmap]
                                                     FD2 native stack

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20260124113314.113584-3-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-27 09:11:30 -08:00
Jiayuan Chen
b40cc5adaa bpf, sockmap: Fix incorrect copied_seq calculation
A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.

The issue is that when reading from ingress_msg, we update tp->copied_seq
by default. However, if the data is not from its own protocol stack,
tcp->rcv_nxt is not increased. Later, if we convert this socket to a
native socket, reading from this socket may fail because copied_seq might
be significantly larger than rcv_nxt.

This fix also addresses the syzkaller-reported bug referenced in the
Closes tag.

This patch marks the skmsg objects in ingress_msg. When reading, we update
copied_seq only if the data is from its own protocol stack.

                                                     FD1:read()
                                                     --  FD1->copied_seq++
                                                         |  [read data]
                                                         |
                                [enqueue data]           v
                  [sockmap]     -> ingress to self ->  ingress_msg queue
FD1 native stack  ------>                                 ^
-- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
                                       |                  |
                                       |             ingress to FD1
                                       v                  ^
                                      ...                 |  [sockmap]
                                                     FD2 native stack

Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260124113314.113584-2-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-27 09:11:30 -08:00
Eric Dumazet
a18056a6c1 tcp: move sk_forced_mem_schedule() to tcp.c
TCP fast path can (auto)inline this helper, instead
of (auto)inling it from tcp_send_fin().

No change of overall code size, but tcp_sendmsg() is faster.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 1/1 up/down: 141/-140 (1)
Function                                     old     new   delta
tcp_stream_alloc_skb                         216     357    +141
tcp_send_fin                                 688     548    -140
Total: Before=22236729, After=22236730, chg +0.00%

BTW, we might change tcp_send_fin() to use tcp_stream_alloc_skb().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20260123111605.4089200-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-27 14:58:13 +01:00