7144 Commits

Author SHA1 Message Date
Kees Cook
189f164e57 Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses
Conversion performed via this Coccinelle script:

  // SPDX-License-Identifier: GPL-2.0-only
  // Options: --include-headers-for-types --all-includes --include-headers --keep-comments
  virtual patch

  @gfp depends on patch && !(file in "tools") && !(file in "samples")@
  identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
 		    kzalloc_obj,kzalloc_objs,kzalloc_flex,
		    kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
		    kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
  @@

  	ALLOC(...
  -		, GFP_KERNEL
  	)

  $ make coccicheck MODE=patch COCCI=gfp.cocci

Build and boot tested x86_64 with Fedora 42's GCC and Clang:

Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01

Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22 08:26:33 -08:00
Linus Torvalds
32a92f8c89 Convert more 'alloc_obj' cases to default GFP_KERNEL arguments
This converts some of the visually simpler cases that have been split
over multiple lines.  I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.

Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script.  I probably had made it a bit _too_ trivial.

So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.

The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 20:03:00 -08:00
Linus Torvalds
323bbfcf1e Convert 'alloc_flex' family to use the new default GFP_KERNEL argument
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.

As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Linus Torvalds
bf4afc53b7 Convert 'alloc_obj' family to use the new default GFP_KERNEL argument
This was done entirely with mindless brute force, using

    git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
        xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'

to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.

Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.

For the same reason the 'flex' versions will be done as a separate
conversion.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21 17:09:51 -08:00
Kees Cook
69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Inseo An
71e99ee20f netfilter: nf_tables: fix use-after-free in nf_tables_addchain()
nf_tables_addchain() publishes the chain to table->chains via
list_add_tail_rcu() (in nft_chain_add()) before registering hooks.
If nf_tables_register_hook() then fails, the error path calls
nft_chain_del() (list_del_rcu()) followed by nf_tables_chain_destroy()
with no RCU grace period in between.

This creates two use-after-free conditions:

 1) Control-plane: nf_tables_dump_chains() traverses table->chains
    under rcu_read_lock(). A concurrent dump can still be walking
    the chain when the error path frees it.

 2) Packet path: for NFPROTO_INET, nf_register_net_hook() briefly
    installs the IPv4 hook before IPv6 registration fails.  Packets
    entering nft_do_chain() via the transient IPv4 hook can still be
    dereferencing chain->blob_gen_X when the error path frees the
    chain.

Add synchronize_rcu() between nft_chain_del() and the chain destroy
so that all RCU readers -- both dump threads and in-flight packet
evaluation -- have finished before the chain is freed.

Fixes: 91c7b38dc9 ("netfilter: nf_tables: use new transaction infrastructure to handle chain")
Signed-off-by: Inseo An <y0un9sa@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17 15:04:20 +01:00
Julian Anastasov
8fde939b02 ipvs: do not keep dest_dst if dev is going down
There is race between the netdev notifier ip_vs_dst_event()
and the code that caches dst with dev that is going down.
As the FIB can be notified for the closed device after our
handler finishes, it is possible valid route to be returned
and cached resuling in a leaked dev reference until the dest
is not removed.

To prevent new dest_dst to be attached to dest just after the
handler dropped the old one, add a netif_running() check
to make sure the notifier handler is not currently running
for device that is closing.

Fixes: 7a4f0761fc ("IPVS: init and cleanup restructuring")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17 15:04:20 +01:00
Julian Anastasov
05cfe9863e ipvs: skip ipv6 extension headers for csum checks
Protocol checksum validation fails for IPv6 if there are extension
headers before the protocol header. iph->len already contains its
offset, so use it to fix the problem.

Fixes: 2906f66a56 ("ipvs: SCTP Trasport Loadbalancing Support")
Fixes: 0bbdd42b7e ("IPVS: Extend protocol DNAT/SNAT and state handlers")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17 15:04:20 +01:00
Florian Westphal
a6d28eb8ef netfilter: nf_conntrack_h323: don't pass uninitialised l3num value
Mihail Milev reports: Error: UNINIT (CWE-457):
 net/netfilter/nf_conntrack_h323_main.c:1189:2: var_decl:
	Declaring variable "tuple" without initializer.
 net/netfilter/nf_conntrack_h323_main.c:1197:2:
	uninit_use_in_call: Using uninitialized value "tuple.src.l3num" when calling "__nf_ct_expect_find".
 net/netfilter/nf_conntrack_expect.c:142:2:
	read_value: Reading value "tuple->src.l3num" when calling "nf_ct_expect_dst_hash".

  1195|   	tuple.dst.protonum = IPPROTO_TCP;
  1196|
  1197|-> 	exp = __nf_ct_expect_find(net, nf_ct_zone(ct), &tuple);
  1198|   	if (exp && exp->master == ct)
  1199|   		return exp;

Switch this to a C99 initialiser and set the l3num value.

Fixes: f587de0e2f ("[NETFILTER]: nf_conntrack/nf_nat: add H.323 helper port")
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17 15:04:20 +01:00
Brian Witte
7f261bb906 netfilter: nf_tables: revert commit_mutex usage in reset path
It causes circular lock dependency between commit_mutex, nfnl_subsys_ipset
and nlk_cb_mutex when nft reset, ipset list, and iptables-nft with '-m set'
rule run at the same time.

Previous patches made it safe to run individual reset handlers concurrently
so commit_mutex is no longer required to prevent this.

Fixes: bd662c4218 ("netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests")
Fixes: 3d483faa66 ("netfilter: nf_tables: Add locking for NFT_MSG_GETSETELEM_RESET requests")
Fixes: 3cb03edb4d ("netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests")
Link: https://lore.kernel.org/all/aUh_3mVRV8OrGsVo@strlen.de/
Reported-by: <syzbot+ff16b505ec9152e5f448@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=ff16b505ec9152e5f448
Signed-off-by: Brian Witte <brianwitte@mailfence.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17 15:04:20 +01:00
Brian Witte
30c4d7fb59 netfilter: nft_quota: use atomic64_xchg for reset
Use atomic64_xchg() to atomically read and zero the consumed value
on reset, which is simpler than the previous read+sub pattern and
doesn't require lock serialization.

Fixes: bd662c4218 ("netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests")
Fixes: 3d483faa66 ("netfilter: nf_tables: Add locking for NFT_MSG_GETSETELEM_RESET requests")
Fixes: 3cb03edb4d ("netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests")
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Brian Witte <brianwitte@mailfence.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17 15:04:20 +01:00
Brian Witte
779c60a519 netfilter: nft_counter: serialize reset with spinlock
Add a global static spinlock to serialize counter fetch+reset
operations, preventing concurrent dump-and-reset from underrunning
values.

The lock is taken before fetching the total so that two parallel
resets cannot both read the same counter values and then both
subtract them.

A global lock is used for simplicity since resets are infrequent.
If this becomes a bottleneck, it can be replaced with a per-net
lock later.

Fixes: bd662c4218 ("netfilter: nf_tables: Add locking for NFT_MSG_GETOBJ_RESET requests")
Fixes: 3d483faa66 ("netfilter: nf_tables: Add locking for NFT_MSG_GETSETELEM_RESET requests")
Fixes: 3cb03edb4d ("netfilter: nf_tables: Add locking for NFT_MSG_GETRULE_RESET requests")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Brian Witte <brianwitte@mailfence.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17 15:04:20 +01:00
Sun Jian
07919126ec netfilter: annotate NAT helper hook pointers with __rcu
The NAT helper hook pointers are updated and dereferenced under RCU rules,
but lack the proper __rcu annotation.

This makes sparse report address space mismatches when the hooks are used
with rcu_dereference().

Add the missing __rcu annotations to the global hook pointer declarations
and definitions in Amanda, FTP, IRC, SNMP and TFTP.

No functional change intended.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17 15:04:20 +01:00
Linus Torvalds
37a93dd5c4 Merge tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
 "Core & protocols:

   - A significant effort all around the stack to guide the compiler to
     make the right choice when inlining code, to avoid unneeded calls
     for small helper and stack canary overhead in the fast-path.

     This generates better and faster code with very small or no text
     size increases, as in many cases the call generated more code than
     the actual inlined helper.

   - Extend AccECN implementation so that is now functionally complete,
     also allow the user-space enabling it on a per network namespace
     basis.

   - Add support for memory providers with large (above 4K) rx buffer.
     Paired with hw-gro, larger rx buffer sizes reduce the number of
     buffers traversing the stack, dincreasing single stream CPU usage
     by up to ~30%.

   - Do not add HBH header to Big TCP GSO packets. This simplifies the
     RX path, the TX path and the NIC drivers, and is possible because
     user-space taps can now interpret correctly such packets without
     the HBH hint.

   - Allow IPv6 routes to be configured with a gateway address that is
     resolved out of a different interface than the one specified,
     aligning IPv6 to IPv4 behavior.

   - Multi-queue aware sch_cake. This makes it possible to scale the
     rate shaper of sch_cake across multiple CPUs, while still enforcing
     a single global rate on the interface.

   - Add support for the nbcon (new buffer console) infrastructure to
     netconsole, enabling lock-free, priority-based console operations
     that are safer in crash scenarios.

   - Improve the TCP ipv6 output path to cache the flow information,
     saving cpu cycles, reducing cache line misses and stack use.

   - Improve netfilter packet tracker to resolve clashes for most
     protocols, avoiding unneeded drops on rare occasions.

   - Add IP6IP6 tunneling acceleration to the flowtable infrastructure.

   - Reduce tcp socket size by one cache line.

   - Notify neighbour changes atomically, avoiding inconsistencies
     between the notification sequence and the actual states sequence.

   - Add vsock namespace support, allowing complete isolation of vsocks
     across different network namespaces.

   - Improve xsk generic performances with cache-alignment-oriented
     optimizations.

   - Support netconsole automatic target recovery, allowing netconsole
     to reestablish targets when underlying low-level interface comes
     back online.

  Driver API:

   - Support for switching the working mode (automatic vs manual) of a
     DPLL device via netlink.

   - Introduce PHY ports representation to expose multiple front-facing
     media ports over a single MAC.

   - Introduce "rx-polarity" and "tx-polarity" device tree properties,
     to generalize polarity inversion requirements for differential
     signaling.

   - Add helper to create, prepare and enable managed clocks.

  Device drivers:

   - Add Huawei hinic3 PF etherner driver.

   - Add DWMAC glue driver for Motorcomm YT6801 PCIe ethernet
     controller.

   - Add ethernet driver for MaxLinear MxL862xx switches

   - Remove parallel-port Ethernet driver.

   - Convert existing driver timestamp configuration reporting to
     hwtstamp_get and remove legacy ioctl().

   - Convert existing drivers to .get_rx_ring_count(), simplifing the RX
     ring count retrieval. Also remove the legacy fallback path.

   - Ethernet high-speed NICs:
      - Broadcom (bnxt, bng):
         - bnxt: add FW interface update to support FEC stats histogram
           and NVRAM defragmentation
         - bng: add TSO and H/W GRO support
      - nVidia/Mellanox (mlx5):
         - improve latency of channel restart operations, reducing the
           used H/W resources
         - add TSO support for UDP over GRE over VLAN
         - add flow counters support for hardware steering (HWS) rules
         - use a static memory area to store headers for H/W GRO,
           leading to 12% RX tput improvement
      - Intel (100G, ice, idpf):
         - ice: reorganizes layout of Tx and Rx rings for cacheline
           locality and utilizes __cacheline_group* macros on the new
           layouts
         - ice: introduces Synchronous Ethernet (SyncE) support
      - Meta (fbnic):
         - adds debugfs for firmware mailbox and tx/rx rings vectors

   - Ethernet virtual:
      - geneve: introduce GRO/GSO support for double UDP encapsulation

   - Ethernet NICs consumer, and embedded:
      - Synopsys (stmmac):
         - some code refactoring and cleanups
      - RealTek (r8169):
         - add support for RTL8127ATF (10G Fiber SFP)
         - add dash and LTR support
      - Airoha:
         - AN8811HB 2.5 Gbps phy support
      - Freescale (fec):
         - add XDP zero-copy support
      - Thunderbolt:
         - add get link setting support to allow bonding
      - Renesas:
         - add support for RZ/G3L GBETH SoC

   - Ethernet switches:
      - Maxlinear:
         - support R(G)MII slow rate configuration
         - add support for Intel GSW150
      - Motorcomm (yt921x):
         - add DCB/QoS support
      - TI:
         - icssm-prueth: support bridging (STP/RSTP) via the switchdev
           framework

   - Ethernet PHYs:
      - Realtek:
         - enable SGMII and 2500Base-X in-band auto-negotiation
         - simplify and reunify C22/C45 drivers
      - Micrel: convert bindings to DT schema

   - CAN:
      - move skb headroom content into skb extensions, making CAN
        metadata access more robust

   - CAN drivers:
      - rcar_canfd:
         - add support for FD-only mode
         - add support for the RZ/T2H SoC
      - sja1000: cleanup the CAN state handling

   - WiFi:
      - implement EPPKE/802.1X over auth frames support
      - split up drop reasons better, removing generic RX_DROP
      - additional FTM capabilities: 6 GHz support, supported number of
        spatial streams and supported number of LTF repetitions
      - better mac80211 iterators to enumerate resources
      - initial UHR (Wi-Fi 8) support for cfg80211/mac80211

   - WiFi drivers:
      - Qualcomm/Atheros:
         - ath11k: support for Channel Frequency Response measurement
         - ath12k: a significant driver refactor to support multi-wiphy
           devices and and pave the way for future device support in the
           same driver (rather than splitting to ath13k)
         - ath12k: support for the QCC2072 chipset
      - Intel:
         - iwlwifi: partial Neighbor Awareness Networking (NAN) support
         - iwlwifi: initial support for U-NII-9 and IEEE 802.11bn
      - RealTek (rtw89):
         - preparations for RTL8922DE support

   - Bluetooth:
      - implement setsockopt(BT_PHY) to set the connection packet type/PHY
      - set link_policy on incoming ACL connections

   - Bluetooth drivers:
      - btusb: add support for MediaTek7920, Realtek RTL8761BU and 8851BE
      - btqca: add WCN6855 firmware priority selection feature"

* tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1254 commits)
  bnge/bng_re: Add a new HSI
  net: macb: Fix tx/rx malfunction after phy link down and up
  af_unix: Fix memleak of newsk in unix_stream_connect().
  net: ti: icssg-prueth: Add optional dependency on HSR
  net: dsa: add basic initial driver for MxL862xx switches
  net: mdio: add unlocked mdiodev C45 bus accessors
  net: dsa: add tag format for MxL862xx switches
  dt-bindings: net: dsa: add MaxLinear MxL862xx
  selftests: drivers: net: hw: Modify toeplitz.c to poll for packets
  octeontx2-pf: Unregister devlink on probe failure
  net: renesas: rswitch: fix forwarding offload statemachine
  ionic: Rate limit unknown xcvr type messages
  tcp: inet6_csk_xmit() optimization
  tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock()
  tcp: populate inet->cork.fl.u.ip6 in tcp_v6_connect()
  ipv6: inet6_csk_xmit() and inet6_csk_update_pmtu() use inet->cork.fl.u.ip6
  ipv6: use inet->cork.fl.u.ip6 and np->final in ip6_datagram_dst_update()
  ipv6: use np->final in inet6_sk_rebuild_header()
  ipv6: add daddr/final storage in struct ipv6_pinfo
  net: stmmac: qcom-ethqos: fix qcom_ethqos_serdes_powerup()
  ...
2026-02-11 19:31:52 -08:00
Jakub Kicinski
792aaea994 Merge tag 'nf-next-26-02-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:

====================
netfilter: updates for net-next

The following patchset contains Netfilter updates for *net-next*:

1) Fix net-next-only use-after-free bug in nf_tables rbtree set:
   Expired elements cannot be released right away after unlink anymore
   because there is no guarantee that the binary-search blob is going to
   be updated.  Spotted by syzkaller.

2) Fix esoteric bug in nf_queue with udp fraglist gro, broken since
   6.11. Patch 3 adds extends the nfqueue selftest for this.

4) Use dedicated slab for flowtable entries, currently the -512 cache
   is used, which is wasteful.  From Qingfang Deng.

5) Recent net-next update extended existing test for ip6ip6 tunnels, add
   the required /config entry.  Test still passed by accident because the
   previous tests network setup gets re-used, so also update the test so
   it will fail in case the ip6ip6 tunnel interface cannot be added.

6) Fix 'nft get element mytable myset { 1.2.3.4 }' on big endian
   platforms, this was broken since code was added in v5.1.

7) Fix nf_tables counter reset support on 32bit platforms, where counter
   reset may cause huge values to appear due to wraparound.
   Broken since reset feature was added in v6.11.  From Anders Grahn.

8-11) update nf_tables rbtree set type to detect partial
   operlaps.  This will eventually speed up nftables userspace: at this
   time userspace does a netlink dump of the set content which slows down
   incremental updates on interval sets.  From Pablo Neira Ayuso.

* tag 'nf-next-26-02-06' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nft_set_rbtree: validate open interval overlap
  netfilter: nft_set_rbtree: validate element belonging to interval
  netfilter: nft_set_rbtree: check for partial overlaps in anonymous sets
  netfilter: nft_set_rbtree: fix bogus EEXIST with NLM_F_CREATE with null interval
  netfilter: nft_counter: fix reset of counters on 32bit archs
  netfilter: nft_set_hash: fix get operation on big endian
  selftests: netfilter: add IPV6_TUNNEL to config
  netfilter: flowtable: dedicated slab for flow entry
  selftests: netfilter: nft_queue.sh: add udp fraglist gro test case
  netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation
  netfilter: nft_set_rbtree: don't gc elements on insert
====================

Link: https://patch.msgid.link/20260206153048.17570-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-10 20:25:38 -08:00
Linus Torvalds
f17b474e36 Merge tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Pull bpf updates from Alexei Starovoitov:

 - Support associating BPF program with struct_ops (Amery Hung)

 - Switch BPF local storage to rqspinlock and remove recursion detection
   counters which were causing false positives (Amery Hung)

 - Fix live registers marking for indirect jumps (Anton Protopopov)

 - Introduce execution context detection BPF helpers (Changwoo Min)

 - Improve verifier precision for 32bit sign extension pattern
   (Cupertino Miranda)

 - Optimize BTF type lookup by sorting vmlinux BTF and doing binary
   search (Donglin Peng)

 - Allow states pruning for misc/invalid slots in iterator loops (Eduard
   Zingerman)

 - In preparation for ASAN support in BPF arenas teach libbpf to move
   global BPF variables to the end of the region and enable arena kfuncs
   while holding locks (Emil Tsalapatis)

 - Introduce support for implicit arguments in kfuncs and migrate a
   number of them to new API. This is a prerequisite for cgroup
   sub-schedulers in sched-ext (Ihor Solodrai)

 - Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen)

 - Fix ORC stack unwind from kprobe_multi (Jiri Olsa)

 - Speed up fentry attach by using single ftrace direct ops in BPF
   trampolines (Jiri Olsa)

 - Require frozen map for calculating map hash (KP Singh)

 - Fix lock entry creation in TAS fallback in rqspinlock (Kumar
   Kartikeya Dwivedi)

 - Allow user space to select cpu in lookup/update operations on per-cpu
   array and hash maps (Leon Hwang)

 - Make kfuncs return trusted pointers by default (Matt Bobrowski)

 - Introduce "fsession" support where single BPF program is executed
   upon entry and exit from traced kernel function (Menglong Dong)

 - Allow bpf_timer and bpf_wq use in all programs types (Mykyta
   Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei
   Starovoitov)

 - Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their
   definition across the tree (Puranjay Mohan)

 - Allow BPF arena calls from non-sleepable context (Puranjay Mohan)

 - Improve register id comparison logic in the verifier and extend
   linked registers with negative offsets (Puranjay Mohan)

 - In preparation for BPF-OOM introduce kfuncs to access memcg events
   (Roman Gushchin)

 - Use CFI compatible destructor kfunc type (Sami Tolvanen)

 - Add bitwise tracking for BPF_END in the verifier (Tianci Cao)

 - Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou
   Tang)

 - Make BPF selftests work with 64k page size (Yonghong Song)

* tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits)
  selftests/bpf: Fix outdated test on storage->smap
  selftests/bpf: Choose another percpu variable in bpf for btf_dump test
  selftests/bpf: Remove test_task_storage_map_stress_lookup
  selftests/bpf: Update task_local_storage/task_storage_nodeadlock test
  selftests/bpf: Update task_local_storage/recursion test
  selftests/bpf: Update sk_storage_omem_uncharge test
  bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy}
  bpf: Support lockless unlink when freeing map or local storage
  bpf: Prepare for bpf_selem_unlink_nofail()
  bpf: Remove unused percpu counter from bpf_local_storage_map_free
  bpf: Remove cgroup local storage percpu counter
  bpf: Remove task local storage percpu counter
  bpf: Change local_storage->lock and b->lock to rqspinlock
  bpf: Convert bpf_selem_unlink to failable
  bpf: Convert bpf_selem_link_map to failable
  bpf: Convert bpf_selem_unlink_map to failable
  bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage
  selftests/xsk: fix number of Tx frags in invalid packet
  selftests/xsk: properly handle batch ending in the middle of a packet
  bpf: Prevent reentrance into call_rcu_tasks_trace()
  ...
2026-02-10 11:26:21 -08:00
Linus Torvalds
698749164a Merge tag 'audit-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:

 - Improve the NETFILTER_PKT audit records

   Add source and destination ports to the NETFILTER_PKT audit records
   while also consolidating a lot of the code into a new, singular
   audit_log_nf_skb() function. This new approach to structuring the
   NETFILTER_PKT record generation should eliminate some unnecessary
   overhead when audit is not built into the kernel.

 - Update the audit syscall classifier code

   Add the listxattrat(), getxattrat(), and fchmodat2() syscall to the
   audit code which classifies syscalls into categories of operations,
   e.g. "read" or "change attributes".

 - Move the syscall classifier declarations into audit_arch.h

   Shuffle around some header file declarations to resolve some sparse
   warnings.

* tag 'audit-pr-20260203' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: move the compat_xxx_class[] extern declarations to audit_arch.h
  audit: add missing syscalls to read class
  audit: include source and destination ports to NETFILTER_PKT
  audit: add audit_log_nf_skb helper function
  audit: add fchmodat2() to change attributes class
2026-02-09 10:13:03 -08:00
Alice Mikityanska
b2936b4fd5 net/ipv6: Introduce payload_len helpers
The next commits will transition away from using the hop-by-hop
extension header to encode packet length for BIG TCP. Add wrappers
around ip6->payload_len that return the actual value if it's non-zero,
and calculate it from skb->len if payload_len is set to zero (and a
symmetrical setter).

The new helpers are used wherever the surrounding code supports the
hop-by-hop jumbo header for BIG TCP IPv6, or the corresponding IPv4 code
uses skb_ip_totlen (e.g., in include/net/netfilter/nf_tables_ipv6.h).

No behavioral change in this commit.

Signed-off-by: Alice Mikityanska <alice@isovalent.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260205133925.526371-2-alice.kernel@fastmail.im
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-06 20:50:03 -08:00
Pablo Neira Ayuso
648946966a netfilter: nft_set_rbtree: validate open interval overlap
Open intervals do not have an end element, in particular an open
interval at the end of the set is hard to validate because of it is
lacking the end element, and interval validation relies on such end
element to perform the checks.

This patch adds a new flag field to struct nft_set_elem, this is not an
issue because this is a temporary object that is allocated in the stack
from the insert/deactivate path. This flag field is used to specify that
this is the last element in this add/delete command.

The last flag is used, in combination with the start element cookie, to
check if there is a partial overlap, eg.

   Already exists:   255.255.255.0-255.255.255.254
   Add interval:     255.255.255.0-255.255.255.255
                     ~~~~~~~~~~~~~
             start element overlap

Basically, the idea is to check for an existing end element in the set
if there is an overlap with an existing start element.

However, the last open interval can come in any position in the add
command, the corner case can get a bit more complicated:

   Already exists:   255.255.255.0-255.255.255.254
   Add intervals:    255.255.255.0-255.255.255.255,255.255.255.0-255.255.255.254
                     ~~~~~~~~~~~~~
             start element overlap

To catch this overlap, annotate that the new start element is a possible
overlap, then report the overlap if the next element is another start
element that confirms that previous element in an open interval at the
end of the set.

For deletions, do not update the start cookie when deleting an open
interval, otherwise this can trigger spurious EEXIST when adding new
elements.

Unfortunately, there is no NFT_SET_ELEM_INTERVAL_OPEN flag which would
make easier to detect open interval overlaps.

Fixes: 7c84d41416 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-06 13:36:07 +01:00
Pablo Neira Ayuso
782f268812 netfilter: nft_set_rbtree: validate element belonging to interval
The existing partial overlap detection does not check if the elements
belong to the interval, eg.

  add element inet x y { 1.1.1.1-2.2.2.2, 4.4.4.4-5.5.5.5 }
  add element inet x y { 1.1.1.1-5.5.5.5 } => this should fail: ENOENT

Similar situation occurs with deletions:

  add element inet x y { 1.1.1.1-2.2.2.2, 4.4.4.4-5.5.5.5}
  delete element inet x y { 1.1.1.1-5.5.5.5 } => this should fail: ENOENT

This currently works via mitigation by nft in userspace, which is
performing the overlap detection before sending the elements to the
kernel. This requires a previous netlink dump of the set content which
slows down incremental updates on interval sets, because a netlink set
content dump is needed.

This patch extends the existing overlap detection to track the most
recent start element that already exists. The pointer to the existing
start element is stored as a cookie (no pointer dereference is ever
possible). If the end element is added and it already exists, then
check that the existing end element is adjacent to the already existing
start element. Similar logic applies to element deactivation.

This patch also annotates the timestamp to identify if start cookie
comes from an older batch, in such case reset it. Otherwise, a failing
create element command leaves the start cookie in place, resulting in
bogus error reporting.

There is still a few more corner cases of overlap detection related to
the open interval that are addressed in follow up patches.

This is address an early design mistake where an interval is expressed
as two elements, using the NFT_SET_ELEM_INTERVAL_END flag, instead of
the more recent NFTA_SET_ELEM_KEY_END attribute that pipapo already
uses.

Fixes: 7c84d41416 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-06 13:36:07 +01:00
Pablo Neira Ayuso
4780ec142c netfilter: nft_set_rbtree: check for partial overlaps in anonymous sets
Userspace provides an optimized representation in case intervals are
adjacent, where the end element is omitted.

The existing partial overlap detection logic skips anonymous set checks
on start elements for this reason.

However, it is possible to add intervals that overlap to this anonymous
where two start elements with the same, eg. A-B, A-C where C < B.

      start     end
	A        B
      start  end
        A     C

Restore the check on overlapping start elements to report an overlap.

Fixes: c9e6978e27 ("netfilter: nft_set_rbtree: Switch to node list walk for overlap detection")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-06 13:36:07 +01:00
Pablo Neira Ayuso
7f9203f41a netfilter: nft_set_rbtree: fix bogus EEXIST with NLM_F_CREATE with null interval
Userspace adds a non-matching null element to the kernel for historical
reasons. This null element is added when the set is populated with
elements. Inclusion of this element is conditional, therefore,
userspace needs to dump the set content to check for its presence.

If the NLM_F_CREATE flag is turned on, this becomes an issue because
kernel bogusly reports EEXIST.

Add special case to ignore NLM_F_CREATE in this case, therefore,
re-adding the nul-element never fails.

Fixes: c016c7e45d ("netfilter: nf_tables: honor NLM_F_EXCL flag in set element insertion")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-06 13:36:07 +01:00
Anders Grahn
1e13f27e06 netfilter: nft_counter: fix reset of counters on 32bit archs
nft_counter_reset() calls u64_stats_add() with a negative value to reset
the counter. This will work on 64bit archs, hence the negative value
added will wrap as a 64bit value which then can wrap the stat counter as
well.

On 32bit archs, the added negative value will wrap as a 32bit value and
_not_ wrapping the stat counter properly. In most cases, this would just
lead to a very large 32bit value being added to the stat counter.

Fix by introducing u64_stats_sub().

Fixes: 4a1d3acd6e ("netfilter: nft_counter: Use u64_stats_t for statistic.")
Signed-off-by: Anders Grahn <anders.grahn@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-06 13:34:55 +01:00
Florian Westphal
2f635adbe2 netfilter: nft_set_hash: fix get operation on big endian
tests/shell/testcases/packetpath/set_match_nomatch_hash_fast
fails on big endian with:

Error: Could not process rule: No such file or directory
reset element ip test s { 244.147.90.126 }
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Fatal: Cannot fetch element "244.147.90.126"

... because the wrong bucket is searched, jhash() and jhash1_word are
not interchangeable on big endian.

Fixes: 3b02b0adc2 ("netfilter: nft_set_hash: fix lookups with fixed size hash on big endian")
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-06 13:34:55 +01:00
Qingfang Deng
2a441a9aac netfilter: flowtable: dedicated slab for flow entry
The size of `struct flow_offload` has grown beyond 256 bytes on 64-bit
kernels (currently 280 bytes) because of the `flow_offload_tunnel`
member added recently. So kmalloc() allocates from the kmalloc-512 slab,
causing significant memory waste per entry.

Introduce a dedicated slab cache for flow entries to reduce memory
footprint. Results in a reduction from 512 bytes to 320 bytes per entry
on x86_64 kernels.

Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-06 13:34:55 +01:00
Florian Westphal
207b3ebacb netfilter: nfnetlink_queue: do shared-unconfirmed check before segmentation
Ulrich reports a regression with nfqueue:

If an application did not set the 'F_GSO' capability flag and a gso
packet with an unconfirmed nf_conn entry is received all packets are
now dropped instead of queued, because the check happens after
skb_gso_segment().  In that case, we did have exclusive ownership
of the skb and its associated conntrack entry.  The elevated use
count is due to skb_clone happening via skb_gso_segment().

Move the check so that its peformed vs. the aggregated packet.

Then, annotate the individual segments except the first one so we
can do a 2nd check at reinject time.

For the normal case, where userspace does in-order reinjects, this avoids
packet drops: first reinjected segment continues traversal and confirms
entry, remaining segments observe the confirmed entry.

While at it, simplify nf_ct_drop_unconfirmed(): We only care about
unconfirmed entries with a refcnt > 1, there is no need to special-case
dying entries.

This only happens with UDP.  With TCP, the only unconfirmed packet will
be the TCP SYN, those aren't aggregated by GRO.

Next patch adds a udpgro test case to cover this scenario.

Reported-by: Ulrich Weber <ulrich.weber@gmail.com>
Fixes: 7d8dc1c7be ("netfilter: nf_queue: drop packets with cloned unconfirmed conntracks")
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-06 13:34:55 +01:00
Florian Westphal
35f83a7552 netfilter: nft_set_rbtree: don't gc elements on insert
During insertion we can queue up expired elements for garbage
collection.

In case of later abort, the commit hook will never be called.
Packet path and 'get' requests will find free'd elements in the
binary search blob:

 nft_set_ext_key include/net/netfilter/nf_tables.h:800 [inline]
 nft_array_get_cmp+0x1f6/0x2a0 net/netfilter/nft_set_rbtree.c:133
 __inline_bsearch include/linux/bsearch.h:15 [inline]
 bsearch+0x50/0xc0 lib/bsearch.c:33
 nft_rbtree_get+0x16b/0x400 net/netfilter/nft_set_rbtree.c:169
 nft_setelem_get net/netfilter/nf_tables_api.c:6495 [inline]
 nft_get_set_elem+0x420/0xaa0 net/netfilter/nf_tables_api.c:6543
 nf_tables_getsetelem+0x448/0x5e0 net/netfilter/nf_tables_api.c:6632
 nfnetlink_rcv_msg+0x8ae/0x12c0 net/netfilter/nfnetlink.c:290

Also, when we insert an element that triggers -EEXIST, and that insertion
happens to also zap a timed-out entry, we end up with same issue:
Neither commit nor abort hook is called.

Fix this by removing gc api usage during insertion.

The blamed commit also removes concurrency of the rbtree with the
packet path, so we can now safely rb_erase() the element and move
it to a new expired list that can be reaped in the commit hook
before building the next blob iteration.

This also avoids the need to rebuild the blob in the abort path:
Expired elements seen during insertion attempts are kept around
until a transaction passes.

Reported-by: syzbot+d417922a3e7935517ef6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d417922a3e7935517ef6
Fixes: 7e43e0a114 ("netfilter: nft_set_rbtree: translate rbtree to array for binary search")
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-06 13:34:41 +01:00
Jakub Kicinski
a182a62ff7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.19-rc9).

No adjacent changes, conflicts:

drivers/net/ethernet/spacemit/k1_emac.c
  3125fc1701 ("net: spacemit: k1-emac: fix jumbo frame support")
  f66086798f ("net: spacemit: Remove broken flow control support")
https://lore.kernel.org/aYIysFIE9ooavWia@sirena.org.uk

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05 09:54:08 -08:00
Andrew Fasano
f41c5d1510 netfilter: nf_tables: fix inverted genmask check in nft_map_catchall_activate()
nft_map_catchall_activate() has an inverted element activity check
compared to its non-catchall counterpart nft_mapelem_activate() and
compared to what is logically required.

nft_map_catchall_activate() is called from the abort path to re-activate
catchall map elements that were deactivated during a failed transaction.
It should skip elements that are already active (they don't need
re-activation) and process elements that are inactive (they need to be
restored). Instead, the current code does the opposite: it skips inactive
elements and processes active ones.

Compare the non-catchall activate callback, which is correct:

  nft_mapelem_activate():
    if (nft_set_elem_active(ext, iter->genmask))
        return 0;   /* skip active, process inactive */

With the buggy catchall version:

  nft_map_catchall_activate():
    if (!nft_set_elem_active(ext, genmask))
        continue;   /* skip inactive, process active */

The consequence is that when a DELSET operation is aborted,
nft_setelem_data_activate() is never called for the catchall element.
For NFT_GOTO verdict elements, this means nft_data_hold() is never
called to restore the chain->use reference count. Each abort cycle
permanently decrements chain->use. Once chain->use reaches zero,
DELCHAIN succeeds and frees the chain while catchall verdict elements
still reference it, resulting in a use-after-free.

This is exploitable for local privilege escalation from an unprivileged
user via user namespaces + nftables on distributions that enable
CONFIG_USER_NS and CONFIG_NF_TABLES.

Fix by removing the negation so the check matches nft_mapelem_activate():
skip active elements, process inactive ones.

Fixes: 628bd3e49c ("netfilter: nf_tables: drop map element references from preparation phase")
Signed-off-by: Andrew Fasano <andrew.fasano@nist.gov>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-05 08:36:59 +01:00
Scott Mitchell
e19079adcd netfilter: nfnetlink_queue: optimize verdict lookup with hash table
The current implementation uses a linear list to find queued packets by
ID when processing verdicts from userspace. With large queue depths and
out-of-order verdicting, this O(n) lookup becomes a significant
bottleneck, causing userspace verdict processing to dominate CPU time.

Replace the linear search with a hash table for O(1) average-case
packet lookup by ID. A global rhashtable spanning all network
namespaces attributes hash bucket memory to kernel but is subject to
fixed upper bound.

Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-29 09:52:07 +01:00
Jinjie Ruan
77fd1b4c6e netfilter: xt_time: use is_leap_year() helper
Use the is_leap_year() helper from rtc.h instead of
writing it by hand

Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-29 09:52:07 +01:00
Lorenzo Bianconi
93cf357fa7 netfilter: flowtable: Add IP6IP6 tx sw acceleration
Introduce sw acceleration for tx path of IP6IP6 tunnels relying on the
netfilter flowtable infrastructure.
IP6IP6 tx sw acceleration can be tested running the following scenario
where the traffic is forwarded between two NICs (eth0 and eth1) and an
IP6IP6 tunnel is used to access a remote site (using eth1 as the underlay
device):

ETH0 -- TUN0 <==> ETH1 -- [IP network] -- TUN1 (2001:db8:3::2)

$ip addr show
6: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:00:22:33:11:55 brd ff:ff:ff:ff:ff:ff
    inet6 2001:db8:1::2/64 scope global nodad
       valid_lft forever preferred_lft forever
7: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:11:22:33:11:55 brd ff:ff:ff:ff:ff:ff
    inet6 2001:db8:2::1/64 scope global nodad
       valid_lft forever preferred_lft forever
8: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
    link/tunnel6 2001:db8:2::1 peer 2001:db8:2::2 permaddr ce9c:2940:7dcc::
    inet6 2002:db8:1::1/64 scope global nodad
       valid_lft forever preferred_lft forever

$ip -6 route show
2001:db8:1::/64 dev eth0 proto kernel metric 256 pref medium
2001:db8:2::/64 dev eth1 proto kernel metric 256 pref medium
2002:db8:1::/64 dev tun0 proto kernel metric 256 pref medium
default via 2002:db8:1::2 dev tun0 metric 1024 pref medium

$nft list ruleset
table inet filter {
        flowtable ft {
                hook ingress priority filter
                devices = { eth0, eth1 }
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                meta l4proto { tcp, udp } flow add @ft
        }
}

Reproducing the scenario described above using veths I got the following
results:
- TCP stream received from the IPIP tunnel:
  - net-next: (baseline)                  ~93Gbps
  - net-next + IP6IP6 flowtbale support:  ~98Gbps

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-29 09:52:06 +01:00
Lorenzo Bianconi
d98103575d netfilter: flowtable: Add IP6IP6 rx sw acceleration
Introduce sw acceleration for rx path of IP6IP6 tunnels relying on the
netfilter flowtable infrastructure. Subsequent patches will add sw
acceleration for IP6IP6 tunnels tx path.
IP6IP6 rx sw acceleration can be tested running the following scenario
where the traffic is forwarded between two NICs (eth0 and eth1) and an
IP6IP6 tunnel is used to access a remote site (using eth1 as the underlay
device):

ETH0 -- TUN0 <==> ETH1 -- [IP network] -- TUN1 (2001:db8:3::2)

$ip addr show
6: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:00:22:33:11:55 brd ff:ff:ff:ff:ff:ff
    inet6 2001:db8:1::2/64 scope global nodad
       valid_lft forever preferred_lft forever
7: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 00:11:22:33:11:55 brd ff:ff:ff:ff:ff:ff
    inet6 2001:db8:2::1/64 scope global nodad
       valid_lft forever preferred_lft forever
8: tun0@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1480 qdisc noqueue state UNKNOWN group default qlen 1000
    link/tunnel6 2001:db8:2::1 peer 2001:db8:2::2 permaddr ce9c:2940:7dcc::
    inet6 2002:db8:1::1/64 scope global nodad
       valid_lft forever preferred_lft forever

$ip -6 route show
2001:db8:1::/64 dev eth0 proto kernel metric 256 pref medium
2001:db8:2::/64 dev eth1 proto kernel metric 256 pref medium
2002:db8:1::/64 dev tun0 proto kernel metric 256 pref medium
default via 2002:db8:1::2 dev tun0 metric 1024 pref medium

$nft list ruleset
table inet filter {
        flowtable ft {
                hook ingress priority filter
                devices = { eth0, eth1 }
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                meta l4proto { tcp, udp } flow add @ft
        }
}

Reproducing the scenario described above using veths I got the following
results:
- TCP stream received from the IPIP tunnel:
  - net-next: (baseline)                  ~ 81Gbps
  - net-next + IP6IP6 flowtbale support:  ~112Gbps

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-29 09:52:06 +01:00
Lorenzo Bianconi
c64436daf6 netfilter: Introduce tunnel metadata info in nf_flowtable_ctx struct
Add tunnel hdr_size and tunnel proto fields in nf_flowtable_ctx struct
in order to store IP tunnel header size and protocol used during IPIP
and IP6IP6 tunnel sw offloading decapsulation and avoid recomputing them
during tunnel header pop since this is constant for IPv6.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-29 09:52:06 +01:00
Lorenzo Bianconi
baa501b12a netfilter: Add ctx pointer in nf_flow_skb_encap_protocol/nf_flow_ip4_tunnel_proto signature
Rely on nf_flowtable_ctx struct pointer in nf_flow_ip4_tunnel_proto and
nf_flow_skb_encap_protocol routine signature. This is a preliminary patch
to introduce IP6IP6 flowtable acceleration since nf_flowtable_ctx will
be used to store IP6IP6 tunnel info.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-29 09:52:06 +01:00
Pablo Neira Ayuso
5599fa810b netfilter: nft_set_rbtree: remove seqcount_rwlock_t
After the conversion to binary search array, this is not required anymore.
Remove it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-22 17:18:13 +01:00
Pablo Neira Ayuso
2aa34191f0 netfilter: nft_set_rbtree: use binary search array in get command
Rework .get interface to use the binary search array, this needs a specific
lookup function to match on end intervals (<=). Packet path lookup is slight
different because match is on lesser value, not equal (ie. <).

After this patch, seqcount can be removed in a follow up patch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-22 17:18:13 +01:00
Pablo Neira Ayuso
7e43e0a114 netfilter: nft_set_rbtree: translate rbtree to array for binary search
The rbtree can temporarily store overlapping inactive elements during
the transaction processing, leading to false negative lookups.

To address this issue, this patch adds a .commit function that walks the
the rbtree to build a array of intervals of ordered elements. This
conversion compacts the two singleton elements that represent the start
and the end of the interval into a single interval object for space
efficient.

Binary search is O(log n), similar to rbtree lookup time, therefore,
performance number should be similar, and there is an implementation
available under lib/bsearch.c and include/linux/bsearch.h that is used
for this purpose.

This slightly increases memory consumption for this new array that
stores pointers to the start and the end of the interval.

With this patch:

 # time nft -f 100k-intervals-set.nft

 real    0m4.218s
 user    0m3.544s
 sys     0m0.400s

Without this patch:

 # time nft -f 100k-intervals-set.nft

 real    0m3.920s
 user    0m3.547s
 sys     0m0.276s

With this patch, with IPv4 intervals:

  baseline rbtree (match on first field only):   15254954pps

Without this patch:

  baseline rbtree (match on first field only):   10256119pps

This provides a ~50% improvement in matching intervals from packet path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-22 17:18:13 +01:00
Pablo Neira Ayuso
f175b46d91 netfilter: nf_tables: add .abort_skip_removal flag for set types
The pipapo set backend is the only user of the .abort interface so far.
To speed up pipapo abort path, removals are skipped.

The follow up patch updates the rbtree to use to build an array of
ordered elements, then use binary search. This needs a new .abort
interface but, unlike pipapo, it also need to undo/remove elements.

Add a flag and use it from the pipapo set backend.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-22 17:18:13 +01:00
Florian Westphal
735ee8582d netfilter: xt_tcpmss: check remaining length before reading optlen
Quoting reporter:
  In net/netfilter/xt_tcpmss.c (lines 53-68), the TCP option parser reads
 op[i+1] directly without validating the remaining option length.

  If the last byte of the option field is not EOL/NOP (0/1), the code attempts
  to index op[i+1]. In the case where i + 1 == optlen, this causes an
  out-of-bounds read, accessing memory past the optlen boundary
  (either reading beyond the stack buffer _opt or the
  following payload).

Reported-by: sungzii <sungzii@pm.me>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20 16:23:38 +01:00
Fernando Fernandez Mancera
de8a70cefc netfilter: nf_conncount: fix tracking of connections from localhost
Since commit be102eb6a0 ("netfilter: nf_conncount: rework API to use
sk_buff directly"), we skip the adding and trigger a GC when the ct is
confirmed. For connections originated from local to local it doesn't
work because the connection is confirmed on POSTROUTING, therefore
tracking on the INPUT hook is always skipped.

In order to fix this, we check whether skb input ifindex is set to
loopback ifindex. If it is then we fallback on a GC plus track operation
skipping the optimization. This fallback is necessary to avoid
duplicated tracking of a packet train e.g 10 UDP datagrams sent on a
burst when initiating the connection.

Tested with xt_connlimit/nft_connlimit and OVS limit and with a HTTP
server and iperf3 on UDP mode.

Fixes: be102eb6a0 ("netfilter: nf_conncount: rework API to use sk_buff directly")
Reported-by: Michal Slabihoudek <michal.slabihoudek@gooddata.com>
Closes: https://lore.kernel.org/netfilter/6989BD9F-8C24-4397-9AD7-4613B28BF0DB@gooddata.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20 16:23:38 +01:00
Florian Westphal
cda26c6459 netfilter: nft_compat: add more restrictions on netlink attributes
As far as I can see nothing bad can happen when NFTA_TARGET/MATCH_NAME
are too large because this calls x_tables helpers which check for the
length, but it seems better to already reject it during netlink parsing.

Rest of the changes avoid silent u8/u16 truncations.

For _TYPE, its expected to be only 1 or 0. In x_tables world, this
variable is set by kernel, for IPT_SO_GET_REVISION_TARGET its 1, for
all others its set to 0.

As older versions of nf_tables permitted any value except 1 to mean 'match',
keep this as-is but sanitize the value for consistency.

Fixes: 0ca743a559 ("netfilter: nf_tables: add compatibility layer for x_tables")
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20 16:23:37 +01:00
Scott Mitchell
a4400a5b34 netfilter: nfnetlink_queue: nfqnl_instance GFP_ATOMIC -> GFP_KERNEL_ACCOUNT allocation
Currently, instance_create() uses GFP_ATOMIC because it's called while
holding instances_lock spinlock. This makes allocation more likely to
fail under memory pressure.

Refactor nfqnl_recv_config() to drop RCU lock after instance_lookup()
and peer_portid verification. A socket cannot simultaneously send a
message and close, so the queue owned by the sending socket cannot be
destroyed while processing its CONFIG message. This allows
instance_create() to allocate with GFP_KERNEL_ACCOUNT before taking
the spinlock.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20 16:23:37 +01:00
Florian Westphal
d00453b6e3 netfilter: nf_conntrack: don't rely on implicit includes
several netfilter compilation units rely on implicit includes
coming from nf_conntrack_proto_gre.h.

Clean this up and add the required dependencies where needed.

nf_conntrack.h requires net_generic() helper.
Place various gre/ppp/vlan includes to where they are needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20 16:23:37 +01:00
Florian Westphal
910d271227 netfilter: don't include xt and nftables.h in unrelated subsystems
conntrack, xtables and nftables are distinct subsystems, don't use them
in other subystems.

Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20 16:23:37 +01:00
Florian Westphal
f7becf0dad netfilter: nf_conntrack: enable icmp clash support
Not strictly required, but should not be harmful either:
This isn't a stateful protocol, hence clash resolution should work fine.

Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20 16:23:37 +01:00
Fernando Fernandez Mancera
21d033e472 netfilter: nf_conncount: increase the connection clean up limit to 64
After the optimization to only perform one GC per jiffy, a new problem
was introduced. If more than 8 new connections are tracked per jiffy the
list won't be cleaned up fast enough possibly reaching the limit
wrongly.

In order to prevent this issue, only skip the GC if it was already
triggered during the same jiffy and the increment is lower than the
clean up limit. In addition, increase the clean up limit to 64
connections to avoid triggering GC too often and do more effective GCs.

This has been tested using a HTTP server and several
performance tools while having nft_connlimit/xt_connlimit or OVS limit
configured.

Output of slowhttptest + OVS limit at 52000 connections:

 slow HTTP test status on 340th second:
 initializing:        0
 pending:             432
 connected:           51998
 error:               0
 closed:              0
 service available:   YES

Fixes: d265929930 ("netfilter: nf_conncount: reduce unnecessary GC")
Reported-by: Aleksandra Rukomoinikova <ARukomoinikova@k2.cloud>
Closes: https://lore.kernel.org/netfilter/b2064e7b-0776-4e14-adb6-c68080987471@k2.cloud/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20 16:23:37 +01:00
Yuto Hamaguchi
8a49fc8d8a netfilter: nf_conntrack: Add allow_clash to generic protocol handler
The upstream commit, 71d8c47fc6
 ("netfilter: conntrack: introduce clash resolution on insertion race"),
sets allow_clash=true in the UDP/UDPLITE protocol handler
but does not set it in the generic protocol handler.

As a result, packets composed of connectionless protocols at each layer,
such as UDP over IP-in-IP, still drop packets due to conflicts during conntrack insertion.

To resolve this, this patch sets allow_clash in the nf_conntrack_l4proto_generic.

Signed-off-by: Yuto Hamaguchi <Hamaguchi.Yuto@da.MitsubishiElectric.co.jp>
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20 16:23:37 +01:00
Florian Westphal
6f93616a73 netfilter: nf_tables: reset table validation state on abort
If a transaction fails the final validation in the commit hook, the table
validation state is changed to NFT_VALIDATE_DO and a replay of the batch is
performed.  Every rule insert will then do a graph validation.

This is much slower, but provides better error reporting to the user
because we can point at the rule that introduces the validation issue.

Without this reset the affected table(s) remain in full validation mode,
i.e. on next transaction we start with slow-mode.

This makes the next transaction after a failed incremental update very slow:

 # time iptables-restore < /tmp/ruleset
 real    0m0.496s [..]
 # time iptables -A CALLEE -j CALLER
 iptables v1.8.11 (nf_tables):  RULE_APPEND failed (Too many links): rule in chain CALLEE
 real    0m0.022s [..]
 # time iptables-restore < /tmp/ruleset
 real    1m22.355s [..]

After this patch, 2nd iptables-restore is back to ~0.5s.

Fixes: 9a32e98506 ("netfilter: nf_tables: don't write table validation state without mutex")
Signed-off-by: Florian Westphal <fw@strlen.de>
2026-01-20 16:23:37 +01:00
Alexei Starovoitov
e3d0dbb3b5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5
Cross-merge BPF and other fixes after downstream PR.

No conflicts.

Adjacent:
Auto-merging MAINTAINERS
Auto-merging Makefile
Auto-merging kernel/bpf/verifier.c
Auto-merging kernel/sched/ext.c
Auto-merging mm/memcontrol.c

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-14 15:22:01 -08:00