linux-cryptodev-2.6

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git synced 2026-04-04 04:37:39 -04:00

Author	SHA1	Message	Date
Kees Cook	189f164e57	Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-22 08:26:33 -08:00
Linus Torvalds	32a92f8c89	Convert more 'alloc_obj' cases to default GFP_KERNEL arguments This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 20:03:00 -08:00
Linus Torvalds	323bbfcf1e	Convert 'alloc_flex' family to use the new default GFP_KERNEL argument This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/$alloc_objs(.*$, GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Linus Torvalds	8bf22c33e7	Merge tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from Netfilter. Current release - new code bugs: - net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT - eth: mlx5e: XSK, Fix unintended ICOSQ change - phy_port: correctly recompute the port's linkmodes - vsock: prevent child netns mode switch from local to global - couple of kconfig fixes for new symbols Previous releases - regressions: - nfc: nci: fix false-positive parameter validation for packet data - net: do not delay zero-copy skbs in skb_attempt_defer_free() Previous releases - always broken: - mctp: ensure our nlmsg responses to user space are zero-initialised - ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data() - fixes for ICMP rate limiting Misc: - intel: fix PCI device ID conflict between i40e and ipw2200" * tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits) net: nfc: nci: Fix parameter validation for packet data net/mlx5e: Use unsigned for mlx5e_get_max_num_channels net/mlx5e: Fix deadlocks between devlink and netdev instance locks net/mlx5e: MACsec, add ASO poll loop in macsec_aso_set_arm_event net/mlx5: Fix misidentification of write combining CQE during poll loop net/mlx5e: Fix misidentification of ASO CQE during poll loop net/mlx5: Fix multiport device check over light SFs bonding: alb: fix UAF in rlb_arp_recv during bond up/down bnge: fix reserving resources from FW eth: fbnic: Advertise supported XDP features. rds: tcp: fix uninit-value in __inet_bind net/rds: Fix NULL pointer dereference in rds_tcp_accept_one octeontx2-af: Fix default entries mcam entry action net/mlx5e: XSK, Fix unintended ICOSQ change ipv6: icmp: icmpv6_xrlim_allow() optimization if net.ipv6.icmp.ratelimit is zero ipv4: icmp: icmpv4_xrlim_allow() optimization if net.ipv4.icmp_ratelimit is zero ipv6: icmp: remove obsolete code in icmpv6_xrlim_allow() inet: move icmp_global_{credit,stamp} to a separate cache line icmp: prevent possible overflow in icmp_global_allow() selftests/net: packetdrill: add ipv4-mapped-ipv6 tests ...	2026-02-19 10:39:08 -08:00
Eric Dumazet	d8d9ef2988	ipv4: icmp: icmpv4_xrlim_allow() optimization if net.ipv4.icmp_ratelimit is zero If net.ipv4.icmp_ratelimit is zero, we do not have to call inet_getpeer_v4() and inet_peer_xrlim_allow(). Both can be very expensive under DDOS. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216142832.3834174-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-18 16:46:36 -08:00
Eric Dumazet	034bbd8062	icmp: prevent possible overflow in icmp_global_allow() Following expression can overflow if sysctl_icmp_msgs_per_sec is big enough. sysctl_icmp_msgs_per_sec * delta / HZ; Fixes: `4cdf507d54` ("icmp: add a global rate limitation") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216142832.3834174-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-18 16:46:36 -08:00
Eric Dumazet	ad5dfde2a5	ping: annotate data-races in ping_lookup() isk->inet_num, isk->inet_rcv_saddr and sk->sk_bound_dev_if are read locklessly in ping_lookup(). Add READ_ONCE()/WRITE_ONCE() annotations. The race on isk->inet_rcv_saddr is probably coming from IPv6 support, but does not deserve a specific backport. Fixes: `dbca1596bb` ("ping: convert to RCU lookups, get rid of rwlock") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260216100149.3319315-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-17 17:11:08 -08:00
Linus Torvalds	136114e0ab	Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...	2026-02-12 12:13:01 -08:00
Linus Torvalds	37a93dd5c4	Merge tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core & protocols: - A significant effort all around the stack to guide the compiler to make the right choice when inlining code, to avoid unneeded calls for small helper and stack canary overhead in the fast-path. This generates better and faster code with very small or no text size increases, as in many cases the call generated more code than the actual inlined helper. - Extend AccECN implementation so that is now functionally complete, also allow the user-space enabling it on a per network namespace basis. - Add support for memory providers with large (above 4K) rx buffer. Paired with hw-gro, larger rx buffer sizes reduce the number of buffers traversing the stack, dincreasing single stream CPU usage by up to ~30%. - Do not add HBH header to Big TCP GSO packets. This simplifies the RX path, the TX path and the NIC drivers, and is possible because user-space taps can now interpret correctly such packets without the HBH hint. - Allow IPv6 routes to be configured with a gateway address that is resolved out of a different interface than the one specified, aligning IPv6 to IPv4 behavior. - Multi-queue aware sch_cake. This makes it possible to scale the rate shaper of sch_cake across multiple CPUs, while still enforcing a single global rate on the interface. - Add support for the nbcon (new buffer console) infrastructure to netconsole, enabling lock-free, priority-based console operations that are safer in crash scenarios. - Improve the TCP ipv6 output path to cache the flow information, saving cpu cycles, reducing cache line misses and stack use. - Improve netfilter packet tracker to resolve clashes for most protocols, avoiding unneeded drops on rare occasions. - Add IP6IP6 tunneling acceleration to the flowtable infrastructure. - Reduce tcp socket size by one cache line. - Notify neighbour changes atomically, avoiding inconsistencies between the notification sequence and the actual states sequence. - Add vsock namespace support, allowing complete isolation of vsocks across different network namespaces. - Improve xsk generic performances with cache-alignment-oriented optimizations. - Support netconsole automatic target recovery, allowing netconsole to reestablish targets when underlying low-level interface comes back online. Driver API: - Support for switching the working mode (automatic vs manual) of a DPLL device via netlink. - Introduce PHY ports representation to expose multiple front-facing media ports over a single MAC. - Introduce "rx-polarity" and "tx-polarity" device tree properties, to generalize polarity inversion requirements for differential signaling. - Add helper to create, prepare and enable managed clocks. Device drivers: - Add Huawei hinic3 PF etherner driver. - Add DWMAC glue driver for Motorcomm YT6801 PCIe ethernet controller. - Add ethernet driver for MaxLinear MxL862xx switches - Remove parallel-port Ethernet driver. - Convert existing driver timestamp configuration reporting to hwtstamp_get and remove legacy ioctl(). - Convert existing drivers to .get_rx_ring_count(), simplifing the RX ring count retrieval. Also remove the legacy fallback path. - Ethernet high-speed NICs: - Broadcom (bnxt, bng): - bnxt: add FW interface update to support FEC stats histogram and NVRAM defragmentation - bng: add TSO and H/W GRO support - nVidia/Mellanox (mlx5): - improve latency of channel restart operations, reducing the used H/W resources - add TSO support for UDP over GRE over VLAN - add flow counters support for hardware steering (HWS) rules - use a static memory area to store headers for H/W GRO, leading to 12% RX tput improvement - Intel (100G, ice, idpf): - ice: reorganizes layout of Tx and Rx rings for cacheline locality and utilizes __cacheline_group* macros on the new layouts - ice: introduces Synchronous Ethernet (SyncE) support - Meta (fbnic): - adds debugfs for firmware mailbox and tx/rx rings vectors - Ethernet virtual: - geneve: introduce GRO/GSO support for double UDP encapsulation - Ethernet NICs consumer, and embedded: - Synopsys (stmmac): - some code refactoring and cleanups - RealTek (r8169): - add support for RTL8127ATF (10G Fiber SFP) - add dash and LTR support - Airoha: - AN8811HB 2.5 Gbps phy support - Freescale (fec): - add XDP zero-copy support - Thunderbolt: - add get link setting support to allow bonding - Renesas: - add support for RZ/G3L GBETH SoC - Ethernet switches: - Maxlinear: - support R(G)MII slow rate configuration - add support for Intel GSW150 - Motorcomm (yt921x): - add DCB/QoS support - TI: - icssm-prueth: support bridging (STP/RSTP) via the switchdev framework - Ethernet PHYs: - Realtek: - enable SGMII and 2500Base-X in-band auto-negotiation - simplify and reunify C22/C45 drivers - Micrel: convert bindings to DT schema - CAN: - move skb headroom content into skb extensions, making CAN metadata access more robust - CAN drivers: - rcar_canfd: - add support for FD-only mode - add support for the RZ/T2H SoC - sja1000: cleanup the CAN state handling - WiFi: - implement EPPKE/802.1X over auth frames support - split up drop reasons better, removing generic RX_DROP - additional FTM capabilities: 6 GHz support, supported number of spatial streams and supported number of LTF repetitions - better mac80211 iterators to enumerate resources - initial UHR (Wi-Fi 8) support for cfg80211/mac80211 - WiFi drivers: - Qualcomm/Atheros: - ath11k: support for Channel Frequency Response measurement - ath12k: a significant driver refactor to support multi-wiphy devices and and pave the way for future device support in the same driver (rather than splitting to ath13k) - ath12k: support for the QCC2072 chipset - Intel: - iwlwifi: partial Neighbor Awareness Networking (NAN) support - iwlwifi: initial support for U-NII-9 and IEEE 802.11bn - RealTek (rtw89): - preparations for RTL8922DE support - Bluetooth: - implement setsockopt(BT_PHY) to set the connection packet type/PHY - set link_policy on incoming ACL connections - Bluetooth drivers: - btusb: add support for MediaTek7920, Realtek RTL8761BU and 8851BE - btqca: add WCN6855 firmware priority selection feature" * tag 'net-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1254 commits) bnge/bng_re: Add a new HSI net: macb: Fix tx/rx malfunction after phy link down and up af_unix: Fix memleak of newsk in unix_stream_connect(). net: ti: icssg-prueth: Add optional dependency on HSR net: dsa: add basic initial driver for MxL862xx switches net: mdio: add unlocked mdiodev C45 bus accessors net: dsa: add tag format for MxL862xx switches dt-bindings: net: dsa: add MaxLinear MxL862xx selftests: drivers: net: hw: Modify toeplitz.c to poll for packets octeontx2-pf: Unregister devlink on probe failure net: renesas: rswitch: fix forwarding offload statemachine ionic: Rate limit unknown xcvr type messages tcp: inet6_csk_xmit() optimization tcp: populate inet->cork.fl.u.ip6 in tcp_v6_syn_recv_sock() tcp: populate inet->cork.fl.u.ip6 in tcp_v6_connect() ipv6: inet6_csk_xmit() and inet6_csk_update_pmtu() use inet->cork.fl.u.ip6 ipv6: use inet->cork.fl.u.ip6 and np->final in ip6_datagram_dst_update() ipv6: use np->final in inet6_sk_rebuild_header() ipv6: add daddr/final storage in struct ipv6_pinfo net: stmmac: qcom-ethqos: fix qcom_ethqos_serdes_powerup() ...	2026-02-11 19:31:52 -08:00
Paolo Abeni	83310d6133	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Merge in late fixes in preparation for the net-next PR. Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-11 15:14:35 +01:00
Linus Torvalds	0923fd0419	Merge tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "Lock debugging: - Implement compiler-driven static analysis locking context checking, using the upcoming Clang 22 compiler's context analysis features (Marco Elver) We removed Sparse context analysis support, because prior to removal even a defconfig kernel produced 1,700+ context tracking Sparse warnings, the overwhelming majority of which are false positives. On an allmodconfig kernel the number of false positive context tracking Sparse warnings grows to over 5,200... On the plus side of the balance actual locking bugs found by Sparse context analysis is also rather ... sparse: I found only 3 such commits in the last 3 years. So the rate of false positives and the maintenance overhead is rather high and there appears to be no active policy in place to achieve a zero-warnings baseline to move the annotations & fixers to developers who introduce new code. Clang context analysis is more complete and more aggressive in trying to find bugs, at least in principle. Plus it has a different model to enabling it: it's enabled subsystem by subsystem, which results in zero warnings on all relevant kernel builds (as far as our testing managed to cover it). Which allowed us to enable it by default, similar to other compiler warnings, with the expectation that there are no warnings going forward. This enforces a zero-warnings baseline on clang-22+ builds (Which are still limited in distribution, admittedly) Hopefully the Clang approach can lead to a more maintainable zero-warnings status quo and policy, with more and more subsystems and drivers enabling the feature. Context tracking can be enabled for all kernel code via WARN_CONTEXT_ANALYSIS_ALL=y (default disabled), but this will generate a lot of false positives. ( Having said that, Sparse support could still be added back, if anyone is interested - the removal patch is still relatively straightforward to revert at this stage. ) Rust integration updates: (Alice Ryhl, Fujita Tomonori, Boqun Feng) - Add support for Atomic<i8/i16/bool> and replace most Rust native AtomicBool usages with Atomic<bool> - Clean up LockClassKey and improve its documentation - Add missing Send and Sync trait implementation for SetOnce - Make ARef Unpin as it is supposed to be - Add __rust_helper to a few Rust helpers as a preparation for helper LTO - Inline various lock related functions to avoid additional function calls WW mutexes: - Extend ww_mutex tests and other test-ww_mutex updates (John Stultz) Misc fixes and cleanups: - rcu: Mark lockdep_assert_rcu_helper() __always_inline (Arnd Bergmann) - locking/local_lock: Include more missing headers (Peter Zijlstra) - seqlock: fix scoped_seqlock_read kernel-doc (Randy Dunlap) - rust: sync: Replace `kernel::c_str!` with C-Strings (Tamir Duberstein)" * tag 'locking-core-2026-02-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (90 commits) locking/rwlock: Fix write_trylock_irqsave() with CONFIG_INLINE_WRITE_TRYLOCK rcu: Mark lockdep_assert_rcu_helper() __always_inline compiler-context-analysis: Remove __assume_ctx_lock from initializers tomoyo: Use scoped init guard crypto: Use scoped init guard kcov: Use scoped init guard compiler-context-analysis: Introduce scoped init guards cleanup: Make __DEFINE_LOCK_GUARD handle commas in initializers seqlock: fix scoped_seqlock_read kernel-doc tools: Update context analysis macros in compiler_types.h rust: sync: Replace `kernel::c_str!` with C-Strings rust: sync: Inline various lock related methods rust: helpers: Move #define __rust_helper out of atomic.c rust: wait: Add __rust_helper to helpers rust: time: Add __rust_helper to helpers rust: task: Add __rust_helper to helpers rust: sync: Add __rust_helper to helpers rust: refcount: Add __rust_helper to helpers rust: rcu: Add __rust_helper to helpers rust: processor: Add __rust_helper to helpers ...	2026-02-10 12:28:44 -08:00
Linus Torvalds	f17b474e36	Merge tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Support associating BPF program with struct_ops (Amery Hung) - Switch BPF local storage to rqspinlock and remove recursion detection counters which were causing false positives (Amery Hung) - Fix live registers marking for indirect jumps (Anton Protopopov) - Introduce execution context detection BPF helpers (Changwoo Min) - Improve verifier precision for 32bit sign extension pattern (Cupertino Miranda) - Optimize BTF type lookup by sorting vmlinux BTF and doing binary search (Donglin Peng) - Allow states pruning for misc/invalid slots in iterator loops (Eduard Zingerman) - In preparation for ASAN support in BPF arenas teach libbpf to move global BPF variables to the end of the region and enable arena kfuncs while holding locks (Emil Tsalapatis) - Introduce support for implicit arguments in kfuncs and migrate a number of them to new API. This is a prerequisite for cgroup sub-schedulers in sched-ext (Ihor Solodrai) - Fix incorrect copied_seq calculation in sockmap (Jiayuan Chen) - Fix ORC stack unwind from kprobe_multi (Jiri Olsa) - Speed up fentry attach by using single ftrace direct ops in BPF trampolines (Jiri Olsa) - Require frozen map for calculating map hash (KP Singh) - Fix lock entry creation in TAS fallback in rqspinlock (Kumar Kartikeya Dwivedi) - Allow user space to select cpu in lookup/update operations on per-cpu array and hash maps (Leon Hwang) - Make kfuncs return trusted pointers by default (Matt Bobrowski) - Introduce "fsession" support where single BPF program is executed upon entry and exit from traced kernel function (Menglong Dong) - Allow bpf_timer and bpf_wq use in all programs types (Mykyta Yatsenko, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Alexei Starovoitov) - Make KF_TRUSTED_ARGS the default for all kfuncs and clean up their definition across the tree (Puranjay Mohan) - Allow BPF arena calls from non-sleepable context (Puranjay Mohan) - Improve register id comparison logic in the verifier and extend linked registers with negative offsets (Puranjay Mohan) - In preparation for BPF-OOM introduce kfuncs to access memcg events (Roman Gushchin) - Use CFI compatible destructor kfunc type (Sami Tolvanen) - Add bitwise tracking for BPF_END in the verifier (Tianci Cao) - Add range tracking for BPF_DIV and BPF_MOD in the verifier (Yazhou Tang) - Make BPF selftests work with 64k page size (Yonghong Song) * tag 'bpf-next-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (268 commits) selftests/bpf: Fix outdated test on storage->smap selftests/bpf: Choose another percpu variable in bpf for btf_dump test selftests/bpf: Remove test_task_storage_map_stress_lookup selftests/bpf: Update task_local_storage/task_storage_nodeadlock test selftests/bpf: Update task_local_storage/recursion test selftests/bpf: Update sk_storage_omem_uncharge test bpf: Switch to bpf_selem_unlink_nofail in bpf_local_storage_{map_free, destroy} bpf: Support lockless unlink when freeing map or local storage bpf: Prepare for bpf_selem_unlink_nofail() bpf: Remove unused percpu counter from bpf_local_storage_map_free bpf: Remove cgroup local storage percpu counter bpf: Remove task local storage percpu counter bpf: Change local_storage->lock and b->lock to rqspinlock bpf: Convert bpf_selem_unlink to failable bpf: Convert bpf_selem_link_map to failable bpf: Convert bpf_selem_unlink_map to failable bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage selftests/xsk: fix number of Tx frags in invalid packet selftests/xsk: properly handle batch ending in the middle of a packet bpf: Prevent reentrance into call_rcu_tasks_trace() ...	2026-02-10 11:26:21 -08:00
Jiayuan Chen	81b84de32b	xfrm: fix ip_rt_bug race in icmp_route_lookup reverse path icmp_route_lookup() performs multiple route lookups to find a suitable route for sending ICMP error messages, with special handling for XFRM (IPsec) policies. The lookup sequence is: 1. First, lookup output route for ICMP reply (dst = original src) 2. Pass through xfrm_lookup() for policy check 3. If blocked (-EPERM) or dst is not local, enter "reverse path" 4. In reverse path, call xfrm_decode_session_reverse() to get fl4_dec which reverses the original packet's flow (saddr<->daddr swapped) 5. If fl4_dec.saddr is local (we are the original destination), use __ip_route_output_key() for output route lookup 6. If fl4_dec.saddr is NOT local (we are a forwarding node), use ip_route_input() to simulate the reverse packet's input path 7. Finally, pass rt2 through xfrm_lookup() with XFRM_LOOKUP_ICMP flag The bug occurs in step 6: ip_route_input() is called with fl4_dec.daddr (original packet's source) as destination. If this address becomes local between the initial check and ip_route_input() call (e.g., due to concurrent "ip addr add"), ip_route_input() returns a LOCAL route with dst.output set to ip_rt_bug. This route is then used for ICMP output, causing dst_output() to call ip_rt_bug(), triggering a WARN_ON: ------------[ cut here ]------------ WARNING: net/ipv4/route.c:1275 at ip_rt_bug+0x21/0x30, CPU#1 Call Trace: <TASK> ip_push_pending_frames+0x202/0x240 icmp_push_reply+0x30d/0x430 __icmp_send+0x1149/0x24f0 ip_options_compile+0xa2/0xd0 ip_rcv_finish_core+0x829/0x1950 ip_rcv+0x2d7/0x420 __netif_receive_skb_one_core+0x185/0x1f0 netif_receive_skb+0x90/0x450 tun_get_user+0x3413/0x3fb0 tun_chr_write_iter+0xe4/0x220 ... Fix this by checking rt2->rt_type after ip_route_input(). If it's RTN_LOCAL, the route cannot be used for output, so treat it as an error. The reproducer requires kernel modification to widen the race window, making it unsuitable as a selftest. It is available at: https://gist.github.com/mrpre/eae853b72ac6a750f5d45d64ddac1e81 Reported-by: syzbot+e738404dcd14b620923c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000b1060905eada8881@google.com/T/ Closes: https://lore.kernel.org/r/20260128090523.356953-1-jiayuan.chen@linux.dev Fixes: `8b7817f3a9` ("[IPSEC]: Add ICMP host relookup support") Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260206050220.59642-1-jiayuan.chen@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-10 15:06:11 +01:00
Eric Dumazet	a35b6e4863	tcp: inline tcp_filter() This helper is already (auto)inlined from IPv4 TCP stack. Make it an inline function to benefit IPv6 as well. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/0 up/down: 30/-49 (-19) Function old new delta tcp_v6_rcv 3448 3478 +30 __pfx_tcp_filter 16 - -16 tcp_filter 33 - -33 Total: Before=24891904, After=24891885, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260205164329.3401481-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-06 20:12:11 -08:00
Eric Dumazet	c89477ad79	inet: RAW sockets using IPPROTO_RAW MUST drop incoming ICMP Yizhou Zhao reported that simply having one RAW socket on protocol IPPROTO_RAW (255) was dangerous. socket(AF_INET, SOCK_RAW, 255); A malicious incoming ICMP packet can set the protocol field to 255 and match this socket, leading to FNHE cache changes. inner = IP(src="192.168.2.1", dst="8.8.8.8", proto=255)/Raw("TEST") pkt = IP(src="192.168.1.1", dst="192.168.2.1")/ICMP(type=3, code=4, nexthopmtu=576)/inner "man 7 raw" states: A protocol of IPPROTO_RAW implies enabled IP_HDRINCL and is able to send any IP protocol that is specified in the passed header. Receiving of all IP protocols via IPPROTO_RAW is not possible using raw sockets. Make sure we drop these malicious packets. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Link: https://lore.kernel.org/netdev/20251109134600.292125-1-zhaoyz24@mails.tsinghua.edu.cn/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260203192509.682208-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 12:36:49 -08:00
Eric Dumazet	22c1264415	tcp: move __reqsk_free() out of line Inlining __reqsk_free() is overkill, let's reclaim 2 Kbytes of text. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 2/4 grow/shrink: 2/14 up/down: 225/-2338 (-2113) Function old new delta __reqsk_free - 114 +114 sock_edemux 18 82 +64 inet_csk_listen_start 233 264 +31 __pfx___reqsk_free - 16 +16 __pfx_reqsk_queue_alloc 16 - -16 __pfx_reqsk_free 16 - -16 reqsk_queue_alloc 46 - -46 tcp_req_err 272 177 -95 reqsk_fastopen_remove 348 253 -95 cookie_bpf_check 157 62 -95 cookie_tcp_reqsk_alloc 387 290 -97 cookie_v4_check 1568 1465 -103 reqsk_free 105 - -105 cookie_v6_check 1519 1412 -107 sock_gen_put 187 78 -109 sock_pfree 212 82 -130 tcp_try_fastopen 1818 1683 -135 tcp_v4_rcv 3478 3294 -184 reqsk_put 306 90 -216 tcp_get_cookie_sock 551 318 -233 tcp_v6_rcv 3404 3141 -263 tcp_conn_request 2677 2384 -293 Total: Before=24887415, After=24885302, chg -0.01% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204055147.1682705-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 09:23:06 -08:00
Eric Dumazet	a90765c6f6	tcp: move reqsk_fastopen_remove to net/ipv4/tcp_fastopen.c This function belongs to TCP stack, not to net/core/request_sock.c We get rid of the now empty request_sock.c n the following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204055147.1682705-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 09:23:05 -08:00
Eric Dumazet	d5c5391554	inet: move reqsk_queue_alloc() to net/ipv4/inet_connection_sock.c Only called once from inet_csk_listen_start(), it can be static. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260204055147.1682705-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-05 09:23:05 -08:00
Eric Dumazet	309dd99421	tcp: split tcp_check_space() in two parts tcp_check_space() is fat and not inlined. Move its slow path in (out of line) __tcp_check_space() and make tcp_check_space() an inline function for better TCP performance. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 2/2 grow/shrink: 4/0 up/down: 708/-582 (126) Function old new delta __tcp_check_space - 521 +521 tcp_rcv_established 1860 1916 +56 tcp_rcv_state_process 3342 3384 +42 tcp_event_new_data_sent 248 286 +38 tcp_data_snd_check 71 106 +35 __pfx___tcp_check_space - 16 +16 __pfx_tcp_check_space 16 - -16 tcp_check_space 566 - -566 Total: Before=24896373, After=24896499, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260203050932.3522221-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:37:06 -08:00
Eric Dumazet	7c1db78ff7	tcp: move tcp_rbtree_insert() to tcp_output.c tcp_rbtree_insert() is primarily used from tcp_output.c In tcp_input.c, only (slow path) tcp_collapse() uses it. Move it to tcp_output.c to allow its (auto)inlining to improve TCP tx fast path. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 4/1 up/down: 445/-115 (330) Function old new delta tcp_connect 4277 4478 +201 tcp_event_new_data_sent 162 248 +86 tcp_send_synack 780 862 +82 tcp_fragment 1185 1261 +76 tcp_collapse 1524 1409 -115 Total: Before=24896043, After=24896373, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260203045110.3499713-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:36:50 -08:00
Eric Dumazet	59b5e7f47c	tcp: use __skb_push() in __tcp_transmit_skb() We trust MAX_TCP_HEADER to be large enough. Using the inlined version of skb_push() trades 8 bytes of text for better performance of TCP TX fast path. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 1/0 up/down: 8/0 (8) Function old new delta __tcp_transmit_skb 3181 3189 +8 Total: Before=24896035, After=24896043, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260203044226.3489941-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-04 20:36:27 -08:00
Chia-Yu Chang	8ae3e8e6ce	tcp: accecn: enable AccECN Enable Accurate ECN negotiation and request for incoming and outgoing connection by setting sysctl_tcp_ecn: +==============+===========================================+ \| \| Highest ECN variant (Accurate ECN, ECN, \| \| tcp_ecn \| or no ECN) to be negotiated & requested \| \| +---------------------+---------------------+ \| \| Incoming connection \| Outgoing connection \| +==============+=====================+=====================+ \| 0 \| No ECN \| No ECN \| \| 1 \| ECN \| ECN \| \| 2 \| ECN \| No ECN \| +--------------+---------------------+---------------------+ \| 3 \| Accurate ECN \| Accurate ECN \| \| 4 \| Accurate ECN \| ECN \| \| 5 \| Accurate ECN \| No ECN \| +==============+=====================+=====================+ Refer Documentation/networking/ip-sysctl.rst for more details. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-15-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:25 +01:00
Chia-Yu Chang	4fa4ac5e58	tcp: accecn: add tcpi_ecn_mode and tcpi_option2 in tcp_info Add 2-bit tcpi_ecn_mode feild within tcp_info to indicate which ECN mode is negotiated: ECN_MODE_DISABLED, ECN_MODE_RFC3168, ECN_MODE_ACCECN, or ECN_MODE_PENDING. This is done by utilizing available bits from tcpi_accecn_opt_seen (reduced from 16 bits to 2 bits) and tcpi_accecn_fail_mode (reduced from 16 bits to 4 bits). Also, an extra 24-bit tcpi_options2 field is identified to represent newer options and connection features, as all 8 bits of tcpi_options field have been used. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Co-developed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-14-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:25 +01:00
Chia-Yu Chang	1247fb19ca	tcp: accecn: detect loss ACK w/ AccECN option and add TCP_ACCECN_OPTION_PERSIST Detect spurious retransmission of a previously sent ACK carrying the AccECN option after the second retransmission. Since this might be caused by the middlebox dropping ACK with options it does not recognize, disable the sending of the AccECN option in all subsequent ACKs. This patch follows Section 3.2.3.2.2 of AccECN spec (RFC9768), and a new field (accecn_opt_sent_w_dsack) is added to indicate that an AccECN option was sent with duplicate SACK info. Also, a new AccECN option sending mode is added to tcp_ecn_option sysctl: (TCP_ECN_OPTION_PERSIST), which ignores the AccECN fallback policy and persistently sends AccECN option once it fits into TCP option space. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-13-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:25 +01:00
Chia-Yu Chang	4024081feb	tcp: accecn: unset ECT if receive or send ACE=0 in AccECN negotiaion Based on specification: https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt Based on Section 3.1.5 of AccECN spec (RFC9768), a TCP Server in AccECN mode MUST NOT set ECT on any packet for the rest of the connection, if it has received or sent at least one valid SYN or Acceptable SYN/ACK with (AE,CWR,ECE) = (0,0,0) during the handshake. In addition, a host in AccECN mode that is feeding back the IP-ECN field on a SYN or SYN/ACK MUST feed back the IP-ECN field on the latest valid SYN or acceptable SYN/ACK to arrive. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-11-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Chia-Yu Chang	f326f1f17f	tcp: accecn: retransmit SYN/ACK without AccECN option or non-AccECN SYN/ACK For Accurate ECN, the first SYN/ACK sent by the TCP server shall set the ACE flag (Table 1 of RFC9768) and the AccECN option to complete the capability negotiation. However, if the TCP server needs to retransmit such a SYN/ACK (for example, because it did not receive an ACK acknowledging its SYN/ACK, or received a second SYN requesting AccECN support), the TCP server retransmits the SYN/ACK without the AccECN option. This is because the SYN/ACK may be lost due to congestion, or a middlebox may block the AccECN option. Furthermore, if this retransmission also times out, to expedite connection establishment, the TCP server should retransmit the SYN/ACK with (AE,CWR,ECE) = (0,0,0) and without the AccECN option, while maintaining AccECN feedback mode. This complies with Section 3.2.3.2.2 of the AccECN spec RFC9768. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-10-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Chia-Yu Chang	f1eaea5585	tcp: add TCP_SYNACK_RETRANS synack_type Before this patch, retransmitted SYN/ACK did not have a specific synack_type; however, the upcoming patch needs to distinguish between retransmitted and non-retransmitted SYN/ACK for AccECN negotiation to transmit the fallback SYN/ACK during AccECN negotiation. Therefore, this patch introduces a new synack_type (TCP_SYNACK_RETRANS). Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-9-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Chia-Yu Chang	3ae62b8b4a	tcp: accecn: retransmit downgraded SYN in AccECN negotiation Based on AccECN spec (RFC9768) Section 3.1.4.1, if the sender of an AccECN SYN (the TCP Client) times out before receiving the SYN/ACK, it SHOULD attempt to negotiate the use of AccECN at least one more time by continuing to set all three TCP ECN flags (AE,CWR,ECE) = (1,1,1) on the first retransmitted SYN (using the usual retransmission time-outs). If this first retransmission also fails to be acknowledged, in deployment scenarios where AccECN path traversal might be problematic, the TCP Client SHOULD send subsequent retransmissions of the SYN with the three TCP-ECN flags cleared (AE,CWR,ECE) = (0,0,0). Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-8-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Chia-Yu Chang	e68c28f22f	tcp: disable RFC3168 fallback identifier for CC modules When AccECN is not successfully negociated for a TCP flow, it defaults fallback to classic ECN (RFC3168). However, L4S service will fallback to non-ECN. This patch enables congestion control module to control whether it should not fallback to classic ECN after unsuccessful AccECN negotiation. A new CA module flag (TCP_CONG_NO_FALLBACK_RFC3168) identifies this behavior expected by the CA. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-6-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Chia-Yu Chang	100f946b8d	tcp: ECT_1_NEGOTIATION and NEEDS_ACCECN identifiers Two flags for congestion control (CC) module are added in this patch related to AccECN negotiation. First, a new flag (TCP_CONG_NEEDS_ACCECN) defines that the CC expects to negotiate AccECN functionality using the ECE, CWR and AE flags in the TCP header. Second, during ECN negotiation, ECT(0) in the IP header is used. This patch enables CC to control whether ECT(0) or ECT(1) should be used on a per-segment basis. A new flag (TCP_CONG_ECT_1_NEGOTIATION) defines the expected ECT value in the IP header by the CA when not-yet initialized for the connection. The detailed AccECN negotiaotn can be found in IETF RFC9768. Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-5-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Ilpo Järvinen	ab4c8b6f7f	gro: flushing when CWR is set negatively affects AccECN As AccECN may keep CWR bit asserted due to different interpretation of the bit, flushing with GRO because of CWR may effectively disable GRO until AccECN counter field changes such that CWR-bit becomes 0. There is no harm done from not immediately forwarding the CWR'ed segment with RFC3168 ECN. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-3-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Ilpo Järvinen	7885ce0147	tcp: try to avoid safer when ACKs are thinned Add newly acked pkts EWMA. When ACK thinning occurs, select between safer and unsafe cep delta in AccECN processing based on it. If the packets ACKed per ACK tends to be large, don't conservatively assume ACE field overflow. This patch uses the existing 2-byte holes in the rx group for new u16 variables withtout creating more holes. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_sock { [...] u32 delivered_ecn_bytes[3]; /* 2744 12 / / XXX 4 bytes hole, try to pack / [...] __cacheline_group_end__tcp_sock_write_rx[0]; / 2816 0 / [...] / size: 3264, cachelines: 51, members: 177 / } [AFTER THIS PATCH] struct tcp_sock { [...] u32 delivered_ecn_bytes[3]; / 2744 12 / u16 pkts_acked_ewma; / 2756 2 / / XXX 2 bytes hole, try to pack / [...] __cacheline_group_end__tcp_sock_write_rx[0]; / 2816 0 / [...] / size: 3264, cachelines: 51, members: 178 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260131222515.8485-2-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-02-03 15:13:24 +01:00
Geliang Tang	2d85088d46	tcp: export tcp_splice_state Export struct tcp_splice_state and tcp_splice_data_recv() in net/tcp.h so that they can be used by MPTCP in the next patch. Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Acked-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260130-net-next-mptcp-splice-v2-3-31332ba70d7f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 18:15:32 -08:00
Eric Dumazet	fe8570186f	ipv4: use dst4_mtu() instead of dst_mtu() When we expect an IPv4 dst, use dst4_mtu() instead of dst_mtu() to save some code space. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260130210303.3888261-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-02-02 17:49:29 -08:00
Mahdi Faramarzpour	820990d665	udp: add drop count for packets in udp_prod_queue This commit adds SNMP drop count increment for the packets in per NUMA queues which were introduced in commit `b650bf0977` ("udp: remove busylock and add per NUMA queues"). note that SNMP counters are incremented currently by the caller for skb. And that these skbs on the intermediate queue cannot be counted there so need similar logic in their error path. Signed-off-by: Mahdi Faramarzpour <mahdifrmx@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260129083806.204752-1-mahdifrmx@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-30 17:18:53 -08:00
Eric Dumazet	ed9b70040d	tcp: reduce tcp sockets size by one cache line By default, when a kmem_cache is created with SLAB_TYPESAFE_BY_RCU, slub has to use extra storage for the freelist pointer after each object, because slub assumes that any bit in the object can be used by RCU readers. Because proto_register() is also using SLAB_HWCACHE_ALIGN, this forces slub to use one extra cache line per object. We can instead put the slub freelist anywhere in the object, granted the concurrent RCU readers are not supposed to use the pointer value. Add a new (struct sock)sk_freeptr field, in an union with sk_rcu: No RCU readers would need to look at sk_rcu, which is only used at free phase. Tested: grep . /sys/kernel/slab/TCP/{object_size,slab_size,objs_per_slab} grep . /sys/kernel/slab/TCPv6/{object_size,slab_size,objs_per_slab} Before: /sys/kernel/slab/TCP/object_size:2368 /sys/kernel/slab/TCP/slab_size:2432 /sys/kernel/slab/TCP/objs_per_slab:13 /sys/kernel/slab/TCPv6/object_size:2496 /sys/kernel/slab/TCPv6/slab_size:2560 /sys/kernel/slab/TCPv6/objs_per_slab:12 After this patch, we can pack one more TCPv6 object per slab, and object_size == slab_size. /sys/kernel/slab/TCP/object_size:2368 /sys/kernel/slab/TCP/slab_size:2368 /sys/kernel/slab/TCP/objs_per_slab:13 /sys/kernel/slab/TCPv6/object_size:2496 /sys/kernel/slab/TCPv6/slab_size:2496 /sys/kernel/slab/TCPv6/objs_per_slab:13 Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260129153458.4163797-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-30 17:15:51 -08:00
Jakub Kicinski	a010fe8d86	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR (net-6.19-rc8). No adjacent changes, conflicts: drivers/net/ethernet/spacemit/k1_emac.c `2c84959167` ("net: spacemit: Check for netif_carrier_ok() in emac_stats_update()") `f66086798f` ("net: spacemit: Remove broken flow control support") https://lore.kernel.org/aXjAqZA3iEWD_DGM@sirena.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-29 17:28:54 -08:00
Jibin Zhang	426ca15c7f	net: fix segmentation of forwarding fraglist GRO This patch enhances GSO segment handling by properly checking the SKB_GSO_DODGY flag for frag_list GSO packets, addressing low throughput issues observed when a station accesses IPv4 servers via hotspots with an IPv6-only upstream interface. Specifically, it fixes a bug in GSO segmentation when forwarding GRO packets containing a frag_list. The function skb_segment_list cannot correctly process GRO skbs that have been converted by XLAT, since XLAT only translates the header of the head skb. Consequently, skbs in the frag_list may remain untranslated, resulting in protocol inconsistencies and reduced throughput. To address this, the patch explicitly sets the SKB_GSO_DODGY flag for GSO packets in XLAT's IPv4/IPv6 protocol translation helpers (bpf_skb_proto_4_to_6 and bpf_skb_proto_6_to_4). This marks GSO packets as potentially modified after protocol translation. As a result, GSO segmentation will avoid using skb_segment_list and instead falls back to skb_segment for packets with the SKB_GSO_DODGY flag. This ensures that only safe and fully translated frag_list packets are processed by skb_segment_list, resolving protocol inconsistencies and improving throughput when forwarding GRO packets converted by XLAT. Signed-off-by: Jibin Zhang <jibin.zhang@mediatek.com> Fixes: `9fd1ff5d2a` ("udp: Support UDP fraglist GRO/GSO.") Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260126152114.1211-1-jibin.zhang@mediatek.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-01-29 14:40:12 +01:00
Eric Dumazet	838eb96876	tcp: tcp_tx_timestamp() must look at the rtx queue tcp_tx_timestamp() is only called at the end of tcp_sendmsg_locked() before the final tcp_push(). By the time it is called, it is possible all the copied data has been sent already (transmit queue is empty). If this is the case, use the last skb in the rtx queue. Fixes: `75c119afe1` ("tcp: implement rb-tree based retransmit queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20260127123828.4098577-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-28 19:35:35 -08:00
Kuniyuki Iwashima	5b71de34b7	ipv4: Use EXPORT_IPV6_MOD_GPL() for ip_fib_metrics_init(). ip_fib_metrics_init() is only called from fib_create_info() and ip6_route_info_create(). Let's use EXPORT_IPV6_MOD_GPL() instead. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260127081335.646666-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-28 19:33:38 -08:00
Kuniyuki Iwashima	6e84fc395e	ipv4: fib: Annotate access to struct fib_alias.fa_state. syzbot reported that struct fib_alias.fa_state can be modified locklessly by RCU readers. [0] Let's use READ_ONCE()/WRITE_ONCE() properly. [0]: BUG: KCSAN: data-race in fib_table_lookup / fib_table_lookup write to 0xffff88811b06a7fa of 1 bytes by task 4167 on cpu 0: fib_alias_accessed net/ipv4/fib_lookup.h:32 [inline] fib_table_lookup+0x361/0xd60 net/ipv4/fib_trie.c:1565 fib_lookup include/net/ip_fib.h:390 [inline] ip_route_output_key_hash_rcu+0x378/0x1380 net/ipv4/route.c:2814 ip_route_output_key_hash net/ipv4/route.c:2705 [inline] __ip_route_output_key include/net/route.h:169 [inline] ip_route_output_flow+0x65/0x110 net/ipv4/route.c:2932 udp_sendmsg+0x13c3/0x15d0 net/ipv4/udp.c:1450 inet_sendmsg+0xac/0xd0 net/ipv4/af_inet.c:859 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0x53a/0x600 net/socket.c:2592 ___sys_sendmsg+0x195/0x1e0 net/socket.c:2646 __sys_sendmmsg+0x185/0x320 net/socket.c:2735 __do_sys_sendmmsg net/socket.c:2762 [inline] __se_sys_sendmmsg net/socket.c:2759 [inline] __x64_sys_sendmmsg+0x57/0x70 net/socket.c:2759 x64_sys_call+0x1e28/0x3000 arch/x86/include/generated/asm/syscalls_64.h:308 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffff88811b06a7fa of 1 bytes by task 4168 on cpu 1: fib_alias_accessed net/ipv4/fib_lookup.h:31 [inline] fib_table_lookup+0x338/0xd60 net/ipv4/fib_trie.c:1565 fib_lookup include/net/ip_fib.h:390 [inline] ip_route_output_key_hash_rcu+0x378/0x1380 net/ipv4/route.c:2814 ip_route_output_key_hash net/ipv4/route.c:2705 [inline] __ip_route_output_key include/net/route.h:169 [inline] ip_route_output_flow+0x65/0x110 net/ipv4/route.c:2932 udp_sendmsg+0x13c3/0x15d0 net/ipv4/udp.c:1450 inet_sendmsg+0xac/0xd0 net/ipv4/af_inet.c:859 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg net/socket.c:742 [inline] ____sys_sendmsg+0x53a/0x600 net/socket.c:2592 ___sys_sendmsg+0x195/0x1e0 net/socket.c:2646 __sys_sendmmsg+0x185/0x320 net/socket.c:2735 __do_sys_sendmmsg net/socket.c:2762 [inline] __se_sys_sendmmsg net/socket.c:2759 [inline] __x64_sys_sendmmsg+0x57/0x70 net/socket.c:2759 x64_sys_call+0x1e28/0x3000 arch/x86/include/generated/asm/syscalls_64.h:308 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xc0/0x2a0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x00 -> 0x01 Reported by Kernel Concurrency Sanitizer on: CPU: 1 UID: 0 PID: 4168 Comm: syz.4.206 Not tainted syzkaller #0 PREEMPT(voluntary) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 Reported-by: syzbot+d24f940f770afda885cf@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69783ead.050a0220.c9109.0013.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260127043528.514160-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-28 19:33:07 -08:00
Eric Dumazet	d5fb143dbe	tcp: move tcp_rack_advance() to tcp_input.c tcp_rack_advance() is called from tcp_ack() and tcp_sacktag_one(). Moving it to tcp_input.c allows the compiler to inline it and save both space and cpu cycles in TCP fast path. $ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2 add/remove: 0/2 grow/shrink: 1/1 up/down: 98/-132 (-34) Function old new delta tcp_ack 5741 5839 +98 tcp_sacktag_one 407 395 -12 __pfx_tcp_rack_advance 16 - -16 tcp_rack_advance 104 - -104 Total: Before=22572680, After=22572646, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260127032147.3498272-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-28 19:31:51 -08:00
Eric Dumazet	629a68865a	tcp: move tcp_rack_update_reo_wnd() to tcp_input.c tcp_rack_update_reo_wnd() is called only once from tcp_ack() Move it to tcp_input.c so that it can be inlined by the compiler to save space and cpu cycles. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/0 up/down: 110/-153 (-43) Function old new delta tcp_ack 5631 5741 +110 __pfx_tcp_rack_update_reo_wnd 16 - -16 tcp_rack_update_reo_wnd 137 - -137 Total: Before=22572723, After=22572680, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260127032147.3498272-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-28 19:31:51 -08:00
Eric Dumazet	773a700213	tcp: mark tcp_process_tlp_ack() as unlikely It is unlikely we have to call tcp_process_tlp_ack(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260127032147.3498272-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-28 19:31:51 -08:00
Gal Pressman	b10b446ce7	udp: gso: Use single MSS length in UDP header for GSO_PARTIAL In GSO_PARTIAL segmentation, set the UDP length field to the single segment size (gso_size + UDP header) instead of the large MSS size. This provides hardware with a template length value for final segmentation, similar to how tunnel GSO_PARTIAL handles outer headers in UDP tunnels. This will remove the need to manually adjust the UDP header length in the drivers, as can be seen in subsequent patches. This was suggested by Alex in 2018: https://lore.kernel.org/netdev/CAKgT0UcdnUWgr3KQ=RnLKigokkiUuYefmL-ePpDvJOBNpKScFA@mail.gmail.com/ Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260125121649.778086-2-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2026-01-27 17:30:51 -08:00
Jiayuan Chen	929e30f931	bpf, sockmap: Fix FIONREAD for sockmap A socket using sockmap has its own independent receive queue: ingress_msg. This queue may contain data from its own protocol stack or from other sockets. Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to calculate FIONREAD is not enough. This patch adds a new msg_tot_len field in the psock structure to record the data length in ingress_msg. Additionally, we implement new ioctl interfaces for TCP and UDP to intercept FIONREAD operations. Note that we intentionally do not include sk_receive_queue data in the FIONREAD result. Data in sk_receive_queue has not yet been processed by the BPF verdict program, and may be redirected to other sockets or dropped. Including it would create semantic ambiguity since this data may never be readable by the user. Unix and VSOCK sockets have similar issues, but fixing them is outside the scope of this patch as it would require more intrusive changes. Previous work by John Fastabend made some efforts towards FIONREAD support: commit `e5c6de5fa0` ("bpf, sockmap: Incorrectly handling copied_seq") Although the current patch is based on the previous work by John Fastabend, it is acceptable for our Fixes tag to point to the same commit. FD1:read() -- FD1->copied_seq++ \| [read data] \| [enqueue data] v [sockmap] -> ingress to self -> ingress_msg queue FD1 native stack ------> ^ -- FD1->rcv_nxt++ -> redirect to other \| [enqueue data] \| \| \| ingress to FD1 v ^ ... \| [sockmap] FD2 native stack Fixes: `04919bed94` ("tcp: Introduce tcp_read_skb()") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/r/20260124113314.113584-3-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-27 09:11:30 -08:00
Jiayuan Chen	b40cc5adaa	bpf, sockmap: Fix incorrect copied_seq calculation A socket using sockmap has its own independent receive queue: ingress_msg. This queue may contain data from its own protocol stack or from other sockets. The issue is that when reading from ingress_msg, we update tp->copied_seq by default. However, if the data is not from its own protocol stack, tcp->rcv_nxt is not increased. Later, if we convert this socket to a native socket, reading from this socket may fail because copied_seq might be significantly larger than rcv_nxt. This fix also addresses the syzkaller-reported bug referenced in the Closes tag. This patch marks the skmsg objects in ingress_msg. When reading, we update copied_seq only if the data is from its own protocol stack. FD1:read() -- FD1->copied_seq++ \| [read data] \| [enqueue data] v [sockmap] -> ingress to self -> ingress_msg queue FD1 native stack ------> ^ -- FD1->rcv_nxt++ -> redirect to other \| [enqueue data] \| \| \| ingress to FD1 v ^ ... \| [sockmap] FD2 native stack Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983 Fixes: `04919bed94` ("tcp: Introduce tcp_read_skb()") Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260124113314.113584-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2026-01-27 09:11:30 -08:00
Eric Dumazet	a18056a6c1	tcp: move sk_forced_mem_schedule() to tcp.c TCP fast path can (auto)inline this helper, instead of (auto)inling it from tcp_send_fin(). No change of overall code size, but tcp_sendmsg() is faster. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 1/1 up/down: 141/-140 (1) Function old new delta tcp_stream_alloc_skb 216 357 +141 tcp_send_fin 688 548 -140 Total: Before=22236729, After=22236730, chg +0.00% BTW, we might change tcp_send_fin() to use tcp_stream_alloc_skb(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260123111605.4089200-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2026-01-27 14:58:13 +01:00

1 2 3 4 5 ...

13372 Commits