Compare commits

...

33 Commits

Author SHA1 Message Date
Paolo Abeni
db472c34a7 Merge tag 'nf-26-03-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter for net

This is v3, I kept back an ipset fix and another to tigthen the xtables
interface to reject invalid combinations with the NFPROTO_ARP family.
They need a bit more discussion. I fixed the issues reported by AI on
patch 9 (add #ifdef to access ct zone, update nf_conntrack_broadcast
and patch 10 (use better Fixes: tag). Thanks!

The following patchset contains Netfilter fixes for *net*.

Note that most bugs fixed here stem from 2.6 days, the large PR is not
due to an increase in regressions.

1) Fix incorrect reject of set updates with nf_tables pipapo set
   avx2 backend.  This comes with a regression test in patch 2.
   From Florian Westphal.

2) nfnetlink_log needs to zero padding to prevent infoleak to userspace,
   from Weiming Shi.

3) xtables ip6t_rt module never validated that addrnr length is within the
   allowed array boundary. Reject bogus values.  From Ren Wei.

4) Fix high memory usage in rbtree set backend that was unwanted side-effect
   of the recently added binary search blob. From Pablo Neira Ayuso.

5) Patches 5 to 10, also from Pablo, address long-standing RCU safety bugs
   in conntracks handling of expectations: We can never safely defer
   a conntrack extension area without holding a reference. Yet expectation
   handling does so in multiple places.  Fix this by avoiding the need to
   look into the master conntrack to begin with and by extending locked
   sections in a few places.

11) Fix use of uninitialized rtp_addr in the sip conntrack helper,
    also from Weiming Shi.

12) Add stricter netlink policy checks in ctnetlink, from David Carlier.
    This avoids undefined behaviour when userspace provides huge wscale
    value.

netfilter pull request 26-03-26

* tag 'nf-26-03-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: ctnetlink: use netlink policy range checks
  netfilter: nf_conntrack_sip: fix use of uninitialized rtp_addr in process_sdp
  netfilter: nf_conntrack_expect: skip expectations in other netns via proc
  netfilter: nf_conntrack_expect: store netns and zone in expectation
  netfilter: ctnetlink: ensure safe access to master conntrack
  netfilter: nf_conntrack_expect: use expect->helper
  netfilter: nf_conntrack_expect: honor expectation helper field
  netfilter: nft_set_rbtree: revisit array resize logic
  netfilter: ip6t_rt: reject oversized addrnr in rt_mt6_check()
  netfilter: nfnetlink_log: fix uninitialized padding leak in NFULA_PAYLOAD
  selftests: netfilter: nft_concat_range.sh: add check for flush+reload bug
  netfilter: nft_set_pipapo_avx2: don't return non-matching entry on expiry
====================

Link: https://patch.msgid.link/20260326125153.685915-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-26 15:38:14 +01:00
Paolo Abeni
deec4f7b41 Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:

====================
For ice:
Michal corrects call to alloc_etherdev_mqs() to provide maximum number
of queues supported rather than currently allocated number of queues.

Petr Oros fixes issues related to some ethtool operations in switchdev
mode.

For iavf:
Kohei Enju corrects number of reported queues for ethtool statistics to
absolute max as using current number could race and cause out-of-bounds
issues.

For idpf:
Josh NULLs cdev_info pointer after freeing to prevent possible subsequent
improper access. He also defers setting of refillqs value until after
allocation to prevent possible NULL pointer dereference.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  idpf: only assign num refillqs if allocation was successful
  idpf: clear stale cdev_info ptr
  iavf: fix out-of-bounds writes in iavf_get_ethtool_stats()
  ice: use ice_update_eth_stats() for representor stats
  ice: fix inverted ready check for VF representors
  ice: set max queues in alloc_etherdev_mqs()
====================

Link: https://patch.msgid.link/20260323205843.624704-1-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-26 15:14:52 +01:00
Paolo Valerio
72d96e4e24 net: macb: use the current queue number for stats
There's a potential mismatch between the memory reserved for statistics
and the amount of memory written.

gem_get_sset_count() correctly computes the number of stats based on the
active queues, whereas gem_get_ethtool_stats() indiscriminately copies
data using the maximum number of queues, and in the case the number of
active queues is less than MACB_MAX_QUEUES, this results in a OOB write
as observed in the KASAN splat.

==================================================================
BUG: KASAN: vmalloc-out-of-bounds in gem_get_ethtool_stats+0x54/0x78
  [macb]
Write of size 760 at addr ffff80008080b000 by task ethtool/1027

CPU: [...]
Tainted: [E]=UNSIGNED_MODULE
Hardware name: raspberrypi rpi/rpi, BIOS 2025.10 10/01/2025
Call trace:
 show_stack+0x20/0x38 (C)
 dump_stack_lvl+0x80/0xf8
 print_report+0x384/0x5e0
 kasan_report+0xa0/0xf0
 kasan_check_range+0xe8/0x190
 __asan_memcpy+0x54/0x98
 gem_get_ethtool_stats+0x54/0x78 [macb
   926c13f3af83b0c6fe64badb21ec87d5e93fcf65]
 dev_ethtool+0x1220/0x38c0
 dev_ioctl+0x4ac/0xca8
 sock_do_ioctl+0x170/0x1d8
 sock_ioctl+0x484/0x5d8
 __arm64_sys_ioctl+0x12c/0x1b8
 invoke_syscall+0xd4/0x258
 el0_svc_common.constprop.0+0xb4/0x240
 do_el0_svc+0x48/0x68
 el0_svc+0x40/0xf8
 el0t_64_sync_handler+0xa0/0xe8
 el0t_64_sync+0x1b0/0x1b8

The buggy address belongs to a 1-page vmalloc region starting at
  0xffff80008080b000 allocated at dev_ethtool+0x11f0/0x38c0
The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000
  index:0xffff00000a333000 pfn:0xa333
flags: 0x7fffc000000000(node=0|zone=0|lastcpupid=0x1ffff)
raw: 007fffc000000000 0000000000000000 dead000000000122 0000000000000000
raw: ffff00000a333000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff80008080b080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff80008080b100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff80008080b180: 00 00 00 00 00 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
                                  ^
 ffff80008080b200: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
 ffff80008080b280: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
==================================================================

Fix it by making sure the copied size only considers the active number of
queues.

Fixes: 512286bbd4 ("net: macb: Added some queue statistics")
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260323191634.2185840-1-pvalerio@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-26 13:48:21 +01:00
Paolo Abeni
aa637b2cf3 Merge tag 'for-net-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - L2CAP: Fix deadlock in l2cap_conn_del()
 - L2CAP: Fix ERTM re-init and zero pdu_len infinite loop
 - L2CAP: Fix send LE flow credits in ACL link
 - btintel: serialize btintel_hw_error() with hci_req_sync_lock
 - btusb: clamp SCO altsetting table indices

* tag 'for-net-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: btusb: clamp SCO altsetting table indices
  Bluetooth: L2CAP: Fix ERTM re-init and zero pdu_len infinite loop
  Bluetooth: L2CAP: Fix deadlock in l2cap_conn_del()
  Bluetooth: btintel: serialize btintel_hw_error() with hci_req_sync_lock
  Bluetooth: L2CAP: Fix send LE flow credits in ACL link
====================

Link: https://patch.msgid.link/20260325194358.618892-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-26 13:46:55 +01:00
David Carlier
8f15b5071b netfilter: ctnetlink: use netlink policy range checks
Replace manual range and mask validations with netlink policy
annotations in ctnetlink code paths, so that the netlink core rejects
invalid values early and can generate extack errors.

- CTA_PROTOINFO_TCP_STATE: reject values > TCP_CONNTRACK_SYN_SENT2 at
  policy level, removing the manual >= TCP_CONNTRACK_MAX check.
- CTA_PROTOINFO_TCP_WSCALE_ORIGINAL/REPLY: reject values > TCP_MAX_WSCALE
  (14). The normal TCP option parsing path already clamps to this value,
  but the ctnetlink path accepted 0-255, causing undefined behavior when
  used as a u32 shift count.
- CTA_FILTER_ORIG_FLAGS/REPLY_FLAGS: use NLA_POLICY_MASK with
  CTA_FILTER_F_ALL, removing the manual mask checks.
- CTA_EXPECT_FLAGS: use NLA_POLICY_MASK with NF_CT_EXPECT_MASK, adding
  a new mask define grouping all valid expect flags.

Extracted from a broader nf-next patch by Florian Westphal, scoped to
ctnetlink for the fixes tree.

Fixes: c8e2078cfe ("[NETFILTER]: ctnetlink: add support for internal tcp connection tracking flags handling")
Signed-off-by: David Carlier <devnexen@gmail.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26 13:28:17 +01:00
Weiming Shi
6a2b724460 netfilter: nf_conntrack_sip: fix use of uninitialized rtp_addr in process_sdp
process_sdp() declares union nf_inet_addr rtp_addr on the stack and
passes it to the nf_nat_sip sdp_session hook after walking the SDP
media descriptions. However rtp_addr is only initialized inside the
media loop when a recognized media type with a non-zero port is found.

If the SDP body contains no m= lines, only inactive media sections
(m=audio 0 ...) or only unrecognized media types, rtp_addr is never
assigned. Despite that, the function still calls hooks->sdp_session()
with &rtp_addr, causing nf_nat_sdp_session() to format the stale stack
value as an IP address and rewrite the SDP session owner and connection
lines with it.

With CONFIG_INIT_STACK_ALL_ZERO (default on most distributions) this
results in the session-level o= and c= addresses being rewritten to
0.0.0.0 for inactive SDP sessions. Without stack auto-init the
rewritten address is whatever happened to be on the stack.

Fix this by pre-initializing rtp_addr from the session-level connection
address (caddr) when available, and tracking via a have_rtp_addr flag
whether any valid address was established. Skip the sdp_session hook
entirely when no valid address exists.

Fixes: 4ab9e64e5e ("[NETFILTER]: nf_nat_sip: split up SDP mangling")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26 13:28:17 +01:00
Pablo Neira Ayuso
3db5647984 netfilter: nf_conntrack_expect: skip expectations in other netns via proc
Skip expectations that do not reside in this netns.

Similar to e77e6ff502 ("netfilter: conntrack: do not dump other netns's
conntrack entries via proc").

Fixes: 9b03f38d04 ("netfilter: netns nf_conntrack: per-netns expectations")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26 13:28:03 +01:00
Pablo Neira Ayuso
02a3231b6d netfilter: nf_conntrack_expect: store netns and zone in expectation
__nf_ct_expect_find() and nf_ct_expect_find_get() are called under
rcu_read_lock() but they dereference the master conntrack via
exp->master.

Since the expectation does not hold a reference on the master conntrack,
this could be dying conntrack or different recycled conntrack than the
real master due to SLAB_TYPESAFE_RCU.

Store the netns, the master_tuple and the zone in struct
nf_conntrack_expect as a safety measure.

This patch is required by the follow up fix not to dump expectations
that do not belong to this netns.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26 13:24:40 +01:00
Pablo Neira Ayuso
bffcaad9af netfilter: ctnetlink: ensure safe access to master conntrack
Holding reference on the expectation is not sufficient, the master
conntrack object can just go away, making exp->master invalid.

To access exp->master safely:

- Grab the nf_conntrack_expect_lock, this gets serialized with
  clean_from_lists() which also holds this lock when the master
  conntrack goes away.

- Hold reference on master conntrack via nf_conntrack_find_get().
  Not so easy since the master tuple to look up for the master conntrack
  is not available in the existing problematic paths.

This patch goes for extending the nf_conntrack_expect_lock section
to address this issue for simplicity, in the cases that are described
below this is just slightly extending the lock section.

The add expectation command already holds a reference to the master
conntrack from ctnetlink_create_expect().

However, the delete expectation command needs to grab the spinlock
before looking up for the expectation. Expand the existing spinlock
section to address this to cover the expectation lookup. Note that,
the nf_ct_expect_iterate_net() calls already grabs the spinlock while
iterating over the expectation table, which is correct.

The get expectation command needs to grab the spinlock to ensure master
conntrack does not go away. This also expands the existing spinlock
section to cover the expectation lookup too. I needed to move the
netlink skb allocation out of the spinlock to keep it GFP_KERNEL.

For the expectation events, the IPEXP_DESTROY event is already delivered
under the spinlock, just move the delivery of IPEXP_NEW under the
spinlock too because the master conntrack event cache is reached through
exp->master.

While at it, add lockdep notations to help identify what codepaths need
to grab the spinlock.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26 13:18:32 +01:00
Pablo Neira Ayuso
f017941060 netfilter: nf_conntrack_expect: use expect->helper
Use expect->helper in ctnetlink and /proc to dump the helper name.
Using nfct_help() without holding a reference to the master conntrack
is unsafe.

Use exp->master->helper in ctnetlink path if userspace does not provide
an explicit helper when creating an expectation to retain the existing
behaviour. The ctnetlink expectation path holds the reference on the
master conntrack and nf_conntrack_expect lock and the nfnetlink glue
path refers to the master ct that is attached to the skb.

Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26 13:18:31 +01:00
Pablo Neira Ayuso
9c42bc9db9 netfilter: nf_conntrack_expect: honor expectation helper field
The expectation helper field is mostly unused. As a result, the
netfilter codebase relies on accessing the helper through exp->master.

Always set on the expectation helper field so it can be used to reach
the helper.

nf_ct_expect_init() is called from packet path where the skb owns
the ct object, therefore accessing exp->master for the newly created
expectation is safe. This saves a lot of updates in all callsites
to pass the ct object as parameter to nf_ct_expect_init().

This is a preparation patches for follow up fixes.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26 13:18:31 +01:00
Pablo Neira Ayuso
fafdd92b9e netfilter: nft_set_rbtree: revisit array resize logic
Chris Arges reports high memory consumption with thousands of
containers, this patch revisits the array allocation logic.

For anonymous sets, start by 16 slots (which takes 256 bytes on x86_64).
Expand it by x2 until threshold of 512 slots is reached, over that
threshold, expand it by x1.5.

For non-anonymous set, start by 1024 slots in the array (which takes 16
Kbytes initially on x86_64). Expand it by x1.5.

Use set->ndeact to subtract deactivated elements when calculating the
number of the slots in the array, otherwise the array size array gets
increased artifically. Add special case shrink logic to deal with flush
set too.

The shrink logic is skipped by anonymous sets.

Use check_add_overflow() to calculate the new array size.

Add a WARN_ON_ONCE check to make sure elements fit into the new array
size.

Reported-by: Chris Arges <carges@cloudflare.com>
Fixes: 7e43e0a114 ("netfilter: nft_set_rbtree: translate rbtree to array for binary search")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26 13:18:31 +01:00
9d3f027327 netfilter: ip6t_rt: reject oversized addrnr in rt_mt6_check()
Reject rt match rules whose addrnr exceeds IP6T_RT_HOPS.

rt_mt6() expects addrnr to stay within the bounds of rtinfo->addrs[].
Validate addrnr during rule installation so malformed rules are rejected
before the match logic can use an out-of-range value.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Tested-by: Yuhang Zheng <z1652074432@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26 13:18:31 +01:00
Weiming Shi
52025ebaa2 netfilter: nfnetlink_log: fix uninitialized padding leak in NFULA_PAYLOAD
__build_packet_message() manually constructs the NFULA_PAYLOAD netlink
attribute using skb_put() and skb_copy_bits(), bypassing the standard
nla_reserve()/nla_put() helpers. While nla_total_size(data_len) bytes
are allocated (including NLA alignment padding), only data_len bytes
of actual packet data are copied. The trailing nla_padlen(data_len)
bytes (1-3 when data_len is not 4-byte aligned) are never initialized,
leaking stale heap contents to userspace via the NFLOG netlink socket.

Replace the manual attribute construction with nla_reserve(), which
handles the tailroom check, header setup, and padding zeroing via
__nla_reserve(). The subsequent skb_copy_bits() fills in the payload
data on top of the properly initialized attribute.

Fixes: df6fb868d6 ("[NETFILTER]: nfnetlink: convert to generic netlink attribute functions")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-26 13:15:46 +01:00
Chuck Lever
84a8335d83 tls: Purge async_hold in tls_decrypt_async_wait()
The async_hold queue pins encrypted input skbs while
the AEAD engine references their scatterlist data. Once
tls_decrypt_async_wait() returns, every AEAD operation
has completed and the engine no longer references those
skbs, so they can be freed unconditionally.

A subsequent patch adds batch async decryption to
tls_sw_read_sock(), introducing a new call site that
must drain pending AEAD operations and release held
skbs. Move __skb_queue_purge(&ctx->async_hold) into
tls_decrypt_async_wait() so the purge is centralized
and every caller -- recvmsg's drain path, the -EBUSY
fallback in tls_do_decryption(), and the new read_sock
batch path -- releases held skbs on synchronization
without each site managing the purge independently.

This fixes a leak when tls_strp_msg_hold() fails part-way through,
after having added some cloned skbs to the async_hold
queue. tls_decrypt_sg() will then call tls_decrypt_async_wait() to
process all pending decrypts, and drop back to synchronous mode, but
tls_sw_recvmsg() only flushes the async_hold queue when one record has
been processed in "fully-async" mode, which may not be the case here.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Fixes: b8a6ff84ab ("tls: wait for pending async decryptions if tls_strp_msg_hold fails")
Link: https://patch.msgid.link/20260324-tls-read-sock-v5-1-5408befe5774@oracle.com
[pabeni@redhat.com: added leak comment]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-03-26 09:55:53 +01:00
Florian Westphal
6caefcd949 selftests: netfilter: nft_concat_range.sh: add check for flush+reload bug
This test will fail without
the preceding commit ("netfilter: nft_set_pipapo_avx2: fix match retart if found element is expired"):

  reject overlapping range on add       0s                              [ OK ]
  reload with flush                 /dev/stdin:59:32-52: Error: Could not process rule: File exists
add element inet filter test { 10.0.0.29 . 10.0.2.29 }

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-25 21:40:47 +01:00
Florian Westphal
d3c0037ffe netfilter: nft_set_pipapo_avx2: don't return non-matching entry on expiry
New test case fails unexpectedly when avx2 matching functions are used.

The test first loads a ranomly generated pipapo set
with 'ipv4 . port' key, i.e.  nft -f foo.

This works.  Then, it reloads the set after a flush:
(echo flush set t s; cat foo) | nft -f -

This is expected to work, because its the same set after all and it was
already loaded once.

But with avx2, this fails: nft reports a clashing element.

The reported clash is of following form:

    We successfully re-inserted
      a . b
      c . d

Then we try to insert a . d

avx2 finds the already existing a . d, which (due to 'flush set') is marked
as invalid in the new generation.  It skips the element and moves to next.

Due to incorrect masking, the skip-step finds the next matching
element *only considering the first field*,

i.e. we return the already reinserted "a . b", even though the
last field is different and the entry should not have been matched.

No such error is reported for the generic c implementation (no avx2) or when
the last field has to use the 'nft_pipapo_avx2_lookup_slow' fallback.

Bisection points to
7711f4bb4b ("netfilter: nft_set_pipapo: fix range overlap detection")
but that fix merely uncovers this bug.

Before this commit, the wrong element is returned, but erronously
reported as a full, identical duplicate.

The root-cause is too early return in the avx2 match functions.
When we process the last field, we should continue to process data
until the entire input size has been consumed to make sure no stale
bits remain in the map.

Link: https://lore.kernel.org/netfilter-devel/20260321152506.037f68c0@elisabeth/
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-03-25 21:38:27 +01:00
Pengpeng Hou
129fa608b6 Bluetooth: btusb: clamp SCO altsetting table indices
btusb_work() maps the number of active SCO links to USB alternate
settings through a three-entry lookup table when CVSD traffic uses
transparent voice settings. The lookup currently indexes alts[] with
data->sco_num - 1 without first constraining sco_num to the number of
available table entries.

While the table only defines alternate settings for up to three SCO
links, data->sco_num comes from hci_conn_num() and is used directly.
Cap the lookup to the last table entry before indexing it so the
driver keeps selecting the highest supported alternate setting without
reading past alts[].

Fixes: baac6276c0 ("Bluetooth: btusb: handle mSBC audio over USB Endpoints")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2026-03-25 15:32:55 -04:00
Hyunwoo Kim
25f420a0d4 Bluetooth: L2CAP: Fix ERTM re-init and zero pdu_len infinite loop
l2cap_config_req() processes CONFIG_REQ for channels in BT_CONNECTED
state to support L2CAP reconfiguration (e.g. MTU changes). However,
since both CONF_INPUT_DONE and CONF_OUTPUT_DONE are already set from
the initial configuration, the reconfiguration path falls through to
l2cap_ertm_init(), which re-initializes tx_q, srej_q, srej_list, and
retrans_list without freeing the previous allocations and sets
chan->sdu to NULL without freeing the existing skb. This leaks all
previously allocated ERTM resources.

Additionally, l2cap_parse_conf_req() does not validate the minimum
value of remote_mps derived from the RFC max_pdu_size option. A zero
value propagates to l2cap_segment_sdu() where pdu_len becomes zero,
causing the while loop to never terminate since len is never
decremented, exhausting all available memory.

Fix the double-init by skipping l2cap_ertm_init() and
l2cap_chan_ready() when the channel is already in BT_CONNECTED state,
while still allowing the reconfiguration parameters to be updated
through l2cap_parse_conf_req(). Also add a pdu_len zero check in
l2cap_segment_sdu() as a safeguard.

Fixes: 96298f6401 ("Bluetooth: L2CAP: handle l2cap config request during open state")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2026-03-25 15:32:32 -04:00
Hyunwoo Kim
00fdebbbc5 Bluetooth: L2CAP: Fix deadlock in l2cap_conn_del()
l2cap_conn_del() calls cancel_delayed_work_sync() for both info_timer
and id_addr_timer while holding conn->lock. However, the work functions
l2cap_info_timeout() and l2cap_conn_update_id_addr() both acquire
conn->lock, creating a potential AB-BA deadlock if the work is already
executing when l2cap_conn_del() takes the lock.

Move the work cancellations before acquiring conn->lock and use
disable_delayed_work_sync() to additionally prevent the works from
being rearmed after cancellation, consistent with the pattern used in
hci_conn_del().

Fixes: ab4eedb790 ("Bluetooth: L2CAP: Fix corrupted list in hci_chan_del")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2026-03-25 15:32:09 -04:00
Cen Zhang
94d8e6fe5d Bluetooth: btintel: serialize btintel_hw_error() with hci_req_sync_lock
btintel_hw_error() issues two __hci_cmd_sync() calls (HCI_OP_RESET
and Intel exception-info retrieval) without holding
hci_req_sync_lock().  This lets it race against
hci_dev_do_close() -> btintel_shutdown_combined(), which also runs
__hci_cmd_sync() under the same lock.  When both paths manipulate
hdev->req_status/req_rsp concurrently, the close path may free the
response skb first, and the still-running hw_error path hits a
slab-use-after-free in kfree_skb().

Wrap the whole recovery sequence in hci_req_sync_lock/unlock so it
is serialized with every other synchronous HCI command issuer.

Below is the data race report and the kasan report:

  BUG: data-race in __hci_cmd_sync_sk / btintel_shutdown_combined

  read of hdev->req_rsp at net/bluetooth/hci_sync.c:199
  by task kworker/u17:1/83:
   __hci_cmd_sync_sk+0x12f2/0x1c30 net/bluetooth/hci_sync.c:200
   __hci_cmd_sync+0x55/0x80 net/bluetooth/hci_sync.c:223
   btintel_hw_error+0x114/0x670 drivers/bluetooth/btintel.c:254
   hci_error_reset+0x348/0xa30 net/bluetooth/hci_core.c:1030

  write/free by task ioctl/22580:
   btintel_shutdown_combined+0xd0/0x360
    drivers/bluetooth/btintel.c:3648
   hci_dev_close_sync+0x9ae/0x2c10 net/bluetooth/hci_sync.c:5246
   hci_dev_do_close+0x232/0x460 net/bluetooth/hci_core.c:526

  BUG: KASAN: slab-use-after-free in
   sk_skb_reason_drop+0x43/0x380 net/core/skbuff.c:1202
  Read of size 4 at addr ffff888144a738dc
  by task kworker/u17:1/83:
   __hci_cmd_sync_sk+0x12f2/0x1c30 net/bluetooth/hci_sync.c:200
   __hci_cmd_sync+0x55/0x80 net/bluetooth/hci_sync.c:223
   btintel_hw_error+0x186/0x670 drivers/bluetooth/btintel.c:260

Fixes: 973bb97e5a ("Bluetooth: btintel: Add generic function for handling hardware errors")
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2026-03-25 15:31:48 -04:00
Zhang Chen
f39f905e55 Bluetooth: L2CAP: Fix send LE flow credits in ACL link
When the L2CAP channel mode is L2CAP_MODE_ERTM/L2CAP_MODE_STREAMING,
l2cap_publish_rx_avail will be called and le flow credits will be sent in
l2cap_chan_rx_avail, even though the link type is ACL.

The logs in question as follows:
> ACL Data RX: Handle 129 flags 0x02 dlen 12
      L2CAP: Unknown (0x16) ident 4 len 4
        40 00 ed 05
< ACL Data TX: Handle 129 flags 0x00 dlen 10
      L2CAP: Command Reject (0x01) ident 4 len 2
        Reason: Command not understood (0x0000)

Bluetooth: Unknown BR/EDR signaling command 0x16
Bluetooth: Wrong link type (-22)

Fixes: ce60b9231b ("Bluetooth: compute LE flow credits based on recvbuf space")
Signed-off-by: Zhang Chen <zhangchen01@kylinos.cn>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2026-03-25 15:24:02 -04:00
Guangshuo Li
c4ea7d8907 net: mana: fix use-after-free in add_adev() error path
If auxiliary_device_add() fails, add_adev() jumps to add_fail and calls
auxiliary_device_uninit(adev).

The auxiliary device has its release callback set to adev_release(),
which frees the containing struct mana_adev. Since adev is embedded in
struct mana_adev, the subsequent fall-through to init_fail and access
to adev->id may result in a use-after-free.

Fix this by saving the allocated auxiliary device id in a local
variable before calling auxiliary_device_add(), and use that saved id
in the cleanup path after auxiliary_device_uninit().

Fixes: a69839d432 ("net: mana: Add support for auxiliary device")
Cc: stable@vger.kernel.org
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Link: https://patch.msgid.link/20260323165730.945365-1-lgs201920130244@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-24 21:07:58 -07:00
Jonas Köppeler
815980fe6d net_sched: codel: fix stale state for empty flows in fq_codel
When codel_dequeue() finds an empty queue, it resets vars->dropping
but does not reset vars->first_above_time.  The reference CoDel
algorithm (Nichols & Jacobson, ACM Queue 2012) resets both:

  dodeque_result codel_queue_t::dodeque(time_t now) {
      ...
      if (r.p == NULL) {
          first_above_time = 0;   // <-- Linux omits this
      }
      ...
  }

Note that codel_should_drop() does reset first_above_time when called
with a NULL skb, but codel_dequeue() returns early before ever calling
codel_should_drop() in the empty-queue case.  The post-drop code paths
do reach codel_should_drop(NULL) and correctly reset the timer, so a
dropped packet breaks the cycle -- but the next delivered packet
re-arms first_above_time and the cycle repeats.

For sparse flows such as ICMP ping (one packet every 200ms-1s), the
first packet arms first_above_time, the flow goes empty, and the
second packet arrives after the interval has elapsed and gets dropped.
The pattern repeats, producing sustained loss on flows that are not
actually congested.

Test: veth pair, fq_codel, BQL disabled, 30000 iptables rules in the
consumer namespace (NAPI-64 cycle ~14ms, well above fq_codel's 5ms
target), ping at 5 pps under UDP flood:

  Before fix:  26% ping packet loss
  After fix:    0% ping packet loss

Fix by resetting first_above_time to zero in the empty-queue path
of codel_dequeue(), matching the reference algorithm.

Fixes: 76e3cc126b ("codel: Controlled Delay AQM")
Fixes: d068ca2ae2 ("codel: split into multiple files")
Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Reported-by: Chris Arges <carges@cloudflare.com>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/all/20260318134826.1281205-7-hawk@kernel.org/
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260323174920.253526-1-hawk@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-24 20:57:57 -07:00
Sabrina Dubroca
09474055f2 rtnetlink: fix leak of SRCU struct in rtnl_link_register
Commit 6b57ff21a3 ("rtnetlink: Protect link_ops by mutex.") swapped
the EEXIST check with the init_srcu_struct, but didn't add cleanup of
the SRCU struct we just allocated in case of error.

Fixes: 6b57ff21a3 ("rtnetlink: Protect link_ops by mutex.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/e77fe499f9a58c547b33b5212b3596dad417cec6.1774025341.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-24 20:56:02 -07:00
Thangaraj Samynathan
7139970787 net: lan743x: fix duplex configuration in mac_link_up
The driver does not explicitly configure the MAC duplex mode when
bringing the link up. As a result, the MAC may retain a stale duplex
setting from a previous link state, leading to duplex mismatches with
the link partner and degraded network performance.

Update lan743x_phylink_mac_link_up() to set or clear the MAC_CR_DPX_
bit according to the negotiated duplex mode.

This ensures the MAC configuration is consistent with the phylink
resolved state.

Fixes: a5f199a8d8 ("net: lan743x: Migrate phylib to phylink")
Signed-off-by: Thangaraj Samynathan <thangaraj.s@microchip.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20260323065345.144915-1-thangaraj.s@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-24 20:48:25 -07:00
xietangxin
ba8bda9a08 virtio_net: Fix UAF on dst_ops when IFF_XMIT_DST_RELEASE is cleared and napi_tx is false
A UAF issue occurs when the virtio_net driver is configured with napi_tx=N
and the device's IFF_XMIT_DST_RELEASE flag is cleared
(e.g., during the configuration of tc route filter rules).

When IFF_XMIT_DST_RELEASE is removed from the net_device, the network stack
expects the driver to hold the reference to skb->dst until the packet
is fully transmitted and freed. In virtio_net with napi_tx=N,
skbs may remain in the virtio transmit ring for an extended period.

If the network namespace is destroyed while these skbs are still pending,
the corresponding dst_ops structure has freed. When a subsequent packet
is transmitted, free_old_xmit() is triggered to clean up old skbs.
It then calls dst_release() on the skb associated with the stale dst_entry.
Since the dst_ops (referenced by the dst_entry) has already been freed,
a UAF kernel paging request occurs.

fix it by adds skb_dst_drop(skb) in start_xmit to explicitly release
the dst reference before the skb is queued in virtio_net.

Call Trace:
 Unable to handle kernel paging request at virtual address ffff80007e150000
 CPU: 2 UID: 0 PID: 6236 Comm: ping Kdump: loaded Not tainted 7.0.0-rc1+ #6 PREEMPT
  ...
  percpu_counter_add_batch+0x3c/0x158 lib/percpu_counter.c:98 (P)
  dst_release+0xe0/0x110  net/core/dst.c:177
  skb_release_head_state+0xe8/0x108 net/core/skbuff.c:1177
  sk_skb_reason_drop+0x54/0x2d8 net/core/skbuff.c:1255
  dev_kfree_skb_any_reason+0x64/0x78 net/core/dev.c:3469
  napi_consume_skb+0x1c4/0x3a0 net/core/skbuff.c:1527
  __free_old_xmit+0x164/0x230  drivers/net/virtio_net.c:611 [virtio_net]
  free_old_xmit drivers/net/virtio_net.c:1081 [virtio_net]
  start_xmit+0x7c/0x530 drivers/net/virtio_net.c:3329 [virtio_net]
  ...

Reproduction Steps:
NETDEV="enp3s0"

config_qdisc_route_filter() {
    tc qdisc del dev $NETDEV root
    tc qdisc add dev $NETDEV root handle 1: prio
    tc filter add dev $NETDEV parent 1:0 \
	protocol ip prio 100 route to 100 flowid 1:1
    ip route add 192.168.1.100/32 dev $NETDEV realm 100
}

test_ns() {
    ip netns add testns
    ip link set $NETDEV netns testns
    ip netns exec testns ifconfig $NETDEV  10.0.32.46/24
    ip netns exec testns ping -c 1 10.0.32.1
    ip netns del testns
}

config_qdisc_route_filter

test_ns
sleep 2
test_ns

Fixes: f2fc6a5458 ("[NETNS][IPV6] route6 - move ip6_dst_ops inside the network namespace")
Cc: stable@vger.kernel.org
Signed-off-by: xietangxin <xietangxin@yeah.net>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Fixes: 0287587884 ("net: better IFF_XMIT_DST_RELEASE support")
Link: https://patch.msgid.link/20260312025406.15641-1-xietangxin@yeah.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-03-24 20:46:08 -07:00
Joshua Hay
b5e5797e3c idpf: only assign num refillqs if allocation was successful
As reported by AI review [1], if the refillqs allocation fails, refillqs
will be NULL but num_refillqs will be non-zero. The release function
will then dereference refillqs since it thinks the refillqs are present,
resulting in a NULL ptr dereference.

Only assign the num refillqs if the allocation was successful. This will
prevent the release function from entering the loop and accessing
refillqs.

[1] https://lore.kernel.org/netdev/20260227035625.2632753-1-kuba@kernel.org/

Fixes: 95af467d9a ("idpf: configure resources for RX queues")
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-23 13:29:51 -07:00
Joshua Hay
1eb0db7e39 idpf: clear stale cdev_info ptr
Deinit calls idpf_idc_deinit_core_aux_device to free the cdev_info
memory, but leaves the adapter->cdev_info field with a stale pointer
value. This will bypass subsequent "if (!cdev_info)" checks if cdev_info
is not reallocated. For example, if idc_init fails after a reset,
cdev_info will already have been freed during the reset handling, but it
will not have been reallocated. The next reset or rmmod will result in a
crash.

[  +0.000008] BUG: kernel NULL pointer dereference, address: 00000000000000d0
[  +0.000033] #PF: supervisor read access in kernel mode
[  +0.000020] #PF: error_code(0x0000) - not-present page
[  +0.000017] PGD 2097dfa067 P4D 0
[  +0.000017] Oops: Oops: 0000 [#1] SMP NOPTI
...
[  +0.000018] RIP: 0010:device_del+0x3e/0x3d0
[  +0.000010] Call Trace:
[  +0.000010]  <TASK>
[  +0.000012]  idpf_idc_deinit_core_aux_device+0x36/0x70 [idpf]
[  +0.000034]  idpf_vc_core_deinit+0x3e/0x180 [idpf]
[  +0.000035]  idpf_remove+0x40/0x1d0 [idpf]
[  +0.000035]  pci_device_remove+0x42/0xb0
[  +0.000020]  device_release_driver_internal+0x19c/0x200
[  +0.000024]  driver_detach+0x48/0x90
[  +0.000018]  bus_remove_driver+0x6d/0x100
[  +0.000023]  pci_unregister_driver+0x2e/0xb0
[  +0.000022]  __do_sys_delete_module.isra.0+0x18c/0x2b0
[  +0.000025]  ? kmem_cache_free+0x2c2/0x390
[  +0.000023]  do_syscall_64+0x107/0x7d0
[  +0.000023]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Pass the adapter struct into idpf_idc_deinit_core_aux_device instead and
clear the cdev_info ptr.

Fixes: f4312e6bfa ("idpf: implement core RDMA auxiliary dev create, init, and destroy")
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-23 13:29:50 -07:00
Kohei Enju
fecacfc95f iavf: fix out-of-bounds writes in iavf_get_ethtool_stats()
iavf incorrectly uses real_num_tx_queues for ETH_SS_STATS. Since the
value could change in runtime, we should use num_tx_queues instead.

Moreover iavf_get_ethtool_stats() uses num_active_queues while
iavf_get_sset_count() and iavf_get_stat_strings() use
real_num_tx_queues, which triggers out-of-bounds writes when we do
"ethtool -L" and "ethtool -S" simultaneously [1].

For example when we change channels from 1 to 8, Thread 3 could be
scheduled before Thread 2, and out-of-bounds writes could be triggered
in Thread 3:

Thread 1 (ethtool -L)       Thread 2 (work)        Thread 3 (ethtool -S)
iavf_set_channels()
...
iavf_alloc_queues()
-> num_active_queues = 8
iavf_schedule_finish_config()
                                                   iavf_get_sset_count()
                                                   real_num_tx_queues: 1
                                                   -> buffer for 1 queue
                                                   iavf_get_ethtool_stats()
                                                   num_active_queues: 8
                                                   -> out-of-bounds!
                            iavf_finish_config()
                            -> real_num_tx_queues = 8

Use immutable num_tx_queues in all related functions to avoid the issue.

[1]
 BUG: KASAN: vmalloc-out-of-bounds in iavf_add_one_ethtool_stat+0x200/0x270
 Write of size 8 at addr ffffc900031c9080 by task ethtool/5800

 CPU: 1 UID: 0 PID: 5800 Comm: ethtool Not tainted 6.19.0-enjuk-08403-g8137e3db7f1c #241 PREEMPT(full)
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x6f/0xb0
  print_report+0x170/0x4f3
  kasan_report+0xe1/0x180
  iavf_add_one_ethtool_stat+0x200/0x270
  iavf_get_ethtool_stats+0x14c/0x2e0
  __dev_ethtool+0x3d0c/0x5830
  dev_ethtool+0x12d/0x270
  dev_ioctl+0x53c/0xe30
  sock_do_ioctl+0x1a9/0x270
  sock_ioctl+0x3d4/0x5e0
  __x64_sys_ioctl+0x137/0x1c0
  do_syscall_64+0xf3/0x690
  entry_SYSCALL_64_after_hwframe+0x77/0x7f
 RIP: 0033:0x7f7da0e6e36d
 ...
  </TASK>

 The buggy address belongs to a 1-page vmalloc region starting at 0xffffc900031c9000 allocated at __dev_ethtool+0x3cc9/0x5830
 The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000
 index:0xffff88813a013de0 pfn:0x13a013
 flags: 0x200000000000000(node=0|zone=2)
 raw: 0200000000000000 0000000000000000 dead000000000122 0000000000000000
 raw: ffff88813a013de0 0000000000000000 00000001ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffffc900031c8f80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
  ffffc900031c9000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 >ffffc900031c9080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
                    ^
  ffffc900031c9100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
  ffffc900031c9180: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8

Fixes: 64430f70ba ("iavf: Fix displaying queue statistics shown by ethtool")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-23 13:29:50 -07:00
Petr Oros
2526e440df ice: use ice_update_eth_stats() for representor stats
ice_repr_get_stats64() and __ice_get_ethtool_stats() call
ice_update_vsi_stats() on the VF's src_vsi. This always returns early
because ICE_VSI_DOWN is permanently set for VF VSIs - ice_up() is never
called on them since queues are managed by iavf through virtchnl.

In __ice_get_ethtool_stats() the original code called
ice_update_vsi_stats() for all VSIs including representors, iterated
over ice_gstrings_vsi_stats[] to populate the data, and then bailed out
with an early return before the per-queue ring stats section. That early
return was necessary because representor VSIs have no rings on the PF
side - the rings belong to the VF driver (iavf), so accessing per-queue
stats would be invalid.

Move the representor handling to the top of __ice_get_ethtool_stats()
and call ice_update_eth_stats() directly to read the hardware GLV_*
counters. This matches ice_get_vf_stats() which already uses
ice_update_eth_stats() for the same VF VSI in legacy mode. Apply the
same fix to ice_repr_get_stats64().

Note that ice_gstrings_vsi_stats[] contains five software ring counters
(rx_buf_failed, rx_page_failed, tx_linearize, tx_busy, tx_restart) that
are always zero for representors since the PF never processes packets on
VF rings. This is pre-existing behavior unchanged by this patch.

Fixes: 7aae80cef7 ("ice: add port representor ethtool ops and stats")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Patryk Holda <patryk.holda@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-23 13:29:07 -07:00
Petr Oros
ad85de0fc0 ice: fix inverted ready check for VF representors
Commit 0f00a897c9 ("ice: check if SF is ready in ethtool ops")
refactored the VF readiness check into a generic repr->ops.ready()
callback but implemented ice_repr_ready_vf() with inverted logic:

  return !ice_check_vf_ready_for_cfg(repr->vf);

ice_check_vf_ready_for_cfg() returns 0 on success, so the negation
makes ready() return non-zero when the VF is ready. All callers treat
non-zero as "not ready, skip", causing ndo_get_stats64, get_drvinfo,
get_strings and get_ethtool_stats to always bail out in switchdev mode.

Remove the erroneous negation. The SF variant ice_repr_ready_sf() is
already correct (returns !active, i.e. non-zero when not active).

Fixes: 0f00a897c9 ("ice: check if SF is ready in ethtool ops")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Patryk Holda <patryk.holda@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-23 12:54:23 -07:00
Michal Swiatkowski
c7fcd269e1 ice: set max queues in alloc_etherdev_mqs()
When allocating netdevice using alloc_etherdev_mqs() the maximum
supported queues number should be passed. The vsi->alloc_txq/rxq is
storing current number of queues, not the maximum ones.

Use the same function for getting max Tx and Rx queues which is used
during ethtool -l call to set maximum number of queues during netdev
allocation.

Reproduction steps:
$ethtool -l $pf # says current 16, max 64
$ethtool -S $pf # fine
$ethtool -L $pf combined 40 # crash

[491187.472594] Call Trace:
[491187.472829]  <TASK>
[491187.473067]  netif_set_xps_queue+0x26/0x40
[491187.473305]  ice_vsi_cfg_txq+0x265/0x3d0 [ice]
[491187.473619]  ice_vsi_cfg_lan_txqs+0x68/0xa0 [ice]
[491187.473918]  ice_vsi_cfg_lan+0x2b/0xa0 [ice]
[491187.474202]  ice_vsi_open+0x71/0x170 [ice]
[491187.474484]  ice_vsi_recfg_qs+0x17f/0x230 [ice]
[491187.474759]  ? dev_get_min_mp_channel_count+0xab/0xd0
[491187.474987]  ice_set_channels+0x185/0x3d0 [ice]
[491187.475278]  ethnl_set_channels+0x26f/0x340

Fixes: ee13aa1a2c ("ice: use netif_get_num_default_rss_queues()")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-03-23 12:54:23 -07:00
35 changed files with 403 additions and 168 deletions

View File

@@ -251,11 +251,13 @@ void btintel_hw_error(struct hci_dev *hdev, u8 code)
bt_dev_err(hdev, "Hardware error 0x%2.2x", code);
hci_req_sync_lock(hdev);
skb = __hci_cmd_sync(hdev, HCI_OP_RESET, 0, NULL, HCI_INIT_TIMEOUT);
if (IS_ERR(skb)) {
bt_dev_err(hdev, "Reset after hardware error failed (%ld)",
PTR_ERR(skb));
return;
goto unlock;
}
kfree_skb(skb);
@@ -263,18 +265,21 @@ void btintel_hw_error(struct hci_dev *hdev, u8 code)
if (IS_ERR(skb)) {
bt_dev_err(hdev, "Retrieving Intel exception info failed (%ld)",
PTR_ERR(skb));
return;
goto unlock;
}
if (skb->len != 13) {
bt_dev_err(hdev, "Exception info size mismatch");
kfree_skb(skb);
return;
goto unlock;
}
bt_dev_err(hdev, "Exception info %s", (char *)(skb->data + 1));
kfree_skb(skb);
unlock:
hci_req_sync_unlock(hdev);
}
EXPORT_SYMBOL_GPL(btintel_hw_error);

View File

@@ -2376,8 +2376,11 @@ static void btusb_work(struct work_struct *work)
if (data->air_mode == HCI_NOTIFY_ENABLE_SCO_CVSD) {
if (hdev->voice_setting & 0x0020) {
static const int alts[3] = { 2, 4, 5 };
unsigned int sco_idx;
new_alts = alts[data->sco_num - 1];
sco_idx = min_t(unsigned int, data->sco_num - 1,
ARRAY_SIZE(alts) - 1);
new_alts = alts[sco_idx];
} else {
new_alts = data->sco_num;
}

View File

@@ -3224,7 +3224,7 @@ static void gem_get_ethtool_stats(struct net_device *dev,
spin_lock_irq(&bp->stats_lock);
gem_update_stats(bp);
memcpy(data, &bp->ethtool_stats, sizeof(u64)
* (GEM_STATS_LEN + QUEUE_STATS_LEN * MACB_MAX_QUEUES));
* (GEM_STATS_LEN + QUEUE_STATS_LEN * bp->num_queues));
spin_unlock_irq(&bp->stats_lock);
}

View File

@@ -313,14 +313,13 @@ static int iavf_get_sset_count(struct net_device *netdev, int sset)
{
/* Report the maximum number queues, even if not every queue is
* currently configured. Since allocation of queues is in pairs,
* use netdev->real_num_tx_queues * 2. The real_num_tx_queues is set
* at device creation and never changes.
* use netdev->num_tx_queues * 2. The num_tx_queues is set at
* device creation and never changes.
*/
if (sset == ETH_SS_STATS)
return IAVF_STATS_LEN +
(IAVF_QUEUE_STATS_LEN * 2 *
netdev->real_num_tx_queues);
(IAVF_QUEUE_STATS_LEN * 2 * netdev->num_tx_queues);
else
return -EINVAL;
}
@@ -345,19 +344,19 @@ static void iavf_get_ethtool_stats(struct net_device *netdev,
iavf_add_ethtool_stats(&data, adapter, iavf_gstrings_stats);
rcu_read_lock();
/* As num_active_queues describe both tx and rx queues, we can use
* it to iterate over rings' stats.
/* Use num_tx_queues to report stats for the maximum number of queues.
* Queues beyond num_active_queues will report zero.
*/
for (i = 0; i < adapter->num_active_queues; i++) {
struct iavf_ring *ring;
for (i = 0; i < netdev->num_tx_queues; i++) {
struct iavf_ring *tx_ring = NULL, *rx_ring = NULL;
/* Tx rings stats */
ring = &adapter->tx_rings[i];
iavf_add_queue_stats(&data, ring);
if (i < adapter->num_active_queues) {
tx_ring = &adapter->tx_rings[i];
rx_ring = &adapter->rx_rings[i];
}
/* Rx rings stats */
ring = &adapter->rx_rings[i];
iavf_add_queue_stats(&data, ring);
iavf_add_queue_stats(&data, tx_ring);
iavf_add_queue_stats(&data, rx_ring);
}
rcu_read_unlock();
}
@@ -376,9 +375,9 @@ static void iavf_get_stat_strings(struct net_device *netdev, u8 *data)
iavf_add_stat_strings(&data, iavf_gstrings_stats);
/* Queues are always allocated in pairs, so we just use
* real_num_tx_queues for both Tx and Rx queues.
* num_tx_queues for both Tx and Rx queues.
*/
for (i = 0; i < netdev->real_num_tx_queues; i++) {
for (i = 0; i < netdev->num_tx_queues; i++) {
iavf_add_stat_strings(&data, iavf_gstrings_queue_stats,
"tx", i);
iavf_add_stat_strings(&data, iavf_gstrings_queue_stats,

View File

@@ -839,6 +839,28 @@ static inline void ice_tx_xsk_pool(struct ice_vsi *vsi, u16 qid)
WRITE_ONCE(ring->xsk_pool, ice_get_xp_from_qid(vsi, qid));
}
/**
* ice_get_max_txq - return the maximum number of Tx queues for in a PF
* @pf: PF structure
*
* Return: maximum number of Tx queues
*/
static inline int ice_get_max_txq(struct ice_pf *pf)
{
return min(num_online_cpus(), pf->hw.func_caps.common_cap.num_txq);
}
/**
* ice_get_max_rxq - return the maximum number of Rx queues for in a PF
* @pf: PF structure
*
* Return: maximum number of Rx queues
*/
static inline int ice_get_max_rxq(struct ice_pf *pf)
{
return min(num_online_cpus(), pf->hw.func_caps.common_cap.num_rxq);
}
/**
* ice_get_main_vsi - Get the PF VSI
* @pf: PF instance

View File

@@ -1930,6 +1930,17 @@ __ice_get_ethtool_stats(struct net_device *netdev,
int i = 0;
char *p;
if (ice_is_port_repr_netdev(netdev)) {
ice_update_eth_stats(vsi);
for (j = 0; j < ICE_VSI_STATS_LEN; j++) {
p = (char *)vsi + ice_gstrings_vsi_stats[j].stat_offset;
data[i++] = (ice_gstrings_vsi_stats[j].sizeof_stat ==
sizeof(u64)) ? *(u64 *)p : *(u32 *)p;
}
return;
}
ice_update_pf_stats(pf);
ice_update_vsi_stats(vsi);
@@ -1939,9 +1950,6 @@ __ice_get_ethtool_stats(struct net_device *netdev,
sizeof(u64)) ? *(u64 *)p : *(u32 *)p;
}
if (ice_is_port_repr_netdev(netdev))
return;
/* populate per queue stats */
rcu_read_lock();
@@ -3773,24 +3781,6 @@ ice_get_ts_info(struct net_device *dev, struct kernel_ethtool_ts_info *info)
return 0;
}
/**
* ice_get_max_txq - return the maximum number of Tx queues for in a PF
* @pf: PF structure
*/
static int ice_get_max_txq(struct ice_pf *pf)
{
return min(num_online_cpus(), pf->hw.func_caps.common_cap.num_txq);
}
/**
* ice_get_max_rxq - return the maximum number of Rx queues for in a PF
* @pf: PF structure
*/
static int ice_get_max_rxq(struct ice_pf *pf)
{
return min(num_online_cpus(), pf->hw.func_caps.common_cap.num_rxq);
}
/**
* ice_get_combined_cnt - return the current number of combined channels
* @vsi: PF VSI pointer

View File

@@ -4699,8 +4699,8 @@ static int ice_cfg_netdev(struct ice_vsi *vsi)
struct net_device *netdev;
u8 mac_addr[ETH_ALEN];
netdev = alloc_etherdev_mqs(sizeof(*np), vsi->alloc_txq,
vsi->alloc_rxq);
netdev = alloc_etherdev_mqs(sizeof(*np), ice_get_max_txq(vsi->back),
ice_get_max_rxq(vsi->back));
if (!netdev)
return -ENOMEM;

View File

@@ -2,6 +2,7 @@
/* Copyright (C) 2019-2021, Intel Corporation. */
#include "ice.h"
#include "ice_lib.h"
#include "ice_eswitch.h"
#include "devlink/devlink.h"
#include "devlink/port.h"
@@ -67,7 +68,7 @@ ice_repr_get_stats64(struct net_device *netdev, struct rtnl_link_stats64 *stats)
return;
vsi = repr->src_vsi;
ice_update_vsi_stats(vsi);
ice_update_eth_stats(vsi);
eth_stats = &vsi->eth_stats;
stats->tx_packets = eth_stats->tx_unicast + eth_stats->tx_broadcast +
@@ -315,7 +316,7 @@ ice_repr_reg_netdev(struct net_device *netdev, const struct net_device_ops *ops)
static int ice_repr_ready_vf(struct ice_repr *repr)
{
return !ice_check_vf_ready_for_cfg(repr->vf);
return ice_check_vf_ready_for_cfg(repr->vf);
}
static int ice_repr_ready_sf(struct ice_repr *repr)

View File

@@ -1066,7 +1066,7 @@ bool idpf_vport_set_hsplit(const struct idpf_vport *vport, u8 val);
int idpf_idc_init(struct idpf_adapter *adapter);
int idpf_idc_init_aux_core_dev(struct idpf_adapter *adapter,
enum iidc_function_type ftype);
void idpf_idc_deinit_core_aux_device(struct iidc_rdma_core_dev_info *cdev_info);
void idpf_idc_deinit_core_aux_device(struct idpf_adapter *adapter);
void idpf_idc_deinit_vport_aux_device(struct iidc_rdma_vport_dev_info *vdev_info);
void idpf_idc_issue_reset_event(struct iidc_rdma_core_dev_info *cdev_info);
void idpf_idc_vdev_mtu_event(struct iidc_rdma_vport_dev_info *vdev_info,

View File

@@ -470,10 +470,11 @@ err_privd_alloc:
/**
* idpf_idc_deinit_core_aux_device - de-initialize Auxiliary Device(s)
* @cdev_info: IDC core device info pointer
* @adapter: driver private data structure
*/
void idpf_idc_deinit_core_aux_device(struct iidc_rdma_core_dev_info *cdev_info)
void idpf_idc_deinit_core_aux_device(struct idpf_adapter *adapter)
{
struct iidc_rdma_core_dev_info *cdev_info = adapter->cdev_info;
struct iidc_rdma_priv_dev_info *privd;
if (!cdev_info)
@@ -485,6 +486,7 @@ void idpf_idc_deinit_core_aux_device(struct iidc_rdma_core_dev_info *cdev_info)
kfree(privd->mapped_mem_regions);
kfree(privd);
kfree(cdev_info);
adapter->cdev_info = NULL;
}
/**

View File

@@ -1860,13 +1860,13 @@ static int idpf_rxq_group_alloc(struct idpf_vport *vport,
idpf_queue_assign(HSPLIT_EN, q, hs);
idpf_queue_assign(RSC_EN, q, rsc);
bufq_set->num_refillqs = num_rxq;
bufq_set->refillqs = kcalloc(num_rxq, swq_size,
GFP_KERNEL);
if (!bufq_set->refillqs) {
err = -ENOMEM;
goto err_alloc;
}
bufq_set->num_refillqs = num_rxq;
for (unsigned int k = 0; k < bufq_set->num_refillqs; k++) {
struct idpf_sw_queue *refillq =
&bufq_set->refillqs[k];

View File

@@ -3668,7 +3668,7 @@ void idpf_vc_core_deinit(struct idpf_adapter *adapter)
idpf_ptp_release(adapter);
idpf_deinit_task(adapter);
idpf_idc_deinit_core_aux_device(adapter->cdev_info);
idpf_idc_deinit_core_aux_device(adapter);
idpf_rel_rx_pt_lkup(adapter);
idpf_intr_rel(adapter);

View File

@@ -3053,6 +3053,11 @@ static void lan743x_phylink_mac_link_up(struct phylink_config *config,
else if (speed == SPEED_100)
mac_cr |= MAC_CR_CFG_L_;
if (duplex == DUPLEX_FULL)
mac_cr |= MAC_CR_DPX_;
else
mac_cr &= ~MAC_CR_DPX_;
lan743x_csr_write(adapter, MAC_CR, mac_cr);
lan743x_ptp_update_latency(adapter, speed);

View File

@@ -3425,6 +3425,7 @@ static int add_adev(struct gdma_dev *gd, const char *name)
struct auxiliary_device *adev;
struct mana_adev *madev;
int ret;
int id;
madev = kzalloc_obj(*madev);
if (!madev)
@@ -3434,7 +3435,8 @@ static int add_adev(struct gdma_dev *gd, const char *name)
ret = mana_adev_idx_alloc();
if (ret < 0)
goto idx_fail;
adev->id = ret;
id = ret;
adev->id = id;
adev->name = name;
adev->dev.parent = gd->gdma_context->dev;
@@ -3460,7 +3462,7 @@ add_fail:
auxiliary_device_uninit(adev);
init_fail:
mana_adev_idx_free(adev->id);
mana_adev_idx_free(id);
idx_fail:
kfree(madev);

View File

@@ -3355,6 +3355,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
/* Don't wait up for transmitted skbs to be freed. */
if (!use_napi) {
skb_orphan(skb);
skb_dst_drop(skb);
nf_reset_ct(skb);
}

View File

@@ -158,6 +158,7 @@ static struct sk_buff *codel_dequeue(void *ctx,
bool drop;
if (!skb) {
vars->first_above_time = 0;
vars->dropping = false;
return skb;
}

View File

@@ -83,6 +83,11 @@ void nf_conntrack_lock(spinlock_t *lock);
extern spinlock_t nf_conntrack_expect_lock;
static inline void lockdep_nfct_expect_lock_held(void)
{
lockdep_assert_held(&nf_conntrack_expect_lock);
}
/* ctnetlink code shared by both ctnetlink and nf_conntrack_bpf */
static inline void __nf_ct_set_timeout(struct nf_conn *ct, u64 timeout)

View File

@@ -22,10 +22,16 @@ struct nf_conntrack_expect {
/* Hash member */
struct hlist_node hnode;
/* Network namespace */
possible_net_t net;
/* We expect this tuple, with the following mask */
struct nf_conntrack_tuple tuple;
struct nf_conntrack_tuple_mask mask;
#ifdef CONFIG_NF_CONNTRACK_ZONES
struct nf_conntrack_zone zone;
#endif
/* Usage count. */
refcount_t use;
@@ -40,7 +46,7 @@ struct nf_conntrack_expect {
struct nf_conntrack_expect *this);
/* Helper to assign to new connection */
struct nf_conntrack_helper *helper;
struct nf_conntrack_helper __rcu *helper;
/* The conntrack of the master connection */
struct nf_conn *master;
@@ -62,7 +68,17 @@ struct nf_conntrack_expect {
static inline struct net *nf_ct_exp_net(struct nf_conntrack_expect *exp)
{
return nf_ct_net(exp->master);
return read_pnet(&exp->net);
}
static inline bool nf_ct_exp_zone_equal_any(const struct nf_conntrack_expect *a,
const struct nf_conntrack_zone *b)
{
#ifdef CONFIG_NF_CONNTRACK_ZONES
return a->zone.id == b->id;
#else
return true;
#endif
}
#define NF_CT_EXP_POLICY_NAME_LEN 16

View File

@@ -159,5 +159,9 @@ enum ip_conntrack_expect_events {
#define NF_CT_EXPECT_INACTIVE 0x2
#define NF_CT_EXPECT_USERSPACE 0x4
#ifdef __KERNEL__
#define NF_CT_EXPECT_MASK (NF_CT_EXPECT_PERMANENT | NF_CT_EXPECT_INACTIVE | \
NF_CT_EXPECT_USERSPACE)
#endif
#endif /* _UAPI_NF_CONNTRACK_COMMON_H */

View File

@@ -1771,6 +1771,9 @@ static void l2cap_conn_del(struct hci_conn *hcon, int err)
BT_DBG("hcon %p conn %p, err %d", hcon, conn, err);
disable_delayed_work_sync(&conn->info_timer);
disable_delayed_work_sync(&conn->id_addr_timer);
mutex_lock(&conn->lock);
kfree_skb(conn->rx_skb);
@@ -1786,8 +1789,6 @@ static void l2cap_conn_del(struct hci_conn *hcon, int err)
ida_destroy(&conn->tx_ida);
cancel_delayed_work_sync(&conn->id_addr_timer);
l2cap_unregister_all_users(conn);
/* Force the connection to be immediately dropped */
@@ -1806,9 +1807,6 @@ static void l2cap_conn_del(struct hci_conn *hcon, int err)
l2cap_chan_put(chan);
}
if (conn->info_state & L2CAP_INFO_FEAT_MASK_REQ_SENT)
cancel_delayed_work_sync(&conn->info_timer);
hci_chan_del(conn->hchan);
conn->hchan = NULL;
@@ -2400,6 +2398,9 @@ static int l2cap_segment_sdu(struct l2cap_chan *chan,
/* Remote device may have requested smaller PDUs */
pdu_len = min_t(size_t, pdu_len, chan->remote_mps);
if (!pdu_len)
return -EINVAL;
if (len <= pdu_len) {
sar = L2CAP_SAR_UNSEGMENTED;
sdu_len = 0;
@@ -4335,14 +4336,16 @@ static inline int l2cap_config_req(struct l2cap_conn *conn,
if (test_bit(CONF_INPUT_DONE, &chan->conf_state)) {
set_default_fcs(chan);
if (chan->mode == L2CAP_MODE_ERTM ||
chan->mode == L2CAP_MODE_STREAMING)
err = l2cap_ertm_init(chan);
if (chan->state != BT_CONNECTED) {
if (chan->mode == L2CAP_MODE_ERTM ||
chan->mode == L2CAP_MODE_STREAMING)
err = l2cap_ertm_init(chan);
if (err < 0)
l2cap_send_disconn_req(chan, -err);
else
l2cap_chan_ready(chan);
if (err < 0)
l2cap_send_disconn_req(chan, -err);
else
l2cap_chan_ready(chan);
}
goto unlock;
}
@@ -6630,6 +6633,10 @@ static void l2cap_chan_le_send_credits(struct l2cap_chan *chan)
struct l2cap_le_credits pkt;
u16 return_credits = l2cap_le_rx_credits(chan);
if (chan->mode != L2CAP_MODE_LE_FLOWCTL &&
chan->mode != L2CAP_MODE_EXT_FLOWCTL)
return;
if (chan->rx_credits >= return_credits)
return;

View File

@@ -629,6 +629,9 @@ int rtnl_link_register(struct rtnl_link_ops *ops)
unlock:
mutex_unlock(&link_ops_mutex);
if (err)
cleanup_srcu_struct(&ops->srcu);
return err;
}
EXPORT_SYMBOL_GPL(rtnl_link_register);

View File

@@ -157,6 +157,10 @@ static int rt_mt6_check(const struct xt_mtchk_param *par)
pr_debug("unknown flags %X\n", rtinfo->invflags);
return -EINVAL;
}
if (rtinfo->addrnr > IP6T_RT_HOPS) {
pr_debug("too many addresses specified\n");
return -EINVAL;
}
if ((rtinfo->flags & (IP6T_RT_RES | IP6T_RT_FST_MASK)) &&
(!(rtinfo->flags & IP6T_RT_TYP) ||
(rtinfo->rt_type != 0) ||

View File

@@ -21,6 +21,7 @@ int nf_conntrack_broadcast_help(struct sk_buff *skb,
unsigned int timeout)
{
const struct nf_conntrack_helper *helper;
struct net *net = read_pnet(&ct->ct_net);
struct nf_conntrack_expect *exp;
struct iphdr *iph = ip_hdr(skb);
struct rtable *rt = skb_rtable(skb);
@@ -70,8 +71,11 @@ int nf_conntrack_broadcast_help(struct sk_buff *skb,
exp->expectfn = NULL;
exp->flags = NF_CT_EXPECT_PERMANENT;
exp->class = NF_CT_EXPECT_CLASS_DEFAULT;
exp->helper = NULL;
rcu_assign_pointer(exp->helper, helper);
write_pnet(&exp->net, net);
#ifdef CONFIG_NF_CONNTRACK_ZONES
exp->zone = ct->zone;
#endif
nf_ct_expect_related(exp, 0);
nf_ct_expect_put(exp);

View File

@@ -247,6 +247,8 @@ void nf_ct_expect_event_report(enum ip_conntrack_expect_events event,
struct nf_ct_event_notifier *notify;
struct nf_conntrack_ecache *e;
lockdep_nfct_expect_lock_held();
rcu_read_lock();
notify = rcu_dereference(net->ct.nf_conntrack_event_cb);
if (!notify)

View File

@@ -51,6 +51,7 @@ void nf_ct_unlink_expect_report(struct nf_conntrack_expect *exp,
struct net *net = nf_ct_exp_net(exp);
struct nf_conntrack_net *cnet;
lockdep_nfct_expect_lock_held();
WARN_ON(!master_help);
WARN_ON(timer_pending(&exp->timeout));
@@ -112,12 +113,14 @@ nf_ct_exp_equal(const struct nf_conntrack_tuple *tuple,
const struct net *net)
{
return nf_ct_tuple_mask_cmp(tuple, &i->tuple, &i->mask) &&
net_eq(net, nf_ct_net(i->master)) &&
nf_ct_zone_equal_any(i->master, zone);
net_eq(net, read_pnet(&i->net)) &&
nf_ct_exp_zone_equal_any(i, zone);
}
bool nf_ct_remove_expect(struct nf_conntrack_expect *exp)
{
lockdep_nfct_expect_lock_held();
if (timer_delete(&exp->timeout)) {
nf_ct_unlink_expect(exp);
nf_ct_expect_put(exp);
@@ -177,6 +180,8 @@ nf_ct_find_expectation(struct net *net,
struct nf_conntrack_expect *i, *exp = NULL;
unsigned int h;
lockdep_nfct_expect_lock_held();
if (!cnet->expect_count)
return NULL;
@@ -309,12 +314,20 @@ struct nf_conntrack_expect *nf_ct_expect_alloc(struct nf_conn *me)
}
EXPORT_SYMBOL_GPL(nf_ct_expect_alloc);
/* This function can only be used from packet path, where accessing
* master's helper is safe, because the packet holds a reference on
* the conntrack object. Never use it from control plane.
*/
void nf_ct_expect_init(struct nf_conntrack_expect *exp, unsigned int class,
u_int8_t family,
const union nf_inet_addr *saddr,
const union nf_inet_addr *daddr,
u_int8_t proto, const __be16 *src, const __be16 *dst)
{
struct nf_conntrack_helper *helper = NULL;
struct nf_conn *ct = exp->master;
struct net *net = read_pnet(&ct->ct_net);
struct nf_conn_help *help;
int len;
if (family == AF_INET)
@@ -325,7 +338,16 @@ void nf_ct_expect_init(struct nf_conntrack_expect *exp, unsigned int class,
exp->flags = 0;
exp->class = class;
exp->expectfn = NULL;
exp->helper = NULL;
help = nfct_help(ct);
if (help)
helper = rcu_dereference(help->helper);
rcu_assign_pointer(exp->helper, helper);
write_pnet(&exp->net, net);
#ifdef CONFIG_NF_CONNTRACK_ZONES
exp->zone = ct->zone;
#endif
exp->tuple.src.l3num = family;
exp->tuple.dst.protonum = proto;
@@ -442,6 +464,8 @@ static inline int __nf_ct_expect_check(struct nf_conntrack_expect *expect,
unsigned int h;
int ret = 0;
lockdep_nfct_expect_lock_held();
if (!master_help) {
ret = -ESHUTDOWN;
goto out;
@@ -498,8 +522,9 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect *expect,
nf_ct_expect_insert(expect);
spin_unlock_bh(&nf_conntrack_expect_lock);
nf_ct_expect_event_report(IPEXP_NEW, expect, portid, report);
spin_unlock_bh(&nf_conntrack_expect_lock);
return 0;
out:
spin_unlock_bh(&nf_conntrack_expect_lock);
@@ -627,11 +652,15 @@ static int exp_seq_show(struct seq_file *s, void *v)
{
struct nf_conntrack_expect *expect;
struct nf_conntrack_helper *helper;
struct net *net = seq_file_net(s);
struct hlist_node *n = v;
char *delim = "";
expect = hlist_entry(n, struct nf_conntrack_expect, hnode);
if (!net_eq(nf_ct_exp_net(expect), net))
return 0;
if (expect->timeout.function)
seq_printf(s, "%ld ", timer_pending(&expect->timeout)
? (long)(expect->timeout.expires - jiffies)/HZ : 0);
@@ -654,7 +683,7 @@ static int exp_seq_show(struct seq_file *s, void *v)
if (expect->flags & NF_CT_EXPECT_USERSPACE)
seq_printf(s, "%sUSERSPACE", delim);
helper = rcu_dereference(nfct_help(expect->master)->helper);
helper = rcu_dereference(expect->helper);
if (helper) {
seq_printf(s, "%s%s", expect->flags ? " " : "", helper->name);
if (helper->expect_policy[expect->class].name[0])

View File

@@ -643,7 +643,7 @@ static int expect_h245(struct sk_buff *skb, struct nf_conn *ct,
&ct->tuplehash[!dir].tuple.src.u3,
&ct->tuplehash[!dir].tuple.dst.u3,
IPPROTO_TCP, NULL, &port);
exp->helper = &nf_conntrack_helper_h245;
rcu_assign_pointer(exp->helper, &nf_conntrack_helper_h245);
nathook = rcu_dereference(nfct_h323_nat_hook);
if (memcmp(&ct->tuplehash[dir].tuple.src.u3,
@@ -767,7 +767,7 @@ static int expect_callforwarding(struct sk_buff *skb,
nf_ct_expect_init(exp, NF_CT_EXPECT_CLASS_DEFAULT, nf_ct_l3num(ct),
&ct->tuplehash[!dir].tuple.src.u3, &addr,
IPPROTO_TCP, NULL, &port);
exp->helper = nf_conntrack_helper_q931;
rcu_assign_pointer(exp->helper, nf_conntrack_helper_q931);
nathook = rcu_dereference(nfct_h323_nat_hook);
if (memcmp(&ct->tuplehash[dir].tuple.src.u3,
@@ -1234,7 +1234,7 @@ static int expect_q931(struct sk_buff *skb, struct nf_conn *ct,
&ct->tuplehash[!dir].tuple.src.u3 : NULL,
&ct->tuplehash[!dir].tuple.dst.u3,
IPPROTO_TCP, NULL, &port);
exp->helper = nf_conntrack_helper_q931;
rcu_assign_pointer(exp->helper, nf_conntrack_helper_q931);
exp->flags = NF_CT_EXPECT_PERMANENT; /* Accept multiple calls */
nathook = rcu_dereference(nfct_h323_nat_hook);
@@ -1306,7 +1306,7 @@ static int process_gcf(struct sk_buff *skb, struct nf_conn *ct,
nf_ct_expect_init(exp, NF_CT_EXPECT_CLASS_DEFAULT, nf_ct_l3num(ct),
&ct->tuplehash[!dir].tuple.src.u3, &addr,
IPPROTO_UDP, NULL, &port);
exp->helper = nf_conntrack_helper_ras;
rcu_assign_pointer(exp->helper, nf_conntrack_helper_ras);
if (nf_ct_expect_related(exp, 0) == 0) {
pr_debug("nf_ct_ras: expect RAS ");
@@ -1523,7 +1523,7 @@ static int process_acf(struct sk_buff *skb, struct nf_conn *ct,
&ct->tuplehash[!dir].tuple.src.u3, &addr,
IPPROTO_TCP, NULL, &port);
exp->flags = NF_CT_EXPECT_PERMANENT;
exp->helper = nf_conntrack_helper_q931;
rcu_assign_pointer(exp->helper, nf_conntrack_helper_q931);
if (nf_ct_expect_related(exp, 0) == 0) {
pr_debug("nf_ct_ras: expect Q.931 ");
@@ -1577,7 +1577,7 @@ static int process_lcf(struct sk_buff *skb, struct nf_conn *ct,
&ct->tuplehash[!dir].tuple.src.u3, &addr,
IPPROTO_TCP, NULL, &port);
exp->flags = NF_CT_EXPECT_PERMANENT;
exp->helper = nf_conntrack_helper_q931;
rcu_assign_pointer(exp->helper, nf_conntrack_helper_q931);
if (nf_ct_expect_related(exp, 0) == 0) {
pr_debug("nf_ct_ras: expect Q.931 ");

View File

@@ -395,14 +395,10 @@ EXPORT_SYMBOL_GPL(nf_conntrack_helper_register);
static bool expect_iter_me(struct nf_conntrack_expect *exp, void *data)
{
struct nf_conn_help *help = nfct_help(exp->master);
const struct nf_conntrack_helper *me = data;
const struct nf_conntrack_helper *this;
if (exp->helper == me)
return true;
this = rcu_dereference_protected(help->helper,
this = rcu_dereference_protected(exp->helper,
lockdep_is_held(&nf_conntrack_expect_lock));
return this == me;
}
@@ -421,6 +417,11 @@ void nf_conntrack_helper_unregister(struct nf_conntrack_helper *me)
nf_ct_expect_iterate_destroy(expect_iter_me, NULL);
nf_ct_iterate_destroy(unhelp, me);
/* nf_ct_iterate_destroy() does an unconditional synchronize_rcu() as
* last step, this ensures rcu readers of exp->helper are done.
* No need for another synchronize_rcu() here.
*/
}
EXPORT_SYMBOL_GPL(nf_conntrack_helper_unregister);

View File

@@ -910,8 +910,8 @@ struct ctnetlink_filter {
};
static const struct nla_policy cta_filter_nla_policy[CTA_FILTER_MAX + 1] = {
[CTA_FILTER_ORIG_FLAGS] = { .type = NLA_U32 },
[CTA_FILTER_REPLY_FLAGS] = { .type = NLA_U32 },
[CTA_FILTER_ORIG_FLAGS] = NLA_POLICY_MASK(NLA_U32, CTA_FILTER_F_ALL),
[CTA_FILTER_REPLY_FLAGS] = NLA_POLICY_MASK(NLA_U32, CTA_FILTER_F_ALL),
};
static int ctnetlink_parse_filter(const struct nlattr *attr,
@@ -925,17 +925,11 @@ static int ctnetlink_parse_filter(const struct nlattr *attr,
if (ret)
return ret;
if (tb[CTA_FILTER_ORIG_FLAGS]) {
if (tb[CTA_FILTER_ORIG_FLAGS])
filter->orig_flags = nla_get_u32(tb[CTA_FILTER_ORIG_FLAGS]);
if (filter->orig_flags & ~CTA_FILTER_F_ALL)
return -EOPNOTSUPP;
}
if (tb[CTA_FILTER_REPLY_FLAGS]) {
if (tb[CTA_FILTER_REPLY_FLAGS])
filter->reply_flags = nla_get_u32(tb[CTA_FILTER_REPLY_FLAGS]);
if (filter->reply_flags & ~CTA_FILTER_F_ALL)
return -EOPNOTSUPP;
}
return 0;
}
@@ -2634,7 +2628,7 @@ static const struct nla_policy exp_nla_policy[CTA_EXPECT_MAX+1] = {
[CTA_EXPECT_HELP_NAME] = { .type = NLA_NUL_STRING,
.len = NF_CT_HELPER_NAME_LEN - 1 },
[CTA_EXPECT_ZONE] = { .type = NLA_U16 },
[CTA_EXPECT_FLAGS] = { .type = NLA_U32 },
[CTA_EXPECT_FLAGS] = NLA_POLICY_MASK(NLA_BE32, NF_CT_EXPECT_MASK),
[CTA_EXPECT_CLASS] = { .type = NLA_U32 },
[CTA_EXPECT_NAT] = { .type = NLA_NESTED },
[CTA_EXPECT_FN] = { .type = NLA_NUL_STRING },
@@ -3012,7 +3006,7 @@ ctnetlink_exp_dump_expect(struct sk_buff *skb,
{
struct nf_conn *master = exp->master;
long timeout = ((long)exp->timeout.expires - (long)jiffies) / HZ;
struct nf_conn_help *help;
struct nf_conntrack_helper *helper;
#if IS_ENABLED(CONFIG_NF_NAT)
struct nlattr *nest_parms;
struct nf_conntrack_tuple nat_tuple = {};
@@ -3057,15 +3051,12 @@ ctnetlink_exp_dump_expect(struct sk_buff *skb,
nla_put_be32(skb, CTA_EXPECT_FLAGS, htonl(exp->flags)) ||
nla_put_be32(skb, CTA_EXPECT_CLASS, htonl(exp->class)))
goto nla_put_failure;
help = nfct_help(master);
if (help) {
struct nf_conntrack_helper *helper;
helper = rcu_dereference(help->helper);
if (helper &&
nla_put_string(skb, CTA_EXPECT_HELP_NAME, helper->name))
goto nla_put_failure;
}
helper = rcu_dereference(exp->helper);
if (helper &&
nla_put_string(skb, CTA_EXPECT_HELP_NAME, helper->name))
goto nla_put_failure;
expfn = nf_ct_helper_expectfn_find_by_symbol(exp->expectfn);
if (expfn != NULL &&
nla_put_string(skb, CTA_EXPECT_FN, expfn->name))
@@ -3358,31 +3349,37 @@ static int ctnetlink_get_expect(struct sk_buff *skb,
if (err < 0)
return err;
skb2 = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
if (!skb2)
return -ENOMEM;
spin_lock_bh(&nf_conntrack_expect_lock);
exp = nf_ct_expect_find_get(info->net, &zone, &tuple);
if (!exp)
if (!exp) {
spin_unlock_bh(&nf_conntrack_expect_lock);
kfree_skb(skb2);
return -ENOENT;
}
if (cda[CTA_EXPECT_ID]) {
__be32 id = nla_get_be32(cda[CTA_EXPECT_ID]);
if (id != nf_expect_get_id(exp)) {
nf_ct_expect_put(exp);
spin_unlock_bh(&nf_conntrack_expect_lock);
kfree_skb(skb2);
return -ENOENT;
}
}
skb2 = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
if (!skb2) {
nf_ct_expect_put(exp);
return -ENOMEM;
}
rcu_read_lock();
err = ctnetlink_exp_fill_info(skb2, NETLINK_CB(skb).portid,
info->nlh->nlmsg_seq, IPCTNL_MSG_EXP_NEW,
exp);
rcu_read_unlock();
nf_ct_expect_put(exp);
spin_unlock_bh(&nf_conntrack_expect_lock);
if (err <= 0) {
kfree_skb(skb2);
return -ENOMEM;
@@ -3394,12 +3391,9 @@ static int ctnetlink_get_expect(struct sk_buff *skb,
static bool expect_iter_name(struct nf_conntrack_expect *exp, void *data)
{
struct nf_conntrack_helper *helper;
const struct nf_conn_help *m_help;
const char *name = data;
m_help = nfct_help(exp->master);
helper = rcu_dereference(m_help->helper);
helper = rcu_dereference(exp->helper);
if (!helper)
return false;
@@ -3432,22 +3426,26 @@ static int ctnetlink_del_expect(struct sk_buff *skb,
if (err < 0)
return err;
spin_lock_bh(&nf_conntrack_expect_lock);
/* bump usage count to 2 */
exp = nf_ct_expect_find_get(info->net, &zone, &tuple);
if (!exp)
if (!exp) {
spin_unlock_bh(&nf_conntrack_expect_lock);
return -ENOENT;
}
if (cda[CTA_EXPECT_ID]) {
__be32 id = nla_get_be32(cda[CTA_EXPECT_ID]);
if (id != nf_expect_get_id(exp)) {
nf_ct_expect_put(exp);
spin_unlock_bh(&nf_conntrack_expect_lock);
return -ENOENT;
}
}
/* after list removal, usage count == 1 */
spin_lock_bh(&nf_conntrack_expect_lock);
if (timer_delete(&exp->timeout)) {
nf_ct_unlink_expect_report(exp, NETLINK_CB(skb).portid,
nlmsg_report(info->nlh));
@@ -3534,9 +3532,10 @@ ctnetlink_alloc_expect(const struct nlattr * const cda[], struct nf_conn *ct,
struct nf_conntrack_tuple *tuple,
struct nf_conntrack_tuple *mask)
{
u_int32_t class = 0;
struct net *net = read_pnet(&ct->ct_net);
struct nf_conntrack_expect *exp;
struct nf_conn_help *help;
u32 class = 0;
int err;
help = nfct_help(ct);
@@ -3573,7 +3572,13 @@ ctnetlink_alloc_expect(const struct nlattr * const cda[], struct nf_conn *ct,
exp->class = class;
exp->master = ct;
exp->helper = helper;
write_pnet(&exp->net, net);
#ifdef CONFIG_NF_CONNTRACK_ZONES
exp->zone = ct->zone;
#endif
if (!helper)
helper = rcu_dereference(help->helper);
rcu_assign_pointer(exp->helper, helper);
exp->tuple = *tuple;
exp->mask.src.u3 = mask->src.u3;
exp->mask.src.u.all = mask->src.u.all;

View File

@@ -1385,9 +1385,9 @@ nla_put_failure:
}
static const struct nla_policy tcp_nla_policy[CTA_PROTOINFO_TCP_MAX+1] = {
[CTA_PROTOINFO_TCP_STATE] = { .type = NLA_U8 },
[CTA_PROTOINFO_TCP_WSCALE_ORIGINAL] = { .type = NLA_U8 },
[CTA_PROTOINFO_TCP_WSCALE_REPLY] = { .type = NLA_U8 },
[CTA_PROTOINFO_TCP_STATE] = NLA_POLICY_MAX(NLA_U8, TCP_CONNTRACK_SYN_SENT2),
[CTA_PROTOINFO_TCP_WSCALE_ORIGINAL] = NLA_POLICY_MAX(NLA_U8, TCP_MAX_WSCALE),
[CTA_PROTOINFO_TCP_WSCALE_REPLY] = NLA_POLICY_MAX(NLA_U8, TCP_MAX_WSCALE),
[CTA_PROTOINFO_TCP_FLAGS_ORIGINAL] = { .len = sizeof(struct nf_ct_tcp_flags) },
[CTA_PROTOINFO_TCP_FLAGS_REPLY] = { .len = sizeof(struct nf_ct_tcp_flags) },
};
@@ -1414,10 +1414,6 @@ static int nlattr_to_tcp(struct nlattr *cda[], struct nf_conn *ct)
if (err < 0)
return err;
if (tb[CTA_PROTOINFO_TCP_STATE] &&
nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]) >= TCP_CONNTRACK_MAX)
return -EINVAL;
spin_lock_bh(&ct->lock);
if (tb[CTA_PROTOINFO_TCP_STATE])
ct->proto.tcp.state = nla_get_u8(tb[CTA_PROTOINFO_TCP_STATE]);

View File

@@ -924,7 +924,7 @@ static int set_expected_rtp_rtcp(struct sk_buff *skb, unsigned int protoff,
exp = __nf_ct_expect_find(net, nf_ct_zone(ct), &tuple);
if (!exp || exp->master == ct ||
nfct_help(exp->master)->helper != nfct_help(ct)->helper ||
exp->helper != nfct_help(ct)->helper ||
exp->class != class)
break;
#if IS_ENABLED(CONFIG_NF_NAT)
@@ -1040,6 +1040,7 @@ static int process_sdp(struct sk_buff *skb, unsigned int protoff,
unsigned int port;
const struct sdp_media_type *t;
int ret = NF_ACCEPT;
bool have_rtp_addr = false;
hooks = rcu_dereference(nf_nat_sip_hooks);
@@ -1056,8 +1057,11 @@ static int process_sdp(struct sk_buff *skb, unsigned int protoff,
caddr_len = 0;
if (ct_sip_parse_sdp_addr(ct, *dptr, sdpoff, *datalen,
SDP_HDR_CONNECTION, SDP_HDR_MEDIA,
&matchoff, &matchlen, &caddr) > 0)
&matchoff, &matchlen, &caddr) > 0) {
caddr_len = matchlen;
memcpy(&rtp_addr, &caddr, sizeof(rtp_addr));
have_rtp_addr = true;
}
mediaoff = sdpoff;
for (i = 0; i < ARRAY_SIZE(sdp_media_types); ) {
@@ -1091,9 +1095,11 @@ static int process_sdp(struct sk_buff *skb, unsigned int protoff,
&matchoff, &matchlen, &maddr) > 0) {
maddr_len = matchlen;
memcpy(&rtp_addr, &maddr, sizeof(rtp_addr));
} else if (caddr_len)
have_rtp_addr = true;
} else if (caddr_len) {
memcpy(&rtp_addr, &caddr, sizeof(rtp_addr));
else {
have_rtp_addr = true;
} else {
nf_ct_helper_log(skb, ct, "cannot parse SDP message");
return NF_DROP;
}
@@ -1125,7 +1131,7 @@ static int process_sdp(struct sk_buff *skb, unsigned int protoff,
/* Update session connection and owner addresses */
hooks = rcu_dereference(nf_nat_sip_hooks);
if (hooks && ct->status & IPS_NAT_MASK)
if (hooks && ct->status & IPS_NAT_MASK && have_rtp_addr)
ret = hooks->sdp_session(skb, protoff, dataoff,
dptr, datalen, sdpoff,
&rtp_addr);
@@ -1297,7 +1303,7 @@ static int process_register_request(struct sk_buff *skb, unsigned int protoff,
nf_ct_expect_init(exp, SIP_EXPECT_SIGNALLING, nf_ct_l3num(ct),
saddr, &daddr, proto, NULL, &port);
exp->timeout.expires = sip_timeout * HZ;
exp->helper = helper;
rcu_assign_pointer(exp->helper, helper);
exp->flags = NF_CT_EXPECT_PERMANENT | NF_CT_EXPECT_INACTIVE;
hooks = rcu_dereference(nf_nat_sip_hooks);

View File

@@ -647,15 +647,11 @@ __build_packet_message(struct nfnl_log_net *log,
if (data_len) {
struct nlattr *nla;
int size = nla_attr_size(data_len);
if (skb_tailroom(inst->skb) < nla_total_size(data_len))
nla = nla_reserve(inst->skb, NFULA_PAYLOAD, data_len);
if (!nla)
goto nla_put_failure;
nla = skb_put(inst->skb, nla_total_size(data_len));
nla->nla_type = NFULA_PAYLOAD;
nla->nla_len = size;
if (skb_copy_bits(skb, 0, nla_data(nla), data_len))
BUG();
}

View File

@@ -242,7 +242,7 @@ static int nft_pipapo_avx2_lookup_4b_2(unsigned long *map, unsigned long *fill,
b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last);
if (last)
return b;
ret = b;
if (unlikely(ret == -1))
ret = b / XSAVE_YMM_SIZE;
@@ -319,7 +319,7 @@ static int nft_pipapo_avx2_lookup_4b_4(unsigned long *map, unsigned long *fill,
b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last);
if (last)
return b;
ret = b;
if (unlikely(ret == -1))
ret = b / XSAVE_YMM_SIZE;
@@ -414,7 +414,7 @@ static int nft_pipapo_avx2_lookup_4b_8(unsigned long *map, unsigned long *fill,
b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last);
if (last)
return b;
ret = b;
if (unlikely(ret == -1))
ret = b / XSAVE_YMM_SIZE;
@@ -505,7 +505,7 @@ static int nft_pipapo_avx2_lookup_4b_12(unsigned long *map, unsigned long *fill,
b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last);
if (last)
return b;
ret = b;
if (unlikely(ret == -1))
ret = b / XSAVE_YMM_SIZE;
@@ -641,7 +641,7 @@ static int nft_pipapo_avx2_lookup_4b_32(unsigned long *map, unsigned long *fill,
b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last);
if (last)
return b;
ret = b;
if (unlikely(ret == -1))
ret = b / XSAVE_YMM_SIZE;
@@ -699,7 +699,7 @@ static int nft_pipapo_avx2_lookup_8b_1(unsigned long *map, unsigned long *fill,
b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last);
if (last)
return b;
ret = b;
if (unlikely(ret == -1))
ret = b / XSAVE_YMM_SIZE;
@@ -764,7 +764,7 @@ static int nft_pipapo_avx2_lookup_8b_2(unsigned long *map, unsigned long *fill,
b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last);
if (last)
return b;
ret = b;
if (unlikely(ret == -1))
ret = b / XSAVE_YMM_SIZE;
@@ -839,7 +839,7 @@ static int nft_pipapo_avx2_lookup_8b_4(unsigned long *map, unsigned long *fill,
b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last);
if (last)
return b;
ret = b;
if (unlikely(ret == -1))
ret = b / XSAVE_YMM_SIZE;
@@ -925,7 +925,7 @@ static int nft_pipapo_avx2_lookup_8b_6(unsigned long *map, unsigned long *fill,
b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last);
if (last)
return b;
ret = b;
if (unlikely(ret == -1))
ret = b / XSAVE_YMM_SIZE;
@@ -1019,7 +1019,7 @@ static int nft_pipapo_avx2_lookup_8b_16(unsigned long *map, unsigned long *fill,
b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last);
if (last)
return b;
ret = b;
if (unlikely(ret == -1))
ret = b / XSAVE_YMM_SIZE;

View File

@@ -572,14 +572,12 @@ static struct nft_array *nft_array_alloc(u32 max_intervals)
return array;
}
#define NFT_ARRAY_EXTRA_SIZE 10240
/* Similar to nft_rbtree_{u,k}size to hide details to userspace, but consider
* packed representation coming from userspace for anonymous sets too.
*/
static u32 nft_array_elems(const struct nft_set *set)
{
u32 nelems = atomic_read(&set->nelems);
u32 nelems = atomic_read(&set->nelems) - set->ndeact;
/* Adjacent intervals are represented with a single start element in
* anonymous sets, use the current element counter as is.
@@ -595,27 +593,87 @@ static u32 nft_array_elems(const struct nft_set *set)
return (nelems / 2) + 2;
}
static int nft_array_may_resize(const struct nft_set *set)
#define NFT_ARRAY_INITIAL_SIZE 1024
#define NFT_ARRAY_INITIAL_ANON_SIZE 16
#define NFT_ARRAY_INITIAL_ANON_THRESH (8192U / sizeof(struct nft_array_interval))
static int nft_array_may_resize(const struct nft_set *set, bool flush)
{
u32 nelems = nft_array_elems(set), new_max_intervals;
u32 initial_intervals, max_intervals, new_max_intervals, delta;
u32 shrinked_max_intervals, nelems = nft_array_elems(set);
struct nft_rbtree *priv = nft_set_priv(set);
struct nft_array *array;
if (!priv->array_next) {
array = nft_array_alloc(nelems + NFT_ARRAY_EXTRA_SIZE);
if (nft_set_is_anonymous(set))
initial_intervals = NFT_ARRAY_INITIAL_ANON_SIZE;
else
initial_intervals = NFT_ARRAY_INITIAL_SIZE;
if (priv->array_next) {
max_intervals = priv->array_next->max_intervals;
new_max_intervals = priv->array_next->max_intervals;
} else {
if (priv->array) {
max_intervals = priv->array->max_intervals;
new_max_intervals = priv->array->max_intervals;
} else {
max_intervals = 0;
new_max_intervals = initial_intervals;
}
}
if (nft_set_is_anonymous(set))
goto maybe_grow;
if (flush) {
/* Set flush just started, nelems still report elements.*/
nelems = 0;
new_max_intervals = NFT_ARRAY_INITIAL_SIZE;
goto realloc_array;
}
if (check_add_overflow(new_max_intervals, new_max_intervals,
&shrinked_max_intervals))
return -EOVERFLOW;
shrinked_max_intervals = DIV_ROUND_UP(shrinked_max_intervals, 3);
if (shrinked_max_intervals > NFT_ARRAY_INITIAL_SIZE &&
nelems < shrinked_max_intervals) {
new_max_intervals = shrinked_max_intervals;
goto realloc_array;
}
maybe_grow:
if (nelems > new_max_intervals) {
if (nft_set_is_anonymous(set) &&
new_max_intervals < NFT_ARRAY_INITIAL_ANON_THRESH) {
new_max_intervals <<= 1;
} else {
delta = new_max_intervals >> 1;
if (check_add_overflow(new_max_intervals, delta,
&new_max_intervals))
return -EOVERFLOW;
}
}
realloc_array:
if (WARN_ON_ONCE(nelems > new_max_intervals))
return -ENOMEM;
if (priv->array_next) {
if (max_intervals == new_max_intervals)
return 0;
if (nft_array_intervals_alloc(priv->array_next, new_max_intervals) < 0)
return -ENOMEM;
} else {
array = nft_array_alloc(new_max_intervals);
if (!array)
return -ENOMEM;
priv->array_next = array;
}
if (nelems < priv->array_next->max_intervals)
return 0;
new_max_intervals = priv->array_next->max_intervals + NFT_ARRAY_EXTRA_SIZE;
if (nft_array_intervals_alloc(priv->array_next, new_max_intervals) < 0)
return -ENOMEM;
return 0;
}
@@ -630,7 +688,7 @@ static int nft_rbtree_insert(const struct net *net, const struct nft_set *set,
nft_rbtree_maybe_reset_start_cookie(priv, tstamp);
if (nft_array_may_resize(set) < 0)
if (nft_array_may_resize(set, false) < 0)
return -ENOMEM;
do {
@@ -741,7 +799,7 @@ nft_rbtree_deactivate(const struct net *net, const struct nft_set *set,
nft_rbtree_interval_null(set, this))
priv->start_rbe_cookie = 0;
if (nft_array_may_resize(set) < 0)
if (nft_array_may_resize(set, false) < 0)
return NULL;
while (parent != NULL) {
@@ -811,7 +869,7 @@ static void nft_rbtree_walk(const struct nft_ctx *ctx,
switch (iter->type) {
case NFT_ITER_UPDATE_CLONE:
if (nft_array_may_resize(set) < 0) {
if (nft_array_may_resize(set, true) < 0) {
iter->err = -ENOMEM;
break;
}

View File

@@ -246,6 +246,7 @@ static int tls_decrypt_async_wait(struct tls_sw_context_rx *ctx)
crypto_wait_req(-EINPROGRESS, &ctx->async_wait);
atomic_inc(&ctx->decrypt_pending);
__skb_queue_purge(&ctx->async_hold);
return ctx->async_wait.err;
}
@@ -2225,7 +2226,6 @@ recv_end:
/* Wait for all previously submitted records to be decrypted */
ret = tls_decrypt_async_wait(ctx);
__skb_queue_purge(&ctx->async_hold);
if (ret) {
if (err >= 0 || err == -EINPROGRESS)

View File

@@ -29,7 +29,8 @@ TYPES="net_port port_net net6_port port_proto net6_port_mac net6_port_mac_proto
net6_port_net6_port net_port_mac_proto_net"
# Reported bugs, also described by TYPE_ variables below
BUGS="flush_remove_add reload net_port_proto_match avx2_mismatch doublecreate insert_overlap"
BUGS="flush_remove_add reload net_port_proto_match avx2_mismatch doublecreate
insert_overlap load_flush_load4 load_flush_load8"
# List of possible paths to pktgen script from kernel tree for performance tests
PKTGEN_SCRIPT_PATHS="
@@ -432,6 +433,30 @@ race_repeat 0
perf_duration 0
"
TYPE_load_flush_load4="
display reload with flush, 4bit groups
type_spec ipv4_addr . ipv4_addr
chain_spec ip saddr . ip daddr
dst addr4
proto icmp
race_repeat 0
perf_duration 0
"
TYPE_load_flush_load8="
display reload with flush, 8bit groups
type_spec ipv4_addr . ipv4_addr
chain_spec ip saddr . ip daddr
dst addr4
proto icmp
race_repeat 0
perf_duration 0
"
# Set template for all tests, types and rules are filled in depending on test
set_template='
flush ruleset
@@ -1997,6 +2022,49 @@ test_bug_insert_overlap()
return 0
}
test_bug_load_flush_load4()
{
local i
setup veth send_"${proto}" set || return ${ksft_skip}
for i in $(seq 0 255); do
local addelem="add element inet filter test"
local j
for j in $(seq 0 20); do
echo "$addelem { 10.$j.0.$i . 10.$j.1.$i }"
echo "$addelem { 10.$j.0.$i . 10.$j.2.$i }"
done
done > "$tmp"
nft -f "$tmp" || return 1
( echo "flush set inet filter test";cat "$tmp") | nft -f -
[ $? -eq 0 ] || return 1
return 0
}
test_bug_load_flush_load8()
{
local i
setup veth send_"${proto}" set || return ${ksft_skip}
for i in $(seq 1 100); do
echo "add element inet filter test { 10.0.0.$i . 10.0.1.$i }"
echo "add element inet filter test { 10.0.0.$i . 10.0.2.$i }"
done > "$tmp"
nft -f "$tmp" || return 1
( echo "flush set inet filter test";cat "$tmp") | nft -f -
[ $? -eq 0 ] || return 1
return 0
}
test_reported_issues() {
eval test_bug_"${subtest}"
}