Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net. Note that most
of the bugs fixed here are >5 years old. The large PR is not due to an
increase in regressions.
1) Flowtable hardware offload support in IPv6 can lead to out-of-bounds
when populating the rule action array when combined with double-tagged
vlan. Bump the maximum number of actions from 16 to 24 and check that
such limit is never reached, otherwise bail out. This bugs stems from
the original flowtable hardware offload support.
2) nfnetlink_log does not include the netlink header size of the trailing
NLMSG_DONE message when calculating the skb size. From Florian Westphal.
3) Reject names in xt_cgroup and xt_rateest extensions which are not
nul-terminated. Also from Florian.
4) Use nla_strcmp in ipset lookup by set name, since IPSET_ATTR_NAME and
IPSET_ATTR_NAMEREF are of NLA_STRING type. From Florian Westphal.
5) When unregistering conntrack helpers, pass the helper that is going
away so the expectation cleanup is done accordingly, otherwise UaF is
possible when accessing expectation that refer to the helper that is
gone. From Qi Tang.
6) Zero expectation NAT fields to address leaking kernel memory through
the expectation netlink dump when unset. Also from Qi Tang.
7) Use the master conntrack helper when creating expectations via
ctnetlink, ignore the suggested helper through CTA_EXPECT_HELP_NAME.
This allows to address a possible read of kernel memory off the
expectation object boundary.
8) Fix incorrect release of the hash bucket logic in ipset when the
bucket is empty, leading to shrinking the hash bucket to size 0
which deals to out-of-bound write in next element additions.
From Yifan Wu.
9) Allow the use of x_tables extensions that explicitly declare
NFPROTO_ARP support only. This is to avoid an incorrect hook number
validation due to non-overlapping arp and inet hook number
definitions.
10) Reject immediate NF_QUEUE verdict in nf_tables. The userspace
nft tool always uses the nft_queue expression for queueing.
This ensures this verdict cannot be used for the arp family,
which does supported this.
* tag 'nf-26-04-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: reject immediate NF_QUEUE verdict
netfilter: x_tables: restrict xt_check_match/xt_check_target extensions for NFPROTO_ARP
netfilter: ipset: drop logically empty buckets in mtype_del
netfilter: ctnetlink: ignore explicit helper on new expectations
netfilter: ctnetlink: zero expect NAT fields when CTA_EXPECT_NAT absent
netfilter: nf_conntrack_helper: pass helper to expect cleanup
netfilter: ipset: use nla_strcmp for IPSET_ATTR_NAME attr
netfilter: x_tables: ensure names are nul-terminated
netfilter: nfnetlink_log: account for netlink header size
netfilter: flowtable: strictly check for maximum number of actions
====================
Link: https://patch.msgid.link/20260401103646.1015423-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rds_ib_get_mr() extracts the rds_ib_connection from conn->c_transport_data
and passes it to rds_ib_reg_frmr() for FRWR memory registration. On a
fresh outgoing connection, ic is allocated in rds_ib_conn_alloc() with
i_cm_id = NULL because the connection worker has not yet called
rds_ib_conn_path_connect() to create the rdma_cm_id. When sendmsg() with
RDS_CMSG_RDMA_MAP is called on such a connection, the sendmsg path parses
the control message before any connection establishment, allowing
rds_ib_post_reg_frmr() to dereference ic->i_cm_id->qp and crash the
kernel.
The existing guard in rds_ib_reg_frmr() only checks for !ic (added in
commit 9e630bcb77), which does not catch this case since ic is allocated
early and is always non-NULL once the connection object exists.
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
RIP: 0010:rds_ib_post_reg_frmr+0x50e/0x920
Call Trace:
rds_ib_post_reg_frmr (net/rds/ib_frmr.c:167)
rds_ib_map_frmr (net/rds/ib_frmr.c:252)
rds_ib_reg_frmr (net/rds/ib_frmr.c:430)
rds_ib_get_mr (net/rds/ib_rdma.c:615)
__rds_rdma_map (net/rds/rdma.c:295)
rds_cmsg_rdma_map (net/rds/rdma.c:860)
rds_sendmsg (net/rds/send.c:1363)
____sys_sendmsg
do_syscall_64
Add a check in rds_ib_get_mr() that verifies ic, i_cm_id, and qp are all
non-NULL before proceeding with FRMR registration, mirroring the guard
already present in rds_ib_post_inv(). Return -ENODEV when the connection
is not ready, which the existing error handling in rds_cmsg_send() converts
to -EAGAIN for userspace retry and triggers rds_conn_connect_if_down() to
start the connection worker.
Fixes: 1659185fb4 ("RDS: IB: Support Fastreg MR (FRMR) memory registration mode")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260330163237.2752440-2-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
fib6_metric_set() may be called concurrently from softirq context without
holding the FIB table lock. A typical path is:
ndisc_router_discovery()
spin_unlock_bh(&table->tb6_lock) <- lock released
fib6_metric_set(rt, RTAX_HOPLIMIT, ...) <- lockless call
When two CPUs process Router Advertisement packets for the same router
simultaneously, they can both arrive at fib6_metric_set() with the same
fib6_info pointer whose fib6_metrics still points to dst_default_metrics.
if (f6i->fib6_metrics == &dst_default_metrics) { /* both CPUs: true */
struct dst_metrics *p = kzalloc_obj(*p, GFP_ATOMIC);
refcount_set(&p->refcnt, 1);
f6i->fib6_metrics = p; /* CPU1 overwrites CPU0's p -> p0 leaked */
}
The dst_metrics allocated by the losing CPU has refcnt=1 but no pointer
to it anywhere in memory, producing a kmemleak report:
unreferenced object 0xff1100025aca1400 (size 96):
comm "softirq", pid 0, jiffies 4299271239
backtrace:
kmalloc_trace+0x28a/0x380
fib6_metric_set+0xcd/0x180
ndisc_router_discovery+0x12dc/0x24b0
icmpv6_rcv+0xc16/0x1360
Fix this by:
- Set val for p->metrics before published via cmpxchg() so the metrics
value is ready before the pointer becomes visible to other CPUs.
- Replace the plain pointer store with cmpxchg() and free the allocation
safely when competition failed.
- Add READ_ONCE()/WRITE_ONCE() for metrics[] setting in the non-default
metrics path to prevent compiler-based data races.
Fixes: d4ead6b34b ("net/ipv6: move metrics from dst to rt6_info")
Reported-by: Fei Liu <feliu@redhat.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260331-b4-fib6_metric_set-kmemleak-v3-1-88d27f4d8825@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
hci_le_big_create_sync() uses DEFINE_FLEX to allocate a
struct hci_cp_le_big_create_sync on the stack with room for 0x11 (17)
BIS entries. However, conn->num_bis can hold up to HCI_MAX_ISO_BIS (31)
entries — validated against ISO_MAX_NUM_BIS (0x1f) in the caller
hci_conn_big_create_sync(). When conn->num_bis is between 18 and 31,
the memcpy that copies conn->bis into cp->bis writes up to 14 bytes
past the stack buffer, corrupting adjacent stack memory.
This is trivially reproducible: binding an ISO socket with
bc_num_bis = ISO_MAX_NUM_BIS (31) and calling listen() will
eventually trigger hci_le_big_create_sync() from the HCI command
sync worker, causing a KASAN-detectable stack-out-of-bounds write:
BUG: KASAN: stack-out-of-bounds in hci_le_big_create_sync+0x256/0x3b0
Write of size 31 at addr ffffc90000487b48 by task kworker/u9:0/71
Fix this by changing the DEFINE_FLEX count from the incorrect 0x11 to
HCI_MAX_ISO_BIS, which matches the maximum number of BIS entries that
conn->bis can actually carry.
Fixes: 42ecf19471 ("Bluetooth: ISO: Do not emit LE BIG Create Sync if previous is pending")
Cc: stable@vger.kernel.org
Signed-off-by: hkbinbin <hkbinbinbin@gmail.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The legacy responder path in smp_random() currently labels the stored
STK as authenticated whenever pending_sec_level is BT_SECURITY_HIGH.
That reflects what the local service requested, not what the pairing
flow actually achieved.
For Just Works/Confirm legacy pairing, SMP_FLAG_MITM_AUTH stays clear
and the resulting STK should remain unauthenticated even if the local
side requested HIGH security. Use the established MITM state when
storing the responder STK so the key metadata matches the pairing result.
This also keeps the legacy path aligned with the Secure Connections code,
which already treats JUST_WORKS/JUST_CFM as unauthenticated.
Fixes: fff3490f47 ("Bluetooth: Fix setting correct authentication information for SMP STK")
Cc: stable@vger.kernel.org
Signed-off-by: Oleh Konko <security@1seal.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
smp_cmd_pairing_req() currently builds the pairing response from the
initiator auth_req before enforcing the local BT_SECURITY_HIGH
requirement. If the initiator omits SMP_AUTH_MITM, the response can
also omit it even though the local side still requires MITM.
tk_request() then sees an auth value without SMP_AUTH_MITM and may
select JUST_CFM, making method selection inconsistent with the pairing
policy the responder already enforces.
When the local side requires HIGH security, first verify that MITM can
be achieved from the IO capabilities and then force SMP_AUTH_MITM in the
response in both rsp.auth_req and auth. This keeps the responder auth bits
and later method selection aligned.
Fixes: 2b64d153a0 ("Bluetooth: Add MITM mechanism to LE-SMP")
Cc: stable@vger.kernel.org
Suggested-by: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Signed-off-by: Oleh Konko <security@1seal.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
mesh_send() currently bounds MGMT_OP_MESH_SEND by total command
length, but it never verifies that the bytes supplied for the
flexible adv_data[] array actually match the embedded adv_data_len
field. MGMT_MESH_SEND_SIZE only covers the fixed header, so a
truncated command can still pass the existing 20..50 byte range
check and later drive the async mesh send path past the end of the
queued command buffer.
Keep rejecting zero-length and oversized advertising payloads, but
validate adv_data_len explicitly and require the command length to
exactly match the flexible array size before queueing the request.
Fixes: b338d91703 ("Bluetooth: Implement support for Mesh")
Reported-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_conn lookup and field access must be covered by hdev lock in
hci_le_remote_conn_param_req_evt, otherwise it's possible it is freed
concurrently.
Extend the hci_dev_lock critical section to cover all conn usage.
Fixes: 95118dd4ed ("Bluetooth: hci_event: Use of a function table to handle LE subevents")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_conn lookup and field access must be covered by hdev lock in
set_cig_params_sync, otherwise it's possible it is freed concurrently.
Take hdev lock to prevent hci_conn from being deleted or modified
concurrently. Just RCU lock is not suitable here, as we also want to
avoid "tearing" in the configuration.
Fixes: a091289218 ("Bluetooth: hci_conn: Fix hci_le_set_cig_params")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Load Long Term Keys stores the user-provided enc_size and later uses
it to size fixed-size stack operations when replying to LE LTK
requests. An enc_size larger than the 16-byte key buffer can therefore
overflow the reply stack buffer.
Reject oversized enc_size values while validating the management LTK
record so invalid keys never reach the stored key state.
Fixes: 346af67b8d ("Bluetooth: Add MGMT handlers for dealing with SMP LTK's")
Reported-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Commit 5df5dafc17 ("Bluetooth: hci_uart: Fix another race during
initialization") fixed a race for hci commands sent during initialization.
However, there is still a race that happens if an hci event from one of
these commands is received before HCI_UART_REGISTERED has been set at
the end of hci_uart_register_dev(). The event will be ignored which
causes the command to fail with a timeout in the log:
"Bluetooth: hci0: command 0x1003 tx timeout"
This is because the hci event receive path (hci_uart_tty_receive ->
h4_recv) requires HCI_UART_REGISTERED to be set in h4_recv(), while the
hci command transmit path (hci_uart_send_frame -> h4_enqueue) only
requires HCI_UART_PROTO_INIT to be set in hci_uart_send_frame().
The check for HCI_UART_REGISTERED was originally added in commit
c257820291 ("Bluetooth: Fix H4 crash from incoming UART packets")
to fix a crash caused by hu->hdev being null dereferenced. That can no
longer happen: once HCI_UART_PROTO_INIT is set in hci_uart_register_dev()
all pointers (hu, hu->priv and hu->hdev) are valid, and
hci_uart_tty_receive() already calls h4_recv() on HCI_UART_PROTO_INIT
or HCI_UART_PROTO_READY.
Remove the check for HCI_UART_REGISTERED in h4_recv() to fix the race
condition.
Fixes: 5df5dafc17 ("Bluetooth: hci_uart: Fix another race during initialization")
Signed-off-by: Jonathan Rissanen <jonathan.rissanen@axis.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When hci_cmd_sync_queue_once() returns with error, the destroy callback
will not be called.
Fix leaking references / memory on these failures.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_cmd_sync_queue_once() needs to indicate whether a queue item was
added, so caller can know if callbacks are called, so it can avoid
leaking resources.
Change the function to return -EEXIST if queue item already exists.
Modify all callsites to handle that.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_store_wake_reason() is called from hci_event_packet() immediately
after stripping the HCI event header but before hci_event_func()
enforces the per-event minimum payload length from hci_ev_table.
This means a short HCI event frame can reach bacpy() before any bounds
check runs.
Rather than duplicating skb parsing and per-event length checks inside
hci_store_wake_reason(), move wake-address storage into the individual
event handlers after their existing event-length validation has
succeeded. Convert hci_store_wake_reason() into a small helper that only
stores an already-validated bdaddr while the caller holds hci_dev_lock().
Use the same helper after hci_event_func() with a NULL address to
preserve the existing unexpected-wake fallback semantics when no
validated event handler records a wake address.
Annotate the helper with __must_hold(&hdev->lock) and add
lockdep_assert_held(&hdev->lock) so future call paths keep the lock
contract explicit.
Call the helper from hci_conn_request_evt(), hci_conn_complete_evt(),
hci_sync_conn_complete_evt(), le_conn_complete_evt(),
hci_le_adv_report_evt(), hci_le_ext_adv_report_evt(),
hci_le_direct_adv_report_evt(), hci_le_pa_sync_established_evt(), and
hci_le_past_received_evt().
Fixes: 2f20216c1d ("Bluetooth: Emit controller suspend and resume events")
Cc: stable@vger.kernel.org
Signed-off-by: Oleh Konko <security@1seal.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
sco_sock_connect() checks sk_state and sk_type without holding
the socket lock. Two concurrent connect() syscalls on the same
socket can both pass the check and enter sco_connect(), leading
to use-after-free.
The buggy scenario involves three participants and was confirmed
with additional logging instrumentation:
Thread A (connect): HCI disconnect: Thread B (connect):
sco_sock_connect(sk) sco_sock_connect(sk)
sk_state==BT_OPEN sk_state==BT_OPEN
(pass, no lock) (pass, no lock)
sco_connect(sk): sco_connect(sk):
hci_dev_lock hci_dev_lock
hci_connect_sco <- blocked
-> hcon1
sco_conn_add->conn1
lock_sock(sk)
sco_chan_add:
conn1->sk = sk
sk->conn = conn1
sk_state=BT_CONNECT
release_sock
hci_dev_unlock
hci_dev_lock
sco_conn_del:
lock_sock(sk)
sco_chan_del:
sk->conn=NULL
conn1->sk=NULL
sk_state=
BT_CLOSED
SOCK_ZAPPED
release_sock
hci_dev_unlock
(unblocked)
hci_connect_sco
-> hcon2
sco_conn_add
-> conn2
lock_sock(sk)
sco_chan_add:
sk->conn=conn2
sk_state=
BT_CONNECT
// zombie sk!
release_sock
hci_dev_unlock
Thread B revives a BT_CLOSED + SOCK_ZAPPED socket back to
BT_CONNECT. Subsequent cleanup triggers double sock_put() and
use-after-free. Meanwhile conn1 is leaked as it was orphaned
when sco_conn_del() cleared the association.
Fix this by:
- Moving lock_sock() before the sk_state/sk_type checks in
sco_sock_connect() to serialize concurrent connect attempts
- Fixing the sk_type != SOCK_SEQPACKET check to actually
return the error instead of just assigning it
- Adding a state re-check in sco_connect() after lock_sock()
to catch state changes during the window between the locks
- Adding sco_pi(sk)->conn check in sco_chan_add() to prevent
double-attach of a socket to multiple connections
- Adding hci_conn_drop() on sco_chan_add failure to prevent
HCI connection leaks
Fixes: 9a8ec9e8eb ("Bluetooth: SCO: Fix possible circular locking dependency on sco_connect_cfm")
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_cmd_sync_run() may run the work immediately if called from existing
sync work (otherwise it queues a new sync work). In this case it fails
to call the destroy() function.
On immediate run, make it behave same way as if item was queued
successfully: call destroy, and return 0.
The only callsite is hci_abort_conn() via hci_cmd_sync_run_once(), and
this changes its return value. However, its return value is not used
except as the return value for hci_disconnect(), and nothing uses the
return value of hci_disconnect(). Hence there should be no behavior
change anywhere.
Fixes: c898f6d7b0 ("Bluetooth: hci_sync: Introduce hci_cmd_sync_run/hci_cmd_sync_run_once")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
nft_queue is always used from userspace nftables to deliver the NF_QUEUE
verdict. Immediately emitting an NF_QUEUE verdict is never used by the
userspace nft tools, so reject immediate NF_QUEUE verdicts.
The arp family does not provide queue support, but such an immediate
verdict is still reachable. Globally reject NF_QUEUE immediate verdicts
to address this issue.
Fixes: f342de4e2f ("netfilter: nf_tables: reject QUEUE/DROP verdict parameters")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Weiming Shi says:
xt_match and xt_target structs registered with NFPROTO_UNSPEC can be
loaded by any protocol family through nft_compat. When such a
match/target sets .hooks to restrict which hooks it may run on, the
bitmask uses NF_INET_* constants. This is only correct for families
whose hook layout matches NF_INET_*: IPv4, IPv6, INET, and bridge
all share the same five hooks (PRE_ROUTING ... POST_ROUTING).
ARP only has three hooks (IN=0, OUT=1, FORWARD=2) with different
semantics. Because NF_ARP_OUT == 1 == NF_INET_LOCAL_IN, the .hooks
validation silently passes for the wrong reasons, allowing matches to
run on ARP chains where the hook assumptions (e.g. state->in being
set on input hooks) do not hold. This leads to NULL pointer
dereferences; xt_devgroup is one concrete example:
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000044: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000220-0x0000000000000227]
RIP: 0010:devgroup_mt+0xff/0x350
Call Trace:
<TASK>
nft_match_eval (net/netfilter/nft_compat.c:407)
nft_do_chain (net/netfilter/nf_tables_core.c:285)
nft_do_chain_arp (net/netfilter/nft_chain_filter.c:61)
nf_hook_slow (net/netfilter/core.c:623)
arp_xmit (net/ipv4/arp.c:666)
</TASK>
Kernel panic - not syncing: Fatal exception in interrupt
Fix it by restricting arptables to NFPROTO_ARP extensions only.
Note that arptables-legacy only supports:
- arpt_CLASSIFY
- arpt_mangle
- arpt_MARK
that provide explicit NFPROTO_ARP match/target declarations.
Fixes: 9291747f11 ("netfilter: xtables: add device group match")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
mtype_del() counts empty slots below n->pos in k, but it only drops the
bucket when both n->pos and k are zero. This misses buckets whose live
entries have all been removed while n->pos still points past deleted slots.
Treat a bucket as empty when all positions below n->pos are unused and
release it directly instead of shrinking it further.
Fixes: 8af1c6fbd9 ("netfilter: ipset: Fix forceadd evaluation path")
Cc: stable@vger.kernel.org
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <dstsmallbird@foxmail.com>
Signed-off-by: Yifan Wu <yifanwucs@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Reviewed-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use the existing master conntrack helper, anything else is not really
supported and it just makes validation more complicated, so just ignore
what helper userspace suggests for this expectation.
This was uncovered when validating CTA_EXPECT_CLASS via different helper
provided by userspace than the existing master conntrack helper:
BUG: KASAN: slab-out-of-bounds in nf_ct_expect_related_report+0x2479/0x27c0
Read of size 4 at addr ffff8880043fe408 by task poc/102
Call Trace:
nf_ct_expect_related_report+0x2479/0x27c0
ctnetlink_create_expect+0x22b/0x3b0
ctnetlink_new_expect+0x4bd/0x5c0
nfnetlink_rcv_msg+0x67a/0x950
netlink_rcv_skb+0x120/0x350
Allowing to read kernel memory bytes off the expectation boundary.
CTA_EXPECT_HELP_NAME is still used to offer the helper name to userspace
via netlink dump.
Fixes: bd07793705 ("netfilter: nfnetlink_queue: allow to attach expectations to conntracks")
Reported-by: Qi Tang <tpluszz77@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ctnetlink_alloc_expect() allocates expectations from a non-zeroing
slab cache via nf_ct_expect_alloc(). When CTA_EXPECT_NAT is not
present in the netlink message, saved_addr and saved_proto are
never initialized. Stale data from a previous slab occupant can
then be dumped to userspace by ctnetlink_exp_dump_expect(), which
checks these fields to decide whether to emit CTA_EXPECT_NAT.
The safe sibling nf_ct_expect_init(), used by the packet path,
explicitly zeroes these fields.
Zero saved_addr, saved_proto and dir in the else branch, guarded
by IS_ENABLED(CONFIG_NF_NAT) since these fields only exist when
NAT is enabled.
Confirmed by priming the expect slab with NAT-bearing expectations,
freeing them, creating a new expectation without CTA_EXPECT_NAT,
and observing that the ctnetlink dump emits a spurious
CTA_EXPECT_NAT containing stale data from the prior allocation.
Fixes: 076a0ca026 ("netfilter: ctnetlink: add NAT support for expectations")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Qi Tang <tpluszz77@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_conntrack_helper_unregister() calls nf_ct_expect_iterate_destroy()
to remove expectations belonging to the helper being unregistered.
However, it passes NULL instead of the helper pointer as the data
argument, so expect_iter_me() never matches any expectation and all
of them survive the cleanup.
After unregister returns, nfnl_cthelper_del() frees the helper
object immediately. Subsequent expectation dumps or packet-driven
init_conntrack() calls then dereference the freed exp->helper,
causing a use-after-free.
Pass the actual helper pointer so expectations referencing it are
properly destroyed before the helper object is freed.
BUG: KASAN: slab-use-after-free in string+0x38f/0x430
Read of size 1 at addr ffff888003b14d20 by task poc/103
Call Trace:
string+0x38f/0x430
vsnprintf+0x3cc/0x1170
seq_printf+0x17a/0x240
exp_seq_show+0x2e5/0x560
seq_read_iter+0x419/0x1280
proc_reg_read+0x1ac/0x270
vfs_read+0x179/0x930
ksys_read+0xef/0x1c0
Freed by task 103:
The buggy address is located 32 bytes inside of
freed 192-byte region [ffff888003b14d00, ffff888003b14dc0)
Fixes: ac7b848390 ("netfilter: expect: add and use nf_ct_expect_iterate helpers")
Signed-off-by: Qi Tang <tpluszz77@gmail.com>
Reviewed-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
IPSET_ATTR_NAME and IPSET_ATTR_NAMEREF are of NLA_STRING type, they
cannot be treated like a c-string.
They either have to be switched to NLA_NUL_STRING, or the compare
operations need to use the nla functions.
Fixes: f830837f0e ("netfilter: ipset: list:set set type support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reject names that lack a \0 character before feeding them
to functions that expect c-strings.
Fixes tag is the most recent commit that needs this change.
Fixes: c38c4597e4 ("netfilter: implement xt_cgroup cgroup2 path match")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is a followup to an old bug fix: NLMSG_DONE needs to account
for the netlink header size, not just the attribute size.
This can result in a WARN splat + drop of the netlink message,
but other than this there are no ill effects.
Fixes: 9dfa1dfe4d ("netfilter: nf_log: account for size of NLMSG_DONE attribute")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The maximum number of flowtable hardware offload actions in IPv6 is:
* ethernet mangling (4 payload actions, 2 for each ethernet address)
* SNAT (4 payload actions)
* DNAT (4 payload actions)
* Double VLAN (4 vlan actions, 2 for popping vlan, and 2 for pushing)
for QinQ.
* Redirect (1 action)
Which makes 17, while the maximum is 16. But act_ct supports for tunnels
actions too. Note that payload action operates at 32-bit word level, so
mangling an IPv6 address takes 4 payload actions.
Update flow_action_entry_next() calls to check for the maximum number of
supported actions.
While at it, rise the maximum number of actions per flow from 16 to 24
so this works fine with IPv6 setups.
Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
stmmac_vlan_restore() unconditionally calls stmmac_vlan_update() when
NETIF_F_VLAN_FEATURES is set. On platforms where priv->hw->vlan (or
->update_vlan_hash) is not provided, stmmac_update_vlan_hash() returns
-EINVAL via stmmac_do_void_callback(), resulting in a spurious
"Failed to restore VLANs" error even when no VLAN filtering is in use.
Remove not needed comment.
Remove not used return value from stmmac_vlan_restore().
Tested on Orange Pi Zero 3.
Fixes: bd7ad51253 ("net: stmmac: Fix VLAN HW state restore")
Signed-off-by: Michal Piekos <michal.piekos@mmpsystems.pl>
Link: https://patch.msgid.link/20260328-vlan-restore-error-v4-1-f88624c530dc@mmpsystems.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ftgmac100_alloc_rings() allocates rx_skbs, tx_skbs, rxdes, txdes, and
rx_scratch in stages. On intermediate failures it returned -ENOMEM
directly, leaking resources allocated earlier in the function.
Rework the failure path to use staged local unwind labels and free
allocated resources in reverse order before returning -ENOMEM. This
matches common netdev allocation cleanup style.
Fixes: d72e01a043 ("ftgmac100: Use a scratch buffer for failed RX allocations")
Cc: stable@vger.kernel.org
Signed-off-by: Yufan Chen <yufan.chen@linux.dev>
Link: https://patch.msgid.link/20260328163257.60836-1-yufan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
`ip6fl_seq_show()` walks the global flowlabel hash under the seq-file
RCU read-side lock and prints `fl->opt->opt_nflen` when an option block
is present.
Exclusive flowlabels currently free `fl->opt` as soon as `fl->users`
drops to zero in `fl_release()`. However, the surrounding
`struct ip6_flowlabel` remains visible in the global hash table until
later garbage collection removes it and `fl_free_rcu()` finally tears it
down.
A concurrent `/proc/net/ip6_flowlabel` reader can therefore race that
early `kfree()` and dereference freed option state, triggering a crash
in `ip6fl_seq_show()`.
Fix this by keeping `fl->opt` alive until `fl_free_rcu()`. That matches
the lifetime already required for the enclosing flowlabel while readers
can still reach it under RCU.
Fixes: d3aedd5ebd ("ipv6 flowlabel: Convert hash list to RCU.")
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Tested-by: Ren Wei <enjou1224z@gmail.com>
Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/07351f0ec47bcee289576f39f9354f4a64add6e4.1774855883.git.zcliangcn@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
br_mrp_start_test() and br_mrp_start_in_test() accept the user-supplied
interval value from netlink without validation. When interval is 0,
usecs_to_jiffies(0) yields 0, causing the delayed work
(br_mrp_test_work_expired / br_mrp_in_test_work_expired) to reschedule
itself with zero delay. This creates a tight loop on system_percpu_wq
that allocates and transmits MRP test frames at maximum rate, exhausting
all system memory and causing a kernel panic via OOM deadlock.
The same zero-interval issue applies to br_mrp_start_in_test_parse()
for interconnect test frames.
Use NLA_POLICY_MIN(NLA_U32, 1) in the nla_policy tables for both
IFLA_BRIDGE_MRP_START_TEST_INTERVAL and
IFLA_BRIDGE_MRP_START_IN_TEST_INTERVAL, so zero is rejected at the
netlink attribute parsing layer before the value ever reaches the
workqueue scheduling code. This is consistent with how other bridge
subsystems (br_fdb, br_mst) enforce range constraints on netlink
attributes.
Fixes: 20f6a05ef6 ("bridge: mrp: Rework the MRP netlink interface")
Fixes: 7ab1748e4c ("bridge: mrp: Extend MRP netlink interface for configuring MRP interconnect")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260328063000.1845376-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Suraj Gupta says:
====================
Correct BD length masks and BQL accounting for multi-BD TX packets
This patch series fixes two issues in the Xilinx AXI Ethernet driver:
1. Corrects the BD length masks to match the AXIDMA IP spec.
2. Fixes BQL accounting for multi-BD TX packets.
====================
Link: https://patch.msgid.link/20260327073238.134948-1-suraj.gupta2@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When a TX packet spans multiple buffer descriptors (scatter-gather),
axienet_free_tx_chain sums the per-BD actual length from descriptor
status into a caller-provided accumulator. That sum is reset on each
NAPI poll. If the BDs for a single packet complete across different
polls, the earlier bytes are lost and never credited to BQL. This
causes BQL to think bytes are permanently in-flight, eventually
stalling the TX queue.
The SKB pointer is stored only on the last BD of a packet. When that
BD completes, use skb->len for the byte count instead of summing
per-BD status lengths. This matches netdev_sent_queue(), which debits
skb->len, and naturally survives across polls because no partial
packet contributes to the accumulator.
Fixes: c900e49d58 ("net: xilinx: axienet: Implement BQL")
Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com>
Reviewed-by: Sean Anderson <sean.anderson@linux.dev>
Link: https://patch.msgid.link/20260327073238.134948-3-suraj.gupta2@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The XAXIDMA_BD_CTRL_LENGTH_MASK and XAXIDMA_BD_STS_ACTUAL_LEN_MASK
macros were defined as 0x007FFFFF (23 bits), but the AXI DMA IP
product guide (PG021) specifies the buffer length field as bits 25:0
(26 bits). Update both masks to match the IP documentation.
In practice this had no functional impact, since Ethernet frames are
far smaller than 2^23 bytes and the extra bits were always zero, but
the masks should still reflect the hardware specification.
Fixes: 8a3b7a252d ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver")
Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com>
Reviewed-by: Sean Anderson <sean.anderson@linux.dev>
Link: https://patch.msgid.link/20260327073238.134948-2-suraj.gupta2@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
pn532_receive_buf() appends every incoming byte to dev->recv_skb and
only resets the buffer after pn532_uart_rx_is_frame() recognizes a
complete frame. A continuous stream of bytes without a valid PN532 frame
header therefore keeps growing the skb until skb_put_u8() hits the tail
limit.
Drop the accumulated partial frame once the fixed receive buffer is full
so malformed UART traffic cannot grow the skb past
PN532_UART_SKB_BUFF_LEN.
Fixes: c656aa4c27 ("nfc: pn533: add UART phy driver")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Link: https://patch.msgid.link/20260326142033.82297-1-pengpeng@iscas.ac.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
bond_xmit_broadcast() reuses the original skb for the last slave
(determined by bond_is_last_slave()) and clones it for others.
Concurrent slave enslave/release can mutate the slave list during
RCU-protected iteration, changing which slave is "last" mid-loop.
This causes the original skb to be double-consumed (double-freed).
Replace the racy bond_is_last_slave() check with a simple index
comparison (i + 1 == slaves_count) against the pre-snapshot slave
count taken via READ_ONCE() before the loop. This preserves the
zero-copy optimization for the last slave while making the "last"
determination stable against concurrent list mutations.
The UAF can trigger the following crash:
==================================================================
BUG: KASAN: slab-use-after-free in skb_clone
Read of size 8 at addr ffff888100ef8d40 by task exploit/147
CPU: 1 UID: 0 PID: 147 Comm: exploit Not tainted 7.0.0-rc3+ #4 PREEMPTLAZY
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:123)
print_report (mm/kasan/report.c:379 mm/kasan/report.c:482)
kasan_report (mm/kasan/report.c:597)
skb_clone (include/linux/skbuff.h:1724 include/linux/skbuff.h:1792 include/linux/skbuff.h:3396 net/core/skbuff.c:2108)
bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5334)
bond_start_xmit (drivers/net/bonding/bond_main.c:5567 drivers/net/bonding/bond_main.c:5593)
dev_hard_start_xmit (include/linux/netdevice.h:5325 include/linux/netdevice.h:5334 net/core/dev.c:3871 net/core/dev.c:3887)
__dev_queue_xmit (include/linux/netdevice.h:3601 net/core/dev.c:4838)
ip6_finish_output2 (include/net/neighbour.h:540 include/net/neighbour.h:554 net/ipv6/ip6_output.c:136)
ip6_finish_output (net/ipv6/ip6_output.c:208 net/ipv6/ip6_output.c:219)
ip6_output (net/ipv6/ip6_output.c:250)
ip6_send_skb (net/ipv6/ip6_output.c:1985)
udp_v6_send_skb (net/ipv6/udp.c:1442)
udpv6_sendmsg (net/ipv6/udp.c:1733)
__sys_sendto (net/socket.c:730 net/socket.c:742 net/socket.c:2206)
__x64_sys_sendto (net/socket.c:2209)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
</TASK>
Allocated by task 147:
Freed by task 147:
The buggy address belongs to the object at ffff888100ef8c80
which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 192 bytes inside of
freed 224-byte region [ffff888100ef8c80, ffff888100ef8d60)
Memory state around the buggy address:
ffff888100ef8c00: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
ffff888100ef8c80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888100ef8d00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
^
ffff888100ef8d80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
ffff888100ef8e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Fixes: 4e5bd03ae3 ("net: bonding: fix bond_xmit_broadcast return value error bug")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260326075553.3960562-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
bnxt_hwrm_func_backing_store_qcaps_v2() stores resp->type from the
firmware response in ctxm->type and later uses that value to index
fixed backing-store metadata arrays such as ctx_arr[] and
bnxt_bstore_to_trace[].
ctxm->type is fixed by the current backing-store query type and matches
the array index of ctx->ctx_arr. Set ctxm->type from the current loop
variable instead of depending on resp->type.
Also update the loop to advance type from next_valid_type in the for
statement, which keeps the control flow simpler for non-valid and
unchanged entries.
Fixes: 6a4d0774f0 ("bnxt_en: Add support for new backing store query firmware API")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Tested-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20260328234357.43669-1-pengpeng@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Netfilter flowtable can theoretically try to offload flower rules as soon
as a net_device is registered while all the other ones are not
registered or initialized, triggering a possible NULL pointer dereferencing
of qdma pointer in airoha_ppe_set_cpu_port routine. Moreover, if
register_netdev() fails for a particular net_device, there is a small
race if Netfilter tries to offload flowtable rules before all the
net_devices are properly unregistered in airoha_probe() error patch,
triggering a NULL pointer dereferencing in airoha_ppe_set_cpu_port
routine. In order to avoid any possible race, delay offloading until
all net_devices are registered in the networking subsystem.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260329-airoha-regiser-race-fix-v2-1-f4ebb139277b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When building netlink messages, tc_chain_fill_node() never initializes
the tcm_info field of struct tcmsg. Since the allocation is not zeroed,
kernel heap memory is leaked to userspace through this 4-byte field.
The fix simply zeroes tcm_info alongside the other fields that are
already initialized.
Fixes: 32a4f5ecd7 ("net: sched: introduce chain object to uapi")
Signed-off-by: Yochai Eisenrich <echelonh@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260328211436.1010152-1-echelonh@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wei Fang says:
====================
net: enetc: add more checks to enetc_set_rxfh()
ENETC only supports Toeplitz algorithm, and VFs do not support setting
the RSS key, but enetc_set_rxfh() does not check these constraints and
silently accepts unsupported configurations. This may mislead users or
tools into believing that the requested RSS settings have been
successfully applied. So add checks to reject unsupported hash functions
and RSS key updates on VFs, and return "-EOPNOTSUPP" to user space.
====================
Link: https://patch.msgid.link/20260326075233.3628047-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
VFs do not have privilege to configure the RSS key because the registers
are owned by the PF. Currently, if VF attempts to configure the RSS key,
enetc_set_rxfh() simply skips the configuration and does not generate a
warning, which may mislead users into thinking the feature is supported.
To improve this situation, add a check to reject RSS key configuration
on VFs.
Fixes: d382563f54 ("enetc: Add RFS and RSS support")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Clark Wang <xiaoning.wang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20260326075233.3628047-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In commit 8110633db4 ("net: sfp-bus: allow SFP quirks to override
Autoneg and pause bits") we moved the setting of Autoneg and pause bits
before the call to SFP quirk when parsing SFP module support.
Since the quirk for Ubiquiti U-Fiber Instant SFP module zeroes the
support bits and sets 1000baseX_Full only, the above mentioned commit
changed the overall computed support from
1000baseX_Full, Autoneg, Pause, Asym_Pause
to just
1000baseX_Full.
This broke the SFP module for mvneta, which requires Autoneg for
1000baseX since commit c762b7fac1 ("net: mvneta: deny disabling
autoneg for 802.3z modes").
Fix this by setting back the Autoneg, Pause and Asym_Pause bits in the
quirk.
Fixes: 8110633db4 ("net: sfp-bus: allow SFP quirks to override Autoneg and pause bits")
Signed-off-by: Marek Behún <kabel@kernel.org>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20260326122038.2489589-1-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a regression test for the divide-by-zero in rtsc_min() triggered
when m2sm() converts a large m1 value (e.g. 32gbit) to a u64 scaled
slope reaching 2^32. rtsc_min() stores the difference of two such u64
values (sm1 - sm2) in a u32 variable `dsm`, truncating 2^32 to zero
and causing a divide-by-zero oops in the concave-curve intersection
path. The test configures an HFSC class with m1=32gbit d=1ms m2=0bit,
sends a packet to activate the class, waits for it to drain and go
idle, then sends another packet to trigger reactivation through
rtsc_min().
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260326204310.1549327-2-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
m2sm() converts a u32 slope to a u64 scaled value. For large inputs
(e.g. m1=4000000000), the result can reach 2^32. rtsc_min() stores
the difference of two such u64 values in a u32 variable `dsm` and
uses it as a divisor. When the difference is exactly 2^32 the
truncation yields zero, causing a divide-by-zero oops in the
concave-curve intersection path:
Oops: divide error: 0000
RIP: 0010:rtsc_min (net/sched/sch_hfsc.c:601)
Call Trace:
init_ed (net/sched/sch_hfsc.c:629)
hfsc_enqueue (net/sched/sch_hfsc.c:1569)
[...]
Widen `dsm` to u64 and replace do_div() with div64_u64() so the full
difference is preserved.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260326204310.1549327-1-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>