Jon Hunter says:
====================
net: stmmac: Fix Tegra234 MGBE clock
The name of the PTP ref clock for the Tegra234 MGBE ethernet controller
does not match the generic name in the stmmac platform driver. Despite
this basic ethernet is functional on the Tegra234 platforms that use
this driver and as far as I know, we have not tested PTP support with
this driver. Hence, the risk of breaking any functionality is low.
The previous attempt to fix this in the stmmac platform driver, by
supporting the Tegra234 PTP clock name, was rejected [0]. The preference
from the netdev maintainers is to fix this in the DT binding for
Tegra234.
This series fixes this by correcting the device-tree binding to align
with the generic name for the PTP clock. I understand that this is
breaking the ABI for this device, which we should never do, but this
is a last resort for getting this fixed. I am open to any better ideas
to fix this. Please note that we still maintain backward compatibility
in the driver to allow older device-trees to work, but we don't
advertise this via the binding, because I did not see any value in doing
so.
====================
Link: https://patch.msgid.link/20260401102941.17466-1-jonathanh@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The PTP clock for the Tegra234 MGBE device is incorrectly named
'ptp-ref' and should be 'ptp_ref'. This is causing the following
warning to be observed on Tegra234 platforms that use this device:
ERR KERN tegra-mgbe 6800000.ethernet eth0: Invalid PTP clock rate
WARNING KERN tegra-mgbe 6800000.ethernet eth0: PTP init failed
Although this constitutes an ABI breakage in the binding for this
device, PTP support has clearly never worked and so fix this now
so we can correct the device-tree for this device. Note that the
MGBE driver still supports the legacy 'ptp-ref' clock name and so
older/existing device-trees will still work, but given that this
is not the correct name, there is no point to advertise this in the
binding.
Fixes: 189c2e5c76 ("dt-bindings: net: Add Tegra234 MGBE")
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Link: https://patch.msgid.link/20260401102941.17466-3-jonathanh@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 030ce919e1 ("net: stmmac: make sure that ptp_rate is not
0 before configuring timestamping") was added the following error is
observed on Tegra234:
ERR KERN tegra-mgbe 6800000.ethernet eth0: Invalid PTP clock rate
WARNING KERN tegra-mgbe 6800000.ethernet eth0: PTP init failed
It turns out that the Tegra234 device-tree binding defines the PTP ref
clock name as 'ptp-ref' and not 'ptp_ref' and the above commit now
exposes this and that the PTP clock is not configured correctly.
In order to update device-tree to use the correct 'ptp_ref' name, update
the Tegra MGBE driver to use 'ptp_ref' by default and fallback to using
'ptp-ref' if this clock name is present.
Fixes: d8ca113724 ("net: stmmac: tegra: Add MGBE support")
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260401102941.17466-2-jonathanh@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
s3fwrn82_uart_read() reports the number of accepted bytes to the serdev
core. The current code consumes bytes into recv_skb and may already
deliver a complete frame before allocating a fresh receive buffer.
If that alloc_skb() fails, the callback returns 0 even though it has
already consumed bytes, and it leaves recv_skb as NULL for the next
receive callback. That breaks the receive_buf() accounting contract and
can also lead to a NULL dereference on the next skb_put_u8().
Allocate the receive skb lazily before consuming the next byte instead.
If allocation fails, return the number of bytes already accepted.
Fixes: 3f52c2cb7e ("nfc: s3fwrn5: Support a UART interface")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Link: https://patch.msgid.link/20260402042148.65236-1-pengpeng@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv6_stub->ipv6_dev_find() may return ERR_PTR(-EAFNOSUPPORT) when the
IPv6 stack is not active (CONFIG_IPV6=m and not loaded), and passing
this error pointer to dev_hold() will cause a kernel crash with
null-ptr-deref.
Instead, silently discard the request. RFC 8335 does not appear to
define a specific response for the case where an IPv6 interface
identifier is syntactically valid but the implementation cannot perform
the lookup at runtime, and silently dropping the request may safer than
misreporting "No Such Interface".
Fixes: d329ea5bd8 ("icmp: add response to RFC 8335 PROBE messages")
Signed-off-by: Yiqi Sun <sunyiqixm@gmail.com>
Link: https://patch.msgid.link/20260402070419.2291578-1-sunyiqixm@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When querying a nexthop object via RTM_GETNEXTHOP, the kernel currently
allocates a fixed-size skb using NLMSG_GOODSIZE. While sufficient for
single nexthops and small Equal-Cost Multi-Path groups, this fixed
allocation fails for large nexthop groups like 512 nexthops.
This results in the following warning splat:
WARNING: net/ipv4/nexthop.c:3395 at rtm_get_nexthop+0x176/0x1c0, CPU#20: rep/4608
[...]
RIP: 0010:rtm_get_nexthop (net/ipv4/nexthop.c:3395)
[...]
Call Trace:
<TASK>
rtnetlink_rcv_msg (net/core/rtnetlink.c:6989)
netlink_rcv_skb (net/netlink/af_netlink.c:2550)
netlink_unicast (net/netlink/af_netlink.c:1319 net/netlink/af_netlink.c:1344)
netlink_sendmsg (net/netlink/af_netlink.c:1894)
____sys_sendmsg (net/socket.c:721 net/socket.c:736 net/socket.c:2585)
___sys_sendmsg (net/socket.c:2641)
__sys_sendmsg (net/socket.c:2671)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
</TASK>
Fix this by allocating the size dynamically using nh_nlmsg_size() and
using nlmsg_new(), this is consistent with nexthop_notify() behavior. In
addition, adjust nh_nlmsg_size_grp() so it calculates the size needed
based on flags passed. While at it, also add the size of NHA_FDB for
nexthop group size calculation as it was missing too.
This cannot be reproduced via iproute2 as the group size is currently
limited and the command fails as follows:
addattr_l ERROR: message exceeded bound of 1048
Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Closes: https://lore.kernel.org/netdev/CAL_bE8Li2h4KO+AQFXW4S6Yb_u5X4oSKnkywW+LPFjuErhqELA@mail.gmail.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260402072613.25262-2-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently NHA_HW_STATS_ENABLE is included twice everytime a dump of
nexthop group is performed with NHA_OP_FLAG_DUMP_STATS. As all the stats
querying were moved to nla_put_nh_group_stats(), leave only that
instance of the attribute querying.
Fixes: 5072ae00ae ("net: nexthop: Expose nexthop group HW stats to user space")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260402072613.25262-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
qca_tty_receive() consumes each input byte before checking whether a
completed frame needs a fresh receive skb. When the current byte completes
a frame, the driver delivers that frame and then allocates a new skb for
the next one.
If that allocation fails, the current code returns i even though data[i]
has already been consumed and may already have completed the delivered
frame. Since serdev interprets the return value as the number of accepted
bytes, this under-reports progress by one byte and can replay the final
byte of the completed frame into a fresh parser state on the next call.
Return i + 1 in that failure path so the accepted-byte count matches the
actual receive-state progress.
Fixes: dfc768fbe6 ("net: qualcomm: add QCA7000 UART driver")
Cc: stable@vger.kernel.org
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Reviewed-by: Stefan Wahren <wahrenst@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260402071207.4036-1-pengpeng@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The GRP_ACK_MSG handler in tipc_group_proto_rcv() currently decrements
bc_ackers on every inbound group ACK, even when the same member has
already acknowledged the current broadcast round.
Because bc_ackers is a u16, a duplicate ACK received after the last
legitimate ACK wraps the counter to 65535. Once wrapped,
tipc_group_bc_cong() keeps reporting congestion and later group
broadcasts on the affected socket stay blocked until the group is
recreated.
Fix this by ignoring duplicate or stale ACKs before touching bc_acked or
bc_ackers. This makes repeated GRP_ACK_MSG handling idempotent and
prevents the underflow path.
Fixes: 2f487712b8 ("tipc: guarantee that group broadcast doesn't bypass group unicast")
Cc: stable@vger.kernel.org
Signed-off-by: Oleh Konko <security@1seal.org>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/41a4833f368641218e444fdcff822039.security@1seal.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rtnl_newlink() lacks a CAP_NET_ADMIN capability check on the peer
network namespace when creating paired devices (veth, vxcan,
netkit). This allows an unprivileged user with a user namespace
to create interfaces in arbitrary network namespaces, including
init_net.
Add a netlink_ns_capable() check for CAP_NET_ADMIN in the peer
namespace before allowing device creation to proceed.
Fixes: 81adee47df ("net: Support specifying the network namespace upon device creation.")
Signed-off-by: Nikolaos Gkarlis <nickgarlis@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260402181432.4126920-1-nickgarlis@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When CONFIG_BRIDGE_VLAN_FILTERING is not set, br_vlan_group() and
nbp_vlan_group() return NULL (br_private.h stub definitions). The
BR_BOOLOPT_FDB_LOCAL_VLAN_0 toggle code is compiled unconditionally and
reaches br_fdb_delete_locals_per_vlan_port() and
br_fdb_insert_locals_per_vlan_port(), where the NULL vlan group pointer
is dereferenced via list_for_each_entry(v, &vg->vlan_list, vlist).
The observed crash is in the delete path, triggered when creating a
bridge with IFLA_BR_MULTI_BOOLOPT containing BR_BOOLOPT_FDB_LOCAL_VLAN_0
via RTM_NEWLINK. The insert helper has the same bug pattern.
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000056: 0000 [#1] KASAN NOPTI
KASAN: null-ptr-deref in range [0x00000000000002b0-0x00000000000002b7]
RIP: 0010:br_fdb_delete_locals_per_vlan+0x2b9/0x310
Call Trace:
br_fdb_toggle_local_vlan_0+0x452/0x4c0
br_toggle_fdb_local_vlan_0+0x31/0x80 net/bridge/br.c:276
br_boolopt_toggle net/bridge/br.c:313
br_boolopt_multi_toggle net/bridge/br.c:364
br_changelink net/bridge/br_netlink.c:1542
br_dev_newlink net/bridge/br_netlink.c:1575
Add NULL checks for the vlan group pointer in both helpers, returning
early when there are no VLANs to iterate. This matches the existing
pattern used by other bridge FDB functions such as br_fdb_add() and
br_fdb_delete().
Fixes: 21446c06b4 ("net: bridge: Introduce UAPI for BR_BOOLOPT_FDB_LOCAL_VLAN_0")
Signed-off-by: Zijing Yin <yzjaurora@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402140153.3925663-1-yzjaurora@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If an error occurs on the subsequents buffers belonging to the
non-linear part of the skb (e.g. due to an error in the payload length
reported by the NIC or if we consumed all the available fragments for
the skb), the page_pool fragment will not be linked to the skb so it will
not return to the pool in the airoha_qdma_rx_process() error path. Fix the
memory leak partially reverting commit 'd6d2b0e1538d ("net: airoha: Fix
page recycling in airoha_qdma_rx_process()")' and always running
page_pool_put_full_page routine in the airoha_qdma_rx_process() error
path.
Fixes: d6d2b0e153 ("net: airoha: Fix page recycling in airoha_qdma_rx_process()")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260402-airoha_qdma_rx_process-mem-leak-fix-v1-1-b5706f402d3c@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When CONFIG_FIXED_PHY is in a loadable module, the fec driver cannot be
built-in any more:
x86_64-linux-ld: vmlinux.o: in function `fec_enet_mii_probe':
fec_main.c:(.text+0xc4f367): undefined reference to `fixed_phy_unregister'
x86_64-linux-ld: vmlinux.o: in function `fec_enet_close':
fec_main.c:(.text+0xc59591): undefined reference to `fixed_phy_unregister'
x86_64-linux-ld: vmlinux.o: in function `fec_enet_mii_probe.cold':
Select the fixed phy support on all targets to make this build
correctly, not just on coldfire.
Notat that Essentially the stub helpers in include/linux/phy_fixed.h
cannot be used correctly because of this build time dependency,
and we could just remove them to hit the build failure more often
when a driver uses them without the 'select FIXED_PHY'.
Fixes: dc86b621e1 ("net: fec: register a fixed phy using fixed_phy_register_100fd if needed")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260402141048.2713445-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcf_csum_act() walks nested VLAN headers directly from skb->data when an
skb still carries in-payload VLAN tags. The current code reads
vlan->h_vlan_encapsulated_proto and then pulls VLAN_HLEN bytes without
first ensuring that the full VLAN header is present in the linear area.
If only part of an inner VLAN header is linearized, accessing
h_vlan_encapsulated_proto reads past the linear area, and the following
skb_pull(VLAN_HLEN) may violate skb invariants.
Fix this by requiring pskb_may_pull(skb, VLAN_HLEN) before accessing and
pulling each nested VLAN header. If the header still is not fully
available, drop the packet through the existing error path.
Fixes: 2ecba2d1e4 ("net: sched: act_csum: Fix csum calc for tagged packets")
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Tested-by: Ren Wei <enjou1224z@gmail.com>
Signed-off-by: Ruide Cao <caoruide123@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/22df2fcb49f410203eafa5d97963dd36089f4ecf.1774892775.git.caoruide123@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The jumbo_frm() chain-mode implementation unconditionally computes
len = nopaged_len - bmax;
where nopaged_len = skb_headlen(skb) (linear bytes only) and bmax is
BUF_SIZE_8KiB or BUF_SIZE_2KiB. However, the caller stmmac_xmit()
decides to invoke jumbo_frm() based on skb->len (total length including
page fragments):
is_jumbo = stmmac_is_jumbo_frm(priv, skb->len, enh_desc);
When a packet has a small linear portion (nopaged_len <= bmax) but a
large total length due to page fragments (skb->len > bmax), the
subtraction wraps as an unsigned integer, producing a huge len value
(~0xFFFFxxxx). This causes the while (len != 0) loop to execute
hundreds of thousands of iterations, passing skb->data + bmax * i
pointers far beyond the skb buffer to dma_map_single(). On IOMMU-less
SoCs (the typical deployment for stmmac), this maps arbitrary kernel
memory to the DMA engine, constituting a kernel memory disclosure and
potential memory corruption from hardware.
Fix this by introducing a buf_len local variable clamped to
min(nopaged_len, bmax). Computing len = nopaged_len - buf_len is then
always safe: it is zero when the linear portion fits within a single
descriptor, causing the while (len != 0) loop to be skipped naturally,
and the fragment loop in stmmac_xmit() handles page fragments afterward.
Fixes: 286a837217 ("stmmac: add CHAINED descriptor mode support (V4)")
Cc: stable@vger.kernel.org
Signed-off-by: Tyllis Xu <LivelyCarpet87@gmail.com>
Link: https://patch.msgid.link/20260401044708.1386919-1-LivelyCarpet87@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When dma_map_single() fails in tse_start_xmit(), the function returns
NETDEV_TX_OK without freeing the skb. Since NETDEV_TX_OK tells the
stack the packet was consumed, the skb is never freed, leaking memory
on every DMA mapping failure.
Add dev_kfree_skb_any() before returning to properly free the skb.
Fixes: bbd2190ce9 ("Altera TSE: Add main and header file for Altera Ethernet Driver")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
Link: https://patch.msgid.link/20260401211218.279185-1-devnexen@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking fixes from Jakub Kicinski:
"With fixes from wireless, bluetooth and netfilter included we're back
to each PR carrying 30%+ more fixes than in previous era.
The good news is that so far none of the "extra" fixes are themselves
causing real regressions. Not sure how much comfort that is.
Current release - fix to a fix:
- netdevsim: fix build if SKB_EXTENSIONS=n
- eth: stmmac: skip VLAN restore when VLAN hash ops are missing
Previous releases - regressions:
- wifi: iwlwifi: mvm: don't send a 6E related command when
not supported
Previous releases - always broken:
- some info leak fixes
- add missing clearing of skb->cb[] on ICMP paths from tunnels
- ipv6:
- flowlabel: defer exclusive option free until RCU teardown
- avoid overflows in ip6_datagram_send_ctl()
- mpls: add seqcount to protect platform_labels from OOB access
- bridge: improve safety of parsing ND options
- bluetooth: fix leaks, overflows and races in hci_sync
- netfilter: add more input validation, some to address bugs directly
some to prevent exploits from cooking up broken configurations
- wifi:
- ath: avoid poor performance due to stopping the wrong
aggregation session
- virt_wifi: remove SET_NETDEV_DEV to avoid use-after-free
- eth:
- fec: fix the PTP periodic output sysfs interface
- enetc: safely reinitialize TX BD ring when it has unsent frames"
* tag 'net-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits)
eth: fbnic: Increase FBNIC_QUEUE_SIZE_MIN to 64
ipv6: avoid overflows in ip6_datagram_send_ctl()
net: hsr: fix VLAN add unwind on slave errors
net: hsr: serialize seq_blocks merge across nodes
vsock: initialize child_ns_mode_locked in vsock_net_init()
selftests/tc-testing: add tests for cls_fw and cls_flow on shared blocks
net/sched: cls_flow: fix NULL pointer dereference on shared blocks
net/sched: cls_fw: fix NULL pointer dereference on shared blocks
net/x25: Fix overflow when accumulating packets
net/x25: Fix potential double free of skb
bnxt_en: Restore default stat ctxs for ULP when resource is available
bnxt_en: Don't assume XDP is never enabled in bnxt_init_dflt_ring_mode()
bnxt_en: Refactor some basic ring setup and adjustment logic
net/mlx5: Fix switchdev mode rollback in case of failure
net/mlx5: Avoid "No data available" when FW version queries fail
net/mlx5: lag: Check for LAG device before creating debugfs
net: macb: properly unregister fixed rate clocks
net: macb: fix clk handling on PCI glue driver removal
virtio_net: clamp rss_max_key_size to NETDEV_RSS_KEY_LEN
net/sched: sch_netem: fix out-of-bounds access in packet corruption
...
Pull iommu fixes from Joerg Roedel:
- IOMMU-PT related compile breakage in for AMD driver
- IOTLB flushing behavior when unmapped region is larger than requested
due to page-sizes
- Fix IOTLB flush behavior with empty gathers
* tag 'iommu-fixes-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommupt/amdv1: mark amdv1pt_install_leaf_entry as __always_inline
iommupt: Fix short gather if the unmap goes into a large mapping
iommu: Do not call drivers for empty gathers
Pull sound fixes from Takashi Iwai:
"People have been so busy for hunting and we're still getting more
changes than wished for, but it doesn't look too scary; almost all
changes are device-specific small fixes.
I guess it's rather a casual bump, and no more Easter eggs are left
for 7.0 (hopefully)...
- Fixes for the recent regression on ctxfi driver
- Fix missing INIT_LIST_HEAD() for ASoC card_aux_list
- Usual HD- and USB-audio, and ASoC AMD quirk updates
- ASoC fixes for AMD and Intel"
* tag 'sound-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (24 commits)
ASoC: amd: ps: Fix missing leading zeros in subsystem_device SSID log
ALSA: usb-audio: Exclude Scarlett 2i2 1st Gen (8016) from SKIP_IFACE_SETUP
ALSA: hda/realtek: add quirk for Acer Swift SFG14-73
ALSA: hda/realtek: Add quirk for Lenovo Yoga Pro 7 14IMH9
ASoC: Intel: boards: fix unmet dependency on PINCTRL
ASoC: Intel: ehl_rt5660: Use the correct rtd->dev device in hw_params
ALSA: ctxfi: Don't enumerate SPDIF1 at DAIO initialization
ALSA: hda/realtek: Add quirk for Lenovo Yoga Slim 7 14AKP10
ALSA: hda/realtek: add quirk for HP Laptop 15-fc0xxx
ASoC: ep93xx: Fix unchecked clk_prepare_enable() and add rollback on failure
ASoC: soc-core: call missing INIT_LIST_HEAD() for card_aux_list
ALSA: hda/realtek: Add quirk for Samsung Book2 Pro 360 (NP950QED)
ASoC: amd: yc: Add DMI entry for HP Laptop 15-fc0xxx
ASoC: amd: yc: Add DMI quirk for ASUS Vivobook Pro 16X OLED M7601RM
ALSA: hda/realtek: Add quirk for ASUS ROG Strix SCAR 15
ALSA: usb-audio: Exclude Scarlett Solo 1st Gen from SKIP_IFACE_SETUP
ALSA: caiaq: fix stack out-of-bounds read in init_card
ALSA: ctxfi: Check the error for index mapping
ALSA: ctxfi: Fix missing SPDIFI1 index handling
ALSA: hda/realtek: add quirk for HP Victus 15-fb0xxx
...
Pull auxdisplay fixes from Andy Shevchenko:
- Fix NULL dereference in linedisp_release()
- Fix ht16k33 DT bindings to avoid warnings
- Handle errors in I²C transfers in lcd2s driver
* tag 'auxdisplay-v7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/andy/linux-auxdisplay:
auxdisplay: line-display: fix NULL dereference in linedisp_release
auxdisplay: lcd2s: add error handling for i2c transfers
dt-bindings: auxdisplay: ht16k33: Use unevaluatedProperties to fix common property warning
On systems with 64K pages, RX queues will be wedged if users set the
descriptor count to the current minimum (16). Fbnic fragments large
pages into 4K chunks, and scales down the ring size accordingly. With
64K pages and 16 descriptors, the ring size mask is 0 and will never
be filled.
32 descriptors is another special case that wedges the RX rings.
Internally, the rings track pages for the head/tail pointers, not page
fragments. So with 32 descriptors, there's only 1 usable page as one
ring slot is kept empty to disambiguate between an empty/full ring.
As a result, the head pointer never advances and the HW stalls after
consuming 16 page fragments.
Fixes: 0cb4c0a137 ("eth: fbnic: Implement Rx queue alloc/start/stop/free")
Signed-off-by: Dimitri Daskalakis <daskald@meta.com>
Link: https://patch.msgid.link/20260401162848.2335350-1-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yiming Qian reported :
<quote>
I believe I found a locally triggerable kernel bug in the IPv6 sendmsg
ancillary-data path that can panic the kernel via `skb_under_panic()`
(local DoS).
The core issue is a mismatch between:
- a 16-bit length accumulator (`struct ipv6_txoptions::opt_flen`, type
`__u16`) and
- a pointer to the *last* provided destination-options header (`opt->dst1opt`)
when multiple `IPV6_DSTOPTS` control messages (cmsgs) are provided.
- `include/net/ipv6.h`:
- `struct ipv6_txoptions::opt_flen` is `__u16` (wrap possible).
(lines 291-307, especially 298)
- `net/ipv6/datagram.c:ip6_datagram_send_ctl()`:
- Accepts repeated `IPV6_DSTOPTS` and accumulates into `opt_flen`
without rejecting duplicates. (lines 909-933)
- `net/ipv6/ip6_output.c:__ip6_append_data()`:
- Uses `opt->opt_flen + opt->opt_nflen` to compute header
sizes/headroom decisions. (lines 1448-1466, especially 1463-1465)
- `net/ipv6/ip6_output.c:__ip6_make_skb()`:
- Calls `ipv6_push_frag_opts()` if `opt->opt_flen` is non-zero.
(lines 1930-1934)
- `net/ipv6/exthdrs.c:ipv6_push_frag_opts()` / `ipv6_push_exthdr()`:
- Push size comes from `ipv6_optlen(opt->dst1opt)` (based on the
pointed-to header). (lines 1179-1185 and 1206-1211)
1. `opt_flen` is a 16-bit accumulator:
- `include/net/ipv6.h:298` defines `__u16 opt_flen; /* after fragment hdr */`.
2. `ip6_datagram_send_ctl()` accepts *repeated* `IPV6_DSTOPTS` cmsgs
and increments `opt_flen` each time:
- In `net/ipv6/datagram.c:909-933`, for `IPV6_DSTOPTS`:
- It computes `len = ((hdr->hdrlen + 1) << 3);`
- It checks `CAP_NET_RAW` using `ns_capable(net->user_ns,
CAP_NET_RAW)`. (line 922)
- Then it does:
- `opt->opt_flen += len;` (line 927)
- `opt->dst1opt = hdr;` (line 928)
There is no duplicate rejection here (unlike the legacy
`IPV6_2292DSTOPTS` path which rejects duplicates at
`net/ipv6/datagram.c:901-904`).
If enough large `IPV6_DSTOPTS` cmsgs are provided, `opt_flen` wraps
while `dst1opt` still points to a large (2048-byte)
destination-options header.
In the attached PoC (`poc.c`):
- 32 cmsgs with `hdrlen=255` => `len = (255+1)*8 = 2048`
- 1 cmsg with `hdrlen=0` => `len = 8`
- Total increment: `32*2048 + 8 = 65544`, so `(__u16)opt_flen == 8`
- The last cmsg is 2048 bytes, so `dst1opt` points to a 2048-byte header.
3. The transmit path sizes headers using the wrapped `opt_flen`:
- In `net/ipv6/ip6_output.c:1463-1465`:
- `headersize = sizeof(struct ipv6hdr) + (opt ? opt->opt_flen +
opt->opt_nflen : 0) + ...;`
With wrapped `opt_flen`, `headersize`/headroom decisions underestimate
what will be pushed later.
4. When building the final skb, the actual push length comes from
`dst1opt` and is not limited by wrapped `opt_flen`:
- In `net/ipv6/ip6_output.c:1930-1934`:
- `if (opt->opt_flen) proto = ipv6_push_frag_opts(skb, opt, proto);`
- In `net/ipv6/exthdrs.c:1206-1211`, `ipv6_push_frag_opts()` pushes
`dst1opt` via `ipv6_push_exthdr()`.
- In `net/ipv6/exthdrs.c:1179-1184`, `ipv6_push_exthdr()` does:
- `skb_push(skb, ipv6_optlen(opt));`
- `memcpy(h, opt, ipv6_optlen(opt));`
With insufficient headroom, `skb_push()` underflows and triggers
`skb_under_panic()` -> `BUG()`:
- `net/core/skbuff.c:2669-2675` (`skb_push()` calls `skb_under_panic()`)
- `net/core/skbuff.c:207-214` (`skb_panic()` ends in `BUG()`)
- The `IPV6_DSTOPTS` cmsg path requires `CAP_NET_RAW` in the target
netns user namespace (`ns_capable(net->user_ns, CAP_NET_RAW)`).
- Root (or any task with `CAP_NET_RAW`) can trigger this without user
namespaces.
- An unprivileged `uid=1000` user can trigger this if unprivileged
user namespaces are enabled and it can create a userns+netns to obtain
namespaced `CAP_NET_RAW` (the attached PoC does this).
- Local denial of service: kernel BUG/panic (system crash).
- Reproducible with a small userspace PoC.
</quote>
This patch does not reject duplicated options, as this might break
some user applications.
Instead, it makes sure to adjust opt_flen and opt_nflen to correctly
reflect the size of the current option headers, preventing the overflows
and the potential for panics.
This applies to IPV6_DSTOPTS, IPV6_HOPOPTS, and IPV6_RTHDR.
Specifically:
When a new IPV6_DSTOPTS is processed, the length of the old opt->dst1opt
is subtracted from opt->opt_flen before adding the new length.
When a new IPV6_HOPOPTS is processed, the length of the old opt->dst0opt
is subtracted from opt->opt_nflen.
When a new Routing Header (IPV6_RTHDR or IPV6_2292RTHDR) is processed,
the length of the old opt->srcrt is subtracted from opt->opt_nflen.
In the special case within IPV6_2292RTHDR handling where dst1opt is moved
to dst0opt, the length of the old opt->dst0opt is subtracted from
opt->opt_nflen before the new one is added.
Fixes: 333fad5364 ("[IPV6]: Support several new sockopt / ancillary data in Advanced API (RFC3542).")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Closes: https://lore.kernel.org/netdev/CAL_bE8JNzawgr5OX5m+3jnQDHry2XxhQT5=jThW1zDPtUikRYA@mail.gmail.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260401154721.3740056-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Luka Gejak says:
====================
net: hsr: fixes for PRP duplication and VLAN unwind
This series addresses two logic bugs in the HSR/PRP implementation
identified during a protocol audit. These are targeted for the 'net'
tree as they fix potential memory corruption and state inconsistency.
The primary change resolves a race condition in the node merging path by
implementing address-based lock ordering. This ensures that concurrent
mutations of sequence blocks do not lead to state corruption or
deadlocks.
An additional fix corrects asymmetric VLAN error unwinding by
implementing a centralized unwind path on slave errors.
====================
Link: https://patch.msgid.link/20260401092243.52121-1-luka.gejak@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When vlan_vid_add() fails for a secondary slave, the error path calls
vlan_vid_del() on the failing port instead of the peer slave that had
already succeeded. This results in asymmetric VLAN state across the HSR
pair.
Fix this by switching to a centralized unwind path that removes the VID
from any slave device that was already programmed.
Fixes: 1a8a63a530 ("net: hsr: Add VLAN CTAG filter support")
Signed-off-by: Luka Gejak <luka.gejak@linux.dev>
Link: https://patch.msgid.link/20260401092243.52121-3-luka.gejak@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During node merging, hsr_handle_sup_frame() walks node_curr->seq_blocks
to update node_real without holding node_curr->seq_out_lock. This
allows concurrent mutations from duplicate registration paths, risking
inconsistent state or XArray/bitmap corruption.
Fix this by locking both nodes' seq_out_lock during the merge.
To prevent ABBA deadlocks, locks are acquired in order of memory
address.
Reviewed-by: Felix Maurer <fmaurer@redhat.com>
Fixes: 415e636751 ("hsr: Implement more robust duplicate discard for PRP")
Signed-off-by: Luka Gejak <luka.gejak@linux.dev>
Link: https://patch.msgid.link/20260401092243.52121-2-luka.gejak@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The `child_ns_mode_locked` field lives in `struct net`, which persists
across vsock module reloads. When the module is unloaded and reloaded,
`vsock_net_init()` resets `mode` and `child_ns_mode` back to their
default values, but does not reset `child_ns_mode_locked`.
The stale lock from the previous module load causes subsequent writes
to `child_ns_mode` to silently fail: `vsock_net_set_child_mode()` sees
the old lock, skips updating the actual value, and returns success
when the requested mode matches the stale lock. The sysctl handler
reports no error, but `child_ns_mode` remains unchanged.
Steps to reproduce:
$ modprobe vsock
$ echo local > /proc/sys/net/vsock/child_ns_mode
$ cat /proc/sys/net/vsock/child_ns_mode
local
$ modprobe -r vsock
$ modprobe vsock
$ echo local > /proc/sys/net/vsock/child_ns_mode
$ cat /proc/sys/net/vsock/child_ns_mode
global <--- expected "local"
Fix this by initializing `child_ns_mode_locked` to 0 (unlocked) in
`vsock_net_init()`, so the write-once mechanism works correctly after
module reload.
Fixes: 102eab95f0 ("vsock: lock down child_ns_mode as write-once")
Reported-by: Jin Liu <jinl@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260401092153.28462-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Regression tests for the shared-block NULL derefs fixed in the previous
two patches:
- fw: attempt to attach an empty fw filter to a shared block and
verify the configuration is rejected with EINVAL.
- flow: create a flow filter on a shared block without a baseclass
and verify the configuration is rejected with EINVAL.
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260331050217.504278-3-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
flow_change() calls tcf_block_q() and dereferences q->handle to derive
a default baseclass. Shared blocks leave block->q NULL, causing a NULL
deref when a flow filter without a fully qualified baseclass is created
on a shared block.
Check tcf_block_shared() before accessing block->q and return -EINVAL
for shared blocks. This avoids the null-deref shown below:
=======================================================================
KASAN: null-ptr-deref in range [0x0000000000000038-0x000000000000003f]
RIP: 0010:flow_change (net/sched/cls_flow.c:508)
Call Trace:
tc_new_tfilter (net/sched/cls_api.c:2432)
rtnetlink_rcv_msg (net/core/rtnetlink.c:6980)
[...]
=======================================================================
Fixes: 1abf272022 ("net: sched: tcindex, fw, flow: use tcf_block_q helper to get struct Qdisc")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260331050217.504278-2-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The old-method path in fw_classify() calls tcf_block_q() and
dereferences q->handle. Shared blocks leave block->q NULL, causing a
NULL deref when an empty cls_fw filter is attached to a shared block
and a packet with a nonzero major skb mark is classified.
Reject the configuration in fw_change() when the old method (no
TCA_OPTIONS) is used on a shared block, since fw_classify()'s
old-method path needs block->q which is NULL for shared blocks.
The fixed null-ptr-deref calling stack:
KASAN: null-ptr-deref in range [0x0000000000000038-0x000000000000003f]
RIP: 0010:fw_classify (net/sched/cls_fw.c:81)
Call Trace:
tcf_classify (./include/net/tc_wrapper.h:197 net/sched/cls_api.c:1764 net/sched/cls_api.c:1860)
tc_run (net/core/dev.c:4401)
__dev_queue_xmit (net/core/dev.c:4535 net/core/dev.c:4790)
Fixes: 1abf272022 ("net: sched: tcindex, fw, flow: use tcf_block_q helper to get struct Qdisc")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260331050217.504278-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Martin Schiller says:
====================
net/x25: Fix overflow and double free
This patch set includes 2 fixes:
The first removes a potential double free of received skb
The second fixes an overflow when accumulating packets with the more-bit
set.
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
====================
Link: https://patch.msgid.link/20260331-x25_fraglen-v4-0-3e69f18464b4@dev.tdt.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When alloc_skb fails in x25_queue_rx_frame it calls kfree_skb(skb) at
line 48 and returns 1 (error).
This error propagates back through the call chain:
x25_queue_rx_frame returns 1
|
v
x25_state3_machine receives the return value 1 and takes the else
branch at line 278, setting queued=0 and returning 0
|
v
x25_process_rx_frame returns queued=0
|
v
x25_backlog_rcv at line 452 sees queued=0 and calls kfree_skb(skb)
again
This would free the same skb twice. Looking at x25_backlog_rcv:
net/x25/x25_in.c:x25_backlog_rcv() {
...
queued = x25_process_rx_frame(sk, skb);
...
if (!queued)
kfree_skb(skb);
}
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Link: https://patch.msgid.link/20260331-x25_fraglen-v4-1-3e69f18464b4@dev.tdt.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ASoC: Fixes for v7.0
Another smallish batch of fixes and quirks, these days it's AMD that is
getting all the DMI entries added. We've got one core fix for a missing
list initialisation with auxiliary devices, otherwise it's all fairly
small things.
Michael Chan says:
====================
bnxt_en: Bug fixes
The first patch is a refactor patch needed by the second patch to
fix XDP ring initialization during FW reset. The third patch
fixes an issue related to stats context reservation for RoCE.
====================
Link: https://patch.msgid.link/20260331065138.948205-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During resource reservation, if the L2 driver does not have enough
MSIX vectors to provide to the RoCE driver, it sets the stat ctxs for
ULP also to 0 so that we don't have to reserve it unnecessarily.
However, subsequently the user may reduce L2 rings thereby freeing up
some resources that the L2 driver can now earmark for RoCE. In this
case, the driver should restore the default ULP stat ctxs to make
sure that all RoCE resources are ready for use.
The RoCE driver may fail to initialize in this scenario without this
fix.
Fixes: d630624ebd ("bnxt_en: Utilize ulp client resources if RoCE is not registered")
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20260331065138.948205-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The original code made the assumption that when we set up the initial
default ring mode, we must be just loading the driver and XDP cannot
be enabled yet. This is not true when the FW goes through a resource
or capability change. Resource reservations will be cancelled and
reinitialized with XDP already enabled. devlink reload with XDP enabled
will also have the same issue. This scenario will cause the ring
arithmetic to be all wrong in the bnxt_init_dflt_ring_mode() path
causing failure:
bnxt_en 0000:a1:00.0 ens2f0np0: bnxt_setup_int_mode err: ffffffea
bnxt_en 0000:a1:00.0 ens2f0np0: bnxt_request_irq err: ffffffea
bnxt_en 0000:a1:00.0 ens2f0np0: nic open fail (rc: ffffffea)
Fix it by properly accounting for XDP in the bnxt_init_dflt_ring_mode()
path by using the refactored helper functions in the previous patch.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Fixes: ec5d31e3c1 ("bnxt_en: Handle firmware reset status during IF_UP.")
Fixes: 228ea8c187 ("bnxt_en: implement devlink dev reload driver_reinit")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20260331065138.948205-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Avoid printing the misleading "kernel answers: No data available" devlink
output when querying firmware or pending firmware version fails
(e.g. MLX5 fw state errors / flash failures).
FW can fail on loading the pending flash image and get its version due
to various reasons, examples:
mlxfw: Firmware flash failed: key not applicable, err (7)
mlx5_fw_image_pending: can't read pending fw version while fw state is 1
and the resulting:
$ devlink dev info
kernel answers: No data available
Instead, just report 0 or 0xfff.. versions in case of failure to indicate
a problem, and let other information be shown.
after the fix:
$ devlink dev info
pci/0000:00:06.0:
driver mlx5_core
serial_number xxx...
board.serial_number MT2225300179
versions:
fixed:
fw.psid MT_0000000436
running:
fw.version 22.41.0188
fw 22.41.0188
stored:
fw.version 255.255.65535
fw 255.255.65535
Fixes: 9c86b07e30 ("net/mlx5: Added fw version query command")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260330194015.53585-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__mlx5_lag_dev_add_mdev() may return 0 (success) even when an error
occurs that is handled gracefully. Consequently, the initialization
flow proceeds to call mlx5_ldev_add_debugfs() even when there is no
valid LAG context.
mlx5_ldev_add_debugfs() blindly created the debugfs directory and
attributes. This exposed interfaces (like the members file) that rely on
a valid ldev pointer, leading to potential NULL pointer dereferences if
accessed when ldev is NULL.
Add a check to verify that mlx5_lag_dev(dev) returns a valid pointer
before attempting to create the debugfs entries.
Fixes: 7f46a0b732 ("net/mlx5: Lag, add debugfs to query hardware lag state")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260330194015.53585-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
platform_device_unregister() may still want to use the registered clks
during runtime resume callback.
Note that there is a commit d82d5303c4 ("net: macb: fix use after free
on rmmod") that addressed the similar problem of clk vs platform device
unregistration but just moved the bug to another place.
Save the pointers to clks into local variables for reuse after platform
device is unregistered.
BUG: KASAN: use-after-free in clk_prepare+0x5a/0x60
Read of size 8 at addr ffff888104f85e00 by task modprobe/597
CPU: 2 PID: 597 Comm: modprobe Not tainted 6.1.164+ #114
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x8d/0xba
print_report+0x17f/0x496
kasan_report+0xd9/0x180
clk_prepare+0x5a/0x60
macb_runtime_resume+0x13d/0x410 [macb]
pm_generic_runtime_resume+0x97/0xd0
__rpm_callback+0xc8/0x4d0
rpm_callback+0xf6/0x230
rpm_resume+0xeeb/0x1a70
__pm_runtime_resume+0xb4/0x170
bus_remove_device+0x2e3/0x4b0
device_del+0x5b3/0xdc0
platform_device_del+0x4e/0x280
platform_device_unregister+0x11/0x50
pci_device_remove+0xae/0x210
device_remove+0xcb/0x180
device_release_driver_internal+0x529/0x770
driver_detach+0xd4/0x1a0
bus_remove_driver+0x135/0x260
driver_unregister+0x72/0xb0
pci_unregister_driver+0x26/0x220
__do_sys_delete_module+0x32e/0x550
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
</TASK>
Allocated by task 519:
kasan_save_stack+0x2c/0x50
kasan_set_track+0x21/0x30
__kasan_kmalloc+0x8e/0x90
__clk_register+0x458/0x2890
clk_hw_register+0x1a/0x60
__clk_hw_register_fixed_rate+0x255/0x410
clk_register_fixed_rate+0x3c/0xa0
macb_probe+0x1d8/0x42e [macb_pci]
local_pci_probe+0xd7/0x190
pci_device_probe+0x252/0x600
really_probe+0x255/0x7f0
__driver_probe_device+0x1ee/0x330
driver_probe_device+0x4c/0x1f0
__driver_attach+0x1df/0x4e0
bus_for_each_dev+0x15d/0x1f0
bus_add_driver+0x486/0x5e0
driver_register+0x23a/0x3d0
do_one_initcall+0xfd/0x4d0
do_init_module+0x18b/0x5a0
load_module+0x5663/0x7950
__do_sys_finit_module+0x101/0x180
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Freed by task 597:
kasan_save_stack+0x2c/0x50
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x50
__kasan_slab_free+0x106/0x180
__kmem_cache_free+0xbc/0x320
clk_unregister+0x6de/0x8d0
macb_remove+0x73/0xc0 [macb_pci]
pci_device_remove+0xae/0x210
device_remove+0xcb/0x180
device_release_driver_internal+0x529/0x770
driver_detach+0xd4/0x1a0
bus_remove_driver+0x135/0x260
driver_unregister+0x72/0xb0
pci_unregister_driver+0x26/0x220
__do_sys_delete_module+0x32e/0x550
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Fixes: d82d5303c4 ("net: macb: fix use after free on rmmod")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Link: https://patch.msgid.link/20260330184542.626619-1-pchelkin@ispras.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rss_max_key_size in the virtio spec is the maximum key size supported by
the device, not a mandatory size the driver must use. Also the value 40
is a spec minimum, not a spec maximum.
The current code rejects RSS and can fail probe when the device reports a
larger rss_max_key_size than the driver buffer limit. Instead, clamp the
effective key length to min(device rss_max_key_size, NETDEV_RSS_KEY_LEN)
and keep RSS enabled.
This keeps probe working on devices that advertise larger maximum key sizes
while respecting the netdev RSS key buffer size limit.
Fixes: 3f7d9c1964 ("virtio_net: Add hash_key_length check")
Cc: stable@vger.kernel.org
Signed-off-by: Srujana Challa <schalla@marvell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260326142344.1171317-1-schalla@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In netem_enqueue(), the packet corruption logic uses
get_random_u32_below(skb_headlen(skb)) to select an index for
modifying skb->data. When an AF_PACKET TX_RING sends fully non-linear
packets over an IPIP tunnel, skb_headlen(skb) evaluates to 0.
Passing 0 to get_random_u32_below() takes the variable-ceil slow path
which returns an unconstrained 32-bit random integer. Using this
unconstrained value as an offset into skb->data results in an
out-of-bounds memory access.
Fix this by verifying skb_headlen(skb) is non-zero before attempting
to corrupt the linear data area. Fully non-linear packets will silently
bypass the corruption logic.
Fixes: c865e5d99e ("[PKT_SCHED] netem: packet corruption option")
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Signed-off-by: Yuan Tan <tanyuan98@outlook.com>
Signed-off-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Yuhang Zheng <z1652074432@gmail.com>
Signed-off-by: Yucheng Lu <kanolyc@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/45435c0935df877853a81e6d06205ac738ec65fa.1774941614.git.kanolyc@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net. Note that most
of the bugs fixed here are >5 years old. The large PR is not due to an
increase in regressions.
1) Flowtable hardware offload support in IPv6 can lead to out-of-bounds
when populating the rule action array when combined with double-tagged
vlan. Bump the maximum number of actions from 16 to 24 and check that
such limit is never reached, otherwise bail out. This bugs stems from
the original flowtable hardware offload support.
2) nfnetlink_log does not include the netlink header size of the trailing
NLMSG_DONE message when calculating the skb size. From Florian Westphal.
3) Reject names in xt_cgroup and xt_rateest extensions which are not
nul-terminated. Also from Florian.
4) Use nla_strcmp in ipset lookup by set name, since IPSET_ATTR_NAME and
IPSET_ATTR_NAMEREF are of NLA_STRING type. From Florian Westphal.
5) When unregistering conntrack helpers, pass the helper that is going
away so the expectation cleanup is done accordingly, otherwise UaF is
possible when accessing expectation that refer to the helper that is
gone. From Qi Tang.
6) Zero expectation NAT fields to address leaking kernel memory through
the expectation netlink dump when unset. Also from Qi Tang.
7) Use the master conntrack helper when creating expectations via
ctnetlink, ignore the suggested helper through CTA_EXPECT_HELP_NAME.
This allows to address a possible read of kernel memory off the
expectation object boundary.
8) Fix incorrect release of the hash bucket logic in ipset when the
bucket is empty, leading to shrinking the hash bucket to size 0
which deals to out-of-bound write in next element additions.
From Yifan Wu.
9) Allow the use of x_tables extensions that explicitly declare
NFPROTO_ARP support only. This is to avoid an incorrect hook number
validation due to non-overlapping arp and inet hook number
definitions.
10) Reject immediate NF_QUEUE verdict in nf_tables. The userspace
nft tool always uses the nft_queue expression for queueing.
This ensures this verdict cannot be used for the arp family,
which does supported this.
* tag 'nf-26-04-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: reject immediate NF_QUEUE verdict
netfilter: x_tables: restrict xt_check_match/xt_check_target extensions for NFPROTO_ARP
netfilter: ipset: drop logically empty buckets in mtype_del
netfilter: ctnetlink: ignore explicit helper on new expectations
netfilter: ctnetlink: zero expect NAT fields when CTA_EXPECT_NAT absent
netfilter: nf_conntrack_helper: pass helper to expect cleanup
netfilter: ipset: use nla_strcmp for IPSET_ATTR_NAME attr
netfilter: x_tables: ensure names are nul-terminated
netfilter: nfnetlink_log: account for netlink header size
netfilter: flowtable: strictly check for maximum number of actions
====================
Link: https://patch.msgid.link/20260401103646.1015423-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rds_ib_get_mr() extracts the rds_ib_connection from conn->c_transport_data
and passes it to rds_ib_reg_frmr() for FRWR memory registration. On a
fresh outgoing connection, ic is allocated in rds_ib_conn_alloc() with
i_cm_id = NULL because the connection worker has not yet called
rds_ib_conn_path_connect() to create the rdma_cm_id. When sendmsg() with
RDS_CMSG_RDMA_MAP is called on such a connection, the sendmsg path parses
the control message before any connection establishment, allowing
rds_ib_post_reg_frmr() to dereference ic->i_cm_id->qp and crash the
kernel.
The existing guard in rds_ib_reg_frmr() only checks for !ic (added in
commit 9e630bcb77), which does not catch this case since ic is allocated
early and is always non-NULL once the connection object exists.
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
RIP: 0010:rds_ib_post_reg_frmr+0x50e/0x920
Call Trace:
rds_ib_post_reg_frmr (net/rds/ib_frmr.c:167)
rds_ib_map_frmr (net/rds/ib_frmr.c:252)
rds_ib_reg_frmr (net/rds/ib_frmr.c:430)
rds_ib_get_mr (net/rds/ib_rdma.c:615)
__rds_rdma_map (net/rds/rdma.c:295)
rds_cmsg_rdma_map (net/rds/rdma.c:860)
rds_sendmsg (net/rds/send.c:1363)
____sys_sendmsg
do_syscall_64
Add a check in rds_ib_get_mr() that verifies ic, i_cm_id, and qp are all
non-NULL before proceeding with FRMR registration, mirroring the guard
already present in rds_ib_post_inv(). Return -ENODEV when the connection
is not ready, which the existing error handling in rds_cmsg_send() converts
to -EAGAIN for userspace retry and triggers rds_conn_connect_if_down() to
start the connection worker.
Fixes: 1659185fb4 ("RDS: IB: Support Fastreg MR (FRMR) memory registration mode")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260330163237.2752440-2-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
fib6_metric_set() may be called concurrently from softirq context without
holding the FIB table lock. A typical path is:
ndisc_router_discovery()
spin_unlock_bh(&table->tb6_lock) <- lock released
fib6_metric_set(rt, RTAX_HOPLIMIT, ...) <- lockless call
When two CPUs process Router Advertisement packets for the same router
simultaneously, they can both arrive at fib6_metric_set() with the same
fib6_info pointer whose fib6_metrics still points to dst_default_metrics.
if (f6i->fib6_metrics == &dst_default_metrics) { /* both CPUs: true */
struct dst_metrics *p = kzalloc_obj(*p, GFP_ATOMIC);
refcount_set(&p->refcnt, 1);
f6i->fib6_metrics = p; /* CPU1 overwrites CPU0's p -> p0 leaked */
}
The dst_metrics allocated by the losing CPU has refcnt=1 but no pointer
to it anywhere in memory, producing a kmemleak report:
unreferenced object 0xff1100025aca1400 (size 96):
comm "softirq", pid 0, jiffies 4299271239
backtrace:
kmalloc_trace+0x28a/0x380
fib6_metric_set+0xcd/0x180
ndisc_router_discovery+0x12dc/0x24b0
icmpv6_rcv+0xc16/0x1360
Fix this by:
- Set val for p->metrics before published via cmpxchg() so the metrics
value is ready before the pointer becomes visible to other CPUs.
- Replace the plain pointer store with cmpxchg() and free the allocation
safely when competition failed.
- Add READ_ONCE()/WRITE_ONCE() for metrics[] setting in the non-default
metrics path to prevent compiler-based data races.
Fixes: d4ead6b34b ("net/ipv6: move metrics from dst to rt6_info")
Reported-by: Fei Liu <feliu@redhat.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260331-b4-fib6_metric_set-kmemleak-v3-1-88d27f4d8825@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
hci_le_big_create_sync() uses DEFINE_FLEX to allocate a
struct hci_cp_le_big_create_sync on the stack with room for 0x11 (17)
BIS entries. However, conn->num_bis can hold up to HCI_MAX_ISO_BIS (31)
entries — validated against ISO_MAX_NUM_BIS (0x1f) in the caller
hci_conn_big_create_sync(). When conn->num_bis is between 18 and 31,
the memcpy that copies conn->bis into cp->bis writes up to 14 bytes
past the stack buffer, corrupting adjacent stack memory.
This is trivially reproducible: binding an ISO socket with
bc_num_bis = ISO_MAX_NUM_BIS (31) and calling listen() will
eventually trigger hci_le_big_create_sync() from the HCI command
sync worker, causing a KASAN-detectable stack-out-of-bounds write:
BUG: KASAN: stack-out-of-bounds in hci_le_big_create_sync+0x256/0x3b0
Write of size 31 at addr ffffc90000487b48 by task kworker/u9:0/71
Fix this by changing the DEFINE_FLEX count from the incorrect 0x11 to
HCI_MAX_ISO_BIS, which matches the maximum number of BIS entries that
conn->bis can actually carry.
Fixes: 42ecf19471 ("Bluetooth: ISO: Do not emit LE BIG Create Sync if previous is pending")
Cc: stable@vger.kernel.org
Signed-off-by: hkbinbin <hkbinbinbin@gmail.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The legacy responder path in smp_random() currently labels the stored
STK as authenticated whenever pending_sec_level is BT_SECURITY_HIGH.
That reflects what the local service requested, not what the pairing
flow actually achieved.
For Just Works/Confirm legacy pairing, SMP_FLAG_MITM_AUTH stays clear
and the resulting STK should remain unauthenticated even if the local
side requested HIGH security. Use the established MITM state when
storing the responder STK so the key metadata matches the pairing result.
This also keeps the legacy path aligned with the Secure Connections code,
which already treats JUST_WORKS/JUST_CFM as unauthenticated.
Fixes: fff3490f47 ("Bluetooth: Fix setting correct authentication information for SMP STK")
Cc: stable@vger.kernel.org
Signed-off-by: Oleh Konko <security@1seal.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
smp_cmd_pairing_req() currently builds the pairing response from the
initiator auth_req before enforcing the local BT_SECURITY_HIGH
requirement. If the initiator omits SMP_AUTH_MITM, the response can
also omit it even though the local side still requires MITM.
tk_request() then sees an auth value without SMP_AUTH_MITM and may
select JUST_CFM, making method selection inconsistent with the pairing
policy the responder already enforces.
When the local side requires HIGH security, first verify that MITM can
be achieved from the IO capabilities and then force SMP_AUTH_MITM in the
response in both rsp.auth_req and auth. This keeps the responder auth bits
and later method selection aligned.
Fixes: 2b64d153a0 ("Bluetooth: Add MITM mechanism to LE-SMP")
Cc: stable@vger.kernel.org
Suggested-by: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Signed-off-by: Oleh Konko <security@1seal.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
mesh_send() currently bounds MGMT_OP_MESH_SEND by total command
length, but it never verifies that the bytes supplied for the
flexible adv_data[] array actually match the embedded adv_data_len
field. MGMT_MESH_SEND_SIZE only covers the fixed header, so a
truncated command can still pass the existing 20..50 byte range
check and later drive the async mesh send path past the end of the
queued command buffer.
Keep rejecting zero-length and oversized advertising payloads, but
validate adv_data_len explicitly and require the command length to
exactly match the flexible array size before queueing the request.
Fixes: b338d91703 ("Bluetooth: Implement support for Mesh")
Reported-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_conn lookup and field access must be covered by hdev lock in
hci_le_remote_conn_param_req_evt, otherwise it's possible it is freed
concurrently.
Extend the hci_dev_lock critical section to cover all conn usage.
Fixes: 95118dd4ed ("Bluetooth: hci_event: Use of a function table to handle LE subevents")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_conn lookup and field access must be covered by hdev lock in
set_cig_params_sync, otherwise it's possible it is freed concurrently.
Take hdev lock to prevent hci_conn from being deleted or modified
concurrently. Just RCU lock is not suitable here, as we also want to
avoid "tearing" in the configuration.
Fixes: a091289218 ("Bluetooth: hci_conn: Fix hci_le_set_cig_params")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Load Long Term Keys stores the user-provided enc_size and later uses
it to size fixed-size stack operations when replying to LE LTK
requests. An enc_size larger than the 16-byte key buffer can therefore
overflow the reply stack buffer.
Reject oversized enc_size values while validating the management LTK
record so invalid keys never reach the stored key state.
Fixes: 346af67b8d ("Bluetooth: Add MGMT handlers for dealing with SMP LTK's")
Reported-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Commit 5df5dafc17 ("Bluetooth: hci_uart: Fix another race during
initialization") fixed a race for hci commands sent during initialization.
However, there is still a race that happens if an hci event from one of
these commands is received before HCI_UART_REGISTERED has been set at
the end of hci_uart_register_dev(). The event will be ignored which
causes the command to fail with a timeout in the log:
"Bluetooth: hci0: command 0x1003 tx timeout"
This is because the hci event receive path (hci_uart_tty_receive ->
h4_recv) requires HCI_UART_REGISTERED to be set in h4_recv(), while the
hci command transmit path (hci_uart_send_frame -> h4_enqueue) only
requires HCI_UART_PROTO_INIT to be set in hci_uart_send_frame().
The check for HCI_UART_REGISTERED was originally added in commit
c257820291 ("Bluetooth: Fix H4 crash from incoming UART packets")
to fix a crash caused by hu->hdev being null dereferenced. That can no
longer happen: once HCI_UART_PROTO_INIT is set in hci_uart_register_dev()
all pointers (hu, hu->priv and hu->hdev) are valid, and
hci_uart_tty_receive() already calls h4_recv() on HCI_UART_PROTO_INIT
or HCI_UART_PROTO_READY.
Remove the check for HCI_UART_REGISTERED in h4_recv() to fix the race
condition.
Fixes: 5df5dafc17 ("Bluetooth: hci_uart: Fix another race during initialization")
Signed-off-by: Jonathan Rissanen <jonathan.rissanen@axis.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When hci_cmd_sync_queue_once() returns with error, the destroy callback
will not be called.
Fix leaking references / memory on these failures.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_cmd_sync_queue_once() needs to indicate whether a queue item was
added, so caller can know if callbacks are called, so it can avoid
leaking resources.
Change the function to return -EEXIST if queue item already exists.
Modify all callsites to handle that.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_store_wake_reason() is called from hci_event_packet() immediately
after stripping the HCI event header but before hci_event_func()
enforces the per-event minimum payload length from hci_ev_table.
This means a short HCI event frame can reach bacpy() before any bounds
check runs.
Rather than duplicating skb parsing and per-event length checks inside
hci_store_wake_reason(), move wake-address storage into the individual
event handlers after their existing event-length validation has
succeeded. Convert hci_store_wake_reason() into a small helper that only
stores an already-validated bdaddr while the caller holds hci_dev_lock().
Use the same helper after hci_event_func() with a NULL address to
preserve the existing unexpected-wake fallback semantics when no
validated event handler records a wake address.
Annotate the helper with __must_hold(&hdev->lock) and add
lockdep_assert_held(&hdev->lock) so future call paths keep the lock
contract explicit.
Call the helper from hci_conn_request_evt(), hci_conn_complete_evt(),
hci_sync_conn_complete_evt(), le_conn_complete_evt(),
hci_le_adv_report_evt(), hci_le_ext_adv_report_evt(),
hci_le_direct_adv_report_evt(), hci_le_pa_sync_established_evt(), and
hci_le_past_received_evt().
Fixes: 2f20216c1d ("Bluetooth: Emit controller suspend and resume events")
Cc: stable@vger.kernel.org
Signed-off-by: Oleh Konko <security@1seal.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
sco_sock_connect() checks sk_state and sk_type without holding
the socket lock. Two concurrent connect() syscalls on the same
socket can both pass the check and enter sco_connect(), leading
to use-after-free.
The buggy scenario involves three participants and was confirmed
with additional logging instrumentation:
Thread A (connect): HCI disconnect: Thread B (connect):
sco_sock_connect(sk) sco_sock_connect(sk)
sk_state==BT_OPEN sk_state==BT_OPEN
(pass, no lock) (pass, no lock)
sco_connect(sk): sco_connect(sk):
hci_dev_lock hci_dev_lock
hci_connect_sco <- blocked
-> hcon1
sco_conn_add->conn1
lock_sock(sk)
sco_chan_add:
conn1->sk = sk
sk->conn = conn1
sk_state=BT_CONNECT
release_sock
hci_dev_unlock
hci_dev_lock
sco_conn_del:
lock_sock(sk)
sco_chan_del:
sk->conn=NULL
conn1->sk=NULL
sk_state=
BT_CLOSED
SOCK_ZAPPED
release_sock
hci_dev_unlock
(unblocked)
hci_connect_sco
-> hcon2
sco_conn_add
-> conn2
lock_sock(sk)
sco_chan_add:
sk->conn=conn2
sk_state=
BT_CONNECT
// zombie sk!
release_sock
hci_dev_unlock
Thread B revives a BT_CLOSED + SOCK_ZAPPED socket back to
BT_CONNECT. Subsequent cleanup triggers double sock_put() and
use-after-free. Meanwhile conn1 is leaked as it was orphaned
when sco_conn_del() cleared the association.
Fix this by:
- Moving lock_sock() before the sk_state/sk_type checks in
sco_sock_connect() to serialize concurrent connect attempts
- Fixing the sk_type != SOCK_SEQPACKET check to actually
return the error instead of just assigning it
- Adding a state re-check in sco_connect() after lock_sock()
to catch state changes during the window between the locks
- Adding sco_pi(sk)->conn check in sco_chan_add() to prevent
double-attach of a socket to multiple connections
- Adding hci_conn_drop() on sco_chan_add failure to prevent
HCI connection leaks
Fixes: 9a8ec9e8eb ("Bluetooth: SCO: Fix possible circular locking dependency on sco_connect_cfm")
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_cmd_sync_run() may run the work immediately if called from existing
sync work (otherwise it queues a new sync work). In this case it fails
to call the destroy() function.
On immediate run, make it behave same way as if item was queued
successfully: call destroy, and return 0.
The only callsite is hci_abort_conn() via hci_cmd_sync_run_once(), and
this changes its return value. However, its return value is not used
except as the return value for hci_disconnect(), and nothing uses the
return value of hci_disconnect(). Hence there should be no behavior
change anywhere.
Fixes: c898f6d7b0 ("Bluetooth: hci_sync: Introduce hci_cmd_sync_run/hci_cmd_sync_run_once")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
nft_queue is always used from userspace nftables to deliver the NF_QUEUE
verdict. Immediately emitting an NF_QUEUE verdict is never used by the
userspace nft tools, so reject immediate NF_QUEUE verdicts.
The arp family does not provide queue support, but such an immediate
verdict is still reachable. Globally reject NF_QUEUE immediate verdicts
to address this issue.
Fixes: f342de4e2f ("netfilter: nf_tables: reject QUEUE/DROP verdict parameters")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Weiming Shi says:
xt_match and xt_target structs registered with NFPROTO_UNSPEC can be
loaded by any protocol family through nft_compat. When such a
match/target sets .hooks to restrict which hooks it may run on, the
bitmask uses NF_INET_* constants. This is only correct for families
whose hook layout matches NF_INET_*: IPv4, IPv6, INET, and bridge
all share the same five hooks (PRE_ROUTING ... POST_ROUTING).
ARP only has three hooks (IN=0, OUT=1, FORWARD=2) with different
semantics. Because NF_ARP_OUT == 1 == NF_INET_LOCAL_IN, the .hooks
validation silently passes for the wrong reasons, allowing matches to
run on ARP chains where the hook assumptions (e.g. state->in being
set on input hooks) do not hold. This leads to NULL pointer
dereferences; xt_devgroup is one concrete example:
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000044: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000220-0x0000000000000227]
RIP: 0010:devgroup_mt+0xff/0x350
Call Trace:
<TASK>
nft_match_eval (net/netfilter/nft_compat.c:407)
nft_do_chain (net/netfilter/nf_tables_core.c:285)
nft_do_chain_arp (net/netfilter/nft_chain_filter.c:61)
nf_hook_slow (net/netfilter/core.c:623)
arp_xmit (net/ipv4/arp.c:666)
</TASK>
Kernel panic - not syncing: Fatal exception in interrupt
Fix it by restricting arptables to NFPROTO_ARP extensions only.
Note that arptables-legacy only supports:
- arpt_CLASSIFY
- arpt_mangle
- arpt_MARK
that provide explicit NFPROTO_ARP match/target declarations.
Fixes: 9291747f11 ("netfilter: xtables: add device group match")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
mtype_del() counts empty slots below n->pos in k, but it only drops the
bucket when both n->pos and k are zero. This misses buckets whose live
entries have all been removed while n->pos still points past deleted slots.
Treat a bucket as empty when all positions below n->pos are unused and
release it directly instead of shrinking it further.
Fixes: 8af1c6fbd9 ("netfilter: ipset: Fix forceadd evaluation path")
Cc: stable@vger.kernel.org
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <dstsmallbird@foxmail.com>
Signed-off-by: Yifan Wu <yifanwucs@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Reviewed-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use the existing master conntrack helper, anything else is not really
supported and it just makes validation more complicated, so just ignore
what helper userspace suggests for this expectation.
This was uncovered when validating CTA_EXPECT_CLASS via different helper
provided by userspace than the existing master conntrack helper:
BUG: KASAN: slab-out-of-bounds in nf_ct_expect_related_report+0x2479/0x27c0
Read of size 4 at addr ffff8880043fe408 by task poc/102
Call Trace:
nf_ct_expect_related_report+0x2479/0x27c0
ctnetlink_create_expect+0x22b/0x3b0
ctnetlink_new_expect+0x4bd/0x5c0
nfnetlink_rcv_msg+0x67a/0x950
netlink_rcv_skb+0x120/0x350
Allowing to read kernel memory bytes off the expectation boundary.
CTA_EXPECT_HELP_NAME is still used to offer the helper name to userspace
via netlink dump.
Fixes: bd07793705 ("netfilter: nfnetlink_queue: allow to attach expectations to conntracks")
Reported-by: Qi Tang <tpluszz77@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ctnetlink_alloc_expect() allocates expectations from a non-zeroing
slab cache via nf_ct_expect_alloc(). When CTA_EXPECT_NAT is not
present in the netlink message, saved_addr and saved_proto are
never initialized. Stale data from a previous slab occupant can
then be dumped to userspace by ctnetlink_exp_dump_expect(), which
checks these fields to decide whether to emit CTA_EXPECT_NAT.
The safe sibling nf_ct_expect_init(), used by the packet path,
explicitly zeroes these fields.
Zero saved_addr, saved_proto and dir in the else branch, guarded
by IS_ENABLED(CONFIG_NF_NAT) since these fields only exist when
NAT is enabled.
Confirmed by priming the expect slab with NAT-bearing expectations,
freeing them, creating a new expectation without CTA_EXPECT_NAT,
and observing that the ctnetlink dump emits a spurious
CTA_EXPECT_NAT containing stale data from the prior allocation.
Fixes: 076a0ca026 ("netfilter: ctnetlink: add NAT support for expectations")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Qi Tang <tpluszz77@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_conntrack_helper_unregister() calls nf_ct_expect_iterate_destroy()
to remove expectations belonging to the helper being unregistered.
However, it passes NULL instead of the helper pointer as the data
argument, so expect_iter_me() never matches any expectation and all
of them survive the cleanup.
After unregister returns, nfnl_cthelper_del() frees the helper
object immediately. Subsequent expectation dumps or packet-driven
init_conntrack() calls then dereference the freed exp->helper,
causing a use-after-free.
Pass the actual helper pointer so expectations referencing it are
properly destroyed before the helper object is freed.
BUG: KASAN: slab-use-after-free in string+0x38f/0x430
Read of size 1 at addr ffff888003b14d20 by task poc/103
Call Trace:
string+0x38f/0x430
vsnprintf+0x3cc/0x1170
seq_printf+0x17a/0x240
exp_seq_show+0x2e5/0x560
seq_read_iter+0x419/0x1280
proc_reg_read+0x1ac/0x270
vfs_read+0x179/0x930
ksys_read+0xef/0x1c0
Freed by task 103:
The buggy address is located 32 bytes inside of
freed 192-byte region [ffff888003b14d00, ffff888003b14dc0)
Fixes: ac7b848390 ("netfilter: expect: add and use nf_ct_expect_iterate helpers")
Signed-off-by: Qi Tang <tpluszz77@gmail.com>
Reviewed-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
IPSET_ATTR_NAME and IPSET_ATTR_NAMEREF are of NLA_STRING type, they
cannot be treated like a c-string.
They either have to be switched to NLA_NUL_STRING, or the compare
operations need to use the nla functions.
Fixes: f830837f0e ("netfilter: ipset: list:set set type support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reject names that lack a \0 character before feeding them
to functions that expect c-strings.
Fixes tag is the most recent commit that needs this change.
Fixes: c38c4597e4 ("netfilter: implement xt_cgroup cgroup2 path match")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is a followup to an old bug fix: NLMSG_DONE needs to account
for the netlink header size, not just the attribute size.
This can result in a WARN splat + drop of the netlink message,
but other than this there are no ill effects.
Fixes: 9dfa1dfe4d ("netfilter: nf_log: account for size of NLMSG_DONE attribute")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The maximum number of flowtable hardware offload actions in IPv6 is:
* ethernet mangling (4 payload actions, 2 for each ethernet address)
* SNAT (4 payload actions)
* DNAT (4 payload actions)
* Double VLAN (4 vlan actions, 2 for popping vlan, and 2 for pushing)
for QinQ.
* Redirect (1 action)
Which makes 17, while the maximum is 16. But act_ct supports for tunnels
actions too. Note that payload action operates at 32-bit word level, so
mangling an IPv6 address takes 4 payload actions.
Update flow_action_entry_next() calls to check for the maximum number of
supported actions.
While at it, rise the maximum number of actions per flow from 16 to 24
so this works fine with IPv6 setups.
Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
stmmac_vlan_restore() unconditionally calls stmmac_vlan_update() when
NETIF_F_VLAN_FEATURES is set. On platforms where priv->hw->vlan (or
->update_vlan_hash) is not provided, stmmac_update_vlan_hash() returns
-EINVAL via stmmac_do_void_callback(), resulting in a spurious
"Failed to restore VLANs" error even when no VLAN filtering is in use.
Remove not needed comment.
Remove not used return value from stmmac_vlan_restore().
Tested on Orange Pi Zero 3.
Fixes: bd7ad51253 ("net: stmmac: Fix VLAN HW state restore")
Signed-off-by: Michal Piekos <michal.piekos@mmpsystems.pl>
Link: https://patch.msgid.link/20260328-vlan-restore-error-v4-1-f88624c530dc@mmpsystems.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ftgmac100_alloc_rings() allocates rx_skbs, tx_skbs, rxdes, txdes, and
rx_scratch in stages. On intermediate failures it returned -ENOMEM
directly, leaking resources allocated earlier in the function.
Rework the failure path to use staged local unwind labels and free
allocated resources in reverse order before returning -ENOMEM. This
matches common netdev allocation cleanup style.
Fixes: d72e01a043 ("ftgmac100: Use a scratch buffer for failed RX allocations")
Cc: stable@vger.kernel.org
Signed-off-by: Yufan Chen <yufan.chen@linux.dev>
Link: https://patch.msgid.link/20260328163257.60836-1-yufan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
`ip6fl_seq_show()` walks the global flowlabel hash under the seq-file
RCU read-side lock and prints `fl->opt->opt_nflen` when an option block
is present.
Exclusive flowlabels currently free `fl->opt` as soon as `fl->users`
drops to zero in `fl_release()`. However, the surrounding
`struct ip6_flowlabel` remains visible in the global hash table until
later garbage collection removes it and `fl_free_rcu()` finally tears it
down.
A concurrent `/proc/net/ip6_flowlabel` reader can therefore race that
early `kfree()` and dereference freed option state, triggering a crash
in `ip6fl_seq_show()`.
Fix this by keeping `fl->opt` alive until `fl_free_rcu()`. That matches
the lifetime already required for the enclosing flowlabel while readers
can still reach it under RCU.
Fixes: d3aedd5ebd ("ipv6 flowlabel: Convert hash list to RCU.")
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Tested-by: Ren Wei <enjou1224z@gmail.com>
Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/07351f0ec47bcee289576f39f9354f4a64add6e4.1774855883.git.zcliangcn@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull sched_ext fixes from Tejun Heo:
- Fix SCX_KICK_WAIT deadlock where multiple CPUs waiting for each other
in hardirq context form a cycle. Move the wait to a balance callback
which can drop the rq lock and process IPIs.
- Fix inconsistent NUMA node lookup in scx_select_cpu_dfl() where
the waker_node used cpu_to_node() while prev_cpu used
scx_cpu_node_if_enabled(), leading to undefined behavior when
per-node idle tracking is disabled.
* tag 'sched_ext-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
selftests/sched_ext: Add cyclic SCX_KICK_WAIT stress test
sched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait to balance callback
sched_ext: Fix inconsistent NUMA node lookup in scx_select_cpu_dfl()
Pull workqueue fix from Tejun Heo:
- Fix false positive stall reports on weakly ordered architectures
where the lockless worklist/timestamp check in the watchdog can
observe stale values due to memory reordering.
Recheck under pool->lock to confirm.
* tag 'wq-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Better describe stall check
workqueue: Fix false positive stall reports
Pull cgroup fixes from Tejun Heo:
- Fix cgroup rmdir racing with dying tasks.
Deferred task cgroup unlink introduced a window where cgroup.procs
is empty but the cgroup is still populated, causing rmdir to fail
with -EBUSY and selftest failures.
Make rmdir wait for dying tasks to fully leave and fix selftests to
not depend on synchronous populated updates.
- Fix cpuset v1 task migration failure from empty cpusets under strict
security policies.
When CPU hotplug removes the last CPU from a v1 cpuset, tasks must be
migrated to an ancestor without a security_task_setscheduler() check
that would block the migration.
* tag 'cgroup-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Skip security check for hotplug induced v1 task migration
cgroup/cpuset: Simplify setsched decision check in task iteration loop of cpuset_can_attach()
cgroup: Fix cgroup_drain_dying() testing the wrong condition
selftests/cgroup: Don't require synchronous populated update on task exit
cgroup: Wait for dying tasks to leave on rmdir
When a CPU hot removal causes a v1 cpuset to lose all its CPUs, the
cpuset hotplug handler will schedule a work function to migrate tasks
in that cpuset with no CPU to its ancestor to enable those tasks to
continue running.
If a strict security policy is in place, however, the task migration
may fail when security_task_setscheduler() call in cpuset_can_attach()
returns a -EACCES error. That will mean that those tasks will have
no CPU to run on. The system administrators will have to explicitly
intervene to either add CPUs to that cpuset or move the tasks elsewhere
if they are aware of it.
This problem was found by a reported test failure in the LTP's
cpuset_hotplug_test.sh. Fix this problem by treating this special case as
an exception to skip the setsched security check in cpuset_can_attach()
when a v1 cpuset with tasks have no CPU left.
With that patch applied, the cpuset_hotplug_test.sh test can be run
successfully without failure.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Centralize the check required to run security_task_setscheduler() in
the task iteration loop of cpuset_can_attach() outside of the loop as
it has no dependency on the characteristics of the tasks themselves.
There is no functional change.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull udf fix from Jan Kara:
"Fix for a race in UDF that can lead to memory corruption"
* tag 'fs_for_v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix race between file type conversion and writeback
mpage: Provide variant of mpage_writepages() with own optional folio handler
The Lenovo Yoga Pro 7 14IMH9 (DMI: 83E2) shares PCI SSID 17aa:3847
with the Legion 7 16ACHG6, but has a different codec subsystem ID
(17aa:38cf). The existing SND_PCI_QUIRK for 17aa:3847 applies
ALC287_FIXUP_LEGION_16ACHG6, which attempts to initialize an external
I2C amplifier (CLSA0100) that is not present on the Yoga Pro 7 14IMH9.
As a result, pin 0x17 (bass speakers) is connected to DAC 0x06 which
has no volume control, making hardware volume adjustment completely
non-functional. Audio is either silent or at maximum volume regardless
of the slider position.
Add a HDA_CODEC_QUIRK entry using the codec subsystem ID (17aa:38cf)
to correctly identify the Yoga Pro 7 14IMH9 and apply
ALC287_FIXUP_YOGA9_14IMH9_BASS_SPK_PIN, which redirects pin 0x17 to
DAC 0x02 and restores proper volume control. The existing Legion entry
is preserved unchanged.
This follows the same pattern used for 17aa:386e, where Legion Y9000X
and Yoga Pro 7 14ARP8 share a PCI SSID but are distinguished via
HDA_CODEC_QUIRK.
Link: https://github.com/nomad4tech/lenovo-yoga-pro-7-linux
Tested-by: Alexander Savenko <alex.sav4387@gmail.com>
Signed-off-by: Alexander Savenko <alex.sav4387@gmail.com>
Link: https://patch.msgid.link/20260331082929.44890-1-alex.sav4387@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
br_mrp_start_test() and br_mrp_start_in_test() accept the user-supplied
interval value from netlink without validation. When interval is 0,
usecs_to_jiffies(0) yields 0, causing the delayed work
(br_mrp_test_work_expired / br_mrp_in_test_work_expired) to reschedule
itself with zero delay. This creates a tight loop on system_percpu_wq
that allocates and transmits MRP test frames at maximum rate, exhausting
all system memory and causing a kernel panic via OOM deadlock.
The same zero-interval issue applies to br_mrp_start_in_test_parse()
for interconnect test frames.
Use NLA_POLICY_MIN(NLA_U32, 1) in the nla_policy tables for both
IFLA_BRIDGE_MRP_START_TEST_INTERVAL and
IFLA_BRIDGE_MRP_START_IN_TEST_INTERVAL, so zero is rejected at the
netlink attribute parsing layer before the value ever reaches the
workqueue scheduling code. This is consistent with how other bridge
subsystems (br_fdb, br_mst) enforce range constraints on netlink
attributes.
Fixes: 20f6a05ef6 ("bridge: mrp: Rework the MRP netlink interface")
Fixes: 7ab1748e4c ("bridge: mrp: Extend MRP netlink interface for configuring MRP interconnect")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260328063000.1845376-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This reverts commit c073f07576 ("ASoC: Intel: sof_sdw: select PINCTRL_CS42L43 and SPI_CS42L43")
Currently, SND_SOC_INTEL_SOUNDWIRE_SOF_MACH selects PINCTRL_CS42L43
without also selecting or depending on PINCTRL, despite PINCTRL_CS42L43
depending on PINCTRL.
See the following Kbuild warning:
WARNING: unmet direct dependencies detected for PINCTRL_CS42L43
Depends on [n]: PINCTRL [=n] && MFD_CS42L43 [=m]
Selected by [m]:
- SND_SOC_INTEL_SOUNDWIRE_SOF_MACH [=m] && SOUND [=y] && SND [=m] && SND_SOC [=m] && SND_SOC_INTEL_MACH [=y] && (SND_SOC_SOF_INTEL_COMMON [=m] || !SND_SOC_SOF_INTEL_COMMON [=m]) && SND_SOC_SOF_INTEL_SOUNDWIRE [=m] && I2C [=y] && SPI_MASTER [=y] && ACPI [=y] && (MFD_INTEL_LPSS [=n] || COMPILE_TEST [=y]) && (SND_SOC_INTEL_USER_FRIENDLY_LONG_NAMES [=n] || COMPILE_TEST [=y]) && SOUNDWIRE [=m]
In response to v1 of this patch [1], Arnd pointed out that there is
no compile-time dependency sof_sdw and the PINCTRL_CS42L43 driver.
After testing, I can confirm that the kernel compiled with
SND_SOC_INTEL_SOUNDWIRE_SOF_MACH enabled and PINCTRL_CS42L43 disabled.
This unmet dependency was detected by kconfirm, a static analysis
tool for Kconfig.
Link: https://lore.kernel.org/all/b8aecc71-1fed-4f52-9f6c-263fbe56d493@app.fastmail.com/ [1]
Fixes: c073f07576 ("ASoC: Intel: sof_sdw: select PINCTRL_CS42L43 and SPI_CS42L43")
Signed-off-by: Julian Braha <julianbraha@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260325001522.1727678-1-julianbraha@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
In rt5660_hw_params(), the error path for snd_soc_dai_set_sysclk()
correctly uses rtd->dev as the logging device, but the error path for
snd_soc_dai_set_pll() uses codec_dai->dev instead.
These two devices are distinct:
- rtd->dev is the platform device of the PCM runtime (the Intel HDA/SSP
controller, e.g. 0000:00:1f.3), which owns the machine driver callback.
- codec_dai->dev is the I2C device of the rt5660 codec itself
(i2c-10EC5660:00).
Since hw_params is a machine driver operation and both calls are made
within the same function from the machine driver's context, all error
messages should be attributed to rtd->dev. Using codec_dai->dev for one
of them would suggest the error originates inside the codec driver,
which is misleading.
Align the PLL error log with the sysclk one to use rtd->dev, matching
the convention used by all other Intel board drivers in this directory.
Signed-off-by: Sachin Mokashi <sachin.mokashi@intel.com>
Link: https://patch.msgid.link/20260327131439.1330373-1-sachin.mokashi@intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Suraj Gupta says:
====================
Correct BD length masks and BQL accounting for multi-BD TX packets
This patch series fixes two issues in the Xilinx AXI Ethernet driver:
1. Corrects the BD length masks to match the AXIDMA IP spec.
2. Fixes BQL accounting for multi-BD TX packets.
====================
Link: https://patch.msgid.link/20260327073238.134948-1-suraj.gupta2@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When a TX packet spans multiple buffer descriptors (scatter-gather),
axienet_free_tx_chain sums the per-BD actual length from descriptor
status into a caller-provided accumulator. That sum is reset on each
NAPI poll. If the BDs for a single packet complete across different
polls, the earlier bytes are lost and never credited to BQL. This
causes BQL to think bytes are permanently in-flight, eventually
stalling the TX queue.
The SKB pointer is stored only on the last BD of a packet. When that
BD completes, use skb->len for the byte count instead of summing
per-BD status lengths. This matches netdev_sent_queue(), which debits
skb->len, and naturally survives across polls because no partial
packet contributes to the accumulator.
Fixes: c900e49d58 ("net: xilinx: axienet: Implement BQL")
Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com>
Reviewed-by: Sean Anderson <sean.anderson@linux.dev>
Link: https://patch.msgid.link/20260327073238.134948-3-suraj.gupta2@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The XAXIDMA_BD_CTRL_LENGTH_MASK and XAXIDMA_BD_STS_ACTUAL_LEN_MASK
macros were defined as 0x007FFFFF (23 bits), but the AXI DMA IP
product guide (PG021) specifies the buffer length field as bits 25:0
(26 bits). Update both masks to match the IP documentation.
In practice this had no functional impact, since Ethernet frames are
far smaller than 2^23 bytes and the extra bits were always zero, but
the masks should still reflect the hardware specification.
Fixes: 8a3b7a252d ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver")
Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com>
Reviewed-by: Sean Anderson <sean.anderson@linux.dev>
Link: https://patch.msgid.link/20260327073238.134948-2-suraj.gupta2@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
pn532_receive_buf() appends every incoming byte to dev->recv_skb and
only resets the buffer after pn532_uart_rx_is_frame() recognizes a
complete frame. A continuous stream of bytes without a valid PN532 frame
header therefore keeps growing the skb until skb_put_u8() hits the tail
limit.
Drop the accumulated partial frame once the fixed receive buffer is full
so malformed UART traffic cannot grow the skb past
PN532_UART_SKB_BUFF_LEN.
Fixes: c656aa4c27 ("nfc: pn533: add UART phy driver")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Link: https://patch.msgid.link/20260326142033.82297-1-pengpeng@iscas.ac.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
bond_xmit_broadcast() reuses the original skb for the last slave
(determined by bond_is_last_slave()) and clones it for others.
Concurrent slave enslave/release can mutate the slave list during
RCU-protected iteration, changing which slave is "last" mid-loop.
This causes the original skb to be double-consumed (double-freed).
Replace the racy bond_is_last_slave() check with a simple index
comparison (i + 1 == slaves_count) against the pre-snapshot slave
count taken via READ_ONCE() before the loop. This preserves the
zero-copy optimization for the last slave while making the "last"
determination stable against concurrent list mutations.
The UAF can trigger the following crash:
==================================================================
BUG: KASAN: slab-use-after-free in skb_clone
Read of size 8 at addr ffff888100ef8d40 by task exploit/147
CPU: 1 UID: 0 PID: 147 Comm: exploit Not tainted 7.0.0-rc3+ #4 PREEMPTLAZY
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:123)
print_report (mm/kasan/report.c:379 mm/kasan/report.c:482)
kasan_report (mm/kasan/report.c:597)
skb_clone (include/linux/skbuff.h:1724 include/linux/skbuff.h:1792 include/linux/skbuff.h:3396 net/core/skbuff.c:2108)
bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5334)
bond_start_xmit (drivers/net/bonding/bond_main.c:5567 drivers/net/bonding/bond_main.c:5593)
dev_hard_start_xmit (include/linux/netdevice.h:5325 include/linux/netdevice.h:5334 net/core/dev.c:3871 net/core/dev.c:3887)
__dev_queue_xmit (include/linux/netdevice.h:3601 net/core/dev.c:4838)
ip6_finish_output2 (include/net/neighbour.h:540 include/net/neighbour.h:554 net/ipv6/ip6_output.c:136)
ip6_finish_output (net/ipv6/ip6_output.c:208 net/ipv6/ip6_output.c:219)
ip6_output (net/ipv6/ip6_output.c:250)
ip6_send_skb (net/ipv6/ip6_output.c:1985)
udp_v6_send_skb (net/ipv6/udp.c:1442)
udpv6_sendmsg (net/ipv6/udp.c:1733)
__sys_sendto (net/socket.c:730 net/socket.c:742 net/socket.c:2206)
__x64_sys_sendto (net/socket.c:2209)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
</TASK>
Allocated by task 147:
Freed by task 147:
The buggy address belongs to the object at ffff888100ef8c80
which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 192 bytes inside of
freed 224-byte region [ffff888100ef8c80, ffff888100ef8d60)
Memory state around the buggy address:
ffff888100ef8c00: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
ffff888100ef8c80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888100ef8d00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
^
ffff888100ef8d80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
ffff888100ef8e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Fixes: 4e5bd03ae3 ("net: bonding: fix bond_xmit_broadcast return value error bug")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260326075553.3960562-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The recent refactoring of xfi driver changed the assignment of
atc->daios[] at atc_get_resources(); now it loops over all enum
DAIOTYP entries while it looped formerly only a part of them.
The problem is that the last entry, SPDIF1, is a special type that
is used only for hw20k1 CTSB073X model (as a replacement of SPDIFIO),
and there is no corresponding definition for hw20k2. Due to the lack
of the info, it caused a kernel crash on hw20k2, which was already
worked around by the commit b045ab3dff ("ALSA: ctxfi: Fix missing
SPDIFI1 index handling").
This patch addresses the root cause of the regression above properly,
simply by skipping the incorrect SPDIF1 type in the parser loop.
For making the change clearer, the code is slightly arranged, too.
Fixes: a2dbaeb5c6 ("ALSA: ctxfi: Refactor resource alloc for sparse mappings")
Cc: <stable@vger.kernel.org>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1259925
Link: https://patch.msgid.link/20260331081227.216134-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
bnxt_hwrm_func_backing_store_qcaps_v2() stores resp->type from the
firmware response in ctxm->type and later uses that value to index
fixed backing-store metadata arrays such as ctx_arr[] and
bnxt_bstore_to_trace[].
ctxm->type is fixed by the current backing-store query type and matches
the array index of ctx->ctx_arr. Set ctxm->type from the current loop
variable instead of depending on resp->type.
Also update the loop to advance type from next_valid_type in the for
statement, which keeps the control flow simpler for non-valid and
unchanged entries.
Fixes: 6a4d0774f0 ("bnxt_en: Add support for new backing store query firmware API")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Tested-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20260328234357.43669-1-pengpeng@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Netfilter flowtable can theoretically try to offload flower rules as soon
as a net_device is registered while all the other ones are not
registered or initialized, triggering a possible NULL pointer dereferencing
of qdma pointer in airoha_ppe_set_cpu_port routine. Moreover, if
register_netdev() fails for a particular net_device, there is a small
race if Netfilter tries to offload flowtable rules before all the
net_devices are properly unregistered in airoha_probe() error patch,
triggering a NULL pointer dereferencing in airoha_ppe_set_cpu_port
routine. In order to avoid any possible race, delay offloading until
all net_devices are registered in the networking subsystem.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260329-airoha-regiser-race-fix-v2-1-f4ebb139277b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When building netlink messages, tc_chain_fill_node() never initializes
the tcm_info field of struct tcmsg. Since the allocation is not zeroed,
kernel heap memory is leaked to userspace through this 4-byte field.
The fix simply zeroes tcm_info alongside the other fields that are
already initialized.
Fixes: 32a4f5ecd7 ("net: sched: introduce chain object to uapi")
Signed-off-by: Yochai Eisenrich <echelonh@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260328211436.1010152-1-echelonh@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull crypto library fix from Eric Biggers:
"Fix missing zeroization of the ChaCha state"
* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crypto: chacha: Zeroize permuted_state before it leaves scope
Pull rtla build fix from Steven Rostedt:
- Fix build failure when libbpf does not exist
RTLA supports building without BPF libraries, but a recent change
added a libbpf.h include outside of the BPF protection which caused
build failures when libbpf was not installed.
* tag 'trace-rtla-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rtla: Fix build without libbpf header
ep93xx_i2s_enable() calls clk_prepare_enable() on three clocks in
sequence (mclk, sclk, lrclk) without checking the return value of any
of them. If an intermediate enable fails, the clocks that were already
enabled are never rolled back, leaking them until the next disable cycle
— which may never come if the stream never started cleanly.
Change ep93xx_i2s_enable() from void to int. Add error checking after
each clk_prepare_enable() call and unwind already-enabled clocks on
failure. Propagate the error through ep93xx_i2s_startup() and
ep93xx_i2s_resume(), both of which already return int.
Signed-off-by: Jihed Chaibi <jihed.chaibi.dev@gmail.com>
Fixes: f4ff6b56bc ("ASoC: cirrus: i2s: Prepare clock before using it")
Link: https://patch.msgid.link/20260324210909.45494-1-jihed.chaibi.dev@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Add a test that creates a 3-CPU kick_wait cycle (A->B->C->A). A BPF
scheduler kicks the next CPU in the ring with SCX_KICK_WAIT on every
enqueue while userspace workers generate continuous scheduling churn via
sched_yield(). Without the preceding fix, this hangs the machine within seconds.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
SCX_KICK_WAIT busy-waits in kick_cpus_irq_workfn() using
smp_cond_load_acquire() until the target CPU's kick_sync advances. Because
the irq_work runs in hardirq context, the waiting CPU cannot reschedule and
its own kick_sync never advances. If multiple CPUs form a wait cycle, all
CPUs deadlock.
Replace the busy-wait in kick_cpus_irq_workfn() with resched_curr() to
force the CPU through do_pick_task_scx(), which queues a balance callback
to perform the wait. The balance callback drops the rq lock and enables
IRQs following the sched_core_balance() pattern, so the CPU can process
IPIs while waiting. The local CPU's kick_sync is advanced on entry to
do_pick_task_scx() and continuously during the wait, ensuring any CPU that
starts waiting for us sees the advancement and cannot form cyclic
dependencies.
Fixes: 90e55164da ("sched_ext: Implement SCX_KICK_WAIT")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Christian Loehle <christian.loehle@arm.com>
Link: https://lore.kernel.org/r/20260316100249.1651641-1-christian.loehle@arm.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Component has "card_aux_list" which is added/deled in bind/unbind aux dev
function (A), and used in for_each_card_auxs() loop (B).
static void soc_unbind_aux_dev(...)
{
...
for_each_card_auxs_safe(...) {
...
(A) list_del(&component->card_aux_list);
} ^^^^^^^^^^^^^
}
static int soc_bind_aux_dev(...)
{
...
for_each_card_pre_auxs(...) {
...
(A) list_add(&component->card_aux_list, ...);
} ^^^^^^^^^^^^^
...
}
#define for_each_card_auxs(card, component) \
(B) list_for_each_entry(component, ..., card_aux_list)
^^^^^^^^^^^^^
But it has been used without calling INIT_LIST_HEAD().
> git grep card_aux_list sound/soc
sound/soc/soc-core.c: list_del(&component->card_aux_list);
sound/soc/soc-core.c: list_add(&component->card_aux_list, ...);
call missing INIT_LIST_HEAD() for it.
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Link: https://patch.msgid.link/87341mxa8l.wl-kuninori.morimoto.gx@renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>
The loop creates a whitespace-stripped copy of the card shortname
where `len < sizeof(card->id)` is used for the bounds check. Since
sizeof(card->id) is 16 and the local id buffer is also 16 bytes,
writing 16 non-space characters fills the entire buffer,
overwriting the terminating nullbyte.
When this non-null-terminated string is later passed to
snd_card_set_id() -> copy_valid_id_string(), the function scans
forward with `while (*nid && ...)` and reads past the end of the
stack buffer, reading the contents of the stack.
A USB device with a product name containing many non-ASCII, non-space
characters (e.g. multibyte UTF-8) will reliably trigger this as follows:
BUG: KASAN: stack-out-of-bounds in copy_valid_id_string
sound/core/init.c:696 [inline]
BUG: KASAN: stack-out-of-bounds in snd_card_set_id_no_lock+0x698/0x74c
sound/core/init.c:718
The off-by-one has been present since commit bafeee5b1f ("ALSA:
snd_usb_caiaq: give better shortname") from June 2009 (v2.6.31-rc1),
which first introduced this whitespace-stripping loop. The original
code never accounted for the null terminator when bounding the copy.
Fix this by changing the loop bound to `sizeof(card->id) - 1`,
ensuring at least one byte remains as the null terminator.
Fixes: bafeee5b1f ("ALSA: snd_usb_caiaq: give better shortname")
Cc: stable@vger.kernel.org
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Reported-by: Berk Cem Goksel <berkcgoksel@gmail.com>
Signed-off-by: Berk Cem Goksel <berkcgoksel@gmail.com>
Link: https://patch.msgid.link/20260329133825.581585-1-berkcgoksel@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Pull vfs fixes from Christian Brauner:
- Fix netfs_limit_iter() hitting BUG() when an ITER_KVEC iterator
reaches it via core dump writes to 9P filesystems. Add ITER_KVEC
handling following the same pattern as the existing ITER_BVEC code.
- Fix a NULL pointer dereference in the netfs unbuffered write retry
path when the filesystem (e.g., 9P) doesn't set the prepare_write
operation.
- Clear I_DIRTY_TIME in sync_lazytime for filesystems implementing
->sync_lazytime. Without this the flag stays set and may cause
additional unnecessary calls during inode deactivation.
- Increase tmpfs size in mount_setattr selftests. A recent commit
bumped the ext4 image size to 2 GB but didn't adjust the tmpfs
backing store, so mkfs.ext4 fails with ENOSPC writing metadata.
- Fix an invalid folio access in iomap when i_blkbits matches the folio
size but differs from the I/O granularity. The cur_folio pointer
would not get invalidated and iomap_read_end() would still be called
on it despite the IO helper owning it.
- Fix hash_name() docstring.
- Fix read abandonment during netfs retry where the subreq variable
used for abandonment could be uninitialized on the first pass or
point to a deleted subrequest on later passes.
- Don't block sync for filesystems with no data integrity guarantees.
Add a SB_I_NO_DATA_INTEGRITY superblock flag replacing the per-inode
AS_NO_DATA_INTEGRITY mapping flag so sync kicks off writeback but
doesn't wait for flusher threads. This fixes a suspend-to-RAM hang on
fuse-overlayfs where the flusher thread blocks when the fuse daemon
is frozen.
- Fix a lockdep splat in iomap when reads fail. iomap_read_end_io()
invokes fserror_report() which calls igrab() taking i_lock in hardirq
context while i_lock is normally held with interrupts enabled. Kick
failed read handling to a workqueue.
- Remove the redundant netfs_io_stream::front member and use
stream->subrequests.next instead, fixing a potential issue in the
direct write code path.
* tag 'vfs-7.0-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs: Fix the handling of stream->front by removing it
iomap: fix lockdep complaint when reads fail
writeback: don't block sync for filesystems with no data integrity guarantees
netfs: Fix read abandonment during retry
vfs: fix docstring of hash_name()
iomap: fix invalid folio access when i_blkbits differs from I/O granularity
selftests/mount_setattr: increase tmpfs size for idmapped mount tests
fs: clear I_DIRTY_TIME in sync_lazytime
netfs: Fix NULL pointer dereference in netfs_unbuffered_write() on retry
netfs: Fix kernel BUG in netfs_limit_iter() for ITER_KVEC iterators
Pull phy fixes from Vinod Koul:
- Qualcomm PCS table fix for ufs phy
- TI device node reference fix
- Common prop kconfig fix
- lynx CDR lock workaround for lanes disabled
- usb disconnect function fix of k1 driver
* tag 'phy-fixes-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy:
phy: qcom: qmp-ufs: Fix SM8650 PCS table for Gear 4
phy: ti: j721e-wiz: Fix device node reference leak in wiz_get_lane_phy_types()
phy: k1-usb: add disconnect function support
phy: lynx-28g: skip CDR lock workaround for lanes disabled in the device tree
phy: make PHY_COMMON_PROPS Kconfig symbol conditionally user-selectable
Pull dmaengine fixes from Vinod Koul:
"A bunch of driver fixes with idxd ones being the biggest:
- Xilinx regmap init error handling, dma_device directions, residue
calculation, and reset related timeout fixes
- Renesas CHCTRL updates and driver list fixes
- DW HDMA cycle bits and MSI data programming fix
- IDXD pile of fixes for memeory leak and FLR fixes"
* tag 'dmaengine-fix-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (21 commits)
dmaengine: xilinx_dma: Fix reset related timeout with two-channel AXIDMA
dmaengine: xilinx: xilinx_dma: Fix unmasked residue subtraction
dmaengine: xilinx: xilinx_dma: Fix residue calculation for cyclic DMA
dmaengine: xilinx: xilinx_dma: Fix dma_device directions
dmaengine: sh: rz-dmac: Move CHCTRL updates under spinlock
dmaengine: sh: rz-dmac: Protect the driver specific lists
dmaengine: idxd: fix possible wrong descriptor completion in llist_abort_desc()
dmaengine: xilinx: xdma: Fix regmap init error handling
dmaengine: dw-edma: Fix multiple times setting of the CYCLE_STATE and CYCLE_BIT bits for HDMA.
dmaengine: idxd: Fix leaking event log memory
dmaengine: idxd: Fix freeing the allocated ida too late
dmaengine: idxd: Fix memory leak when a wq is reset
dmaengine: idxd: Fix not releasing workqueue on .release()
dmaengine: idxd: Wait for submitted operations on .device_synchronize()
dmaengine: idxd: Flush all pending descriptors
dmaengine: idxd: Flush kernel workqueues on Function Level Reset
dmaengine: idxd: Fix possible invalid memory access after FLR
dmaengine: idxd: Fix crash when the event log is disabled
dmaengine: idxd: Fix lockdep warnings when calling idxd_device_config()
dmaengine: dw-edma: fix MSI data programming for multi-IRQ case
...
Pull i2c fixes from Wolfram Sang:
- designware: fix resume-probe race causing NULL-deref in amdisp
- imx: fix timeout on repeated reads and extra clock at end
- MAINTAINERS: drop outdated I2C website
* tag 'i2c-for-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
MAINTAINERS: drop outdated I2C website
i2c: designware: amdisp: Fix resume-probe race condition issue
i2c: imx: ensure no clock is generated after last read
i2c: imx: fix i2c issue when reading multiple messages
Pull kvm fixes from Paolo Bonzini:
"s390:
- Lots of small and not-so-small fixes for the newly rewritten gmap,
mostly affecting the handling of nested guests.
x86:
- Fix an issue with shadow paging, which causes KVM to install an
MMIO PTE in the shadow page tables without first zapping a non-MMIO
SPTE if KVM didn't see the write that modified the shadowed guest
PTE.
While commit a54aa15c6b ("KVM: x86/mmu: Handle MMIO SPTEs
directly in mmu_set_spte()") was right about it being impossible to
miss such a write if it was coming from the guest, it failed to
account for writes to guest memory that are outside the scope of
KVM: if userspace modifies the guest PTE, and then the guest hits a
relevant page fault, KVM will get confused"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86/mmu: Only WARN in direct MMUs when overwriting shadow-present SPTE
KVM: x86/mmu: Drop/zap existing present SPTE even when creating an MMIO SPTE
KVM: s390: Fix KVM_S390_VCPU_FAULT ioctl
KVM: s390: vsie: Fix guest page tables protection
KVM: s390: vsie: Fix unshadowing while shadowing
KVM: s390: vsie: Fix refcount overflow for shadow gmaps
KVM: s390: vsie: Fix nested guest memory shadowing
KVM: s390: Correctly handle guest mappings without struct page
KVM: s390: Fix gmap_link()
KVM: s390: vsie: Fix check for pre-existing shadow mapping
KVM: s390: Remove non-atomic dat_crstep_xchg()
KVM: s390: vsie: Fix dat_split_ste()
Pull xen fix from Juergen Gross:
"A single fix for a very rare bug introduced in rc5"
* tag 'for-linus-7.0a-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/privcmd: unregister xenstore notifier on module exit
Pull x86 fixes from Ingo Molnar:
- Fix an early boot crash in AMD SEV-SNP guests, caused by incorrect
FSGSBASE init ordering (Nikunj A Dadhania)
- Remove X86_CR4_FRED from the CR4 pinned bits mask, to fix a race
window during the bootup of SEV-{ES,SNP} or TDX guests, which can
crash them if they trigger exceptions in that window (Borislav
Petkov)
- Fix early boot failures on SEV-ES/SNP guests, due to incorrect early
GHCB access (Nikunj A Dadhania)
- Add clarifying comment to the CRn pinning logic, to avoid future
confusion & bugs (Peter Zijlstra)
* tag 'x86-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add comment clarifying CRn pinning
x86/fred: Fix early boot failures on SEV-ES/SNP guests
x86/cpu: Remove X86_CR4_FRED from the CR4 pinned bits mask
x86/cpu: Enable FSGSBASE early in cpu_init_exception_handling()
Pull timer fix from Ingo Molnar:
"Fix an argument order bug in the alarm timer forwarding logic, which
may cause missed expirations or incorrect overrun accounting"
* tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
alarmtimer: Fix argument order in alarm_timer_forward()
Pull futex fixes from Ingo Molnar:
- Tighten up the sys_futex_requeue() ABI a bit, to disallow dissimilar
futex flags and potential UaF access (Peter Zijlstra)
- Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
(Hao-Yu Yang)
- Clear stale exiting pointer in futex_lock_pi() retry path, which
triggered a warning (and potential misbehavior) in stress-testing
(Davidlohr Bueso)
* tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Clear stale exiting pointer in futex_lock_pi() retry path
futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
futex: Require sys_futex_requeue() to have identical flags
Pull overlayfs fixes from Amir Goldstein:
- Fix regression in 'xino' feature detection
I clumsily introduced this regression myself when working on another
subsystem (fsnotify). Both the regression and the fix have almost no
visible impact on users except for some kmsg prints.
- Fix to performance regression in v6.12.
This regression was reported by Google COS developers.
It is not uncommon these days for the year-old mature LTS to get
adopted by distros and get exposed to many new workloads. We made a
sub-smart move of making a behavior change in v6.12 which could
impact performance, without making it opt-in. Fixing this mistake
retroactively, to be picked by LTS.
* tag 'ovl-fixes-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
ovl: make fsync after metadata copy-up opt-in mount option
ovl: fix wrong detection of 32bit inode numbers
Pull ext4 fixes from Ted Ts'o:
- Update the MAINTAINERS file to add reviewers for the ext4 file system
- Add a test issue an ext4 warning (not a WARN_ON) if there are still
dirty pages attached to an evicted inode.
- Fix a number of Syzkaller issues
- Fix memory leaks on error paths
- Replace some BUG and WARN with EFSCORRUPTED reporting
- Fix a potential crash when disabling discard via remount followed by
an immediate unmount. (Found by Sashiko)
- Fix a corner case which could lead to allocating blocks for an
indirect-mapped inode block numbers > 2**32
- Fix a race when reallocating a freed inode that could result in a
deadlock
- Fix a user-after-free in update_super_work when racing with umount
- Fix build issues when trying to build ext4's kunit tests as a module
- Fix a bug where ext4_split_extent_zeroout() could fail to pass back
an error from ext4_ext_dirty()
- Avoid allocating blocks from a corrupted block group in
ext4_mb_find_by_goal()
- Fix a percpu_counters list corruption BUG triggered by an ext4
extents kunit
- Fix a potetial crash caused by the fast commit flush path potentially
accessing the jinode structure before it is fully initialized
- Fix fsync(2) in no-journal mode to make sure the dirtied inode is
write to storage
- Fix a bug when in no-journal mode, when ext4 tries to avoid using
recently deleted inodes, if lazy itable initialization is enabled,
can lead to an unitialized inode getting skipped and triggering an
e2fsck complaint
- Fix journal credit calculation when setting an xattr when both the
encryption and ea_inode feeatures are enabled
- Fix corner cases which could result in stale xarray tags after
writeback
- Fix generic/475 failures caused by ENOSPC errors while creating a
symlink when the system crashes resulting to a file system
inconsistency when replaying the fast commit journal
* tag 'ext4_for_linus-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (27 commits)
ext4: always drain queued discard work in ext4_mb_release()
ext4: handle wraparound when searching for blocks for indirect mapped blocks
ext4: skip split extent recovery on corruption
ext4: fix iloc.bh leak in ext4_fc_replay_inode() error paths
ext4: fix deadlock on inode reallocation
ext4: fix use-after-free in update_super_work when racing with umount
ext4: fix the might_sleep() warnings in kvfree()
ext4: reject mount if bigalloc with s_first_data_block != 0
ext4: fix extents-test.c is not compiled when EXT4_KUNIT_TESTS=M
ext4: fix mballoc-test.c is not compiled when EXT4_KUNIT_TESTS=M
ext4: introduce EXPORT_SYMBOL_FOR_EXT4_TEST() helper
jbd2: gracefully abort on checkpointing state corruptions
ext4: avoid infinite loops caused by residual data
ext4: validate p_idx bounds in ext4_ext_correct_indexes
ext4: test if inode's all dirty pages are submitted to disk
ext4: minor fix for ext4_split_extent_zeroout()
ext4: avoid allocate block from corrupted group in ext4_mb_find_by_goal()
ext4: kunit: extents-test: lix percpu_counters list corruption
ext4: publish jinode after initialization
ext4: replace BUG_ON with proper error handling in ext4_read_inline_folio
...
Pull btrfs fixes from David Sterba:
"A few more fixes. There's one that stands out in size as it fixes an
edge case in fsync.
- fix issue on fsync where file with zero size appears as a non-zero
after log replay
- in zlib compression, handle a crash when data alignment causes
folio reference issues
- fix possible crash with enabled tracepoints on a overlayfs mount
- handle device stats update error
- on zoned filesystems, fix kobject leak on sub-block groups
- fix super block offset in an error message in validation"
* tag 'for-7.0-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix lost error when running device stats on multiple devices fs
btrfs: tracepoints: get correct superblock from dentry in event btrfs_sync_file()
btrfs: zlib: handle page aligned compressed size correctly
btrfs: fix leak of kobject name for sub-group space_info
btrfs: fix zero size inode with non-zero size after log replay
btrfs: fix super block offset in error message in btrfs_validate_super()
Pull misc fixes from Andrew Morton:
"10 hotfixes. 8 are cc:stable. 9 are for MM.
There's a 3-patch series of DAMON fixes from Josh Law and SeongJae
Park. The rest are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-03-28-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/mseal: update VMA end correctly on merge
bug: avoid format attribute warning for clang as well
mm/pagewalk: fix race between concurrent split and refault
mm/memory: fix PMD/PUD checks in follow_pfnmap_start()
mm/damon/sysfs: check contexts->nr in repeat_call_fn
mm/damon/sysfs: check contexts->nr before accessing contexts_arr[0]
mm/damon/sysfs: fix param_ctx leak on damon_sysfs_new_test_ctx() failure
mm/swap: fix swap cache memcg accounting
MAINTAINERS, mailmap: update email address for Harry Yoo
mm/huge_memory: fix folio isn't locked in softleaf_to_folio()
As stated on the website: "This wiki has been archived and the content
is no longer updated." No need to reference it.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Pull tracing fixes from Steven Rostedt:
- Fix potential deadlock in osnoise and hotplug
The interface_lock can be called by a osnoise thread and the CPU
shutdown logic of osnoise can wait for this thread to finish. But
cpus_read_lock() can also be taken while holding the interface_lock.
This produces a circular lock dependency and can cause a deadlock.
Swap the ordering of cpus_read_lock() and the interface_lock to have
interface_lock taken within the cpus_read_lock() context to prevent
this circular dependency.
- Fix freeing of event triggers in early boot up
If the same trigger is added on the kernel command line, the second
one will fail to be applied and the trigger created will be freed.
This calls into the deferred logic and creates a kernel thread to do
the freeing. But the command line logic is called before kernel
threads can be created and this leads to a NULL pointer dereference.
Delay freeing event triggers until late init.
* tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Drain deferred trigger frees if kthread creation fails
tracing: Fix potential deadlock in cpu hotplug with osnoise
Pull s390 fixes from Vasily Gorbik:
- Add array_index_nospec() to syscall dispatch table lookup to prevent
limited speculative out-of-bounds access with user-controlled syscall
number
- Mark array_index_mask_nospec() __always_inline since GCC may emit an
out-of-line call instead of the inline data dependency sequence the
mitigation relies on
- Clear r12 on kernel entry to prevent potential speculative use of
user value in system_call, ext/io/mcck interrupt handlers
* tag 's390-7.0-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/entry: Scrub r12 register on kernel entry
s390/syscalls: Add spectre boundary for syscall dispatch table
s390/barrier: Make array_index_mask_nospec() __always_inline
Fuzzying/stressing futexes triggered:
WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524
When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY
and stores a refcounted task pointer in 'exiting'.
After wait_for_owner_exiting() consumes that reference, the local pointer
is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a
different error, the bogus pointer is passed to wait_for_owner_exiting().
CPU0 CPU1 CPU2
futex_lock_pi(uaddr)
// acquires the PI futex
exit()
futex_cleanup_begin()
futex_state = EXITING;
futex_lock_pi(uaddr)
futex_lock_pi_atomic()
attach_to_pi_owner()
// observes EXITING
*exiting = owner; // takes ref
return -EBUSY
wait_for_owner_exiting(-EBUSY, owner)
put_task_struct(); // drops ref
// exiting still points to owner
goto retry;
futex_lock_pi_atomic()
lock_pi_update_atomic()
cmpxchg(uaddr)
*uaddr ^= WAITERS // whatever
// value changed
return -EAGAIN;
wait_for_owner_exiting(-EAGAIN, exiting) // stale
WARN_ON_ONCE(exiting)
Fix this by resetting upon retry, essentially aligning it with requeue_pi.
Fixes: 3ef240eaff ("futex: Prevent exit livelock")
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net
Boot-time trigger registration can fail before the trigger-data cleanup
kthread exists. Deferring those frees until late init is fine, but the
post-boot fallback must still drain the deferred list if kthread
creation never succeeds.
Otherwise, boot-deferred nodes can accumulate on
trigger_data_free_list, later frees fall back to synchronously freeing
only the current object, and the older queued entries are leaked
forever.
To trigger this, add the following to the kernel command line:
trace_event=sched_switch trace_trigger=sched_switch.traceon,sched_switch.traceon
The second traceon trigger will fail and be freed. This triggers a NULL
pointer dereference and crashes the kernel.
Keep the deferred boot-time behavior, but when kthread creation fails,
drain the whole queued list synchronously. Do the same in the late-init
drain path so queued entries are not stranded there either.
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260324221326.1395799-3-atwellwea@gmail.com
Fixes: 61d445af0a ("tracing: Add bulk garbage collection of freeing event_trigger_data")
Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This motherboard uses USB audio instead, causing this driver to complain
about "no codecs found!".
Add it to the denylist to silence the warning.
The first attempt only matched on the PCI device, but this caused issues
for some laptops, so DMI match against the board as well.
Signed-off-by: Stuart Hayhurst <stuart.a.hayhurst@gmail.com>
Link: https://patch.msgid.link/20260327155737.21818-2-stuart.a.hayhurst@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Wei Fang says:
====================
net: enetc: add more checks to enetc_set_rxfh()
ENETC only supports Toeplitz algorithm, and VFs do not support setting
the RSS key, but enetc_set_rxfh() does not check these constraints and
silently accepts unsupported configurations. This may mislead users or
tools into believing that the requested RSS settings have been
successfully applied. So add checks to reject unsupported hash functions
and RSS key updates on VFs, and return "-EOPNOTSUPP" to user space.
====================
Link: https://patch.msgid.link/20260326075233.3628047-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
VFs do not have privilege to configure the RSS key because the registers
are owned by the PF. Currently, if VF attempts to configure the RSS key,
enetc_set_rxfh() simply skips the configuration and does not generate a
warning, which may mislead users into thinking the feature is supported.
To improve this situation, add a check to reject RSS key configuration
on VFs.
Fixes: d382563f54 ("enetc: Add RFS and RSS support")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Clark Wang <xiaoning.wang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20260326075233.3628047-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In commit 8110633db4 ("net: sfp-bus: allow SFP quirks to override
Autoneg and pause bits") we moved the setting of Autoneg and pause bits
before the call to SFP quirk when parsing SFP module support.
Since the quirk for Ubiquiti U-Fiber Instant SFP module zeroes the
support bits and sets 1000baseX_Full only, the above mentioned commit
changed the overall computed support from
1000baseX_Full, Autoneg, Pause, Asym_Pause
to just
1000baseX_Full.
This broke the SFP module for mvneta, which requires Autoneg for
1000baseX since commit c762b7fac1 ("net: mvneta: deny disabling
autoneg for 802.3z modes").
Fix this by setting back the Autoneg, Pause and Asym_Pause bits in the
quirk.
Fixes: 8110633db4 ("net: sfp-bus: allow SFP quirks to override Autoneg and pause bits")
Signed-off-by: Marek Behún <kabel@kernel.org>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20260326122038.2489589-1-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
follow_pfnmap_start() suffers from two problems:
(1) We are not re-fetching the pmd/pud after taking the PTL
Therefore, we are not properly stabilizing what the lock actually
protects. If there is concurrent zapping, we would indicate to the
caller that we found an entry, however, that entry might already have
been invalidated, or contain a different PFN after taking the lock.
Properly use pmdp_get() / pudp_get() after taking the lock.
(2) pmd_leaf() / pud_leaf() are not well defined on non-present entries
pmd_leaf()/pud_leaf() could wrongly trigger on non-present entries.
There is no real guarantee that pmd_leaf()/pud_leaf() returns something
reasonable on non-present entries. Most architectures indeed either
perform a present check or make it work by smart use of flags.
However, for example loongarch checks the _PAGE_HUGE flag in pmd_leaf(),
and always sets the _PAGE_HUGE flag in __swp_entry_to_pmd(). Whereby
pmd_trans_huge() explicitly checks pmd_present(), pmd_leaf() does not do
that.
Let's check pmd_present()/pud_present() before assuming "the is a present
PMD leaf" when spotting pmd_leaf()/pud_leaf(), like other page table
handling code that traverses user page tables does.
Given that non-present PMD entries are likely rare in VM_IO|VM_PFNMAP, (1)
is likely more relevant than (2). It is questionable how often (1) would
actually trigger, but let's CC stable to be sure.
This was found by code inspection.
Link: https://lkml.kernel.org/r/20260323-follow_pfnmap_fix-v1-1-5b0ec10872b3@kernel.org
Fixes: 6da8e9634b ("mm: new follow_pfnmap API")
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Multiple sysfs command paths dereference contexts_arr[0] without first
verifying that kdamond->contexts->nr == 1. A user can set nr_contexts to
0 via sysfs while DAMON is running, causing NULL pointer dereferences.
In more detail, the issue can be triggered by privileged users like
below.
First, start DAMON and make contexts directory empty
(kdamond->contexts->nr == 0).
# damo start
# cd /sys/kernel/mm/damon/admin/kdamonds/0
# echo 0 > contexts/nr_contexts
Then, each of below commands will cause the NULL pointer dereference.
# echo update_schemes_stats > state
# echo update_schemes_tried_regions > state
# echo update_schemes_tried_bytes > state
# echo update_schemes_effective_quotas > state
# echo update_tuned_intervals > state
Guard all commands (except OFF) at the entry point of
damon_sysfs_handle_cmd().
Link: https://lkml.kernel.org/r/20260321175427.86000-3-sj@kernel.org
Fixes: 0ac32b8aff ("mm/damon/sysfs: support DAMOS stats")
Signed-off-by: Josh Law <objecting@objecting.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [5.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On arm64 server, we found folio that get from migration entry isn't locked
in softleaf_to_folio(). This issue triggers when mTHP splitting and
zap_nonpresent_ptes() races, and the root cause is lack of memory barrier
in softleaf_to_folio(). The race is as follows:
CPU0 CPU1
deferred_split_scan() zap_nonpresent_ptes()
lock folio
split_folio()
unmap_folio()
change ptes to migration entries
__split_folio_to_order() softleaf_to_folio()
set flags(including PG_locked) for tail pages folio = pfn_folio(softleaf_to_pfn(entry))
smp_wmb() VM_WARN_ON_ONCE(!folio_test_locked(folio))
prep_compound_page() for tail pages
In __split_folio_to_order(), smp_wmb() guarantees page flags of tail pages
are visible before the tail page becomes non-compound. smp_wmb() should
be paired with smp_rmb() in softleaf_to_folio(), which is missed. As a
result, if zap_nonpresent_ptes() accesses migration entry that stores tail
pfn, softleaf_to_folio() may see the updated compound_head of tail page
before page->flags.
This issue will trigger VM_WARN_ON_ONCE() in pfn_swap_entry_folio()
because of the race between folio split and zap_nonpresent_ptes()
leading to a folio incorrectly undergoing modification without a folio
lock being held.
This is a BUG_ON() before commit 93976a2034 ("mm: eliminate further
swapops predicates"), which in merged in v6.19-rc1.
To fix it, add missing smp_rmb() if the softleaf entry is migration entry
in softleaf_to_folio() and softleaf_to_page().
[tujinjiang@huawei.com: update function name and comments]
Link: https://lkml.kernel.org/r/20260321075214.3305564-1-tujinjiang@huawei.com
Link: https://lkml.kernel.org/r/20260319012541.4158561-1-tujinjiang@huawei.com
Fixes: e9b61f1985 ("thp: reintroduce split_huge_page()")
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Barry Song <baohua@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a regression test for the divide-by-zero in rtsc_min() triggered
when m2sm() converts a large m1 value (e.g. 32gbit) to a u64 scaled
slope reaching 2^32. rtsc_min() stores the difference of two such u64
values (sm1 - sm2) in a u32 variable `dsm`, truncating 2^32 to zero
and causing a divide-by-zero oops in the concave-curve intersection
path. The test configures an HFSC class with m1=32gbit d=1ms m2=0bit,
sends a packet to activate the class, waits for it to drain and go
idle, then sends another packet to trigger reactivation through
rtsc_min().
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260326204310.1549327-2-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
m2sm() converts a u32 slope to a u64 scaled value. For large inputs
(e.g. m1=4000000000), the result can reach 2^32. rtsc_min() stores
the difference of two such u64 values in a u32 variable `dsm` and
uses it as a divisor. When the difference is exactly 2^32 the
truncation yields zero, causing a divide-by-zero oops in the
concave-curve intersection path:
Oops: divide error: 0000
RIP: 0010:rtsc_min (net/sched/sch_hfsc.c:601)
Call Trace:
init_ed (net/sched/sch_hfsc.c:629)
hfsc_enqueue (net/sched/sch_hfsc.c:1569)
[...]
Widen `dsm` to u64 and replace do_div() with div64_u64() so the full
difference is preserved.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260326204310.1549327-1-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While reviewing recent ext4 patch[1], Sashiko raised the following
concern[2]:
> If the filesystem is initially mounted with the discard option,
> deleting files will populate sbi->s_discard_list and queue
> s_discard_work. If it is then remounted with nodiscard, the
> EXT4_MOUNT_DISCARD flag is cleared, but the pending s_discard_work is
> neither cancelled nor flushed.
[1] https://lore.kernel.org/r/20260319094545.19291-1-qiang.zhang@linux.dev/
[2] https://sashiko.dev/#/patchset/20260319094545.19291-1-qiang.zhang%40linux.dev
The concern was valid, but it had nothing to do with the patch[1].
One of the problems with Sashiko in its current (early) form is that
it will detect pre-existing issues and report it as a problem with the
patch that it is reviewing.
In practice, it would be hard to hit deliberately (unless you are a
malicious syzkaller fuzzer), since it would involve mounting the file
system with -o discard, and then deleting a large number of files,
remounting the file system with -o nodiscard, and then immediately
unmounting the file system before the queued discard work has a change
to drain on its own.
Fix it because it's a real bug, and to avoid Sashiko from raising this
concern when analyzing future patches to mballoc.c.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: 55cdd0af2b ("ext4: get discard out of jbd2 commit kthread contex")
Cc: stable@kernel.org
Commit 4865c768b5 ("ext4: always allocate blocks only from groups
inode can use") restricts what blocks will be allocated for indirect
block based files to block numbers that fit within 32-bit block
numbers.
However, when using a review bot running on the latest Gemini LLM to
check this commit when backporting into an LTS based kernel, it raised
this concern:
If ac->ac_g_ex.fe_group is >= ngroups (for instance, if the goal
group was populated via stream allocation from s_mb_last_groups),
then start will be >= ngroups.
Does this allow allocating blocks beyond the 32-bit limit for
indirect block mapped files? The commit message mentions that
ext4_mb_scan_groups_linear() takes care to not select unsupported
groups. However, its loop uses group = *start, and the very first
iteration will call ext4_mb_scan_group() with this unsupported
group because next_linear_group() is only called at the end of the
iteration.
After reviewing the code paths involved and considering the LLM
review, I determined that this can happen when there is a file system
where some files/directories are extent-mapped and others are
indirect-block mapped. To address this, add a safety clamp in
ext4_mb_scan_groups().
Fixes: 4865c768b5 ("ext4: always allocate blocks only from groups inode can use")
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20260326045834.1175822-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
ext4_split_extent_at() retries after ext4_ext_insert_extent() fails by
refinding the original extent and restoring its length. That recovery is
only safe for transient resource failures such as -ENOSPC, -EDQUOT, and
-ENOMEM.
When ext4_ext_insert_extent() fails because the extent tree is already
corrupted, ext4_find_extent() can return a leaf path without p_ext.
ext4_split_extent_at() then dereferences path[depth].p_ext while trying to
fix up the original extent length, causing a NULL pointer dereference while
handling a pre-existing filesystem corruption.
Do not enter the recovery path for corruption errors, and validate p_ext
after refinding the extent before touching it. This keeps the recovery path
limited to cases it can actually repair and turns the syzbot-triggered crash
into a proper corruption report.
Fixes: 716b9c23b8 ("ext4: refactor split and convert extents")
Reported-by: syzbot+1ffa5d865557e51cb604@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=1ffa5d865557e51cb604
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: hongao <hongao@uniontech.com>
Link: https://patch.msgid.link/EF77870F23FF9C90+20260324015815.35248-1-hongao@uniontech.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
During code review, Joseph found that ext4_fc_replay_inode() calls
ext4_get_fc_inode_loc() to get the inode location, which holds a
reference to iloc.bh that must be released via brelse().
However, several error paths jump to the 'out' label without
releasing iloc.bh:
- ext4_handle_dirty_metadata() failure
- sync_dirty_buffer() failure
- ext4_mark_inode_used() failure
- ext4_iget() failure
Fix this by introducing an 'out_brelse' label placed just before
the existing 'out' label to ensure iloc.bh is always released.
Additionally, make ext4_fc_replay_inode() propagate errors
properly instead of always returning 0.
Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260323060836.3452660-1-libaokun@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Currently there is a race in ext4 when reallocating freed inode
resulting in a deadlock:
Task1 Task2
ext4_evict_inode()
handle = ext4_journal_start();
...
if (IS_SYNC(inode))
handle->h_sync = 1;
ext4_free_inode()
ext4_new_inode()
handle = ext4_journal_start()
finds the bit in inode bitmap
already clear
insert_inode_locked()
waits for inode to be
removed from the hash.
ext4_journal_stop(handle)
jbd2_journal_stop(handle)
jbd2_log_wait_commit(journal, tid);
- deadlocks waiting for transaction handle Task2 holds
Fix the problem by removing inode from the hash already in
ext4_clear_inode() by which time all IO for the inode is done so reuse
is already fine but we are still before possibly blocking on transaction
commit.
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Link: https://lore.kernel.org/all/abNvb2PcrKj1FBeC@ly-workstation
Fixes: 88ec797c46 ("fs: make insert_inode_locked() wait for inode destruction")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260320090428.24899-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Commit b98535d091 ("ext4: fix bug_on in start_this_handle during umount
filesystem") moved ext4_unregister_sysfs() before flushing s_sb_upd_work
to prevent new error work from being queued via /proc/fs/ext4/xx/mb_groups
reads during unmount. However, this introduced a use-after-free because
update_super_work calls ext4_notify_error_sysfs() -> sysfs_notify() which
accesses the kobject's kernfs_node after it has been freed by kobject_del()
in ext4_unregister_sysfs():
update_super_work ext4_put_super
----------------- --------------
ext4_unregister_sysfs(sb)
kobject_del(&sbi->s_kobj)
__kobject_del()
sysfs_remove_dir()
kobj->sd = NULL
sysfs_put(sd)
kernfs_put() // RCU free
ext4_notify_error_sysfs(sbi)
sysfs_notify(&sbi->s_kobj)
kn = kobj->sd // stale pointer
kernfs_get(kn) // UAF on freed kernfs_node
ext4_journal_destroy()
flush_work(&sbi->s_sb_upd_work)
Instead of reordering the teardown sequence, fix this by making
ext4_notify_error_sysfs() detect that sysfs has already been torn down
by checking s_kobj.state_in_sysfs, and skipping the sysfs_notify() call
in that case. A dedicated mutex (s_error_notify_mutex) serializes
ext4_notify_error_sysfs() against kobject_del() in ext4_unregister_sysfs()
to prevent TOCTOU races where the kobject could be deleted between the
state_in_sysfs check and the sysfs_notify() call.
Fixes: b98535d091 ("ext4: fix bug_on in start_this_handle during umount filesystem")
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260319120336.157873-1-jiayuan.chen@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Yang Yang says:
====================
bridge/vxlan: harden ND option parsing paths
This series hardens ND option parsing in bridge and vxlan paths.
Patch 1 linearizes the request skb in br_nd_send() before walking ND
options. Patch 2 adds explicit ND option length validation in
br_nd_send(). Patch 3 adds matching ND option length validation in
vxlan_na_create().
====================
Link: https://patch.msgid.link/20260326034441.2037420-1-n05ec@lzu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch targets two internal state machine invariants in checkpoint.c
residing inside functions that natively return integer error codes.
- In jbd2_cleanup_journal_tail(): A blocknr of 0 indicates a severely
corrupted journal superblock. Replaced the J_ASSERT with a WARN_ON_ONCE
and a graceful journal abort, returning -EFSCORRUPTED.
- In jbd2_log_do_checkpoint(): Replaced the J_ASSERT_BH checking for
an unexpected buffer_jwrite state. If the warning triggers, we
explicitly drop the just-taken get_bh() reference and call __flush_batch()
to safely clean up any previously queued buffers in the j_chkpt_bhs array,
preventing a memory leak before returning -EFSCORRUPTED.
Signed-off-by: Milos Nikic <nikic.milos@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260311041548.159424-1-nikic.milos@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
On the mkdir/mknod path, when mapping logical blocks to physical blocks,
if inserting a new extent into the extent tree fails (in this example,
because the file system disabled the huge file feature when marking the
inode as dirty), ext4_ext_map_blocks() only calls ext4_free_blocks() to
reclaim the physical block without deleting the corresponding data in
the extent tree. This causes subsequent mkdir operations to reference
the previously reclaimed physical block number again, even though this
physical block is already being used by the xattr block. Therefore, a
situation arises where both the directory and xattr are using the same
buffer head block in memory simultaneously.
The above causes ext4_xattr_block_set() to enter an infinite loop about
"inserted" and cannot release the inode lock, ultimately leading to the
143s blocking problem mentioned in [1].
If the metadata is corrupted, then trying to remove some extent space
can do even more harm. Also in case EXT4_GET_BLOCKS_DELALLOC_RESERVE
was passed, remove space wrongly update quota information.
Jan Kara suggests distinguishing between two cases:
1) The error is ENOSPC or EDQUOT - in this case the filesystem is fully
consistent and we must maintain its consistency including all the
accounting. However these errors can happen only early before we've
inserted the extent into the extent tree. So current code works correctly
for this case.
2) Some other error - this means metadata is corrupted. We should strive to
do as few modifications as possible to limit damage. So I'd just skip
freeing of allocated blocks.
[1]
INFO: task syz.0.17:5995 blocked for more than 143 seconds.
Call Trace:
inode_lock_nested include/linux/fs.h:1073 [inline]
__start_dirop fs/namei.c:2923 [inline]
start_dirop fs/namei.c:2934 [inline]
Reported-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=1659aaaaa8d9d11265d7
Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com
Reported-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=512459401510e2a9a39f
Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com
Link: https://patch.msgid.link/tencent_43696283A68450B761D76866C6F360E36705@qq.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
The commit aa373cf550 ("writeback: stop background/kupdate works from
livelocking other works") introduced an issue where unmounting a filesystem
in a multi-logical-partition scenario could lead to batch file data loss.
This problem was not fixed until the commit d92109891f ("fs/writeback:
bail out if there is no more inodes for IO and queued once"). It took
considerable time to identify the root cause. Additionally, in actual
production environments, we frequently encountered file data loss after
normal system reboots. Therefore, we are adding a check in the inode
release flow to verify whether all dirty pages have been flushed to disk,
in order to determine whether the data loss is caused by a logic issue in
the filesystem code.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260303012242.3206465-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
There's issue as follows:
...
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 2243 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 2239 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): error count since last fsck: 1
EXT4-fs (mmcblk0p1): initial error at time 1765597433: ext4_mb_generate_buddy:760
EXT4-fs (mmcblk0p1): last error at time 1765597433: ext4_mb_generate_buddy:760
...
According to the log analysis, blocks are always requested from the
corrupted block group. This may happen as follows:
ext4_mb_find_by_goal
ext4_mb_load_buddy
ext4_mb_load_buddy_gfp
ext4_mb_init_cache
ext4_read_block_bitmap_nowait
ext4_wait_block_bitmap
ext4_validate_block_bitmap
if (!grp || EXT4_MB_GRP_BBITMAP_CORRUPT(grp))
return -EFSCORRUPTED; // There's no logs.
if (err)
return err; // Will return error
ext4_lock_group(ac->ac_sb, group);
if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) // Unreachable
goto out;
After commit 9008a58e5d ("ext4: make the bitmap read routines return
real error codes") merged, Commit 163a203ddb ("ext4: mark block group
as corrupt on block bitmap error") is no real solution for allocating
blocks from corrupted block groups. This is because if
'EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)' is true, then
'ext4_mb_load_buddy()' may return an error. This means that the block
allocation will fail.
Therefore, check block group if corrupted when ext4_mb_load_buddy()
returns error.
Fixes: 163a203ddb ("ext4: mark block group as corrupt on block bitmap error")
Fixes: 9008a58e5d ("ext4: make the bitmap read routines return real error codes")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260302134619.3145520-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
commit 82f80e2e3b ("ext4: add extent status cache support to kunit tests"),
added ext4_es_register_shrinker() in extents_kunit_init() function but
failed to add the unregister shrinker routine in extents_kunit_exit().
This could cause the following percpu_counters list corruption bug.
ok 1 split unwrit extent to 2 extents and convert 1st half writ
slab kmalloc-4k start c0000002007ff000 pointer offset 1448 size 4096
list_add corruption. next->prev should be prev (c000000004bc9e60), but was 0000000000000000. (next=c0000002007ff5a8).
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:29!
cpu 0x2: Vector: 700 (Program Check) at [c000000241927a30]
pc: c000000000f26ed0: __list_add_valid_or_report+0x120/0x164
lr: c000000000f26ecc: __list_add_valid_or_report+0x11c/0x164
sp: c000000241927cd0
msr: 800000000282b033
current = 0xc000000241215200
paca = 0xc0000003fffff300 irqmask: 0x03 irq_happened: 0x09
pid = 258, comm = kunit_try_catch
kernel BUG at lib/list_debug.c:29!
enter ? for help
__percpu_counter_init_many+0x148/0x184
ext4_es_register_shrinker+0x74/0x23c
extents_kunit_init+0x100/0x308
kunit_try_run_case+0x78/0x1f8
kunit_generic_run_threadfn_adapter+0x40/0x70
kthread+0x190/0x1a0
start_kernel_thread+0x14/0x18
2:mon>
This happens because:
extents_kunit_init(test N):
ext4_es_register_shrinker(sbi)
percpu_counters_init() x 4; // this adds 4 list nodes to global percpu_counters list
list_add(&fbc->list, &percpu_counters);
shrinker_register();
extents_kunit_exit(test N):
kfree(sbi); // frees sbi w/o removing those 4 list nodes.
// So, those list node now becomes dangling pointers
extents_kunit_init(test N+1):
kzalloc_obj(ext4_sb_info) // allocator returns same page, but zeroed.
ext4_es_register_shrinker(sbi)
percpu_counters_init()
list_add(&fbc->list, &percpu_counters);
__list_add_valid(new, prev, next);
next->prev != prev // list corruption bug detected, since next->prev = NULL
Fixes: 82f80e2e3b ("ext4: add extent status cache support to kunit tests")
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/5bb9041471dab8ce870c191c19cbe4df57473be8.1772381213.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Replace BUG_ON() with proper error handling when inline data size
exceeds PAGE_SIZE. This prevents kernel panic and allows the system to
continue running while properly reporting the filesystem corruption.
The error is logged via ext4_error_inode(), the buffer head is released
to prevent memory leak, and -EFSCORRUPTED is returned to indicate
filesystem corruption.
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Link: https://patch.msgid.link/20260223123345.14838-2-ytohnuki@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
When inode metadata is changed, we sometimes just call
ext4_mark_inode_dirty() to track modified metadata. This copies inode
metadata into block buffer which is enough when we are journalling
metadata. However when we are running in nojournal mode we currently
fail to write the dirtied inode buffer during fsync(2) because the inode
is not marked as dirty. Use explicit ext4_write_inode() call to make
sure the inode table buffer is written to the disk. This is a band aid
solution but proper solution requires a much larger rewrite including
changes in metadata bh tracking infrastructure.
Reported-by: Free Ekanayaka <free.ekanayaka@gmail.com>
Link: https://lore.kernel.org/all/87il8nhxdm.fsf@x1.mail-host-address-is-not-set/
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20260216164848.3074-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
recently_deleted() checks whether inode has been used in the near past.
However this can give false positive result when inode table is not
initialized yet and we are in fact comparing to random garbage (or stale
itable block of a filesystem before mkfs). Ultimately this results in
uninitialized inodes being skipped during inode allocation and possibly
they are never initialized and thus e2fsck complains. Verify if the
inode has been initialized before checking for dtime.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20260216164848.3074-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Dimitri Daskalakis says:
====================
Fix page fragment handling when PAGE_SIZE > 4K
FBNIC operates on fixed size descriptors (4K). When the OS supports pages
larger than 4K, we fragment the page across multiple descriptors.
While performance testing, I found several issues with our page fragment
handling, resulting in low throughput and potential RX stalls.
====================
Link: https://patch.msgid.link/20260324195123.3486219-1-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix an issue arising when ext4 features has_journal, ea_inode, and encrypt
are activated simultaneously, leading to ENOSPC when creating an encrypted
file.
Fix by passing XATTR_CREATE flag to xattr_set_handle function if a handle
is specified, i.e., when the function is called in the control flow of
creating a new inode. This aligns the number of jbd2 credits set_handle
checks for with the number allocated for creating a new inode.
ext4_set_context must not be called with a non-null handle (fs_data) if
fscrypt context xattr is not guaranteed to not exist yet. The only other
usage of this function currently is when handling the ioctl
FS_IOC_SET_ENCRYPTION_POLICY, which calls it with fs_data=NULL.
Fixes: c1a5d5f6ab ("ext4: improve journal credit handling in set xattr paths")
Co-developed-by: Anthony Durrer <anthonydev@fastmail.com>
Signed-off-by: Anthony Durrer <anthonydev@fastmail.com>
Signed-off-by: Simon Weber <simon.weber.39@gmail.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20260207100148.724275-4-simon.weber.39@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
FBNIC supports fixed size buffers of 4K. When PAGE_SIZE > 4K, we
fragment the page across multiple descriptors (FBNIC_BD_FRAG_COUNT).
When refilling the BDQ, the correct number of entries are populated,
but tail was only incremented by one. So on a system with 64K pages,
HW would get one descriptor refilled for every 16 we populate.
Additionally, we program the ring size in the HW when enabling the BDQ.
This was not accounting for page fragments, so on systems with 64K pages,
the HW used 1/16th of the ring.
Fixes: 0cb4c0a137 ("eth: fbnic: Implement Rx queue alloc/start/stop/free")
Signed-off-by: Dimitri Daskalakis <daskald@meta.com>
Link: https://patch.msgid.link/20260324195123.3486219-2-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a check in ext4_setattr() to convert files from inline data storage
to extent-based storage when truncate() grows the file size beyond the
inline capacity. This prevents the filesystem from entering an
inconsistent state where the inline data flag is set but the file size
exceeds what can be stored inline.
Without this fix, the following sequence causes a kernel BUG_ON():
1. Mount filesystem with inode that has inline flag set and small size
2. truncate(file, 50MB) - grows size but inline flag remains set
3. sendfile() attempts to write data
4. ext4_write_inline_data() hits BUG_ON(write_size > inline_capacity)
The crash occurs because ext4_write_inline_data() expects inline storage
to accommodate the write, but the actual inline capacity (~60 bytes for
i_block + ~96 bytes for xattrs) is far smaller than the file size and
write request.
The fix checks if the new size from setattr exceeds the inode's actual
inline capacity (EXT4_I(inode)->i_inline_size) and converts the file to
extent-based storage before proceeding with the size change.
This addresses the root cause by ensuring the inline data flag and file
size remain consistent during truncate operations.
Reported-by: syzbot+7de5fe447862fc37576f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7de5fe447862fc37576f
Tested-by: syzbot+7de5fe447862fc37576f@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
Link: https://patch.msgid.link/20260207043607.1175976-1-kartikey406@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
There are cases where ext4_bio_write_page() gets called for a page which
has no buffers to submit. This happens e.g. when the part of the file is
actually a hole, when we cannot allocate blocks due to being called from
jbd2, or in data=journal mode when checkpointing writes the buffers
earlier. In these cases we just return from ext4_bio_write_page()
however if the page didn't need redirtying, we will leave stale DIRTY
and/or TOWRITE tags in xarray because those get cleared only in
__folio_start_writeback(). As a result we can leave these tags set in
mappings even after a final sync on filesystem that's getting remounted
read-only or that's being frozen. Various assertions can then get upset
when writeback is started on such filesystems (Gerald reported assertion
in ext4_journal_check_start() firing).
Fix the problem by cycling the page through writeback state even if we
decide nothing needs to be written for it so that xarray tags get
properly updated. This is slightly silly (we could update the xarray
tags directly) but I don't think a special helper messing with xarray
tags is really worth it in this relatively rare corner case.
Reported-by: Gerald Yang <gerald.yang@canonical.com>
Link: https://lore.kernel.org/all/20260128074515.2028982-1-gerald.yang@canonical.com
Fixes: dff4ac75ee ("ext4: move keep_towrite handling to ext4_bio_write_page()")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260205092223.21287-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Oskar Kjos reported the following problem.
ip4ip6_err() calls icmp_send() on a cloned skb whose cb[] was written
by the IPv6 receive path as struct inet6_skb_parm. icmp_send() passes
IPCB(skb2) to __ip_options_echo(), which interprets that cb[] region
as struct inet_skb_parm (IPv4). The layouts differ: inet6_skb_parm.nhoff
at offset 14 overlaps inet_skb_parm.opt.rr, producing a non-zero rr
value. __ip_options_echo() then reads optlen from attacker-controlled
packet data at sptr[rr+1] and copies that many bytes into dopt->__data,
a fixed 40-byte stack buffer (IP_OPTIONS_DATA_FIXED_SIZE).
To fix this we clear skb2->cb[], as suggested by Oskar Kjos.
Also add minimal IPv4 header validation (version == 4, ihl >= 5).
Fixes: c4d3efafcc ("[IPV6] IP6TUNNEL: Add support to IPv4 over IPv6 tunnel.")
Reported-by: Oskar Kjos <oskar.kjos@hotmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260326155138.2429480-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sashiko AI-review observed:
In ip6_err_gen_icmpv6_unreach(), the skb is an outer IPv4 ICMP error packet
where its cb contains an IPv4 inet_skb_parm. When skb is cloned into skb2
and passed to icmp6_send(), it uses IP6CB(skb2).
IP6CB interprets the IPv4 inet_skb_parm as an inet6_skb_parm. The cipso
offset in inet_skb_parm.opt directly overlaps with dsthao in inet6_skb_parm
at offset 18.
If an attacker sends a forged ICMPv4 error with a CIPSO IP option, dsthao
would be a non-zero offset. Inside icmp6_send(), mip6_addr_swap() is called
and uses ipv6_find_tlv(skb, opt->dsthao, IPV6_TLV_HAO).
This would scan the inner, attacker-controlled IPv6 packet starting at that
offset, potentially returning a fake TLV without checking if the remaining
packet length can hold the full 18-byte struct ipv6_destopt_hao.
Could mip6_addr_swap() then perform a 16-byte swap that extends past the end
of the packet data into skb_shared_info?
Should the cb array also be cleared in ip6_err_gen_icmpv6_unreach() and
ip6ip6_err() to prevent this?
This patch implements the first suggestion.
I am not sure if ip6ip6_err() needs to be changed.
A separate patch would be better anyway.
Fixes: ca15a078bd ("sit: generate icmpv6 error when receiving icmpv4 error")
Reported-by: Ido Schimmel <idosch@nvidia.com>
Closes: https://sashiko.dev/#/patchset/20260326155138.2429480-1-edumazet%40google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Oskar Kjos <oskar.kjos@hotmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260326202608.2976021-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit '5f920d5d6083 ("ext4: verify fast symlink length")' causes the
generic/475 test to fail during orphan cleanup of zero-length symlinks.
generic/475 84s ... _check_generic_filesystem: filesystem on /dev/vde is inconsistent
The fsck reports are provided below:
Deleted inode 9686 has zero dtime.
Deleted inode 158230 has zero dtime.
...
Inode bitmap differences: -9686 -158230
Orphan file (inode 12) block 13 is not clean.
Failed to initialize orphan file.
In ext4_symlink(), a newly created symlink can be added to the orphan
list due to ENOSPC. Its data has not been initialized, and its size is
zero. Therefore, we need to disregard the length check of the symbolic
link when cleaning up orphan inodes. Instead, we should ensure that the
nlink count is zero.
Fixes: 5f920d5d60 ("ext4: verify fast symlink length")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260131091156.1733648-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Pull hwmon fixes from Guenter Roeck:
- PMBus driver fixes:
- Add mutex protection for regulator operations
- Fix reading from "write-only" attributes
- Mark lowest/average/highest/rated attributes as read-only
- isl68137: Add mutex protection for AVS enable sysfs attributes
- ina233: Fix error handling and sign extension when reading shunt voltage
- adm1177: Fix sysfs ABI violation and current unit conversion
- peci: Fix off-by-one in cputemp_is_visible(), and crit_hyst returning
delta instead of absolute temperature
* tag 'hwmon-for-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (pmbus/core) Protect regulator operations with mutex
hwmon: (pmbus) Introduce the concept of "write-only" attributes
hwmon: (pmbus) Mark lowest/average/highest/rated attributes as read-only
hwmon: (adm1177) fix sysfs ABI violation and current unit conversion
hwmon: (peci/cputemp) Fix off-by-one in cputemp_is_visible()
hwmon: (peci/cputemp) Fix crit_hyst returning delta instead of absolute temperature
hwmon: (pmbus/isl68137) Add mutex protection for AVS enable sysfs attributes
hwmon: (pmbus/ina233) Fix error handling and sign extension in shunt voltage read
Pull SCSI fixes from James Bottomley:
"Driver (and enclosure) only fixes. Most are obvious. The big change is
in the tcm_loop driver to add command draining to error handling (the
lack of which was causing hangs with the potential for double use
crashes)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: target: file: Use kzalloc_flex for aio_cmd
scsi: scsi_transport_sas: Fix the maximum channel scanning issue
scsi: target: tcm_loop: Drain commands in target_reset handler
scsi: ibmvfc: Fix OOB access in ibmvfc_discover_targets_done()
scsi: ses: Handle positive SCSI error from ses_recv_diag()
Pull drm fixes from Dave Airlie:
"Weekly fixes, still a bit busy, but the usual suspects amdgpu and
i915/xe have a bunch of small fixes, and otherwise it's just a few
minor driver fixes.
loognsoon:
- update MAINTAINERS
shmem:
- fault handler fix
syncobj:
- fix GFP flags
amdgpu:
- DSC fix
- Module parameter parsing fix
- PASID reuse fix
- drm_edid leak fix
- SMU 13.x fixes
- SMU 14.x fix
- Fence fix in amdgpu_amdkfd_submit_ib()
- LVDS fixes
- GPU page fault fix for non-4K pages
amdkfd:
- Ordering fix in kfd_ioctl_create_process()
i915/display:
- DP tunnel error handling fix
- Spurious GMBUS timeout fix
- Unlink NV12 planes earlier
- Order OP vs. timeout correctly in __wait_for()
xe:
- Fix UAF in SRIOV migration restore
- Updates to HW W/a
- VMBind remap fix
ivpu:
- poweroff fix
mediatek:
- fix register ordering"
* tag 'drm-fixes-2026-03-28-1' of https://gitlab.freedesktop.org/drm/kernel: (25 commits)
MAINTAINERS: Update GPU driver maintainer information
drm/xe: always keep track of remap prev/next
drm/syncobj: Fix xa_alloc allocation flags
drm/amd/display: Fix DCE LVDS handling
drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
drm/amd/pm: disable OD_FAN_CURVE if temp or pwm range invalid for smu v14
drm/amdkfd: Fix NULL pointer check order in kfd_ioctl_create_process
drm/amd/display: check if ext_caps is valid in BL setup
drm/amdgpu: Fix fence put before wait in amdgpu_amdkfd_submit_ib
drm/xe: Implement recent spec updates to Wa_16025250150
accel/ivpu: Add disable clock relinquish workaround for NVL-A0
drm/i915/dp_tunnel: Fix error handling when clearing stream BW in atomic state
drm/amd/pm: disable OD_FAN_CURVE if temp or pwm range invalid for smu v13
drm/amd/pm: Return -EOPNOTSUPP for unsupported OD_MCLK on smu_v13_0_6
drm/amd/pm: Skip redundant UCLK restore in smu_v13_0_6
drm/amd/display: Fix drm_edid leak in amdgpu_dm
drm/amdgpu: prevent immediate PASID reuse case
drm/amdgpu: fix strsep() corrupting lockup_timeout on multi-GPU (v3)
drm/amd/display: Do not skip unrelated mode changes in DSC validation
drm/xe/pf: Fix use-after-free in migration restore
...
Before commit f33f2d4c7c ("s390/bp: remove TIF_ISOLATE_BP"),
all entry handlers loaded r12 with the current task pointer
(lg %r12,__LC_CURRENT) for use by the BPENTER/BPEXIT macros. That
commit removed TIF_ISOLATE_BP, dropping both the branch prediction
macros and the r12 load, but did not add r12 to the register clearing
sequence.
Add the missing xgr %r12,%r12 to make the register scrub consistent
across all entry points.
Fixes: f33f2d4c7c ("s390/bp: remove TIF_ISOLATE_BP")
Cc: stable@kernel.org
Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Pull spi fixes from Mark Brown:
"There are two core fixes here. One is from Johan dealing with an issue
introduced by a devm_ API usage update causing things to be freed
earlier than they had earlier when we fail to register a device,
another from Danilo avoids unlocked acccess to data by converting to
use a driver core API.
We also have a few relatively minor driver specific fixes"
* tag 'spi-fix-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: spi-fsl-lpspi: fix teardown order issue (UAF)
spi: fix use-after-free on managed registration failure
spi: use generic driver_override infrastructure
spi: meson-spicc: Fix double-put in remove path
spi: sn-f-ospi: Use devm_mutex_init() to simplify code
spi: sn-f-ospi: Fix resource leak in f_ospi_probe()
Pull regulator fix from Mark Brown:
"A fix from Alice for the rust bindings, they didn't handle the stub
implementation of the C API used when CONFIG_REGULATOR is disabled
leading to undefined behaviour"
* tag 'regulator-fix-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
rust: regulator: do not assume that regulator_get() returns non-null
Pull regmap fix from Mark Brown:
"A fix from Andy Shevchenko for an issue with caching of page selector
registers which are located inside the page they are switching"
* tag 'regmap-fix-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: Synchronize cache for the page selector
Pull tsm fix from Dan Williams:
- Fix a VMM controlled buffer length used to emit TDX attestation
reports
* tag 'tsm-fixes-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm:
virt: tdx-guest: Fix handling of host controlled 'quote' buffer length
Pull VFIO fix from Alex Williamson:
- Fix double-free and reference count underflow if dma-buf file
allocation fails (Alex Williamson)
* tag 'vfio-v7.0-rc6' of https://github.com/awilliam/linux-vfio:
vfio/pci: Fix double free in dma-buf feature
Pull EFI fix from Ard Biesheuvel:
"Fix a potential buffer overrun issue introduced by the previous fix
for EFI boot services region reservations on x86"
* tag 'efi-fixes-for-v7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
x86/efi: efi_unmap_boot_services: fix calculation of ranges_to_free size
Pull LoongArch fixes from Huacai Chen:
"Fix missing NULL checks for kstrdup(), workaround LS2K/LS7A GPU
DMA hang bug, emit GNU_EH_FRAME for vDSO correctly, and fix some
KVM-related bugs"
* tag 'loongarch-fixes-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: KVM: Fix base address calculation in kvm_eiointc_regs_access()
LoongArch: KVM: Handle the case that EIOINTC's coremap is empty
LoongArch: KVM: Make kvm_get_vcpu_by_cpuid() more robust
LoongArch: vDSO: Emit GNU_EH_FRAME correctly
LoongArch: Workaround LS2K/LS7A GPU DMA hang bug
LoongArch: Fix missing NULL checks for kstrdup()
Pull io_uring fixes from Jens Axboe:
"Just two small fixes, both fixing regressions added in the fdinfo code
in 6.19 with the SQE mixed size support"
* tag 'io_uring-7.0-20260327' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/fdinfo: fix OOB read in SQE_MIXED wrap check
io_uring/fdinfo: fix SQE_MIXED SQE displaying
Adjust KVM's sanity check against overwriting a shadow-present SPTE with a
another SPTE with a different target PFN to only apply to direct MMUs,
i.e. only to MMUs without shadowed gPTEs. While it's impossible for KVM
to overwrite a shadow-present SPTE in response to a guest write, writes
from outside the scope of KVM, e.g. from host userspace, aren't detected
by KVM's write tracking and so can break KVM's shadow paging rules.
------------[ cut here ]------------
pfn != spte_to_pfn(*sptep)
WARNING: arch/x86/kvm/mmu/mmu.c:3069 at mmu_set_spte+0x1e4/0x440 [kvm], CPU#0: vmx_ept_stale_r/872
Modules linked in: kvm_intel kvm irqbypass
CPU: 0 UID: 1000 PID: 872 Comm: vmx_ept_stale_r Not tainted 7.0.0-rc2-eafebd2d2ab0-sink-vm #319 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:mmu_set_spte+0x1e4/0x440 [kvm]
Call Trace:
<TASK>
ept_page_fault+0x535/0x7f0 [kvm]
kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
kvm_mmu_page_fault+0x8d/0x620 [kvm]
vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0xc55/0x1c20 [kvm]
kvm_vcpu_ioctl+0x2d5/0x980 [kvm]
__x64_sys_ioctl+0x8a/0xd0
do_syscall_64+0xb5/0x730
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
---[ end trace 0000000000000000 ]---
Fixes: 11d4517511 ("KVM: x86/mmu: Warn if PFN changes on shadow-present SPTE in shadow MMU")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
When installing an emulated MMIO SPTE, do so *after* dropping/zapping the
existing SPTE (if it's shadow-present). While commit a54aa15c6b was
right about it being impossible to convert a shadow-present SPTE to an
MMIO SPTE due to a _guest_ write, it failed to account for writes to guest
memory that are outside the scope of KVM.
E.g. if host userspace modifies a shadowed gPTE to switch from a memslot
to emulted MMIO and then the guest hits a relevant page fault, KVM will
install the MMIO SPTE without first zapping the shadow-present SPTE.
------------[ cut here ]------------
is_shadow_present_pte(*sptep)
WARNING: arch/x86/kvm/mmu/mmu.c:484 at mark_mmio_spte+0xb2/0xc0 [kvm], CPU#0: vmx_ept_stale_r/4292
Modules linked in: kvm_intel kvm irqbypass
CPU: 0 UID: 1000 PID: 4292 Comm: vmx_ept_stale_r Not tainted 7.0.0-rc2-eafebd2d2ab0-sink-vm #319 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:mark_mmio_spte+0xb2/0xc0 [kvm]
Call Trace:
<TASK>
mmu_set_spte+0x237/0x440 [kvm]
ept_page_fault+0x535/0x7f0 [kvm]
kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
kvm_mmu_page_fault+0x8d/0x620 [kvm]
vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0xc55/0x1c20 [kvm]
kvm_vcpu_ioctl+0x2d5/0x980 [kvm]
__x64_sys_ioctl+0x8a/0xd0
do_syscall_64+0xb5/0x730
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x47fa3f
</TASK>
---[ end trace 0000000000000000 ]---
Reported-by: Alexander Bulekov <bkov@amazon.com>
Debugged-by: Alexander Bulekov <bkov@amazon.com>
Suggested-by: Fred Griffoul <fgriffo@amazon.co.uk>
Fixes: a54aa15c6b ("KVM: x86/mmu: Handle MMIO SPTEs directly in mmu_set_spte()")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: s390: More memory management fixes
Lots of small and not-so-small fixes for the newly rewritten gmap,
mostly affecting the handling of nested guests.
Since the ChaCha permutation is invertible, the local variable
'permuted_state' is sufficient to compute the original 'state', and thus
the key, even after the permutation has been done.
While the kernel is quite inconsistent about zeroizing secrets on the
stack (and some prominent userspace crypto libraries don't bother at all
since it's not guaranteed to work anyway), the kernel does try to do it
as a best practice, especially in cases involving the RNG.
Thus, explicitly zeroize 'permuted_state' before it goes out of scope.
Fixes: c08d0e6473 ("crypto: chacha20 - Add a generic ChaCha20 stream cipher implementation")
Cc: stable@vger.kernel.org
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20260326032920.39408-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Pull rdma fixes from Jason Gunthorpe:
- Quite a few irdma bug fixes, several user triggerable
- Fix a 0 SMAC header in ionic
- Tolerate FW errors for RAAS in bng_re
- Don't UAF in efa when printing error events
- Better handle pool exhaustion in the new bvec paths
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/irdma: Harden depth calculation functions
RDMA/irdma: Return EINVAL for invalid arp index error
RDMA/irdma: Fix deadlock during netdev reset with active connections
RDMA/irdma: Remove reset check from irdma_modify_qp_to_err()
RDMA/irdma: Clean up unnecessary dereference of event->cm_node
RDMA/irdma: Remove a NOP wait_event() in irdma_modify_qp_roce()
RDMA/irdma: Update ibqp state to error if QP is already in error state
RDMA/irdma: Initialize free_qp completion before using it
RDMA/efa: Fix possible deadlock
RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
RDMA/rw: Fall back to direct SGE on MR pool exhaustion
RDMA/efa: Fix use of completion ctx after free
RDMA/bng_re: Fix silent failure in HWRM version query
RDMA/ionic: Preserve and set Ethernet source MAC after ib_ud_header_init()
RDMA/irdma: Fix double free related to rereg_user_mr
Pull pci fixes from Bjorn Helgaas:
- Remove power-off from pwrctrl drivers since this is now done directly
by the PCI controller drivers (Chen-Yu Tsai)
- Fix pwrctrl device node leak (Felix Gu)
- Document a TLP header decoder for AER log messages (Lukas Wunner)
* tag 'pci-v7.0-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
Documentation: PCI: Document PCIe TLP Header decoder for AER messages
PCI/pwrctrl: Fix pci_pwrctrl_is_required() device node leak
PCI/pwrctrl: Do not power off on pwrctrl device removal
Pull sound fixes from Takashi Iwai:
"This became slightly big partly due to my time off in the last week.
But all changes are about device-specific fixes, so it should be
safely applicable.
ASoC:
- Fix double free in sma1307
- Fix uninitialized variables in simple-card-utils/imx-card
- Address clock leaks and error propagation in ADAU1372
- Add DMI quirks and ACP/SDW support for ASUS
- Fix Intel CATPT DMA mask
- Fix SOF topology parsing
- Fix DT bindings for RK3576 SPDIF, STM32 SAI and WCD934x
HD-audio:
- Quirks for Lenovo, ASUS, and various HP models, as well as
a speaker pop fix on Star Labs StarFighter
- Revert MSI X870E Tomahawk denylist again
USB-Audio:
- Fix distorted audio on Focusrite Scarlett 2i2/2i4 1st Gen
- Add iface reset quirk for AB17X
- Update Qualcomm USB audio Kconfig dependencies and license
Misc:
- Fix minor compile warnings for firewire and asihpi drivers"
* tag 'sound-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (35 commits)
Revert "ALSA: hda/intel: Add MSI X870E Tomahawk to denylist"
ALSA: usb-audio: Add iface reset and delay quirk for AB17X USB Audio
ALSA: hda/realtek: add HP Laptop 15-fd0xxx mute LED quirk
ALSA: usb-audio: Exclude Scarlett 2i4 1st Gen from SKIP_IFACE_SETUP
ALSA: hda/realtek: Add mute LED quirk for HP Pavilion 15-eg0xxx
ALSA: hda/realtek - Fixed Speaker Mute LED for HP EliteBoard G1a platform
ASoC: SOF: ipc4-topology: Allow bytes controls without initial payload
ASoC: adau1372: Fix clock leak on PLL lock failure
ASoC: adau1372: Fix unchecked clk_prepare_enable() return value
ASoC: SDCA: fix finding wrong entity
ASoC: SDCA: remove the max count of initialization table
ASoC: codecs: wcd934x: fix typo in dt parsing
ASoC: dt-bindings: stm32: Fix incorrect compatible string in stm32h7-sai match
ASoC: Intel: catpt: Fix the device initialization
ASoC: amd: acp: add ASUS HN7306EA quirk for legacy SDW machine
ASoC: SOF: topology: reject invalid vendor array size in token parser
ASoC: tas2781: Add null check for calibration data
ALSA: asihpi: avoid write overflow check warning
ASoC: fsl: imx-card: initialize playback_only and capture_only
ASoC: simple-card-utils: Check value of is_playback_only and is_capture_only
...
Pull media fixes from Mauro Carvalho Chehab:
- uvcvideo may cause OOPS when out of memory
- remove a deadlock in the ccs driver
* tag 'media/v7.0-6' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: ccs: Avoid deadlock in ccs_init_state()
media: uvcvideo: Fix bug in error path of uvc_alloc_urb_buffers
Pull sysctl fix from Joel Granados:
"Fix uninitialized variable error when writing to a sysctl bitmap
Removed the possibility of returning an unjustified -EINVAL when
writing to a sysctl bitmap"
* tag 'sysctl-7.00-fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: fix uninitialized variable in proc_do_large_bitmap
Pull xfs fixes from Carlos Maiolino:
"This includes a few important bug fixes, and some code refactoring
that was necessary for one of the fixes"
* tag 'xfs-fixes-7.0-rc6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: remove file_path tracepoint data
xfs: don't irele after failing to iget in xfs_attri_recover_work
xfs: remove redundant validation in xlog_recover_attri_commit_pass2
xfs: fix ri_total validation in xlog_recover_attri_commit_pass2
xfs: close crash window in attr dabtree inactivation
xfs: factor out xfs_attr3_leaf_init
xfs: factor out xfs_attr3_node_entry_remove
xfs: only assert new size for datafork during truncate extents
xfs: annotate struct xfs_attr_list_context with __counted_by_ptr
xfs: cleanup buftarg handling in XFS_IOC_VERIFY_MEDIA
xfs: scrub: unlock dquot before early return in quota scrub
xfs: refactor xfsaild_push loop into helper
xfs: save ailp before dropping the AIL lock in push callbacks
xfs: avoid dereferencing log items after push callbacks
xfs: stop reclaim before pushing AIL during unmount
Pull smb server fixes from Steve French:
- Fix out of bounds write
- Fix for better calculating max output buffers
- Fix memory leaks in SMB2/SMB3 lock
- Fix use after free
- Multichannel fix
* tag 'v7.0-rc5-ksmbd-srv-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix potencial OOB in get_file_all_info() for compound requests
ksmbd: replace hardcoded hdr2_len with offsetof() in smb2_calc_max_out_buf_len()
ksmbd: fix memory leaks and NULL deref in smb2_lock()
ksmbd: fix use-after-free and NULL deref in smb_grant_oplock()
ksmbd: do not expire session on binding failure
udf_setsize() can race with udf_writepages() as follows:
udf_setsize() udf_writepages()
if (iinfo->i_alloc_type ==
ICBTAG_FLAG_AD_IN_ICB)
err = udf_expand_file_adinicb(inode);
err = udf_extend_file(inode, newsize);
udf_adinicb_writepages()
memcpy_from_file_folio() - crash
because inode size is too big.
Fix the problem by checking the file type under folio lock in
udf_handle_page_wb() handler called from __mpage_writepages() which
properly serializes with udf_expand_file_adinicb().
Reported-by: Jianzhou Zhao <luckd0g@163.com>
Link: https://lore.kernel.org/all/f622c01.67ac.19cdbdd777d.Coremail.luckd0g@163.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260326140635.15895-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Some filesystems need to treat some folios specially (for example for
inodes with inline data). Doing the handling in their .writepages method
in a race-free manner results in duplicating some of the writeback
internals. So provide generalized version of mpage_writepages() that
allows filesystem to provide a handler called for each folio which can
handle the folio in a special way.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260326140635.15895-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Identified resume-probe race condition in kernel v7.0 with the commit
38fa29b01a ("i2c: designware: Combine the init functions"),but this
issue existed from the beginning though not detected.
The amdisp i2c device requires ISP to be in power-on state for probe
to succeed. To meet this requirement, this device is added to genpd
to control ISP power using runtime PM. The pm_runtime_get_sync() called
before i2c_dw_probe() triggers PM resume, which powers on ISP and also
invokes the amdisp i2c runtime resume before the probe completes resulting
in this race condition and a NULL dereferencing issue in v7.0
Fix this race condition by using the genpd APIs directly during probe:
- Call dev_pm_genpd_resume() to Power ON ISP before probe
- Call dev_pm_genpd_suspend() to Power OFF ISP after probe
- Set the device to suspended state with pm_runtime_set_suspended()
- Enable runtime PM only after the device is fully initialized
Fixes: d6263c468a ("i2c: amd-isp: Add ISP i2c-designware driver")
Co-developed-by: Bin Du <bin.du@amd.com>
Signed-off-by: Bin Du <bin.du@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Cc: <stable@vger.kernel.org> # v6.16+
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260320201302.3490570-1-pratap.nirujogi@amd.com
When reading from the I2DR register, right after releasing the bus by
clearing MSTA and MTX, the I2C controller might still generate an
additional clock cycle which can cause devices to misbehave. Ensure to
only read from I2DR after the bus is not busy anymore. Because this
requires polling, the read of the last byte is moved outside of the
interrupt handler.
An example for such a failing transfer is this:
i2ctransfer -y -a 0 w1@0x00 0x02 r1
Error: Sending messages failed: Connection timed out
It does not happen with every device because not all devices react to
the additional clock cycle.
Fixes: 5f5c2d4579 ("i2c: imx: prevent rescheduling in non dma mode")
Cc: stable@vger.kernel.org # v6.13+
Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260218150940.131354-3-eichest@gmail.com
When reading multiple messages, meaning a repeated start is required,
polling the bus busy bit must be avoided. This must only be done for
the last message. Otherwise, the driver will timeout.
Here an example of such a sequence that fails with an error:
i2ctransfer -y -a 0 w1@0x00 0x02 r1 w1@0x00 0x02 r1
Error: Sending messages failed: Connection timed out
Fixes: 5f5c2d4579 ("i2c: imx: prevent rescheduling in non dma mode")
Cc: stable@vger.kernel.org # v6.13+
Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260218150940.131354-2-eichest@gmail.com
emac_dispatch_skb_zc() allocates a new skb via napi_alloc_skb() but
never copies the packet data from the XDP buffer into it. The skb is
passed up the stack containing uninitialized heap memory instead of
the actual received packet, leaking kernel heap contents to userspace.
Copy the received packet data from the XDP buffer into the skb using
skb_copy_to_linear_data().
Additionally, remove the skb_mark_for_recycle() call since the skb is
backed by the NAPI page frag allocator, not page_pool. Marking a
non-page_pool skb for recycle causes the free path to return pages to
a page_pool that does not own them, corrupting page_pool state.
The non-ZC path (emac_rx_packet) does not have these issues because it
uses napi_build_skb() to wrap the existing page_pool page directly,
requiring no copy, and correctly marks for recycle since the page comes
from page_pool_dev_alloc_pages().
Fixes: 7a64bb388d ("net: ti: icssg-prueth: Add AF_XDP zero copy for RX")
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When driver signals carrier up via netif_carrier_on() its internal
link_up state isn't updated immediately. This leads to inconsistent
speed/duplex in /proc/net/bonding/bondX where the speed and duplex
is shown as unknown while ethtool shows correct values. Fix this by
using netif_carrier_ok() for link checking in get_ksettings function.
Fixes: 84421b99ce ("tg3: Update link_up flag for phylib devices")
Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ioam6_fill_trace_data() stores the schema contribution to the trace
length in a u8. With bit 22 enabled and the largest schema payload,
sclen becomes 1 + 1020 / 4, wraps from 256 to 0, and bypasses the
remaining-space check. __ioam6_fill_trace_data() then positions the
write cursor without reserving the schema area but still copies the
4-byte schema header and the full schema payload, overrunning the trace
buffer.
Keep sclen in an unsigned int so the remaining-space check and the write
cursor calculation both see the full schema length.
Fixes: 8c6f6fa677 ("ipv6: ioam: IOAM Generic Netlink API")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7d6899fb69 ("ovl: fsync after metadata copy-up") was done to
fix durability of overlayfs copy up on an upper filesystem which does
not enforce ordering on storing of metadata changes (e.g. ubifs).
In an earlier revision of the regressing commit by Lei Lv, the metadata
fsync behavior was opt-in via a new "fsync=strict" mount option.
We were hoping that the opt-in mount option could be avoided, so the
change was only made to depend on metacopy=off, in the hope of not
hurting performance of metadata heavy workloads, which are more likely
to be using metacopy=on.
This hope was proven wrong by a performance regression report from Google
COS workload after upgrade to kernel 6.12.
This is an adaptation of Lei's original "fsync=strict" mount option
to the existing upstream code.
The new mount option is mutually exclusive with the "volatile" mount
option, so the latter is now an alias to the "fsync=volatile" mount
option.
Reported-by: Chenglong Tang <chenglongtang@google.com>
Closes: https://lore.kernel.org/linux-unionfs/CAOdxtTadAFH01Vui1FvWfcmQ8jH1O45owTzUcpYbNvBxnLeM7Q@mail.gmail.com/
Link: https://lore.kernel.org/linux-unionfs/CAOQ4uxgKC1SgjMWre=fUb00v8rxtd6sQi-S+dxR8oDzAuiGu8g@mail.gmail.com/
Fixes: 7d6899fb69 ("ovl: fsync after metadata copy-up")
Depends: 50e638beb6 ("ovl: Use str_on_off() helper in ovl_show_options()")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Fei Lv <feilv@asrmicro.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Setting up the interface when suspended/resumeing fail on this card.
Adding a reset and delay quirk will eliminate this problem.
usb 1-1: new full-speed USB device number 2 using xhci-hcd
usb 1-1: New USB device found, idVendor=001f, idProduct=0b23
usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 1-1: Product: AB17X USB Audio
usb 1-1: Manufacturer: Generic
usb 1-1: SerialNumber: 20241228172028
Signed-off-by: Lianqin Hu <hulianqin@vivo.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/PUZPR06MB6224CA59AD2B26054120B276D249A@PUZPR06MB6224.apcprd06.prod.outlook.com
The HP Pavilion 15-eg0xxx with subsystem ID 0x103c87cb uses a Realtek
ALC287 codec with a mute LED wired to GPIO pin 4 (mask 0x10). The
existing ALC287_FIXUP_HP_GPIO_LED fixup already handles this correctly,
but the subsystem ID was missing from the quirk table.
GPIO pin confirmed via manual hda-verb testing:
hda-verb SET_GPIO_MASK 0x10
hda-verb SET_GPIO_DIRECTION 0x10
hda-verb SET_GPIO_DATA 0x10
Signed-off-by: César Montoya <sprit152009@gmail.com>
Link: https://patch.msgid.link/20260321153603.12771-1-sprit152009@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
On the HP EliteBoard G1a platform (models without a headphone jack).
the speaker mute LED failed to function. The Sysfs ctl-led info showed
empty values because the standard LED registration couldn't correctly
bind to the master switch.
Adding this patch will fix and enable the speaker mute LED feature.
Tested-by: Chris Chiu <chris.chiu@canonical.com>
Signed-off-by: Kailang Yang <kailang@realtek.com>
Link: https://lore.kernel.org/279e929e884849df84687dbd67f20037@realtek.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
ASoC: Fixes for v7.0
This is two week's worth of fixes and quirks so it's a bit larger than
you might expect, there's nothing too exciting individually and nothing
in core code.
linedisp_release() currently retrieves the enclosing struct linedisp via
to_linedisp(). That lookup depends on the attachment list, but the
attachment may already have been removed before put_device() invokes the
release callback. This can happen in linedisp_unregister(), and can also
be reached from some linedisp_register() error paths.
In that case, to_linedisp() returns NULL and linedisp_release()
dereferences it while freeing the display resources.
The struct device released here is the embedded linedisp->dev used by
linedisp_register(), so retrieve the enclosing object directly with
container_of() instead.
Fixes: 66c9380948 ("auxdisplay: linedisp: encapsulate container_of usage within to_linedisp")
Cc: stable@vger.kernel.org
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
After enabling CONFIG_GCOV_KERNEL and CONFIG_GCOV_PROFILE_ALL, following
build failure is observed under GCC 14.2.1:
In function 'amdv1pt_install_leaf_entry',
inlined from '__do_map_single_page' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:650:3,
inlined from '__map_single_page0' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:661:1,
inlined from 'pt_descend' at drivers/iommu/generic_pt/fmt/../pt_iter.h:391:9,
inlined from '__do_map_single_page' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:657:10,
inlined from '__map_single_page1.constprop' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:661:1:
././include/linux/compiler_types.h:706:45: error: call to '__compiletime_assert_71' declared with attribute error: FIELD_PREP: value too large for the field
706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
|
......
drivers/iommu/generic_pt/fmt/amdv1.h:220:26: note: in expansion of macro 'FIELD_PREP'
220 | FIELD_PREP(AMDV1PT_FMT_OA,
| ^~~~~~~~~~
In the path '__do_map_single_page()', level 0 always invokes
'pt_install_leaf_entry(&pts, map->oa, PAGE_SHIFT, …)'. At runtime that
lands in the 'if (oasz_lg2 == isz_lg2)' arm of 'amdv1pt_install_leaf_entry()';
the contiguous-only 'else' block is unreachable for 4 KiB pages.
With CONFIG_GCOV_KERNEL + CONFIG_GCOV_PROFILE_ALL, the extra
instrumentation changes GCC's inlining so that the "dead" 'else' branch
still gets instantiated. The compiler constant-folds the contiguous OA
expression, runs the 'FIELD_PREP()' compile-time check, and produces:
FIELD_PREP: value too large for the field
gcov-enabled builds therefore fail even though the code path never executes.
Fix this by marking amdv1pt_install_leaf_entry as __always_inline.
Fixes: dcd6a011a8 ("iommupt: Add map_pages op")
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Sherry Yang <sherry.yang@oracle.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
unmap has the odd behavior that it can unmap more than requested if the
ending point lands within the middle of a large or contiguous IOPTE.
In this case the gather should flush everything unmapped which can be
larger than what was requested to be unmapped. The gather was only
flushing the range requested to be unmapped, not extending to the extra
range, resulting in a short invalidation if the caller hits this special
condition.
This was found by the new invalidation/gather test I am adding in
preparation for ARMv8. Claude deduced the root cause.
As far as I remember nothing relies on unmapping a large entry, so this is
likely not a triggerable bug.
Cc: stable@vger.kernel.org
Fixes: 7c53f4238a ("iommupt: Add unmap_pages op")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
An empty gather is coded with start=U64_MAX, end=0 and several drivers go
on to convert that to a size with:
end - start + 1
Which gives 2 for an empty gather. This then causes Weird Stuff to
happen (for example an UBSAN splat in VT-d) that is hopefully harmless,
but maybe not.
Prevent drivers from being called right in iommu_iotlb_sync().
Auditing shows that AMD, Intel, Mediatek and RSIC-V drivers all do things
on these empty gathers.
Further, there are several callers that can trigger empty gathers,
especially in unusual conditions. For example iommu_map_nosync() will call
a 0 size unmap on some error paths. Also in VFIO, iommupt and other
places.
Cc: stable@vger.kernel.org
Reported-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Closes: https://lore.kernel.org/r/11145826.aFP6jjVeTY@jkrzyszt-mobl2.ger.corp.intel.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
When processing Router Advertisements with user options the kernel
builds an RTM_NEWNDUSEROPT netlink message. The nduseroptmsg struct
has three padding fields that are never zeroed and can leak kernel data
The fix is simple, just zeroes the padding fields.
Fixes: 31910575a9 ("[IPv6]: Export userland ND options through netlink (RDNSS support)")
Signed-off-by: Yochai Eisenrich <echelonh@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260324224925.2437775-1-echelonh@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wei Fang says:
====================
net: enetc: safely reinitialize TX BD ring when it has unsent frames
Currently the driver does not reset the producer index register (PIR) and
consumer index register (CIR) when initializing a TX BD ring. The driver
only reads the PIR and CIR and initializes the software indexes. If the
TX BD ring is reinitialized when it still contains unsent frames, its PIR
and CIR will not be equal after the reinitialization. However, the BDs
between CIR and PIR have been freed and become invalid and this can lead
to a hardware malfunction, causing the TX BD ring will not work properly.
Since the PIR and CIR are sofeware-configurable on ENETC v4. Therefore,
the driver must reset them if they are not equal when reinitializing
the TX BD ring.
However, resetting the PIR and CIR alone is insufficient, it cannot
completely solve the problem. When a link-down event occurs while the TX
BD ring is transmitting frames, subsequent reinitialization of the TX BD
ring may cause it to malfunction. Because enetc4_pl_mac_link_down() only
clears PMa_COMMAND_CONFIG[TX_EN] to disable MAC transmit data path. It
doesn't set PORT[TXDIS] to 1 to flush the TX BD ring. Therefore, it is
not safe to reinitialize the TX BD ring at this point.
To safely reinitialize the TX BD ring after a link-down event, we checked
with the NETC IP team, a proper Ethernet MAC graceful stop is necessary.
Therefore, add the Ethernet MAC graceful stop to the link-down event
handler enetc4_pl_mac_link_down(). Note that this patch set is not
applicable to ENETC v1 (LS1028A).
====================
Link: https://patch.msgid.link/20260324062121.2745033-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For ENETC v4, the PIR and CIR will be reset if they are not equal when
reinitializing the TX BD ring. However, resetting the PIR and CIR alone
is insufficient. When a link-down event occurs while the TX BD ring is
transmitting frames, subsequent reinitialization of the TX BD ring may
cause it to malfunction. For example, the below steps can reproduce the
problem.
1. Unplug the cable when the TX BD ring is busy transmitting frames.
2. Disable the network interface (ifconfig eth0 down).
3. Re-enable the network interface (ifconfig eth0 up).
4. Plug in the cable, the TX BD ring may fail to transmit packets.
When the link-down event occurs, enetc4_pl_mac_link_down() only clears
PMa_COMMAND_CONFIG[TX_EN] to disable MAC transmit data path. It doesn't
set PORT[TXDIS] to 1 to flush the TX BD ring. Therefore, reinitializing
the TX BD ring at this point is unsafe. To safely reinitialize the TX BD
ring after a link-down event, we checked with the NETC IP team, a proper
Ethernet MAC graceful stop is necessary. Therefore, add the Ethernet MAC
graceful stop to the link-down event handler enetc4_pl_mac_link_down().
Fixes: 99100d0d99 ("net: enetc: add preliminary support for i.MX95 ENETC PF")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260324062121.2745033-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the driver does not reset the producer index register (PIR) and
consumer index register (CIR) when initializing a TX BD ring. The driver
only reads the PIR and CIR and initializes the software indexes. If the
TX BD ring is reinitialized when it still contains unsent frames, its PIR
and CIR will not be equal after the reinitialization. However, the BDs
between CIR and PIR have been freed and become invalid and this can lead
to a hardware malfunction, causing the TX BD ring will not work properly.
For ENETC v4, it supports software to set the PIR and CIR, so the driver
can reset these two registers if they are not equal when reinitializing
the TX BD ring. Therefore, add this solution for ENETC v4. Note that this
patch does not work for ENETC v1 because it does not support software to
set the PIR and CIR.
Fixes: 99100d0d99 ("net: enetc: add preliminary support for i.MX95 ENETC PF")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260324062121.2745033-2-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the PPS channel configuration was implemented, the channel
index for the periodic outputs was configured as the hardware
channel number.
The sysfs interface uses a logical channel index, and rejects numbers
greater than `n_per_out` (see period_store() in ptp_sysfs.c).
That property was left at 1, since the driver implements channel
selection, not simultaneous operation of multiple PTP hardware timer
channels.
A second check in fec_ptp_enable() returns -EOPNOTSUPP when the two
channel numbers disagree, making channels 1..3 unusable from sysfs.
Fix by removing this redundant check in the FEC PTP driver.
Fixes: 566c2d8388 ("net: fec: make PPS channel configurable")
Signed-off-by: Buday Csaba <buday.csaba@prolan.hu>
Link: https://patch.msgid.link/8ec2afe88423c2231f9cf8044d212ce57846670e.1774359059.git.buday.csaba@prolan.hu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__skb_ext_put() is not declared if SKB_EXTENSIONS is not enabled, which
causes a build error:
drivers/net/netdevsim/netdev.c: In function 'nsim_forward_skb':
drivers/net/netdevsim/netdev.c:114:25: error: implicit declaration of function '__skb_ext_put'; did you mean 'skb_ext_put'? [-Werror=implicit-function-declaration]
114 | __skb_ext_put(psp_ext);
| ^~~~~~~~~~~~~
| skb_ext_put
cc1: some warnings being treated as errors
Add a stub to fix the build.
Fixes: 7d9351435e ("netdevsim: drop PSP ext ref on forward failure")
Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Link: https://patch.msgid.link/20260324140857.783-1-dqfext@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__io_uring_show_fdinfo() iterates over pending SQEs and, for 128-byte
SQEs on an IORING_SETUP_SQE_MIXED ring, needs to detect when the second
half of the SQE would be past the end of the sq_sqes array. The current
check tests (++sq_head & sq_mask) == 0, but sq_head is only incremented
when a 128-byte SQE is encountered, not on every iteration. The actual
array index is sq_idx = (i + sq_head) & sq_mask, which can be sq_mask
(the last slot) while the wrap check passes.
Fix by checking sq_idx directly. Keep the sq_head increment so the loop
still skips the second half of the 128-byte SQE on the next iteration.
Fixes: 1cba30bf9f ("io_uring: add support for IORING_SETUP_SQE_MIXED")
Signed-off-by: Nicholas Carlini <nicholas@carlini.com>
Link: https://patch.msgid.link/20260327021823.3138396-1-nicholas@carlini.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the gmac0 is disabled, the precheck for a valid ingress device will
cause a NULL pointer deref and crash the system. This happens because
eth->netdev[0] will be NULL but the code will directly try to access
netdev_ops.
Instead of just checking for the first net_device, it must be checked if
any of the mtk_eth net_devices is matching the netdev_ops of the ingress
device.
Cc: stable@vger.kernel.org
Fixes: 73cfd947db ("net: ethernet: mtk_eth_soc: ppe: prevent ppe update for non-mtk devices")
Signed-off-by: Sven Eckelmann (Plasma Cloud) <se@simonwunderlich.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260324-wed-crash-gmac0-disabled-v1-1-3bc388aee565@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
MANA passes rxq->alloc_size to napi_build_skb() for all RX buffers.
It is correct for fragment-backed RX buffers, where alloc_size matches
the actual backing allocation used for each packet buffer. However, in
the non-fragment RX path mana allocates a full page, or a higher-order
page, per RX buffer. In that case alloc_size only reflects the usable
packet area and not the actual backing memory.
This causes napi_build_skb() to underestimate the skb backing allocation
in the single-buffer RX path, so skb->truesize is derived from a value
smaller than the real RX buffer allocation.
Fix this by updating alloc_size in the non-fragment RX path to the
actual backing allocation size before it is passed to napi_build_skb().
Fixes: 730ff06d3f ("net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.")
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/acLUhLpLum6qrD/N@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The RCU-protected codepaths (mpls_forward, mpls_dump_routes) can have
an inconsistent view of platform_labels vs platform_label in case of a
concurrent resize (resize_platform_label_table, under
platform_mutex). This can lead to OOB accesses.
This patch adds a seqcount, so that we get a consistent snapshot.
Note that mpls_label_ok is also susceptible to this, so the check
against RTA_DST in rtm_to_route_config, done outside platform_mutex,
is not sufficient. This value gets passed to mpls_label_ok once more
in both mpls_route_add and mpls_route_del, so there is no issue, but
that additional check must not be removed.
Reported-by: Yuan Tan <tanyuan98@outlook.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Fixes: 7720c01f3f ("mpls: Add a sysctl to control the size of the mpls label table")
Fixes: dde1b38e87 ("mpls: Convert mpls_dump_routes() to RCU.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/cd8fca15e3eb7e212b094064cd83652e20fd9d31.1774284088.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Johannes Berg says:
====================
Couple more fixes:
- virt_wifi: remove SET_NETDEV_DEV to avoid UAF on teardown
- iwlwifi:
- fix (some) devices that don't have 6 GHz (WiFi6E)
- fix potential OOB read of firmware notification
- set WiFi generation for firmware to avoid packet drops
- fix multi-link scan timing
- wilc1000: fix integer overflow
- ath11k/ath12k: fix TID during A-MPDU session teardown
- wl1251: don't trust firmware TX status response index
* tag 'wireless-2026-03-26' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: virt_wifi: remove SET_NETDEV_DEV to avoid use-after-free
wifi: iwlwifi: mvm: fix potential out-of-bounds read in iwl_mvm_nd_match_info_handler()
wifi: wl1251: validate packet IDs before indexing tx_frames
wifi: wilc1000: fix u8 overflow in SSID scan buffer size calculation
wifi: ath12k: Pass the correct value of each TID during a stop AMPDU session
wifi: ath11k: Pass the correct value of each TID during a stop AMPDU session
wifi: iwlwifi: mld: correctly set wifi generation data
wifi: iwlwifi: mvm: don't send a 6E related command when not supported
wifi: iwlwifi: mld: Fix MLO scan timing
====================
Link: https://patch.msgid.link/20260326093329.77815-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull smb client fix from Steve French:
- Fix rebuild of mapping table
* tag 'v7.0-rc5-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
smb/client: ensure smb2_mapping_table rebuild on cmd changes
Pull power management fixes from Rafael Wysocki:
"These fix two cpufreq issues, one in the core and one in the
conservative governor, and two issues related to system sleep:
- Restore the cpufreq core behavior changed inadvertently during the
6.19 development cycle to call cpufreq_frequency_table_cpuinfo()
for cpufreq policies getting re-initialized which ensures that
policy->max and policy->cpuinfo_max_freq will be valid going
forward (Viresh Kumar)
- Adjust the cached requested frequency in the conservative cpufreq
governor on policy limits changes to prevent it from becoming stale
in some cases (Viresh Kumar)
- Prevent pm_restore_gfp_mask() from triggering a WARN_ON() in some
code paths in which it is legitimately called without invoking
pm_restrict_gfp_mask() previously (Youngjun Park)
- Update snapshot_write_finalize() to take trailing zero pages into
account properly which prevents user space restore from failing
subsequently in some cases (Alberto Garcia)"
* tag 'pm-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: Drop spurious WARN_ON() from pm_restore_gfp_mask()
PM: hibernate: Drain trailing zero pages on userspace restore
cpufreq: conservative: Reset requested_freq on limits change
cpufreq: Don't skip cpufreq_frequency_table_cpuinfo()
Pull thermal control fix from Rafael Wysocki:
"This prevents the int340x thermal driver from taking the power slider
offset parameter into account incorrectly in some cases (Srinivas
Pandruvada)"
* tag 'thermal-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: intel: int340x: soc_slider: Set offset only for balanced mode
Pull ACPI support fix from Rafael Wysocki:
"Prevent use-after-free from occurring on reduced-hardware ACPI
platforms when -EPROBE_DEFER is returned by ec_install_handlers()
during ACPI EC driver initialization (Weiming Shi)"
* tag 'acpi-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: EC: clean up handlers on probe failure in acpi_ec_setup()
Pull Landlock fixes from Mickaël Salaün:
"This mainly fixes Landlock TSYNC issues related to interrupts and
unexpected task exit.
Other fixes touch documentation and sample, and a new test extends
coverage"
* tag 'landlock-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
landlock: Expand restrict flags example for ABI version 8
selftests/landlock: Test tsync interruption and cancellation paths
landlock: Clean up interrupted thread logic in TSYNC
landlock: Serialize TSYNC thread restriction
samples/landlock: Bump ABI version to 8
landlock: Improve TSYNC types
landlock: Fully release unused TSYNC work entries
landlock: Fix formatting
Merge fixes related to system sleep for 7.0-rc6:
- Prevent pm_restore_gfp_mask() from triggering a WARN_ON() in some
code paths in which it is legitimately called without invoking
pm_restrict_gfp_mask() previously (Youngjun Park)
- Update snapshot_write_finalize() to take trailing zero pages into
account properly which prevents user space restore from failing
subsequently in some cases (Alberto Garcia)
* pm-sleep:
PM: sleep: Drop spurious WARN_ON() from pm_restore_gfp_mask()
PM: hibernate: Drain trailing zero pages on userspace restore
Pull networking fixes from Paolo Abeni:
"Including fixes from Bluetooth, CAN, IPsec and Netfilter.
Notably, this includes the fix for the Bluetooth regression that you
were notified about. I'm not aware of any other pending regressions.
Current release - regressions:
- bluetooth:
- fix stack-out-of-bounds read in l2cap_ecred_conn_req
- fix regressions caused by reusing ident
- netfilter: revisit array resize logic
- eth: ice: set max queues in alloc_etherdev_mqs()
Previous releases - regressions:
- core: correctly handle tunneled traffic on IPV6_CSUM GSO fallback
- bluetooth:
- fix dangling pointer on mgmt_add_adv_patterns_monitor_complete
- fix deadlock in l2cap_conn_del()
- sched: codel: fix stale state for empty flows in fq_codel
- ipv6: remove permanent routes from tb6_gc_hlist when all exceptions expire.
- xfrm: fix skb_put() panic on non-linear skb during reassembly
- openvswitch:
- avoid releasing netdev before teardown completes
- validate MPLS set/set_masked payload length
- eth: iavf: fix out-of-bounds writes in iavf_get_ethtool_stats()
Previous releases - always broken:
- bluetooth: fix null-ptr-deref on l2cap_sock_ready_cb
- udp: fix wildcard bind conflict check when using hash2
- netfilter: fix use of uninitialized rtp_addr in process_sdp
- tls: Purge async_hold in tls_decrypt_async_wait()
- xfrm:
- prevent policy_hthresh.work from racing with netns teardown
- fix skb leak with espintcp and async crypto
- smc: fix double-free of smc_spd_priv when tee() duplicates splice pipe buffer
- can:
- add missing error handling to call can_ctrlmode_changelink()
- fix OOB heap access in cgw_csum_crc8_rel()
- eth:
- mana: fix use-after-free in add_adev() error path
- virtio-net: fix for VIRTIO_NET_F_GUEST_HDRLEN
- bcmasp: fix double free of WoL irq"
* tag 'net-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (90 commits)
net: macb: use the current queue number for stats
netfilter: ctnetlink: use netlink policy range checks
netfilter: nf_conntrack_sip: fix use of uninitialized rtp_addr in process_sdp
netfilter: nf_conntrack_expect: skip expectations in other netns via proc
netfilter: nf_conntrack_expect: store netns and zone in expectation
netfilter: ctnetlink: ensure safe access to master conntrack
netfilter: nf_conntrack_expect: use expect->helper
netfilter: nf_conntrack_expect: honor expectation helper field
netfilter: nft_set_rbtree: revisit array resize logic
netfilter: ip6t_rt: reject oversized addrnr in rt_mt6_check()
netfilter: nfnetlink_log: fix uninitialized padding leak in NFULA_PAYLOAD
tls: Purge async_hold in tls_decrypt_async_wait()
selftests: netfilter: nft_concat_range.sh: add check for flush+reload bug
netfilter: nft_set_pipapo_avx2: don't return non-matching entry on expiry
Bluetooth: btusb: clamp SCO altsetting table indices
Bluetooth: L2CAP: Fix ERTM re-init and zero pdu_len infinite loop
Bluetooth: L2CAP: Fix deadlock in l2cap_conn_del()
Bluetooth: btintel: serialize btintel_hw_error() with hci_req_sync_lock
Bluetooth: L2CAP: Fix send LE flow credits in ACL link
net: mana: fix use-after-free in add_adev() error path
...
Pull pin control fixes from Linus Walleij:
- Implement .get_direction() in the spmi-gpio gpio_chip
Recent changes makes this start to print warnings and it's not nice,
let's just fix it
- Clamp the return value of gpio_get() in the Renesas RZA1 driver
- Add the GPIO_GENERIC dependency to the STM32 HDP driver
- Modify the Mediatek driver to accept devices that do not use external
interrupts (EINT) at all
- Fix flag propagation in the Sunxi driver, so that we can fix an issue
with uninitialized pins in a follow-up patch using said flags
* tag 'pinctrl-v7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: sunxi: fix gpiochip_lock_as_irq() failure when pinmux is unknown
pinctrl: sunxi: pass down flags to pinctrl routines
pinctrl: mediatek: common: Fix probe failure for devices without EINT
pinctrl: stm32: fix HDP driver dependency on GPIO_GENERIC
pinctrl: renesas: rza1: Normalize return value of gpio_get()
pinctrl: qcom: spmi-gpio: implement .get_direction()
pinctrl: renesas: rzt2h: Fix invalid wait context
pinctrl: renesas: rzt2h: Fix device node leak in rzt2h_gpio_register()
Pull dma-mapping fixes from Marek Szyprowski:
"A set of fixes for DMA-mapping subsystem, which resolve false-
positive warnings from KMSAN and DMA-API debug (Shigeru Yoshida
and Leon Romanovsky) as well as a simple build fix (Miguel Ojeda)"
* tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-mapping: add missing `inline` for `dma_free_attrs`
mm/hmm: Indicate that HMM requires DMA coherency
RDMA/umem: Tell DMA mapping that UMEM requires coherency
iommu/dma: add support for DMA_ATTR_REQUIRE_COHERENT attribute
dma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is set
dma-mapping: Introduce DMA require coherency attribute
dma-mapping: Clarify valid conditions for CPU cache line overlap
dma-mapping: handle DMA_ATTR_CPU_CACHE_CLEAN in trace output
dma-debug: Allow multiple invocations of overlapping entries
dma: swiotlb: add KMSAN annotations to swiotlb_bounce()
During futex_key_to_node_opt() execution, vma->vm_policy is read under
speculative mmap lock and RCU. Concurrently, mbind() may call
vma_replace_policy() which frees the old mempolicy immediately via
kmem_cache_free().
This creates a race where __futex_key_to_node() dereferences a freed
mempolicy pointer, causing a use-after-free read of mpol->mode.
[ 151.412631] BUG: KASAN: slab-use-after-free in __futex_key_to_node (kernel/futex/core.c:349)
[ 151.414046] Read of size 2 at addr ffff888001c49634 by task e/87
[ 151.415969] Call Trace:
[ 151.416732] __asan_load2 (mm/kasan/generic.c:271)
[ 151.416777] __futex_key_to_node (kernel/futex/core.c:349)
[ 151.416822] get_futex_key (kernel/futex/core.c:374 kernel/futex/core.c:386 kernel/futex/core.c:593)
Fix by adding rcu to __mpol_put().
Fixes: c042c50521 ("futex: Implement FUTEX2_MPOL")
Reported-by: Hao-Yu Yang <naup96721@gmail.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hao-Yu Yang <naup96721@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Link: https://patch.msgid.link/20260324174418.GB1850007@noisy.programming.kicks-ass.net
Nicholas reported that his LLM found it was possible to create a UaF
when sys_futex_requeue() is used with different flags. The initial
motivation for allowing different flags was the variable sized futex,
but since that hasn't been merged (yet), simply mandate the flags are
identical, as is the case for the old style sys_futex() requeue
operations.
Fixes: 0f4b5f9722 ("futex: Add sys_futex_requeue()")
Reported-by: Nicholas Carlini <npc@anthropic.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
A previous commit changed the behaviour of the KVM_S390_VCPU_FAULT
ioctl. The current (wrong) implementation will trigger a guest
addressing exception if the requested address lies outside of a
memslot, unless the VM is UCONTROL.
Restore the previous behaviour by open coding the fault-in logic.
Fixes: 3762e905ec ("KVM: s390: use __kvm_faultin_pfn()")
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
When shadowing, the guest page tables are write-protected, in order to
trap changes and properly unshadow the shadow mapping for the nested
guest. Already shadowed levels are skipped, so that only the needed
levels are write protected.
Currently the levels that get write protected are exactly one level too
deep: the last level (nested guest memory) gets protected in the wrong
way, and will be protected again correctly a few lines afterwards; most
importantly, the highest non-shadowed level does *not* get write
protected.
Moreover, if the nested guest is running in a real address space, there
are no DAT tables to shadow.
Write protect the correct levels, so that all the levels that need to
be protected are protected, and avoid double protecting the last level;
skip attempting to shadow the DAT tables when the nested guest is
running in a real address space.
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
If shadowing causes the shadow gmap to get unshadowed, exit early to
prevent an attempt to dereference the parent pointer, which at this
point is NULL.
Opportunistically add some more checks to prevent NULL parents.
Fixes: a2c17f9270 ("KVM: s390: New gmap code")
Fixes: e5f98a6899 ("KVM: s390: Add some helper functions needed for vSIE")
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
In most cases gmap_put() was not called when it should have.
Add the missing gmap_put() in vsie_run().
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Fix _do_shadow_pte() to use the correct pointer (guest pte instead of
nested guest) to set up the new pte.
Add a check to return -EOPNOTSUPP if the mapping for the nested guest
is writeable but the same page in the guest is only read-only.
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Introduce a new special softbit for large pages, like already presend
for normal pages, and use it to mark guest mappings that do not have
struct pages.
Whenever a leaf DAT entry becomes dirty, check the special softbit and
only call SetPageDirty() if there is an actual struct page.
Move the logic to mark pages dirty inside _gmap_ptep_xchg() and
_gmap_crstep_xchg_atomic(), to avoid needlessly duplicating the code.
Fixes: 5a74e3d934 ("KVM: s390: KVM-specific bitfields and helper functions")
Fixes: a2c17f9270 ("KVM: s390: New gmap code")
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
The slow path of the fault handler ultimately called gmap_link(), which
assumed the fault was a major fault, and blindly called dat_link().
In case of minor faults, things were not always handled properly; in
particular the prefix and vsie marker bits were ignored.
Move dat_link() into gmap.c, renaming it accordingly. Once moved, the
new _gmap_link() function will be able to correctly honour the prefix
and vsie markers.
This will cause spurious unshadows in some uncommon cases.
Fixes: 94fd9b16cc ("KVM: s390: KVM page table management functions: lifecycle management")
Fixes: a2c17f9270 ("KVM: s390: New gmap code")
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
When shadowing a nested guest, a check is performed and no shadowing is
attempted if the nested guest is already shadowed.
The existing check was incomplete; fix it by also checking whether the
leaf DAT table entry in the existing shadow gmap has the same protection
as the one specified in the guest DAT entry.
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
In practice dat_crstep_xchg() is racy and hard to use correctly. Simply
remove it and replace its uses with dat_crstep_xchg_atomic().
This solves some actual races that lead to system hangs / crashes.
Opportunistically fix an alignment issue in _gmap_crstep_xchg_atomic().
Fixes: 589071eaaa ("KVM: s390: KVM page table management functions: clear and replace")
Fixes: 94fd9b16cc ("KVM: s390: KVM page table management functions: lifecycle management")
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
If the guest misbehaves and puts the page tables for its nested guest
inside the memory of the nested guest itself, and the guest and nested
guest are being mapped with large pages, the shadow mapping will
lose synchronization with the actual mapping, since this will cause the
large page with the vsie notification bit to be split, but the
vsie notification bit will not be propagated to the resulting small
pages.
Fix this by propagating the vsie_notif bit from large pages to normal
pages when splitting a large page.
Fixes: 2db149a0a6 ("KVM: s390: KVM page table management functions: walks")
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Pablo Neira Ayuso says:
====================
Netfilter for net
This is v3, I kept back an ipset fix and another to tigthen the xtables
interface to reject invalid combinations with the NFPROTO_ARP family.
They need a bit more discussion. I fixed the issues reported by AI on
patch 9 (add #ifdef to access ct zone, update nf_conntrack_broadcast
and patch 10 (use better Fixes: tag). Thanks!
The following patchset contains Netfilter fixes for *net*.
Note that most bugs fixed here stem from 2.6 days, the large PR is not
due to an increase in regressions.
1) Fix incorrect reject of set updates with nf_tables pipapo set
avx2 backend. This comes with a regression test in patch 2.
From Florian Westphal.
2) nfnetlink_log needs to zero padding to prevent infoleak to userspace,
from Weiming Shi.
3) xtables ip6t_rt module never validated that addrnr length is within the
allowed array boundary. Reject bogus values. From Ren Wei.
4) Fix high memory usage in rbtree set backend that was unwanted side-effect
of the recently added binary search blob. From Pablo Neira Ayuso.
5) Patches 5 to 10, also from Pablo, address long-standing RCU safety bugs
in conntracks handling of expectations: We can never safely defer
a conntrack extension area without holding a reference. Yet expectation
handling does so in multiple places. Fix this by avoiding the need to
look into the master conntrack to begin with and by extending locked
sections in a few places.
11) Fix use of uninitialized rtp_addr in the sip conntrack helper,
also from Weiming Shi.
12) Add stricter netlink policy checks in ctnetlink, from David Carlier.
This avoids undefined behaviour when userspace provides huge wscale
value.
netfilter pull request 26-03-26
* tag 'nf-26-03-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: ctnetlink: use netlink policy range checks
netfilter: nf_conntrack_sip: fix use of uninitialized rtp_addr in process_sdp
netfilter: nf_conntrack_expect: skip expectations in other netns via proc
netfilter: nf_conntrack_expect: store netns and zone in expectation
netfilter: ctnetlink: ensure safe access to master conntrack
netfilter: nf_conntrack_expect: use expect->helper
netfilter: nf_conntrack_expect: honor expectation helper field
netfilter: nft_set_rbtree: revisit array resize logic
netfilter: ip6t_rt: reject oversized addrnr in rt_mt6_check()
netfilter: nfnetlink_log: fix uninitialized padding leak in NFULA_PAYLOAD
selftests: netfilter: nft_concat_range.sh: add check for flush+reload bug
netfilter: nft_set_pipapo_avx2: don't return non-matching entry on expiry
====================
Link: https://patch.msgid.link/20260326125153.685915-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The netfs_io_stream::front member is meant to point to the subrequest
currently being collected on a stream, but it isn't actually used this way
by direct write (which mostly ignores it). However, there's a tracepoint
which looks at it. Further, stream->front is actually redundant with
stream->subrequests.next.
Fix the potential problem in the direct code by just removing the member
and using stream->subrequests.next instead, thereby also simplifying the
code.
Fixes: a0b4c7a491 ("netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence")
Reported-by: Paulo Alcantara <pc@manguebit.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/4158599.1774426817@warthog.procyon.org.uk
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Tony Nguyen says:
====================
For ice:
Michal corrects call to alloc_etherdev_mqs() to provide maximum number
of queues supported rather than currently allocated number of queues.
Petr Oros fixes issues related to some ethtool operations in switchdev
mode.
For iavf:
Kohei Enju corrects number of reported queues for ethtool statistics to
absolute max as using current number could race and cause out-of-bounds
issues.
For idpf:
Josh NULLs cdev_info pointer after freeing to prevent possible subsequent
improper access. He also defers setting of refillqs value until after
allocation to prevent possible NULL pointer dereference.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
idpf: only assign num refillqs if allocation was successful
idpf: clear stale cdev_info ptr
iavf: fix out-of-bounds writes in iavf_get_ethtool_stats()
ice: use ice_update_eth_stats() for representor stats
ice: fix inverted ready check for VF representors
ice: set max queues in alloc_etherdev_mqs()
====================
Link: https://patch.msgid.link/20260323205843.624704-1-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The xfile/xmbuf shmem file descriptions are no longer as detailed as
they were when online fsck was first merged, because moving to static
strings in commit 60382993a2 ("xfs: get rid of the
xchk_xfile_*_descr calls") removed a memory allocation and hence a
source of failure.
However this makes encoding the description in the tracepoints sort of a
waste of memory. David Laight also points out that file_path doesn't
zero the whole buffer which causes exposure of stale trace bytes, and
Steven Rostedt wonders why we're not using a dynamic array for the file
path.
I don't think this is worth fixing, so let's just rip it out.
Cc: rostedt@goodmis.org
Cc: david.laight.linux@gmail.com
Link: https://lore.kernel.org/linux-xfs/20260323172204.work.979-kees@kernel.org/
Cc: stable@vger.kernel.org # v6.11
Fixes: 19ebc8f84e ("xfs: fix file_path handling in tracepoints")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
xlog_recovery_iget* never set @ip to a valid pointer if they return
an error, so this irele will walk off a dangling pointer. Fix that.
Cc: stable@vger.kernel.org # v6.10
Fixes: ae673f534a ("xfs: record inode generation in xattr update log intent items")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When displaying pending SQEs for a MIXED ring, each 128-byte SQE
increments sq_head to skip the second slot, but the loop counter is not
adjusted. This can cause the loop to read past sq_tail by one entry for
each 128-byte SQE encountered, displaying SQEs that haven't been made
consumable yet by the application.
Match the kernel's own consumption logic in io_init_req() which
decrements what's left when consuming the extra slot.
Fixes: 1cba30bf9f ("io_uring: add support for IORING_SETUP_SQE_MIXED")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's a potential mismatch between the memory reserved for statistics
and the amount of memory written.
gem_get_sset_count() correctly computes the number of stats based on the
active queues, whereas gem_get_ethtool_stats() indiscriminately copies
data using the maximum number of queues, and in the case the number of
active queues is less than MACB_MAX_QUEUES, this results in a OOB write
as observed in the KASAN splat.
==================================================================
BUG: KASAN: vmalloc-out-of-bounds in gem_get_ethtool_stats+0x54/0x78
[macb]
Write of size 760 at addr ffff80008080b000 by task ethtool/1027
CPU: [...]
Tainted: [E]=UNSIGNED_MODULE
Hardware name: raspberrypi rpi/rpi, BIOS 2025.10 10/01/2025
Call trace:
show_stack+0x20/0x38 (C)
dump_stack_lvl+0x80/0xf8
print_report+0x384/0x5e0
kasan_report+0xa0/0xf0
kasan_check_range+0xe8/0x190
__asan_memcpy+0x54/0x98
gem_get_ethtool_stats+0x54/0x78 [macb
926c13f3af83b0c6fe64badb21ec87d5e93fcf65]
dev_ethtool+0x1220/0x38c0
dev_ioctl+0x4ac/0xca8
sock_do_ioctl+0x170/0x1d8
sock_ioctl+0x484/0x5d8
__arm64_sys_ioctl+0x12c/0x1b8
invoke_syscall+0xd4/0x258
el0_svc_common.constprop.0+0xb4/0x240
do_el0_svc+0x48/0x68
el0_svc+0x40/0xf8
el0t_64_sync_handler+0xa0/0xe8
el0t_64_sync+0x1b0/0x1b8
The buggy address belongs to a 1-page vmalloc region starting at
0xffff80008080b000 allocated at dev_ethtool+0x11f0/0x38c0
The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000
index:0xffff00000a333000 pfn:0xa333
flags: 0x7fffc000000000(node=0|zone=0|lastcpupid=0x1ffff)
raw: 007fffc000000000 0000000000000000 dead000000000122 0000000000000000
raw: ffff00000a333000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff80008080b080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff80008080b100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff80008080b180: 00 00 00 00 00 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
^
ffff80008080b200: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
ffff80008080b280: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
==================================================================
Fix it by making sure the copied size only considers the active number of
queues.
Fixes: 512286bbd4 ("net: macb: Added some queue statistics")
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260323191634.2185840-1-pvalerio@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- L2CAP: Fix deadlock in l2cap_conn_del()
- L2CAP: Fix ERTM re-init and zero pdu_len infinite loop
- L2CAP: Fix send LE flow credits in ACL link
- btintel: serialize btintel_hw_error() with hci_req_sync_lock
- btusb: clamp SCO altsetting table indices
* tag 'for-net-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: btusb: clamp SCO altsetting table indices
Bluetooth: L2CAP: Fix ERTM re-init and zero pdu_len infinite loop
Bluetooth: L2CAP: Fix deadlock in l2cap_conn_del()
Bluetooth: btintel: serialize btintel_hw_error() with hci_req_sync_lock
Bluetooth: L2CAP: Fix send LE flow credits in ACL link
====================
Link: https://patch.msgid.link/20260325194358.618892-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The error path through vfio_pci_core_feature_dma_buf() ignores its
own advice to only use dma_buf_put() after dma_buf_export(), instead
falling through the entire unwind chain. In the unlikely event that
we encounter file descriptor exhaustion, this can result in an
unbalanced refcount on the vfio device and double free of allocated
objects.
Avoid this by moving the "put" directly into the error path and return
the errno rather than entering the unwind chain.
Reported-by: Renato Marziano <renato@marziano.top>
Fixes: 5d74781ebc ("vfio/pci: Add dma-buf export support for MMIO regions")
Cc: stable@vger.kernel.org
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
Link: https://lore.kernel.org/r/20260323215659.2108191-3-alex.williamson@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex@shazbot.org>
Replace manual range and mask validations with netlink policy
annotations in ctnetlink code paths, so that the netlink core rejects
invalid values early and can generate extack errors.
- CTA_PROTOINFO_TCP_STATE: reject values > TCP_CONNTRACK_SYN_SENT2 at
policy level, removing the manual >= TCP_CONNTRACK_MAX check.
- CTA_PROTOINFO_TCP_WSCALE_ORIGINAL/REPLY: reject values > TCP_MAX_WSCALE
(14). The normal TCP option parsing path already clamps to this value,
but the ctnetlink path accepted 0-255, causing undefined behavior when
used as a u32 shift count.
- CTA_FILTER_ORIG_FLAGS/REPLY_FLAGS: use NLA_POLICY_MASK with
CTA_FILTER_F_ALL, removing the manual mask checks.
- CTA_EXPECT_FLAGS: use NLA_POLICY_MASK with NF_CT_EXPECT_MASK, adding
a new mask define grouping all valid expect flags.
Extracted from a broader nf-next patch by Florian Westphal, scoped to
ctnetlink for the fixes tree.
Fixes: c8e2078cfe ("[NETFILTER]: ctnetlink: add support for internal tcp connection tracking flags handling")
Signed-off-by: David Carlier <devnexen@gmail.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
process_sdp() declares union nf_inet_addr rtp_addr on the stack and
passes it to the nf_nat_sip sdp_session hook after walking the SDP
media descriptions. However rtp_addr is only initialized inside the
media loop when a recognized media type with a non-zero port is found.
If the SDP body contains no m= lines, only inactive media sections
(m=audio 0 ...) or only unrecognized media types, rtp_addr is never
assigned. Despite that, the function still calls hooks->sdp_session()
with &rtp_addr, causing nf_nat_sdp_session() to format the stale stack
value as an IP address and rewrite the SDP session owner and connection
lines with it.
With CONFIG_INIT_STACK_ALL_ZERO (default on most distributions) this
results in the session-level o= and c= addresses being rewritten to
0.0.0.0 for inactive SDP sessions. Without stack auto-init the
rewritten address is whatever happened to be on the stack.
Fix this by pre-initializing rtp_addr from the session-level connection
address (caddr) when available, and tracking via a have_rtp_addr flag
whether any valid address was established. Skip the sdp_session hook
entirely when no valid address exists.
Fixes: 4ab9e64e5e ("[NETFILTER]: nf_nat_sip: split up SDP mangling")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Skip expectations that do not reside in this netns.
Similar to e77e6ff502 ("netfilter: conntrack: do not dump other netns's
conntrack entries via proc").
Fixes: 9b03f38d04 ("netfilter: netns nf_conntrack: per-netns expectations")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There is a teardown order issue in the driver. The SPI controller is
registered using devm_spi_register_controller(), which delays
unregistration of the SPI controller until after the fsl_lpspi_remove()
function returns.
As the fsl_lpspi_remove() function synchronously tears down the DMA
channels, a running SPI transfer triggers the following NULL pointer
dereference due to use after free:
| fsl_lpspi 42550000.spi: I/O Error in DMA RX
| Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[...]
| Call trace:
| fsl_lpspi_dma_transfer+0x260/0x340 [spi_fsl_lpspi]
| fsl_lpspi_transfer_one+0x198/0x448 [spi_fsl_lpspi]
| spi_transfer_one_message+0x49c/0x7c8
| __spi_pump_transfer_message+0x120/0x420
| __spi_sync+0x2c4/0x520
| spi_sync+0x34/0x60
| spidev_message+0x20c/0x378 [spidev]
| spidev_ioctl+0x398/0x750 [spidev]
[...]
Switch from devm_spi_register_controller() to spi_register_controller() in
fsl_lpspi_probe() and add the corresponding spi_unregister_controller() in
fsl_lpspi_remove().
Fixes: 5314987de5 ("spi: imx: add lpspi bus driver")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://patch.msgid.link/20260319-spi-fsl-lpspi-fixes-v1-1-b433e435b2d8@pengutronix.de
Signed-off-by: Mark Brown <broonie@kernel.org>
__nf_ct_expect_find() and nf_ct_expect_find_get() are called under
rcu_read_lock() but they dereference the master conntrack via
exp->master.
Since the expectation does not hold a reference on the master conntrack,
this could be dying conntrack or different recycled conntrack than the
real master due to SLAB_TYPESAFE_RCU.
Store the netns, the master_tuple and the zone in struct
nf_conntrack_expect as a safety measure.
This patch is required by the follow up fix not to dump expectations
that do not belong to this netns.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Holding reference on the expectation is not sufficient, the master
conntrack object can just go away, making exp->master invalid.
To access exp->master safely:
- Grab the nf_conntrack_expect_lock, this gets serialized with
clean_from_lists() which also holds this lock when the master
conntrack goes away.
- Hold reference on master conntrack via nf_conntrack_find_get().
Not so easy since the master tuple to look up for the master conntrack
is not available in the existing problematic paths.
This patch goes for extending the nf_conntrack_expect_lock section
to address this issue for simplicity, in the cases that are described
below this is just slightly extending the lock section.
The add expectation command already holds a reference to the master
conntrack from ctnetlink_create_expect().
However, the delete expectation command needs to grab the spinlock
before looking up for the expectation. Expand the existing spinlock
section to address this to cover the expectation lookup. Note that,
the nf_ct_expect_iterate_net() calls already grabs the spinlock while
iterating over the expectation table, which is correct.
The get expectation command needs to grab the spinlock to ensure master
conntrack does not go away. This also expands the existing spinlock
section to cover the expectation lookup too. I needed to move the
netlink skb allocation out of the spinlock to keep it GFP_KERNEL.
For the expectation events, the IPEXP_DESTROY event is already delivered
under the spinlock, just move the delivery of IPEXP_NEW under the
spinlock too because the master conntrack event cache is reached through
exp->master.
While at it, add lockdep notations to help identify what codepaths need
to grab the spinlock.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use expect->helper in ctnetlink and /proc to dump the helper name.
Using nfct_help() without holding a reference to the master conntrack
is unsafe.
Use exp->master->helper in ctnetlink path if userspace does not provide
an explicit helper when creating an expectation to retain the existing
behaviour. The ctnetlink expectation path holds the reference on the
master conntrack and nf_conntrack_expect lock and the nfnetlink glue
path refers to the master ct that is attached to the skb.
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The expectation helper field is mostly unused. As a result, the
netfilter codebase relies on accessing the helper through exp->master.
Always set on the expectation helper field so it can be used to reach
the helper.
nf_ct_expect_init() is called from packet path where the skb owns
the ct object, therefore accessing exp->master for the newly created
expectation is safe. This saves a lot of updates in all callsites
to pass the ct object as parameter to nf_ct_expect_init().
This is a preparation patches for follow up fixes.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Chris Arges reports high memory consumption with thousands of
containers, this patch revisits the array allocation logic.
For anonymous sets, start by 16 slots (which takes 256 bytes on x86_64).
Expand it by x2 until threshold of 512 slots is reached, over that
threshold, expand it by x1.5.
For non-anonymous set, start by 1024 slots in the array (which takes 16
Kbytes initially on x86_64). Expand it by x1.5.
Use set->ndeact to subtract deactivated elements when calculating the
number of the slots in the array, otherwise the array size array gets
increased artifically. Add special case shrink logic to deal with flush
set too.
The shrink logic is skipped by anonymous sets.
Use check_add_overflow() to calculate the new array size.
Add a WARN_ON_ONCE check to make sure elements fit into the new array
size.
Reported-by: Chris Arges <carges@cloudflare.com>
Fixes: 7e43e0a114 ("netfilter: nft_set_rbtree: translate rbtree to array for binary search")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reject rt match rules whose addrnr exceeds IP6T_RT_HOPS.
rt_mt6() expects addrnr to stay within the bounds of rtinfo->addrs[].
Validate addrnr during rule installation so malformed rules are rejected
before the match logic can use an out-of-range value.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Tested-by: Yuhang Zheng <z1652074432@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
__build_packet_message() manually constructs the NFULA_PAYLOAD netlink
attribute using skb_put() and skb_copy_bits(), bypassing the standard
nla_reserve()/nla_put() helpers. While nla_total_size(data_len) bytes
are allocated (including NLA alignment padding), only data_len bytes
of actual packet data are copied. The trailing nla_padlen(data_len)
bytes (1-3 when data_len is not 4-byte aligned) are never initialized,
leaking stale heap contents to userspace via the NFLOG netlink socket.
Replace the manual attribute construction with nla_reserve(), which
handles the tailroom check, header setup, and padding zeroing via
__nla_reserve(). The subsequent skb_copy_bits() fills in the payload
data on top of the properly initialized attribute.
Fixes: df6fb868d6 ("[NETFILTER]: nfnetlink: convert to generic netlink attribute functions")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The sub-device state lock has been already acquired when ccs_init_state()
is called. Do not try to acquire it again.
Reported-by: David Heidelberg <david@ixit.cz>
Fixes: a88883d120 ("media: ccs: Rely on sub-device state locking")
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
The SPI API is asymmetric and the controller is freed as part of
deregistration (unless it has been allocated using
devm_spi_alloc_host/target()).
A recent change converting the managed registration function to use
devm_add_action_or_reset() inadvertently introduced a (mostly
theoretical) regression where a non-devres managed controller could be
freed as part of failed registration. This in turn would lead to
use-after-free in controller driver error paths.
Fix this by taking another reference before calling
devm_add_action_or_reset() and not releasing it on errors for
non-devres allocated controllers.
An alternative would be a partial revert of the offending commit, but
it is better to handle this explicitly until the API has been fixed
(e.g. see 5e844cc37a ("spi: Introduce device-managed SPI controller
allocation")).
Fixes: b6376dbed8 ("spi: Simplify devm_spi_*_controller()")
Reported-by: Felix Gu <ustc.gu@gmail.com>
Link: https://lore.kernel.org/all/20260324145548.139952-1-ustc.gu@gmail.com/
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20260325145319.1132072-1-johan@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Jihed Chaibi <jihed.chaibi.dev@gmail.com> says:
adau1372_set_power() had two related error handling issues in its enable
path: clk_prepare_enable() was called but its return value discarded, and
adau1372_enable_pll() was a void function that silently swallowed lock
failures, leaving mclk enabled and adau1372->enabled set to true despite
the device being in a broken state.
Patch 1 fixes the unchecked clk_prepare_enable() by making
adau1372_set_power() return int and propagating the error.
Patch 2 converts adau1372_enable_pll() to return int and adds a full
unwind in adau1372_set_power() if PLL lock fails, reversing the regcache,
GPIO power-down, and clock state.
adau1372_enable_pll() was a void function that logged a dev_err() on
PLL lock timeout but did not propagate the error. As a result,
adau1372_set_power() would continue with adau1372->enabled set to true
despite the PLL being unlocked, and the mclk left enabled with no
corresponding disable on the error path.
Convert adau1372_enable_pll() to return int, using -ETIMEDOUT on lock
timeout and propagating regmap errors directly. In adau1372_set_power(),
check the return value and unwind in reverse order: restore regcache to
cache-only mode, reassert GPIO power-down, and disable the clock before
returning the error.
Signed-off-by: Jihed Chaibi <jihed.chaibi.dev@gmail.com>
Fixes: 6cd4c6459e ("ASoC: Add ADAU1372 audio CODEC support")
Reviewed-by: Nuno Sá <nuno.sa@analog.com>
Link: https://patch.msgid.link/20260325210704.76847-3-jihed.chaibi.dev@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
adau1372_set_power() calls clk_prepare_enable() but discards the return
value. If the clock enable fails, the driver proceeds to access registers
on unpowered hardware, potentially causing silent corruption.
Make adau1372_set_power() return int and propagate the error from
clk_prepare_enable(). Update adau1372_set_bias_level() to return the
error directly for the STANDBY and OFF cases.
Signed-off-by: Jihed Chaibi <jihed.chaibi.dev@gmail.com>
Fixes: 6cd4c6459e ("ASoC: Add ADAU1372 audio CODEC support")
Reviewed-by: Nuno Sá <nuno.sa@analog.com>
Link: https://patch.msgid.link/20260325210704.76847-2-jihed.chaibi.dev@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
The async_hold queue pins encrypted input skbs while
the AEAD engine references their scatterlist data. Once
tls_decrypt_async_wait() returns, every AEAD operation
has completed and the engine no longer references those
skbs, so they can be freed unconditionally.
A subsequent patch adds batch async decryption to
tls_sw_read_sock(), introducing a new call site that
must drain pending AEAD operations and release held
skbs. Move __skb_queue_purge(&ctx->async_hold) into
tls_decrypt_async_wait() so the purge is centralized
and every caller -- recvmsg's drain path, the -EBUSY
fallback in tls_do_decryption(), and the new read_sock
batch path -- releases held skbs on synchronization
without each site managing the purge independently.
This fixes a leak when tls_strp_msg_hold() fails part-way through,
after having added some cloned skbs to the async_hold
queue. tls_decrypt_sg() will then call tls_decrypt_async_wait() to
process all pending decrypts, and drop back to synchronous mode, but
tls_sw_recvmsg() only flushes the async_hold queue when one record has
been processed in "fully-async" mode, which may not be the case here.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Fixes: b8a6ff84ab ("tls: wait for pending async decryptions if tls_strp_msg_hold fails")
Link: https://patch.msgid.link/20260324-tls-read-sock-v5-1-5408befe5774@oracle.com
[pabeni@redhat.com: added leak comment]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
proc_do_large_bitmap() does not initialize variable c, which is expected
to be set to a trailing character by proc_get_long().
However, proc_get_long() only sets c when the input buffer contains a
trailing character after the parsed value.
If c is not initialized it may happen to contain a '-'. If this is the
case proc_do_large_bitmap() expects to be able to parse a second part of
the input buffer. If there is no second part an unjustified -EINVAL will
be returned.
Initialize c to 0 to prevent returning -EINVAL on valid input.
Fixes: 9f977fb7ae ("sysctl: add proc_do_large_bitmap")
Signed-off-by: Marc Buerg <buermarc@googlemail.com>
Reviewed-by: Joel Granados <joel.granados@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Commit 453b8fb68f ("xen/privcmd: restrict usage in
unprivileged domU") added a xenstore notifier to defer setting the
restriction target until Xenstore is ready.
XEN_PRIVCMD can be built as a module, but privcmd_exit() leaves that
notifier behind. Balance the notifier lifecycle by unregistering it on
module exit.
This is harmless even if xenstore was already ready at registration
time and the notifier was never queued on the chain.
Fixes: 453b8fb68f ("xen/privcmd: restrict usage in unprivileged domU")
Signed-off-by: GuoHan Zhao <zhaoguohan@kylinos.cn>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20260325120246.252899-1-zhaoguohan@kylinos.cn>
In function kvm_eiointc_regs_access(), the register base address is
caculated from array base address plus offset, the offset is absolute
value from the base address. The data type of array base address is
u64, it should be converted into the "void *" type and then plus the
offset.
Cc: <stable@vger.kernel.org>
Fixes: d3e43a1f34 ("LoongArch: KVM: Use 64-bit register definition for EIOINTC").
Reported-by: Aurelien Jarno <aurel32@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1131431
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
EIOINTC's coremap in eiointc_update_sw_coremap() can be empty, currently
we get a cpuid with -1 in this case, but we actually need 0 because it's
similar as the case that cpuid >= 4.
This fix an out-of-bounds access to kvm_arch::phyid_map::phys_map[].
Cc: <stable@vger.kernel.org>
Fixes: 3956a52bc0 ("LoongArch: KVM: Add EIOINTC read and write functions")
Reported-by: Aurelien Jarno <aurel32@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1131431
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
With -fno-asynchronous-unwind-tables and --no-eh-frame-hdr (the default
of the linker), the GNU_EH_FRAME segment (specified by vdso.lds.S) is
empty. This is not valid, as the current DWARF specification mandates
the first byte of the EH frame to be the version number 1. It causes
some unwinders to complain, for example the ClickHouse query profiler
spams the log with messages:
clickhouse-server[365854]: libunwind: unsupported .eh_frame_hdr
version: 127 at 7ffffffb0000
Here "127" is just the byte located at the p_vaddr (0, i.e. the
beginning of the vDSO) of the empty GNU_EH_FRAME segment. Cross-
checking with /proc/365854/maps has also proven 7ffffffb0000 is the
start of vDSO in the process VM image.
In LoongArch the -fno-asynchronous-unwind-tables option seems just a
MIPS legacy, and MIPS only uses this option to satisfy the MIPS-specific
"genvdso" program, per the commit cfd75c2db1 ("MIPS: VDSO: Explicitly
use -fno-asynchronous-unwind-tables"). IIRC it indicates some inherent
limitation of the MIPS ELF ABI and has nothing to do with LoongArch. So
we can simply flip it over to -fasynchronous-unwind-tables and pass
--eh-frame-hdr for linking the vDSO, allowing the profilers to unwind the
stack for statistics even if the sample point is taken when the PC is in
the vDSO.
However simply adjusting the options above would exploit an issue: when
the libgcc unwinder saw the invalid GNU_EH_FRAME segment, it silently
falled back to a machine-specific routine to match the code pattern of
rt_sigreturn() and extract the registers saved in the sigframe if the
code pattern is matched. As unwinding from signal handlers is vital for
libgcc to support pthread cancellation etc., the fall-back routine had
been silently keeping the LoongArch Linux systems functioning since
Linux 5.19. But when we start to emit GNU_EH_FRAME with the correct
format, fall-back routine will no longer be used and libgcc will fail
to unwind the sigframe, and unwinding from signal handlers will no
longer work, causing dozens of glibc test failures. To make it possible
to unwind from signal handlers again, it's necessary to code the unwind
info in __vdso_rt_sigreturn via .cfi_* directives.
The offsets in the .cfi_* directives depend on the layout of struct
sigframe, notably the offset of sigcontext in the sigframe. To use the
offset in the assembly file, factor out struct sigframe into a header to
allow asm-offsets.c to output the offset for assembly.
To work around a long-term issue in the libgcc unwinder (the pc is
unconditionally substracted by 1: doing so is technically incorrect for
a signal frame), a nop instruction is included with the two real
instructions in __vdso_rt_sigreturn in the same FDE PC range. The same
hack has been used on x86 for a long time.
Cc: stable@vger.kernel.org
Fixes: c6b99bed6b ("LoongArch: Add VDSO and VSYSCALL support")
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
1. Hardware limitation: GPU, DC and VPU are typically PCI device 06.0,
06.1 and 06.2. They share some hardware resources, so when configure the
PCI 06.0 device BAR1, DMA memory access cannot be performed through this
BAR, otherwise it will cause hardware abnormalities.
2. In typical scenarios of reboot or S3/S4, DC access to memory through
BAR is not prohibited, resulting in GPU DMA hangs.
3. Workaround method: When configuring the 06.0 device BAR1, turn off
the memory access of DC, GPU and VPU (via DC's CRTC registers).
Cc: stable@vger.kernel.org
Signed-off-by: Qianhai Wu <wuqianhai@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
1. Replace "of_find_node_by_path("/")" with "of_root" to avoid multiple
calls to "of_node_put()".
2. Fix a potential kernel oops during early boot when memory allocation
fails while parsing CPU model from device tree.
Cc: stable@vger.kernel.org
Signed-off-by: Li Jun <lijun01@kylinos.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Pull erofs fixes from Gao Xiang:
- Mark I/Os as failed when encountering short reads on file-backed
mounts
- Label GFP_NOIO in the BIO completion when the completion is in the
process context, and directly call into the decompression to avoid
deadlocks
- Improve Kconfig descriptions to better highlight the overall efforts
- Fix .fadvise() for page cache sharing
* tag 'erofs-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix .fadvise() for page cache sharing
erofs: update the Kconfig description
erofs: add GFP_NOIO in the bio completion if needed
erofs: set fileio bio failed in short read case
Pull RCU fixes from Boqun Feng:
"Fix a regression introduced by commit c27cea4416 ("rcu: Re-implement
RCU Tasks Trace in terms of SRCU-fast"): BPF contexts can run with
preemption disabled or scheduler locks held, so call_srcu() must work
in all such contexts.
Fix this by converting SRCU's spinlocks to raw spinlocks and avoiding
scheduler lock acquisition in call_srcu() by deferring to an irq_work
(similar to call_rcu_tasks_generic()), for both tree SRCU and tiny
SRCU.
Also fix a follow-on lockdep splat caused by srcu_node allocation
under the newly introduced raw spinlock by deferring the allocation to
grace-period worker context"
* tag 'rcu-fixes.v7.0-20260325a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux:
srcu: Use irq_work to start GP in tiny SRCU
rcu: Use an intermediate irq_work to start process_srcu()
srcu: Push srcu_node allocation to GP when non-preemptible
srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()
cgroup_drain_dying() was using cgroup_is_populated() to test whether there are
dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets,
nr_populated_domain_children and nr_populated_threaded_children, but
cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether
there are children is cgroup_destroy_locked()'s concern.
This caused hangs during shutdown. When systemd tried to rmdir a cgroup that had
no direct tasks but had a populated child, cgroup_drain_dying() would enter its
wait loop because cgroup_is_populated() was true from
nr_populated_domain_children. The task iterator found nothing to wait for, yet
the populated state never cleared because it was driven by live tasks in the
child cgroup.
Fix it by using cgroup_has_tasks() which only tests nr_populated_csets.
v3: Fix cgroup_is_populated() -> cgroup_has_tasks() (Sebastian).
v2: https://lore.kernel.org/r/20260323200205.1063629-1-tj@kernel.org
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: 1b164b876c ("cgroup: Wait for dying tasks to leave on rmdir")
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
When a compound request consists of QUERY_DIRECTORY + QUERY_INFO
(FILE_ALL_INFORMATION) and the first command consumes nearly the entire
max_trans_size, get_file_all_info() would blindly call smbConvertToUTF16()
with PATH_MAX, causing out-of-bounds write beyond the response buffer.
In get_file_all_info(), there was a missing validation check for
the client-provided OutputBufferLength before copying the filename into
FileName field of the smb2_file_all_info structure.
If the filename length exceeds the available buffer space, it could lead to
potential buffer overflows or memory corruption during smbConvertToUTF16
conversion. This calculating the actual free buffer size using
smb2_calc_max_out_buf_len() and returning -EINVAL if the buffer is
insufficient and updating smbConvertToUTF16 to use the actual filename
length (clamped by PATH_MAX) to ensure a safe copy operation.
Cc: stable@vger.kernel.org
Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Reported-by: Asim Viladi Oglu Manizada <manizada@pm.me>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull hardening fixes from Kees Cook:
- fix required Clang version for CC_HAS_COUNTED_BY_PTR (Nathan
Chancellor)
- update Coccinelle script used for kmalloc_obj
* tag 'hardening-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
init/Kconfig: Require a release version of clang-22 for CC_HAS_COUNTED_BY_PTR
coccinelle: kmalloc_obj: Remove default GFP_KERNEL arg
Pull x86 platform driver fixes from Ilpo Järvinen:
"Fixes and New HW Support. The trivial drop of unused gz_chain_head is
not exactly fixes material but it allows other work to avoid problems
so I decided to take it in along with the fixes.
- amd/hsmp: Fix typo in error message
- asus-armoury: Add support for G614FP, GA503QM, GZ302EAC, and GZ302EAC
- asus-nb-wmi: Add DMI quirk for ASUS ROG Flow Z13-KJP GZ302EAC
- hp-wmi: Support for Omen 16-k0xxx, 16-wf1xxx, 16-xf0xxx
- intel-hid: Disable wakeup_mode during hibernation
- ISST:
- Check HWP support before MSR access
- Correct locked bit width
- lenovo: wmi-gamezone: Drop unused gz_chain_head
- olpc-xo175-ec: Fix overflow error message"
* tag 'platform-drivers-x86-v7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: ISST: Correct locked bit width
platform/x86: intel-hid: disable wakeup_mode during hibernation
platform/x86: asus-armoury: add support for GZ302EA and GZ302EAC
platform/x86: asus-nb-wmi: add DMI quirk for ASUS ROG Flow Z13-KJP GZ302EAC
platform/x86/amd/hsmp: Fix typo in error message
platform/olpc: olpc-xo175-ec: Fix overflow error message to print inlen
platform/x86: lenovo: wmi-gamezone: Drop gz_chain_head
platform/x86: ISST: Check HWP support before MSR access
platform/x86: hp-wmi: Add support for Omen 16-k0xxx (8A4D)
platform/x86: hp-wmi: Add support for Omen 16-wf1xxx (8C76)
platform/x86: hp-wmi: Add Omen 16-xf0xxx (8BCA) support
platform/x86: asus-armoury: add support for G614FP
platform/x86: asus-armoury: add support for GA503QM
MAINTAINERS: change email address of Denis Benato
This test will fail without
the preceding commit ("netfilter: nft_set_pipapo_avx2: fix match retart if found element is expired"):
reject overlapping range on add 0s [ OK ]
reload with flush /dev/stdin:59:32-52: Error: Could not process rule: File exists
add element inet filter test { 10.0.0.29 . 10.0.2.29 }
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
New test case fails unexpectedly when avx2 matching functions are used.
The test first loads a ranomly generated pipapo set
with 'ipv4 . port' key, i.e. nft -f foo.
This works. Then, it reloads the set after a flush:
(echo flush set t s; cat foo) | nft -f -
This is expected to work, because its the same set after all and it was
already loaded once.
But with avx2, this fails: nft reports a clashing element.
The reported clash is of following form:
We successfully re-inserted
a . b
c . d
Then we try to insert a . d
avx2 finds the already existing a . d, which (due to 'flush set') is marked
as invalid in the new generation. It skips the element and moves to next.
Due to incorrect masking, the skip-step finds the next matching
element *only considering the first field*,
i.e. we return the already reinserted "a . b", even though the
last field is different and the entry should not have been matched.
No such error is reported for the generic c implementation (no avx2) or when
the last field has to use the 'nft_pipapo_avx2_lookup_slow' fallback.
Bisection points to
7711f4bb4b ("netfilter: nft_set_pipapo: fix range overlap detection")
but that fix merely uncovers this bug.
Before this commit, the wrong element is returned, but erronously
reported as a full, identical duplicate.
The root-cause is too early return in the avx2 match functions.
When we process the last field, we should continue to process data
until the entire input size has been consumed to make sure no stale
bits remain in the map.
Link: https://lore.kernel.org/netfilter-devel/20260321152506.037f68c0@elisabeth/
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The regulator operations pmbus_regulator_get_voltage(),
pmbus_regulator_set_voltage(), and pmbus_regulator_list_voltage()
access PMBus registers and shared data but were not protected by
the update_lock mutex. This could lead to race conditions.
However, adding mutex protection directly to these functions causes
a deadlock because pmbus_regulator_notify() (which calls
regulator_notifier_call_chain()) is often called with the mutex
already held (e.g., from pmbus_fault_handler()). If a regulator
callback then calls one of the now-protected voltage functions,
it will attempt to acquire the same mutex.
Rework pmbus_regulator_notify() to utilize a worker function to
send notifications outside of the mutex protection. Events are
stored as atomics in a per-page bitmask and processed by the worker.
Initialize the worker and its associated data during regulator
registration, and ensure it is cancelled on device removal using
devm_add_action_or_reset().
While at it, remove the unnecessary include of linux/of.h.
Cc: Sanman Pradhan <psanman@juniper.net>
Fixes: ddbb4db4ce ("hwmon: (pmbus) Add regulator support")
Reviewed-by: Sanman Pradhan <psanman@juniper.net>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Attributes intended to clear sensor history are intended to be writeable
only. Reading those attributes today results in reporting more or less
random values. To avoid ABI surprises, have those attributes explicitly
return 0 when reading.
Fixes: 787c095eda ("hwmon: (pmbus/core) Add support for rated attributes")
Reviewed-by: Sanman Pradhan <psanman@juniper.net>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Writing those attributes is not supported, so mark them as read-only.
Prior to this change, attempts to write into these attributes returned
an error.
Mark boolean fields in struct pmbus_limit_attr and in struct
pmbus_sensor_attr as bit fields to reduce configuration data size.
The data is scanned only while probing, so performance is not a concern.
Fixes: 6f183d33a0 ("hwmon: (pmbus) Add support for peak attributes")
Reviewed-by: Sanman Pradhan <psanman@juniper.net>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Currently we execute `SET_NETDEV_DEV(dev, &priv->lowerdev->dev)` for
the virt_wifi net devices. However, unregistering a virt_wifi device in
netdev_run_todo() can happen together with the device referenced by
SET_NETDEV_DEV().
It can result in use-after-free during the ethtool operations performed
on a virt_wifi device that is currently being unregistered. Such a net
device can have the `dev.parent` field pointing to the freed memory,
but ethnl_ops_begin() calls `pm_runtime_get_sync(dev->dev.parent)`.
Let's remove SET_NETDEV_DEV for virt_wifi to avoid bugs like this:
==================================================================
BUG: KASAN: slab-use-after-free in __pm_runtime_resume+0xe2/0xf0
Read of size 2 at addr ffff88810cfc46f8 by task pm/606
Call Trace:
<TASK>
dump_stack_lvl+0x4d/0x70
print_report+0x170/0x4f3
? __pfx__raw_spin_lock_irqsave+0x10/0x10
kasan_report+0xda/0x110
? __pm_runtime_resume+0xe2/0xf0
? __pm_runtime_resume+0xe2/0xf0
__pm_runtime_resume+0xe2/0xf0
ethnl_ops_begin+0x49/0x270
ethnl_set_features+0x23c/0xab0
? __pfx_ethnl_set_features+0x10/0x10
? kvm_sched_clock_read+0x11/0x20
? local_clock_noinstr+0xf/0xf0
? local_clock+0x10/0x30
? kasan_save_track+0x25/0x60
? __kasan_kmalloc+0x7f/0x90
? genl_family_rcv_msg_attrs_parse.isra.0+0x150/0x2c0
genl_family_rcv_msg_doit+0x1e7/0x2c0
? __pfx_genl_family_rcv_msg_doit+0x10/0x10
? __pfx_cred_has_capability.isra.0+0x10/0x10
? stack_trace_save+0x8e/0xc0
genl_rcv_msg+0x411/0x660
? __pfx_genl_rcv_msg+0x10/0x10
? __pfx_ethnl_set_features+0x10/0x10
netlink_rcv_skb+0x121/0x380
? __pfx_genl_rcv_msg+0x10/0x10
? __pfx_netlink_rcv_skb+0x10/0x10
? __pfx_down_read+0x10/0x10
genl_rcv+0x23/0x30
netlink_unicast+0x60f/0x830
? __pfx_netlink_unicast+0x10/0x10
? __pfx___alloc_skb+0x10/0x10
netlink_sendmsg+0x6ea/0xbc0
? __pfx_netlink_sendmsg+0x10/0x10
? __futex_queue+0x10b/0x1f0
____sys_sendmsg+0x7a2/0x950
? copy_msghdr_from_user+0x26b/0x430
? __pfx_____sys_sendmsg+0x10/0x10
? __pfx_copy_msghdr_from_user+0x10/0x10
___sys_sendmsg+0xf8/0x180
? __pfx____sys_sendmsg+0x10/0x10
? __pfx_futex_wait+0x10/0x10
? fdget+0x2e4/0x4a0
__sys_sendmsg+0x11f/0x1c0
? __pfx___sys_sendmsg+0x10/0x10
do_syscall_64+0xe2/0x570
? exc_page_fault+0x66/0xb0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
This fix may be combined with another one in the ethtool subsystem:
https://lore.kernel.org/all/20260322075917.254874-1-alex.popov@linux.com/T/#u
Fixes: d43c65b05b ("ethtool: runtime-resume netdev parent in ethnl_ops_begin")
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260324224607.374327-1-alex.popov@linux.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
btusb_work() maps the number of active SCO links to USB alternate
settings through a three-entry lookup table when CVSD traffic uses
transparent voice settings. The lookup currently indexes alts[] with
data->sco_num - 1 without first constraining sco_num to the number of
available table entries.
While the table only defines alternate settings for up to three SCO
links, data->sco_num comes from hci_conn_num() and is used directly.
Cap the lookup to the last table entry before indexing it so the
driver keeps selecting the highest supported alternate setting without
reading past alts[].
Fixes: baac6276c0 ("Bluetooth: btusb: handle mSBC audio over USB Endpoints")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
l2cap_config_req() processes CONFIG_REQ for channels in BT_CONNECTED
state to support L2CAP reconfiguration (e.g. MTU changes). However,
since both CONF_INPUT_DONE and CONF_OUTPUT_DONE are already set from
the initial configuration, the reconfiguration path falls through to
l2cap_ertm_init(), which re-initializes tx_q, srej_q, srej_list, and
retrans_list without freeing the previous allocations and sets
chan->sdu to NULL without freeing the existing skb. This leaks all
previously allocated ERTM resources.
Additionally, l2cap_parse_conf_req() does not validate the minimum
value of remote_mps derived from the RFC max_pdu_size option. A zero
value propagates to l2cap_segment_sdu() where pdu_len becomes zero,
causing the while loop to never terminate since len is never
decremented, exhausting all available memory.
Fix the double-init by skipping l2cap_ertm_init() and
l2cap_chan_ready() when the channel is already in BT_CONNECTED state,
while still allowing the reconfiguration parameters to be updated
through l2cap_parse_conf_req(). Also add a pdu_len zero check in
l2cap_segment_sdu() as a safeguard.
Fixes: 96298f6401 ("Bluetooth: L2CAP: handle l2cap config request during open state")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
l2cap_conn_del() calls cancel_delayed_work_sync() for both info_timer
and id_addr_timer while holding conn->lock. However, the work functions
l2cap_info_timeout() and l2cap_conn_update_id_addr() both acquire
conn->lock, creating a potential AB-BA deadlock if the work is already
executing when l2cap_conn_del() takes the lock.
Move the work cancellations before acquiring conn->lock and use
disable_delayed_work_sync() to additionally prevent the works from
being rearmed after cancellation, consistent with the pattern used in
hci_conn_del().
Fixes: ab4eedb790 ("Bluetooth: L2CAP: Fix corrupted list in hci_chan_del")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
btintel_hw_error() issues two __hci_cmd_sync() calls (HCI_OP_RESET
and Intel exception-info retrieval) without holding
hci_req_sync_lock(). This lets it race against
hci_dev_do_close() -> btintel_shutdown_combined(), which also runs
__hci_cmd_sync() under the same lock. When both paths manipulate
hdev->req_status/req_rsp concurrently, the close path may free the
response skb first, and the still-running hw_error path hits a
slab-use-after-free in kfree_skb().
Wrap the whole recovery sequence in hci_req_sync_lock/unlock so it
is serialized with every other synchronous HCI command issuer.
Below is the data race report and the kasan report:
BUG: data-race in __hci_cmd_sync_sk / btintel_shutdown_combined
read of hdev->req_rsp at net/bluetooth/hci_sync.c:199
by task kworker/u17:1/83:
__hci_cmd_sync_sk+0x12f2/0x1c30 net/bluetooth/hci_sync.c:200
__hci_cmd_sync+0x55/0x80 net/bluetooth/hci_sync.c:223
btintel_hw_error+0x114/0x670 drivers/bluetooth/btintel.c:254
hci_error_reset+0x348/0xa30 net/bluetooth/hci_core.c:1030
write/free by task ioctl/22580:
btintel_shutdown_combined+0xd0/0x360
drivers/bluetooth/btintel.c:3648
hci_dev_close_sync+0x9ae/0x2c10 net/bluetooth/hci_sync.c:5246
hci_dev_do_close+0x232/0x460 net/bluetooth/hci_core.c:526
BUG: KASAN: slab-use-after-free in
sk_skb_reason_drop+0x43/0x380 net/core/skbuff.c:1202
Read of size 4 at addr ffff888144a738dc
by task kworker/u17:1/83:
__hci_cmd_sync_sk+0x12f2/0x1c30 net/bluetooth/hci_sync.c:200
__hci_cmd_sync+0x55/0x80 net/bluetooth/hci_sync.c:223
btintel_hw_error+0x186/0x670 drivers/bluetooth/btintel.c:260
Fixes: 973bb97e5a ("Bluetooth: btintel: Add generic function for handling hardware errors")
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When the L2CAP channel mode is L2CAP_MODE_ERTM/L2CAP_MODE_STREAMING,
l2cap_publish_rx_avail will be called and le flow credits will be sent in
l2cap_chan_rx_avail, even though the link type is ACL.
The logs in question as follows:
> ACL Data RX: Handle 129 flags 0x02 dlen 12
L2CAP: Unknown (0x16) ident 4 len 4
40 00 ed 05
< ACL Data TX: Handle 129 flags 0x00 dlen 10
L2CAP: Command Reject (0x01) ident 4 len 2
Reason: Command not understood (0x0000)
Bluetooth: Unknown BR/EDR signaling command 0x16
Bluetooth: Wrong link type (-22)
Fixes: ce60b9231b ("Bluetooth: compute LE flow credits based on recvbuf space")
Signed-off-by: Zhang Chen <zhangchen01@kylinos.cn>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Tiny SRCU's srcu_gp_start_if_needed() directly calls schedule_work(),
which acquires the workqueue pool->lock.
This causes a lockdep splat when call_srcu() is called with a scheduler
lock held, due to:
call_srcu() [holding pi_lock]
srcu_gp_start_if_needed()
schedule_work() -> pool->lock
workqueue_init() / create_worker() [holding pool->lock]
wake_up_process() -> try_to_wake_up() -> pi_lock
Also add irq_work_sync() to cleanup_srcu_struct() to prevent a
use-after-free if a queued irq_work fires after cleanup begins.
Tested with rcutorture SRCU-T and no lockdep warnings.
[ Thanks to Boqun for similar fix in patch "rcu: Use an intermediate irq_work
to start process_srcu()" ]
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Since commit c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms
of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can
happen basically everywhere (including where a scheduler lock is held),
call_srcu() now needs to avoid acquiring scheduler lock because
otherwise it could cause deadlock [1]. Fix this by following what the
previous RCU Tasks Trace did: using an irq_work to delay the queuing of
the work to start process_srcu().
[boqun: Apply Joel's feedback]
[boqun: Apply Andrea's test feedback]
Reported-by: Andrea Righi <arighi@nvidia.com>
Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/
Fixes: commit c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast")
Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1]
Suggested-by: Zqiang <qiang.zhang@linux.dev>
Tested-by: Andrea Righi <arighi@nvidia.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun@kernel.org>
When the srcutree.convert_to_big and srcutree.big_cpu_lim kernel boot
parameters specify initialization-time allocation of the srcu_node
tree for statically allocated srcu_struct structures (for example, in
DEFINE_SRCU() at build time instead of init_srcu_struct() at runtime),
init_srcu_struct_nodes() will attempt to dynamically allocate this tree
at the first run-time update-side use of this srcu_struct structure,
but while holding a raw spinlock. Because the memory allocator can
acquire non-raw spinlocks, this can result in lockdep splats.
This commit therefore uses the same SRCU_SIZE_ALLOC trick that is used
when the first run-time update-side use of this srcu_struct structure
happens before srcu_init() is called. The actual allocation then takes
place from workqueue context at the ends of upcoming SRCU grace periods.
[boqun: Adjust the sha1 of the Fixes tag]
Fixes: 175b45ed34 ("srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Tree SRCU has used non-raw spinlocks for many years, motivated by a desire
to avoid unnecessary real-time latency and the absence of any reason to
use raw spinlocks. However, the recent use of SRCU in tracing as the
underlying implementation of RCU Tasks Trace means that call_srcu()
is invoked from preemption-disabled regions of code, which in turn
requires that any locks acquired by call_srcu() or its callees must be
raw spinlocks.
This commit therefore converts SRCU's spinlocks to raw spinlocks.
[boqun: Add Fixes tag]
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Try to be more explicit why the workqueue watchdog does not take
pool->lock by default. Spin locks are full memory barriers which
delay anything. Obviously, they would primary delay operations
on the related worker pools.
Explain why it is enough to prevent the false positive by re-checking
the timestamp under the pool->lock.
Finally, make it clear what would be the alternative solution in
__queue_work() which is a hotter path.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
I and Qianhai are GPU R&D engineers at Loongson, specializing
in kernel driver development. We understand that the current
Loongson GPU driver lacks dedicated maintenance resources
because of some reasons.
As Loongson GPU driver developers, we have both the capability
and the responsibility to continuously maintain the Loongson
GPU driver, ensuring minimal impact on its users. After internal
discussions, our team has decided to recommend me and Qianhai
to take over the maintenance responsibilities, and recommend
Huacai, Mingcong and Ruoyao to help to review.
And We'll continue to maintain it for current supported chips
and drive future updates according to chip support plan.
Signed-off-by: Jianmin Lv <lvjianmin@loongson.cn>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patch.msgid.link/20260320101012.22714-1-lvjianmin@loongson.cn
The adm1177 driver exposes the current alert threshold through
hwmon_curr_max_alarm. This violates the hwmon sysfs ABI, where
*_alarm attributes are read-only status flags and writable thresholds
must use currN_max.
The driver also stores the threshold internally in microamps, while
currN_max is defined in milliamps. Convert the threshold accordingly
on both the read and write paths.
Widen the cached threshold and related calculations to 64 bits so
that small shunt resistor values do not cause truncation or overflow.
Also use 64-bit arithmetic for the mA/uA conversions, clamp writes
to the range the hardware can represent, and propagate failures from
adm1177_write_alert_thr() instead of silently ignoring them.
Update the hwmon documentation to reflect the attribute rename and
the correct units returned by the driver.
Fixes: 09b08ac9e8 ("hwmon: (adm1177) Add ADM1177 Hot Swap Controller and Digital Power Monitor driver")
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Acked-by: Nuno Sá <nuno.sa@analog.com>
Link: https://lore.kernel.org/r/20260325051246.28262-1-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
During 3D workload, user is reporting hitting:
[ 413.361679] WARNING: drivers/gpu/drm/xe/xe_vm.c:1217 at vm_bind_ioctl_ops_unwind+0x1e2/0x2e0 [xe], CPU#7: vkd3d_queue/9925
[ 413.361944] CPU: 7 UID: 1000 PID: 9925 Comm: vkd3d_queue Kdump: loaded Not tainted 7.0.0-070000rc3-generic #202603090038 PREEMPT(lazy)
[ 413.361949] RIP: 0010:vm_bind_ioctl_ops_unwind+0x1e2/0x2e0 [xe]
[ 413.362074] RSP: 0018:ffffd4c25c3df930 EFLAGS: 00010282
[ 413.362077] RAX: 0000000000000000 RBX: ffff8f3ee817ed10 RCX: 0000000000000000
[ 413.362078] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 413.362079] RBP: ffffd4c25c3df980 R08: 0000000000000000 R09: 0000000000000000
[ 413.362081] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8f41fbf99380
[ 413.362082] R13: ffff8f3ee817e968 R14: 00000000ffffffef R15: ffff8f43d00bd380
[ 413.362083] FS: 00000001040ff6c0(0000) GS:ffff8f4696d89000(0000) knlGS:00000000330b0000
[ 413.362085] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 413.362086] CR2: 00007ddfc4747000 CR3: 00000002e6262005 CR4: 0000000000f72ef0
[ 413.362088] PKRU: 55555554
[ 413.362089] Call Trace:
[ 413.362092] <TASK>
[ 413.362096] xe_vm_bind_ioctl+0xa9a/0xc60 [xe]
Which seems to hint that the vma we are re-inserting for the ops unwind
is either invalid or overlapping with something already inserted in the
vm. It shouldn't be invalid since this is a re-insertion, so must have
worked before. Leaving the likely culprit as something already placed
where we want to insert the vma.
Following from that, for the case where we do something like a rebind in
the middle of a vma, and one or both mapped ends are already compatible,
we skip doing the rebind of those vma and set next/prev to NULL. As well
as then adjust the original unmap va range, to avoid unmapping the ends.
However, if we trigger the unwind path, we end up with three va, with
the two ends never being removed and the original va range in the middle
still being the shrunken size.
If this occurs, one failure mode is when another unwind op needs to
interact with that range, which can happen with a vector of binds. For
example, if we need to re-insert something in place of the original va.
In this case the va is still the shrunken version, so when removing it
and then doing a re-insert it can overlap with the ends, which were
never removed, triggering a warning like above, plus leaving the vm in a
bad state.
With that, we need two things here:
1) Stop nuking the prev/next tracking for the skip cases. Instead
relying on checking for skip prev/next, where needed. That way on the
unwind path, we now correctly remove both ends.
2) Undo the unmap va shrinkage, on the unwind path. With the two ends
now removed the unmap va should expand back to the original size again,
before re-insertion.
v2:
- Update the explanation in the commit message, based on an actual IGT of
triggering this issue, rather than conjecture.
- Also undo the unmap shrinkage, for the skip case. With the two ends
now removed, the original unmap va range should expand back to the
original range.
v3:
- Track the old start/range separately. vma_size/start() uses the va
info directly.
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/7602
Fixes: 8f33b4f054 ("drm/xe: Avoid doing rebinds")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260318100208.78097-2-matthew.auld@intel.com
(cherry picked from commit aec6969f75afbf4e01fd5fb5850ed3e9c27043ac)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Under an UML build for an upcoming series [1], I got `-Wstatic-in-inline`
for `dma_free_attrs`:
BINDGEN rust/bindings/bindings_generated.rs - due to target missing
In file included from rust/helpers/helpers.c:59:
rust/helpers/dma.c:17:2: warning: static function 'dma_free_attrs' is used in an inline function with external linkage [-Wstatic-in-inline]
17 | dma_free_attrs(dev, size, cpu_addr, dma_handle, attrs);
| ^
rust/helpers/dma.c:12:1: note: use 'static' to give inline function 'rust_helper_dma_free_attrs' internal linkage
12 | __rust_helper void rust_helper_dma_free_attrs(struct device *dev, size_t size,
| ^
| static
The issue is that `dma_free_attrs` was not marked `inline` when it was
introduced alongside the rest of the stubs.
Thus mark it.
Fixes: ed6ccf10f2 ("dma-mapping: properly stub out the DMA API for !CONFIG_HAS_DMA")
Closes: https://lore.kernel.org/rust-for-linux/20260322194616.89847-1-ojeda@kernel.org/ [1]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260325015548.70912-1-ojeda@kernel.org
If auxiliary_device_add() fails, add_adev() jumps to add_fail and calls
auxiliary_device_uninit(adev).
The auxiliary device has its release callback set to adev_release(),
which frees the containing struct mana_adev. Since adev is embedded in
struct mana_adev, the subsequent fall-through to init_fail and access
to adev->id may result in a use-after-free.
Fix this by saving the allocated auxiliary device id in a local
variable before calling auxiliary_device_add(), and use that saved id
in the cleanup path after auxiliary_device_uninit().
Fixes: a69839d432 ("net: mana: Add support for auxiliary device")
Cc: stable@vger.kernel.org
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Link: https://patch.msgid.link/20260323165730.945365-1-lgs201920130244@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When codel_dequeue() finds an empty queue, it resets vars->dropping
but does not reset vars->first_above_time. The reference CoDel
algorithm (Nichols & Jacobson, ACM Queue 2012) resets both:
dodeque_result codel_queue_t::dodeque(time_t now) {
...
if (r.p == NULL) {
first_above_time = 0; // <-- Linux omits this
}
...
}
Note that codel_should_drop() does reset first_above_time when called
with a NULL skb, but codel_dequeue() returns early before ever calling
codel_should_drop() in the empty-queue case. The post-drop code paths
do reach codel_should_drop(NULL) and correctly reset the timer, so a
dropped packet breaks the cycle -- but the next delivered packet
re-arms first_above_time and the cycle repeats.
For sparse flows such as ICMP ping (one packet every 200ms-1s), the
first packet arms first_above_time, the flow goes empty, and the
second packet arrives after the interval has elapsed and gets dropped.
The pattern repeats, producing sustained loss on flows that are not
actually congested.
Test: veth pair, fq_codel, BQL disabled, 30000 iptables rules in the
consumer namespace (NAPI-64 cycle ~14ms, well above fq_codel's 5ms
target), ping at 5 pps under UDP flood:
Before fix: 26% ping packet loss
After fix: 0% ping packet loss
Fix by resetting first_above_time to zero in the empty-queue path
of codel_dequeue(), matching the reference algorithm.
Fixes: 76e3cc126b ("codel: Controlled Delay AQM")
Fixes: d068ca2ae2 ("codel: split into multiple files")
Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Reported-by: Chris Arges <carges@cloudflare.com>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/all/20260318134826.1281205-7-hawk@kernel.org/
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260323174920.253526-1-hawk@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The driver does not explicitly configure the MAC duplex mode when
bringing the link up. As a result, the MAC may retain a stale duplex
setting from a previous link state, leading to duplex mismatches with
the link partner and degraded network performance.
Update lan743x_phylink_mac_link_up() to set or clear the MAC_CR_DPX_
bit according to the negotiated duplex mode.
This ensures the MAC configuration is consistent with the phylink
resolved state.
Fixes: a5f199a8d8 ("net: lan743x: Migrate phylib to phylink")
Signed-off-by: Thangaraj Samynathan <thangaraj.s@microchip.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20260323065345.144915-1-thangaraj.s@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A UAF issue occurs when the virtio_net driver is configured with napi_tx=N
and the device's IFF_XMIT_DST_RELEASE flag is cleared
(e.g., during the configuration of tc route filter rules).
When IFF_XMIT_DST_RELEASE is removed from the net_device, the network stack
expects the driver to hold the reference to skb->dst until the packet
is fully transmitted and freed. In virtio_net with napi_tx=N,
skbs may remain in the virtio transmit ring for an extended period.
If the network namespace is destroyed while these skbs are still pending,
the corresponding dst_ops structure has freed. When a subsequent packet
is transmitted, free_old_xmit() is triggered to clean up old skbs.
It then calls dst_release() on the skb associated with the stale dst_entry.
Since the dst_ops (referenced by the dst_entry) has already been freed,
a UAF kernel paging request occurs.
fix it by adds skb_dst_drop(skb) in start_xmit to explicitly release
the dst reference before the skb is queued in virtio_net.
Call Trace:
Unable to handle kernel paging request at virtual address ffff80007e150000
CPU: 2 UID: 0 PID: 6236 Comm: ping Kdump: loaded Not tainted 7.0.0-rc1+ #6 PREEMPT
...
percpu_counter_add_batch+0x3c/0x158 lib/percpu_counter.c:98 (P)
dst_release+0xe0/0x110 net/core/dst.c:177
skb_release_head_state+0xe8/0x108 net/core/skbuff.c:1177
sk_skb_reason_drop+0x54/0x2d8 net/core/skbuff.c:1255
dev_kfree_skb_any_reason+0x64/0x78 net/core/dev.c:3469
napi_consume_skb+0x1c4/0x3a0 net/core/skbuff.c:1527
__free_old_xmit+0x164/0x230 drivers/net/virtio_net.c:611 [virtio_net]
free_old_xmit drivers/net/virtio_net.c:1081 [virtio_net]
start_xmit+0x7c/0x530 drivers/net/virtio_net.c:3329 [virtio_net]
...
Reproduction Steps:
NETDEV="enp3s0"
config_qdisc_route_filter() {
tc qdisc del dev $NETDEV root
tc qdisc add dev $NETDEV root handle 1: prio
tc filter add dev $NETDEV parent 1:0 \
protocol ip prio 100 route to 100 flowid 1:1
ip route add 192.168.1.100/32 dev $NETDEV realm 100
}
test_ns() {
ip netns add testns
ip link set $NETDEV netns testns
ip netns exec testns ifconfig $NETDEV 10.0.32.46/24
ip netns exec testns ping -c 1 10.0.32.1
ip netns del testns
}
config_qdisc_route_filter
test_ns
sleep 2
test_ns
Fixes: f2fc6a5458 ("[NETNS][IPV6] route6 - move ip6_dst_ops inside the network namespace")
Cc: stable@vger.kernel.org
Signed-off-by: xietangxin <xietangxin@yeah.net>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Fixes: 0287587884 ("net: better IFF_XMIT_DST_RELEASE support")
Link: https://patch.msgid.link/20260312025406.15641-1-xietangxin@yeah.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, .fadvise() doesn't work well if page cache sharing is on
since shared inodes belong to a pseudo fs generated with init_pseudo(),
and sb->s_bdi is the default one &noop_backing_dev_info.
Then, generic_fadvise() will just behave as a no-op if sb->s_bdi is
&noop_backing_dev_info, but as the bdev fs (the bdev fs changes
inode_to_bdi() instead), it's actually NOT a pure memfs.
Let's generate a real bdi for erofs_ishare_mnt instead.
Fixes: d86d7817c0 ("erofs: implement .fadvise for page cache share")
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Pull Kbuild fixes from Nathan Chancellor:
"This mostly addresses some issues with the awk conversion in
scripts/kconfig/merge_config.sh.
- Fix typo to ensure .builtin-dtbs.S is properly cleaned
- Fix '==' bashism in scripts/kconfig/merge_config.sh
- Fix awk error in scripts/kconfig/merge_config.sh when base
configuration is empty
- Fix inconsistent indentation in scripts/kconfig/merge_config.sh"
* tag 'kbuild-fixes-7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
scripts: kconfig: merge_config.sh: fix indentation
scripts: kconfig: merge_config.sh: pass output file as awk variable
scripts: kconfig: merge_config.sh: fix unexpected operator warning
kbuild: Delete .builtin-dtbs.S when running make clean
alarm_timer_forward() passes arguments to alarm_forward() in the wrong
order:
alarm_forward(alarm, timr->it_interval, now);
However, alarm_forward() is defined as:
u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval);
and uses the second argument as the current time:
delta = ktime_sub(now, alarm->node.expires);
Passing the interval as "now" results in incorrect delta computation,
which can lead to missed expirations or incorrect overrun accounting.
This issue has been present since the introduction of
alarm_timer_forward().
Fix this by swapping the arguments.
Fixes: e7561f1633 ("alarmtimer: Implement forward callback")
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260323061130.29991-1-zhanxusheng@xiaomi.com
test_cgcore_populated (test_core) and test_cgkill_{simple,tree,forkbomb}
(test_kill) check cgroup.events "populated 0" immediately after reaping
child tasks with waitpid(). This used to work because cgroup_task_exit() in
do_exit() unlinked tasks from css_sets before exit_notify() woke up
waitpid().
d245698d72 ("cgroup: Defer task cgroup unlink until after the task is done
switching out") moved the unlink to cgroup_task_dead() in
finish_task_switch(), which runs after exit_notify(). The populated counter
is now decremented after the parent's waitpid() can return, so there is no
longer a synchronous ordering guarantee. On PREEMPT_RT, where
cgroup_task_dead() is further deferred through lazy irq_work, the race
window is even larger.
The synchronous populated transition was never part of the cgroup interface
contract - it was an implementation artifact. Use cg_read_strcmp_wait() which
retries for up to 1 second, matching what these tests actually need to
verify: that the cgroup eventually becomes unpopulated after all tasks exit.
Fixes: d245698d72 ("cgroup: Defer task cgroup unlink until after the task is done switching out")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Christian Brauner <brauner@kernel.org>
Cc: cgroups@vger.kernel.org
a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup") hid PF_EXITING
tasks from cgroup.procs so that systemd doesn't see tasks that have already
been reaped via waitpid(). However, the populated counter (nr_populated_csets)
is only decremented when the task later passes through cgroup_task_dead() in
finish_task_switch(). This means cgroup.procs can appear empty while the
cgroup is still populated, causing rmdir to fail with -EBUSY.
Fix this by making cgroup_rmdir() wait for dying tasks to fully leave. If the
cgroup is populated but all remaining tasks have PF_EXITING set (the task
iterator returns none due to the existing filter), wait for a kick from
cgroup_task_dead() and retry. The wait is brief as tasks are removed from the
cgroup's css_set between PF_EXITING assertion in do_exit() and
cgroup_task_dead() in finish_task_switch().
v2: cgroup_is_populated() true to false transition happens under css_set_lock
not cgroup_mutex, so retest under css_set_lock before sleeping to avoid
missed wakeups (Sebastian).
Fixes: a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202603222104.2c81684e-lkp@intel.com
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: cgroups@vger.kernel.org
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Clear the pending exception state from a vcpu coming out of reset,
as it could otherwise affect the first instruction executed in the
guest
- Fix pointer arithmetic in address translation emulation, so that
the Hardware Access bit is set on the correct PTE instead of some
other location
s390:
- Fix deadlock in new memory management
- Properly handle kernel faults on donated memory
- Fix bounds checking for irq routing, with selftest
- Fix invalid machine checks and log all of them"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: arm64: Fix the descriptor address in __kvm_at_swap_desc()
KVM: s390: vsie: Avoid injecting machine check on signal
KVM: s390: log machine checks more aggressively
KVM: s390: selftests: Add IRQ routing address offset tests
KVM: s390: Limit adapter indicator access to mapped page
s390/mm: Add missing secure storage access fixups for donated memory
KVM: arm64: Discard PC update state on vcpu reset
KVM: s390: Fix a deadlock
Add LANDLOCK_RESTRICT_SELF_TSYNC to the backwards compatibility example
for restrict flags. This introduces completeness, similar to that of
the ruleset attributes example. However, as the new example can impact
enforcement in certain cases, an appropriate warning is also included.
Additionally, I modified the two comments of the example to make them
more consistent with the ruleset attributes example's.
Signed-off-by: Panagiotis "Ivory" Vasilopoulos <git@n0toose.net>
Co-developed-by: Dan Cojocaru <dan@dcdev.ro>
Signed-off-by: Dan Cojocaru <dan@dcdev.ro>
Reviewed-by: Günther Noack <gnoack@google.com>
Link: https://lore.kernel.org/r/20260304-landlock-docs-add-tsync-example-v4-1-819a276f05c5@n0toose.net
[mic: Update date, improve comments consistency, fix newline issue]
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Pull Compute Express Link (CXL) fixes from Dave Jiang:
- Adjust the startup priority of cxl_pmem to be higher than that of
cxl_acpi
- Use proper endpoint validity check upon sanitize
- Avoid incorrect DVSEC fallback when HDM decoders are enabled
- Fix CXL_ACPI and CXL_PMEM Kconfig tristate mismatch
- Fix leakage in __construct_region()
- Fix use after free of parent_port in cxl_detach_ep()
* tag 'cxl-fixes-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl: Adjust the startup priority of cxl_pmem to be higher than that of cxl_acpi
cxl/mbox: Use proper endpoint validity check upon sanitize
cxl/hdm: Avoid incorrect DVSEC fallback when HDM decoders are enabled
cxl/acpi: Fix CXL_ACPI and CXL_PMEM Kconfig tristate mismatch
cxl/region: Fix leakage in __construct_region()
cxl/port: Fix use after free of parent_port in cxl_detach_ep()
During a GPU page fault, the driver restores the SVM range and then maps it
into the GPU page tables. The current implementation passes a GPU-page-size
(4K-based) PFN to svm_range_restore_pages() to restore the range.
SVM ranges are tracked using system-page-size PFNs. On systems where the
system page size is larger than 4K, using GPU-page-size PFNs to restore the
range causes two problems:
Range lookup fails:
Because the restore function receives PFNs in GPU (4K) units, the SVM
range lookup does not find the existing range. This will result in a
duplicate SVM range being created.
VMA lookup failure:
The restore function also tries to locate the VMA for the faulting address.
It converts the GPU-page-size PFN into an address using the system page
size, which results in an incorrect address on non-4K page-size systems.
As a result, the VMA lookup fails with the message: "address 0xxxx VMA is
removed".
This patch passes the system-page-size PFN to svm_range_restore_pages() so
that the SVM range is restored correctly on non-4K page systems.
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 074fe395fb13247b057f60004c7ebcca9f38ef46)
Forcibly disable the OD_FAN_CURVE feature when temperature or PWM range is invalid,
otherwise PMFW will reject this configuration on smu v14.0.2/14.0.3.
example:
$ sudo cat /sys/bus/pci/devices/<BDF>/gpu_od/fan_ctrl/fan_curve
OD_FAN_CURVE:
0: 0C 0%
1: 0C 0%
2: 0C 0%
3: 0C 0%
4: 0C 0%
OD_RANGE:
FAN_CURVE(hotspot temp): 0C 0C
FAN_CURVE(fan speed): 0% 0%
$ echo "0 50 40" | sudo tee fan_curve
kernel log:
[ 969.761627] amdgpu 0000:03:00.0: amdgpu: Fan curve temp setting(50) must be within [0, 0]!
[ 1010.897800] amdgpu 0000:03:00.0: amdgpu: Fan curve temp setting(50) must be within [0, 0]!
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ab4905d466b60f170d85e19ca2a5d2b159aeb780)
Cc: stable@vger.kernel.org
In kfd_ioctl_create_process(), the pointer 'p' is used before checking
if it is NULL.
The code accesses p->context_id before validating 'p'. This can lead
to a possible NULL pointer dereference.
Move the NULL check before using 'p' so that the pointer is validated
before access.
Fixes the below:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_chardev.c:3177 kfd_ioctl_create_process() warn: variable dereferenced before check 'p' (see line 3174)
Fixes: cc6b66d661 ("amdkfd: introduce new ioctl AMDKFD_IOC_CREATE_PROCESS")
Cc: Zhu Lingshan <lingshan.zhu@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 19d4149b22f57094bfc4b86b742381b3ca394ead)
amdgpu_amdkfd_submit_ib() submits a GPU job and gets a fence
from amdgpu_ib_schedule(). This fence is used to wait for job
completion.
Currently, the code drops the fence reference using dma_fence_put()
before calling dma_fence_wait().
If dma_fence_put() releases the last reference, the fence may be
freed before dma_fence_wait() is called. This can lead to a
use-after-free.
Fix this by waiting on the fence first and releasing the reference
only after dma_fence_wait() completes.
Fixes the below:
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c:697 amdgpu_amdkfd_submit_ib() warn: passing freed memory 'f' (line 696)
Fixes: 9ae55f030d ("drm/amdgpu: Follow up change to previous drm scheduler change.")
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 8b9e5259adc385b61a6590a13b82ae0ac2bd3482)
When ec_install_handlers() returns -EPROBE_DEFER on reduced-hardware
platforms, it has already started the EC and installed the address
space handler with the struct acpi_ec pointer as handler context.
However, acpi_ec_setup() propagates the error without any cleanup.
The caller acpi_ec_add() then frees the struct acpi_ec for non-boot
instances, leaving a dangling handler context in ACPICA.
Any subsequent AML evaluation that accesses an EC OpRegion field
dispatches into acpi_ec_space_handler() with the freed pointer,
causing a use-after-free:
BUG: KASAN: slab-use-after-free in mutex_lock (kernel/locking/mutex.c:289)
Write of size 8 at addr ffff88800721de38 by task init/1
Call Trace:
<TASK>
mutex_lock (kernel/locking/mutex.c:289)
acpi_ec_space_handler (drivers/acpi/ec.c:1362)
acpi_ev_address_space_dispatch (drivers/acpi/acpica/evregion.c:293)
acpi_ex_access_region (drivers/acpi/acpica/exfldio.c:246)
acpi_ex_field_datum_io (drivers/acpi/acpica/exfldio.c:509)
acpi_ex_extract_from_field (drivers/acpi/acpica/exfldio.c:700)
acpi_ex_read_data_from_field (drivers/acpi/acpica/exfield.c:327)
acpi_ex_resolve_node_to_value (drivers/acpi/acpica/exresolv.c:392)
</TASK>
Allocated by task 1:
acpi_ec_alloc (drivers/acpi/ec.c:1424)
acpi_ec_add (drivers/acpi/ec.c:1692)
Freed by task 1:
kfree (mm/slub.c:6876)
acpi_ec_add (drivers/acpi/ec.c:1751)
The bug triggers on reduced-hardware EC platforms (ec->gpe < 0)
when the GPIO IRQ provider defers probing. Once the stale handler
exists, any unprivileged sysfs read that causes AML to touch an
EC OpRegion (battery, thermal, backlight) exercises the dangling
pointer.
Fix this by calling ec_remove_handlers() in the error path of
acpi_ec_setup() before clearing first_ec. ec_remove_handlers()
checks each EC_FLAGS_* bit before acting, so it is safe to call
regardless of how far ec_install_handlers() progressed:
-ENODEV (handler not installed): only calls acpi_ec_stop()
-EPROBE_DEFER (handler installed): removes handler, stops EC
Fixes: 03e9a0e057 ("ACPI: EC: Consolidate event handler installation code")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Link: https://patch.msgid.link/20260324165458.1337233-2-bestswngs@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
KVM/arm64 fixes for 7.0, take #4
- Clear the pending exception state from a vcpu coming out of
reset, as it could otherwise affect the first instruction
executed in the guest.
- Fix the address translation emulation icode to set the Hardware
Access bit on the correct PTE instead of some other location.
Pull MM fixes from Andrew Morton:
"6 hotfixes. 2 are cc:stable. All are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-03-23-17-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/damon/stat: monitor all System RAM resources
mm/zswap: add missing kunmap_local()
mailmap: update email address for Muhammad Usama Anjum
zram: do not slot_free() written-back slots
mm/damon/core: avoid use of half-online-committed context
mm/rmap: clear vma->anon_vma on error
Refine the description to better highlight its features and use cases.
In addition, add instructions for building it as a module and clarify
the compression option.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Fix parsing 'overwrite' in command line event definitions in
big-endian machines by writing correct union member
- Fix finding default metric in 'perf stat'
- Fix relative paths for including headers in 'perf kvm stat'
- Sync header copies with the kernel sources: msr-index.h, kvm,
build_bug.h
* tag 'perf-tools-fixes-for-v7.0-2-2026-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
tools headers: Synchronize linux/build_bug.h with the kernel sources
tools headers UAPI: Sync x86's asm/kvm.h with the kernel sources
tools headers UAPI: Sync linux/kvm.h with the kernel sources
tools arch x86: Sync the msr-index.h copy with the kernel sources
perf kvm stat: Fix relative paths for including headers
perf parse-events: Fix big-endian 'overwrite' by writing correct union member
perf metricgroup: Fix metricgroup__has_metric_or_groups()
tools headers: Skip arm64 cputype.h check
Pull media fixes from Mauro Carvalho Chehab:
- rkvdec: fix stack usage with clang and improve handling missing
short/long term RPS
- synopsys: fix a Kconfig issue and an out-of-bounds check
- verisilicon: Fix kernel panic due to __initconst misuse
- media core: serialize REINIT and REQBUFS with req_queue_mutex
* tag 'media/v7.0-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: verisilicon: Fix kernel panic due to __initconst misuse
media: rkvdec: reduce stack usage in rkvdec_init_v4l2_vp9_count_tbl()
media: rkvdec: reduce excessive stack usage in assemble_hw_pps()
media: rkvdec: Improve handling missing short/long term RPS
media: mc, v4l2: serialize REINIT and REQBUFS with req_queue_mutex
media: synopsys: csi2rx: add missing kconfig dependency
media: synopsys: csi2rx: fix out-of-bounds check for formats array
The implicit FILEID_INO32_GEN encoder was changed to be explicit,
so we need to fix the detection.
When mounting overlayfs with upperdir and lowerdir on different ext4
filesystems, the expected kmsg log is:
overlayfs: "xino" feature enabled using 32 upper inode bits.
But instead, since the regressing commit, the kmsg log was:
overlayfs: "xino" feature enabled using 2 upper inode bits.
Fixes: e21fc2038c ("exportfs: make ->encode_fh() a mandatory method for NFS export")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
The memcpy function assumes the dynamic array notif->matches is at least
as large as the number of bytes to copy. Otherwise, results->matches may
contain unwanted data. To guarantee safety, extend the validation in one
of the checks to ensure sufficient packet length.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Cc: stable@vger.kernel.org
Fixes: 5ac54afd4d ("wifi: iwlwifi: mvm: Add handling for scan offload match info notification")
Signed-off-by: Alexey Velichayshiy <a.velichayshiy@ispras.ru>
Link: https://patch.msgid.link/20260207150335.1013646-1-a.velichayshiy@ispras.ru
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Also note that we do not enable the driver_override feature of struct
bus_type, as SPI - in contrast to most other buses - passes "" to
sysfs_emit() when the driver_override pointer is NULL. Thus, printing
"\n" instead of "(null)\n".
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Fixes: 5039563e7c ("spi: Add driver_override SPI device attribute")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patch.msgid.link/20260324005919.2408620-12-dakr@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
cputemp_is_visible() validates the channel index against
CPUTEMP_CHANNEL_NUMS, but currently uses '>' instead of '>='.
As a result, channel == CPUTEMP_CHANNEL_NUMS is not rejected even though
valid indices are 0 .. CPUTEMP_CHANNEL_NUMS - 1.
Fix the bounds check by using '>=' so invalid channel indices are
rejected before indexing the core bitmap.
Fixes: bf3608f338 ("hwmon: peci: Add cputemp driver")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260323002352.93417-3-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
The hwmon sysfs ABI expects tempN_crit_hyst to report the temperature at
which the critical condition clears, not the hysteresis delta from the
critical limit.
The peci cputemp driver currently returns tjmax - tcontrol for
crit_hyst_type, which is the hysteresis margin rather than the
corresponding absolute temperature.
Return tcontrol directly, and update the documentation accordingly.
Fixes: bf3608f338 ("hwmon: peci: Add cputemp driver")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260323002352.93417-2-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
The custom avs0_enable and avs1_enable sysfs attributes access PMBus
registers through the exported API helpers (pmbus_read_byte_data,
pmbus_read_word_data, pmbus_write_word_data, pmbus_update_byte_data)
without holding the PMBus update_lock mutex. These exported helpers do
not acquire the mutex internally, unlike the core's internal callers
which hold the lock before invoking them.
The store callback is especially vulnerable: it performs a multi-step
read-modify-write sequence (read VOUT_COMMAND, write VOUT_COMMAND, then
update OPERATION) where concurrent access from another thread could
interleave and corrupt the register state.
Add pmbus_lock_interruptible()/pmbus_unlock() around both the show and
store callbacks to serialize PMBus register access with the rest of the
driver.
Fixes: 038a9c3d1e ("hwmon: (pmbus/isl68137) Add driver for Intersil ISL68137 PWM Controller")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260319173055.125271-3-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
ina233_read_word_data() reads MFR_READ_VSHUNT via pmbus_read_word_data()
but has two issues:
1. The return value is not checked for errors before being used in
arithmetic. A negative error code from a failed I2C transaction is
passed directly to DIV_ROUND_CLOSEST(), producing garbage data.
2. MFR_READ_VSHUNT is a 16-bit two's complement value. Negative shunt
voltages (values with bit 15 set) are treated as large positive
values since pmbus_read_word_data() returns them zero-extended in an
int. This leads to incorrect scaling in the VIN coefficient
conversion.
Fix both issues by adding an error check, casting to s16 for proper
sign extension, and clamping the result to a valid non-negative range.
The clamp is necessary because read_word_data callbacks must return
non-negative values on success (negative values indicate errors to the
pmbus core).
Fixes: b64b6cb163 ("hwmon: Add driver for TI INA233 Current and Power Monitor")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260319173055.125271-2-sanman.pradhan@hpe.com
[groeck: Fixed clamp to avoid losing the sign bit]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Jeff Johnson says:
==================
ath.git update for v7.0-rc6
For both ath11k and ath12k use the correct TID when stopping an AMPDU
session.
==================
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Miri Korenblit says:
====================
wifi: iwlwifi: fixes - 2026-03-24
- Fix MLO scan timing (record the scan start in FW)
- don't send a 6E related command when not supported
- correctly set wifi generation data
====================
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
wl1251_tx_packet_cb() uses the firmware completion ID directly to index
the fixed 16-entry wl->tx_frames[] array. The ID is a raw u8 from the
completion block, and the callback does not currently verify that it
fits the array before dereferencing it.
Reject completion IDs that fall outside wl->tx_frames[] and keep the
existing NULL check in the same guard. This keeps the fix local to the
trust boundary and avoids touching the rest of the completion flow.
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Link: https://patch.msgid.link/20260323080845.40033-1-pengpeng@iscas.ac.cn
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The variable valuesize is declared as u8 but accumulates the total
length of all SSIDs to scan. Each SSID contributes up to 33 bytes
(IEEE80211_MAX_SSID_LEN + 1), and with WILC_MAX_NUM_PROBED_SSID (10)
SSIDs the total can reach 330, which wraps around to 74 when stored
in a u8.
This causes kmalloc to allocate only 75 bytes while the subsequent
memcpy writes up to 331 bytes into the buffer, resulting in a 256-byte
heap buffer overflow.
Widen valuesize from u8 to u32 to accommodate the full range.
Fixes: c5c77ba18e ("staging: wilc1000: Add SDIO/SPI 802.11 driver")
Cc: stable@vger.kernel.org
Signed-off-by: Yasuaki Torimaru <yasuakitorimaru@gmail.com>
Link: https://patch.msgid.link/20260324100624.983458-1-yasuakitorimaru@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Steffen Klassert says:
====================
pull request (net): ipsec 2026-03-23
1) Add missing extack for XFRMA_SA_PCPU in add_acquire and allocspi.
From Sabrina Dubroca.
2) Fix the condition on x->pcpu_num in xfrm_sa_len by using the
proper check. From Sabrina Dubroca.
3) Call xdo_dev_state_delete during state update to properly cleanup
the xdo device state. From Sabrina Dubroca.
4) Fix a potential skb leak in espintcp when async crypto is used.
From Sabrina Dubroca.
5) Validate inner IPv4 header length in IPTFS payload to avoid
parsing malformed packets. From Roshan Kumar.
6) Fix skb_put() panic on non-linear skb during IPTFS reassembly.
From Fernando Fernandez Mancera.
7) Silence various sparse warnings related to RCU, state, and policy
handling. From Sabrina Dubroca.
8) Fix work re-schedule race after cancel in xfrm_nat_keepalive_net_fini().
From Hyunwoo Kim.
9) Prevent policy_hthresh.work from racing with netns teardown by using
a proper cleanup mechanism. From Minwoo Ra.
10) Validate that the family of the source and destination addresses match
in pfkey_send_migrate(). From Eric Dumazet.
11) Only publish mode_data after the clone is setup in the IPTFS receive path.
This prevents leaving x->mode_data pointing at freed memory on error.
From Paul Moses.
Please pull or let me know if there are problems.
ipsec-2026-03-23
* tag 'ipsec-2026-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: iptfs: only publish mode_data after clone setup
af_key: validate families in pfkey_send_migrate()
xfrm: prevent policy_hthresh.work from racing with netns teardown
xfrm: Fix work re-schedule after cancel in xfrm_nat_keepalive_net_fini()
xfrm: avoid RCU warnings around the per-netns netlink socket
xfrm: add rcu_access_pointer to silence sparse warning for xfrm_input_afinfo
xfrm: policy: silence sparse warning in xfrm_policy_unregister_afinfo
xfrm: policy: fix sparse warnings in xfrm_policy_{init,fini}
xfrm: state: silence sparse warnings during netns exit
xfrm: remove rcu/state_hold from xfrm_state_lookup_spi_proto
xfrm: state: add xfrm_state_deref_prot to state_by* walk under lock
xfrm: state: fix sparse warnings around XFRM_STATE_INSERT
xfrm: state: fix sparse warnings in xfrm_state_init
xfrm: state: fix sparse warnings on xfrm_state_hold_rcu
xfrm: iptfs: fix skb_put() panic on non-linear skb during reassembly
xfrm: iptfs: validate inner IPv4 header length in IPTFS payload
esp: fix skb leak with espintcp and async crypto
xfrm: call xdo_dev_state_delete during state update
xfrm: fix the condition on x->pcpu_num in xfrm_sa_len
xfrm: add missing extack for XFRMA_SA_PCPU in add_acquire and allocspi
====================
Link: https://patch.msgid.link/20260323083440.2741292-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
With traffic ongoing for data TID [TID 0], an DELBA request to
stop AMPDU for the BA session was received on management TID [TID 4].
The corresponding TID number was incorrectly passed to stop the BA session,
resulting in the BA session for data TIDs being stopped and the BA size
being reduced to 1, causing an overall dip in TCP throughput.
Fix this issue by passing the correct argument from
ath12k_dp_rx_ampdu_stop() to ath12k_dp_arch_peer_rx_tid_reo_update()
during an AMPDU stop session. Instead of passing peer->dp_peer->rx_tid,
which is the base address of the array, corresponding to TID 0, pass
the value of &peer->dp_peer->rx_tid[params->tid]. With this, the
different TID numbers are accounted for.
Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.5-01651-QCAHKSWPL_SILICONZ-1
Fixes: d889913205 ("wifi: ath12k: driver for Qualcomm Wi-Fi 7 devices")
Signed-off-by: Reshma Immaculate Rajkumar <reshma.rajkumar@oss.qualcomm.com>
Reviewed-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com>
Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
Link: https://patch.msgid.link/20260227110123.3726354-1-reshma.rajkumar@oss.qualcomm.com
Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
During ongoing traffic, a request to stop an AMPDU session
for one TID could incorrectly affect other active sessions.
This can happen because an incorrect TID reference would be
passed when updating the BA session state, causing the wrong
session to be stopped. As a result, the affected session would
be reduced to a minimal BA size, leading to a noticeable
throughput degradation.
Fix this issue by passing the correct argument from
ath11k_dp_rx_ampdu_stop() to ath11k_peer_rx_tid_reo_update()
during a stop AMPDU session. Instead of passing peer->tx_tid, which
is the base address of the array, corresponding to TID 0; pass
the value of &peer->rx_tid[params->tid], where the different TID numbers
are accounted for.
Tested-on: QCN9074 hw1.0 PCI WLAN.HK.2.9.0.1-02146-QCAHKSWPL_SILICONZ-1
Fixes: d5c65159f2 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
Signed-off-by: Reshma Immaculate Rajkumar <reshma.rajkumar@oss.qualcomm.com>
Reviewed-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com>
Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
Link: https://patch.msgid.link/20260319065608.2408179-1-reshma.rajkumar@oss.qualcomm.com
Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
The hardware teams noticed that the originally documented workaround
steps for Wa_16025250150 may not be sufficient to fully avoid a hardware
issue. The workaround documentation has been augmented to suggest
programming one additional register; make the corresponding change in
the driver.
Fixes: 7654d51f1f ("drm/xe/xe2hpg: Add Wa_16025250150")
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Link: https://patch.msgid.link/20260319-wa_16025250150_part2-v1-1-46b1de1a31b2@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit a31566762d4075646a8a2214586158b681e94305)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
The Rust `Regulator` abstraction uses `NonNull` to wrap the underlying
`struct regulator` pointer. When `CONFIG_REGULATOR` is disabled, the C
stub for `regulator_get` returns `NULL`. `from_err_ptr` does not treat
`NULL` as an error, so it was passed to `NonNull::new_unchecked`,
causing undefined behavior.
Fix this by using a raw pointer `*mut bindings::regulator` instead of
`NonNull`. This allows `inner` to be `NULL` when `CONFIG_REGULATOR` is
disabled, and leverages the C stubs which are designed to handle `NULL`
or are no-ops.
Fixes: 9b614ceada ("rust: regulator: add a bare minimum regulator abstraction")
Reported-by: Miguel Ojeda <ojeda@kernel.org>
Closes: https://lore.kernel.org/r/20260322193830.89324-1-ojeda@kernel.org
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Link: https://patch.msgid.link/20260324-regulator-fix-v1-1-a5244afa3c15@google.com
Signed-off-by: Mark Brown <broonie@kernel.org>
In each MAC context, the firmware expects the wifi generation
data, i.e. whether or not HE/EHT (and in the future UHR) is
enabled on that MAC.
However, this is currently handled wrong in two ways:
- EHT is only enabled when the interface is also an MLD, but
we currently allow (despite the spec) connecting with EHT
but without MLO.
- when HE or EHT are used by TDLS peers, the firmware needs
to have them enabled regardless of the AP
Fix this by iterating setting up the data depending on the
interface type:
- for AP, just set it according to the BSS configuration
- for monitor, set it according to HW capabilities
- otherwise, particularly for client, iterate all stations
and then their links on the interface in question and set
according to their capabilities, this handles the AP and
TDLS peers. Re-calculate this whenever a TDLS station is
marked associated or removed so that it's kept updated,
for the AP it's already updated on assoc/disassoc.
Fixes: d1e879ec60 ("wifi: iwlwifi: add iwlmld sub-driver")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260319110722.404713b22177.Ic972b5e557d011a5438f8f97c1e793cc829e2ea9@changeid
Link: https://patch.msgid.link/20260324093333.2953495-1-miriam.rachel.korenblit@intel.com
Calculate MLO scan start time based on actual
scan start notification from firmware instead of recording
time when scan command is sent.
Currently, MLO scan start time was captured immediately
after sending the scan command to firmware. However, the
actual scan start time may differ due to the FW being busy
with a previous scan.
In that case, the link selection code will think that the MLO
scan is too old, and will warn.
To fix it, Implement start scan notification handling to
capture the precise moment when firmware begins the scan
operation.
Fixes: 9324731b99 ("wifi: iwlwifi: mld: avoid selecting bad links")
Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260324113316.4c56b8bac533.I6e656d8cc30bb82c96aabadedd62bd67f4c46bf9@changeid
NETIF_F_IPV6_CSUM only advertises support for checksum offload of
packets without IPv6 extension headers. Packets with extension
headers must fall back onto software checksumming. Since TSO
depends on checksum offload, those must revert to GSO.
The below commit introduces that fallback. It always checks
network header length. For tunneled packets, the inner header length
must be checked instead. Extend the check accordingly.
A special case is tunneled packets without inner IP protocol. Such as
RFC 6951 SCTP in UDP. Those are not standard IPv6 followed by
transport header either, so also must revert to the software GSO path.
Cc: stable@vger.kernel.org
Fixes: 864e339697 ("net: gso: Forbid IPv6 TSO with extensions on devices with only IPV6_CSUM")
Reported-by: Tangxin Xie <xietangxin@yeah.net>
Closes: https://lore.kernel.org/netdev/0414e7e2-9a1c-4d7c-a99d-b9039cf68f40@yeah.net/
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260320190148.2409107-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Marc Kleine-Budde says:
====================
pull-request: can 2026-03-23
this is a pull request of 5 patches for net/main.
The first patch is by me and adds missing error handling to the CAN
netlink device configuration code.
Wenyuan Li contributes a patch for the mcp251x drier to add missing
error handling for power enabling in th open and resume functions.
Oliver Hartkopp's patch adds missing atomic access in hot path for the
CAN procfs statistics.
A series by Ali Norouzi and Oliver Hartkopp fix a can-Out-of-Bounds
Heap R/W in the can-gw protocol and a UAF in the CAN isotp protocol.
linux-can-fixes-for-7.0-20260323
* tag 'linux-can-fixes-for-7.0-20260323' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: isotp: fix tx.buf use-after-free in isotp_sendmsg()
can: gw: fix OOB heap access in cgw_csum_crc8_rel()
can: statistics: add missing atomic access in hot path
can: mcp251x: add error handling for power enable in open and resume
can: netlink: can_changelink(): add missing error handling to call can_ctrlmode_changelink()
====================
Link: https://patch.msgid.link/20260323103224.218099-1-mkl@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
cppi5_hdesc_get_psdata() returns a pointer into the CPPI descriptor.
In both emac_rx_packet() and emac_rx_packet_zc(), the descriptor is
freed via k3_cppi_desc_pool_free() before the psdata pointer is used
by emac_rx_timestamp(), which dereferences psdata[0] and psdata[1].
This constitutes a use-after-free on every received packet that goes
through the timestamp path.
Defer the descriptor free until after all accesses through the psdata
pointer are complete. For emac_rx_packet(), move the free into the
requeue label so both early-exit and success paths free the descriptor
after all accesses are done. For emac_rx_packet_zc(), move the free to
the end of the loop body after emac_dispatch_skb_zc() (which calls
emac_rx_timestamp()) has returned.
Fixes: 46eeb90f03 ("net: ti: icssg-prueth: Use page_pool API for RX buffer allocation")
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260320174439.41080-1-devnexen@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Similar to commit 950803f725 ("bonding: fix type confusion in
bond_setup_by_slave()") team has the same class of header_ops type
confusion.
For non-Ethernet ports, team_setup_by_port() copies port_dev->header_ops
directly. When the team device later calls dev_hard_header() or
dev_parse_header(), these callbacks can run with the team net_device
instead of the real lower device, so netdev_priv(dev) is interpreted as
the wrong private type and can crash.
The syzbot report shows a crash in bond_header_create(), but the root
cause is in team: the topology is gre -> bond -> team, and team calls
the inherited header_ops with its own net_device instead of the lower
device, so bond_header_create() receives a team device and interprets
netdev_priv() as bonding private data, causing a type confusion crash.
Fix this by introducing team header_ops wrappers for create/parse,
selecting a team port under RCU, and calling the lower device callbacks
with port->dev, so each callback always sees the correct net_device
context.
Also pass the selected lower device to the lower parse callback, so
recursion is bounded in stacked non-Ethernet topologies and parse
callbacks always run with the correct device context.
Fixes: 1d76efe157 ("team: add support for non-ethernet devices")
Reported-by: syzbot+3d8bc31c45e11450f24c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69b46af7.050a0220.36eb34.000e.GAE@google.com/T/
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Link: https://patch.msgid.link/20260320072139.134249-2-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Xuan Zhuo says:
====================
virtio-net: fix for VIRTIO_NET_F_GUEST_HDRLEN
The commit be50da3e9d ("net: virtio_net: implement exact header length
guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN
feature in virtio-net.
This feature requires virtio-net to set hdr_len to the actual header
length of the packet when transmitting, the number of
bytes from the start of the packet to the beginning of the
transport-layer payload.
However, in practice, hdr_len was being set using skb_headlen(skb),
which is clearly incorrect. This path set fixes that issue.
As discussed in [0], this version checks the VIRTIO_NET_F_GUEST_HDRLEN is
negotiated.
[0]: http://lore.kernel.org/all/20251029030913.20423-1-xuanzhuo@linux.alibaba.com
v10: fix http://lore.kernel.org/all/202603122214.8Anoxrmq-lkp@intel.com
====================
Link: https://patch.msgid.link/20260320021818.111741-1-xuanzhuo@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The commit a2fb4bc4e2 ("net: implement virtio helpers to handle UDP
GSO tunneling.") introduces support for the UDP GSO tunnel feature in
virtio-net.
The virtio spec says:
If the \field{gso_type} has the VIRTIO_NET_HDR_GSO_UDP_TUNNEL_IPV4 bit or
VIRTIO_NET_HDR_GSO_UDP_TUNNEL_IPV6 bit set, \field{hdr_len} accounts for
all the headers up to and including the inner transport.
The commit did not update the hdr_len to include the inner transport.
I observed that the "hdr_len" is 116 for this packet:
17:36:18.241105 52:55:00:d1:27:0a > 2e:2c:df:46:a9:e1, ethertype IPv4 (0x0800), length 2912: (tos 0x0, ttl 64, id 45197, offset 0, flags [none], proto UDP (17), length 2898)
192.168.122.100.50613 > 192.168.122.1.4789: [bad udp cksum 0x8106 -> 0x26a0!] VXLAN, flags [I] (0x08), vni 1
fa:c3:ba:82:05:ee > ce:85:0c:31:77:e5, ethertype IPv4 (0x0800), length 2862: (tos 0x0, ttl 64, id 14678, offset 0, flags [DF], proto TCP (6), length 2848)
192.168.3.1.49880 > 192.168.3.2.9898: Flags [P.], cksum 0x9266 (incorrect -> 0xaa20), seq 515667:518463, ack 1, win 64, options [nop,nop,TS val 2990048824 ecr 2798801412], length 2796
116 = 14(mac) + 20(ip) + 8(udp) + 8(vxlan) + 14(inner mac) + 20(inner ip) + 32(innner tcp)
Fixes: a2fb4bc4e2 ("net: implement virtio helpers to handle UDP GSO tunneling.")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://patch.msgid.link/20260320021818.111741-3-xuanzhuo@linux.alibaba.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The commit be50da3e9d ("net: virtio_net: implement exact header length
guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN
feature in virtio-net.
This feature requires virtio-net to set hdr_len to the actual header
length of the packet when transmitting, the number of
bytes from the start of the packet to the beginning of the
transport-layer payload.
However, in practice, hdr_len was being set using skb_headlen(skb),
which is clearly incorrect. This commit fixes that issue.
Fixes: be50da3e9d ("net: virtio_net: implement exact header length guest feature")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://patch.msgid.link/20260320021818.111741-2-xuanzhuo@linux.alibaba.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Zorro Lang reported the following lockdep splat:
"While running fstests xfs/556 on kernel 7.0.0-rc4+ (HEAD=04a9f1766954),
a lockdep warning was triggered indicating an inconsistent lock state
for sb->s_type->i_lock_key.
"The deadlock might occur because iomap_read_end_io (called from a
hardware interrupt completion path) invokes fserror_report, which then
calls igrab. igrab attempts to acquire the i_lock spinlock. However,
the i_lock is frequently acquired in process context with interrupts
enabled. If an interrupt occurs while a process holds the i_lock, and
that interrupt handler calls fserror_report, the system deadlocks.
"I hit this warning several times by running xfs/556 (mostly) or
generic/648 on xfs. More details refer to below console log."
along with this dmesg, for which I've cleaned up the stacktraces:
run fstests xfs/556 at 2026-03-18 20:05:30
XFS (sda3): Mounting V5 Filesystem 396e9164-c45a-4e05-be9d-b38c2c5c6477
XFS (sda3): Ending clean mount
XFS (sda3): Unmounting Filesystem 396e9164-c45a-4e05-be9d-b38c2c5c6477
XFS (sda3): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (sda3): Ending clean mount
XFS (sda3): Unmounting Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (dm-0): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (dm-0): Ending clean mount
device-mapper: table: 253:0: adding target device (start sect 209 len 1) caused an alignment inconsistency
device-mapper: table: 253:0: adding target device (start sect 210 len 62914350) caused an alignment inconsistency
buffer_io_error: 6 callbacks suppressed
Buffer I/O error on dev dm-0, logical block 209, async page read
Buffer I/O error on dev dm-0, logical block 209, async page read
XFS (dm-0): Unmounting Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (dm-0): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (dm-0): Ending clean mount
================================
WARNING: inconsistent lock state
7.0.0-rc4+ #1 Tainted: G S W
--------------------------------
inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
od/2368602 [HC1[1]:SC0[0]:HE0:SE1] takes:
ff1100069f2b4a98 (&sb->s_type->i_lock_key#31){?.+.}-{3:3}, at: igrab+0x28/0x1a0
{HARDIRQ-ON-W} state was registered at:
__lock_acquire+0x40d/0xbd0
lock_acquire.part.0+0xbd/0x260
_raw_spin_lock+0x37/0x80
unlock_new_inode+0x66/0x2a0
xfs_iget+0x67b/0x7b0 [xfs]
xfs_mountfs+0xde4/0x1c80 [xfs]
xfs_fs_fill_super+0xe86/0x17a0 [xfs]
get_tree_bdev_flags+0x312/0x590
vfs_get_tree+0x8d/0x2f0
vfs_cmd_create+0xb2/0x240
__do_sys_fsconfig+0x3d8/0x9a0
do_syscall_64+0x13a/0x1520
entry_SYSCALL_64_after_hwframe+0x76/0x7e
irq event stamp: 3118
hardirqs last enabled at (3117): [<ffffffffb54e4ad8>] _raw_spin_unlock_irq+0x28/0x50
hardirqs last disabled at (3118): [<ffffffffb54b84c9>] common_interrupt+0x19/0xe0
softirqs last enabled at (3040): [<ffffffffb290ca28>] handle_softirqs+0x6b8/0x950
softirqs last disabled at (3023): [<ffffffffb290ce4d>] __irq_exit_rcu+0xfd/0x250
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&sb->s_type->i_lock_key#31);
<Interrupt>
lock(&sb->s_type->i_lock_key#31);
*** DEADLOCK ***
1 lock held by od/2368602:
#0: ff1100069f2b4b58 (&sb->s_type->i_mutex_key#19){++++}-{4:4}, at: xfs_ilock+0x324/0x4b0 [xfs]
stack backtrace:
CPU: 15 UID: 0 PID: 2368602 Comm: od Kdump: loaded Tainted: G S W 7.0.0-rc4+ #1 PREEMPT(full)
Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
Hardware name: Dell Inc. PowerEdge R660/0R5JJC, BIOS 2.1.5 03/14/2024
Call Trace:
<IRQ>
dump_stack_lvl+0x6f/0xb0
print_usage_bug.part.0+0x230/0x2c0
mark_lock_irq+0x3ce/0x5b0
mark_lock+0x1cb/0x3d0
mark_usage+0x109/0x120
__lock_acquire+0x40d/0xbd0
lock_acquire.part.0+0xbd/0x260
_raw_spin_lock+0x37/0x80
igrab+0x28/0x1a0
fserror_report+0x127/0x2d0
iomap_finish_folio_read+0x13c/0x280
iomap_read_end_io+0x10e/0x2c0
clone_endio+0x37e/0x780 [dm_mod]
blk_update_request+0x448/0xf00
scsi_end_request+0x74/0x750
scsi_io_completion+0xe9/0x7c0
_scsih_io_done+0x6ba/0x1ca0 [mpt3sas]
_base_process_reply_queue+0x249/0x15b0 [mpt3sas]
_base_interrupt+0x95/0xe0 [mpt3sas]
__handle_irq_event_percpu+0x1f0/0x780
handle_irq_event+0xa9/0x1c0
handle_edge_irq+0x2ef/0x8a0
__common_interrupt+0xa0/0x170
common_interrupt+0xb7/0xe0
</IRQ>
<TASK>
asm_common_interrupt+0x26/0x40
RIP: 0010:_raw_spin_unlock_irq+0x2e/0x50
Code: 0f 1f 44 00 00 53 48 8b 74 24 08 48 89 fb 48 83 c7 18 e8 b5 73 5e fd 48 89 df e8 ed e2 5e fd e8 08 78 8f fd fb bf 01 00 00 00 <e8> 8d 56 4d fd 65 8b 05 46 d5 1d 03 85 c0 74 06 5b c3 cc cc cc cc
RSP: 0018:ffa0000027d07538 EFLAGS: 00000206
RAX: 0000000000000c2d RBX: ffffffffb6614bc8 RCX: 0000000000000080
RDX: 0000000000000000 RSI: ffffffffb6306a01 RDI: 0000000000000001
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: ffffffffb75efc67 R11: 0000000000000001 R12: ff1100015ada0000
R13: 0000000000000083 R14: 0000000000000002 R15: ffffffffb6614c10
folio_wait_bit_common+0x407/0x780
filemap_update_page+0x8e7/0xbd0
filemap_get_pages+0x904/0xc50
filemap_read+0x320/0xc20
xfs_file_buffered_read+0x2aa/0x380 [xfs]
xfs_file_read_iter+0x263/0x4a0 [xfs]
vfs_read+0x6cb/0xb70
ksys_read+0xf9/0x1d0
do_syscall_64+0x13a/0x1520
Zorro's diagnosis makes sense, so the solution is to kick the failed
read handling to a workqueue much like we added for writeback ioends in
commit 294f54f849 ("fserror: fix lockdep complaint when igrabbing
inode").
Cc: Zorro Lang <zlang@redhat.com>
Link: https://lore.kernel.org/linux-xfs/20260319194303.efw4wcu7c4idhthz@doltdoltdolt/
Fixes: a9d573ee88 ("iomap: report file I/O errors to the VFS")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://patch.msgid.link/20260323210017.GL6223@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Clearing the DP tunnel stream BW in the atomic state involves getting
the tunnel group state, which can fail. Handle the error accordingly.
This fixes at least one issue where drm_dp_tunnel_atomic_set_stream_bw()
failed to get the tunnel group state returning -EDEADLK, which wasn't
handled. This lead to the ctx->contended warn later in modeset_lock()
while taking a WW mutex for another object in the same atomic state, and
thus within the same already contended WW context.
Moving intel_crtc_state_alloc() later would avoid freeing saved_state on
the error path; this stable patch leaves that simplification for a
follow-up.
Cc: Uma Shankar <uma.shankar@intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: <stable@vger.kernel.org> # v6.9+
Fixes: a4efae87ec ("drm/i915/dp: Compute DP tunnel BW during encoder state computation")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/7617
Reviewed-by: Michał Grzelak <michal.grzelak@intel.com>
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Link: https://patch.msgid.link/20260320092900.13210-1-imre.deak@intel.com
(cherry picked from commit fb69d0076e687421188bc8103ab0e8e5825b1df1)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Pull xen fixes from Juergen Gross:
"Restrict the xen privcmd driver in unprivileged domU to only allow
hypercalls to target domain when using secure boot"
* tag 'xsa482-7.0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/privcmd: add boot control for restricted usage in domU
xen/privcmd: restrict usage in unprivileged domU
Currently, enetc_get_ringparam() only provides rx_pending and tx_pending,
but 'ethtool --show-ring' no longer displays these fields. Because the
ringparam retrieval path has moved to the new netlink interface, where
rings_fill_reply() emits the *x_pending only if the *x_max_pending values
are non-zero. So rx_max_pending and tx_max_pending to are added to
enetc_get_ringparam() to fix the issue.
Note that the maximum tx/rx ring size of hardware is 64K, but we haven't
added set_ringparam() to make the ring size configurable. To avoid users
mistakenly believing that the ring size can be increased, so set
the *x_max_pending to priv->*x_bd_count.
Fixes: e4a1717b67 ("ethtool: provide ring sizes with RINGS_GET request")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260320094222.706339-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When binding a udp_sock to a local address and port, UDP uses
two hashes (udptable->hash and udptable->hash2) for collision
detection. The current code switches to "hash2" when
hslot->count > 10.
"hash2" is keyed by local address and local port.
"hash" is keyed by local port only.
The issue can be shown in the following bind sequence (pseudo code):
bind(fd1, "[fd00::1]:8888")
bind(fd2, "[fd00::2]:8888")
bind(fd3, "[fd00::3]:8888")
bind(fd4, "[fd00::4]:8888")
bind(fd5, "[fd00::5]:8888")
bind(fd6, "[fd00::6]:8888")
bind(fd7, "[fd00::7]:8888")
bind(fd8, "[fd00::8]:8888")
bind(fd9, "[fd00::9]:8888")
bind(fd10, "[fd00::10]:8888")
/* Correctly return -EADDRINUSE because "hash" is used
* instead of "hash2". udp_lib_lport_inuse() detects the
* conflict.
*/
bind(fail_fd, "[::]:8888")
/* After one more socket is bound to "[fd00::11]:8888",
* hslot->count exceeds 10 and "hash2" is used instead.
*/
bind(fd11, "[fd00::11]:8888")
bind(fail_fd, "[::]:8888") /* succeeds unexpectedly */
The same issue applies to the IPv4 wildcard address "0.0.0.0"
and the IPv4-mapped wildcard address "::ffff:0.0.0.0". For
example, if there are existing sockets bound to
"192.168.1.[1-11]:8888", then binding "0.0.0.0:8888" or
"[::ffff:0.0.0.0]:8888" can also miss the conflict when
hslot->count > 10.
TCP inet_csk_get_port() already has the correct check in
inet_use_bhash2_on_bind(). Rename it to
inet_use_hash2_on_bind() and move it to inet_hashtables.h
so udp.c can reuse it in this fix.
Fixes: 30fff9231f ("udp: bind() optimisation")
Reported-by: Andrew Onyshchuk <oandrew@meta.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260319181817.1901357-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When CONFIG_FIXED_PHY=m but CONFIG_B44=y, the kernel fails to link:
ld.lld: error: undefined symbol: fixed_phy_unregister
>>> referenced by b44.c
>>> drivers/net/ethernet/broadcom/b44.o:(b44_remove_one) in archive vmlinux.a
ld.lld: error: undefined symbol: fixed_phy_register_100fd
>>> referenced by b44.c
>>> drivers/net/ethernet/broadcom/b44.o:(b44_register_phy_one) in archive vmlinux.a
The fixed phy support is small enough that just always enabling it
for b44 is the simplest solution, and it avoids adding ugly #ifdef
checks.
Fixes: 10d2f15afb ("net: b44: register a fixed phy using fixed_phy_register_100fd if needed")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260320154927.674555-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since 0417adf367 ("ppp: fix race conditions in ppp_fill_forward_path")
dev_fill_forward_path() should be called with RCU read lock held. This
fix was applied to net, while the Airoha flowtable commit was applied to
net-next, so it hadn't been an issue until net was merged into net-next.
Fixes: a8bdd935d1 ("net: airoha: Add wlan flowtable TX offload")
Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260320094315.525126-1-dqfext@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
`packet_release()` has a race window where `NETDEV_UP` can re-register a
socket into a fanout group's `arr[]` array. The re-registration is not
cleaned up by `fanout_release()`, leaving a dangling pointer in the fanout
array.
`packet_release()` does NOT zero `po->num` in its `bind_lock` section.
After releasing `bind_lock`, `po->num` is still non-zero and `po->ifindex`
still matches the bound device. A concurrent `packet_notifier(NETDEV_UP)`
that already found the socket in `sklist` can re-register the hook.
For fanout sockets, this re-registration calls `__fanout_link(sk, po)`
which adds the socket back into `f->arr[]` and increments `f->num_members`,
but does NOT increment `f->sk_ref`.
The fix sets `po->num` to zero in `packet_release` while `bind_lock` is
held to prevent NETDEV_UP from linking, preventing the race window.
This bug was found following an additional audit with Claude Code based
on CVE-2025-38617.
Fixes: ce06b03e60 ("packet: Add helpers to register/unregister ->prot_hook")
Link: https://blog.calif.io/p/a-race-within-a-race-exploiting-cve
Signed-off-by: Yochai Eisenrich <echelonh@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260319200610.25101-1-echelonh@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima says:
====================
ipv6: Fix two GC issues with permanent routes.
Patch 1 fixes the unbounded growth of tb6_gc_hlist due to
permanent routes whose exception routes have all expired.
Patch 2 fixes an issue where exception routes tied to
permanent routes are not properly aged.
Patch 3 is a selftest for the issue fixed by Patch 2.
====================
Link: https://patch.msgid.link/20260320072317.2561779-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Without the prior commit, IPv6 GC cannot track exceptions tied
to permanent routes if they were originally added as temporary
routes.
Let's add a test case for the issue.
1. Add temporary routes
2. Create exceptions for the temporary routes
3. Promote the routes to permanent routes
4. Check if GC can find and purge the exceptions
A few notes:
+ At step 4, unlike other test cases, we cannot wait for
$GC_WAIT_TIME. While the exceptions are always iterable via
netlink (since it traverses the entire fib tree instead of
tb6_gc_hlist), rt6_nh_dump_exceptions() skips expired entries.
If we waited for the expiration time, we would be unable to
distinguish whether the exceptions were truly purged by GC or
just hidden due to being expired.
+ For the same reason, at step 2, we use ICMPv6 redirect message
instead of Packet Too Big message. This is because MTU exceptions
always have RTF_EXPIRES, and rt6_age_examine_exception() does not
respect the period specified by net.ipv6.route.flush=1.
+ We add a neighbour entry for the redirect target with NTF_ROUTER.
Without this, the exceptions would be removed at step 3 when the
fib6_may_remove_gc_list() is called.
Without the fix, the exceptions remain even after GC is triggered
by sysctl -wq net.ipv6.route.flush=1.
FAIL: Expected 0 routes, got 5
TEST: ipv6 route garbage collection (promote to permanent routes) [FAIL]
With the fix, GC purges the exceptions properly.
TEST: ipv6 route garbage collection (promote to permanent routes) [ OK ]
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260320072317.2561779-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The cited commit mechanically put fib6_remove_gc_list()
just after every fib6_clean_expires() call.
When a temporary route is promoted to a permanent route,
there may already be exception routes tied to it.
If fib6_remove_gc_list() removes the route from tb6_gc_hlist,
such exception routes will no longer be aged.
Let's replace fib6_remove_gc_list() with a new helper
fib6_may_remove_gc_list() and use fib6_age_exceptions() there.
Note that net->ipv6 is only compiled when CONFIG_IPV6 is
enabled, so fib6_{add,remove,may_remove}_gc_list() are guarded.
Fixes: 5eb902b8e7 ("net/ipv6: Remove expired routes with a separated list of routes.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260320072317.2561779-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 5eb902b8e7 ("net/ipv6: Remove expired routes with a
separated list of routes.") introduced a per-table GC list and
changed GC to iterate over that list instead of traversing
the entire route table.
However, it forgot to add permanent routes to tb6_gc_hlist
when exception routes are added.
Commit cfe82469a0 ("ipv6: add exception routes to GC list
in rt6_insert_exception") fixed that issue but introduced
another one.
Even after all exception routes expire, the permanent routes
remain in tb6_gc_hlist, potentially negating the performance
benefits intended by the initial change.
Let's count gc_args->more before and after rt6_age_exceptions()
and remove the permanent route when the delta is 0.
Note that the next patch will reuse fib6_age_exceptions().
Fixes: cfe82469a0 ("ipv6: add exception routes to GC list in rt6_insert_exception")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260320072317.2561779-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The prefix/header of a TLP that caused an error may be recorded in the AER
Capability and emitted to the kernel log in raw hex format. Document the
existence and usage of tlp-tool, which decodes the TLP Header into
human-readable form.
The TLP Header hints at the root cause of an error, yet is often ignored
because of its seeming opaqueness. Instead, PCIe errors are frequently
worked around by a change in the kernel without fully understanding the
actual source of the problem. With more documentation on available tools
we'll hopefully come up with better solutions.
There are also wireshark dissectors for TLPs, but it seems they expect a
complete TLP, not just the header, and they cannot grok the hex format
emitted by the kernel directly. tlp-tool appears to be the most cut and
dried solution out there.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Maciej Grochowski <mx2pg@pm.me>
Link: https://patch.msgid.link/bf826c41b4c1d255c7dcb16e266b52f774d944ed.1774246067.git.lukas@wunner.de
Deinit calls idpf_idc_deinit_core_aux_device to free the cdev_info
memory, but leaves the adapter->cdev_info field with a stale pointer
value. This will bypass subsequent "if (!cdev_info)" checks if cdev_info
is not reallocated. For example, if idc_init fails after a reset,
cdev_info will already have been freed during the reset handling, but it
will not have been reallocated. The next reset or rmmod will result in a
crash.
[ +0.000008] BUG: kernel NULL pointer dereference, address: 00000000000000d0
[ +0.000033] #PF: supervisor read access in kernel mode
[ +0.000020] #PF: error_code(0x0000) - not-present page
[ +0.000017] PGD 2097dfa067 P4D 0
[ +0.000017] Oops: Oops: 0000 [#1] SMP NOPTI
...
[ +0.000018] RIP: 0010:device_del+0x3e/0x3d0
[ +0.000010] Call Trace:
[ +0.000010] <TASK>
[ +0.000012] idpf_idc_deinit_core_aux_device+0x36/0x70 [idpf]
[ +0.000034] idpf_vc_core_deinit+0x3e/0x180 [idpf]
[ +0.000035] idpf_remove+0x40/0x1d0 [idpf]
[ +0.000035] pci_device_remove+0x42/0xb0
[ +0.000020] device_release_driver_internal+0x19c/0x200
[ +0.000024] driver_detach+0x48/0x90
[ +0.000018] bus_remove_driver+0x6d/0x100
[ +0.000023] pci_unregister_driver+0x2e/0xb0
[ +0.000022] __do_sys_delete_module.isra.0+0x18c/0x2b0
[ +0.000025] ? kmem_cache_free+0x2c2/0x390
[ +0.000023] do_syscall_64+0x107/0x7d0
[ +0.000023] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Pass the adapter struct into idpf_idc_deinit_core_aux_device instead and
clear the cdev_info ptr.
Fixes: f4312e6bfa ("idpf: implement core RDMA auxiliary dev create, init, and destroy")
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ice_repr_get_stats64() and __ice_get_ethtool_stats() call
ice_update_vsi_stats() on the VF's src_vsi. This always returns early
because ICE_VSI_DOWN is permanently set for VF VSIs - ice_up() is never
called on them since queues are managed by iavf through virtchnl.
In __ice_get_ethtool_stats() the original code called
ice_update_vsi_stats() for all VSIs including representors, iterated
over ice_gstrings_vsi_stats[] to populate the data, and then bailed out
with an early return before the per-queue ring stats section. That early
return was necessary because representor VSIs have no rings on the PF
side - the rings belong to the VF driver (iavf), so accessing per-queue
stats would be invalid.
Move the representor handling to the top of __ice_get_ethtool_stats()
and call ice_update_eth_stats() directly to read the hardware GLV_*
counters. This matches ice_get_vf_stats() which already uses
ice_update_eth_stats() for the same VF VSI in legacy mode. Apply the
same fix to ice_repr_get_stats64().
Note that ice_gstrings_vsi_stats[] contains five software ring counters
(rx_buf_failed, rx_page_failed, tx_linearize, tx_busy, tx_restart) that
are always zero for representors since the PF never processes packets on
VF rings. This is pre-existing behavior unchanged by this patch.
Fixes: 7aae80cef7 ("ice: add port representor ethtool ops and stats")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Patryk Holda <patryk.holda@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
With the move to explicit pwrctrl power on/off APIs, the caller, i.e., the
PCI controller driver, should manage the power state. The pwrctrl drivers
should not try to clean up or power off when they are removed, as this
might end up disabling an already disabled regulator, causing a big
warning. This can be triggered if a PCI controller driver's .remove()
callback calls pci_pwrctrl_destroy_devices() after
pci_pwrctrl_power_off_devices().
Drop the devm cleanup parts that turn off regulators from the pwrctrl
drivers.
Fixes: b921aa3f8d ("PCI/pwrctrl: Switch to pwrctrl create, power on/off, destroy APIs")
Signed-off-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Link: https://patch.msgid.link/20260226092234.3859740-1-wenst@chromium.org
Commit 0f00a897c9 ("ice: check if SF is ready in ethtool ops")
refactored the VF readiness check into a generic repr->ops.ready()
callback but implemented ice_repr_ready_vf() with inverted logic:
return !ice_check_vf_ready_for_cfg(repr->vf);
ice_check_vf_ready_for_cfg() returns 0 on success, so the negation
makes ready() return non-zero when the VF is ready. All callers treat
non-zero as "not ready, skip", causing ndo_get_stats64, get_drvinfo,
get_strings and get_ethtool_stats to always bail out in switchdev mode.
Remove the erroneous negation. The SF variant ice_repr_ready_sf() is
already correct (returns !active, i.e. non-zero when not active).
Fixes: 0f00a897c9 ("ice: check if SF is ready in ethtool ops")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Patryk Holda <patryk.holda@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
When allocating netdevice using alloc_etherdev_mqs() the maximum
supported queues number should be passed. The vsi->alloc_txq/rxq is
storing current number of queues, not the maximum ones.
Use the same function for getting max Tx and Rx queues which is used
during ethtool -l call to set maximum number of queues during netdev
allocation.
Reproduction steps:
$ethtool -l $pf # says current 16, max 64
$ethtool -S $pf # fine
$ethtool -L $pf combined 40 # crash
[491187.472594] Call Trace:
[491187.472829] <TASK>
[491187.473067] netif_set_xps_queue+0x26/0x40
[491187.473305] ice_vsi_cfg_txq+0x265/0x3d0 [ice]
[491187.473619] ice_vsi_cfg_lan_txqs+0x68/0xa0 [ice]
[491187.473918] ice_vsi_cfg_lan+0x2b/0xa0 [ice]
[491187.474202] ice_vsi_open+0x71/0x170 [ice]
[491187.474484] ice_vsi_recfg_qs+0x17f/0x230 [ice]
[491187.474759] ? dev_get_min_mp_channel_count+0xab/0xd0
[491187.474987] ice_set_channels+0x185/0x3d0 [ice]
[491187.475278] ethnl_set_channels+0x26f/0x340
Fixes: ee13aa1a2c ("ice: use netif_get_num_default_rss_queues()")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Whenever we get an error updating the device stats item for a device in
btrfs_run_dev_stats() we allow the loop to go to the next device, and if
updating the stats item for the next device succeeds, we end up losing
the error we had from the previous device.
Fix this by breaking out of the loop once we get an error and make sure
it's returned to the caller. Since we are in the transaction commit path
(and in the critical section actually), returning the error will result
in a transaction abort.
Fixes: 733f4fbbc1 ("Btrfs: read device stats on mount, write modified ones during commit")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If overlay is used on top of btrfs, dentry->d_sb translates to overlay's
super block and fsid assignment will lead to a crash.
Use file_inode(file)->i_sb to always get btrfs_sb.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Since commit 3d74a7556f ("btrfs: zlib: introduce zlib_compress_bio()
helper"), there are some reports about different crashes in zlib
compression path. One of the symptoms is list corruption like the
following:
list_del corruption. next->prev should be fffffbb340204a08, but was ffff8d6517cb7de0. (next=fffffbb3402d62c8)
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:65!
Oops: invalid opcode: 0000 [#1] SMP NOPTI
CPU: 1 UID: 0 PID: 21436 Comm: kworker/u16:7 Not tainted 7.0.0-rc2-jcg+ #1 PREEMPT
Hardware name: LENOVO 10VGS02P00/3130, BIOS M1XKT57A 02/10/2022
Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
RIP: 0010:__list_del_entry_valid_or_report+0xec/0xf0
Call Trace:
<TASK>
btrfs_alloc_compr_folio+0xae/0xc0 [btrfs]
zlib_compress_bio+0x39d/0x6a0 [btrfs]
btrfs_compress_bio+0x2e3/0x3d0 [btrfs]
compress_file_range+0x2b0/0x660 [btrfs]
btrfs_work_helper+0xdb/0x3e0 [btrfs]
process_one_work+0x192/0x3d0
worker_thread+0x19a/0x310
kthread+0xdf/0x120
ret_from_fork+0x22e/0x310
ret_from_fork_asm+0x1a/0x30
</TASK>
---[ end trace 0000000000000000 ]---
Other symptoms include VM_BUG_ON() during folio_put() but it's rarer.
David Sterba firstly reported this during his CI runs but unfortunately
I'm unable to hit it.
Meanwhile zstd/lzo doesn't seem to have the same problem.
[CAUSE]
During zlib_compress_bio() every time the output buffer is full, we
queue the full folio into the compressed bio, and allocate a new folio
as the output folio.
After the input has finished, we loop through zlib_deflate() with
Z_FINISH to flush all output.
And when that is done, we still need to check if the last folio has any
content, and if so we still need to queue that part into the compressed
bio.
The problem is in the final folio handling, if the final folio is full
(for x86_64 the folio size is 4K), the length to queue is calculated by
u32 cur_len = offset_in_folio(out_folio, workspace->strm.total_out);
But since total_out is 4K aligned, the resulted @cur_len will be 0, then
we hit the bio_add_folio(), which has a quirk that if bio_add_folio()
got an length 0, it will still queue the folio into the bio, but return
false.
In that case we go to out: tag, which calls btrfs_free_compr_folio() to
release @out_folio, which may put the out folio into the btrfs global
pool list.
On the other hand, that @out_folio is already added to the
compressed bio, and will later be released again by
cleanup_compressed_bio(), which results double release.
And if this time we still need to put the folio into the btrfs global
pool list, it will result a list corruption because it's already in the
list.
[FIX]
Instead of offset_inside_folio(), directly use the difference between
strm.total_out and bi_size.
So that if the last folio is completely full, we can still properly
queue the full folio other than queueing zero byte.
Fixes: 3d74a7556f ("btrfs: zlib: introduce zlib_compress_bio() helper")
Reported-by: David Sterba <dsterba@suse.com>
Reported-by: Jean-Christophe Guillain <jean-christophe@guillain.net>
Reported-by: syzbot+3c4d8371d65230f852a2@syzkaller.appspotmail.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221176
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When create_space_info_sub_group() allocates elements of
space_info->sub_group[], kobject_init_and_add() is called for each
element via btrfs_sysfs_add_space_info_type(). However, when
check_removing_space_info() frees these elements, it does not call
btrfs_sysfs_remove_space_info() on them. As a result, kobject_put() is
not called and the associated kobj->name objects are leaked.
This memory leak is reproduced by running the blktests test case
zbd/009 on kernels built with CONFIG_DEBUG_KMEMLEAK. The kmemleak
feature reports the following error:
unreferenced object 0xffff888112877d40 (size 16):
comm "mount", pid 1244, jiffies 4294996972
hex dump (first 16 bytes):
64 61 74 61 2d 72 65 6c 6f 63 00 c4 c6 a7 cb 7f data-reloc......
backtrace (crc 53ffde4d):
__kmalloc_node_track_caller_noprof+0x619/0x870
kstrdup+0x42/0xc0
kobject_set_name_vargs+0x44/0x110
kobject_init_and_add+0xcf/0x150
btrfs_sysfs_add_space_info_type+0xfc/0x210 [btrfs]
create_space_info_sub_group.constprop.0+0xfb/0x1b0 [btrfs]
create_space_info+0x211/0x320 [btrfs]
btrfs_init_space_info+0x15a/0x1b0 [btrfs]
open_ctree+0x33c7/0x4a50 [btrfs]
btrfs_get_tree.cold+0x9f/0x1ee [btrfs]
vfs_get_tree+0x87/0x2f0
vfs_cmd_create+0xbd/0x280
__do_sys_fsconfig+0x3df/0x990
do_syscall_64+0x136/0x1540
entry_SYSCALL_64_after_hwframe+0x76/0x7e
To avoid the leak, call btrfs_sysfs_remove_space_info() instead of
kfree() for the elements.
Fixes: f92ee31e03 ("btrfs: introduce btrfs_space_info sub-group")
Link: https://lore.kernel.org/linux-block/b9488881-f18d-4f47-91a5-3c9bf63955a5@wdc.com/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging that an inode exists, as part of logging a new name or
logging new dir entries for a directory, we always set the generation of
the logged inode item to 0. This is to signal during log replay (in
overwrite_item()), that we should not set the i_size since we only logged
that an inode exists, so the i_size of the inode in the subvolume tree
must be preserved (as when we log new names or that an inode exists, we
don't log extents).
This works fine except when we have already logged an inode in full mode
or it's the first time we are logging an inode created in a past
transaction, that inode has a new i_size of 0 and then we log a new name
for the inode (due to a new hardlink or a rename), in which case we log
an i_size of 0 for the inode and a generation of 0, which causes the log
replay code to not update the inode's i_size to 0 (in overwrite_item()).
An example scenario:
mkdir /mnt/dir
xfs_io -f -c "pwrite 0 64K" /mnt/dir/foo
sync
xfs_io -c "truncate 0" -c "fsync" /mnt/dir/foo
ln /mnt/dir/foo /mnt/dir/bar
xfs_io -c "fsync" /mnt/dir
<power fail>
After log replay the file remains with a size of 64K. This is because when
we first log the inode, when we fsync file foo, we log its current i_size
of 0, and then when we create a hard link we log again the inode in exists
mode (LOG_INODE_EXISTS) but we set a generation of 0 for the inode item we
add to the log tree, so during log replay overwrite_item() sees that the
generation is 0 and i_size is 0 so we skip updating the inode's i_size
from 64K to 0.
Fix this by making sure at fill_inode_item() we always log the real
generation of the inode if it was logged in the current transaction with
the i_size we logged before. Also if an inode created in a previous
transaction is logged in exists mode only, make sure we log the i_size
stored in the inode item located from the commit root, so that if we log
multiple times that the inode exists we get the correct i_size.
A test case for fstests will follow soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/af8c15fa-4e41-4bb2-885c-0bc4e97532a6@gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix the superblock offset mismatch error message in
btrfs_validate_super(): we changed it so that it considers all the
superblocks, but the message still assumes we're only looking at the
first one.
The change from %u to %llu is because we're changing from a constant to
a u64.
Fixes: 069ec957c3 ("btrfs: Refactor btrfs_check_super_valid")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Forcibly disable the OD_FAN_CURVE feature when temperature or PWM range is invalid,
otherwise PMFW will reject this configuration on smu v13.0.x
example:
$ sudo cat /sys/bus/pci/devices/<BDF>/gpu_od/fan_ctrl/fan_curve
OD_FAN_CURVE:
0: 0C 0%
1: 0C 0%
2: 0C 0%
3: 0C 0%
4: 0C 0%
OD_RANGE:
FAN_CURVE(hotspot temp): 0C 0C
FAN_CURVE(fan speed): 0% 0%
$ echo "0 50 40" | sudo tee fan_curve
kernel log:
[ 756.442527] amdgpu 0000:03:00.0: amdgpu: Fan curve temp setting(50) must be within [0, 0]!
[ 777.345800] amdgpu 0000:03:00.0: amdgpu: Fan curve temp setting(50) must be within [0, 0]!
Closes: https://github.com/ROCm/amdgpu/issues/208
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 470891606c5a97b1d0d937e0aa67a3bed9fcb056)
Cc: stable@vger.kernel.org
When SET_UCLK_MAX capability is absent, return -EOPNOTSUPP from
smu_v13_0_6_emit_clk_levels() for OD_MCLK instead of 0. This makes
unsupported OD_MCLK reporting consistent with other clock types
and allows callers to skip the entry cleanly.
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit d82e0a72d9189e8acd353988e1a57f85ce479e37)
Cc: stable@vger.kernel.org
Only reapply UCLK soft limits during PP_OD_RESTORE_DEFAULT when the
current max differs from the DPM table max. This avoids redundant
SMC updates and prevents -EINVAL on restore when no change is needed.
Fixes: b7a9003445 ("drm/amd/pm: Allow setting max UCLK on SMU v13.0.6")
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 17f11bbbc76c8e83c8474ea708316b1e3631d927)
[WHAT]
When a sink is connected, aconnector->drm_edid was overwritten without
freeing the previous allocation, causing a memory leak on resume.
[HOW]
Free the previous drm_edid before updating it.
Reviewed-by: Roman Li <roman.li@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Chuanyu Tseng <chuanyu.tseng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 52024a94e7111366141cfc5d888b2ef011f879e5)
Cc: stable@vger.kernel.org
PASID resue could cause interrupt issue when process
immediately runs into hw state left by previous
process exited with the same PASID, it's possible that
page faults are still pending in the IH ring buffer when
the process exits and frees up its PASID. To prevent the
case, it uses idr cyclic allocator same as kernel pid's.
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 8f1de51f49be692de137c8525106e0fce2d1912d)
Cc: stable@vger.kernel.org
amdgpu_device_get_job_timeout_settings() passes a pointer directly
to the global amdgpu_lockup_timeout[] buffer into strsep().
strsep() destructively replaces delimiter characters with '\0'
in-place.
On multi-GPU systems, this function is called once per device.
When a multi-value setting like "0,0,0,-1" is used, the first
GPU's call transforms the global buffer into "0\00\00\0-1". The
second GPU then sees only "0" (terminated at the first '\0'),
parses a single value, hits the single-value fallthrough
(index == 1), and applies timeout=0 to all rings — causing
immediate false job timeouts.
Fix this by copying into a stack-local array before calling
strsep(), so the global module parameter buffer remains intact
across calls. The buffer is AMDGPU_MAX_TIMEOUT_PARAM_LENGTH
(256) bytes, which is safe for the stack.
v2: wrap commit message to 72 columns, add Assisted-by tag.
v3: use stack array with strscpy() instead of kstrdup()/kfree()
to avoid unnecessary heap allocation (Christian).
This patch was developed with assistance from Claude (claude-opus-4-6).
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 94d79f51efecb74be1d88dde66bdc8bfcca17935)
Cc: stable@vger.kernel.org
Starting with commit 17ce8a6907 ("drm/amd/display: Add dsc pre-validation in
atomic check"), amdgpu resets the CRTC state mode_changed flag to false when
recomputing the DSC configuration results in no timing change for a particular
stream.
However, this is incorrect in scenarios where a change in MST/DSC configuration
happens in the same KMS commit as another (unrelated) mode change. For example,
the integrated panel of a laptop may be configured differently (e.g., HDR
enabled/disabled) depending on whether external screens are attached. In this
case, plugging in external DP-MST screens may result in the mode_changed flag
being dropped incorrectly for the integrated panel if its DSC configuration
did not change during precomputation in pre_validate_dsc().
At this point, however, dm_update_crtc_state() has already created new streams
for CRTCs with DSC-independent mode changes. In turn,
amdgpu_dm_commit_streams() will never release the old stream, resulting in a
memory leak. amdgpu_dm_atomic_commit_tail() will never acquire a reference to
the new stream either, which manifests as a use-after-free when the stream gets
disabled later on:
BUG: KASAN: use-after-free in dc_stream_release+0x25/0x90 [amdgpu]
Write of size 4 at addr ffff88813d836524 by task kworker/9:9/29977
Workqueue: events drm_mode_rmfb_work_fn
Call Trace:
<TASK>
dump_stack_lvl+0x6e/0xa0
print_address_description.constprop.0+0x88/0x320
? dc_stream_release+0x25/0x90 [amdgpu]
print_report+0xfc/0x1ff
? srso_alias_return_thunk+0x5/0xfbef5
? __virt_addr_valid+0x225/0x4e0
? dc_stream_release+0x25/0x90 [amdgpu]
kasan_report+0xe1/0x180
? dc_stream_release+0x25/0x90 [amdgpu]
kasan_check_range+0x125/0x200
dc_stream_release+0x25/0x90 [amdgpu]
dc_state_destruct+0x14d/0x5c0 [amdgpu]
dc_state_release.part.0+0x4e/0x130 [amdgpu]
dm_atomic_destroy_state+0x3f/0x70 [amdgpu]
drm_atomic_state_default_clear+0x8ee/0xf30
? drm_mode_object_put.part.0+0xb1/0x130
__drm_atomic_state_free+0x15c/0x2d0
atomic_remove_fb+0x67e/0x980
Since there is no reliable way of figuring out whether a CRTC has unrelated
mode changes pending at the time of DSC validation, remember the value of the
mode_changed flag from before the point where a CRTC was marked as potentially
affected by a change in DSC configuration. Reset the mode_changed flag to this
earlier value instead in pre_validate_dsc().
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/5004
Fixes: 17ce8a6907 ("drm/amd/display: Add dsc pre-validation in atomic check")
Signed-off-by: Yussuf Khalil <dev@pp3345.net>
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit cc7c7121ae082b7b82891baa7280f1ff2608f22b)
DAMON_STAT usage document (Documentation/admin-guide/mm/damon/stat.rst)
says it monitors the system's entire physical memory. But, it is
monitoring only the biggest System RAM resource of the system. When there
are multiple System RAM resources, this results in monitoring only an
unexpectedly small fraction of the physical memory. For example, suppose
the system has a 500 GiB System RAM, 10 MiB non-System RAM, and 500 GiB
System RAM resources in order on the physical address space. DAMON_STAT
will monitor only the first 500 GiB System RAM. This situation is
particularly common on NUMA systems.
Select a physical address range that covers all System RAM areas of the
system, to fix this issue and make it work as documented.
[sj@kernel.org: return error if monitoring target region is invalid]
Link: https://lkml.kernel.org/r/20260317053631.87907-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20260316235118.873-1-sj@kernel.org
Fixes: 369c415e60 ("mm/damon: introduce DAMON_STAT module")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit e2c3b6b21c ("mm: zswap: use SG list decompression APIs from
zsmalloc") updated zswap_decompress() to use the scatterwalk API to copy
data for uncompressed pages.
In doing so, it mapped kernel memory locally for 32-bit kernels using
kmap_local_folio(), however it never unmapped this memory.
This resulted in the linked syzbot report where a BUG_ON() is triggered
due to leaking the kmap slot.
This patch fixes the issue by explicitly unmapping the established kmap.
Also, add flush_dcache_folio() after the kunmap_local() call
I had assumed that a new folio here combined with the flush that is done at
the point of setting the PTE would suffice, but it doesn't seem that's
actually the case, as update_mmu_cache() will in many archtectures only
actually flush entries where a dcache flush was done on a range previously.
I had also wondered whether kunmap_local() might suffice, but it doesn't
seem to be the case.
Some arches do seem to actually dcache flush on unmap, parisc does it if
CONFIG_HIGHMEM is not set by setting ARCH_HAS_FLUSH_ON_KUNMAP and calling
kunmap_flush_on_unmap() from __kunmap_local(), otherwise non-CONFIG_HIGHMEM
callers do nothing here.
Otherwise arch_kmap_local_pre_unmap() is called which does:
* sparc - flush_cache_all()
* arm - if VIVT, __cpuc_flush_dcache_area()
* otherwise - nothing
Also arch_kmap_local_post_unmap() is called which does:
* arm - local_flush_tlb_kernel_page()
* csky - kmap_flush_tlb()
* microblaze, ppc - local_flush_tlb_page()
* mips - local_flush_tlb_one()
* sparc - flush_tlb_all() (again)
* x86 - arch_flush_lazy_mmu_mode()
* otherwise - nothing
But this is only if it's high memory, and doesn't cover all architectures,
so is presumably intended to handle other cache consistency concerns.
In any case, VIPT is problematic here whether low or high memory (in spite
of what the documentation claims, see [0] - 'the kernel did write to a page
that is in the page cache page and / or in high memory'), because dirty
cache lines may exist at the set indexed by the kernel direct mapping,
which won't exist in the set indexed by any subsequent userland mapping,
meaning userland might read stale data from L2 cache.
Even if the documentation is correct and low memory is fine not to be
flushed here, we can't be sure as to whether the memory is low or high
(kmap_local_folio() will be a no-op if low), and this call should be
harmless if it is low.
VIVT would require more work if the memory were shared and already mapped,
but this isn't the case here, and would anyway be handled by the dcache
flush call.
In any case, we definitely need this flush as far as I can tell.
And we should probably consider updating the documentation unless it turns
out there's somehow dcache synchronisation that happens for low
memory/64-bit kernels elsewhere?
[ljs@kernel.org: add flush_dcache_folio() after the kunmap_local() call]
Link: https://lkml.kernel.org/r/13e09a99-181f-45ac-a18d-057faf94bccb@lucifer.local
Link: https://lkml.kernel.org/r/20260316140122.339697-1-ljs@kernel.org
Link: https://docs.kernel.org/core-api/cachetlb.html [0]
Fixes: e2c3b6b21c ("mm: zswap: use SG list decompression APIs from zsmalloc")
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reported-by: syzbot+fe426bef95363177631d@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69b75e2c.050a0220.12d28.015a.GAE@google.com
Acked-by: Yosry Ahmed <yosry@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Acked-by: Yosry Ahmed <yosry@kernel.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In f_ospi_probe(), when num_cs validation fails, it returns without
calling spi_controller_put() on the SPI controller, which causes a
resource leak.
Use devm_spi_alloc_host() instead of spi_alloc_host() to ensure the
SPI controller is properly freed when probe fails.
Fixes: 1b74dd64c8 ("spi: Add Socionext F_OSPI SPI flash controller driver")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260319-sn-f-v1-1-33a6738d2da8@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Fixes kernel hang during boot due to inability to set up IRQ on AXP313a.
The issue is caused by gpiochip_lock_as_irq() which is failing when gpio
is in uninitialized state.
Solution is to set pinmux to GPIO INPUT in
sunxi_pinctrl_irq_request_resources() if it wasn't initialized
earlier.
Tested on Orange Pi Zero 3.
Fixes: 01e10d0272 ("pinctrl: sunxi: Implement gpiochip::get_direction()")
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Chen-Yu Tsai <wens@kernel.org>
Signed-off-by: Michal Piekos <michal.piekos@mmpsystems.pl>
Signed-off-by: Linus Walleij <linusw@kernel.org>
Recent changes in the Allwinner pinctrl/GPIO IP made us add some quirks,
which the new SoCs (A523 family) need to use. We now have a comfortable
"flags" field on the per-SoC setup side, to tag those quirks we need, but
were translating those flag bits into specific fields for runtime use, in
the init routine.
Now the newest Allwinner GPIO IP adds even more quirks and exceptions,
some of a boolean nature.
To avoid inventing various new boolean flags for the runtime struct
sunxi_pinctrl, let's just directly pass on the flags variable used by the
setup code, so runtime can check for those various quirk bits directly.
Rename the "variant" member to "flags", and directly copy the value from
the setup code into there. Move the variant masking from the init
routine to the functions which actually use the "variant" value.
This mostly paves the way for the new A733 IP generation, which needs
more quirks to be checked at runtime.
Reviewed-by: Chen-Yu Tsai <wens@kernel.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Michal Piekos <michal.piekos@mmpsystems.pl>
Signed-off-by: Linus Walleij <linusw@kernel.org>
FRED-enabled SEV-(ES,SNP) guests fail to boot due to the following issues
in the early boot sequence:
* FRED does not have a #VC exception handler in the dispatch logic
* Early FRED #VC exceptions attempt to use uninitialized per-CPU GHCBs
instead of boot_ghcb
Add X86_TRAP_VC case to fred_hwexc() with a new exc_vmm_communication()
function that provides the unified entry point FRED requires, dispatching
to existing user/kernel handlers based on privilege level. The function is
already declared via DECLARE_IDTENTRY_VC().
Fix early GHCB access by falling back to boot_ghcb in
__sev_{get,put}_ghcb() when per-CPU GHCBs are not yet initialized.
Fixes: 14619d912b ("x86/fred: FRED entry/exit and dispatch code")
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <stable@kernel.org> # 6.12+
Link: https://patch.msgid.link/20260318075654.1792916-4-nikunj@amd.com
The current rule for smb2_mapping_table.c uses `$(call cmd,...)`, which
fails to track command line modifications in the Makefile (e.g., modifying
the command to `perl -d` or `perl -w` for debug will not trigger a rebuild)
and does not generate the required .cmd file for Kbuild.
Fix this by transitioning to the standard `$(call if_changed,...)` macro.
This includes adding the `FORCE` prerequisite and appending the output
file to the `targets` variable so Kbuild can track it properly.
As a result, Kbuild now automatically handles the cleaning of the
generated file, allowing us to safely drop the redundant `clean-files`
assignment.
Fixes: c527e13a7a ("cifs: Autogenerate SMB2 error mapping table")
Signed-off-by: Huiwen He <hehuiwen@kylinos.cn>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>
Commit in Fixes added the FRED CR4 bit to the CR4 pinned bits mask so
that whenever something else modifies CR4, that bit remains set. Which
in itself is a perfectly fine idea.
However, there's an issue when during boot FRED is initialized: first on
the BSP and later on the APs. Thus, there's a window in time when
exceptions cannot be handled.
This becomes particularly nasty when running as SEV-{ES,SNP} or TDX
guests which, when they manage to trigger exceptions during that short
window described above, triple fault due to FRED MSRs not being set up
yet.
See Link tag below for a much more detailed explanation of the
situation.
So, as a result, the commit in that Link URL tried to address this
shortcoming by temporarily disabling CR4 pinning when an AP is not
online yet.
However, that is a problem in itself because in this case, an attack on
the kernel needs to only modify the online bit - a single bit in RW
memory - and then disable CR4 pinning and then disable SM*P, leading to
more and worse things to happen to the system.
So, instead, remove the FRED bit from the CR4 pinning mask, thus
obviating the need to temporarily disable CR4 pinning.
If someone manages to disable FRED when poking at CR4, then
idt_invalidate() would make sure the system would crash'n'burn on the
first exception triggered, which is a much better outcome security-wise.
Fixes: ff45746fbf ("x86/cpu: Add X86_CR4_FRED macro")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org> # 6.12+
Link: https://lore.kernel.org/r/177385987098.1647592.3381141860481415647.tip-bot2@tip-bot2
Commit 35e4a69b20 ("PM: sleep: Allow pm_restrict_gfp_mask()
stacking") introduced refcount-based GFP mask management that warns
when pm_restore_gfp_mask() is called with saved_gfp_count == 0.
Some hibernation paths call pm_restore_gfp_mask() defensively where
the GFP mask may or may not be restricted depending on the execution
path. For example, the uswsusp interface invokes it in
SNAPSHOT_CREATE_IMAGE, SNAPSHOT_UNFREEZE, and snapshot_release().
Before the stacking change this was a silent no-op; it now triggers
a spurious WARNING.
Remove the WARN_ON() wrapper from the !saved_gfp_count check while
retaining the check itself, so that defensive calls remain harmless
without producing false warnings.
Fixes: 35e4a69b20 ("PM: sleep: Allow pm_restrict_gfp_mask() stacking")
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
[ rjw: Subject tweak ]
Link: https://patch.msgid.link/20260322120528.750178-1-youngjun.park@lge.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The gz_chain_head variable has been unused since the driver's initial
addition to the tree. Its use was eliminated between v3 and v4 during
development but due to the reference of gz_chain_head's wait_list
member, the compiler could not warn that it was unused.
After a (tip) commit ("locking/rwsem: Remove the list_head from struct
rw_semaphore"), which removed a reference to the variable passed to
__RWSEM_INITIALIZER(), certain configurations show an unused variable
warning from the Lenovo wmi-gamezone driver:
drivers/platform/x86/lenovo/wmi-gamezone.c:34:31: warning: 'gz_chain_head' defined but not used [-Wunused-variable]
34 | static BLOCKING_NOTIFIER_HEAD(gz_chain_head);
| ^~~~~~~~~~~~~
include/linux/notifier.h:119:39: note: in definition of macro 'BLOCKING_NOTIFIER_HEAD'
119 | struct blocking_notifier_head name = \
| ^~~~
Remove the variable to prevent the warning from showing up.
Fixes: 22024ac536 ("platform/x86: Add Lenovo Gamezone WMI Driver")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Mark Pearson <mpearson-lenovo@squebb.ca>
Link: https://patch.msgid.link/20260313-lenovo-wmi-gamezone-remove-gz_chain_head-v1-1-ce5231f0c6fa@kernel.org
[ij: reorganized the changelog]
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
On some systems, HWP can be explicitly disabled in the BIOS settings
When HWP is disabled by firmware, the HWP CPUID bit is not set, and
attempting to read MSR_PM_ENABLE will result in a General Protection
(GP) fault.
unchecked MSR access error: RDMSR from 0x770 at rIP: 0xffffffffc33db92e (disable_dynamic_sst_features+0xe/0x50 [isst_tpmi_core])
Call Trace:
<TASK>
? ex_handler_msr+0xf6/0x150
? fixup_exception+0x1ad/0x340
? gp_try_fixup_and_notify+0x1e/0xb0
? exc_general_protection+0xc9/0x390
? terminate_walk+0x64/0x100
? asm_exc_general_protection+0x22/0x30
? disable_dynamic_sst_features+0xe/0x50 [isst_tpmi_core]
isst_if_def_ioctl+0xece/0x1050 [isst_tpmi_core]
? ioctl_has_perm.constprop.42+0xe0/0x130
isst_if_def_ioctl+0x10d/0x1a0 [isst_if_common]
__se_sys_ioctl+0x86/0xc0
do_syscall_64+0x8a/0x100
entry_SYSCALL_64_after_hwframe+0x78/0xe2
RIP: 0033:0x7f36eaef54a7
Add a check for X86_FEATURE_HWP before accessing the MSR. If HWP is
not available, return true safely.
Fixes: 12a7d2cb81 ("platform/x86: ISST: Add SST-CP support via TPMI")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://patch.msgid.link/20260303074635.2218-1-lirongqing@baidu.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
The HP Omen 16-xf0xxx board 8BCA uses the same Victus-S fan and
thermal WMI path as other recently supported Omen/Victus boards,
but it requires Omen v1 thermal profile parameters for correct
platform profile behavior.
Add board 8BCA to victus_s_thermal_profile_boards[] and map it
to omen_v1_thermal_params.
Validated on HP Omen 16-xf0xxx (board 8BCA):
- /sys/firmware/acpi/platform_profile exposes
low-power/balanced/performance
- fan RPM reporting works (fan1_input/fan2_input)
- manual fan control works through hp-wmi hwmon (pwm1/pwm1_enable)
Signed-off-by: Raed <thisisraed@outlook.com>
Link: https://patch.msgid.link/20260311131338.965249-1-youaretalkingtoraed@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Commit 005e8dddd4 ("PM: hibernate: don't store zero pages in the
image file") added an optimization to skip zero-filled pages in the
hibernation image. On restore, zero pages are handled internally by
snapshot_write_next() in a loop that processes them without returning
to the caller.
With the userspace restore interface, writing the last non-zero page
to /dev/snapshot is followed by the SNAPSHOT_ATOMIC_RESTORE ioctl. At
this point there are no more calls to snapshot_write_next() so any
trailing zero pages are not processed, snapshot_image_loaded() fails
because handle->cur is smaller than expected, the ioctl returns -EPERM
and the image is not restored.
The in-kernel restore path is not affected by this because the loop in
load_image() in swap.c calls snapshot_write_next() until it returns 0.
It is this final call that drains any trailing zero pages.
Fixed by calling snapshot_write_next() in snapshot_write_finalize(),
giving the kernel the chance to drain any trailing zero pages.
Fixes: 005e8dddd4 ("PM: hibernate: don't store zero pages in the image file")
Signed-off-by: Alberto Garcia <berto@igalia.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Link: https://patch.msgid.link/ef5a7c5e3e3dbd17dcb20efaa0c53a47a23498bb.1773075892.git.berto@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move FSGSBASE enablement from identify_cpu() to cpu_init_exception_handling()
to ensure it is enabled before any exceptions can occur on both boot and
secondary CPUs.
== Background ==
Exception entry code (paranoid_entry()) uses ALTERNATIVE patching based on
X86_FEATURE_FSGSBASE to decide whether to use RDGSBASE/WRGSBASE instructions
or the slower RDMSR/SWAPGS sequence for saving/restoring GSBASE.
On boot CPU, ALTERNATIVE patching happens after enabling FSGSBASE in CR4.
When the feature is available, the code is permanently patched to use
RDGSBASE/WRGSBASE, which require CR4.FSGSBASE=1 to execute without triggering
== Boot Sequence ==
Boot CPU (with CR pinning enabled):
trap_init()
cpu_init() <- Uses unpatched code (RDMSR/SWAPGS)
x2apic_setup()
...
arch_cpu_finalize_init()
identify_boot_cpu()
identify_cpu()
cr4_set_bits(X86_CR4_FSGSBASE) # Enables the feature
# This becomes part of cr4_pinned_bits
...
alternative_instructions() <- Patches code to use RDGSBASE/WRGSBASE
Secondary CPUs (with CR pinning enabled):
start_secondary()
cr4_init() <- Code already patched, CR4.FSGSBASE=1
set implicitly via cr4_pinned_bits
cpu_init() <- exceptions work because FSGSBASE is
already enabled
Secondary CPU (with CR pinning disabled):
start_secondary()
cr4_init() <- Code already patched, CR4.FSGSBASE=0
cpu_init()
x2apic_setup()
rdmsrq(MSR_IA32_APICBASE) <- Triggers #VC in SNP guests
exc_vmm_communication()
paranoid_entry() <- Uses RDGSBASE with CR4.FSGSBASE=0
(patched code)
...
ap_starting()
identify_secondary_cpu()
identify_cpu()
cr4_set_bits(X86_CR4_FSGSBASE) <- Enables the feature, which is
too late
== CR Pinning ==
Currently, for secondary CPUs, CR4.FSGSBASE is set implicitly through
CR-pinning: the boot CPU sets it during identify_cpu(), it becomes part of
cr4_pinned_bits, and cr4_init() applies those pinned bits to secondary CPUs.
This works but creates an undocumented dependency between cr4_init() and the
pinning mechanism.
== Problem ==
Secondary CPUs boot after alternatives have been applied globally. They
execute already-patched paranoid_entry() code that uses RDGSBASE/WRGSBASE
instructions, which require CR4.FSGSBASE=1. Upcoming changes to CR pinning
behavior will break the implicit dependency, causing secondary CPUs to
generate #UD.
This issue manifests itself on AMD SEV-SNP guests, where the rdmsrq() in
x2apic_setup() triggers a #VC exception early during cpu_init(). The #VC
handler (exc_vmm_communication()) executes the patched paranoid_entry() path.
Without CR4.FSGSBASE enabled, RDGSBASE instructions trigger #UD.
== Fix ==
Enable FSGSBASE explicitly in cpu_init_exception_handling() before loading
exception handlers. This makes the dependency explicit and ensures both
boot and secondary CPUs have FSGSBASE enabled before paranoid_entry()
executes.
Fixes: c82965f9e5 ("x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit")
Reported-by: Borislav Petkov <bp@alien8.de>
Suggested-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Cc: <stable@kernel.org>
Link: https://patch.msgid.link/20260318075654.1792916-2-nikunj@amd.com
Remove the redundant post-parse validation switch. By the time that
block is reached, xfs_attri_validate() has already guaranteed all name
lengths are non-zero via xfs_attri_validate_namelen(), and
xfs_attri_validate_name_iovec() has already returned -EFSCORRUPTED for
NULL names. For the REMOVE case, attr_value and value_len are
structurally guaranteed to be NULL/zero because the parsing loop only
populates them when value_len != 0. All checks in that switch are
therefore dead code.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The ri_total checks for SET/REPLACE operations are hardcoded to 3,
but xfs_attri_item_size() only emits a value iovec when value_len > 0,
so ri_total is 2 when value_len == 0.
For PPTR_SET/PPTR_REMOVE/PPTR_REPLACE, value_len is validated by
xfs_attri_validate() to be exactly sizeof(struct xfs_parent_rec) and
is never zero, so their hardcoded checks remain correct.
This problem may cause log recovery failures. The following script can be
used to reproduce the problem:
#!/bin/bash
mkfs.xfs -f /dev/sda
mount /dev/sda /mnt/test/
touch /mnt/test/file
for i in {1..200}; do
attr -s "user.attr_$i" -V "value_$i" /mnt/test/file > /dev/null
done
echo 1 > /sys/fs/xfs/debug/larp
echo 1 > /sys/fs/xfs/sda/errortag/larp
attr -s "user.zero" -V "" /mnt/test/file
echo 0 > /sys/fs/xfs/sda/errortag/larp
umount /mnt/test
mount /dev/sda /mnt/test/ # mount failed
Fix this by deriving the expected count dynamically as "2 + !!value_len"
for SET/REPLACE operations.
Cc: stable@vger.kernel.org # v6.9
Fixes: ad206ae50e ("xfs: check opcode and iovec count match in xlog_recover_attri_commit_pass2")
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When inactivating an inode with node-format extended attributes,
xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
xfs_trans_binval(), but intentionally does not remove the corresponding
entries from their parent node blocks. The implicit assumption is that
xfs_attr_inactive() will truncate the entire attr fork to zero extents
afterwards, so log recovery will never reach the root node and follow
those stale pointers.
However, if a log shutdown occurs after the leaf/node block cancellations
commit but before the attr bmap truncation commits, this assumption
breaks. Recovery replays the attr bmap intact (the inode still has
attr fork extents), but suppresses replay of all cancelled leaf/node
blocks, maybe leaving them as stale data on disk. On the next mount,
xlog_recover_process_iunlinks() retries inactivation and attempts to
read the root node via the attr bmap. If the root node was not replayed,
reading the unreplayed root block triggers a metadata verification
failure immediately; if it was replayed, following its child pointers
to unreplayed child blocks triggers the same failure:
XFS (pmem0): Metadata corruption detected at
xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
XFS (pmem0): Unmount and run xfs_repair
XFS (pmem0): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
Fix this in two places:
In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
child block, immediately remove the entry that references it from the
parent node in the same transaction. This eliminates the window where
the parent holds a pointer to a cancelled block. Once all children are
removed, the now-empty root node is converted to a leaf block within the
same transaction. This node-to-leaf conversion is necessary for crash
safety. If the system shutdown after the empty node is written to the
log but before the second-phase bmap truncation commits, log recovery
will attempt to verify the root block on disk. xfs_da3_node_verify()
does not permit a node block with count == 0; such a block will fail
verification and trigger a metadata corruption shutdown. on the other
hand, leaf blocks are allowed to have this transient state.
In xfs_attr_inactive(), split the attr fork truncation into two explicit
phases. First, truncate all extents beyond the root block (the child
extents whose parent references have already been removed above).
Second, invalidate the root block and truncate the attr bmap to zero in
a single transaction. The two operations in the second phase must be
atomic: as long as the attr bmap has any non-zero length, recovery can
follow it to the root block, so the root block invalidation must commit
together with the bmap-to-zero truncation.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Factor out wrapper xfs_attr3_leaf_init function, which exported for
external use.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Factor out wrapper xfs_attr3_node_entry_remove function, which
exported for external use.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The assertion functions properly because we currently only truncate the
attr to a zero size. Any other new size of the attr is not preempted.
Make this assertion is specific to the datafork, preparing for
subsequent patches to truncate the attribute to a non-zero size.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
unlink_nv12_plane() will clobber parts of the plane state
potentially already set up by plane_atomic_check(), so we
must make sure not to call the two in the wrong order.
The problem happens when a plane previously selected as
a Y plane is now configured as a normal plane by user space.
plane_atomic_check() will first compute the proper plane
state based on the userspace request, and unlink_nv12_plane()
later clears some of the state.
This used to work on account of unlink_nv12_plane() skipping
the state clearing based on the plane visibility. But I removed
that check, thinking it was an impossible situation. Now when
that situation happens unlink_nv12_plane() will just WARN
and proceed to clobber the state.
Rather than reverting to the old way of doing things, I think
it's more clear if we unlink the NV12 planes before we even
compute the new plane state.
Cc: stable@vger.kernel.org
Reported-by: Khaled Almahallawy <khaled.almahallawy@intel.com>
Closes: https://lore.kernel.org/intel-gfx/20260212004852.1920270-1-khaled.almahallawy@intel.com/
Tested-by: Khaled Almahallawy <khaled.almahallawy@intel.com>
Fixes: 6a01df2f1b ("drm/i915: Remove pointless visible check in unlink_nv12_plane()")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260316163953.12905-2-ville.syrjala@linux.intel.com
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
(cherry picked from commit 017ecd04985573eeeb0745fa2c23896fb22ee0cc)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Put the barrier() before the OP so that anything we read out in
OP and check in COND will actually be read out after the timeout
has been evaluated.
Currently the only place where we use OP is __intel_wait_for_register(),
but the use there is precisely susceptible to this reordering, assuming
the ktime_*() stuff itself doesn't act as a sufficient barrier:
__intel_wait_for_register(...)
{
...
ret = __wait_for(reg_value = intel_uncore_read_notrace(...),
(reg_value & mask) == value, ...);
...
}
Cc: stable@vger.kernel.org
Fixes: 1c3c1dc66a ("drm/i915: Add compiler barrier to wait_for")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260313110740.24620-1-ville.syrjala@linux.intel.com
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
(cherry picked from commit a464bace0482aa9a83e9aa7beefbaf44cd58e6cf)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
After this commit (e2b76ab8b5 "ksmbd: add support for read compound"),
response buffer management was changed to use dynamic iov array.
In the new design, smb2_calc_max_out_buf_len() expects the second
argument (hdr2_len) to be the offset of ->Buffer field in the
response structure, not a hardcoded magic number.
Fix the remaining call sites to use the correct offsetof() value.
Cc: stable@vger.kernel.org
Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
smb2_lock() has three error handling issues after list_del() detaches
smb_lock from lock_list at no_check_cl:
1) If vfs_lock_file() returns an unexpected error in the non-UNLOCK
path, goto out leaks smb_lock and its flock because the out:
handler only iterates lock_list and rollback_list, neither of
which contains the detached smb_lock.
2) If vfs_lock_file() returns -ENOENT in the UNLOCK path, goto out
leaks smb_lock and flock for the same reason. The error code
returned to the dispatcher is also stale.
3) In the rollback path, smb_flock_init() can return NULL on
allocation failure. The result is dereferenced unconditionally,
causing a kernel NULL pointer dereference. Add a NULL check to
prevent the crash and clean up the bookkeeping; the VFS lock
itself cannot be rolled back without the allocation and will be
released at file or connection teardown.
Fix cases 1 and 2 by hoisting the locks_free_lock()/kfree() to before
the if(!rc) check in the UNLOCK branch so all exit paths share one
free site, and by freeing smb_lock and flock before goto out in the
non-UNLOCK branch. Propagate the correct error code in both cases.
Fix case 3 by wrapping the VFS unlock in an if(rlock) guard and adding
a NULL check for locks_free_lock(rlock) in the shared cleanup.
Found via call-graph analysis using sqry.
Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org
Suggested-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: Werner Kasselman <werner@verivus.com>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
smb_grant_oplock() has two issues in the oplock publication sequence:
1) opinfo is linked into ci->m_op_list (via opinfo_add) before
add_lease_global_list() is called. If add_lease_global_list()
fails (kmalloc returns NULL), the error path frees the opinfo
via __free_opinfo() while it is still linked in ci->m_op_list.
Concurrent m_op_list readers (opinfo_get_list, or direct iteration
in smb_break_all_levII_oplock) dereference the freed node.
2) opinfo->o_fp is assigned after add_lease_global_list() publishes
the opinfo on the global lease list. A concurrent
find_same_lease_key() can walk the lease list and dereference
opinfo->o_fp->f_ci while o_fp is still NULL.
Fix by restructuring the publication sequence to eliminate post-publish
failure:
- Set opinfo->o_fp before any list publication (fixes NULL deref).
- Preallocate lease_table via alloc_lease_table() before opinfo_add()
so add_lease_global_list() becomes infallible after publication.
- Keep the original m_op_list publication order (opinfo_add before
lease list) so concurrent opens via same_client_has_lease() and
opinfo_get_list() still see the in-flight grant.
- Use opinfo_put() instead of __free_opinfo() on err_out so that
the RCU-deferred free path is used.
This also requires splitting add_lease_global_list() to take a
preallocated lease_table and changing its return type from int to void,
since it can no longer fail.
Fixes: 1dfd062caa ("ksmbd: fix use-after-free by using call_rcu() for oplock_info")
Cc: stable@vger.kernel.org
Signed-off-by: Werner Kasselman <werner@verivus.com>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When a multichannel session binding request fails (e.g. wrong password),
the error path unconditionally sets sess->state = SMB2_SESSION_EXPIRED.
However, during binding, sess points to the target session looked up via
ksmbd_session_lookup_slowpath() -- which belongs to another connection's
user. This allows a remote attacker to invalidate any active session by
simply sending a binding request with a wrong password (DoS).
Fix this by skipping session expiration when the failed request was
a binding attempt, since the session does not belong to the current
connection. The reference taken by ksmbd_session_lookup_slowpath() is
still correctly released via ksmbd_user_session_put().
Cc: stable@vger.kernel.org
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
To pick up the changes in:
6ffd853b0b ("build_bug.h: correct function parameters names in kernel-doc")
That just add some comments, addressing this perf tools build warning:
Warning: Kernel ABI header differences:
diff -u tools/include/linux/build_bug.h include/linux/build_bug.h
Please take a look at tools/include/uapi/README for further info on this
synchronization process.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the changes in:
e2ffe85b6d ("KVM: x86: Introduce KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM")
That just rebuilds kvm-stat.c on x86, no change in functionality.
This silences these perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/arch/x86/include/uapi/asm/kvm.h arch/x86/include/uapi/asm/kvm.h
Please see tools/include/uapi/README for further details.
Cc: Jim Mattson <jmattson@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the changes in:
da142f3d37 ("KVM: Remove subtle "struct kvm_stats_desc" pseudo-overlay")
That just rebuilds perf, as these patches don't add any new KVM ioctl to
be harvested for the 'perf trace' ioctl syscall argument beautifiers.
This addresses this perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
Please see tools/include/uapi/README for further details.
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick up the changes from these csets:
9073428bb2 ("x86/sev: Allow IBPB-on-Entry feature for SNP guests")
That cause no changes to tooling as it doesn't include a new MSR to be
captured by the tools/perf/trace/beauty/tracepoints/x86_msr.sh script.
Just silences this perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull bpf fixes from Alexei Starovoitov:
- Fix how linked registers track zero extension of subregisters (Daniel
Borkmann)
- Fix unsound scalar fork for OR instructions (Daniel Wade)
- Fix exception exit lock check for subprogs (Ihor Solodrai)
- Fix undefined behavior in interpreter for SDIV/SMOD instructions
(Jenny Guanni Qu)
- Release module's BTF when module is unloaded (Kumar Kartikeya
Dwivedi)
- Fix constant blinding for PROBE_MEM32 instructions (Sachin Kumar)
- Reset register ID for END instructions to prevent incorrect value
tracking (Yazhou Tang)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add a test cases for sync_linked_regs regarding zext propagation
bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation
selftests/bpf: Add tests for maybe_fork_scalars() OR vs AND handling
bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR
selftests/bpf: Add tests for sdiv32/smod32 with INT_MIN dividend
bpf: Fix undefined behavior in interpreter sdiv/smod for INT_MIN
selftests/bpf: Add tests for bpf_throw lock leak from subprogs
bpf: Fix exception exit lock checking for subprogs
bpf: Release module BTF IDR before module unload
selftests/bpf: Fix pkg-config call on static builds
bpf: Fix constant blinding for PROBE_MEM32 stores
selftests/bpf: Add test for BPF_END register ID reset
bpf: Reset register ID for BPF_END value tracking
Pull tracing fixes from Steven Rostedt:
- Revert "tracing: Remove pid in task_rename tracing output"
A change was made to remove the pid field from the task_rename event
because it was thought that it was always done for the current task
and recording the pid would be redundant. This turned out to be
incorrect and there are a few corner case where this is not true and
caused some regressions in tooling.
- Fix the reading from user space for migration
The reading of user space uses a seq lock type of logic where it uses
a per-cpu temporary buffer and disables migration, then enables
preemption, does the copy from user space, disables preemption,
enables migration and checks if there was any schedule switches while
preemption was enabled. If there was a context switch, then it is
considered that the per-cpu buffer could be corrupted and it tries
again. There's a protection check that tests if it takes a hundred
tries, it issues a warning and exits out to prevent a live lock.
This was triggered because the task was selected by the load balancer
to be migrated to another CPU, every time preemption is enabled the
migration task would schedule in try to migrate the task but can't
because migration is disabled and let it run again. This caused the
scheduler to schedule out the task every time it enabled preemption
and made the loop never exit (until the 100 iteration test
triggered).
Fix this by enabling and disabling preemption and keeping migration
enabled if the reading from user space needs to be done again. This
will let the migration thread migrate the task and the copy from user
space will likely pass on the next iteration.
- Fix trace_marker copy option freeing
The "copy_trace_marker" option allows a tracing instance to get a
copy of a write to the trace_marker file of the top level instance.
This is managed by a link list protected by RCU. When an instance is
removed, a check is made if the option is set, and if so
synchronized_rcu() is called.
The problem is that an iteration is made to reset all the flags to
what they were when the instance was created (to perform clean ups)
was done before the check of the copy_trace_marker option and that
option was cleared, so the synchronize_rcu() was never called.
Move the clearing of all the flags after the check of
copy_trace_marker to do synchronize_rcu() so that the option is still
set if it was before and the synchronization is performed.
- Fix entries setting when validating the persistent ring buffer
When validating the persistent ring buffer on boot up, the number of
events per sub-buffer is added to the sub-buffer meta page. The
validator was updating cpu_buffer->head_page (the first sub-buffer of
the per-cpu buffer) and not the "head_page" variable that was
iterating the sub-buffers. This was causing the first sub-buffer to
be assigned the entries for each sub-buffer and not the sub-buffer
that was supposed to be updated.
- Use "hash" value to update the direct callers
When updating the ftrace direct callers, it assigned a temporary
callback to all the callback functions of the ftrace ops and not just
the functions represented by the passed in hash. This causes an
unnecessary slow down of the functions of the ftrace_ops that is not
being modified. Only update the functions that are going to be
modified to call the ftrace loop function so that the update can be
made on those functions.
* tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod
ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
tracing: Fix trace_marker copy link list updates
tracing: Fix failure to read user space from system call trace events
tracing: Revert "tracing: Remove pid in task_rename tracing output"
Pull i2c fixes from Wolfram Sang:
- fix broken I2C communication on Armada 3700 with recovery
- fix device_node reference leak in probe (fsi)
- fix NULL-deref when serial string is missing (cp2615)
* tag 'i2c-for-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: pxa: defer reset on Armada 3700 when recovery is used
i2c: fsi: Fix a potential leak in fsi_i2c_probe()
i2c: cp2615: fix serial string NULL-deref at probe
Pull x86 fixes from Ingo Molnar:
- Improve Qemu MCE-injection behavior by only using AMD SMCA MSRs if
the feature bit is set
- Fix the relative path of gettimeofday.c inclusion in vclock_gettime.c
- Fix a boot crash on UV clusters when a socket is marked as
'deconfigured' which are mapped to the SOCK_EMPTY node ID by
the UV firmware, while Linux APIs expect NUMA_NO_NODE.
The difference being (0xffff [unsigned short ~0]) vs [int -1]
* tag 'x86-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/uv: Handle deconfigured sockets
x86/entry/vdso: Fix path of included gettimeofday.c
x86/mce/amd: Check SMCA feature bit before accessing SMCA MSRs
Pull perf fixes from Ingo Molnar:
- Fix a PMU driver crash on AMD EPYC systems, caused by
a race condition in x86_pmu_enable()
- Fix a possible counter-initialization bug in x86_pmu_enable()
- Fix a counter inheritance bug in inherit_event() and
__perf_event_read()
- Fix an Intel PMU driver branch constraints handling bug
found by UBSAN
- Fix the Intel PMU driver's new Off-Module Response (OMR)
support code for Diamond Rapids / Nova lake, to fix a snoop
information parsing bug
* tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix OMR snoop information parsing issues
perf/x86/intel: Add missing branch counters constraint apply
perf: Make sure to use pmu_ctx->pmu for groups
x86/perf: Make sure to program the counter value for stopped events on migration
perf/x86: Move event pointer setup earlier in x86_pmu_enable()
Pull objtool fixes from Ingo Molnar:
"Fix three more livepatching related build environment bugs, and a
false positive warning with Clang jump tables"
* tag 'objtool-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix Clang jump table detection
livepatch/klp-build: Fix inconsistent kernel version
objtool/klp: fix mkstemp() failure with long paths
objtool/klp: fix data alignment in __clone_symbol()
Pull locking fix from Ingo Molnar:
"Fix a sparse build error regression in <linux/local_lock_internal.h>
caused by the locking context-analysis changes"
* tag 'locking-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
include/linux/local_lock_internal.h: Make this header file again compatible with sparse
Pull irq fix from Ingo Molnar:
"Fix a mailbox channel leak in the riscv-rpmi-sysmsi irqchip driver"
* tag 'irq-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/riscv-rpmi-sysmsi: Fix mailbox channel leak in rpmi_sysmsi_probe()
The call to mipi_dsi_host_register triggers a callback to mtk_dsi_bind,
which uses dev_get_drvdata to retrieve the mtk_dsi struct, so this
structure needs to be stored inside the driver data before invoking it.
As drvdata is currently uninitialized it leads to a crash when
registering the DSI DRM encoder right after acquiring
the mode_config.idr_mutex, blocking all subsequent DRM operations.
Fixes the following crash during mediatek-drm probe (tested on Xiaomi
Smart Clock x04g):
Unable to handle kernel NULL pointer dereference at virtual address
0000000000000040
[...]
Modules linked in: mediatek_drm(+) drm_display_helper cec drm_client_lib
drm_dma_helper drm_kms_helper panel_simple
[...]
Call trace:
drm_mode_object_add+0x58/0x98 (P)
__drm_encoder_init+0x48/0x140
drm_encoder_init+0x6c/0xa0
drm_simple_encoder_init+0x20/0x34 [drm_kms_helper]
mtk_dsi_bind+0x34/0x13c [mediatek_drm]
component_bind_all+0x120/0x280
mtk_drm_bind+0x284/0x67c [mediatek_drm]
try_to_bring_up_aggregate_device+0x23c/0x320
__component_add+0xa4/0x198
component_add+0x14/0x20
mtk_dsi_host_attach+0x78/0x100 [mediatek_drm]
mipi_dsi_attach+0x2c/0x50
panel_simple_dsi_probe+0x4c/0x9c [panel_simple]
mipi_dsi_drv_probe+0x1c/0x28
really_probe+0xc0/0x3dc
__driver_probe_device+0x80/0x160
driver_probe_device+0x40/0x120
__device_attach_driver+0xbc/0x17c
bus_for_each_drv+0x88/0xf0
__device_attach+0x9c/0x1cc
device_initial_probe+0x54/0x60
bus_probe_device+0x34/0xa0
device_add+0x5b0/0x800
mipi_dsi_device_register_full+0xdc/0x16c
mipi_dsi_host_register+0xc4/0x17c
mtk_dsi_probe+0x10c/0x260 [mediatek_drm]
platform_probe+0x5c/0xa4
really_probe+0xc0/0x3dc
__driver_probe_device+0x80/0x160
driver_probe_device+0x40/0x120
__driver_attach+0xc8/0x1f8
bus_for_each_dev+0x7c/0xe0
driver_attach+0x24/0x30
bus_add_driver+0x11c/0x240
driver_register+0x68/0x130
__platform_register_drivers+0x64/0x160
mtk_drm_init+0x24/0x1000 [mediatek_drm]
do_one_initcall+0x60/0x1d0
do_init_module+0x54/0x240
load_module+0x1838/0x1dc0
init_module_from_file+0xd8/0xf0
__arm64_sys_finit_module+0x1b4/0x428
invoke_syscall.constprop.0+0x48/0xc8
do_el0_svc+0x3c/0xb8
el0_svc+0x34/0xe8
el0t_64_sync_handler+0xa0/0xe4
el0t_64_sync+0x198/0x19c
Code: 52800022 941004ab 2a0003f3 37f80040 (29005a80)
Fixes: e4732b590a ("drm/mediatek: dsi: Register DSI host after acquiring clocks and PHY")
Signed-off-by: Luca Leonardo Scorcia <l.scorcia@gmail.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: CK Hu <ck.hu@mediatek.com>
Link: https://patchwork.kernel.org/project/dri-devel/patch/20260225094047.76780-1-l.scorcia@gmail.com/
Signed-off-by: Chun-Kuang Hu <chunkuang.hu@kernel.org>
On weakly ordered architectures (e.g., arm64), the lockless check in
wq_watchdog_timer_fn() can observe a reordering between the worklist
insertion and the last_progress_ts update. Specifically, the watchdog
can see a non-empty worklist (from a list_add) while reading a stale
last_progress_ts value, causing a false positive stall report.
This was confirmed by reading pool->last_progress_ts again after holding
pool->lock in wq_watchdog_timer_fn():
workqueue watchdog: pool 7 false positive detected!
lockless_ts=4784580465 locked_ts=4785033728
diff=453263ms worklist_empty=0
To avoid slowing down the hot path (queue_work, etc.), recheck
last_progress_ts with pool->lock held. This will eliminate the false
positive with minimal overhead.
Remove two extra empty lines in wq_watchdog_timer_fn() as we are on it.
Fixes: 82607adcf9 ("workqueue: implement lockup detector")
Cc: stable@vger.kernel.org # v4.5+
Assisted-by: claude-code:claude-opus-4-6
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
slot_free() basically completely resets the slots by clearing all of
its flags and attributes. While zram_writeback_complete() restores
some of flags back (those that are necessary for async read
decompression) we still lose a lot of slot's metadata. For example,
slot's ac-time, or ZRAM_INCOMPRESSIBLE.
More importantly, restoring flags/attrs requires extra attention as
some of the flags are directly affecting zram device stats. And the
original code did not pay that attention. Namely ZRAM_HUGE slots
handling in zram_writeback_complete(). The call to slot_free() would
decrement ->huge_pages, however when zram_writeback_complete() restored
the slot's ZRAM_HUGE flag, it would not get reflected in an incremented
->huge_pages. So when the slot would finally get freed, slot_free()
would decrement ->huge_pages again, leading to underflow.
Fix this by open-coding the required memory free and stats updates in
zram_writeback_complete(), rather than calling the destructive
slot_free(). Since we now preserve the ZRAM_HUGE flag on written-back
slots (for the deferred decompression path), we also update slot_free()
to skip decrementing ->huge_pages if ZRAM_WB is set.
Link: https://lkml.kernel.org/r/20260320023143.2372879-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20260319034912.1894770-1-senozhatsky@chromium.org
Fixes: d38fab605c ("zram: introduce compressed data writeback")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
One major usage of damon_call() is online DAMON parameters update. It is
done by calling damon_commit_ctx() inside the damon_call() callback
function. damon_commit_ctx() can fail for two reasons: 1) invalid
parameters and 2) internal memory allocation failures. In case of
failures, the damon_ctx that attempted to be updated (commit destination)
can be partially updated (or, corrupted from a perspective), and therefore
shouldn't be used anymore. The function only ensures the damon_ctx object
can safely deallocated using damon_destroy_ctx().
The API callers are, however, calling damon_commit_ctx() only after
asserting the parameters are valid, to avoid damon_commit_ctx() fails due
to invalid input parameters. But it can still theoretically fail if the
internal memory allocation fails. In the case, DAMON may run with the
partially updated damon_ctx. This can result in unexpected behaviors
including even NULL pointer dereference in case of damos_commit_dests()
failure [1]. Such allocation failure is arguably too small to fail, so
the real world impact would be rare. But, given the bad consequence, this
needs to be fixed.
Avoid such partially-committed (maybe-corrupted) damon_ctx use by saving
the damon_commit_ctx() failure on the damon_ctx object. For this,
introduce damon_ctx->maybe_corrupted field. damon_commit_ctx() sets it
when it is failed. kdamond_call() checks if the field is set after each
damon_call_control->fn() is executed. If it is set, ignore remaining
callback requests and return. All kdamond_call() callers including
kdamond_fn() also check the maybe_corrupted field right after
kdamond_call() invocations. If the field is set, break the kdamond_fn()
main loop so that DAMON sill doesn't use the context that might be
corrupted.
[sj@kernel.org: let kdamond_call() with cancel regardless of maybe_corrupted]
Link: https://lkml.kernel.org/r/20260320031553.2479-1-sj@kernel.org
Link: https://sashiko.dev/#/patchset/20260319145218.86197-1-sj%40kernel.org
Link: https://lkml.kernel.org/r/20260319145218.86197-1-sj@kernel.org
Link: https://lore.kernel.org/20260319043309.97966-1-sj@kernel.org [1]
Fixes: 3301f1861d ("mm/damon/sysfs: handle commit command using damon_call()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 542eda1a83 ("mm/rmap: improve anon_vma_clone(),
unlink_anon_vmas() comments, add asserts") alters the way errors are
handled, but overlooked one important aspect of clean up.
When a VMA encounters an error state in anon_vma_clone() (that is, on
attempted allocation of anon_vma_chain objects), it cleans up partially
established state in cleanup_partial_anon_vmas(), before returning an
error.
However, this occurs prior to anon_vma->num_active_vmas being incremented,
and it also fails to clear the VMA's vma->anon_vma field, which remains in
place.
This is immediately an inconsistent state, because
anon_vma->num_active_vmas is supposed to track the number of VMAs whose
vma->anon_vma field references that anon_vma, and now that count is
off-by-negative-1 for each VMA for which this error state has occurred.
When VMAs are unlinked from this anon_vma, unlink_anon_vmas() will
eventually underflow anon_vma->num_active_vmas, which will trigger a
warning.
This will always eventually happen, as we unlink anon_vma's at process
teardown.
It could also cause maybe_reuse_anon_vma() to incorrectly permit the reuse
of an anon_vma which has active VMAs attached, which will lead to a
persistently invalid state.
The solution is to clear the VMA's anon_vma field when we clean up partial
state, as the fact we are doing so indicates clearly that the VMA is not
correctly integrated into the anon_vma tree and thus this field is
invalid.
Link: https://lkml.kernel.org/r/20260318122632.63404-1-ljs@kernel.org
Fixes: 542eda1a83 ("mm/rmap: improve anon_vma_clone(), unlink_anon_vmas() comments, add asserts")
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/linux-mm/20260302151547.2389070-1-sashal@kernel.org/
Reported-by: Jiakai Xu <jiakaipeanut@gmail.com>
Closes: https://lore.kernel.org/linux-mm/CAFb8wJvRhatRD-9DVmr5v5pixTMPEr3UKjYBJjCd09OfH55CKg@mail.gmail.com/
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Tested-by: Jiakai Xu <jiakaipeanut@gmail.com>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Sasha Levin (Microsoft) <sashal@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In the WAKE_SYNC path of scx_select_cpu_dfl(), waker_node was computed
with cpu_to_node(), while node (for prev_cpu) was computed with
scx_cpu_node_if_enabled(). When scx_builtin_idle_per_node is disabled,
idle_cpumask(waker_node) is called with a real node ID even though
per-node idle tracking is disabled, resulting in undefined behavior.
Fix by using scx_cpu_node_if_enabled() for waker_node as well, ensuring
both variables are computed consistently.
Fixes: 48849271e6 ("sched_ext: idle: Per-node idle cpumasks")
Cc: stable@vger.kernel.org # v6.15+
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull driver core fixes from Danilo Krummrich:
- Generalize driver_override in the driver core, providing a common
sysfs implementation and concurrency-safe accessors for bus
implementations
- Do not use driver_override as IRQ name in the hwmon axi-fan driver
- Remove an unnecessary driver_override check in sh platform_early
- Migrate the platform bus to use the generic driver_override
infrastructure, fixing a UAF condition caused by accessing the
driver_override field without proper locking in the platform_match()
callback
* tag 'driver-core-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
driver core: platform: use generic driver_override infrastructure
sh: platform_early: remove pdev->driver_override check
hwmon: axi-fan: don't use driver_override as IRQ name
docs: driver-model: document driver_override
driver core: generalize driver_override in struct device
The modify logic registers temporary ftrace_ops object (tmp_ops) to trigger
the slow path for all direct callers to be able to safely modify attached
addresses.
At the moment we use ops->func_hash for tmp_ops filter, which represents all
the systems attachments. It's faster to use just the passed hash filter, which
contains only the modified sites and is always a subset of the ops->func_hash.
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Menglong Dong <menglong8.dong@gmail.com>
Cc: Song Liu <song@kernel.org>
Link: https://patch.msgid.link/20260312123738.129926-1-jolsa@kernel.org
Fixes: e93672f770 ("ftrace: Add update_ftrace_direct_mod function")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the "copy_trace_marker" option is enabled for an instance, anything
written into /sys/kernel/tracing/trace_marker is also copied into that
instances buffer. When the option is set, that instance's trace_array
descriptor is added to the marker_copies link list. This list is protected
by RCU, as all iterations uses an RCU protected list traversal.
When the instance is deleted, all the flags that were enabled are cleared.
This also clears the copy_trace_marker flag and removes the trace_array
descriptor from the list.
The issue is after the flags are called, a direct call to
update_marker_trace() is performed to clear the flag. This function
returns true if the state of the flag changed and false otherwise. If it
returns true here, synchronize_rcu() is called to make sure all readers
see that its removed from the list.
But since the flag was already cleared, the state does not change and the
synchronization is never called, leaving a possible UAF bug.
Move the clearing of all flags below the updating of the copy_trace_marker
option which then makes sure the synchronization is performed.
Also use the flag for checking the state in update_marker_trace() instead
of looking at if the list is empty.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260318185512.1b6c7db4@gandalf.local.home
Fixes: 7b382efd5e ("tracing: Allow the top level trace_marker to write into another instances")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/all/20260225133122.237275-1-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The system call trace events call trace_user_fault_read() to read the user
space part of some system calls. This is done by grabbing a per-cpu
buffer, disabling migration, enabling preemption, calling
copy_from_user(), disabling preemption, enabling migration and checking if
the task was preempted while preemption was enabled. If it was, the buffer
is considered corrupted and it tries again.
There's a safety mechanism that will fail out of this loop if it fails 100
times (with a warning). That warning message was triggered in some
pi_futex stress tests. Enabling the sched_switch trace event and
traceoff_on_warning, showed the problem:
pi_mutex_hammer-1375 [006] d..21 138.981648: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981651: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981656: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981659: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981664: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981667: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981671: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981675: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981679: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981682: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981687: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981690: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981695: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981698: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981703: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981706: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981711: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981714: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981719: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981722: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981727: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981730: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981735: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981738: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
What happened was the task 1375 was flagged to be migrated. When
preemption was enabled, the migration thread woke up to migrate that task,
but failed because migration for that task was disabled. This caused the
loop to fail to exit because the task scheduled out while trying to read
user space.
Every time the task enabled preemption the migration thread would schedule
in, try to migrate the task, fail and let the task continue. But because
the loop would only enable preemption with migration disabled, it would
always fail because each time it enabled preemption to read user space,
the migration thread would try to migrate it.
To solve this, when the loop fails to read user space without being
scheduled out, enabled and disable preemption with migration enabled. This
will allow the migration task to successfully migrate the task and the
next loop should succeed to read user space without being scheduled out.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260316130734.1858a998@gandalf.local.home
Fixes: 64cf7d058a ("tracing: Have trace_marker use per-cpu data to read user space")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add multiple test cases for linked register tracking with alu32 ops:
- Add a test that checks sync_linked_regs() regarding reg->id (the linked
target register) for BPF_ADD_CONST32 rather than known_reg->id (the
branch register).
- Add a test case for linked register tracking that exposes the cross-type
sync_linked_regs() bug. One register uses alu32 (w7 += 1, BPF_ADD_CONST32)
and another uses alu64 (r8 += 2, BPF_ADD_CONST64), both linked to the
same base register.
- Add a test case that exercises regsafe() path pruning when two execution
paths reach the same program point with linked registers carrying
different ADD_CONST flags (BPF_ADD_CONST32 from alu32 vs BPF_ADD_CONST64
from alu64). This particular test passes with and without the fix since
the pruning will fail due to different ranges, but it would still be
useful to carry this one as a regression test for the unreachable div
by zero.
With the fix applied all the tests pass:
# LDLIBS=-static PKG_CONFIG='pkg-config --static' ./vmtest.sh -- ./test_progs -t verifier_linked_scalars
[...]
./test_progs -t verifier_linked_scalars
#602/1 verifier_linked_scalars/scalars: find linked scalars:OK
#602/2 verifier_linked_scalars/sync_linked_regs_preserves_id:OK
#602/3 verifier_linked_scalars/scalars_neg:OK
#602/4 verifier_linked_scalars/scalars_neg_sub:OK
#602/5 verifier_linked_scalars/scalars_neg_alu32_add:OK
#602/6 verifier_linked_scalars/scalars_neg_alu32_sub:OK
#602/7 verifier_linked_scalars/scalars_pos:OK
#602/8 verifier_linked_scalars/scalars_sub_neg_imm:OK
#602/9 verifier_linked_scalars/scalars_double_add:OK
#602/10 verifier_linked_scalars/scalars_sync_delta_overflow:OK
#602/11 verifier_linked_scalars/scalars_sync_delta_overflow_large_range:OK
#602/12 verifier_linked_scalars/scalars_alu32_big_offset:OK
#602/13 verifier_linked_scalars/scalars_alu32_basic:OK
#602/14 verifier_linked_scalars/scalars_alu32_wrap:OK
#602/15 verifier_linked_scalars/scalars_alu32_zext_linked_reg:OK
#602/16 verifier_linked_scalars/scalars_alu32_alu64_cross_type:OK
#602/17 verifier_linked_scalars/scalars_alu32_alu64_regsafe_pruning:OK
#602/18 verifier_linked_scalars/alu32_negative_offset:OK
#602/19 verifier_linked_scalars/spurious_precision_marks:OK
#602 verifier_linked_scalars:OK
Summary: 1/19 PASSED, 0 SKIPPED, 0 FAILED
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260319211507.213816-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jenny reported that in sync_linked_regs() the BPF_ADD_CONST32 flag is
checked on known_reg (the register narrowed by a conditional branch)
instead of reg (the linked target register created by an alu32 operation).
Example case with reg:
1. r6 = bpf_get_prandom_u32()
2. r7 = r6 (linked, same id)
3. w7 += 5 (alu32 -- r7 gets BPF_ADD_CONST32, zero-extended by CPU)
4. if w6 < 0xFFFFFFFC goto safe (narrows r6 to [0xFFFFFFFC, 0xFFFFFFFF])
5. sync_linked_regs() propagates to r7 but does NOT call zext_32_to_64()
6. Verifier thinks r7 is [0x100000001, 0x100000004] instead of [1, 4]
Since known_reg above does not have BPF_ADD_CONST32 set above, zext_32_to_64()
is never called on alu32-derived linked registers. This causes the verifier
to track incorrect 64-bit bounds, while the CPU correctly zero-extends the
32-bit result.
The code checking known_reg->id was correct however (see scalars_alu32_wrap
selftest case), but the real fix needs to handle both directions - zext
propagation should be done when either register has BPF_ADD_CONST32, since
the linked relationship involves a 32-bit operation regardless of which
side has the flag.
Example case with known_reg (exercised also by scalars_alu32_wrap):
1. r1 = r0; w1 += 0x100 (alu32 -- r1 gets BPF_ADD_CONST32)
2. if r1 > 0x80 - known_reg = r1 (has BPF_ADD_CONST32), reg = r0 (doesn't)
Hence, fix it by checking for (reg->id | known_reg->id) & BPF_ADD_CONST32.
Moreover, sync_linked_regs() also has a soundness issue when two linked
registers used different ALU widths: one with BPF_ADD_CONST32 and the
other with BPF_ADD_CONST64. The delta relationship between linked registers
assumes the same arithmetic width though. When one register went through
alu32 (CPU zero-extends the 32-bit result) and the other went through
alu64 (no zero-extension), the propagation produces incorrect bounds.
Example:
r6 = bpf_get_prandom_u32() // fully unknown
if r6 >= 0x100000000 goto out // constrain r6 to [0, U32_MAX]
r7 = r6
w7 += 1 // alu32: r7.id = N | BPF_ADD_CONST32
r8 = r6
r8 += 2 // alu64: r8.id = N | BPF_ADD_CONST64
if r7 < 0xFFFFFFFF goto out // narrows r7 to [0xFFFFFFFF, 0xFFFFFFFF]
At the branch on r7, sync_linked_regs() runs with known_reg=r7
(BPF_ADD_CONST32) and reg=r8 (BPF_ADD_CONST64). The delta path
computes:
r8 = r7 + (delta_r8 - delta_r7) = 0xFFFFFFFF + (2 - 1) = 0x100000000
Then, because known_reg->id has BPF_ADD_CONST32, zext_32_to_64(r8) is
called, truncating r8 to [0, 0]. But r8 used a 64-bit ALU op -- the
CPU does NOT zero-extend it. The actual CPU value of r8 is
0xFFFFFFFE + 2 = 0x100000000, not 0. The verifier now underestimates
r8's 64-bit bounds, which is a soundness violation.
Fix sync_linked_regs() by skipping propagation when the two registers
have mixed ALU widths (one BPF_ADD_CONST32, the other BPF_ADD_CONST64).
Lastly, fix regsafe() used for path pruning: the existing checks used
"& BPF_ADD_CONST" to test for offset linkage, which treated
BPF_ADD_CONST32 and BPF_ADD_CONST64 as equivalent.
Fixes: 7a433e5193 ("bpf: Support negative offsets, BPF_SUB, and alu32 for linked register tracking")
Reported-by: Jenny Guanni Qu <qguanni@gmail.com>
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260319211507.213816-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Wade says:
====================
bpf: Fix unsound scalar forking for BPF_OR
maybe_fork_scalars() unconditionally sets the pushed path dst register
to 0 for both BPF_AND and BPF_OR. For AND this is correct (0 & K == 0),
but for OR it is wrong (0 | K == K, not 0). This causes the verifier to
track an incorrect value on the pushed path, leading to a verifier/runtime
divergence that allows out-of-bounds map value access.
v4: Use block comment style for multi-line comments in selftests (Amery Hung)
Add Reviewed-by/Acked-by tags
v3: Use single-line comment style in selftests (Alexei Starovoitov)
v2: Use push_stack(env, env->insn_idx, ...) to re-execute the insn
on the pushed path (Eduard Zingerman)
====================
Link: https://patch.msgid.link/20260314021521.128361-1-danjwade95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add three test cases to verifier_bounds.c to verify that
maybe_fork_scalars() correctly tracks register values for BPF_OR
operations with constant source operands:
1. or_scalar_fork_rejects_oob: After ARSH 63 + OR 8, the pushed
path should have dst = 8. With value_size = 8, accessing
map_value + 8 is out of bounds and must be rejected.
2. and_scalar_fork_still_works: Regression test ensuring AND
forking continues to work. ARSH 63 + AND 4 produces pushed
dst = 0 and current dst = 4, both within value_size = 8.
3. or_scalar_fork_allows_inbounds: After ARSH 63 + OR 4, the
pushed path has dst = 4, which is within value_size = 8
and should be accepted.
These tests exercise the fix in the previous patch, which makes the
pushed path re-execute the ALU instruction so it computes the correct
result for BPF_OR.
Signed-off-by: Daniel Wade <danjwade95@gmail.com>
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260314021521.128361-3-danjwade95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
maybe_fork_scalars() is called for both BPF_AND and BPF_OR when the
source operand is a constant. When dst has signed range [-1, 0], it
forks the verifier state: the pushed path gets dst = 0, the current
path gets dst = -1.
For BPF_AND this is correct: 0 & K == 0.
For BPF_OR this is wrong: 0 | K == K, not 0.
The pushed path therefore tracks dst as 0 when the runtime value is K,
producing an exploitable verifier/runtime divergence that allows
out-of-bounds map access.
Fix this by passing env->insn_idx (instead of env->insn_idx + 1) to
push_stack(), so the pushed path re-executes the ALU instruction with
dst = 0 and naturally computes the correct result for any opcode.
Fixes: bffacdb80b ("bpf: Recognize special arithmetic shift in the verifier")
Signed-off-by: Daniel Wade <danjwade95@gmail.com>
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260314021521.128361-2-danjwade95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jenny Guanni Qu says:
====================
bpf: Fix abs(INT_MIN) undefined behavior in interpreter sdiv/smod
The BPF interpreter's signed 32-bit division and modulo handlers use
abs() on s32 operands, which is undefined for S32_MIN. This causes
the interpreter to compute wrong results, creating a mismatch with
the verifier's range tracking.
For example, INT_MIN / 2 returns 0x40000000 instead of the correct
0xC0000000. The verifier tracks the correct range, so a crafted BPF
program can exploit the mismatch for out-of-bounds map value access
(confirmed by KASAN).
Patch 1 introduces abs_s32() which handles S32_MIN correctly and
replaces all 8 abs((s32)...) call sites. s32 is the only affected
case -- the s64 handlers do not use abs().
Patch 2 adds selftests covering sdiv32 and smod32 with INT_MIN
dividend to prevent regression.
Changes since v4:
- Renamed __safe_abs32() to abs_s32() and dropped inline keyword
per Alexei Starovoitov's feedback
Changes since v3:
- Fixed stray blank line deletion in the file header
- Improved comment per Yonghong Song's suggestion
- Added JIT vs interpreter context to selftest commit message
Changes since v2:
- Simplified to use -(u32)x per Mykyta Yatsenko's suggestion
Changes since v1:
- Moved helper above kerneldoc comment block to fix build warnings
====================
Link: https://patch.msgid.link/20260311011116.2108005-1-qguanni@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add tests to verify that signed 32-bit division and modulo operations
produce correct results when the dividend is INT_MIN (0x80000000).
The bug fixed in the previous commit only affects the BPF interpreter
path. When JIT is enabled (the default on most architectures), the
native CPU division instruction produces the correct result and these
tests pass regardless. With bpf_jit_enable=0, the interpreter is used
and without the previous fix, INT_MIN / 2 incorrectly returns
0x40000000 instead of 0xC0000000 due to abs(S32_MIN) undefined
behavior, causing these tests to fail.
Test cases:
- SDIV32 INT_MIN / 2 = -1073741824 (imm and reg divisor)
- SMOD32 INT_MIN % 2 = 0 (positive and negative divisor)
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com>
Link: https://lore.kernel.org/r/20260311011116.2108005-3-qguanni@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The BPF interpreter's signed 32-bit division and modulo handlers use
the kernel abs() macro on s32 operands. The abs() macro documentation
(include/linux/math.h) explicitly states the result is undefined when
the input is the type minimum. When DST contains S32_MIN (0x80000000),
abs((s32)DST) triggers undefined behavior and returns S32_MIN unchanged
on arm64/x86. This value is then sign-extended to u64 as
0xFFFFFFFF80000000, causing do_div() to compute the wrong result.
The verifier's abstract interpretation (scalar32_min_max_sdiv) computes
the mathematically correct result for range tracking, creating a
verifier/interpreter mismatch that can be exploited for out-of-bounds
map value access.
Introduce abs_s32() which handles S32_MIN correctly by casting to u32
before negating, avoiding signed overflow entirely. Replace all 8
abs((s32)...) call sites in the interpreter's sdiv32/smod32 handlers.
s32 is the only affected case -- the s64 division/modulo handlers do
not use abs().
Fixes: ec0e2da95f ("bpf: Support new signed div/mod instructions.")
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com>
Link: https://lore.kernel.org/r/20260311011116.2108005-2-qguanni@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add test cases to ensure the verifier correctly rejects bpf_throw from
subprogs when RCU, preempt, or IRQ locks are held:
* reject_subprog_rcu_lock_throw: subprog acquires bpf_rcu_read_lock and
then calls bpf_throw
* reject_subprog_throw_preempt_lock: always-throwing subprog called while
caller holds bpf_preempt_disable
* reject_subprog_throw_irq_lock: always-throwing subprog called while
caller holds bpf_local_irq_save
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260320000809.643798-2-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
process_bpf_exit_full() passes check_lock = !curframe to
check_resource_leak(), which is false in cases when bpf_throw() is
called from a static subprog. This makes check_resource_leak() to skip
validation of active_rcu_locks, active_preempt_locks, and
active_irq_id on exception exits from subprogs.
At runtime bpf_throw() unwinds the stack via ORC without releasing any
user-acquired locks, which may cause various issues as the result.
Fix by setting check_lock = true for exception exits regardless of
curframe, since exceptions bypass all intermediate frame
cleanup. Update the error message prefix to "bpf_throw" for exception
exits to distinguish them from normal BPF_EXIT.
Fix reject_subprog_with_rcu_read_lock test which was previously
passing for the wrong reason. Test program returned directly from the
subprog call without closing the RCU section, so the error was
triggered by the unclosed RCU lock on normal exit, not by
bpf_throw. Update __msg annotations for affected tests to match the
new "bpf_throw" error prefix.
The spin_lock case is not affected because they are already checked [1]
at the call site in do_check_insn() before bpf_throw can run.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/verifier.c?h=v7.0-rc4#n21098
Assisted-by: Claude:claude-opus-4-6
Fixes: f18b03faba ("bpf: Implement BPF exceptions")
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260320000809.643798-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
i2c-fixes for v7.0-rc5
pxa: fix broken I2C communication on Armada 3700 with recovery
fsi: fix device_node reference leak in probe
cp2615: fix NULL-deref when serial string is missing
Pull hwmon fixes from Guenter Roeck:
- max6639: Fix pulses-per-revolution implementation
- Several PMBus drivers: Add missing error checks
* tag 'hwmon-for-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (max6639) Fix pulses-per-revolution implementation
hwmon: (pmbus/isl68137) Fix unchecked return value and use sysfs_emit()
hwmon: (pmbus/ina233) Add error check for pmbus_read_word_data() return value
hwmon: (pmbus/mp2869) Check pmbus_read_byte_data() before using its return value
hwmon: (pmbus/mp2975) Add error check for pmbus_read_word_data() return value
hwmon: (pmbus/hac300s) Add error check for pmbus_read_word_data() return value
Pull bootconfig fixes from Masami Hiramatsu:
- Check error code of xbc_init_node() in override value path in
xbc_parse_kv()
- Fix fd leak in load_xbc_file() on fstat failure
* tag 'bootconfig-fixes-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tools/bootconfig: fix fd leak in load_xbc_file() on fstat failure
lib/bootconfig: check xbc_init_node() return in override path
Pull btrfs fixes from David Sterba:
"Another batch of fixes for problems that have been identified by tools
analyzing code or by fuzzing. Most of them are short, two patches fix
the same thing in many places so the diffs are bigger.
- handle potential NULL pointer errors after attempting to read
extent and checksum trees
- prevent ENOSPC when creating many qgroups by ioctls in the same
transaction
- encoded write ioctl fixes (with 64K page and 4K block size):
- fix unexpected bio length
- do not let compressed bios and pages interfere with page cache
- compression fixes on setups with 64K page and 4K block size: fix
folio length assertions (zstd and lzo)
- remap tree fixes:
- make sure to hold block group reference while moving it
- handle early exit when moving block group to unused list
- handle deleted subvolumes with inconsistent state of deletion
progress"
* tag 'for-7.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: reject root items with drop_progress and zero drop_level
btrfs: check block group before marking it unused in balance_remap_chunks()
btrfs: hold block group reference during entire move_existing_remap()
btrfs: fix an incorrect ASSERT() condition inside lzo_decompress_bio()
btrfs: fix an incorrect ASSERT() condition inside zstd_decompress_bio()
btrfs: do not touch page cache for encoded writes
btrfs: fix a bug that makes encoded write bio larger than expected
btrfs: reserve enough transaction items for qgroup ioctls
btrfs: check for NULL root after calls to btrfs_csum_root()
btrfs: check for NULL root after calls to btrfs_extent_root()
Validate host controlled value `quote_buf->out_len` that determines how
many bytes of the quote are copied out to guest userspace. In TDX
environments with remote attestation, quotes are not considered private,
and can be forwarded to an attestation server.
Catch scenarios where the host specifies a response length larger than
the guest's allocation, or otherwise races modifying the response while
the guest consumes it.
This prevents contents beyond the pages allocated for `quote_buf`
(up to TSM_REPORT_OUTBLOB_MAX) from being read out to guest userspace,
and possibly forwarded in attestation requests.
Recall that some deployments want per-container configs-tsm-report
interfaces, so the leak may cross container protection boundaries, not
just local root.
Fixes: f4738f56d1 ("virt: tdx-guest: Add Quote generation support using TSM_REPORTS")
Cc: stable@vger.kernel.org
Signed-off-by: Zubin Mithra <zsm@google.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Sabrina Dubroca says:
====================
rtnetlink: add missing attributes in if_nlmsg_size
Once again we have some attributes added by rtnl_fill_ifinfo() that
aren't counted in if_nlmsg_size().
====================
Link: https://patch.msgid.link/cover.1773919462.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
smc_rx_splice() allocates one smc_spd_priv per pipe_buffer and stores
the pointer in pipe_buffer.private. The pipe_buf_operations for these
buffers used .get = generic_pipe_buf_get, which only increments the page
reference count when tee(2) duplicates a pipe buffer. The smc_spd_priv
pointer itself was not handled, so after tee() both the original and the
cloned pipe_buffer share the same smc_spd_priv *.
When both pipes are subsequently released, smc_rx_pipe_buf_release() is
called twice against the same object:
1st call: kfree(priv) sock_put(sk) smc_rx_update_cons() [correct]
2nd call: kfree(priv) sock_put(sk) smc_rx_update_cons() [UAF]
KASAN reports a slab-use-after-free in smc_rx_pipe_buf_release(), which
then escalates to a NULL-pointer dereference and kernel panic via
smc_rx_update_consumer() when it chases the freed priv->smc pointer:
BUG: KASAN: slab-use-after-free in smc_rx_pipe_buf_release+0x78/0x2a0
Read of size 8 at addr ffff888004a45740 by task smc_splice_tee_/74
Call Trace:
<TASK>
dump_stack_lvl+0x53/0x70
print_report+0xce/0x650
kasan_report+0xc6/0x100
smc_rx_pipe_buf_release+0x78/0x2a0
free_pipe_info+0xd4/0x130
pipe_release+0x142/0x160
__fput+0x1c6/0x490
__x64_sys_close+0x4f/0x90
do_syscall_64+0xa6/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
BUG: kernel NULL pointer dereference, address: 0000000000000020
RIP: 0010:smc_rx_update_consumer+0x8d/0x350
Call Trace:
<TASK>
smc_rx_pipe_buf_release+0x121/0x2a0
free_pipe_info+0xd4/0x130
pipe_release+0x142/0x160
__fput+0x1c6/0x490
__x64_sys_close+0x4f/0x90
do_syscall_64+0xa6/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
Kernel panic - not syncing: Fatal exception
Beyond the memory-safety problem, duplicating an SMC splice buffer is
semantically questionable: smc_rx_update_cons() would advance the
consumer cursor twice for the same data, corrupting receive-window
accounting. A refcount on smc_spd_priv could fix the double-free, but
the cursor-accounting issue would still need to be addressed separately.
The .get callback is invoked by both tee(2) and splice_pipe_to_pipe()
for partial transfers; both will now return -EFAULT. Users who need
to duplicate SMC socket data must use a copy-based read path.
Fixes: 9014db202c ("smc: add support for splice()")
Signed-off-by: Qi Tang <tpluszz77@gmail.com>
Link: https://patch.msgid.link/20260318064847.23341-1-tpluszz77@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ovs_netdev_tunnel_destroy() may run after NETDEV_UNREGISTER already
detached the device. Dropping the netdev reference in destroy can race
with concurrent readers that still observe vport->dev.
Do not release vport->dev in ovs_netdev_tunnel_destroy(). Instead, let
vport_netdev_free() drop the reference from the RCU callback, matching
the non-tunnel destroy path and avoiding additional synchronization
under RTNL.
Fixes: a9020fde67 ("openvswitch: Move tunnel destroy function to oppenvswitch module.")
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Tested-by: Ao Zhou <n05ec@lzu.edu.cn>
Co-developed-by: Yuan Tan <tanyuan98@outlook.com>
Signed-off-by: Yuan Tan <tanyuan98@outlook.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Yang Yang <n05ec@lzu.edu.cn>
Reviewed-by: Ilya Maximets <i.maximets@ovn.org>
Link: https://patch.msgid.link/20260319074241.3405262-1-n05ec@lzu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kevin Hao says:
====================
net: macb: Fix two lock warnings when WOL is used
This patch series addresses two lock warnings that occur when using WOL as a
wakeup source on my AMD ZynqMP board.
====================
Link: https://patch.msgid.link/20260318-macb-irq-v2-0-f1179768ab24@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Access to net_device::ip_ptr and its associated members must be
protected by an RCU lock. Since we are modifying this piece of code,
let's also move it to execute only when WAKE_ARP is enabled.
To minimize the duration of the RCU lock, a local variable is used to
temporarily store the IP address. This change resolves the following
RCU check warning:
WARNING: suspicious RCU usage
7.0.0-rc3-next-20260310-yocto-standard+ #122 Not tainted
-----------------------------
drivers/net/ethernet/cadence/macb_main.c:5944 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
5 locks held by rtcwake/518:
#0: ffff000803ab1408 (sb_writers#5){.+.+}-{0:0}, at: vfs_write+0xf8/0x368
#1: ffff0008090bf088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0xbc/0x1c8
#2: ffff00080098d588 (kn->active#70){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xcc/0x1c8
#3: ffff800081c84888 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0x1ec/0x290
#4: ffff0008009ba0f8 (&dev->mutex){....}-{4:4}, at: device_suspend+0x118/0x4f0
stack backtrace:
CPU: 3 UID: 0 PID: 518 Comm: rtcwake Not tainted 7.0.0-rc3-next-20260310-yocto-standard+ #122 PREEMPT
Hardware name: ZynqMP ZCU102 Rev1.1 (DT)
Call trace:
show_stack+0x24/0x38 (C)
__dump_stack+0x28/0x38
dump_stack_lvl+0x64/0x88
dump_stack+0x18/0x24
lockdep_rcu_suspicious+0x134/0x1d8
macb_suspend+0xd8/0x4c0
device_suspend+0x218/0x4f0
dpm_suspend+0x244/0x3a0
dpm_suspend_start+0x50/0x78
suspend_devices_and_enter+0xec/0x560
pm_suspend+0x194/0x290
state_store+0x110/0x158
kobj_attr_store+0x1c/0x30
sysfs_kf_write+0xa8/0xd0
kernfs_fop_write_iter+0x11c/0x1c8
vfs_write+0x248/0x368
ksys_write+0x7c/0xf8
__arm64_sys_write+0x28/0x40
invoke_syscall+0x4c/0xe8
el0_svc_common+0x98/0xf0
do_el0_svc+0x28/0x40
el0_svc+0x54/0x1e0
el0t_64_sync_handler+0x84/0x130
el0t_64_sync+0x198/0x1a0
Fixes: 0cb8de39a7 ("net: macb: Add ARP support to WOL")
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Cc: stable@vger.kernel.org
Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20260318-macb-irq-v2-2-f1179768ab24@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The devm_free_irq() and devm_request_irq() functions should not be
executed in an atomic context.
During device suspend, all userspace processes and most kernel threads
are frozen. Additionally, we flush all tx/rx status, disable all macb
interrupts, and halt rx operations. Therefore, it is safe to split the
region protected by bp->lock into two independent sections, allowing
devm_free_irq() and devm_request_irq() to run in a non-atomic context.
This modification resolves the following lockdep warning:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:591
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 501, name: rtcwake
preempt_count: 1, expected: 0
RCU nest depth: 1, expected: 0
7 locks held by rtcwake/501:
#0: ffff0008038c3408 (sb_writers#5){.+.+}-{0:0}, at: vfs_write+0xf8/0x368
#1: ffff0008049a5e88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0xbc/0x1c8
#2: ffff00080098d588 (kn->active#70){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xcc/0x1c8
#3: ffff800081c84888 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0x1ec/0x290
#4: ffff0008009ba0f8 (&dev->mutex){....}-{4:4}, at: device_suspend+0x118/0x4f0
#5: ffff800081d00458 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48
#6: ffff0008031fb9e0 (&bp->lock){-.-.}-{3:3}, at: macb_suspend+0x144/0x558
irq event stamp: 8682
hardirqs last enabled at (8681): [<ffff8000813c7d7c>] _raw_spin_unlock_irqrestore+0x44/0x88
hardirqs last disabled at (8682): [<ffff8000813c7b58>] _raw_spin_lock_irqsave+0x38/0x98
softirqs last enabled at (7322): [<ffff8000800f1b4c>] handle_softirqs+0x52c/0x588
softirqs last disabled at (7317): [<ffff800080010310>] __do_softirq+0x20/0x2c
CPU: 1 UID: 0 PID: 501 Comm: rtcwake Not tainted 7.0.0-rc3-next-20260310-yocto-standard+ #125 PREEMPT
Hardware name: ZynqMP ZCU102 Rev1.1 (DT)
Call trace:
show_stack+0x24/0x38 (C)
__dump_stack+0x28/0x38
dump_stack_lvl+0x64/0x88
dump_stack+0x18/0x24
__might_resched+0x200/0x218
__might_sleep+0x38/0x98
__mutex_lock_common+0x7c/0x1378
mutex_lock_nested+0x38/0x50
free_irq+0x68/0x2b0
devm_irq_release+0x24/0x38
devres_release+0x40/0x80
devm_free_irq+0x48/0x88
macb_suspend+0x298/0x558
device_suspend+0x218/0x4f0
dpm_suspend+0x244/0x3a0
dpm_suspend_start+0x50/0x78
suspend_devices_and_enter+0xec/0x560
pm_suspend+0x194/0x290
state_store+0x110/0x158
kobj_attr_store+0x1c/0x30
sysfs_kf_write+0xa8/0xd0
kernfs_fop_write_iter+0x11c/0x1c8
vfs_write+0x248/0x368
ksys_write+0x7c/0xf8
__arm64_sys_write+0x28/0x40
invoke_syscall+0x4c/0xe8
el0_svc_common+0x98/0xf0
do_el0_svc+0x28/0x40
el0_svc+0x54/0x1e0
el0t_64_sync_handler+0x84/0x130
el0t_64_sync+0x198/0x1a0
Fixes: 558e35ccfe ("net: macb: WoL support for GEM type of Ethernet controller")
Cc: stable@vger.kernel.org
Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Link: https://patch.msgid.link/20260318-macb-irq-v2-1-f1179768ab24@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull drm fixes from Dave Airlie:
"Regular weekly pull request, from sunny San Diego. Usual suspects in
xe/i915/amdgpu with small fixes all over, then some minor fixes across
a few other drivers. It's probably a bit on the heavy side, but most
of the fix seem well contained,
core:
- drm_dev_unplug UAF fix
pagemap:
- lock handling fix
xe:
- A number of teardown fixes
- Skip over non-leaf PTE for PRL generation
- Fix an uninitialized variable
- Fix a missing runtime PM reference
i915/display:
- Fix#15771: Screen corruption and stuttering on P14s w/ 3K display
- Fix for PSR entry setup frames count on rejected commit
- Fix OOPS if firmware is not loaded and suspend is attempted
- Fix unlikely NULL deref due to DC6 on probe
amdgpu:
- Fix gamma 2.2 colorop TFs
- BO list fix
- LTO fix
- DC FP fix
- DisplayID handling fix
- DCN 2.01 fix
- MMHUB boundary fixes
- ISP fix
- TLB fence fix
- Hainan pm fix
radeon:
- Hainan pm fix
vmwgfx:
- memory leak fix
- doc warning fix
imagination:
- deadlock fix
- interrupt handling fixes
dw-hdmi-qp:
- multi channel audio fix"
* tag 'drm-fixes-2026-03-21' of https://gitlab.freedesktop.org/drm/kernel: (40 commits)
drm/xe: Fix missing runtime PM reference in ccs_mode_store
drm/xe: Open-code GGTT MMIO access protection
drm/xe/lrc: Fix uninitialized new_ts when capturing context timestamp
drm/xe/oa: Allow reading after disabling OA stream
drm/xe: Skip over non leaf pte for PRL generation
drm/xe/guc: Ensure CT state transitions via STOP before DISABLED
drm/xe: Trigger queue cleanup if not in wedged mode 2
drm/xe: Forcefully tear down exec queues in GuC submit fini
drm/xe: Always kill exec queues in xe_guc_submit_pause_abort
drm/xe/guc: Fail immediately on GuC load error
drm/i915/gt: Check set_default_submission() before deferencing
drm/radeon: apply state adjust rules to some additional HAINAN vairants
drm/amdgpu: apply state adjust rules to some additional HAINAN vairants
drm/amdgpu: rework how we handle TLB fences
drm/bridge: dw-hdmi-qp: fix multi-channel audio output
drm: Fix use-after-free on framebuffers and property blobs when calling drm_dev_unplug
drm/amdgpu: Fix ISP segfault issue in kernel v7.0
drm/amdgpu/gmc9.0: add bounds checking for cid
drm/amdgpu/mmhub4.2.0: add bounds checking for cid
drm/amdgpu/mmhub4.1.0: add bounds checking for cid
...
The valid range for the pulses-per-revolution devicetree property is
1..4. The current code checks for a range of 1..5. Fix it.
Declare the variable used to retrieve pulses per revolution from
devicetree as u32 (unsigned) to match the of_property_read_u32() API.
The current code uses a postfix decrement when writing the pulses per
resolution into the chip. This has no effect since the value is evaluated
before it is decremented. Fix it by decrementing before evaluating the
value.
Fixes: 7506ebcd66 ("hwmon: (max6639) : Configure based on DT property")
Cc: Naresh Solanki <naresh.solanki@9elements.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Pull execve fixes from Kees Cook:
- binfmt_elf_fdpic: fix AUXV size calculation (Andrei Vagin)
- fs/tests: exec: Remove bad test vector
* tag 'execve-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
fs/tests: exec: Remove bad test vector
binfmt_elf_fdpic: fix AUXV size calculation for ELF_HWCAP3 and ELF_HWCAP4
Pull tty/serial fixes from Greg KH:
"Here are some small tty/vt and serial driver fixes for 7.0-rc5.
Included in here are:
- 8250 driver fixes for reported problems
- serial core lockup fix
- uartlite driver bugfix
- vt save/restore bugfix
All of these have been in linux-next for over a week with no reported
problems"
* tag 'tty-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
vt: save/restore unicode screen buffer for alternate screen
serial: 8250_dw: Ensure BUSY is deasserted
serial: 8250: Add late synchronize_irq() to shutdown to handle DW UART BUSY
serial: 8250_dw: Rework IIR_NO_INT handling to stop interrupt storm
serial: 8250_dw: Rework dw8250_handle_irq() locking and IIR handling
serial: 8250: Add serial8250_handle_irq_locked()
serial: 8250_dw: Avoid unnecessary LCR writes
serial: 8250: Protect LCR write in shutdown
serial: 8250_pci: add support for the AX99100
serial: core: fix infinite loop in handle_tx() for PORT_UNKNOWN
serial: uartlite: fix PM runtime usage count underflow on probe
serial: 8250: always disable IRQ during THRE test
serial: 8250: Fix TX deadlock when using DMA
Commit 150a04d817 ("compiler_types.h: Attributes: Add __counted_by_ptr
macro") used Clang 22.0.0 as a minimum supported version for
__counted_by_ptr, which made sense while 22.0.0 was the version of
LLVM's main branch to allow developers to easily test and develop uses
of __counted_by_ptr in their code. However, __counted_by_ptr requires a
change [1] merged towards the end of the 22 development cycle to avoid
errors when applied to void pointers.
In file included from fs/xfs/xfs_attr_inactive.c:18:
fs/xfs/libxfs/xfs_attr.h:59:2: error: 'counted_by' cannot be applied to a pointer with pointee of unknown size because 'void' is an incomplete type
59 | void *buffer __counted_by_ptr(bufsize);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is disruptive for deployed prerelease clang-22 builds (such as
Android LLVM) or when bisecting between llvmorg-21-init and the fix.
Require a released version of clang-22 (i.e., 21.1.0 or newer) to
enabled __counted_by_ptr to ensure all fixes needed for proper support
are present.
Fixes: 150a04d817 ("compiler_types.h: Attributes: Add __counted_by_ptr macro")
Link: f29955a594 [1]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260318-counted_by_ptr-release-clang-22-v1-1-e017da246df0@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
When a socket is deconfigured, it's mapped to SOCK_EMPTY (0xffff). This causes
a panic while allocating UV hub info structures.
Fix this by using NUMA_NO_NODE, allowing UV hub info structures to be
allocated on valid nodes.
Fixes: 8a50c58519 ("x86/platform/uv: UV support for sub-NUMA clustering")
Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/ab2BmGL0ehVkkjKk@hpe.com
Pull io_uring fixes from Jens Axboe:
- A bit of a work-around for AF_UNIX recv multishot, as the in-kernel
implementation doesn't properly signal EOF. We'll likely rework this
one going forward, but the fix is sufficient for now
- Two fixes for incrementally consumed buffers, for non-pollable files
and for 0 byte reads
* tag 'io_uring-7.0-20260320' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/kbuf: propagate BUF_MORE through early buffer commit path
io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOF
io_uring/poll: fix multishot recv missing EOF on wakeup race
Pull spi fixes from Mark Brown:
"There's a couple of core fixes here from Johan, fixing a race
condition and an error handling path, plus a bunch of driver specific
fixups.
The Qualcomm issues could be nasty if you ran into them, especially
the DMA ordering one"
* tag 'spi-fix-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: geni-qcom: Check DMA interrupts early in ISR
spi: fix statistics allocation
spi: fix use-after-free on controller registration failure
spi: geni-qcom: Fix CPHA and CPOL mode change detection
spi: axiado: Fix double-free in ax_spi_probe()
spi: amlogic-spisg: Fix memory leak in aml_spisg_probe()
spi: amlogic: spifc-a4: Remove redundant clock cleanup
Pull regulator fix from Mark Brown:
"Just one fix here from Hugo Villeneuve, the documentation for some of
the regulator DT properties had been cut'n'pasted so that if anyone
actually read it they'd be informed that those properties had
completely incorrect meanings"
* tag 'regulator-fix-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: dt-bindings: fix typos in regulator-uv-* descriptions
Pull pmdomain fixes from Ulf Hansson:
- bcm: increase ASB control timeout for bcm2835
- mediatek: fix power domain count
* tag 'pmdomain-v7.0-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
pmdomain: bcm: bcm2835-power: Increase ASB control timeout
pmdomain: mediatek: Fix power domain count
Pull ata fixes from Niklas Cassel:
- ADATA SU680 SSDs are causing command timeouts when LPM is enabled.
Enable the ATA_QUIRK_NOLPM quirk to prevent LPM from being enabled
on these devices (Damien)
- When receiving a REPORT SUPPORTED OPERATION CODES command with an
invalid REPORTING OPTIONS format, sense data should have the field
pointer set to byte 2 (the location of the REPORTING OPTIONS field)
instead of incorrectly pointing to byte 1 (Damien)
* tag 'ata-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: libata-scsi: report correct sense field pointer in ata_scsiop_maint_in()
ata: libata-core: disable LPM on ADATA SU680 SSD
Pull MTD fixes from Miquel Raynal:
- In SPI NOR, there was an issue with the RDCR capability, leading to
several platforms no longer capable of using it for wrong reasons
(the follow-up commit renames the helper to avoid future confusion)
- NAND controller drivers needed to be improved to fix some timings, a
locking schenario and avoid certain operations during panic writes
- The Spear600 DT binding conversion was done partially, leading to
several warnings which have individually been fixed
- Tudor gets replaced by Takahiro for the SPI NOR maintainance
- Plus two more misc fixes
* tag 'mtd/fixes-for-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
mtd: rawnand: pl353: make sure optimal timings are applied
mtd: spi-nor: Rename spi_nor_spimem_check_op()
mtd: spi-nor: Fix RDCR controller capability core check
mtd: rawnand: brcmnand: skip DMA during panic write
mtd: rawnand: serialize lock/unlock against other NAND operations
dt-bindings: mtd: st,spear600-smi: Fix example
dt-bindings: mtd: st,spear600-smi: #address/size-cells is mandatory
dt-bindings: mtd: st,spear600-smi: Fix description
mtd: rawnand: cadence: Fix error check for dma_alloc_coherent() in cadence_nand_init()
mtd: Avoid boot crash in RedBoot partition table parser
MAINTAINERS: add Takahiro Kuwano as SPI NOR reviewer
MAINTAINERS: remove Tudor Ambarus as SPI NOR maintainer
Pull iommu fixes from Joerg Roedel:
"Intel VT-d:
- Abort all pending requests on dev_tlb_inv timeout to avoid
hardlockup
- Limit IOPF handling to PRI-capable device to avoid SVA attach
failure
AMD-Vi:
- Make sure identity domain is not used when SNP is active
Core fixes:
- Handle mapping IOVA 0x0 correctly
- Fix crash in SVA code
- Kernel-doc fix in IO-PGTable code"
* tag 'iommu-fixes-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommu/amd: Block identity domain when SNP enabled
iommu/sva: Fix crash in iommu_sva_unbind_device()
iommu/io-pgtable: fix all kernel-doc warnings in io-pgtable.h
iommu: Fix mapping check for 0x0 to avoid re-mapping it
iommu/vt-d: Only handle IOPF for SVA when PRI is supported
iommu/vt-d: Fix intel iommu iotlb sync hardlockup and retry
Pull arm64 fixes from Will Deacon:
"There's a small crop of fixes for the MPAM resctrl driver, a fix for
SCS/PAC patching with the AMDGPU driver and a page-table fix for
realms running with 52-bit physical addresses:
- Fix DWARF parsing for SCS/PAC patching to work with very large
modules (such as the amdgpu driver)
- Fixes to the mpam resctrl driver
- Fix broken handling of 52-bit physical addresses when sharing
memory from within a realm"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: realm: Fix PTE_NS_SHARED for 52bit PA support
arm_mpam: Force __iomem casts
arm_mpam: Disable preemption when making accesses to fake MSC in kunit test
arm_mpam: Fix null pointer dereference when restoring bandwidth counters
arm64/scs: Fix handling of advance_loc4
Pull Hyper-V fixes from Wei Liu:
- Fix ARM64 MSHV support (Anirudh Rayabharam)
- Fix MSHV driver memory handling issues (Stanislav Kinsburskii)
- Update maintainers for Hyper-V DRM driver (Saurabh Sengar)
- Misc clean up in MSHV crashdump code (Ard Biesheuvel, Uros Bizjak)
- Minor improvements to MSHV code (Mukesh R, Wei Liu)
- Revert not yet released MSHV scrub partition hypercall (Wei Liu)
* tag 'hyperv-fixes-signed-20260319' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
mshv: Fix error handling in mshv_region_pin
MAINTAINERS: Update maintainers for Hyper-V DRM driver
mshv: Fix use-after-free in mshv_map_user_memory error path
mshv: pass struct mshv_user_mem_region by reference
x86/hyperv: Use any general-purpose register when saving %cr2 and %cr8
x86/hyperv: Use current_stack_pointer to avoid asm() in hv_hvcrash_ctxt_save()
x86/hyperv: Save segment registers directly to memory in hv_hvcrash_ctxt_save()
x86/hyperv: Use __naked attribute to fix stackless C function
Revert "mshv: expose the scrub partition hypercall"
mshv: add arm64 support for doorbell & intercept SINTs
mshv: refactor synic init and cleanup
x86/hyperv: print out reserved vectors in hexadecimal
Pull smb client fixes from Steve French:
- Fix reporting of i_blocks
- Fix Kerberos mounts with different usernames to same server
- Trivial comment cleanup
* tag 'v7.0-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: fix generic/694 due to wrong ->i_blocks
cifs: smb1: fix comment typo
smb: client: fix krb5 mount with username option
Pull smb server fixes from Steve French:
- Three use after free fixes (in close, in compounded ops, and in tree
disconnect)
- Multichannel fix
- return proper volume identifier (superblock uuid if available) in
FS_OBJECT_ID queries
* tag 'v7.0-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix use-after-free in durable v2 replay of active file handles
ksmbd: fix use-after-free of share_conf in compound request
ksmbd: use volume UUID in FS_OBJECT_ID_INFORMATION
ksmbd: unset conn->binding on failed binding request
ksmbd: fix share_conf UAF in tree_conn disconnect
ranges_to_free array should have enough room to store the entire EFI
memmap plus an extra element for NULL entry.
The calculation of this array size wrongly adds 1 to the overall size
instead of adding 1 to the number of elements.
Add parentheses to properly size the array.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: a4b0bf6a40 ("x86/efi: defer freeing of boot services memory")
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Add a SB_I_NO_DATA_INTEGRITY superblock flag for filesystems that cannot
guarantee data persistence on sync (eg fuse). For superblocks with this
flag set, sync kicks off writeback of dirty inodes but does not wait
for the flusher threads to complete the writeback.
This replaces the per-inode AS_NO_DATA_INTEGRITY mapping flag added in
commit f9a49aa302 ("fs/writeback: skip AS_NO_DATA_INTEGRITY mappings
in wait_sb_inodes()"). The flag belongs at the superblock level because
data integrity is a filesystem-wide property, not a per-inode one.
Having this flag at the superblock level also allows us to skip having
to iterate every dirty inode in wait_sb_inodes() only to skip each inode
individually.
Prior to this commit, mappings with no data integrity guarantees skipped
waiting on writeback completion but still waited on the flusher threads
to finish initiating the writeback. Waiting on the flusher threads is
unnecessary. This commit kicks off writeback but does not wait on the
flusher threads. This change properly addresses a recent report [1] for
a suspend-to-RAM hang seen on fuse-overlayfs that was caused by waiting
on the flusher threads to finish:
Workqueue: pm_fs_sync pm_fs_sync_work_fn
Call Trace:
<TASK>
__schedule+0x457/0x1720
schedule+0x27/0xd0
wb_wait_for_completion+0x97/0xe0
sync_inodes_sb+0xf8/0x2e0
__iterate_supers+0xdc/0x160
ksys_sync+0x43/0xb0
pm_fs_sync_work_fn+0x17/0xa0
process_one_work+0x193/0x350
worker_thread+0x1a1/0x310
kthread+0xfc/0x240
ret_from_fork+0x243/0x280
ret_from_fork_asm+0x1a/0x30
</TASK>
On fuse this is problematic because there are paths that may cause the
flusher thread to block (eg if systemd freezes the user session cgroups
first, which freezes the fuse daemon, before invoking the kernel
suspend. The kernel suspend triggers ->write_node() which on fuse issues
a synchronous setattr request, which cannot be processed since the
daemon is frozen. Or if the daemon is buggy and cannot properly complete
writeback, initiating writeback on a dirty folio already under writeback
leads to writeback_get_folio() -> folio_prepare_writeback() ->
unconditional wait on writeback to finish, which will cause a hang).
This commit restores fuse to its prior behavior before tmp folios were
removed, where sync was essentially a no-op.
[1] https://lore.kernel.org/linux-fsdevel/CAJnrk1a-asuvfrbKXbEwwDSctvemF+6zfhdnuzO65Pt8HsFSRw@mail.gmail.com/T/#m632c4648e9cafc4239299887109ebd880ac6c5c1
Fixes: 0c58a97f91 ("fuse: remove tmp folio for writebacks and internal rb tree")
Reported-by: John <therealgraysky@proton.me>
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260320005145.2483161-2-joannelkoong@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
sof_parse_token_sets() accepts array->size values that can be invalid
for a vendor tuple array header. In particular, a zero size does not
advance the parser state and can lead to non-progress parsing on
malformed topology data.
Validate array->size against the minimum header size and reject values
smaller than sizeof(*array) before parsing. This preserves behavior for
valid topologies and hardens malformed-input handling.
Signed-off-by: Cássio Gabriel <cassiogabrielcontato@gmail.com>
Acked-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Link: https://patch.msgid.link/20260319-sof-topology-array-size-fix-v1-1-f9191b16b1b7@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
When running in an unprivileged domU under Xen, the privcmd driver
is restricted to allow only hypercalls against a target domain, for
which the current domU is acting as a device model.
Add a boot parameter "unrestricted" to allow all hypercalls (the
hypervisor will still refuse destructive hypercalls affecting other
guests).
Make this new parameter effective only in case the domU wasn't started
using secure boot, as otherwise hypercalls targeting the domU itself
might result in violating the secure boot functionality.
This is achieved by adding another lockdown reason, which can be
tested to not being set when applying the "unrestricted" option.
This is part of XSA-482
Signed-off-by: Juergen Gross <jgross@suse.com>
---
V2:
- new patch
HMM is fundamentally about allowing a sophisticated device to perform DMA
directly to a process’s memory while the CPU accesses that same memory at
the same time. It is similar to SVA but does not rely on IOMMU support.
Because the entire model depends on concurrent access to shared memory, it
fails as a uAPI if SWIOTLB substitutes the memory or if the CPU caches are
not coherent with DMA.
Until now, there has been no reliable way to report this, and various
approximations have been used:
int hmm_dma_map_alloc(struct device *dev, struct hmm_dma_map *map,
size_t nr_entries, size_t dma_entry_size)
{
<...>
/*
* The HMM API violates our normal DMA buffer ownership rules and can't
* transfer buffer ownership. The dma_addressing_limited() check is a
* best approximation to ensure no swiotlb buffering happens.
*/
dma_need_sync = !dev->dma_skip_sync;
if (dma_need_sync || dma_addressing_limited(dev))
return -EOPNOTSUPP;
So let's mark mapped buffers with DMA_ATTR_REQUIRE_COHERENT attribute
to prevent silent data corruption if someone tries to use hmm in a system
with swiotlb or incoherent DMA
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260316-dma-debug-overlap-v3-8-1dde90a7f08b@nvidia.com
The Xen privcmd driver allows to issue arbitrary hypercalls from
user space processes. This is normally no problem, as access is
usually limited to root and the hypervisor will deny any hypercalls
affecting other domains.
In case the guest is booted using secure boot, however, the privcmd
driver would be enabling a root user process to modify e.g. kernel
memory contents, thus breaking the secure boot feature.
The only known case where an unprivileged domU is really needing to
use the privcmd driver is the case when it is acting as the device
model for another guest. In this case all hypercalls issued via the
privcmd driver will target that other guest.
Fortunately the privcmd driver can already be locked down to allow
only hypercalls targeting a specific domain, but this mode can be
activated from user land only today.
The target domain can be obtained from Xenstore, so when not running
in dom0 restrict the privcmd driver to that target domain from the
beginning, resolving the potential problem of breaking secure boot.
This is XSA-482
Reported-by: Teddy Astie <teddy.astie@vates.tech>
Fixes: 1c5de1939c ("xen: add privcmd driver")
Signed-off-by: Juergen Gross <jgross@suse.com>
---
V2:
- defer reading from Xenstore if Xenstore isn't ready yet (Jan Beulich)
- wait in open() if target domain isn't known yet
- issue message in case no target domain found (Jan Beulich)
Commit 4ab7bb9763 ("ata: libata-scsi: Refactor ata_scsiop_maint_in()")
modified ata_scsiop_maint_in() to directly call
ata_scsi_set_invalid_field() to set the field pointer of the sense data
of a failed MAINTENANCE IN command. However, in the case of an invalid
command format, the sense data field incorrectly indicates byte 1 of
the CDB. Fix this to indicate byte 2 of the command.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 4ab7bb9763 ("ata: libata-scsi: Refactor ata_scsiop_maint_in()")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
After commit 37c4e72b06 ("scsi: Fix sas_user_scan() to handle wildcard
and multi-channel scans"), if the device supports multiple channels (0 to
shost->max_channel), user_scan() invokes updated sas_user_scan() to perform
the scan behavior for a specific transfer. However, when the user
specifies shost->max_channel, it will return -EINVAL, which is not
expected.
Fix and support specifying the scan shost->max_channel for scanning.
Fixes: 37c4e72b06 ("scsi: Fix sas_user_scan() to handle wildcard and multi-channel scans")
Signed-off-by: Yihang Li <liyihang9@huawei.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20260317063147.2182562-1-liyihang9@huawei.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
tcm_loop_target_reset() violates the SCSI EH contract: it returns SUCCESS
without draining any in-flight commands. The SCSI EH documentation
(scsi_eh.rst) requires that when a reset handler returns SUCCESS the driver
has made lower layers "forget about timed out scmds" and is ready for new
commands. Every other SCSI LLD (virtio_scsi, mpt3sas, ipr, scsi_debug,
mpi3mr) enforces this by draining or completing outstanding commands before
returning SUCCESS.
Because tcm_loop_target_reset() doesn't drain, the SCSI EH reuses in-flight
scsi_cmnd structures for recovery commands (e.g. TUR) while the target core
still has async completion work queued for the old se_cmd. The memset in
queuecommand zeroes se_lun and lun_ref_active, causing
transport_lun_remove_cmd() to skip its percpu_ref_put(). The leaked LUN
reference prevents transport_clear_lun_ref() from completing, hanging
configfs LUN unlink forever in D-state:
INFO: task rm:264 blocked for more than 122 seconds.
rm D 0 264 258 0x00004000
Call Trace:
__schedule+0x3d0/0x8e0
schedule+0x36/0xf0
transport_clear_lun_ref+0x78/0x90 [target_core_mod]
core_tpg_remove_lun+0x28/0xb0 [target_core_mod]
target_fabric_port_unlink+0x50/0x60 [target_core_mod]
configfs_unlink+0x156/0x1f0 [configfs]
vfs_unlink+0x109/0x290
do_unlinkat+0x1d5/0x2d0
Fix this by making tcm_loop_target_reset() actually drain commands:
1. Issue TMR_LUN_RESET via tcm_loop_issue_tmr() to drain all commands that
the target core knows about (those not yet CMD_T_COMPLETE).
2. Use blk_mq_tagset_busy_iter() to iterate all started requests and
flush_work() on each se_cmd — this drains any deferred completion work
for commands that already had CMD_T_COMPLETE set before the TMR (which
the TMR skips via __target_check_io_state()). This is the same pattern
used by mpi3mr, scsi_debug, and libsas to drain outstanding commands
during reset.
Fixes: e0eb5d38b7 ("scsi: target: tcm_loop: Use block cmd allocator for se_cmds")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://patch.msgid.link/27011aa34c8f6b1b94d2e3cf5655b6d037f53428.1773706803.git.josef@toxicpanda.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
A malicious or compromised VIO server can return a num_written value in the
discover targets MAD response that exceeds max_targets. This value is
stored directly in vhost->num_targets without validation, and is then used
as the loop bound in ibmvfc_alloc_targets() to index into disc_buf[], which
is only allocated for max_targets entries. Indices at or beyond max_targets
access kernel memory outside the DMA-coherent allocation. The
out-of-bounds data is subsequently embedded in Implicit Logout and PLOGI
MADs that are sent back to the VIO server, leaking kernel memory.
Fix by clamping num_written to max_targets before storing it.
Fixes: 072b91f9c6 ("[SCSI] ibmvfc: IBM Power Virtual Fibre Channel Adapter Client Driver")
Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Tyllis Xu <LivelyCarpet87@gmail.com>
Reviewed-by: Dave Marquardt <davemarq@linux.ibm.com>
Acked-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Link: https://patch.msgid.link/20260314170151.548614-1-LivelyCarpet87@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
nci_close_device() flushes rx_wq and tx_wq while holding req_lock.
This causes a circular locking dependency because nci_rx_work()
running on rx_wq can end up taking req_lock too:
nci_rx_work -> nci_rx_data_packet -> nci_data_exchange_complete
-> __sk_destruct -> rawsock_destruct -> nfc_deactivate_target
-> nci_deactivate_target -> nci_request -> mutex_lock(&ndev->req_lock)
Move the flush of rx_wq after req_lock has been released.
This should safe (I think) because NCI_UP has already been cleared
and the transport is closed, so the work will see it and return
-ENETDOWN.
NIPA has been hitting this running the nci selftest with a debug
kernel on roughly 4% of the runs.
Fixes: 6a2968aaf5 ("NFC: basic NCI protocol implementation")
Reviewed-by: Ian Ray <ian.ray@gehealthcare.com>
Link: https://patch.msgid.link/20260317193334.988609-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull parisc fix from Helge Deller:
"Fix for the cacheflush() syscall which had D/I caches mixed up"
* tag 'parisc-for-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Flush correct cache in cacheflush() syscall
Pull pci fixes from Bjorn Helgaas:
- Create pwrctrl devices only for DT nodes below a PCI controller that
describe PCI devices and are related to a power supply; this prevents
waiting indefinitely for pwrctrl drivers that will never probe
(Manivannan Sadhasivam)
- Restore endpoint BAR mapping on subrange setup failure to make
selftest reliable (Koichiro Den)
* tag 'pci-v7.0-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI: endpoint: pci-epf-test: Roll back BAR mapping when subrange setup fails
PCI/pwrctrl: Create pwrctrl devices only for PCI device nodes
PCI/pwrctrl: Ensure that remote endpoint node parent has supply requirement
The I2C communication is completely broken on the Armada 3700 platform
since commit 0b01392c18 ("i2c: pxa: move to generic GPIO recovery").
For example, on the Methode uDPU board, probing of the two onboard
temperature sensors fails ...
[ 7.271713] i2c i2c-0: using pinctrl states for GPIO recovery
[ 7.277503] i2c i2c-0: PXA I2C adapter
[ 7.282199] i2c i2c-1: using pinctrl states for GPIO recovery
[ 7.288241] i2c i2c-1: PXA I2C adapter
[ 7.292947] sfp sfp-eth1: Host maximum power 3.0W
[ 7.299614] sfp sfp-eth0: Host maximum power 3.0W
[ 7.308178] lm75 1-0048: supply vs not found, using dummy regulator
[ 32.489631] lm75 1-0048: probe with driver lm75 failed with error -121
[ 32.496833] lm75 1-0049: supply vs not found, using dummy regulator
[ 82.890614] lm75 1-0049: probe with driver lm75 failed with error -121
... and accessing the plugged-in SFP modules also does not work:
[ 511.298537] sfp sfp-eth1: please wait, module slow to respond
[ 536.488530] sfp sfp-eth0: please wait, module slow to respond
...
[ 1065.688536] sfp sfp-eth1: failed to read EEPROM: -EREMOTEIO
[ 1090.888532] sfp sfp-eth0: failed to read EEPROM: -EREMOTEIO
After a discussion [1], there was an attempt to fix the problem by
reverting the offending change by commit 7b211c7671 ("Revert "i2c:
pxa: move to generic GPIO recovery""), but that only helped to fix
the issue in the 6.1.y stable tree. The reason behind the partial succes
is that there was another change in commit 20cb3fce4d ("i2c: Set i2c
pinctrl recovery info from it's device pinctrl") in the 6.3-rc1 cycle
which broke things further.
The cause of the problem is the same in case of both offending commits
mentioned above. Namely, the I2C core code changes the pinctrl state to
GPIO while running the recovery initialization code. Although the PXA
specific initialization also does this, but the key difference is that
it happens before the controller is getting enabled in i2c_pxa_reset(),
whereas in the case of the generic initialization it happens after that.
Change the code to reset the controller only before the first transfer
instead of before registering the controller. This ensures that the
controller is not enabled at the time when the generic recovery code
performs the pinctrl state changes, thus avoids the problem described
above.
As the result this change restores the original behaviour, which in
turn makes the I2C communication to work again as it can be seen from
the following log:
[ 7.363250] i2c i2c-0: using pinctrl states for GPIO recovery
[ 7.369041] i2c i2c-0: PXA I2C adapter
[ 7.373673] i2c i2c-1: using pinctrl states for GPIO recovery
[ 7.379742] i2c i2c-1: PXA I2C adapter
[ 7.384506] sfp sfp-eth1: Host maximum power 3.0W
[ 7.393013] sfp sfp-eth0: Host maximum power 3.0W
[ 7.399266] lm75 1-0048: supply vs not found, using dummy regulator
[ 7.407257] hwmon hwmon0: temp1_input not attached to any thermal zone
[ 7.413863] lm75 1-0048: hwmon0: sensor 'tmp75c'
[ 7.418746] lm75 1-0049: supply vs not found, using dummy regulator
[ 7.426371] hwmon hwmon1: temp1_input not attached to any thermal zone
[ 7.432972] lm75 1-0049: hwmon1: sensor 'tmp75c'
[ 7.755092] sfp sfp-eth1: module MENTECHOPTO POS22-LDCC-KR rev 1.0 sn MNC208U90009 dc 200828
[ 7.764997] mvneta d0040000.ethernet eth1: unsupported SFP module: no common interface modes
[ 7.785362] sfp sfp-eth0: module Mikrotik S-RJ01 rev 1.0 sn 61B103C55C58 dc 201022
[ 7.803426] hwmon hwmon2: temp1_input not attached to any thermal zone
Link: https://lore.kernel.org/r/20230926160255.330417-1-robert.marko@sartura.hr#1
Cc: stable@vger.kernel.org # 6.3+
Fixes: 20cb3fce4d ("i2c: Set i2c pinctrl recovery info from it's device pinctrl")
Signed-off-by: Gabor Juhos <j4g8y7@gmail.com>
Tested-by: Robert Marko <robert.marko@sartura.hr>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260226-i2c-pxa-fix-i2c-communication-v4-1-797a091dae87@gmail.com
The use of IONIC_CMD_LIF_SETATTR in the MAC address update path causes
the ionic firmware to update the LIF's identity in its persistent state.
Since the firmware state is maintained across host warm boots and driver
reloads, any MAC change on the Physical Function (PF) becomes "sticky.
This is problematic because it causes ethtool -P to report the
user-configured MAC as the permanent factory address, which breaks
system management tools that rely on a stable hardware identity.
While Virtual Functions (VFs) need this hardware-level programming to
properly handle MAC assignments in guest environments, the PF should
maintain standard transient behavior. This patch gates the
ionic_program_mac call using is_virtfn so that PF MAC changes remain
local to the netdev filters and do not overwrite the firmware's
permanent identity block.
Fixes: 19058be7c4 ("ionic: VF initial random MAC address if no assigned mac")
Signed-off-by: Mohammad Heib <mheib@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Link: https://patch.msgid.link/20260317170806.35390-1-mheib@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During the cxl_acpi probe process, it checks whether the cxl_nvb device
and driver have been attached. Currently, the startup priority of the
cxl_pmem driver is lower than that of the cxl_acpi driver. At this point,
the cxl_nvb driver has not yet been registered on the cxl_bus, causing
the attachment check to fail. This results in a failure to add the root
nvdimm bridge, leading to a cxl_acpi probe failure and ultimately
affecting the subsequent loading of cxl drivers. As a consequence, only
one mem device object exists on the cxl_bus, while the cxl_port device
objects and decoder device objects are missing.
The solution is to raise the startup priority of cxl_pmem to be higher
than that of cxl_acpi, ensuring that the cxl_pmem driver is registered
before the aforementioned attachment check occurs.
Co-developed-by: Wang Yinfeng <wangyinfeng@phytium.com.cn>
Signed-off-by: Wang Yinfeng <wangyinfeng@phytium.com.cn>
Signed-off-by: Cui Chao <cuichao1753@phytium.com.cn>
Fixes: e7e222ad73 ("cxl: Move devm_cxl_add_nvdimm_bridge() to cxl_pmem.ko")
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20260319074535.1709250-1-cuichao1753@phytium.com.cn
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
When io_should_commit() returns true (eg for non-pollable files), buffer
commit happens at buffer selection time and sel->buf_list is set to
NULL. When __io_put_kbufs() generates CQE flags at completion time, it
calls __io_put_kbuf_ring() which finds a NULL buffer_list and hence
cannot determine whether the buffer was consumed or not. This means that
IORING_CQE_F_BUF_MORE is never set for non-pollable input with
incrementally consumed buffers.
Likewise for io_buffers_select(), which always commits upfront and
discards the return value of io_kbuf_commit().
Add REQ_F_BUF_MORE to store the result of io_kbuf_commit() during early
commit. Then __io_put_kbuf_ring() can check this flag and set
IORING_F_BUF_MORE accordingy.
Reported-by: Martin Michaelis <code@mgjm.de>
Cc: stable@vger.kernel.org
Fixes: ae98dbf43d ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1553
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For a zero length transfer, io_kbuf_inc_commit() is called with !len.
Since we never enter the while loop to consume the buffers,
io_kbuf_inc_commit() ends up returning true, consuming the buffer. But
if no data was consumed, by definition it cannot have consumed the
buffer. Return false for that case.
Reported-by: Martin Michaelis <code@mgjm.de>
Cc: stable@vger.kernel.org
Fixes: ae98dbf43d ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1553
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add tsync_interrupt test to exercise the signal interruption path in
landlock_restrict_sibling_threads(). When a signal interrupts
wait_for_completion_interruptible() while the calling thread waits for
sibling threads to finish credential preparation, the kernel:
1. Sets ERESTARTNOINTR to request a transparent syscall restart.
2. Calls cancel_tsync_works() to opportunistically dequeue task works
that have not started running yet.
3. Breaks out of the preparation loop, then unblocks remaining
task works via complete_all() and waits for them to finish.
4. Returns the error, causing abort_creds() in the syscall handler.
Specifically, cancel_tsync_works() in its entirety, the ERESTARTNOINTR
error branch in landlock_restrict_sibling_threads(), and the
abort_creds() error branch in the landlock_restrict_self() syscall
handler are timing-dependent and not exercised by the existing tsync
tests, making code coverage measurements non-deterministic.
The test spawns a signaler thread that rapidly sends SIGUSR1 to the
calling thread while it performs landlock_restrict_self() with
LANDLOCK_RESTRICT_SELF_TSYNC. Since ERESTARTNOINTR causes a
transparent restart, userspace always sees the syscall succeed.
This is a best-effort coverage test: the interruption path is exercised
when the signal lands during the preparation wait, which depends on
thread scheduling. The test creates enough idle sibling threads (200)
to ensure multiple serialized waves of credential preparation even on
machines with many cores (e.g., 64), widening the window for the
signaler. Deterministic coverage would require wrapping the wait call
with ALLOW_ERROR_INJECTION() and using CONFIG_FAIL_FUNCTION.
Test coverage for security/landlock was 90.2% of 2105 lines according to
LLVM 21, and it is now 91.1% of 2105 lines with this new test.
Cc: Günther Noack <gnoack@google.com>
Cc: Justin Suess <utilityemal77@gmail.com>
Cc: Tingmao Wang <m@maowtm.org>
Cc: Yihan Ding <dingyihan@uniontech.com>
Link: https://lore.kernel.org/r/20260310190416.1913908-1-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Some pinctrl devices like mt6397 or mt6392 don't support EINT at all, but
the mtk_eint_init function is always called and returns -ENODEV, which
then bubbles up and causes probe failure.
To address this only call mtk_eint_init if EINT pins are present.
Tested on Xiaomi Mi Smart Clock x04g (mt6392).
Fixes: e46df235b4 ("pinctrl: mediatek: refactor EINT related code for all MediaTek pinctrl can fit")
Signed-off-by: Luca Leonardo Scorcia <l.scorcia@gmail.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
This attempt to fix regressions caused by reusing ident which apparently
is not handled well on certain stacks causing the stack to not respond to
requests, so instead of simple returning the first unallocated id this
stores the last used tx_ident and then attempt to use the next until all
available ids are exausted and then cycle starting over to 1.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221120
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221177
Fixes: 6c3ea155e5 ("Bluetooth: L2CAP: Fix not tracking outstanding TX ident")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Tested-by: Christian Eggers <ceggers@arri.de>
Smatch reports:
drivers/bluetooth/hci_ll.c:587 download_firmware() warn:
'fw' from request_firmware() not released on lines: 544.
In download_firmware(), if request_firmware() succeeds but the returned
firmware content is invalid (no data or zero size), the function returns
without releasing the firmware, resulting in a resource leak.
Fix this by calling release_firmware() before returning when
request_firmware() succeeded but the firmware content is invalid.
Fixes: 371805522f ("bluetooth: hci_uart: add LL protocol serdev driver support")
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Anas Iqbal <mohd.abd.6602@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
__hci_cmd_sync_sk() sets hdev->req_status under hdev->req_lock:
hdev->req_status = HCI_REQ_PEND;
However, several other functions read or write hdev->req_status without
holding any lock:
- hci_send_cmd_sync() reads req_status in hci_cmd_work (workqueue)
- hci_cmd_sync_complete() reads/writes from HCI event completion
- hci_cmd_sync_cancel() / hci_cmd_sync_cancel_sync() read/write
- hci_abort_conn() reads in connection abort path
Since __hci_cmd_sync_sk() runs on hdev->req_workqueue while
hci_send_cmd_sync() runs on hdev->workqueue, these are different
workqueues that can execute concurrently on different CPUs. The plain
C accesses constitute a data race.
Add READ_ONCE()/WRITE_ONCE() annotations on all concurrent accesses
to hdev->req_status to prevent potential compiler optimizations that
could affect correctness (e.g., load fusing in the wait_event
condition or store reordering).
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
sco_recv_frame() reads conn->sk under sco_conn_lock() but immediately
releases the lock without holding a reference to the socket. A concurrent
close() can free the socket between the lock release and the subsequent
sk->sk_state access, resulting in a use-after-free.
Other functions in the same file (sco_sock_timeout(), sco_conn_del())
correctly use sco_sock_hold() to safely hold a reference under the lock.
Fix by using sco_sock_hold() to take a reference before releasing the
lock, and adding sock_put() on all exit paths.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
l2cap_ecred_data_rcv() reads the SDU length field from skb->data using
get_unaligned_le16() without first verifying that skb contains at least
L2CAP_SDULEN_SIZE (2) bytes. When skb->len is less than 2, this reads
past the valid data in the skb.
The ERTM reassembly path correctly calls pskb_may_pull() before reading
the SDU length (l2cap_reassemble_sdu, L2CAP_SAR_START case). Apply the
same validation to the Enhanced Credit Based Flow Control data path.
Fixes: aac23bf636 ("Bluetooth: Implement LE L2CAP reassembly")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Syzbot reported a KASAN stack-out-of-bounds read in l2cap_build_cmd()
that is triggered by a malformed Enhanced Credit Based Connection Request.
The vulnerability stems from l2cap_ecred_conn_req(). The function allocates
a local stack buffer (`pdu`) designed to hold a maximum of 5 Source Channel
IDs (SCIDs), totaling 18 bytes. When an attacker sends a request with more
than 5 SCIDs, the function calculates `rsp_len` based on this unvalidated
`cmd_len` before checking if the number of SCIDs exceeds
L2CAP_ECRED_MAX_CID.
If the SCID count is too high, the function correctly jumps to the
`response` label to reject the packet, but `rsp_len` retains the
attacker's oversized value. Consequently, l2cap_send_cmd() is instructed
to read past the end of the 18-byte `pdu` buffer, triggering a
KASAN panic.
Fix this by moving the assignment of `rsp_len` to after the `num_scid`
boundary check. If the packet is rejected, `rsp_len` will safely
remain 0, and the error response will only read the 8-byte base header
from the stack.
Fixes: c28d2bff70 ("Bluetooth: L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short")
Reported-by: syzbot+b7f3e7d9a596bf6a63e3@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b7f3e7d9a596bf6a63e3
Tested-by: syzbot+b7f3e7d9a596bf6a63e3@syzkaller.appspotmail.com
Signed-off-by: Minseo Park <jacob.park.9436@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Pull networking fixes from Jakub Kicinski:
"Including fixes from wireless, Bluetooth and netfilter.
Nothing too exciting here, mostly fixes for corner cases.
Current release - fix to a fix:
- bonding: prevent potential infinite loop in bond_header_parse()
Current release - new code bugs:
- wifi: mac80211: check tdls flag in ieee80211_tdls_oper
Previous releases - regressions:
- af_unix: give up GC if MSG_PEEK intervened
- netfilter: conntrack: add missing netlink policy validations
- NFC: nxp-nci: allow GPIOs to sleep"
* tag 'net-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (78 commits)
MPTCP: fix lock class name family in pm_nl_create_listen_socket
icmp: fix NULL pointer dereference in icmp_tag_validation()
net: dsa: bcm_sf2: fix missing clk_disable_unprepare() in error paths
net: shaper: protect from late creation of hierarchy
net: shaper: protect late read accesses to the hierarchy
net: mvpp2: guard flow control update with global_tx_fc in buffer switching
nfnetlink_osf: validate individual option lengths in fingerprints
netfilter: nf_tables: release flowtable after rcu grace period on error
netfilter: bpf: defer hook memory release until rcu readers are done
net: bonding: fix NULL deref in bond_debug_rlb_hash_show
udp_tunnel: fix NULL deref caused by udp_sock_create6 when CONFIG_IPV6=n
net/mlx5e: Fix race condition during IPSec ESN update
net/mlx5e: Prevent concurrent access to IPSec ASO context
net/mlx5: qos: Restrict RTNL area to avoid a lock cycle
ipv6: add NULL checks for idev in SRv6 paths
NFC: nxp-nci: allow GPIOs to sleep
net: macb: fix uninitialized rx_fs_lock
net: macb: fix use-after-free access to PTP clock
netdevsim: drop PSP ext ref on forward failure
wifi: mac80211: always free skb on ieee80211_tx_prepare_skb() failure
...
icmp_tag_validation() unconditionally dereferences the result of
rcu_dereference(inet_protos[proto]) without checking for NULL.
The inet_protos[] array is sparse -- only about 15 of 256 protocol
numbers have registered handlers. When ip_no_pmtu_disc is set to 3
(hardened PMTU mode) and the kernel receives an ICMP Fragmentation
Needed error with a quoted inner IP header containing an unregistered
protocol number, the NULL dereference causes a kernel panic in
softirq context.
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
RIP: 0010:icmp_unreach (net/ipv4/icmp.c:1085 net/ipv4/icmp.c:1143)
Call Trace:
<IRQ>
icmp_rcv (net/ipv4/icmp.c:1527)
ip_protocol_deliver_rcu (net/ipv4/ip_input.c:207)
ip_local_deliver_finish (net/ipv4/ip_input.c:242)
ip_local_deliver (net/ipv4/ip_input.c:262)
ip_rcv (net/ipv4/ip_input.c:573)
__netif_receive_skb_one_core (net/core/dev.c:6164)
process_backlog (net/core/dev.c:6628)
handle_softirqs (kernel/softirq.c:561)
</IRQ>
Add a NULL check before accessing icmp_strict_tag_validation. If the
protocol has no registered handler, return false since it cannot
perform strict tag validation.
Fixes: 8ed1dc44d3 ("ipv4: introduce hardened ip_no_pmtu_disc mode")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Link: https://patch.msgid.link/20260318130558.1050247-4-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Smatch reports:
drivers/net/dsa/bcm_sf2.c:997 bcm_sf2_sw_resume() warn:
'priv->clk' from clk_prepare_enable() not released on lines: 983,990.
The clock enabled by clk_prepare_enable() in bcm_sf2_sw_resume()
is not released if bcm_sf2_sw_rst() or bcm_sf2_cfp_resume() fails.
Add the missing clk_disable_unprepare() calls in the error paths
to properly release the clock resource.
Fixes: e9ec5c3bd2 ("net: dsa: bcm_sf2: request and handle clocks")
Reviewed-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Anas Iqbal <mohd.abd.6602@gmail.com>
Link: https://patch.msgid.link/20260318084212.1287-1-mohd.abd.6602@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
isotp_sendmsg() uses only cmpxchg() on so->tx.state to serialize access
to so->tx.buf. isotp_release() waits for ISOTP_IDLE via
wait_event_interruptible() and then calls kfree(so->tx.buf).
If a signal interrupts the wait_event_interruptible() inside close()
while tx.state is ISOTP_SENDING, the loop exits early and release
proceeds to force ISOTP_SHUTDOWN and continues to kfree(so->tx.buf)
while sendmsg may still be reading so->tx.buf for the final CAN frame
in isotp_fill_dataframe().
The so->tx.buf can be allocated once when the standard tx.buf length needs
to be extended. Move the kfree() of this potentially extended tx.buf to
sk_destruct time when either isotp_sendmsg() and isotp_release() are done.
Fixes: 96d1c81e6a ("can: isotp: add module parameter for maximum pdu size")
Cc: stable@vger.kernel.org
Reported-by: Ali Norouzi <ali.norouzi@keysight.com>
Co-developed-by: Ali Norouzi <ali.norouzi@keysight.com>
Signed-off-by: Ali Norouzi <ali.norouzi@keysight.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260319-fix-can-gw-and-can-isotp-v2-2-c45d52c6d2d8@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
cgw_csum_crc8_rel() correctly computes bounds-safe indices via calc_idx():
int from = calc_idx(crc8->from_idx, cf->len);
int to = calc_idx(crc8->to_idx, cf->len);
int res = calc_idx(crc8->result_idx, cf->len);
if (from < 0 || to < 0 || res < 0)
return;
However, the loop and the result write then use the raw s8 fields directly
instead of the computed variables:
for (i = crc8->from_idx; ...) /* BUG: raw negative index */
cf->data[crc8->result_idx] = ...; /* BUG: raw negative index */
With from_idx = to_idx = result_idx = -64 on a 64-byte CAN FD frame,
calc_idx(-64, 64) = 0 so the guard passes, but the loop iterates with
i = -64, reading cf->data[-64], and the write goes to cf->data[-64].
This write might end up to 56 (7.0-rc) or 40 (<= 6.19) bytes before the
start of the canfd_frame on the heap.
The companion function cgw_csum_xor_rel() uses `from`/`to`/`res`
correctly throughout; fix cgw_csum_crc8_rel() to match.
Confirmed with KASAN on linux-7.0-rc2:
BUG: KASAN: slab-out-of-bounds in cgw_csum_crc8_rel+0x515/0x5b0
Read of size 1 at addr ffff8880076619c8 by task poc_cgw_oob/62
To configure the can-gw crc8 checksums CAP_NET_ADMIN is needed.
Fixes: 456a8a646b ("can: gw: add support for CAN FD frames")
Cc: stable@vger.kernel.org
Reported-by: Ali Norouzi <ali.norouzi@keysight.com>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Ali Norouzi <ali.norouzi@keysight.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260319-fix-can-gw-and-can-isotp-v2-1-c45d52c6d2d8@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
GGTT MMIO access is currently protected by hotplug (drm_dev_enter),
which works correctly when the driver loads successfully and is later
unbound or unloaded. However, if driver load fails, this protection is
insufficient because drm_dev_unplug() is never called.
Additionally, devm release functions cannot guarantee that all BOs with
GGTT mappings are destroyed before the GGTT MMIO region is removed, as
some BOs may be freed asynchronously by worker threads.
To address this, introduce an open-coded flag, protected by the GGTT
lock, that guards GGTT MMIO access. The flag is cleared during the
dev_fini_ggtt devm release function to ensure MMIO access is disabled
once teardown begins.
Cc: stable@vger.kernel.org
Fixes: 919bb54e98 ("drm/xe: Fix missing runtime outer protection for ggtt_remove_node")
Reviewed-by: Zhanjun Dong <zhanjun.dong@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260310225039.1320161-8-zhanjun.dong@intel.com
(cherry picked from commit 4f3a998a173b4325c2efd90bdadc6ccd3ad9a431)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Pull power management fixes from Rafael Wysocki:
"These fix an idle loop issue exposed by recent changes and a race
condition related to device removal in the runtime PM core code:
- Consolidate the handling of two special cases in the idle loop that
occur when only one CPU idle state is present (Rafael Wysocki)
- Fix a race condition related to device removal in the runtime PM
core code that may cause a stale device object pointer to be
dereferenced (Bart Van Assche)"
* tag 'pm-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: runtime: Fix a race condition related to device removal
sched: idle: Consolidate the handling of two special cases
The HDP driver uses the generic GPIO chip API, but this configuration
may not be enabled.
Ensure it is enabled by selecting the appropriate option.
Fixes: 4bcff9c05b ("pinctrl: stm32: use new generic GPIO chip API")
Signed-off-by: Amelie Delaunay <amelie.delaunay@foss.st.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
Pull ACPI support fixes from Rafael Wysocki:
"These fix an MFD child automatic modprobe issue introduced recently,
an ACPI processor driver issue introduced by a previous fix and an
ACPICA issue causing confusing messages regarding _DSM arguments to be
printed:
- Update the format of the last argument of _DSM to avoid printing
confusing error messages in some cases (Saket Dumbre)
- Fix MFD child automatic modprobe issue by removing a stale check
from acpi_companion_match() (Pratap Nirujogi)
- Prevent possible use-after-free in acpi_processor_errata_piix4()
from occurring by rearranging the code to print debug messages
while holding references to relevant device objects (Rafael
Wysocki)"
* tag 'acpi-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: bus: Fix MFD child automatic modprobe issue
ACPI: processor: Fix previous acpi_processor_errata_piix4() fix
ACPICA: Update the format of Arg3 of _DSM
Florian Westphal says:
====================
netfilter: updates for net
The following patchset contains Netfilter fixes for *net*:
1) Fix UaF when netfilter bpf link goes away while nfnetlink dumps
current hook list, we have to wait until rcu readers are gone.
2) Fix UaF when flowtable fails to register all devices, similar
bug as 1). From Pablo Neira Ayuso.
3) nfnetlink_osf fails to properly validate option length fields.
From Weiming Shi.
netfilter pull request nf-26-03-19
* tag 'nf-26-03-19' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
nfnetlink_osf: validate individual option lengths in fingerprints
netfilter: nf_tables: release flowtable after rcu grace period on error
netfilter: bpf: defer hook memory release until rcu readers are done
====================
Link: https://patch.msgid.link/20260319093834.19933-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Merge an ACPICA fix and a core ACPI support code fix for 7.0-rc5:
- Update the format of the last argument of _DSM to avoid printing
confusing error messages in some cases (Saket Dumbre)
- Fix MFD child automatic modprobe issue by removing a stale check
from acpi_companion_match() (Pratap Nirujogi)
* acpica:
ACPICA: Update the format of Arg3 of _DSM
* acpi-bus:
ACPI: bus: Fix MFD child automatic modprobe issue
Merge a fix for a race condition related to device removal (Bart Van
Assche) for 7.0-rc5.
* pm-runtime:
PM: runtime: Fix a race condition related to device removal
Add missing error handling for mcp251x_power_enable() calls in both
mcp251x_open() and mcp251x_can_resume() functions.
In mcp251x_open(), if power enable fails, jump to error path to close
candev without attempting to disable power again.
In mcp251x_can_resume(), properly check return values of power enable calls
for both power and transceiver regulators. If any fails, return the error
code to the PM framework and log the failure.
This ensures the driver properly handles power control failures and
maintains correct device state.
Signed-off-by: Wenyuan Li <2063309626@qq.com>
Link: https://patch.msgid.link/tencent_F3EFC5D7738AC548857B91657715E2D3AA06@qq.com
[mkl: fix patch description]
[mkl: mcp251x_can_resume(): replace goto by return]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Getting engine specific CTX TIMESTAMP register can fail. In that case,
if the context is active, new_ts is uninitialized. Fix that case by
initializing new_ts to the last value that was sampled in SW -
lrc->ctx_timestamp.
Flagged by static analysis.
v2: Fix new_ts initialization (Ashutosh)
Fixes: bb63e7257e ("drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR")
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Link: https://patch.msgid.link/20260312125308.3126607-2-umesh.nerlige.ramappa@intel.com
(cherry picked from commit 466e75d48038af252187855058a7a9312db9d2f8)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
The check using xe_child->base.children was insufficient in determining
if a pte was a leaf node. So explicitly skip over every non-leaf pt and
conditionally abort if there is a scenario where a non-leaf pt is
interleaved between leaf pt, which results in the page walker skipping
over some leaf pt.
Note that the behavior being targeted for abort is
PD[0] = 2M PTE
PD[1] = PT -> 512 4K PTEs
PD[2] = 2M PTE
results in abort, page walker won't descend PD[1].
With new abort, ensuring valid PRL before handling a second abort.
v2:
- Revert to previous assert.
- Revised non-leaf handling for interleaf child pt and leaf pte.
- Update comments to specifications. (Stuart)
- Remove unnecessary XE_PTE_PS64. (Matthew B)
v3:
- Modify secondary abort to only check non-leaf PTEs. (Matthew B)
Fixes: b912138df2 ("drm/xe: Create page reclaim list on unbind")
Signed-off-by: Brian Nguyen <brian3.nguyen@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Cc: Stuart Summers <stuart.summers@intel.com>
Link: https://patch.msgid.link/20260305171546.67691-6-brian3.nguyen@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit 1d123587525db86cc8f0d2beb35d9e33ca3ade83)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
In GuC submit fini, forcefully tear down any exec queues by disabling
CTs, stopping the scheduler (which cleans up lost G2H), killing all
remaining queues, and resuming scheduling to allow any remaining cleanup
actions to complete and signal any remaining fences.
Split guc_submit_fini into device related and software only part. Using
device-managed and drm-managed action guarantees the correct ordering of
cleanup.
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: stable@vger.kernel.org
Reviewed-by: Zhanjun Dong <zhanjun.dong@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260310225039.1320161-3-zhanjun.dong@intel.com
(cherry picked from commit a6ab444a111a59924bd9d0c1e0613a75a0a40b89)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
By using the same variable for both the return of poll_timeout_us and
the return of the polled function guc_wait_ucode, the return value of
the latter is overwritten and lost after exiting the polling loop. Since
guc_wait_ucode returns -1 on GuC load failure, we lose that information
and always continue as if the GuC had been loaded correctly.
This is fixed by simply using 2 separate variables.
Fixes: a4916b4da4 ("drm/xe/guc: Refactor GuC load to use poll_timeout_us()")
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Link: https://patch.msgid.link/20260303001732.2540493-2-daniele.ceraolospurio@intel.com
(cherry picked from commit c85ec5c5753a46b5c2aea1292536487be9470ffe)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
We look up a netdev during prep of Netlink ops (pre- callbacks)
and take a ref to it. Then later in the body of the callback
we take its lock or RCU which are the actual protections.
The netdev may get unregistered in between the time we take
the ref and the time we lock it. We may allocate the hierarchy
after flush has already run, which would lead to a leak.
Take the instance lock in pre- already, this saves us from the race
and removes the need for dedicated lock/unlock callbacks completely.
After all, if there's any chance of write happening concurrently
with the flush - we're back to leaking the hierarchy.
We may take the lock for devices which don't support shapers but
we're only dealing with SET operations here, not taking the lock
would be optimizing for an error case.
Fixes: 93954b40f6 ("net-shapers: implement NL set and delete operations")
Link: https://lore.kernel.org/20260309173450.538026-1-p@1g4.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260317161014.779569-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We look up a netdev during prep of Netlink ops (pre- callbacks)
and take a ref to it. Then later in the body of the callback
we take its lock or RCU which are the actual protections.
This is not proper, a conversion from a ref to a locked netdev
must include a liveness check (a check if the netdev hasn't been
unregistered already). Fix the read cases (those under RCU).
Writes needs a separate change to protect from creating the
hierarchy after flush has already run.
Fixes: 4b623f9f0f ("net-shapers: implement NL get operation")
Reported-by: Paul Moses <p@1g4.org>
Link: https://lore.kernel.org/20260309173450.538026-1-p@1g4.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260317161014.779569-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
With LPA/LPA2, the top bits of the PFN (Bits[51:48]) end up in the lower bits
of the PTE. So, simply creating a mask of the "top IPA bit" doesn't work well
for these configurations to set the "top" bit at the output of Stage1
translation.
Fix this by using the __phys_to_pte_val() to do the right thing for all
configurations.
Tested using, kvmtool, placing the memory at a higher address (-m <size>@<Addr>).
e.g:
# lkvm run --realm -c 4 -m 512M@@128T -k Image --console serial
sh-5.0# dmesg | grep "LPA2\|RSI"
[ 0.000000] RME: Using RSI version 1.0
[ 0.000000] CPU features: detected: 52-bit Virtual Addressing (LPA2)
[ 0.777354] CPU features: detected: 52-bit Virtual Addressing for KVM (LPA2)
Fixes: 3993069549 ("arm64: realm: Query IPA size from the RMM")
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Under certain circumstances, all the remaining subrequests from a read
request will get abandoned during retry. The abandonment process expects
the 'subreq' variable to be set to the place to start abandonment from, but
it doesn't always have a useful value (it will be uninitialised on the
first pass through the loop and it may point to a deleted subrequest on
later passes).
Fix the first jump to "abandon:" to set subreq to the start of the first
subrequest expected to need retry (which, in this abandonment case, turned
out unexpectedly to no longer have NEED_RETRY set).
Also clear the subreq pointer after discarding superfluous retryable
subrequests to cause an oops if we do try to access it.
Fixes: ee4cdf7ba8 ("netfs: Speed up buffered reading")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/3775287.1773848338@warthog.procyon.org.uk
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
mvpp2_bm_switch_buffers() unconditionally calls
mvpp2_bm_pool_update_priv_fc() when switching between per-cpu and
shared buffer pool modes. This function programs CM3 flow control
registers via mvpp2_cm3_read()/mvpp2_cm3_write(), which dereference
priv->cm3_base without any NULL check.
When the CM3 SRAM resource is not present in the device tree (the
third reg entry added by commit 60523583b0 ("dts: marvell: add CM3
SRAM memory to cp11x ethernet device tree")), priv->cm3_base remains
NULL and priv->global_tx_fc is false. Any operation that triggers
mvpp2_bm_switch_buffers(), for example an MTU change that crosses
the jumbo frame threshold, will crash:
Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000000
Mem abort info:
ESR = 0x0000000096000006
EC = 0x25: DABT (current EL), IL = 32 bits
pc : readl+0x0/0x18
lr : mvpp2_cm3_read.isra.0+0x14/0x20
Call trace:
readl+0x0/0x18
mvpp2_bm_pool_update_fc+0x40/0x12c
mvpp2_bm_pool_update_priv_fc+0x94/0xd8
mvpp2_bm_switch_buffers.isra.0+0x80/0x1c0
mvpp2_change_mtu+0x140/0x380
__dev_set_mtu+0x1c/0x38
dev_set_mtu_ext+0x78/0x118
dev_set_mtu+0x48/0xa8
dev_ifsioc+0x21c/0x43c
dev_ioctl+0x2d8/0x42c
sock_ioctl+0x314/0x378
Every other flow control call site in the driver already guards
hardware access with either priv->global_tx_fc or port->tx_fc.
mvpp2_bm_switch_buffers() is the only place that omits this check.
Add the missing priv->global_tx_fc guard to both the disable and
re-enable calls in mvpp2_bm_switch_buffers(), consistent with the
rest of the driver.
Fixes: 3a616b92a9 ("net: mvpp2: Add TX flow control support for jumbo frames")
Signed-off-by: Muhammad Hammad Ijaz <mhijaz@amazon.com>
Reviewed-by: Gunnar Kudrjavets <gunnarku@amazon.com>
Link: https://patch.msgid.link/20260316193157.65748-1-mhijaz@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
nfnl_osf_add_callback() validates opt_num bounds and string
NUL-termination but does not check individual option length fields.
A zero-length option causes nf_osf_match_one() to enter the option
matching loop even when foptsize sums to zero, which matches packets
with no TCP options where ctx->optp is NULL:
Oops: general protection fault
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:nf_osf_match_one (net/netfilter/nfnetlink_osf.c:98)
Call Trace:
nf_osf_match (net/netfilter/nfnetlink_osf.c:227)
xt_osf_match_packet (net/netfilter/xt_osf.c:32)
ipt_do_table (net/ipv4/netfilter/ip_tables.c:293)
nf_hook_slow (net/netfilter/core.c:623)
ip_local_deliver (net/ipv4/ip_input.c:262)
ip_rcv (net/ipv4/ip_input.c:573)
Additionally, an MSS option (kind=2) with length < 4 causes
out-of-bounds reads when nf_osf_match_one() unconditionally accesses
optp[2] and optp[3] for MSS value extraction. While RFC 9293
section 3.2 specifies that the MSS option is always exactly 4
bytes (Kind=2, Length=4), the check uses "< 4" rather than
"!= 4" because lengths greater than 4 do not cause memory
safety issues -- the buffer is guaranteed to be at least
foptsize bytes by the ctx->optsize == foptsize check.
Reject fingerprints where any option has zero length, or where an MSS
option has length less than 4, at add time rather than trusting these
values in the packet matching hot path.
Fixes: 11eeef41d5 ("netfilter: passive OS fingerprint xtables match")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Call synchronize_rcu() after unregistering the hooks from error path,
since a hook that already refers to this flowtable can be already
registered, exposing this flowtable to packet path and nfnetlink_hook
control plane.
This error path is rare, it should only happen by reaching the maximum
number hooks or by failing to set up to hardware offload, just call
synchronize_rcu().
There is a check for already used device hooks by different flowtable
that could result in EEXIST at this late stage. The hook parser can be
updated to perform this check earlier to this error path really becomes
rarely exercised.
Uncovered by KASAN reported as use-after-free from nfnetlink_hook path
when dumping hooks.
Fixes: 3b49e2e94e ("netfilter: nf_tables: add flow table netlink frontend")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Yiming Qian reports UaF when concurrent process is dumping hooks via
nfnetlink_hooks:
BUG: KASAN: slab-use-after-free in nfnl_hook_dump_one.isra.0+0xe71/0x10f0
Read of size 8 at addr ffff888003edbf88 by task poc/79
Call Trace:
<TASK>
nfnl_hook_dump_one.isra.0+0xe71/0x10f0
netlink_dump+0x554/0x12b0
nfnl_hook_get+0x176/0x230
[..]
Defer release until after concurrent readers have completed.
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Fixes: 84601d6ee6 ("bpf: add bpf_link support for BPF_NETFILTER programs")
Signed-off-by: Florian Westphal <fw@strlen.de>
Johannes Berg says:
====================
Just a few updates:
- cfg80211:
- guarantee pmsr work is cancelled
- mac80211:
- reject TDLS operations on non-TDLS stations
- fix crash in AP_VLAN bandwidth change
- fix leak or double-free on some TX preparation
failures
- remove keys needed for beacons _after_ stopping
those
- fix debugfs static branch race
- avoid underflow in inactive time
- fix another NULL dereference in mesh on invalid
frames
- ti/wlcore: avoid infinite realloc loop
* tag 'wireless-2026-03-18' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: always free skb on ieee80211_tx_prepare_skb() failure
wifi: wlcore: Return -ENOMEM instead of -EAGAIN if there is not enough headroom
wifi: mac80211: fix NULL deref in mesh_matches_local()
wifi: mac80211: check tdls flag in ieee80211_tdls_oper
wifi: cfg80211: cancel pmsr_free_wk in cfg80211_pmsr_wdev_down
wifi: mac80211: Fix static_branch_dec() underflow for aql_disable.
mac80211: fix crash in ieee80211_chan_bw_change for AP_VLAN stations
wifi: mac80211: use jiffies_delta_to_msecs() for sta_info inactive times
wifi: mac80211: remove keys after disabling beaconing
wifi: mac80211_hwsim: fully initialise PMSR capabilities
====================
Link: https://patch.msgid.link/20260318172515.381148-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When CONFIG_IPV6 is disabled, the udp_sock_create6() function returns 0
(success) without actually creating a socket. Callers such as
fou_create() then proceed to dereference the uninitialized socket
pointer, resulting in a NULL pointer dereference.
The captured NULL deref crash:
BUG: kernel NULL pointer dereference, address: 0000000000000018
RIP: 0010:fou_nl_add_doit (net/ipv4/fou_core.c:590 net/ipv4/fou_core.c:764)
[...]
Call Trace:
<TASK>
genl_family_rcv_msg_doit.constprop.0 (net/netlink/genetlink.c:1114)
genl_rcv_msg (net/netlink/genetlink.c:1194 net/netlink/genetlink.c:1209)
[...]
netlink_rcv_skb (net/netlink/af_netlink.c:2550)
genl_rcv (net/netlink/genetlink.c:1219)
netlink_unicast (net/netlink/af_netlink.c:1319 net/netlink/af_netlink.c:1344)
netlink_sendmsg (net/netlink/af_netlink.c:1894)
__sock_sendmsg (net/socket.c:727 (discriminator 1) net/socket.c:742 (discriminator 1))
__sys_sendto (./include/linux/file.h:62 (discriminator 1) ./include/linux/file.h:83 (discriminator 1) net/socket.c:2183 (discriminator 1))
__x64_sys_sendto (net/socket.c:2213 (discriminator 1) net/socket.c:2209 (discriminator 1) net/socket.c:2209 (discriminator 1))
do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
entry_SYSCALL_64_after_hwframe (net/arch/x86/entry/entry_64.S:130)
This patch makes udp_sock_create6 return -EPFNOSUPPORT instead, so
callers correctly take their error paths. There is only one caller of
the vulnerable function and only privileged users can trigger it.
Fixes: fd384412e1 ("udp_tunnel: Seperate ipv6 functions into its own file.")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260317010241.1893893-1-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In IPSec full offload mode, the device reports an ESN (Extended
Sequence Number) wrap event to the driver. The driver validates this
event by querying the IPSec ASO and checking that the esn_event_arm
field is 0x0, which indicates an event has occurred. After handling
the event, the driver must re-arm the context by setting esn_event_arm
back to 0x1.
A race condition exists in this handling path. After validating the
event, the driver calls mlx5_accel_esp_modify_xfrm() to update the
kernel's xfrm state. This function temporarily releases and
re-acquires the xfrm state lock.
So, need to acknowledge the event first by setting esn_event_arm to
0x1. This prevents the driver from reprocessing the same ESN update if
the hardware sends events for other reason. Since the next ESN update
only occurs after nearly 2^31 packets are received, there's no risk of
missing an update, as it will happen long after this handling has
finished.
Processing the event twice causes the ESN high-order bits (esn_msb) to
be incremented incorrectly. The driver then programs the hardware with
this invalid ESN state, which leads to anti-replay failures and a
complete halt of IPSec traffic.
Fix this by re-arming the ESN event immediately after it is validated,
before calling mlx5_accel_esp_modify_xfrm(). This ensures that any
spurious, duplicate events are correctly ignored, closing the race
window.
Fixes: fef0667893 ("net/mlx5e: Fix ESN update kernel panic")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260316094603.6999-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The query or updating IPSec offload object is through Access ASO WQE.
The driver uses a single mlx5e_ipsec_aso struct for each PF, which
contains a shared DMA-mapped context for all ASO operations.
A race condition exists because the ASO spinlock is released before
the hardware has finished processing WQE. If a second operation is
initiated immediately after, it overwrites the shared context in the
DMA area.
When the first operation's completion is processed later, it reads
this corrupted context, leading to unexpected behavior and incorrect
results.
This commit fixes the race by introducing a private context within
each IPSec offload object. The shared ASO context is now copied to
this private context while the ASO spinlock is held. Subsequent
processing uses this saved, per-object context, ensuring its integrity
is maintained.
Fixes: 1ed78fc033 ("net/mlx5e: Update IPsec soft and hard limits")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260316094603.6999-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A lock dependency cycle exists where:
1. mlx5_ib_roce_init -> mlx5_core_uplink_netdev_event_replay ->
mlx5_blocking_notifier_call_chain (takes notifier_rwsem) ->
mlx5e_mdev_notifier_event -> mlx5_netdev_notifier_register ->
register_netdevice_notifier_dev_net (takes rtnl)
=> notifier_rwsem -> rtnl
2. mlx5e_probe -> _mlx5e_probe ->
mlx5_core_uplink_netdev_set (takes uplink_netdev_lock) ->
mlx5_blocking_notifier_call_chain (takes notifier_rwsem)
=> uplink_netdev_lock -> notifier_rwsem
3: devlink_nl_rate_set_doit -> devlink_nl_rate_set ->
mlx5_esw_devlink_rate_leaf_tx_max_set -> esw_qos_devlink_rate_to_mbps ->
mlx5_esw_qos_max_link_speed_get (takes rtnl) ->
mlx5_esw_qos_lag_link_speed_get_locked ->
mlx5_uplink_netdev_get (takes uplink_netdev_lock)
=> rtnl -> uplink_netdev_lock
=> BOOM! (lock cycle)
Fix that by restricting the rtnl-protected section to just the necessary
part, the call to netdev_master_upper_dev_get and speed querying, so
that the last lock dependency is avoided and the cycle doesn't close.
This is safe because mlx5_uplink_netdev_get uses netdev_hold to keep the
uplink netdev alive while its master device is queried.
Use this opportunity to rename the ambiguously-named "hold_rtnl_lock"
argument to "take_rtnl" and remove the "_locked" suffix from
mlx5_esw_qos_lag_link_speed_get_locked.
Fixes: 6b4be64fd9 ("net/mlx5e: Harden uplink netdev access against device unbind")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260316094603.6999-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2026-03-17 (igc, iavf, libie)
Kohei Enju adds use of helper function to add missing update of
skb->tail when padding is needed for igc.
Zdenek Bouska clears stale XSK timestamps when taking down Tx rings on
igc.
Petr Oros changes handling of iavf VLAN filter handling when an added
VLAN is also on the delete list to which can race and cause the VLAN
filter to not be added.
Michal frees cmd_buf for libie firmware logging to stop memory leaks.
* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
libie: prevent memleak in fwlog code
iavf: fix VLAN filter lost on add/delete race
igc: fix page fault in XDP TX timestamps handling
igc: fix missing update of skb->tail in igc_xmit_frame()
====================
Link: https://patch.msgid.link/20260317211906.115505-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gregory reported in [0] that the global_map_resize test when run in
repeatedly ends up failing during program load. This stems from the fact
that BTF reference has not dropped to zero after the previous run's
module is unloaded, and the older module's BTF is still discoverable and
visible. Later, in libbpf, load_module_btfs() will find the ID for this
stale BTF, open its fd, and then it will be used during program load
where later steps taking module reference using btf_try_get_module()
fail since the underlying module for the BTF is gone.
Logically, once a module is unloaded, it's associated BTF artifacts
should become hidden. The BTF object inside the kernel may still remain
alive as long its reference counts are alive, but it should no longer be
discoverable.
To fix this, let us call btf_free_id() from the MODULE_STATE_GOING case
for the module unload to free the BTF associated IDR entry, and disable
its discovery once module unload returns to user space. If a race
happens during unload, the outcome is non-deterministic anyway. However,
user space should be able to rely on the guarantee that once it has
synchronously established a successful module unload, no more stale
artifacts associated with this module can be obtained subsequently.
Note that we must be careful to not invoke btf_free_id() in btf_put()
when btf_is_module() is true now. There could be a window where the
module unload drops a non-terminal reference, frees the IDR, but the
same ID gets reused and the second unconditional btf_free_id() ends up
releasing an unrelated entry.
To avoid a special case for btf_is_module() case, set btf->id to zero to
make btf_free_id() idempotent, such that we can unconditionally invoke it
from btf_put(), and also from the MODULE_STATE_GOING case. Since zero is
an invalid IDR, the idr_remove() should be a noop.
Note that we can be sure that by the time we reach final btf_put() for
btf_is_module() case, the btf_free_id() is already done, since the
module itself holds the BTF reference, and it will call this function
for the BTF before dropping its own reference.
[0]: https://lore.kernel.org/bpf/cover.1773170190.git.grbell@redhat.com
Fixes: 36e68442d1 ("bpf: Load and verify kernel module BTFs")
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Reported-by: Gregory Bell <grbell@redhat.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260312205307.1346991-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
__in6_dev_get() can return NULL when the device has no IPv6 configuration
(e.g. MTU < IPV6_MIN_MTU or after NETDEV_UNREGISTER).
Add NULL checks for idev returned by __in6_dev_get() in both
seg6_hmac_validate_skb() and ipv6_srh_rcv() to prevent potential NULL
pointer dereferences.
Fixes: 1ababeba4a ("ipv6: implement dataplane support for rthdr type 4 (Segment Routing Header)")
Fixes: bf355b8d2c ("ipv6: sr: add core files for SR HMAC support")
Signed-off-by: Minhong He <heminhong@kylinos.cn>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20260316073301.106643-1-heminhong@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If hardware doesn't support RX Flow Filters, rx_fs_lock spinlock is not
initialized leading to the following assertion splat triggerable via
set_rxnfc callback.
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 1 PID: 949 Comm: syz.0.6 Not tainted 6.1.164+ #113
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x8d/0xba lib/dump_stack.c:106
assign_lock_key kernel/locking/lockdep.c:974 [inline]
register_lock_class+0x141b/0x17f0 kernel/locking/lockdep.c:1287
__lock_acquire+0x74f/0x6c40 kernel/locking/lockdep.c:4928
lock_acquire kernel/locking/lockdep.c:5662 [inline]
lock_acquire+0x190/0x4b0 kernel/locking/lockdep.c:5627
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x33/0x50 kernel/locking/spinlock.c:162
gem_del_flow_filter drivers/net/ethernet/cadence/macb_main.c:3562 [inline]
gem_set_rxnfc+0x533/0xac0 drivers/net/ethernet/cadence/macb_main.c:3667
ethtool_set_rxnfc+0x18c/0x280 net/ethtool/ioctl.c:961
__dev_ethtool net/ethtool/ioctl.c:2956 [inline]
dev_ethtool+0x229c/0x6290 net/ethtool/ioctl.c:3095
dev_ioctl+0x637/0x1070 net/core/dev_ioctl.c:510
sock_do_ioctl+0x20d/0x2c0 net/socket.c:1215
sock_ioctl+0x577/0x6d0 net/socket.c:1320
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x18c/0x210 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:46 [inline]
do_syscall_64+0x35/0x80 arch/x86/entry/common.c:76
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
A more straightforward solution would be to always initialize rx_fs_lock,
just like rx_fs_list. However, in this case the driver set_rxnfc callback
would return with a rather confusing error code, e.g. -EINVAL. So deny
set_rxnfc attempts directly if the RX filtering feature is not supported
by hardware.
Fixes: ae8223de3d ("net: macb: Added support for RX filtering")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Link: https://patch.msgid.link/20260316103826.74506-2-pchelkin@ispras.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
nsim_do_psp() takes an extra reference to the PSP skb extension so the
extension survives __dev_forward_skb(). That forward path scrubs the skb
and drops attached skb extensions before nsim_psp_handle_ext() can
reattach the PSP metadata.
If __dev_forward_skb() fails in nsim_forward_skb(), the function returns
before nsim_psp_handle_ext() can attach that extension to the skb, leaving
the extra reference leaked.
Drop the saved PSP extension reference before returning from the
forward-failure path. Guard the put because plain or non-decapsulated
traffic can also fail forwarding without ever taking the extra PSP
reference.
Fixes: f857478d62 ("netdevsim: a basic test PSP implementation")
Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20260317061431.1482716-1-atwellwea@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The ':=' override path in xbc_parse_kv() calls xbc_init_node() to
re-initialize an existing value node but does not check the return
value. If xbc_init_node() fails (data offset out of range), parsing
silently continues with stale node data.
Add the missing error check to match the xbc_add_node() call path
which already checks for failure.
In practice, a bootconfig using ':=' to override a value near the
32KB data limit could silently retain the old value, meaning a
security-relevant boot parameter override (e.g., a trace filter or
debug setting) would not take effect as intended.
Link: https://lore.kernel.org/all/20260318155847.78065-2-objecting@objecting.org/
Fixes: e5efaeb8a8 ("bootconfig: Support mixing a value and subkeys under a key")
Signed-off-by: Josh Law <objecting@objecting.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Pull crypto library fixes from Eric Biggers:
- Disable the "padlock" SHA-1 and SHA-256 driver on Zhaoxin
processors, since it does not compute hash values correctly
- Make a generated file be removed by 'make clean'
- Fix excessive stack usage in some of the arm64 AES code
* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crypto: powerpc: Add powerpc/aesp8-ppc.S to clean-files
crypto: padlock-sha - Disable for Zhaoxin processor
crypto: arm64/aes-neonbs - Move key expansion off the stack
People do effort to inject MCEs into guests in order to simulate/test
handling of hardware errors. The real use case behind it is testing the
handling of SIGBUS which the memory failure code sends to the process.
If that process is QEMU, instead of killing the whole guest, the MCE can
be injected into the guest kernel so that latter can attempt proper
handling and kill the user *process* in the guest, instead, which
caused the MCE. The assumption being here that the whole injection flow
can supply enough information that the guest kernel can pinpoint the
right process. But that's a different topic...
Regardless of virtualization or not, access to SMCA-specific registers
like MCA_DESTAT should only be done after having checked the smca
feature bit. And there are AMD machines like Bulldozer (the one before
Zen1) which do support deferred errors but are not SMCA machines.
Therefore, properly check the feature bit before accessing related MSRs.
[ bp: Rewrite commit message. ]
Fixes: 7cb735d7c0 ("x86/mce: Unify AMD DFR handler with MCA Polling")
Signed-off-by: William Roche <william.roche@oracle.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20260218163025.1316501-1-william.roche@oracle.com
Pull nfsd fixes from Chuck Lever:
- Fix cache_request leak in cache_release()
- Fix heap overflow in the NFSv4.0 LOCK replay cache
- Hold net reference for the lifetime of /proc/fs/nfs/exports fd
- Defer sub-object cleanup in export "put" callbacks
* tag 'nfsd-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: fix heap overflow in NFSv4.0 LOCK replay cache
sunrpc: fix cache_request leak in cache_release
NFSD: Hold net reference for the lifetime of /proc/fs/nfs/exports fd
NFSD: Defer sub-object cleanup in export put callbacks
isl68137_avs_enable_show_page() uses the return value of
pmbus_read_byte_data() without checking for errors. If the I2C transaction
fails, a negative error code is passed through bitwise operations,
producing incorrect output.
Add an error check to propagate the return value if it is negative.
Additionally, modernize the callback by replacing sprintf()
with sysfs_emit().
Fixes: 038a9c3d1e ("hwmon: (pmbus/isl68137) Add driver for Intersil ISL68137 PWM Controller")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260318193952.47908-2-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
To pick up some extra files that need to be sync'ed with the kernel
sources to try and reduce the number of PRs.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The bcm2835_asb_control() function uses a tight polling loop to wait
for the ASB bridge to acknowledge a request. During intensive workloads,
this handshake intermittently fails for V3D's master ASB on BCM2711,
resulting in "Failed to disable ASB master for v3d" errors during
runtime PM suspend. As a consequence, the failed power-off leaves V3D in
a broken state, leading to bus faults or system hangs on later accesses.
As the timeout is insufficient in some scenarios, increase the polling
timeout from 1us to 5us, which is still negligible in the context of a
power domain transition. Also, replace the open-coded ktime_get_ns()/
cpu_relax() polling loop with readl_poll_timeout_atomic().
Cc: stable@vger.kernel.org
Fixes: 670c672608 ("soc: bcm: bcm2835-pm: Add support for power domains under a new binding.")
Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Stefan Wahren <wahrenst@gmx.net>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Timings of the nand are adjusted by pl35x_nfc_setup_interface() but
actually applied by the pl35x_nand_select_target() function.
If there is only one nand chip, the pl35x_nand_select_target() will only
apply the timings once since the test at its beginning will always be true
after the first call to this function. As a result, the hardware will
keep using the default timings set at boot to detect the nand chip, not
the optimal ones.
With this patch, we program directly the new timings when
pl35x_nfc_setup_interface() is called.
Fixes: 08d8c62164 ("mtd: rawnand: pl353: Add support for the ARM PL353 SMC NAND controller")
Signed-off-by: Olivier Sobrie <olivier@sobrie.be>
Cc: stable@vger.kernel.org
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
This helper really is just a little helper for internal purposes, and is
I/O operation oriented, despite its name. It has already been misused
in commit 5008c3ec3f ("mtd: spi-nor: core: Check read CR support"), so
rename it to clarify its purpose: it is only useful for reads and page
programs.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Commit 5008c3ec3f ("mtd: spi-nor: core: Check read CR support") adds a
controller check to make sure the core will not use CR reads on
controllers not supporting them. The approach is valid but the fix is
incorrect. Unfortunately, the author could not catch it, because the
expected behavior was met. The patch indeed drops the RDCR capability,
but it does it for all controllers!
The issue comes from the use of spi_nor_spimem_check_op() which is an
internal helper dedicated to check read/write operations only, despite
its generic name.
This helper looks for the biggest number of address bytes that can be
used for a page operation and tries 4 then 3. It then calls the usual
spi-mem helpers to do the checks. These will always fail because there
is now an inconsistency: the address cycles are forced to 4 (then 3)
bytes, but the bus width during the address cycles rightfully remains
0. There is a non-zero address length but a zero address bus width,
which is an invalid combination.
The correct check in this case is to directly call spi_mem_supports_op()
which doesn't messes up with the operation content.
Fixes: 5008c3ec3f ("mtd: spi-nor: core: Check read CR support")
Cc: stable@vger.kernel.org
Acked-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Acked-by: Takahiro Kuwano <takahiro.kuwano@infineon.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
The bio completion path in the process context (e.g. dm-verity)
will directly call into decompression rather than trigger another
workqueue context for minimal scheduling latencies, which can
then call vm_map_ram() with GFP_KERNEL.
Due to insufficient memory, vm_map_ram() may generate memory
swapping I/O, which can cause submit_bio_wait to deadlock
in some scenarios.
Trimmed down the call stack, as follows:
f2fs_submit_read_io
submit_bio //bio_list is initialized.
mmc_blk_mq_recovery
z_erofs_endio
vm_map_ram
__pte_alloc_kernel
__alloc_pages_direct_reclaim
shrink_folio_list
__swap_writepage
submit_bio_wait //bio_list is non-NULL, hang!!!
Use memalloc_noio_{save,restore}() to wrap up this path.
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Jiucheng Xu <jiucheng.xu@amlogic.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The current error handling has two issues:
First, pin_user_pages_fast() can return a short pin count (less than
requested but greater than zero) when it cannot pin all requested pages.
This is treated as success, leading to partially pinned regions being
used, which causes memory corruption.
Second, when an error occurs mid-loop, already pinned pages from the
current batch are not properly accounted for before calling
mshv_region_invalidate_pages(), causing a page reference leak.
Treat short pins as errors and fix partial batch accounting before
cleanup.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
clang-22 rightfully warns that the memcpy() in adapter_prepare() copies
between different structures, crossing the boundary of nested
structures inside it:
In file included from sound/pci/asihpi/hpimsgx.c:13:
In file included from include/linux/string.h:386:
include/linux/fortify-string.h:569:4: error: call to '__write_overflow_field' declared with 'warning' attribute: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror,-Wattribute-warning]
569 | __write_overflow_field(p_size_field, size);
The two structures seem to refer to the same layout, despite the
separate definitions, so the code is in fact correct.
Avoid the warning by copying the two inner structures separately.
I see the same pattern happens in other functions in the same file,
so there is a chance that this may come back in the future, but
this instance is the only one that I saw in practice, hitting it
multiple times per day in randconfig build.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260318124016.3488566-1-arnd@kernel.org
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Pull SoC fixes from Arnd Bergmann:
"The firmware drivers for ARM SCMI, FF-A and the Tee subsystem, as
well as the reset controller and cache controller subsystem all see
small bugfixes for reference ounting errors, ABI correctness, and
NULL pointer dereferences.
Similarly, there are multiple reference counting fixes in drivers/soc/
for vendor specific drivers (rockchips, microchip), while the
freescale drivers get a fix for a race condition and error handling.
The devicetree fixes for Rockchips and NXP got held up, so for
the moment there is only Renesas fixing problesm with SD card
initialization, a boot hang on one board and incorrect descriptions
for interrupts and clock registers on some SoCs. The Microchip
polarfire gets a dts fix for a boot time warning.
A defconfig fix avoids a warning about a conflicting assignment"
* tag 'soc-fixes-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (21 commits)
ARM: multi_v7_defconfig: Drop duplicate CONFIG_TI_PRUSS=m
firmware: arm_scmi: Spelling s/mulit/multi/, s/currenly/currently/
firmware: arm_scmi: Fix NULL dereference on notify error path
firmware: arm_scpi: Fix device_node reference leak in probe path
firmware: arm_ffa: Remove vm_id argument in ffa_rxtx_unmap()
arm64: dts: renesas: r8a78000: Fix out-of-range SPI interrupt numbers
arm64: dts: renesas: rzg3s-smarc-som: Set bypass for Versa3 PLL2
arm64: dts: renesas: r9a09g087: Fix CPG register region sizes
arm64: dts: renesas: r9a09g077: Fix CPG register region sizes
arm64: dts: renesas: r9a09g057: Remove wdt{0,2,3} nodes
arm64: dts: renesas: rzv2-evk-cn15-sd: Add ramp delay for SD0 regulator
arm64: dts: renesas: rzt2h-n2h-evk: Add ramp delay for SD0 card regulator
tee: shm: Remove refcounting of kernel pages
reset: rzg2l-usbphy-ctrl: Check pwrrdy is valid before using it
soc: fsl: cpm1: qmc: Fix error check for devm_ioremap_resource() in qmc_qe_init_resources()
soc: fsl: qbman: fix race condition in qman_destroy_fq
soc: rockchip: grf: Add missing of_node_put() when returning
cache: ax45mp: Fix device node reference leak in ax45mp_cache_init()
cache: starfive: fix device node leak in starlink_cache_init()
riscv: dts: microchip: add can resets to mpfs
...
Pull crypto fix from Herbert Xu:
- Remove duplicate snp_leak_pages call in ccp
* tag 'v7.0-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: ccp - Fix leaking the same page twice
Pull LoongArch fixes from Huacai Chen:
- only use SC.Q when supported by the assembler to fix a build failure
- fix calling smp_processor_id() in preemptible code
- make a BPF helper arch_protect_bpf_trampoline() return 0 to fix a
kernel memory access failure
- fix a typo issue in kvm_vm_init_features()
* tag 'loongarch-fixes-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: KVM: Fix typo issue in kvm_vm_init_features()
LoongArch: BPF: Make arch_protect_bpf_trampoline() return 0
LoongArch: No need to flush icache if text copy failed
LoongArch: Check return values for set_memory_{rw,rox}
LoongArch: Give more information if kmem access failed
LoongArch: Fix calling smp_processor_id() in preemptible code
LoongArch: Only use SC.Q when supported by the assembler
Shengjiu Wang <shengjiu.wang@nxp.com> says:
Check value of is_playback_only and is_capture_only in
graph_util_parse_link_direction() and initialize playback_only and
capture_only in imx-card.c
The audio-graph-card2 gets the value of 'playback-only' and
'capture_only' property in below sequence, if there is 'playback_only' or
'capture_only' property in port_cpu and port_codec nodes, but no these
properties in ep_cpu and ep_codec nodes, the value of playback_only and
capture_only will be flushed to zero in the end.
graph_util_parse_link_direction(lnk, &playback_only, &capture_only);
graph_util_parse_link_direction(ports_cpu, &playback_only, &capture_only);
graph_util_parse_link_direction(ports_codec, &playback_only, &capture_only);
graph_util_parse_link_direction(port_cpu, &playback_only, &capture_only);
graph_util_parse_link_direction(port_codec, &playback_only, &capture_only);
graph_util_parse_link_direction(ep_cpu, &playback_only, &capture_only);
graph_util_parse_link_direction(ep_codec, &playback_only, &capture_only);
So check the value of is_playback_only and is_capture_only in
graph_util_parse_link_direction() function, if they are true, then rewrite
the values, and no need to check the np variable as
of_property_read_bool() will ignore if it was NULL.
Fixes: 3cc393d223 ("ASoC: simple-card-utils: Fix pointer check in graph_util_parse_link_direction")
Fixes: 22a507d768 ("ASoC: simple-card-utils: Check device node before overwrite direction")
Suggested-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Acked-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
Link: https://patch.msgid.link/20260318102850.2794029-2-shengjiu.wang@nxp.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Arm SCMI fixes for v7.0
Few fixes to:
1. Address a NULL dereference in the SCMI notify error path by ensurin
__scmi_event_handler_get_ops() consistently returns an ERR_PTR on
failure, as expected by callers.
2. Fix a device_node reference leak in the SCPI probe path by introducing
scope-based cleanup for acquired DT nodes.
3. Correct minor spelling errors.
* tag 'scmi-fixes-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
firmware: arm_scmi: Spelling s/mulit/multi/, s/currenly/currently/
firmware: arm_scmi: Fix NULL dereference on notify error path
firmware: arm_scpi: Fix device_node reference leak in probe path
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Arm FF-A fix for v7.0
Fix removing the vm_id argument from ffa_rxtx_unmap(), as the FF-A
specification mandates this field be zero in all contexts except a
non-secure physical FF-A instance, where the ID is inherently 0.
* tag 'ffa-fix-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
firmware: arm_ffa: Remove vm_id argument in ffa_rxtx_unmap()
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
TEE shared memory update for 7.0
Remove refcounting of kernel pages in register_shm_helper() to support
slab allocations.
* tag 'tee-fix-for-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/jenswi/linux-tee:
tee: shm: Remove refcounting of kernel pages
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
MFD child devices sharing parent's ACPI Companion fails to probe as
acpi_companion_match() returns incompatible ACPI Companion handle for
binding with the check for pnp.type.backlight added recently. Remove this
pnp.type.backlight check in acpi_companion_match() to fix the automatic
modprobe issue.
Fixes: 7a7a7ed5f8bdb ("ACPI: scan: Register platform devices for backlight device objects")
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Link: https://patch.msgid.link/20260318034842.1216536-1-pratap.nirujogi@amd.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The deeply nested loop in rkvdec_init_v4l2_vp9_count_tbl() needs a lot
of registers, so when the clang register allocator runs out, it ends up
spilling countless temporaries to the stack:
drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c:966:12: error: stack frame size (1472) exceeds limit (1280) in 'rkvdec_vp9_start' [-Werror,-Wframe-larger-than]
Marking this function as noinline_for_stack keeps it out of
rkvdec_vp9_start(), giving the compiler more room for optimization.
The resulting code is good enough that both the total stack usage
and the loop get enough better to stay under the warning limit,
though it's still slow, and would need a larger rework if this
function ends up being called in a fast path.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Signed-off-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
The rkvdec_pps had a large set of bitfields, all of which
as misaligned. This causes clang-21 and likely other versions to
produce absolutely awful object code and a warning about very
large stack usage, on targets without unaligned access:
drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c:966:12: error: stack frame size (1472) exceeds limit (1280) in 'rkvdec_vp9_start' [-Werror,-Wframe-larger-than]
Part of the problem here is how all the bitfield accesses are
inlined into a function that already has large structures on
the stack.
Mark set_field_order_cnt() as noinline_for_stack, and split out
the following accesses in assemble_hw_pps() into another noinline
function, both of which now using around 800 bytes of stack in the
same configuration.
There is clearly still something wrong with clang here, but
splitting it into multiple functions reduces the risk of stack
overflow.
Fixes: fde2490757 ("media: rkvdec: Add H264 support for the VDPU383 variant")
Link: https://godbolt.org/z/acP1eKeq9
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Signed-off-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
The values of ext_sps_st_rps and ext_sps_lt_rps in struct rkvdec_hevc_run
are not initialized when the respective controls are not set by userspace.
When this is the case, set them to NULL so the rkvdec_hevc_run_preamble
function that parses controls does not access garbage data which leads to
a panic on unaccessible memory.
Fixes: c9a59dc2ac ("media: rkvdec: Add HEVC support for the VDPU381 variant")
Reported-by: Christian Hewitt <christianshewitt@gmail.com>
Suggested-by: Jonas Karlman <jonas@kwiboo.se>
Signed-off-by: Detlev Casanova <detlev.casanova@collabora.com>
Tested-by: Christian Hewitt <christianshewitt@gmail.com>
Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Signed-off-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
MEDIA_REQUEST_IOC_REINIT can run concurrently with VIDIOC_REQBUFS(0)
queue teardown paths. This can race request object cleanup against vb2
queue cancellation and lead to use-after-free reports.
We already serialize request queueing against STREAMON/OFF with
req_queue_mutex. Extend that serialization to REQBUFS, and also take
the same mutex in media_request_ioctl_reinit() so REINIT is in the
same exclusion domain.
This keeps request cleanup and queue cancellation from running in
parallel for request-capable devices.
Fixes: 6093d3002e ("media: vb2: keep a reference to the request until dqbuf")
Cc: stable@vger.kernel.org
Signed-off-by: Yuchan Nam <entropy1110@gmail.com>
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
An issue was exposed where OS can pass in U32_MAX for SQ/RQ/SRQ size.
This can cause integer overflow and truncation of SQ/RQ/SRQ depth
returning a success when it should have failed.
Harden the functions to do all depth calculations and boundary
checking in u64 sizes.
Fixes: 563e1feb5f ("RDMA/irdma: Add SRQ support")
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When rdma_connect() fails due to an invalid arp index, user space rdma core
reports ENOMEM which is confusing. Modify irdma_make_cm_node() to return the
correct error code.
Fixes: 146b9756f1 ("RDMA/irdma: Add connection manager")
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Resolve deadlock that occurs when user executes netdev reset while RDMA
applications (e.g., rping) are active. The netdev reset causes ice
driver to remove irdma auxiliary driver, triggering device_delete and
subsequent client removal. During client removal, uverbs_client waits
for QP reference count to reach zero while cma_client holds the final
reference, creating circular dependency and indefinite wait in iWARP
mode. Skip QP reference count wait during device reset to prevent
deadlock.
Fixes: c8f304d75f ("RDMA/irdma: Prevent QP use after free")
Signed-off-by: Anil Samal <anil.samal@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
During reset, irdma_modify_qp() to error should be called to disconnect
the QP. Without this fix, if not preceded by irdma_modify_qp() to error, the
API call irdma_destroy_qp() gets stuck waiting for the QP refcount to go
to zero, because the cm_node associated with this QP isn't disconnected.
Fixes: 915cc7ac0f ("RDMA/irdma: Add miscellaneous utility definitions")
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The cm_node is available and the usage of cm_node and event->cm_node
seems arbitrary. Clean up unnecessary dereference of event->cm_node.
Fixes: 146b9756f1 ("RDMA/irdma: Add connection manager")
Signed-off-by: Ivan Barrera <ivan.d.barrera@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Remove a NOP wait_event() in irdma_modify_qp_roce() which is relevant
for iWARP and likely a copy and paste artifact for RoCEv2. The wait event
is for sending a reset on a TCP connection, after the reset has been
requested in irdma_modify_qp(), which occurs only in iWarp mode.
Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
In irdma_modify_qp() update ibqp state to error if the irdma QP is already
in error state, otherwise the ibqp state which is visible to the consumer
app remains stale.
Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
In irdma_create_qp, if ib_copy_to_udata fails, it will call
irdma_destroy_qp to clean up which will attempt to wait on
the free_qp completion, which is not initialized yet. Fix this
by initializing the completion before the ib_copy_to_udata call.
Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Commit aa35dd5cbc ("iomap: fix invalid folio access after
folio_end_read()") partially addressed invalid folio access for folios
without an ifs attached, but it did not handle the case where
1 << inode->i_blkbits matches the folio size but is different from the
granularity used for the IO, which means IO can be submitted for less
than the full folio for the !ifs case.
In this case, the condition:
if (*bytes_submitted == folio_len)
ctx->cur_folio = NULL;
in iomap_read_folio_iter() will not invalidate ctx->cur_folio, and
iomap_read_end() will still be called on the folio even though the IO
helper owns it and will finish the read on it.
Fix this by unconditionally invalidating ctx->cur_folio for the !ifs
case.
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/linux-fsdevel/b3dfe271-4e3d-4922-b618-e73731242bca@wdc.com/
Fixes: b2f35ac414 ("iomap: add caller-provided callbacks for read and readahead")
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260317203935.830549-1-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add the `__counted_by_ptr` attribute to the `buffer` field of `struct
xfs_attr_list_context`. This field is used to point to a buffer of
size `bufsize`.
The `buffer` field is assigned in:
1. `xfs_ioc_attr_list` in `fs/xfs/xfs_handle.c`
2. `xfs_xattr_list` in `fs/xfs/xfs_xattr.c`
3. `xfs_getparents` in `fs/xfs/xfs_handle.c` (implicitly initialized to NULL)
In `xfs_ioc_attr_list`, `buffer` was assigned before `bufsize`. Reorder
them to ensure `bufsize` is set before `buffer` is assigned, although
no access happens between them.
In `xfs_xattr_list`, `buffer` was assigned before `bufsize`. Reorder
them to ensure `bufsize` is set before `buffer` is assigned.
In `xfs_getparents`, `buffer` is NULL (from zero initialization) and
remains NULL. `bufsize` is set to a non-zero value, but since `buffer`
is NULL, no access occurs.
In all cases, the pointer `buffer` is not accessed before `bufsize` is set.
This patch was generated by CodeMender and reviewed by Bill Wendling.
Tested by running xfstests.
Signed-off-by: Bill Wendling <morbo@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The newly added XFS_IOC_VERIFY_MEDIA is a bit unusual in how it handles
buftarg fields. Update it to be more in line with other XFS code:
- use btp->bt_dev instead of btp->bt_bdev->bd_dev to retrieve the device
number for tracing
- use btp->bt_logical_sectorsize instead of
bdev_logical_block_size(btp->bt_bdev) to retrieve the logical sector
size
- compare the buftarg and not the bdev to see if there is a separate
log buftarg
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
xchk_quota_item can return early after calling xchk_fblock_process_error.
When that helper returns false, the function returned immediately without
dropping dq->q_qlock, which can leave the dquot lock held and risk lock
leaks or deadlocks in later quota operations.
Fix this by unlocking dq->q_qlock before the early return.
Signed-off-by: hongao <hongao@uniontech.com>
Fixes: 7d1f0e167a ("xfs: check the ondisk space mapping behind a dquot")
Cc: <stable@vger.kernel.org> # v6.8
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Factor the loop body of xfsaild_push() into a separate
xfsaild_process_logitem() helper to improve readability.
This is a pure code movement with no functional change.
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
In xfs_inode_item_push() and xfs_qm_dquot_logitem_push(), the AIL lock
is dropped to perform buffer IO. Once the cluster buffer no longer
protects the log item from reclaim, the log item may be freed by
background reclaim or the dquot shrinker. The subsequent spin_lock()
call dereferences lip->li_ailp, which is a use-after-free.
Fix this by saving the ailp pointer in a local variable while the AIL
lock is held and the log item is guaranteed to be valid.
Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c
Fixes: 90c60e1640 ("xfs: xfs_iflush() is no longer necessary")
Cc: stable@vger.kernel.org # v5.9
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
After xfsaild_push_item() calls iop_push(), the log item may have been
freed if the AIL lock was dropped during the push. Background inode
reclaim or the dquot shrinker can free the log item while the AIL lock
is not held, and the tracepoints in the switch statement dereference
the log item after iop_push() returns.
Fix this by capturing the log item type, flags, and LSN before calling
xfsaild_push_item(), and introducing a new xfs_ail_push_class trace
event class that takes these pre-captured values and the ailp pointer
instead of the log item pointer.
Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c
Fixes: 90c60e1640 ("xfs: xfs_iflush() is no longer necessary")
Cc: stable@vger.kernel.org # v5.9
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The unmount sequence in xfs_unmount_flush_inodes() pushed the AIL while
background reclaim and inodegc are still running. This is broken
independently of any use-after-free issues - background reclaim and
inodegc should not be running while the AIL is being pushed during
unmount, as inodegc can dirty and insert inodes into the AIL during the
flush, and background reclaim can race to abort and free dirty inodes.
Reorder xfs_unmount_flush_inodes() to stop inodegc and cancel background
reclaim before pushing the AIL. Stop inodegc before cancelling
m_reclaim_work because the inodegc worker can re-queue m_reclaim_work
via xfs_inodegc_set_reclaimable.
Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c
Fixes: 90c60e1640 ("xfs: xfs_iflush() is no longer necessary")
Cc: stable@vger.kernel.org # v5.9
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
ieee80211_tx_prepare_skb() has three error paths, but only two of them
free the skb. The first error path (ieee80211_tx_prepare() returning
TX_DROP) does not free it, while invoke_tx_handlers() failure and the
fragmentation check both do.
Add kfree_skb() to the first error path so all three are consistent,
and remove the now-redundant frees in callers (ath9k, mt76,
mac80211_hwsim) to avoid double-free.
Document the skb ownership guarantee in the function's kdoc.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://patch.msgid.link/20260314065455.2462900-1-nbd@nbd.name
Fixes: 06be6b149f ("mac80211: add ieee80211_tx_prepare_skb() helper function")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since upstream commit e75665dd09 ("wifi: wlcore: ensure skb headroom
before skb_push"), wl1271_tx_allocate() and with it
wl1271_prepare_tx_frame() returns -EAGAIN if pskb_expand_head() fails.
However, in wlcore_tx_work_locked(), a return value of -EAGAIN from
wl1271_prepare_tx_frame() is interpreted as the aggregation buffer being
full. This causes the code to flush the buffer, put the skb back at the
head of the queue, and immediately retry the same skb in a tight while
loop.
Because wlcore_tx_work_locked() holds wl->mutex, and the retry happens
immediately with GFP_ATOMIC, this will result in an infinite loop and a
CPU soft lockup. Return -ENOMEM instead so the packet is dropped and
the loop terminates.
The problem was found by an experimental code review agent based on
gemini-3.1-pro while reviewing backports into v6.18.y.
Assisted-by: Gemini:gemini-3.1-pro
Fixes: e75665dd09 ("wifi: wlcore: ensure skb headroom before skb_push")
Cc: Peter Astrand <astrand@lysator.liu.se>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Link: https://patch.msgid.link/20260318064636.3065925-1-linux@roeck-us.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
mesh_matches_local() unconditionally dereferences ie->mesh_config to
compare mesh configuration parameters. When called from
mesh_rx_csa_frame(), the parsed action-frame elements may not contain a
Mesh Configuration IE, leaving ie->mesh_config NULL and triggering a
kernel NULL pointer dereference.
The other two callers are already safe:
- ieee80211_mesh_rx_bcn_presp() checks !elems->mesh_config before
calling mesh_matches_local()
- mesh_plink_get_event() is only reached through
mesh_process_plink_frame(), which checks !elems->mesh_config, too
mesh_rx_csa_frame() is the only caller that passes raw parsed elements
to mesh_matches_local() without guarding mesh_config. An adjacent
attacker can exploit this by sending a crafted CSA action frame that
includes a valid Mesh ID IE but omits the Mesh Configuration IE,
crashing the kernel.
The captured crash log:
Oops: general protection fault, probably for non-canonical address ...
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
Workqueue: events_unbound cfg80211_wiphy_work
[...]
Call Trace:
<TASK>
? __pfx_mesh_matches_local (net/mac80211/mesh.c:65)
ieee80211_mesh_rx_queued_mgmt (net/mac80211/mesh.c:1686)
[...]
ieee80211_iface_work (net/mac80211/iface.c:1754 net/mac80211/iface.c:1802)
[...]
cfg80211_wiphy_work (net/wireless/core.c:426)
process_one_work (net/kernel/workqueue.c:3280)
? assign_work (net/kernel/workqueue.c:1219)
worker_thread (net/kernel/workqueue.c:3352)
? __pfx_worker_thread (net/kernel/workqueue.c:3385)
kthread (net/kernel/kthread.c:436)
[...]
ret_from_fork_asm (net/arch/x86/entry/entry_64.S:255)
</TASK>
This patch adds a NULL check for ie->mesh_config at the top of
mesh_matches_local() to return false early when the Mesh Configuration
IE is absent.
Fixes: 2e3c873682 ("mac80211: support functions for mesh")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260318034244.2595020-1-xmei5@asu.edu
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The Focusrite Scarlett 2i2 1st Gen (1235:8006) produces
distorted/silent audio when QUIRK_FLAG_SKIP_IFACE_SETUP is active, as
that flag causes the feedback format to be detected as 17.15 instead
of 16.16.
Add a DEVICE_FLG entry for this device before the Focusrite VENDOR_FLG
entry so that it gets no quirk flags, overriding the vendor-wide
SKIP_IFACE_SETUP. This device doesn't have the internal mixer, Air, or
Safe modes that the quirk was designed to protect.
Fixes: 38c322068a ("ALSA: usb-audio: Add QUIRK_FLAG_SKIP_IFACE_SETUP")
Reported-by: pairomaniac [https://github.com/geoffreybennett/linux-fcp/issues/54]
Tested-by: pairomaniac [https://github.com/geoffreybennett/linux-fcp/issues/54]
Signed-off-by: Geoffrey D. Bennett <g@b4.vu>
Link: https://patch.msgid.link/abmsTjKmQMKbhYtK@m.b4.vu
Signed-off-by: Takashi Iwai <tiwai@suse.de>
smb2_get_ksmbd_tcon() reuses work->tcon in compound requests without
validating tcon->t_state. ksmbd_tree_conn_lookup() checks t_state ==
TREE_CONNECTED on the initial lookup path, but the compound reuse path
bypasses this check entirely.
If a prior command in the compound (SMB2_TREE_DISCONNECT) sets t_state
to TREE_DISCONNECTED and frees share_conf via ksmbd_share_config_put(),
subsequent commands dereference the freed share_conf through
work->tcon->share_conf.
KASAN report:
[ 4.144653] ==================================================================
[ 4.145059] BUG: KASAN: slab-use-after-free in smb2_write+0xc74/0xe70
[ 4.145415] Read of size 4 at addr ffff88810430c194 by task kworker/1:1/44
[ 4.145772]
[ 4.145867] CPU: 1 UID: 0 PID: 44 Comm: kworker/1:1 Not tainted 7.0.0-rc3+ #60 PREEMPTLAZY
[ 4.145871] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 4.145875] Workqueue: ksmbd-io handle_ksmbd_work
[ 4.145888] Call Trace:
[ 4.145892] <TASK>
[ 4.145894] dump_stack_lvl+0x64/0x80
[ 4.145910] print_report+0xce/0x660
[ 4.145919] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 4.145928] ? smb2_write+0xc74/0xe70
[ 4.145931] kasan_report+0xce/0x100
[ 4.145934] ? smb2_write+0xc74/0xe70
[ 4.145937] smb2_write+0xc74/0xe70
[ 4.145939] ? __pfx_smb2_write+0x10/0x10
[ 4.145942] ? _raw_spin_unlock+0xe/0x30
[ 4.145945] ? ksmbd_smb2_check_message+0xeb2/0x24c0
[ 4.145948] ? smb2_tree_disconnect+0x31c/0x480
[ 4.145951] handle_ksmbd_work+0x40f/0x1080
[ 4.145953] process_one_work+0x5fa/0xef0
[ 4.145962] ? assign_work+0x122/0x3e0
[ 4.145964] worker_thread+0x54b/0xf70
[ 4.145967] ? __pfx_worker_thread+0x10/0x10
[ 4.145970] kthread+0x346/0x470
[ 4.145976] ? recalc_sigpending+0x19b/0x230
[ 4.145980] ? __pfx_kthread+0x10/0x10
[ 4.145984] ret_from_fork+0x4fb/0x6c0
[ 4.145992] ? __pfx_ret_from_fork+0x10/0x10
[ 4.145995] ? __switch_to+0x36c/0xbe0
[ 4.145999] ? __pfx_kthread+0x10/0x10
[ 4.146003] ret_from_fork_asm+0x1a/0x30
[ 4.146013] </TASK>
[ 4.146014]
[ 4.149858] Allocated by task 44:
[ 4.149953] kasan_save_stack+0x33/0x60
[ 4.150061] kasan_save_track+0x14/0x30
[ 4.150169] __kasan_kmalloc+0x8f/0xa0
[ 4.150274] ksmbd_share_config_get+0x1dd/0xdd0
[ 4.150401] ksmbd_tree_conn_connect+0x7e/0x600
[ 4.150529] smb2_tree_connect+0x2e6/0x1000
[ 4.150645] handle_ksmbd_work+0x40f/0x1080
[ 4.150761] process_one_work+0x5fa/0xef0
[ 4.150873] worker_thread+0x54b/0xf70
[ 4.150978] kthread+0x346/0x470
[ 4.151071] ret_from_fork+0x4fb/0x6c0
[ 4.151176] ret_from_fork_asm+0x1a/0x30
[ 4.151286]
[ 4.151332] Freed by task 44:
[ 4.151418] kasan_save_stack+0x33/0x60
[ 4.151526] kasan_save_track+0x14/0x30
[ 4.151634] kasan_save_free_info+0x3b/0x60
[ 4.151751] __kasan_slab_free+0x43/0x70
[ 4.151861] kfree+0x1ca/0x430
[ 4.151952] __ksmbd_tree_conn_disconnect+0xc8/0x190
[ 4.152088] smb2_tree_disconnect+0x1cd/0x480
[ 4.152211] handle_ksmbd_work+0x40f/0x1080
[ 4.152326] process_one_work+0x5fa/0xef0
[ 4.152438] worker_thread+0x54b/0xf70
[ 4.152545] kthread+0x346/0x470
[ 4.152638] ret_from_fork+0x4fb/0x6c0
[ 4.152743] ret_from_fork_asm+0x1a/0x30
[ 4.152853]
[ 4.152900] The buggy address belongs to the object at ffff88810430c180
[ 4.152900] which belongs to the cache kmalloc-96 of size 96
[ 4.153226] The buggy address is located 20 bytes inside of
[ 4.153226] freed 96-byte region [ffff88810430c180, ffff88810430c1e0)
[ 4.153549]
[ 4.153596] The buggy address belongs to the physical page:
[ 4.153750] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88810430ce80 pfn:0x10430c
[ 4.154000] flags: 0x100000000000200(workingset|node=0|zone=2)
[ 4.154160] page_type: f5(slab)
[ 4.154251] raw: 0100000000000200 ffff888100041280 ffff888100040110 ffff888100040110
[ 4.154461] raw: ffff88810430ce80 0000000800200009 00000000f5000000 0000000000000000
[ 4.154668] page dumped because: kasan: bad access detected
[ 4.154820]
[ 4.154866] Memory state around the buggy address:
[ 4.155002] ffff88810430c080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 4.155196] ffff88810430c100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 4.155391] >ffff88810430c180: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[ 4.155587] ^
[ 4.155693] ffff88810430c200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 4.155891] ffff88810430c280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 4.156087] ==================================================================
Add the same t_state validation to the compound reuse path, consistent
with ksmbd_tree_conn_lookup().
Fixes: 5005bcb421 ("ksmbd: validate session id and tree id in the compound request")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use sb->s_uuid for a proper volume identifier as the primary choice.
For filesystems that do not provide a UUID, fall back to stfs.f_fsid
obtained from vfs_statfs().
Cc: stable@vger.kernel.org
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When a multichannel SMB2_SESSION_SETUP request with
SMB2_SESSION_REQ_FLAG_BINDING fails ksmbd sets conn->binding = true
but never clears it on the error path. This leaves the connection in
a binding state where all subsequent ksmbd_session_lookup_all() calls
fall back to the global sessions table. This fix it by clearing
conn->binding = false in the error path.
Cc: stable@vger.kernel.org
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
__ksmbd_tree_conn_disconnect() drops the share_conf reference before
checking tree_conn->refcount. When someone uses SMB3 multichannel and
binds two connections to one session, a SESSION_LOGOFF on connection A
calls ksmbd_conn_wait_idle(conn) which only drains connection A's
request counter, not connection B's. This means there's a race condition:
requests already dispatched on connection B hold tree_conn references via
work->tcon. The disconnect path frees share_conf while those requests
are still walking work->tcon->share_conf, causing a use-after-free.
This fix combines the share_conf put with the tree_conn free so it
only happens when the last reference is dropped.
Fixes: b39a1833cc ("ksmbd: fix use-after-free in ksmbd_tree_connect_put under concurrency")
Signed-off-by: Nicholas Carlini <nicholas@carlini.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The ASYNC_EVENT_CMPL_EVENT_ID_DBG_BUF_PRODUCER handler in
bnxt_async_event_process() uses a firmware-supplied 'type' field
directly as an index into bp->bs_trace[] without bounds validation.
The 'type' field is a 16-bit value extracted from DMA-mapped completion
ring memory that the NIC writes directly to host RAM. A malicious or
compromised NIC can supply any value from 0 to 65535, causing an
out-of-bounds access into kernel heap memory.
The bnxt_bs_trace_check_wrap() call then dereferences bs_trace->magic_byte
and writes to bs_trace->last_offset and bs_trace->wrapped, leading to
kernel memory corruption or a crash.
Fix by adding a bounds check and defining BNXT_TRACE_MAX as
DBG_LOG_BUFFER_FLUSH_REQ_TYPE_ERR_QPC_TRACE + 1 to cover all currently
defined firmware trace types (0x0 through 0xc).
Fixes: 84fcd9449f ("bnxt_en: Manage the FW trace context memory")
Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/SYBPR01MB7881A253A1C9775D277F30E9AF42A@SYBPR01MB7881.ausprd01.prod.outlook.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ina233_read_word_data() uses the return value of pmbus_read_word_data()
directly in a DIV_ROUND_CLOSEST() computation without first checking for
errors. If the underlying I2C transaction fails, a negative error code is
used in the arithmetic, producing a garbage sensor value instead of
propagating the error.
Add the missing error check before using the return value.
Fixes: b64b6cb163 ("hwmon: Add driver for TI INA233 Current and Power Monitor")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260317174553.385567-1-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
In mp2869_read_byte_data() and mp2869_read_word_data(), the return value
of pmbus_read_byte_data() for PMBUS_STATUS_MFR_SPECIFIC is used directly
inside FIELD_GET() macro arguments without error checking. If the I2C
transaction fails, a negative error code is passed to FIELD_GET() and
FIELD_PREP(), silently corrupting the status register bits being
constructed.
Extract the nested pmbus_read_byte_data() calls into a separate variable
and check for errors before use. This also eliminates a redundant duplicate
read of the same register in the PMBUS_STATUS_TEMPERATURE case.
Fixes: a3a2923aaf ("hwmon: add MP2869,MP29608,MP29612 and MP29816 series driver")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260317173308.382545-4-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
mp2973_read_word_data() XORs the return value of pmbus_read_word_data()
with PB_STATUS_POWER_GOOD_N without first checking for errors. If the I2C
transaction fails, a negative error code is XORed with the constant,
producing a corrupted value that is returned as valid status data instead
of propagating the error.
Add the missing error check before modifying the return value.
Fixes: acda945afb ("hwmon: (pmbus/mp2975) Fix PGOOD in READ_STATUS_WORD")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260317173308.382545-3-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Pull libnvdimm fix from Ira Weiny:
- Fix old potential use after free bug
* tag 'libnvdimm-fixes-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
nvdimm/bus: Fix potential use after free in asynchronous initialization
hac300s_read_word_data() passes the return value of pmbus_read_word_data()
directly to FIELD_GET() without checking for errors. If the I2C transaction
fails, a negative error code is sign-extended and passed to FIELD_GET(),
which silently produces garbage data instead of propagating the error.
Add the missing error check before using the return value in
the FIELD_GET() macro.
Fixes: 669cf162f7 ("hwmon: Add support for HiTRON HAC300S PSU")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260317173308.382545-2-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Pull kunit fix from Shuah Khan:
- Add documentation for --list_suites feature
* tag 'linux_kselftest-kunit-fixes-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Add documentation of --list_suites
All cmd_buf buffers are allocated and need to be freed after usage.
Add an error unwinding path that properly frees these buffers.
The memory leak happens whenever fwlog configuration is changed. For
example:
$echo 256K > /sys/kernel/debug/ixgbe/0000\:32\:00.0/fwlog/log_size
Fixes: 96a9a9341c ("ice: configure FW logging")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Pull HID fixes from Jiri Kosina:
- various fixes dealing with (intentionally) broken devices in HID
core, logitech-hidpp and multitouch drivers (Lee Jones)
- fix for OOB in wacom driver (Benoît Sevens)
- fix for potentialy HID-bpf-induced buffer overflow in () (Benjamin
Tissoires)
- various other small fixes and device ID / quirk additions
* tag 'hid-for-linus-2026031701' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: multitouch: Check to ensure report responses match the request
HID: logitech-hidpp: Prevent use-after-free on force feedback initialisation failure
HID: bpf: prevent buffer overflow in hid_hw_request
selftests/hid: fix compilation when bpf_wq and hid_device are not exported
HID: core: Mitigate potential OOB by removing bogus memset()
HID: intel-thc-hid: Set HID_PHYS with PCI BDF
HID: appletb-kbd: add .resume method in PM
HID: logitech-hidpp: Enable MX Master 4 over bluetooth
HID: input: Add HID_BATTERY_QUIRK_DYNAMIC for Elan touchscreens
HID: input: Drop Asus UX550* touchscreen ignore battery quirks
HID: asus: add xg mobile 2022 external hardware support
HID: wacom: fix out-of-bounds read in wacom_intuos_bt_irq
When iavf_add_vlan() finds an existing filter in IAVF_VLAN_REMOVE
state, it transitions the filter to IAVF_VLAN_ACTIVE assuming the
pending delete can simply be cancelled. However, there is no guarantee
that iavf_del_vlans() has not already processed the delete AQ request
and removed the filter from the PF. In that case the filter remains in
the driver's list as IAVF_VLAN_ACTIVE but is no longer programmed on
the NIC. Since iavf_add_vlans() only picks up filters in
IAVF_VLAN_ADD state, the filter is never re-added, and spoof checking
drops all traffic for that VLAN.
CPU0 CPU1 Workqueue
---- ---- ---------
iavf_del_vlan(vlan 100)
f->state = REMOVE
schedule AQ_DEL_VLAN
iavf_add_vlan(vlan 100)
f->state = ACTIVE
iavf_del_vlans()
f is ACTIVE, skip
iavf_add_vlans()
f is ACTIVE, skip
Filter is ACTIVE in driver but absent from NIC.
Transition to IAVF_VLAN_ADD instead and schedule
IAVF_FLAG_AQ_ADD_VLAN_FILTER so iavf_add_vlans() re-programs the
filter. A duplicate add is idempotent on the PF.
Fixes: 0c0da0e951 ("iavf: refactor VLAN filter states")
Signed-off-by: Petr Oros <poros@redhat.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
If an XDP application that requested TX timestamping is shutting down
while the link of the interface in use is still up the following kernel
splat is reported:
[ 883.803618] [ T1554] BUG: unable to handle page fault for address: ffffcfb6200fd008
...
[ 883.803650] [ T1554] Call Trace:
[ 883.803652] [ T1554] <TASK>
[ 883.803654] [ T1554] igc_ptp_tx_tstamp_event+0xdf/0x160 [igc]
[ 883.803660] [ T1554] igc_tsync_interrupt+0x2d5/0x300 [igc]
...
During shutdown of the TX ring the xsk_meta pointers are left behind, so
that the IRQ handler is trying to touch them.
This issue is now being fixed by cleaning up the stale xsk meta data on
TX shutdown. TX timestamps on other queues remain unaffected.
Fixes: 15fd021bc4 ("igc: Add Tx hardware timestamp request for AF_XDP zero-copy packet")
Signed-off-by: Zdenek Bouska <zdenek.bouska@siemens.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Florian Bezdeka <florian.bezdeka@siemens.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
igc_xmit_frame() misses updating skb->tail when the packet size is
shorter than the minimum one.
Use skb_put_padto() in alignment with other Intel Ethernet drivers.
Fixes: 0507ef8a03 ("igc: Add transmit and receive fastpath and interrupt handlers")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The "Read backward ring buffer" test crashes on big-endian (e.g. s390x)
due to a NULL dereference when the backward mmap path isn't enabled.
Reproducer:
# ./perf test -F 'Read backward ring buffer'
Segmentation fault (core dumped)
# uname -m
s390x
#
Root cause:
get_config_terms() stores into evsel_config_term::val.val (u64) while later
code reads boolean fields such as evsel_config_term::val.overwrite.
On big-endian the 1-byte boolean is left-aligned, so writing
evsel_config_term::val.val = 1 is read back as
evsel_config_term::val.overwrite = 0,
leaving backward mmap disabled and a NULL map being used.
Store values in the union member that matches the term type, e.g.:
/* for OVERWRITE */
new_term->val.overwrite = 1; /* not new_term->val.val = 1 */
to fix this. Improve add_config_term() and add two more parameters for
string and value. Function add_config_term() now creates a complete node
element of type evsel_config_term and handles all evsel_config_term::val
union members.
Impact:
Enables backward mmap on big-endian and prevents the crash.
No change on little-endian.
Output after:
# ./perf test -Fv 44
--- start ---
Using CPUID IBM,9175,705,ME1,3.8,002f
mmap size 1052672B
mmap size 8192B
---- end ----
44: Read backward ring buffer : Ok
#
Fixes: 159ca97cd9 ("perf parse-events: Refactor get_config_terms() to remove macros")
Reviewed-by: James Clark <james.clark@linaro.org>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@linaro.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Use metricgroup__for_each_metric() rather than
pmu_metrics_table__for_each_metric() that combines the
default metric table with, a potentially empty, CPUID table.
Fixes: cee275edcd ("perf metricgroup: Don't early exit if no CPUID table exists")
Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Leo Yan <leo.yan@arm.com>
Cc: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When a socket send and shutdown() happen back-to-back, both fire
wake-ups before the receiver's task_work has a chance to run. The first
wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
When io_poll_check_events() runs, it calls io_poll_issue() which does a
recv that reads the data and returns IOU_RETRY. The loop then drains all
accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
the first event was consumed. Since the shutdown is a persistent state
change, no further wakeups will happen, and the multishot recv can hang
forever.
Check specifically for HUP in the poll loop, and ensure that another
loop is done to check for status if more than a single poll activation
is pending. This ensures we don't lose the shutdown event.
Cc: stable@vger.kernel.org
Fixes: dbc2564cfe ("io_uring: let fast poll support multishot")
Reported-by: Francis Brosseau <francis@malagauche.com>
Link: https://github.com/axboe/liburing/issues/1549
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In commit 507fd01d53 ("drivers: move the early platform device support to
arch/sh") platform_match() was copied over to the sh platform_early
code, accidentally including the driver_override check.
This check does not make sense for platform_early, as sysfs is not even
available in first place at this point in the boot process, hence remove
the check.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Fixes: 507fd01d53 ("drivers: move the early platform device support to arch/sh")
Link: https://lore.kernel.org/all/DH4M3DJ4P58T.1BGVAVXN71Z09@kernel.org/
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Currently, there are 12 busses (including platform and PCI) that
duplicate the driver_override logic for their individual devices.
All of them seem to be prone to the bug described in [1].
While this could be solved for every bus individually using a separate
lock, solving this in the driver-core generically results in less (and
cleaner) changes overall.
Thus, move driver_override to struct device, provide corresponding
accessors for busses and handle locking with a separate lock internally.
In particular, add device_set_driver_override(),
device_has_driver_override(), device_match_driver_override() and
generalize the sysfs store() and show() callbacks via a driver_override
feature flag in struct bus_type.
Until all busses have migrated, keep driver_set_override() in place.
Note that we can't use the device lock for the reasons described in [2].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220789 [1]
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [2]
Tested-by: Gui-Dong Han <hanguidong02@gmail.com>
Co-developed-by: Gui-Dong Han <hanguidong02@gmail.com>
Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20260303115720.48783-2-dakr@kernel.org
[ Use dev->bus instead of sp->bus for consistency; fix commit message to
refer to the struct bus_type's driver_override feature flag. - Danilo ]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
In the error path for efa_com_alloc_comp_ctx() the semaphore assigned to
&aq->avail_cmds is not released.
Detected by Smatch:
drivers/infiniband/hw/efa/efa_com.c:662 efa_com_cmd_exec() warn:
inconsistent returns '&aq->avail_cmds'
Add release for &aq->avail_cmds in efa_com_alloc_comp_ctx() error path.
Fixes: ef3b06742c ("RDMA/efa: Fix use of completion ctx after free")
Signed-off-by: Ethan Tidmore <ethantidmore06@gmail.com>
Link: https://patch.msgid.link/20260314045730.1143862-1-ethantidmore06@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When IOVA-based DMA mapping is unavailable (e.g., IOMMU
passthrough mode), rdma_rw_ctx_init_bvec() falls back to
checking rdma_rw_io_needs_mr() with the raw bvec count.
Unlike the scatterlist path in rdma_rw_ctx_init(), which
passes a post-DMA-mapping entry count that reflects
coalescing of physically contiguous pages, the bvec path
passes the pre-mapping page count. This overstates the
number of DMA entries, causing every multi-bvec RDMA READ
to consume an MR from the QP's pool.
Under NFS WRITE workloads the server performs RDMA READs
to pull data from the client. With the inflated MR demand,
the pool is rapidly exhausted, ib_mr_pool_get() returns
NULL, and rdma_rw_init_one_mr() returns -EAGAIN. svcrdma
treats this as a DMA mapping failure, closes the connection,
and the client reconnects -- producing a cycle of 71% RPC
retransmissions and ~100 reconnections per test run. RDMA
WRITEs (NFS READ direction) are unaffected because
DMA_TO_DEVICE never triggers the max_sgl_rd check.
Remove the rdma_rw_io_needs_mr() gate from the bvec path
entirely, so that bvec RDMA operations always use the
map_wrs path (direct WR posting without MR allocation).
The bvec caller has no post-DMA-coalescing segment count
available -- xdr_buf and svc_rqst hold pages as individual
pointers, and physical contiguity is discovered only during
DMA mapping -- so the raw page count cannot serve as a
reliable input to rdma_rw_io_needs_mr(). iWARP devices,
which require MRs unconditionally, are handled by an
earlier check in rdma_rw_ctx_init_bvec() and are unaffected.
Fixes: bea28ac14c ("RDMA/core: add MR support for bvec-based RDMA operations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260313194201.5818-3-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When IOMMU passthrough mode is active, ib_dma_map_sgtable_attrs()
produces no coalescing: each scatterlist page maps 1:1 to a DMA
entry, so sgt.nents equals the raw page count. A 1 MB transfer
yields 256 DMA entries. If that count exceeds the device's
max_sgl_rd threshold (an optimization hint from mlx5 firmware),
rdma_rw_io_needs_mr() steers the operation into the MR
registration path. Each such operation consumes one or more MRs
from a pool sized at max_rdma_ctxs -- roughly one MR per
concurrent context. Under write-intensive workloads that issue
many concurrent RDMA READs, the pool is rapidly exhausted,
ib_mr_pool_get() returns NULL, and rdma_rw_init_one_mr() returns
-EAGAIN. Upper layer protocols treat this as a fatal DMA mapping
failure and tear down the connection.
The max_sgl_rd check is a performance optimization, not a
correctness requirement: the device can handle large SGE counts
via direct posting, just less efficiently than with MR
registration. When the MR pool cannot satisfy a request, falling
back to the direct SGE (map_wrs) path avoids the connection
reset while preserving the MR optimization for the common case
where pool resources are available.
Add a fallback in rdma_rw_ctx_init() so that -EAGAIN from
rdma_rw_init_mr_wrs() triggers direct SGE posting instead of
propagating the error. iWARP devices, which mandate MR
registration for RDMA READs, and force_mr debug mode continue
to treat -EAGAIN as terminal.
Fixes: 00bd1439f4 ("RDMA/rw: Support threshold for registration vs scattering to local pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260313194201.5818-2-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add NULL pointer checks for dev->type before accessing
dev->type->name in ISP genpd add/remove functions to
prevent kernel crashes.
This regression was introduced in v7.0 as the wakeup sources
are registered using physical device instead of ACPI device.
This led to adding wakeup source device as the first child of
AMDGPU device without initializing dev-type variable, and
resulted in segfault when accessed it in the amdgpu isp driver.
Fixes: 057edc58aa ("ACPI: PM: Register wakeup sources under physical devices")
Suggested-by: Bin Du <Bin.Du@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit c51632d1ed7ac5aed2d40dbc0718d75342c12c6a)
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Cc: Benjamin Cheng <benjamin.cheng@amd.com>
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit e14d468304832bcc4a082d95849bc0a41b18ddea)
Cc: stable@vger.kernel.org
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit dea5f235baf3786bfd4fd920b03c19285fdc3d9f)
Cc: stable@vger.kernel.org
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 04f063d85090f5dd0c671010ce88ee49d9dcc8ed)
Cc: stable@vger.kernel.org
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f14f27bbe2a3ed7af32d5f6eaf3f417139f45253)
Cc: stable@vger.kernel.org
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1441f52c7f6ae6553664aa9e3e4562f6fc2fe8ea)
Cc: stable@vger.kernel.org
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5f76083183363c4528a4aaa593f5d38c28fe7d7b)
Cc: stable@vger.kernel.org
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 89cd90375c19fb45138990b70e9f4ba4806f05c4)
Cc: stable@vger.kernel.org
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit e064cef4b53552602bb6ac90399c18f662f3cacd)
Cc: stable@vger.kernel.org
The ASICREV_IS_BEIGE_GOBY_P check always took precedence, because it includes all chip revisions upto NV_UNKNOWN.
Fixes: 54b822b3ea ("drm/amd/display: Use dce_version instead of chip_id")
Signed-off-by: Andy Nguyen <theofficialflow1996@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 9c7be0efa6f0daa949a5f3e3fdf9ea090b0713cb)
parse_edid_displayid_vrr() searches the EDID extension blocks for a
DisplayID extension before parsing the dynamic video timing range.
The code previously checked whether edid_ext was NULL after the search
loop. However, edid_ext is assigned during each iteration of the loop,
so it will never be NULL once the loop has executed. If no DisplayID
extension is found, edid_ext ends up pointing to the last extension
block, and the NULL check does not correctly detect the failure case.
Instead, check whether the loop completed without finding a matching
DisplayID block by testing "i == edid->extensions". This ensures the
function exits early when no DisplayID extension is present and avoids
parsing an unrelated EDID extension block.
Also simplify the EDID validation check using "!edid ||
!edid->extensions".
Fixes the below:
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:13079 parse_edid_displayid_vrr() warn: variable dereferenced before check 'edid_ext' (see line 13075)
Fixes: a638b837d0 ("drm/amd/display: Fix refresh rate range for some panel")
Cc: Roman Li <roman.li@amd.com>
Cc: Alex Hung <alex.hung@amd.com>
Cc: Jerry Zuo <jerry.zuo@amd.com>
Cc: Sun peng Li <sunpeng.li@amd.com>
Cc: Tom Chung <chiahsuan.chung@amd.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 91c7e6342e98c846b259c57273436fdea4c043f2)
[Why]
The dcn32_override_min_req_memclk function is in dcn32_fpu.c, which is
compiled with CC_FLAGS_FPU into FP instructions. So when we call it we
must use DC_FP_{START,END} to save and restore the FP context, and
prepare the FP unit on architectures like LoongArch where the FP unit
isn't always on.
Reported-by: LiarOnce <liaronce@hotmail.com>
Fixes: ee7be8f3de ("drm/amd/display: Limit DCN32 8 channel or less parts to DPM1 for FPO")
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Reviewed-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 25bb1d54ba3983c064361033a8ec15474fece37e)
Cc: stable@vger.kernel.org
Commit e1b385726f ("drm/amd/display: Add additional checks for PSP
footer size") introduced a use of an uninitialized stack variable
in dm_dmub_sw_init() (region_params.bss_data_size).
Interestingly, this seems to cause no issue on normal kernels. But when
full LTO is enabled, it causes the compiler to "optimize" out huge
swaths of amdgpu initialization code, and the driver is unusable:
amdgpu 0000:03:00.0: [drm] Loading DMUB firmware via PSP: version=0x07002F00
amdgpu 0000:03:00.0: sw_init of IP block <dm> failed 5
amdgpu 0000:03:00.0: amdgpu_device_ip_init failed
amdgpu 0000:03:00.0: Fatal error during GPU init
It surprises me that neither gcc nor clang emit a warning about this: I
only found it by bisecting the LTO breakage.
Fix by using the bss_data_size field from fw_meta_info_params, as was
presumably intended.
Fixes: e1b385726f ("drm/amd/display: Add additional checks for PSP footer size")
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit b7f1402f6ad24cc6b9a01fa09ebd1c6559d787d0)
Userspace can pass an arbitrary number of BO list entries via the
bo_number field. Although the previous multiplication overflow check
prevents out-of-bounds allocation, a large number of entries could still
cause excessive memory allocation (up to potentially gigabytes) and
unnecessarily long list processing times.
Introduce a hard limit of 128k entries per BO list, which is more than
sufficient for any realistic use case (e.g., a single list containing all
buffers in a large scene). This prevents memory exhaustion attacks and
ensures predictable performance.
Return -EINVAL if the requested entry count exceeds the limit
Reviewed-by: Christian König <christian.koenig@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 688b87d39e0aa8135105b40dc167d74b5ada5332)
Cc: stable@vger.kernel.org
The mount_setattr_idmapped fixture mounts a 2 MB tmpfs at /mnt and then
creates a 2 GB sparse ext4 image at /mnt/C/ext4.img. While ftruncate()
succeeds (sparse file), mkfs.ext4 needs to write actual metadata blocks
(inode tables, journal, bitmaps) which easily exceeds the 2 MB tmpfs
limit, causing ENOSPC and failing the fixture setup for all
mount_setattr_idmapped tests.
This was introduced by commit d37d4720c3 ("selftests/mount_settattr:
ensure that ext4 filesystem can be created") which increased the image
size from 2 MB to 2 GB but didn't adjust the tmpfs size.
Bump the tmpfs size to 256 MB which is sufficient for the ext4 metadata.
Fixes: d37d4720c3 ("selftests/mount_settattr: ensure that ext4 filesystem can be created")
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is an additional safety layer to ensure no accesses to the GPU
registers can be made while it is powered off.
While we can disable IRQ generation from GPU, META firmware, MIPS
firmware and for safety events, we cannot do the same for the RISC-V
firmware.
To keep a unified approach, once the firmware has completed its power
off sequence, disable IRQs for the while GPU at the kernel level
instead.
Signed-off-by: Alessio Belle <alessio.belle@imgtec.com>
Reviewed-by: Matt Coster <matt.coster@imgtec.com>
Link: https://patch.msgid.link/20260310-drain-irqs-before-suspend-v1-2-bf4f9ed68e75@imgtec.com
Signed-off-by: Matt Coster <matt.coster@imgtec.com>
The runtime PM suspend callback doesn't know whether the IRQ handler is
in progress on a different CPU core and doesn't wait for it to finish.
Depending on timing, the IRQ handler could be running while the GPU is
suspended, leading to kernel crashes when trying to access GPU
registers. See example signature below.
In a power off sequence initiated by the runtime PM suspend callback,
wait for any IRQ handlers in progress on other CPU cores to finish, by
calling synchronize_irq().
At the same time, remove the runtime PM resume/put calls in the threaded
IRQ handler. On top of not being the right approach to begin with, and
being at the wrong place as they should have wrapped all GPU register
accesses, the driver would hit a deadlock between synchronize_irq()
being called from a runtime PM suspend callback, holding the device
power lock, and the resume callback requiring the same.
Example crash signature on a TI AM68 SK platform:
[ 337.241218] SError Interrupt on CPU0, code 0x00000000bf000000 -- SError
[ 337.241239] CPU: 0 UID: 0 PID: 112 Comm: irq/234-gpu Tainted: G M 6.17.7-B2C-00005-g9c7bbe4ea16c #2 PREEMPT
[ 337.241246] Tainted: [M]=MACHINE_CHECK
[ 337.241249] Hardware name: Texas Instruments AM68 SK (DT)
[ 337.241252] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 337.241256] pc : pvr_riscv_irq_pending+0xc/0x24
[ 337.241277] lr : pvr_device_irq_thread_handler+0x64/0x310
[ 337.241282] sp : ffff800085b0bd30
[ 337.241284] x29: ffff800085b0bd50 x28: ffff0008070d9eab x27: ffff800083a5ce10
[ 337.241291] x26: ffff000806e48f80 x25: ffff0008070d9eac x24: 0000000000000000
[ 337.241296] x23: ffff0008068e9bf0 x22: ffff0008068e9bd0 x21: ffff800085b0bd30
[ 337.241301] x20: ffff0008070d9e00 x19: ffff0008068e9000 x18: 0000000000000001
[ 337.241305] x17: 637365645f656c70 x16: 0000000000000000 x15: ffff000b7df9ff40
[ 337.241310] x14: 0000a585fe3c0d0e x13: 000000999704f060 x12: 000000000002771a
[ 337.241314] x11: 00000000000000c0 x10: 0000000000000af0 x9 : ffff800085b0bd00
[ 337.241318] x8 : ffff0008071175d0 x7 : 000000000000b955 x6 : 0000000000000003
[ 337.241323] x5 : 0000000000000000 x4 : 0000000000000002 x3 : 0000000000000000
[ 337.241327] x2 : ffff800080e39d20 x1 : ffff800080e3fc48 x0 : 0000000000000000
[ 337.241333] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 337.241337] CPU: 0 UID: 0 PID: 112 Comm: irq/234-gpu Tainted: G M 6.17.7-B2C-00005-g9c7bbe4ea16c #2 PREEMPT
[ 337.241342] Tainted: [M]=MACHINE_CHECK
[ 337.241343] Hardware name: Texas Instruments AM68 SK (DT)
[ 337.241345] Call trace:
[ 337.241348] show_stack+0x18/0x24 (C)
[ 337.241357] dump_stack_lvl+0x60/0x80
[ 337.241364] dump_stack+0x18/0x24
[ 337.241368] vpanic+0x124/0x2ec
[ 337.241373] abort+0x0/0x4
[ 337.241377] add_taint+0x0/0xbc
[ 337.241384] arm64_serror_panic+0x70/0x80
[ 337.241389] do_serror+0x3c/0x74
[ 337.241392] el1h_64_error_handler+0x30/0x48
[ 337.241400] el1h_64_error+0x6c/0x70
[ 337.241404] pvr_riscv_irq_pending+0xc/0x24 (P)
[ 337.241410] irq_thread_fn+0x2c/0xb0
[ 337.241416] irq_thread+0x170/0x334
[ 337.241421] kthread+0x12c/0x210
[ 337.241428] ret_from_fork+0x10/0x20
[ 337.241434] SMP: stopping secondary CPUs
[ 337.241451] Kernel Offset: disabled
[ 337.241453] CPU features: 0x040000,02002800,20002001,0400421b
[ 337.241456] Memory Limit: none
[ 337.457921] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
Fixes: cc1aeedb98 ("drm/imagination: Implement firmware infrastructure and META FW support")
Fixes: 96822d38ff ("drm/imagination: Handle Rogue safety event IRQs")
Cc: stable@vger.kernel.org # see patch description, needs adjustments for < 6.16
Signed-off-by: Alessio Belle <alessio.belle@imgtec.com>
Reviewed-by: Matt Coster <matt.coster@imgtec.com>
Link: https://patch.msgid.link/20260310-drain-irqs-before-suspend-v1-1-bf4f9ed68e75@imgtec.com
Signed-off-by: Matt Coster <matt.coster@imgtec.com>
For file systems implementing ->sync_lazytime, I_DIRTY_TIME fails to get
cleared in sync_lazytime, and might cause additional calls to
sync_lazytime during inode deactivation. Use the same pattern as in
__mark_inode_dirty to clear the flag under the inode lock.
Fixes: 5cf06ea56e ("fs: add a ->sync_lazytime method")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260317134409.1691317-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Previously, commit 8388f7df93 ("iommu/amd: Do not support
IOMMU_DOMAIN_IDENTITY after SNP is enabled") prevented users from
changing the IOMMU domain to identity if SNP was enabled.
This resulted in an error when writing to sysfs:
# echo "identity" > /sys/kernel/iommu_groups/50/type
-bash: echo: write error: Cannot allocate memory
However, commit 4402f2627d ("iommu/amd: Implement global identity
domain") changed the flow of the code, skipping the SNP guard and
allowing users to change the IOMMU domain to identity after a machine
has booted.
Once the user does that, they will probably try to bind and the
device/driver will start to do DMA which will trigger errors:
iommu ivhd3: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:43:00.0 pasid=0x00000 address=0x3737b01000 flags=0x0020]
iommu ivhd3: AMD-Vi: Control Reg : 0xc22000142148d
AMD-Vi: DTE[0]: 6000000000000003
AMD-Vi: DTE[1]: 0000000000000001
AMD-Vi: DTE[2]: 2000003088b3e013
AMD-Vi: DTE[3]: 0000000000000000
bnxt_en 0000:43:00.0 (unnamed net_device) (uninitialized): Error (timeout: 500015) msg {0x0 0x0} len:0
iommu ivhd3: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:43:00.0 pasid=0x00000 address=0x3737b01000 flags=0x0020]
iommu ivhd3: AMD-Vi: Control Reg : 0xc22000142148d
AMD-Vi: DTE[0]: 6000000000000003
AMD-Vi: DTE[1]: 0000000000000001
AMD-Vi: DTE[2]: 2000003088b3e013
AMD-Vi: DTE[3]: 0000000000000000
bnxt_en 0000:43:00.0: probe with driver bnxt_en failed with error -16
To prevent this from happening, create an attach wrapper for
identity_domain_ops which returns EINVAL if amd_iommu_snp_en is true.
With this commit applied:
# echo "identity" > /sys/kernel/iommu_groups/62/type
-bash: echo: write error: Invalid argument
Fixes: 4402f2627d ("iommu/amd: Implement global identity domain")
Signed-off-by: Joe Damato <joe@dama.to>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
domain->mm->iommu_mm can be freed by iommu_domain_free():
iommu_domain_free()
mmdrop()
__mmdrop()
mm_pasid_drop()
After iommu_domain_free() returns, accessing domain->mm->iommu_mm may
dereference a freed mm structure, leading to a crash.
Fix this by moving the code that accesses domain->mm->iommu_mm to before
the call to iommu_domain_free().
Fixes: e37d5a2d60 ("iommu/sva: invalidate stale IOTLB entries for kernel address space")
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Avoid kernel-doc warnings in io-pgtable.h:
- use the correct struct member names or kernel-doc format
- add a missing struct member description
- add a missing function return comment section
Warning: include/linux/io-pgtable.h:187 struct member 'coherent_walk' not
described in 'io_pgtable_cfg'
Warning: include/linux/io-pgtable.h:187 struct member 'arm_lpae_s1_cfg' not
described in 'io_pgtable_cfg'
Warning: include/linux/io-pgtable.h:187 struct member 'arm_lpae_s2_cfg' not
described in 'io_pgtable_cfg'
Warning: include/linux/io-pgtable.h:187 struct member 'arm_v7s_cfg' not
described in 'io_pgtable_cfg'
Warning: include/linux/io-pgtable.h:187 struct member 'arm_mali_lpae_cfg'
not described in 'io_pgtable_cfg'
Warning: include/linux/io-pgtable.h:187 struct member 'apple_dart_cfg' not
described in 'io_pgtable_cfg'
Warning: include/linux/io-pgtable.h:187 struct member 'amd' not described
in 'io_pgtable_cfg'
Warning: include/linux/io-pgtable.h:223 struct member
'read_and_clear_dirty' not described in 'io_pgtable_ops'
Warning: include/linux/io-pgtable.h:237 No description found for return
value of 'alloc_io_pgtable_ops'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
syzbot reports "task hung in rpm_resume"
This is caused by aqc111_suspend calling
the PM variant of its write_cmd routine.
The simplified call trace looks like this:
rpm_suspend()
usb_suspend_both() - here udev->dev.power.runtime_status == RPM_SUSPENDING
aqc111_suspend() - called for the usb device interface
aqc111_write32_cmd()
usb_autopm_get_interface()
pm_runtime_resume_and_get()
rpm_resume() - here we call rpm_resume() on our parent
rpm_resume() - Here we wait for a status change that will never happen.
At this point we block another task which holds
rtnl_lock and locks up the whole networking stack.
Fix this by replacing the write_cmd calls with their _nopm variants
Reported-by: syzbot+48dc1e8dfc92faf1124c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=48dc1e8dfc92faf1124c
Fixes: e58ba4544c ("net: usb: aqc111: Add support for wake on LAN by MAGIC packet")
Signed-off-by: Nikola Z. Ivanov <zlatistiv@gmail.com>
Link: https://patch.msgid.link/20260313141643.1181386-1-zlatistiv@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit 789a5913b2 ("iommu/amd: Use the generic iommu page table")
introduces the shared iommu page table for AMD IOMMU. Some bioses
contain an identity mapping for address 0x0, which is not parsed
properly (e.g., certain Strix Halo devices). This causes the DMA
components of the device to fail to initialize (e.g., the NVMe SSD
controller), leading to a failed post.
Specifically, on the GPD Win 5, the NVME and SSD GPU fail to mount,
making collecting errors difficult. While debugging, it was found that
a -EADDRINUSE error was emitted and its source was traced to
iommu_iova_to_phys(). After adding some debug prints, it was found that
phys_addr becomes 0, which causes the code to try to re-map the 0
address and fail, causing a cascade leading to a failed post. This is
because the GPD Win 5 contains a 0x0-0x1 identity mapping for DMA
devices, causing it to be repeated for each device.
The cause of this failure is the following check in
iommu_create_device_direct_mappings(), where address aliasing is handled
via the following check:
```
phys_addr = iommu_iova_to_phys(domain, addr);
if (!phys_addr) {
map_size += pg_size;
continue;
}
````
Obviously, the iommu_iova_to_phys() signature is faulty and aliases
unmapped and 0 together, causing the allocation code to try to
re-allocate the 0 address per device. However, it has too many
instantiations to fix. Therefore, use a ternary so that when addr
is 0, the check is done for address 1 instead.
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Fixes: 789a5913b2 ("iommu/amd: Use the generic iommu page table")
Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
In intel_svm_set_dev_pasid(), the driver unconditionally manages the IOPF
handling during a domain transition. However, commit a86fb77173
("iommu/vt-d: Allow SVA with device-specific IOPF") introduced support for
SVA on devices that handle page faults internally without utilizing the
PCI PRI. On such devices, the IOMMU-side IOPF infrastructure is not
required. Calling iopf_for_domain_replace() on these devices is incorrect
and can lead to unexpected failures during PASID attachment or unwinding.
Add a check for info->pri_supported to ensure that the IOPF queue logic
is only invoked for devices that actually rely on the IOMMU's PRI-based
fault handling.
Fixes: 17fce9d233 ("iommu/vt-d: Put iopf enablement in domain attach path")
Cc: stable@vger.kernel.org
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20260310075520.295104-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
During the qi_check_fault process after an IOMMU ITE event, requests at
odd-numbered positions in the queue are set to QI_ABORT, only satisfying
single-request submissions. However, qi_submit_sync now supports multiple
simultaneous submissions, and can't guarantee that the wait_desc will be
at an odd-numbered position. Therefore, if an item times out, IOMMU can't
re-initiate the request, resulting in an infinite polling wait.
This modifies the process by setting the status of all requests already
fetched by IOMMU and recorded as QI_IN_USE status (including wait_desc
requests) to QI_ABORT, thus enabling multiple requests to be resubmitted.
Fixes: 8a1d824625 ("iommu/vt-d: Multiple descriptors per qi_submit_sync()")
Cc: stable@vger.kernel.org
Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Link: https://lore.kernel.org/r/20260306101516.3885775-1-guanghuifeng@linux.alibaba.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Fixes: 8a1d824625 ("iommu/vt-d: Multiple descriptors per qi_submit_sync()")
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Fix a use-after-free in the clsact qdisc upon init/destroy rollback asymmetry.
The latter is achieved by first fully initializing a clsact instance, and
then in a second step having a replacement failure for the new clsact qdisc
instance. clsact_init() initializes ingress first and then takes care of the
egress part. This can fail midway, for example, via tcf_block_get_ext(). Upon
failure, the kernel will trigger the clsact_destroy() callback.
Commit 1cb6f0bae5 ("bpf: Fix too early release of tcx_entry") details the
way how the transition is happening. If tcf_block_get_ext on the q->ingress_block
ends up failing, we took the tcx_miniq_inc reference count on the ingress
side, but not yet on the egress side. clsact_destroy() tests whether the
{ingress,egress}_entry was non-NULL. However, even in midway failure on the
replacement, both are in fact non-NULL with a valid egress_entry from the
previous clsact instance.
What we really need to test for is whether the qdisc instance-specific ingress
or egress side previously got initialized. This adds a small helper for checking
the miniq initialization called mini_qdisc_pair_inited, and utilizes that upon
clsact_destroy() in order to fix the use-after-free scenario. Convert the
ingress_destroy() side as well so both are consistent to each other.
Fixes: 1cb6f0bae5 ("bpf: Fix too early release of tcx_entry")
Reported-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260313065531.98639-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
A single AXIDMA controller can have one or two channels. When it has two
channels, the reset for both are tied together: resetting one channel
resets the other as well. This creates a problem where resetting one
channel will reset the registers for both channels, including clearing
interrupt enable bits for the other channel, which can then lead to
timeouts as the driver is waiting for an interrupt which never comes.
The driver currently has a probe-time work around for this: when a
channel is created, the driver also resets and enables the
interrupts. With two channels the reset for the second channel will
clear the interrupt enables for the first one. The work around in the
driver is just to manually enable the interrupts again in
xilinx_dma_alloc_chan_resources().
This workaround only addresses the probe-time issue. When channels are
reset at runtime (e.g., in xilinx_dma_terminate_all() or during error
recovery), there's no corresponding mechanism to restore the other
channel's interrupt enables. This leads to one channel having its
interrupts disabled while the driver expects them to work, causing
timeouts and DMA failures.
A proper fix is a complicated matter, as we should not reset the other
channel when it's operating normally. So, perhaps, there should be some
kind of synchronization for a common reset, which is not trivial to
implement. To add to the complexity, the driver also supports other DMA
types, like VDMA, CDMA and MCDMA, which don't have a shared reset.
However, when the two-channel AXIDMA is used in the (assumably) normal
use case, providing DMA for a single memory-to-memory device, the common
reset is a bit smaller issue: when something bad happens on one channel,
or when one channel is terminated, the assumption is that we also want
to terminate the other channel. And thus resetting both at the same time
is "ok".
With that line of thinking we can implement a bit better work around
than just the current probe time work around: let's enable the
AXIDMA interrupts at xilinx_dma_start_transfer() instead.
This ensures interrupts are enabled whenever a transfer starts,
regardless of any prior resets that may have cleared them.
This approach is also more logical: enable interrupts only when needed
for a transfer, rather than at resource allocation time, and, I think,
all the other DMA types should also use this model, but I'm reluctant to
do such changes as I cannot test them.
The reset function still enables interrupts even though it's not needed
for AXIDMA anymore, but it's common code for all DMA types (VDMA, CDMA,
MCDMA), so leave it unchanged to avoid affecting other variants.
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>
Fixes: c0bba3a99f ("dmaengine: vdma: Add Support for Xilinx AXI Direct Memory Access Engine")
Link: https://patch.msgid.link/20260311-xilinx-dma-fix-v2-1-a725abb66e3c@ideasonboard.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
The segment .control and .status fields both contain top bits which are
not part of the buffer size, the buffer size is located only in the bottom
max_buffer_len bits. To avoid interference from those top bits, mask out
the size using max_buffer_len first, and only then subtract the values.
Fixes: a575d0b4e6 ("dmaengine: xilinx_dma: Introduce xilinx_dma_get_residue")
Signed-off-by: Marek Vasut <marex@nabladev.com>
Link: https://patch.msgid.link/20260316222530.163815-1-marex@nabladev.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
The cyclic DMA calculation is currently entirely broken and reports
residue only for the first segment. The problem is twofold.
First, when the first descriptor finishes, it is moved from active_list
to done_list, but it is never returned back into the active_list. The
xilinx_dma_tx_status() expects the descriptor to be in the active_list
to report any meaningful residue information, which never happens after
the first descriptor finishes. Fix this up in xilinx_dma_start_transfer()
and if the descriptor is cyclic, lift it from done_list and place it back
into active_list list.
Second, the segment .status fields of the descriptor remain dirty. Once
the DMA did one pass on the descriptor, the .status fields are populated
with data by the DMA, but the .status fields are not cleared before reuse
during the next cyclic DMA round. The xilinx_dma_get_residue() recognizes
that as if the descriptor was complete and had 0 residue, which is bogus.
Reinitialize the status field before placing the descriptor back into the
active_list.
Fixes: c0bba3a99f ("dmaengine: vdma: Add Support for Xilinx AXI Direct Memory Access Engine")
Signed-off-by: Marek Vasut <marex@nabladev.com>
Link: https://patch.msgid.link/20260316221943.160375-1-marex@nabladev.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
iptfs_clone_state() stores x->mode_data before allocating the reorder
window. If that allocation fails, the code frees the cloned state and
returns -ENOMEM, leaving x->mode_data pointing at freed memory.
The xfrm clone unwind later runs destroy_state() through x->mode_data,
so the failed clone path tears down IPTFS state that clone_state()
already freed.
Keep the cloned IPTFS state private until all allocations succeed so
failed clones leave x->mode_data unset. The destroy path already
handles a NULL mode_data pointer.
Fixes: 6be02e3e4f ("xfrm: iptfs: handle reordering of received packets")
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moses <p@1g4.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
[BUG]
When recovering relocation at mount time, merge_reloc_root() and
btrfs_drop_snapshot() both use BUG_ON(level == 0) to guard against
an impossible state: a non-zero drop_progress combined with a zero
drop_level in a root_item, which can be triggered:
------------[ cut here ]------------
kernel BUG at fs/btrfs/relocation.c:1545!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 1 UID: 0 PID: 283 ... Tainted: 6.18.0+ #16 PREEMPT(voluntary)
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Ubuntu 24.04 PC v2, BIOS 1.16.3-debian-1.16.3-2
RIP: 0010:merge_reloc_root+0x1266/0x1650 fs/btrfs/relocation.c:1545
Code: ffff0000 00004589 d7e9acfa ffffe8a1 79bafebe 02000000
Call Trace:
merge_reloc_roots+0x295/0x890 fs/btrfs/relocation.c:1861
btrfs_recover_relocation+0xd6e/0x11d0 fs/btrfs/relocation.c:4195
btrfs_start_pre_rw_mount+0xa4d/0x1810 fs/btrfs/disk-io.c:3130
open_ctree+0x5824/0x5fe0 fs/btrfs/disk-io.c:3640
btrfs_fill_super fs/btrfs/super.c:987 [inline]
btrfs_get_tree_super fs/btrfs/super.c:1951 [inline]
btrfs_get_tree_subvol fs/btrfs/super.c:2094 [inline]
btrfs_get_tree+0x111c/0x2190 fs/btrfs/super.c:2128
vfs_get_tree+0x9a/0x370 fs/super.c:1758
fc_mount fs/namespace.c:1199 [inline]
do_new_mount_fc fs/namespace.c:3642 [inline]
do_new_mount fs/namespace.c:3718 [inline]
path_mount+0x5b8/0x1ea0 fs/namespace.c:4028
do_mount fs/namespace.c:4041 [inline]
__do_sys_mount fs/namespace.c:4229 [inline]
__se_sys_mount fs/namespace.c:4206 [inline]
__x64_sys_mount+0x282/0x320 fs/namespace.c:4206
...
RIP: 0033:0x7f969c9a8fde
Code: 0f1f4000 48c7c2b0 fffffff7 d8648902 b8ffffff ffc3660f
---[ end trace 0000000000000000 ]---
The bug is reproducible on 7.0.0-rc2-next-20260310 with our dynamic
metadata fuzzing tool that corrupts btrfs metadata at runtime.
[CAUSE]
A non-zero drop_progress.objectid means an interrupted
btrfs_drop_snapshot() left a resume point on disk, and in that case
drop_level must be greater than 0 because the checkpoint is only
saved at internal node levels.
Although this invariant is enforced when the kernel writes the root
item, it is not validated when the root item is read back from disk.
That allows on-disk corruption to provide an invalid state with
drop_progress.objectid != 0 and drop_level == 0.
When relocation recovery later processes such a root item,
merge_reloc_root() reads drop_level and hits BUG_ON(level == 0). The
same invalid metadata can also trigger the corresponding BUG_ON() in
btrfs_drop_snapshot().
[FIX]
Fix this by validating the root_item invariant in tree-checker when
reading root items from disk: if drop_progress.objectid is non-zero,
drop_level must also be non-zero. Reject such malformed metadata with
-EUCLEAN before it reaches merge_reloc_root() or btrfs_drop_snapshot()
and triggers the BUG_ON.
After the fix, the same corruption is correctly rejected by tree-checker
and the BUG_ON is no longer triggered.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a potential use-after-free in move_existing_remap(): we're calling
btrfs_put_block_group() on dest_bg, then passing it to
btrfs_add_block_group_free_space() a few lines later.
Fix this by getting the BG at the start of the function and putting it
near the end. This also means we're not doing a lookup twice for the
same thing.
Reported-by: Chris Mason <clm@fb.com>
Link: https://lore.kernel.org/linux-btrfs/20260125123908.2096548-1-clm@meta.com/
Fixes: bbea42dfb9 ("btrfs: move existing remaps before relocating block group")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/284 with 64K page size and 4K fs block size, it
crashes with the following ASSERT() triggered:
BTRFS info (device dm-3): use lzo compression, level 1
assertion failed: folio_size(fi.folio) == sectorsize :: 0, in lzo.c:450
------------[ cut here ]------------
kernel BUG at lzo.c:450!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 4 UID: 0 PID: 329 Comm: kworker/u37:2 Tainted: G OE 6.19.0-rc8-custom+ #185 PREEMPT(voluntary)
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: btrfs-endio simple_end_io_work [btrfs]
pc : lzo_decompress_bio+0x61c/0x630 [btrfs]
lr : lzo_decompress_bio+0x61c/0x630 [btrfs]
Call trace:
lzo_decompress_bio+0x61c/0x630 [btrfs] (P)
end_bbio_compressed_read+0x2a8/0x2c0 [btrfs]
btrfs_bio_end_io+0xc4/0x258 [btrfs]
btrfs_check_read_bio+0x424/0x7e0 [btrfs]
simple_end_io_work+0x40/0xa8 [btrfs]
process_one_work+0x168/0x3f0
worker_thread+0x25c/0x398
kthread+0x154/0x250
ret_from_fork+0x10/0x20
Code: 912a2021 b0000e00 91246000 940244e9 (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
Commit 37cc07cab7 ("btrfs: lzo: use folio_iter to handle
lzo_decompress_bio()") added the ASSERT() to make sure the folio size
matches the fs block size.
But the check is completely wrong, the original intention is to make
sure for bs > ps cases, we always got a large folio that covers a full fs
block.
However for bs < ps cases, a folio can never be smaller than page size,
and the ASSERT() gets triggered immediately.
[FIX]
Check the folio size against @min_folio_size instead, which will never
be smaller than PAGE_SIZE, and still cover bs > ps cases.
Fixes: 37cc07cab7 ("btrfs: lzo: use folio_iter to handle lzo_decompress_bio()")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/284 with 64K page size and 4K fs block size, it
crashes with the following ASSERT() triggered:
assertion failed: folio_size(fi.folio) == blocksize :: 0, in fs/btrfs/zstd.c:603
------------[ cut here ]------------
kernel BUG at fs/btrfs/zstd.c:603!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 2 UID: 0 PID: 1183 Comm: kworker/u35:4 Not tainted 6.19.0-rc8-custom+ #185 PREEMPT(voluntary)
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: btrfs-endio simple_end_io_work [btrfs]
pc : zstd_decompress_bio+0x4f0/0x508 [btrfs]
lr : zstd_decompress_bio+0x4f0/0x508 [btrfs]
Call trace:
zstd_decompress_bio+0x4f0/0x508 [btrfs] (P)
end_bbio_compressed_read+0x260/0x2c0 [btrfs]
btrfs_bio_end_io+0xc4/0x258 [btrfs]
btrfs_check_read_bio+0x424/0x7e0 [btrfs]
simple_end_io_work+0x40/0xa8 [btrfs]
process_one_work+0x168/0x3f0
worker_thread+0x25c/0x398
kthread+0x154/0x250
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
[CAUSE]
Commit 1914b94231 ("btrfs: zstd: use folio_iter to handle
zstd_decompress_bio()") added the ASSERT() to make sure the folio size
matches the fs block size.
But the check is completely wrong, the original intention is to make
sure for bs > ps cases, we always got a large folio that covers a full fs
block.
However for bs < ps cases, a folio can never be smaller than page size,
and the ASSERT() gets triggered immediately.
[FIX]
Check the folio size against @min_folio_size instead, which will never
be smaller than PAGE_SIZE, and still cover bs > ps cases.
Fixes: 1914b94231 ("btrfs: zstd: use folio_iter to handle zstd_decompress_bio()")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/284, the following ASSERT() will be triggered with
64K page size and 4K fs block size:
assertion failed: folio_test_writeback(folio) :: 0, in subpage.c:476
------------[ cut here ]------------
kernel BUG at subpage.c:476!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 4 UID: 0 PID: 2313 Comm: kworker/u37:2 Tainted: G OE 6.19.0-rc8-custom+ #185 PREEMPT(voluntary)
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: btrfs-endio simple_end_io_work [btrfs]
pc : btrfs_subpage_clear_writeback+0x148/0x160 [btrfs]
lr : btrfs_subpage_clear_writeback+0x148/0x160 [btrfs]
Call trace:
btrfs_subpage_clear_writeback+0x148/0x160 [btrfs] (P)
btrfs_folio_clamp_clear_writeback+0xb4/0xd0 [btrfs]
end_compressed_writeback+0xe0/0x1e0 [btrfs]
end_bbio_compressed_write+0x1e8/0x218 [btrfs]
btrfs_bio_end_io+0x108/0x258 [btrfs]
simple_end_io_work+0x68/0xa8 [btrfs]
process_one_work+0x168/0x3f0
worker_thread+0x25c/0x398
kthread+0x154/0x250
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
[CAUSE]
The offending bio is from an encoded write, where the compressed data is
directly written as a data extent, without touching the page cache.
However the encoded write still utilizes the regular buffered write path
for compressed data, by setting the compressed_bio::writeback flag.
When that flag is set, at end_bbio_compressed_write() btrfs will go
clearing the writeback flag of the folios in the page cache.
However for bs < ps cases, the subpage helper has one extra check to make
sure the folio has a writeback flag set in the first place.
But since it's an encoded write, we never go through page
cache, thus the folio has no writeback flag and triggers the ASSERT().
[FIX]
Do not set compressed_bio::writeback flag for encoded writes, and change
the ASSERT() in btrfs_submit_compressed_write() to make sure that flag
is not set.
Fixes: e1bc83f8b1 ("btrfs: get rid of compressed_folios[] usage for encoded writes")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/284 with 64K page size and 4K fs block size, the
following ASSERT() can be triggered:
assertion failed: cb->bbio.bio.bi_iter.bi_size == disk_num_bytes :: 0, in inode.c:9991
------------[ cut here ]------------
kernel BUG at inode.c:9991!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 5 UID: 0 PID: 6787 Comm: btrfs Tainted: G OE 6.19.0-rc8-custom+ #1 PREEMPT(voluntary)
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
pc : btrfs_do_encoded_write+0x9b0/0x9c0 [btrfs]
lr : btrfs_do_encoded_write+0x9b0/0x9c0 [btrfs]
Call trace:
btrfs_do_encoded_write+0x9b0/0x9c0 [btrfs] (P)
btrfs_do_write_iter+0x1d8/0x208 [btrfs]
btrfs_ioctl_encoded_write+0x3c8/0x6d0 [btrfs]
btrfs_ioctl+0xeb0/0x2b60 [btrfs]
__arm64_sys_ioctl+0xac/0x110
invoke_syscall.constprop.0+0x64/0xe8
el0_svc_common.constprop.0+0x40/0xe8
do_el0_svc+0x24/0x38
el0_svc+0x3c/0x1b8
el0t_64_sync_handler+0xa0/0xe8
el0t_64_sync+0x1a4/0x1a8
Code: 91180021 90001080 9111a000 94039d54 (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
After commit e1bc83f8b1 ("btrfs: get rid of compressed_folios[] usage
for encoded writes"), the encoded write is changed to copy the content
from the iov into a folio, and queue the folio into the compressed bio.
However we always queue the full folio into the compressed bio, which
can make the compressed bio larger than the on-disk extent, if the folio
size is larger than the fs block size.
Although we have an ASSERT() to catch such problem, for kernels without
CONFIG_BTRFS_ASSERT, such larger than expected bio will just be
submitted, possibly overwrite the next data extent, causing data
corruption.
[FIX]
Instead of blindly queuing the full folio into the compressed bio, only
queue the rounded up range, which is the old behavior before that
offending commit.
This also means we no longer need to zero the tailing range until the
folio end (but still to the block boundary), as such range will not be
submitted anyway.
And since we're here, add a final ASSERT() into
btrfs_submit_compressed_write() as the last safety net for kernels with
btrfs assertions enabled
Fixes: e1bc83f8b1 ("btrfs: get rid of compressed_folios[] usage for encoded writes")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently our qgroup ioctls don't reserve any space, they just do a
transaction join, which does not reserve any space, neither for the quota
tree updates nor for the delayed refs generated when updating the quota
tree. The quota root uses the global block reserve, which is fine most of
the time since we don't expect a lot of updates to the quota root, or to
be too close to -ENOSPC such that other critical metadata updates need to
resort to the global reserve.
However this is not optimal, as not reserving proper space may result in a
transaction abort due to not reserving space for delayed refs and then
abusing the use of the global block reserve.
For example, the following reproducer (which is unlikely to model any
real world use case, but just to illustrate the problem), triggers such a
transaction abort due to -ENOSPC when running delayed refs:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
umount $DEV &> /dev/null
# Limit device to 1G so that it's much faster to reproduce the issue.
mkfs.btrfs -f -b 1G $DEV
mount -o commit=600 $DEV $MNT
fallocate -l 800M $MNT/filler
btrfs quota enable $MNT
for ((i = 1; i <= 400000; i++)); do
btrfs qgroup create 1/$i $MNT
done
umount $MNT
When running this, we can see in dmesg/syslog that a transaction abort
happened:
[436.490] BTRFS error (device nullb0): failed to run delayed ref for logical 30408704 num_bytes 16384 type 176 action 1 ref_mod 1: -28
[436.493] ------------[ cut here ]------------
[436.494] BTRFS: Transaction aborted (error -28)
[436.495] WARNING: fs/btrfs/extent-tree.c:2247 at btrfs_run_delayed_refs+0xd9/0x110 [btrfs], CPU#4: umount/2495372
[436.497] Modules linked in: btrfs loop (...)
[436.508] CPU: 4 UID: 0 PID: 2495372 Comm: umount Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full)
[436.510] Tainted: [W]=WARN
[436.511] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[436.513] RIP: 0010:btrfs_run_delayed_refs+0xdf/0x110 [btrfs]
[436.514] Code: 0f 82 ea (...)
[436.518] RSP: 0018:ffffd511850b7d78 EFLAGS: 00010292
[436.519] RAX: 00000000ffffffe4 RBX: ffff8f120dad37e0 RCX: 0000000002040001
[436.520] RDX: 0000000000000002 RSI: 00000000ffffffe4 RDI: ffffffffc090fd80
[436.522] RBP: 0000000000000000 R08: 0000000000000001 R09: ffffffffc04d1867
[436.523] R10: ffff8f18dc1fffa8 R11: 0000000000000003 R12: ffff8f173aa89400
[436.524] R13: 0000000000000000 R14: ffff8f173aa89400 R15: 0000000000000000
[436.526] FS: 00007fe59045d840(0000) GS:ffff8f192e22e000(0000) knlGS:0000000000000000
[436.527] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[436.528] CR2: 00007fe5905ff2b0 CR3: 000000060710a002 CR4: 0000000000370ef0
[436.530] Call Trace:
[436.530] <TASK>
[436.530] btrfs_commit_transaction+0x73/0xc00 [btrfs]
[436.531] ? btrfs_attach_transaction_barrier+0x1e/0x70 [btrfs]
[436.532] sync_filesystem+0x7a/0x90
[436.533] generic_shutdown_super+0x28/0x180
[436.533] kill_anon_super+0x12/0x40
[436.534] btrfs_kill_super+0x12/0x20 [btrfs]
[436.534] deactivate_locked_super+0x2f/0xb0
[436.534] cleanup_mnt+0xea/0x180
[436.535] task_work_run+0x58/0xa0
[436.535] exit_to_user_mode_loop+0xed/0x480
[436.536] ? __x64_sys_umount+0x68/0x80
[436.536] do_syscall_64+0x2a5/0xf20
[436.537] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[436.537] RIP: 0033:0x7fe5906b6217
[436.538] Code: 0d 00 f7 (...)
[436.540] RSP: 002b:00007ffcd87a61f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[436.541] RAX: 0000000000000000 RBX: 00005618b9ecadc8 RCX: 00007fe5906b6217
[436.541] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005618b9ecb100
[436.542] RBP: 0000000000000000 R08: 00007ffcd87a4fe0 R09: 00000000ffffffff
[436.544] R10: 0000000000000103 R11: 0000000000000246 R12: 00007fe59081626c
[436.544] R13: 00005618b9ecb100 R14: 0000000000000000 R15: 00005618b9ecacc0
[436.545] </TASK>
[436.545] ---[ end trace 0000000000000000 ]---
Fix this by changing the qgroup ioctls to use start transaction instead of
joining so that proper space is reserved for the delayed refs generated
for the updates to the quota root. This way we don't get any transaction
abort.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both rz_dmac_disable_hw() and rz_dmac_irq_handle_channel() update the
CHCTRL register. To avoid concurrency issues when configuring
functionalities exposed by this registers, take the virtual channel lock.
All other CHCTRL updates were already protected by the same lock.
Previously, rz_dmac_disable_hw() disabled and re-enabled local IRQs, before
accessing CHCTRL registers but this does not ensure race-free access.
Remove the local IRQ disable/enable code as well.
Fixes: 5000d37042 ("dmaengine: sh: Add DMAC driver for RZ/G2L SoC")
Cc: stable@vger.kernel.org
Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Link: https://patch.msgid.link/20260316133252.240348-3-claudiu.beznea.uj@bp.renesas.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
The driver lists (ld_free, ld_queue) are used in
rz_dmac_free_chan_resources(), rz_dmac_terminate_all(),
rz_dmac_issue_pending(), and rz_dmac_irq_handler_thread(), all under
the virtual channel lock. Take the same lock in rz_dmac_prep_slave_sg()
and rz_dmac_prep_dma_memcpy() as well to avoid concurrency issues, since
these functions also check whether the lists are empty and update or
remove list entries.
Fixes: 5000d37042 ("dmaengine: sh: Add DMAC driver for RZ/G2L SoC")
Cc: stable@vger.kernel.org
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Link: https://patch.msgid.link/20260316133252.240348-2-claudiu.beznea.uj@bp.renesas.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
It is possible for a malicious (or clumsy) device to respond to a
specific report's feature request using a completely different report
ID. This can cause confusion in the HID core resulting in nasty
side-effects such as OOB writes.
Add a check to ensure that the report ID in the response, matches the
one that was requested. If it doesn't, omit reporting the raw event and
return early.
Signed-off-by: Lee Jones <lee@kernel.org>
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
btrfs_extent_root() can return a NULL pointer in case the root we are
looking for is not in the rb tree that tracks roots. So add checks to
every caller that is missing such check to log a message and return
an error. The same applies to callers of btrfs_block_group_root(),
since it calls btrfs_extent_root().
Reported-by: Chris Mason <clm@meta.com>
Link: https://lore.kernel.org/linux-btrfs/20260208161657.3972997-1-clm@meta.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
PSR entry_setup_frames is currently computed directly into struct
intel_dp:intel_psr:entry_setup_frames. This causes a problem if mode change
gets rejected after PSR compute config: Psr_entry_setup_frames computed for
this rejected state is in intel_dp:intel_psr:entry_setup_frame. Fix this by
computing it into intel_crtc_state and copy the value into
intel_dp:intel_psr:entry_setup_frames on PSR enable.
Fixes: 2b981d57e4 ("drm/i915/display: Support PSR entry VSC packet to be transmitted one frame earlier")
Cc: Mika Kahola <mika.kahola@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Signed-off-by: Jouni Högander <jouni.hogander@intel.com>
Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Link: https://patch.msgid.link/20260312083710.1593781-3-jouni.hogander@intel.com
(cherry picked from commit 8c229b4aa00262c13787982e998c61c0783285e0)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Similar to commit d8dc872046 ("ALSA: firewire-lib: fix uninitialized
local variable"), the local variable `curr_cycle_time` in
process_rx_packets() is declared without initialization.
When the tracepoint event is not probed, the variable may appear to be
used without being initialized. In practice the value is only relevant
when the tracepoint is enabled, however initializing it avoids potential
use of an uninitialized value and improves code safety.
Initialize `curr_cycle_time` to zero.
Fixes: fef4e61b0b ("ALSA: firewire-lib: extend tracepoints event including CYCLE_TIME of 1394 OHCI")
Cc: stable@vger.kernel.org
Signed-off-by: Alexey Nepomnyashih <sdl@nppct.ru>
Link: https://patch.msgid.link/20260316191824.83249-1-sdl@nppct.ru
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Fixing a missing of_node_put() call.
* tag 'v7.0-rockchip-drvfixes1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
soc: rockchip: grf: Add missing of_node_put() when returning
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
On some systems (e.g. iMac 20,1 with BCM57766), the tg3 driver reads
a default placeholder mac address (00:10:18:00:00:00) from the
mailbox. The correct value on those systems are stored in the
'local-mac-address' property.
This patch, detect the default value and tries to retrieve
the correct address from the device_get_mac_address
function instead.
The patch has been tested on two different systems:
- iMac 20,1 (BCM57766) model which use the local-mac-address property
- iMac 13,2 (BCM57766) model which can use the mailbox,
NVRAM or MAC control registers
Tested-by: Rishon Jonathan R <mithicalaviator85@gmail.com>
Co-developed-by: Vincent MORVAN <vinc@42.fr>
Signed-off-by: Vincent MORVAN <vinc@42.fr>
Signed-off-by: Paul SAGE <paul.sage@42.fr>
Signed-off-by: Atharva Tiwari <atharvatiwarilinuxdev@gmail.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20260314215432.3589-1-atharvatiwarilinuxdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tobgaertner says:
====================
net: usb: cdc_ncm: add ndpoffset to NDP nframes bounds check
The nframes bounds check in cdc_ncm_rx_verify_ndp16() and
cdc_ncm_rx_verify_ndp32() does not account for ndpoffset,
allowing out-of-bounds reads when the NDP is placed near the
end of the NTB.
====================
Link: https://patch.msgid.link/20260314054640.2895026-1-tob.gaertner@me.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The same bounds-check bug fixed for NDP16 in the previous patch also
exists in cdc_ncm_rx_verify_ndp32(). The DPE array size is validated
against the total skb length without accounting for ndpoffset, allowing
out-of-bounds reads when the NDP32 is placed near the end of the NTB.
Add ndpoffset to the nframes bounds check and use struct_size_t() to
express the NDP-plus-DPE-array size more clearly.
Compile-tested only.
Fixes: 0fa81b304a ("cdc_ncm: Implement the 32-bit version of NCM Transfer Block")
Signed-off-by: Tobi Gaertner <tob.gaertner@me.com>
Link: https://patch.msgid.link/20260314054640.2895026-3-tob.gaertner@me.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cdc_ncm_rx_verify_ndp16() validates that the NDP header and its DPE
entries fit within the skb. The first check correctly accounts for
ndpoffset:
if ((ndpoffset + sizeof(struct usb_cdc_ncm_ndp16)) > skb_in->len)
but the second check omits it:
if ((sizeof(struct usb_cdc_ncm_ndp16) +
ret * (sizeof(struct usb_cdc_ncm_dpe16))) > skb_in->len)
This validates the DPE array size against the total skb length as if
the NDP were at offset 0, rather than at ndpoffset. When the NDP is
placed near the end of the NTB (large wNdpIndex), the DPE entries can
extend past the skb data buffer even though the check passes.
cdc_ncm_rx_fixup() then reads out-of-bounds memory when iterating
the DPE array.
Add ndpoffset to the nframes bounds check and use struct_size_t() to
express the NDP-plus-DPE-array size more clearly.
Fixes: ff06ab13a4 ("net: cdc_ncm: splitting rx_fixup for code reuse")
Signed-off-by: Tobi Gaertner <tob.gaertner@me.com>
Link: https://patch.msgid.link/20260314054640.2895026-2-tob.gaertner@me.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Do not run airoha_dev_stop routine explicitly in airoha_remove()
since ndo_stop() callback is already executed by unregister_netdev() in
__dev_close_many routine if necessary and, doing so, we will end up causing
an underflow in the qdma users atomic counters. Rely on networking subsystem
to stop the device removing the airoha_eth module.
Fixes: 23020f0493 ("net: airoha: Introduce ethernet support for EN7581 SoC")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260313-airoha-remove-ndo_stop-remove-net-v2-1-67542c3ceeca@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzkaller reported a panic in smc_tcp_syn_recv_sock() [1].
smc_tcp_syn_recv_sock() is called in the TCP receive path
(softirq) via icsk_af_ops->syn_recv_sock on the clcsock (TCP
listening socket). It reads sk_user_data to get the smc_sock
pointer. However, when the SMC listen socket is being closed
concurrently, smc_close_active() sets clcsock->sk_user_data
to NULL under sk_callback_lock, and then the smc_sock itself
can be freed via sock_put() in smc_release().
This leads to two issues:
1) NULL pointer dereference: sk_user_data is NULL when
accessed.
2) Use-after-free: sk_user_data is read as non-NULL, but the
smc_sock is freed before its fields (e.g., queued_smc_hs,
ori_af_ops) are accessed.
The race window looks like this (the syzkaller crash [1]
triggers via the SYN cookie path: tcp_get_cookie_sock() ->
smc_tcp_syn_recv_sock(), but the normal tcp_check_req() path
has the same race):
CPU A (softirq) CPU B (process ctx)
tcp_v4_rcv()
TCP_NEW_SYN_RECV:
sk = req->rsk_listener
sock_hold(sk)
/* No lock on listener */
smc_close_active():
write_lock_bh(cb_lock)
sk_user_data = NULL
write_unlock_bh(cb_lock)
...
smc_clcsock_release()
sock_put(smc->sk) x2
-> smc_sock freed!
tcp_check_req()
smc_tcp_syn_recv_sock():
smc = user_data(sk)
-> NULL or dangling
smc->queued_smc_hs
-> crash!
Note that the clcsock and smc_sock are two independent objects
with separate refcounts. TCP stack holds a reference on the
clcsock, which keeps it alive, but this does NOT prevent the
smc_sock from being freed.
Fix this by using RCU and refcount_inc_not_zero() to safely
access smc_sock. Since smc_tcp_syn_recv_sock() is called in
the TCP three-way handshake path, taking read_lock_bh on
sk_callback_lock is too heavy and would not survive a SYN
flood attack. Using rcu_read_lock() is much more lightweight.
- Set SOCK_RCU_FREE on the SMC listen socket so that
smc_sock freeing is deferred until after the RCU grace
period. This guarantees the memory is still valid when
accessed inside rcu_read_lock().
- Use rcu_read_lock() to protect reading sk_user_data.
- Use refcount_inc_not_zero(&smc->sk.sk_refcnt) to pin the
smc_sock. If the refcount has already reached zero (close
path completed), it returns false and we bail out safely.
Note: smc_hs_congested() has a similar lockless read of
sk_user_data without rcu_read_lock(), but it only checks for
NULL and accesses the global smc_hs_wq, never dereferencing
any smc_sock field, so it is not affected.
Reproducer was verified with mdelay injection and smc_run,
the issue no longer occurs with this patch applied.
[1] https://syzkaller.appspot.com/bug?extid=827ae2bfb3a3529333e9
Fixes: 8270d9c210 ("net/smc: Limit backlog connections")
Reported-by: syzbot+827ae2bfb3a3529333e9@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67eaf9b8.050a0220.3c3d88.004a.GAE@google.com/T/
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Link: https://patch.msgid.link/20260312092909.48325-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For file-backed mount, IO requests are handled by vfs_iocb_iter_read().
However, it can be interrupted by SIGKILL, returning the number of
bytes actually copied. Unused folios in bio are unexpectedly marked
as uptodate.
vfs_read
filemap_read
filemap_get_pages
filemap_readahead
erofs_fileio_readahead
erofs_fileio_rq_submit
vfs_iocb_iter_read
filemap_read
filemap_get_pages <= detect signal
erofs_fileio_ki_complete <= set all folios uptodate
This patch addresses this by setting short read bio with an error
directly.
Fixes: bc804a8d7e ("erofs: handle end of filesystem properly for file-backed mounts")
Reported-by: chenguanyou <chenguanyou@xiaomi.com>
Signed-off-by: Yunlei He <heyunlei@xiaomi.com>
Signed-off-by: Sheng Yong <shengyong1@xiaomi.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The file contains a spelling error in a source comment (resposne).
Typos in comments reduce readability and make text searches less reliable
for developers and maintainers.
Replace 'resposne' with 'response' in the affected comment. This is a
comment-only cleanup and does not change behavior.
[v2: Removed Fixes: and Cc: to stable tags.]
Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Check the global CXL_HDM_DECODER_ENABLE bit instead of looping over
per-decoder COMMITTED bits to determine whether to fall back to DVSEC
range emulation. When the HDM decoder capability is globally enabled,
ignore DVSEC range registers regardless of individual decoder commit
state.
should_emulate_decoders() currently loops over per-decoder COMMITTED
bits, which leads to an incorrect DVSEC fallback when those bits are
zero. One way to trigger this is to destroy a region and bounce the
memdev:
cxl disable-region region0
cxl destroy-region region0
cxl disable-memdev mem0
cxl enable-memdev mem0
Region teardown zeroes the HDM decoder registers including the committed
bits. The subsequent memdev re-probe finds uncommitted decoders and falls
back to DVSEC emulation, even though HDM remains globally enabled.
Observed failures:
should_emulate_decoders: cxl_port endpoint6: decoder6.0: committed: 0 base: 0x0_00000000 size: 0x0_00000000
devm_cxl_setup_hdm: cxl_port endpoint6: Fallback map 1 range register
..
devm_cxl_add_region: cxl_acpi ACPI0017:00: decoder0.0: created region0
__construct_region: cxl_pci 0000:e1:00.0: mem1:decoder6.0:
__construct_region region0 res: [mem 0x850000000-0x284fffffff flags 0x200] iw: 1 ig: 4096
cxl region0: pci0000:e0:port1 cxl_port_setup_targets expected iw: 1 ig: 4096 ..
cxl region0: pci0000:e0:port1 cxl_port_setup_targets got iw: 1 ig: 256 state: disabled ..
cxl_port endpoint6: failed to attach decoder6.0 to region0: -6
..
devm_cxl_add_region: cxl_acpi ACPI0017:00: decoder0.0: created region4
alloc_hpa: cxl region4: HPA allocation error (-34) ..
Fixes: 52cc48ad2a ("cxl/hdm: Limit emulation to the number of range registers")
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20260316201950.224567-1-Smita.KoralahalliChannabasappa@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
If OF graph is used in the PCI device node, the pwrctrl core creates a
pwrctrl device even if the remote endpoint doesn't have power supply
requirements. Since the device doesn't have any power supply requirements,
there was no pwrctrl driver to probe, leading to PCI controller driver
probe deferral as it waits for all pwrctrl drivers to probe before starting
bus scan.
This issue happens with Qcom ath12k devices with WSI interface attached to
the Qcom IPQ platforms.
Fix this issue by checking for the existence of at least one power supply
property in the remote endpoint parent node. To consolidate all the checks,
create a new helper pci_pwrctrl_is_required() and move all the checks
there.
Fixes: 9db826206f ("PCI/pwrctrl: Create pwrctrl device if graph port is found")
Reported-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Reviewed-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Link: https://patch.msgid.link/20260223-pwrctrl-fixes-7-0-v2-1-97566dfb1809@oss.qualcomm.com
The NFSv4.0 replay cache uses a fixed 112-byte inline buffer
(rp_ibuf[NFSD4_REPLAY_ISIZE]) to store encoded operation responses.
This size was calculated based on OPEN responses and does not account
for LOCK denied responses, which include the conflicting lock owner as
a variable-length field up to 1024 bytes (NFS4_OPAQUE_LIMIT).
When a LOCK operation is denied due to a conflict with an existing lock
that has a large owner, nfsd4_encode_operation() copies the full encoded
response into the undersized replay buffer via read_bytes_from_xdr_buf()
with no bounds check. This results in a slab-out-of-bounds write of up
to 944 bytes past the end of the buffer, corrupting adjacent heap memory.
This can be triggered remotely by an unauthenticated attacker with two
cooperating NFSv4.0 clients: one sets a lock with a large owner string,
then the other requests a conflicting lock to provoke the denial.
We could fix this by increasing NFSD4_REPLAY_ISIZE to allow for a full
opaque, but that would increase the size of every stateowner, when most
lockowners are not that large.
Instead, fix this by checking the encoded response length against
NFSD4_REPLAY_ISIZE before copying into the replay buffer. If the
response is too large, set rp_buflen to 0 to skip caching the replay
payload. The status is still cached, and the client already received the
correct response on the original request.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@kernel.org
Reported-by: Nicholas Carlini <npc@anthropic.com>
Tested-by: Nicholas Carlini <npc@anthropic.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The elf_create_file() function fails with EINVAL when the build directory
path is long enough to truncate the "XXXXXX" suffix in the 256-byte
tmp_name buffer.
Simplify the code to remove the unnecessary dirname()/basename() split
and concatenation. Instead, allocate the exact number of bytes needed for
the path.
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Link: https://patch.msgid.link/20260310203751.1479229-3-joe.lawrence@redhat.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Commit 356e4b2f5b ("objtool: Fix data alignment in elf_add_data()")
corrected the alignment of data within a section (honoring the section's
sh_addralign). Apply the same alignment when klp-diff mode clones a
symbol, adjusting the new symbol's offset for the output section's
sh_addralign.
Fixes: dd590d4d57 ("objtool/klp: Introduce klp diff subcommand for diffing object files")
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Link: https://patch.msgid.link/20260310203751.1479229-2-joe.lawrence@redhat.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
There are two special cases in the idle loop that are handled
inconsistently even though they are analogous.
The first one is when a cpuidle driver is absent and the default CPU
idle time power management implemented by the architecture code is used.
In that case, the scheduler tick is stopped every time before invoking
default_idle_call().
The second one is when a cpuidle driver is present, but there is only
one idle state in its table. In that case, the scheduler tick is never
stopped at all.
Since each of these approaches has its drawbacks, reconcile them with
the help of one simple heuristic. Namely, stop the tick if the CPU has
been woken up by it in the previous iteration of the idle loop, or let
it tick otherwise.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Fixes: ed98c34919 ("sched: idle: Do not stop the tick before cpuidle_idle_call()")
[ rjw: Added Fixes tag, changelog edits ]
Link: https://patch.msgid.link/4741364.LvFx2qVVIh@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull misc fixes from Andrew Morton:
"6 hotfixes. 4 are cc:stable. 3 are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-03-16-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: update email address for Ignat Korchagin
mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared THP
mm/rmap: fix incorrect pte restoration for lazyfree folios
mm/huge_memory: fix use of NULL folio in move_pages_huge_pmd()
build_bug.h: correct function parameters names in kernel-doc
crash_dump: don't log dm-crypt key bytes in read_key_from_user_keying
The controller per-cpu statistics is not allocated until after the
controller has been registered with driver core, which leaves a window
where accessing the sysfs attributes can trigger a NULL-pointer
dereference.
Fix this by moving the statistics allocation to controller allocation
while tying its lifetime to that of the controller (rather than using
implicit devres).
Fixes: 6598b91b5a ("spi: spi.c: Convert statistics to per-cpu u64_stats_t")
Cc: stable@vger.kernel.org # 6.0
Cc: David Jander <david@protonic.nl>
Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260312151817.32100-3-johan@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
The initial StarFighter quirk fixed the runtime suspend pop by muting
speakers in the shutup callback before power-down. Further hardware
validation showed that the speaker path is controlled directly by LINE2
EAPD on NID 0x1b together with GPIO2 for the external amplifier.
Replace the shutup-delay workaround with explicit sequencing of those
controls at playback start and stop:
- assert LINE2 EAPD and drive GPIO2 high on PREPARE
- deassert LINE2 EAPD and drive GPIO2 low on CLEANUP
This avoids the runtime suspend pop without a sleep, and also fixes pops
around G3 entry and display-manager start that the original workaround
did not cover.
Fixes: 1cb3c20688 ("ALSA: hda/realtek: Fix speaker pop on Star Labs StarFighter")
Tested-by: Sean Rhodes <sean@starlabs.systems>
Signed-off-by: Sean Rhodes <sean@starlabs.systems>
Link: https://patch.msgid.link/20260315201127.33744-1-sean@starlabs.systems
Signed-off-by: Takashi Iwai <tiwai@suse.de>
The recent XFER_TO_GUEST_WORK change resulted in a situation, where the
vsie code would interpret a signal during work as a machine check during
SIE as both use the EINTR return code.
The exit_reason of the sie64a function has nothing to do with the
kvm_run exit_reason. Rename it and define a specific code for machine
checks instead of abusing -EINTR.
rename exit_reason into sie_return to avoid the naming conflict
and change the code flow in vsie.c to have a separate variable for rc
and sie_return.
Fixes: 2bd1337a12 ("KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions")
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
KVM will reinject machine checks that happen during guest activity.
From a host perspective this machine check is no longer visible
and even for the guest, the guest might decide to only kill a
userspace program or even ignore the machine check.
As this can be a disruptive event nevertheless, we should log this
not only in the VM debug event (that gets lost after guest shutdown)
but also on the global KVM event as well as syslog.
Consolidate the logging and log with loglevel 2 and higher.
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Hendrik Brueckner <brueckner@linux.ibm.com>
There are special cases where secure storage access exceptions happen
in a kernel context for pages that don't have the PG_arch_1 bit
set. That bit is set for non-exported guest secure storage (memory)
but is absent on storage donated to the Ultravisor since the kernel
isn't allowed to export donated pages.
Prior to this patch we would try to export the page by calling
arch_make_folio_accessible() which would instantly return since the
arch bit is absent signifying that the page was already exported and
no further action is necessary. This leads to secure storage access
exception loops which can never be resolved.
With this patch we unconditionally try to export and if that fails we
fixup.
Fixes: 084ea4d611 ("s390/mm: add (non)secure page access exceptions handlers")
Reported-by: Heiko Carstens <hca@linux.ibm.com>
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Pull btrfs fixes from David Sterba:
- fix logging of new dentries when logging parent directory and there
are conflicting inodes (e.g. deleted directory)
- avoid taking big device lock for zone setup, this is not necessary
during mount
- tune message verbosity when auto-reclaiming zones when low on space
- fix slightly misleading message of root item check
* tag 'for-7.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: tree-checker: fix misleading root drop_level error message
btrfs: log new dentries when logging parent dir of a conflicting inode
btrfs: don't take device_list_mutex when querying zone info
btrfs: pass 'verbose' parameter to btrfs_relocate_block_group
Fix 45+ kernel-doc warnings in vmwgfx_drv.h:
- spell a struct name correctly
- don't have structs between kernel-doc and its struct
- end description of struct members with ':'
- start all kernel-doc lines with " *"
- mark private struct member and enum value with "private:"
- add kernel-doc for enum vmw_dma_map_mode
- add missing struct member comments
- add missing function parameter comments
- convert "/**" to "/*" for non-kernel-doc comments
- add missing "Returns:" comments for several functions
- correct a function parameter name
to eliminate kernel-doc warnings (examples):
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:128 struct vmw_bo; error:
Cannot parse struct or union!
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:151 struct member 'used_prio'
not described in 'vmw_resource'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:151 struct member 'mob_node'
not described in 'vmw_resource'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:199 bad line: SM4 device.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:270 struct member 'private'
not described in 'vmw_res_cache_entry'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:280 Enum value
'vmw_dma_alloc_coherent' not described in enum 'vmw_dma_map_mode'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:280 Enum value
'vmw_dma_map_bind' not described in enum 'vmw_dma_map_mode'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:295 struct member 'addrs'
not described in 'vmw_sg_table'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:295 struct member 'mode'
not described in 'vmw_sg_table'
vmwgfx_drv.h:309: warning: Excess struct member 'num_regions' description
in 'vmw_sg_table'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:402 struct member 'filp'
not described in 'vmw_sw_context'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:732 This comment starts with
'/**', but isn't a kernel-doc comment.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:742 This comment starts with
'/**', but isn't a kernel-doc comment.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:762 This comment starts with
'/**', but isn't a kernel-doc comment.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:887 No description found for
return value of 'vmw_fifo_caps'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:901 No description found for
return value of 'vmw_is_cursor_bypass3_enabled'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:906 This comment starts with
'/**', but isn't a kernel-doc comment.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:961 This comment starts with
'/**', but isn't a kernel-doc comment.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:996 This comment starts with
'/**', but isn't a kernel-doc comment.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1082 cannot understand
function prototype: 'const struct dma_buf_ops vmw_prime_dmabuf_ops;'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1303 struct member 'do_cpy'
not described in 'vmw_diff_cpy'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1385 function parameter 'fmt'
not described in 'VMW_DEBUG_KMS'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1389 This comment starts with
'/**', but isn't a kernel-doc comment.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1426 function parameter 'vmw'
not described in 'vmw_fifo_mem_read'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1426 No description found for
return value of 'vmw_fifo_mem_read'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1441 function parameter
'fifo_reg' not described in 'vmw_fifo_mem_write'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Zack Rusin <zack.rusin@broadcom.com>
Link: https://patch.msgid.link/20260219215548.470810-1-rdunlap@infradead.org
Presently, if the force feedback initialisation fails when probing the
Logitech G920 Driving Force Racing Wheel for Xbox One, an error number
will be returned and propagated before the userspace infrastructure
(sysfs and /dev/input) has been torn down. If userspace ignores the
errors and continues to use its references to these dangling entities, a
UAF will promptly follow.
We have 2 options; continue to return the error, but ensure that all of
the infrastructure is torn down accordingly or continue to treat this
condition as a warning by emitting the message but returning success.
It is thought that the original author's intention was to emit the
warning but keep the device functional, less the force feedback feature,
so let's go with that.
Signed-off-by: Lee Jones <lee@kernel.org>
Reviewed-by: Günther Noack <gnoack@google.com>
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
right now the returned value is considered to be always valid. However,
when playing with HID-BPF, the return value can be arbitrary big,
because it's the return value of dispatch_hid_bpf_raw_requests(), which
calls the struct_ops and we have no guarantees that the value makes
sense.
Fixes: 8bd0488b5e ("HID: bpf: add HID-BPF hooks for hid_hw_raw_requests")
Cc: stable@vger.kernel.org
Acked-by: Jiri Kosina <jkosina@suse.com>
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
The memset() in hid_report_raw_event() has the good intention of
clearing out bogus data by zeroing the area from the end of the incoming
data string to the assumed end of the buffer. However, as we have
previously seen, doing so can easily result in OOB reads and writes in
the subsequent thread of execution.
The current suggestion from one of the HID maintainers is to remove the
memset() and simply return if the incoming event buffer size is not
large enough to fill the associated report.
Suggested-by Benjamin Tissoires <bentiss@kernel.org>
Signed-off-by: Lee Jones <lee@kernel.org>
[bentiss: changed the return value]
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
When 1-bit bus width is used with HS200/HS400 capabilities set,
mmc_select_hs200() returns 0 without actually switching. This
causes mmc_select_timing() to skip mmc_select_hs(), leaving eMMC
in legacy mode (26MHz) instead of High Speed SDR (52MHz).
Per JEDEC eMMC spec section 5.3.2, 1-bit mode supports High Speed
SDR. Drop incompatible HS200/HS400/UHS/DDR caps early so timing
selection falls through to mmc_select_hs() correctly.
Fixes: f2119df6b7 ("mmc: sd: add support for signal voltage switch procedure")
Signed-off-by: Luke Wang <ziniu.wang_1@nxp.com>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
setup_fifo_params computes mode_changed from spi->mode flags but tests
it against SE_SPI_CPHA and SE_SPI_CPOL, which are register offsets,
not SPI mode bits. This causes CPHA and CPOL updates to be skipped
on mode switches, leaving the controller with stale clock phase
and polarity settings.
Fix this by using SPI_CPHA and SPI_CPOL to detect mode changes before
updating the corresponding registers.
Fixes: 781c3e71c9 ("spi: spi-geni-qcom: rework setup_fifo_params")
Signed-off-by: Maramaina Naresh <naresh.maramaina@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://patch.msgid.link/20260316-spi-geni-cpha-cpol-fix-v1-1-4cb44c176b79@oss.qualcomm.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Currently HID_PHYS is empty, which means userspace tools (e.g. fwupd)
that depend on it for distinguishing the devices, are unable to do so.
Other drivers like i2c-hid, usbhid, surface-hid, all populate it.
With this change it's set to, for example: HID_PHYS=0000:00:10.0
Each function has just a single HID device, as far as I can tell, so
there is no need to add a suffix.
Tested with fwupd 2.1.1, can avoid https://github.com/fwupd/fwupd/pull/9995
Cc: Even Xu <even.xu@intel.com>
Cc: Xinpeng Sun <xinpeng.sun@intel.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Benjamin Tissoires <bentiss@kernel.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Daniel Schaefer <git@danielschaefer.me>
Reviewed-by: Even Xu <even.xu@intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
A XFRM_MSG_NEWSPDINFO request can queue the per-net work item
policy_hthresh.work onto the system workqueue.
The queued callback, xfrm_hash_rebuild(), retrieves the enclosing
struct net via container_of(). If the net namespace is torn down
before that work runs, the associated struct net may already have
been freed, and xfrm_hash_rebuild() may then dereference stale memory.
xfrm_policy_fini() already flushes policy_hash_work during teardown,
but it does not synchronize policy_hthresh.work.
Synchronize policy_hthresh.work in xfrm_policy_fini() as well, so the
queued work cannot outlive the net namespace teardown and access a
freed struct net.
Fixes: 880a6fab8f ("xfrm: configure policy hash table thresholds by netlink")
Signed-off-by: Minwoo Ra <raminwo0202@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
intel_dmc_update_dc6_allowed_count() oopses when DMC hasn't been
initialized, and dmc is thus NULL.
That would be the case when the call path is
intel_power_domains_init_hw() -> {skl,bxt,icl}_display_core_init() ->
gen9_set_dc_state() -> intel_dmc_update_dc6_allowed_count(), as
intel_power_domains_init_hw() is called *before* intel_dmc_init().
However, gen9_set_dc_state() calls intel_dmc_update_dc6_allowed_count()
conditionally, depending on the current and target DC states. At probe,
the target is disabled, but if DC6 is enabled, the function is called,
and an oops follows. Apparently it's quite unlikely that DC6 is enabled
at probe, as we haven't seen this failure mode before.
It is also strange to have DC6 enabled at boot, since that would require
the DMC firmware (loaded by BIOS); the BIOS loading the DMC firmware and
the driver stopping / reprogramming the firmware is a poorly specified
sequence and as such unlikely an intentional BIOS behaviour. It's more
likely that BIOS is leaving an unintentionally enabled DC6 HW state
behind (without actually loading the required DMC firmware for this).
The tracking of the DC6 allowed counter only works if starting /
stopping the counter depends on the _SW_ DC6 state vs. the current _HW_
DC6 state (since stopping the counter requires the DC5 counter captured
when the counter was started). Thus, using the HW DC6 state is incorrect
and it also leads to the above oops. Fix both issues by using the SW DC6
state for the tracking.
This is v2 of the fix originally sent by Jani, updated based on the
first Link: discussion below.
Link: https://lore.kernel.org/all/3626411dc9e556452c432d0919821b76d9991217@intel.com
Link: https://lore.kernel.org/all/20260228130946.50919-2-ltao@redhat.com
Fixes: 88c1f9a4d3 ("drm/i915/dmc: Create debugfs entry for dc6 counter")
Cc: Mohammed Thasleem <mohammed.thasleem@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Tao Liu <ltao@redhat.com>
Cc: <stable@vger.kernel.org> # v6.16+
Tested-by: Tao Liu <ltao@redhat.com>
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Link: https://patch.msgid.link/20260309164803.1918158-1-imre.deak@intel.com
(cherry picked from commit 2344b93af8eb5da5d496b4e0529d35f0f559eaf0)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Most of VM feature detections are integer OR operations, and integer
assignment operation will clear previous integer OR operation. So here
change all integer assignment operations to integer OR operations.
Fixes: 82db90bf46 ("LoongArch: KVM: Move feature detection in kvm_vm_init_features()")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Occasionally there exist "text_copy_cb: operation failed" when executing
the bpf selftests, the reason is copy_to_kernel_nofault() failed and the
ecode of ESTAT register is 0x4 (PME: Page Modification Exception) due to
the pte is not writeable. The root cause is that there is another place
to set the pte entry as readonly which is in the generic weak version of
arch_protect_bpf_trampoline().
There are two ways to fix this race condition issue: the direct way is
to modify the generic weak arch_protect_bpf_trampoline() to add a mutex
lock for set_memory_rox(), but the other simple and proper way is to
just make arch_protect_bpf_trampoline() return 0 in the arch-specific
code because LoongArch has already use the BPF prog pack allocator for
trampoline.
Here are the trimmed kernel log messages:
copy_to_kernel_nofault: memory access failed, ecode 0x4
copy_to_kernel_nofault: the caller is text_copy_cb+0x50/0xa0
text_copy_cb: operation failed
------------[ cut here ]------------
bpf_prog_pack bug: missing bpf_arch_text_invalidate?
WARNING: kernel/bpf/core.c:1008 at bpf_prog_pack_free+0x200/0x228
...
Call Trace:
[<9000000000248914>] show_stack+0x64/0x188
[<9000000000241308>] dump_stack_lvl+0x6c/0x9c
[<90000000002705bc>] __warn+0x9c/0x200
[<9000000001c428c0>] __report_bug+0xa8/0x1c0
[<9000000001c42b5c>] report_bug+0x64/0x120
[<9000000001c7dcd0>] do_bp+0x270/0x3c0
[<9000000000246f40>] handle_bp+0x120/0x1c0
[<900000000047b030>] bpf_prog_pack_free+0x200/0x228
[<900000000047b2ec>] bpf_jit_binary_pack_free+0x24/0x60
[<900000000026989c>] bpf_jit_free+0x54/0xb0
[<900000000029e10c>] process_one_work+0x184/0x610
[<900000000029ef8c>] worker_thread+0x24c/0x388
[<90000000002a902c>] kthread+0x13c/0x170
[<9000000001c7dfe8>] ret_from_kernel_thread+0x28/0x1c0
[<9000000000246624>] ret_from_kernel_thread_asm+0xc/0x88
---[ end trace 0000000000000000 ]---
Here is a simple shell script to reproduce:
#!/bin/bash
for ((i=1; i<=1000; i++))
do
echo "Under testing $i ..."
dmesg -c > /dev/null
./test_progs -t fentry_attach_stress > /dev/null
dmesg -t | grep "text_copy_cb: operation failed"
if [ $? -eq 0 ]; then
break
fi
done
Cc: stable@vger.kernel.org
Fixes: 4ab17e762b ("LoongArch: BPF: Use BPF prog pack allocator")
Acked-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
set_memory_rw() and set_memory_rox() may fail, so we should check the
return values and return immediately in larch_insn_text_copy().
Cc: stable@vger.kernel.org
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
If memory access such as copy_{from, to}_kernel_nofault() failed, its
users do not know what happened, so it is very useful to print the
exception code for such cases. Furthermore, it is better to print the
caller function to know where is the entry.
Here are the low level call chains:
copy_from_kernel_nofault()
copy_from_kernel_nofault_loop()
__get_kernel_nofault()
copy_to_kernel_nofault()
copy_to_kernel_nofault_loop()
__put_kernel_nofault()
Cc: stable@vger.kernel.org
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Fix the warning:
BUG: using smp_processor_id() in preemptible [00000000] code: systemd/1
caller is larch_insn_text_copy+0x40/0xf0
Simply changing it to raw_smp_processor_id() is not enough: if preempt
and CPU hotplug happens after raw_smp_processor_id() but before calling
stop_machine(), the CPU where raw_smp_processor_id() has run may become
offline when stop_machine() and no CPU will run copy_to_kernel_nofault()
in text_copy_cb(). Thus guard the larch_insn_text_copy() calls with
cpus_read_lock() and change stop_machine() to stop_machine_cpuslocked()
to prevent this.
I've considered moving the locks inside larch_insn_text_copy() but
doing so seems not an easy hack. In bpf_arch_text_poke() obviously the
memcpy() call must be guarded by text_mutex, so we have to leave the
acquire of text_mutex out of larch_insn_text_copy(). But in the entire
kernel the acquire of mutexes is always after cpus_read_lock(), so we
cannot put cpus_read_lock() into larch_insn_text_copy() while leaving
the text_mutex acquire out (or we risk a deadlock due to inconsistent
lock acquire order). So let's fix the bug first and leave the posssible
refactor as future work.
Fixes: 9fbd18cf4c ("LoongArch: BPF: Add dynamic code modification support")
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
The 128-bit atomic cmpxchg implementation uses the SC.Q instruction.
Older versions of GNU AS do not support that instruction, erroring out:
ERROR:root:{standard input}: Assembler messages:
{standard input}:4831: Error: no match insn: sc.q $t0,$t1,$r14
{standard input}:6407: Error: no match insn: sc.q $t0,$t1,$r23
{standard input}:10856: Error: no match insn: sc.q $t0,$t1,$r14
make[4]: *** [../scripts/Makefile.build:289: mm/slub.o] Error 1
(Binutils 2.41)
So test support for SC.Q in Kconfig and disable the atomics if the
instruction is not available.
Fixes: f0e4b1b6e2 ("LoongArch: Add 128-bit atomic cmpxchg support")
Closes: https://lore.kernel.org/lkml/20260216082834-edc51c46-7b7a-4295-8ea5-4d9a3ca2224f@linutronix.de/
Reviewed-by: Xi Ruoyao <xry111@xry111.site>
Acked-by: Hengqi Chen <hengqi.chen@gmail.com>
Tested-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Customer reported that some of their krb5 mounts were failing against
a single server as the client was trying to mount the shares with
wrong credentials. It turned out the client was reusing SMB session
from first mount to try mounting the other shares, even though a
different username= option had been specified to the other mounts.
By using username mount option along with sec=krb5 to search for
principals from keytab is supported by cifs.upcall(8) since
cifs-utils-4.8. So fix this by matching username mount option in
match_session() even with Kerberos.
For example, the second mount below should fail with -ENOKEY as there
is no 'foobar' principal in keytab (/etc/krb5.keytab). The client
ends up reusing SMB session from first mount to perform the second
one, which is wrong.
```
$ ktutil
ktutil: add_entry -password -p testuser -k 1 -e aes256-cts
Password for testuser@ZELDA.TEST:
ktutil: write_kt /etc/krb5.keytab
ktutil: quit
$ klist -ke
Keytab name: FILE:/etc/krb5.keytab
KVNO Principal
---- ----------------------------------------------------------------
1 testuser@ZELDA.TEST (aes256-cts-hmac-sha1-96)
$ mount.cifs //w22-root2/scratch /mnt/1 -o sec=krb5,username=testuser
$ mount.cifs //w22-root2/scratch /mnt/2 -o sec=krb5,username=foobar
$ mount -t cifs | grep -Po 'username=\K\w+'
testuser
testuser
```
Reported-by: Oscar Santos <ossantos@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
ctlr is allocated using devm_spi_alloc_host(), which automatically
handles reference counting via the devm framework.
Calling spi_controller_put() manually in the probe error path is
redundant and results in a double-free.
Fixes: e75a6b00ad ("spi: axiado: Add driver for Axiado SPI DB controller")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260302-axiado-v1-1-1132819f1cb7@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Add an ACP70 SoundWire machine entry for ASUS PX13
(HN7306EA/HN7306EAC) with rt721 and two TAS2783 amps on link1.
Describe rt721 with jack/DMIC endpoints on this platform and add
explicit left/right TAS2783 speaker endpoint mapping via name prefixes.
Signed-off-by: Hasun Park <hasunpark@gmail.com>
Link: https://patch.msgid.link/20260308151654.29059-3-hasunpark@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Some ASUS ProArt PX13 systems expose ACP ACPI config flags that can
select a non-working fallback path.
Add a DMI override in snd_amd_acp_find_config() for ACP70+ boards and
return 0 so ACP ACPI flag-based selection is skipped on this platform.
This keeps machine driver selection on the intended SoundWire path.
Signed-off-by: Hasun Park <hasunpark@gmail.com>
Link: https://patch.msgid.link/20260308151654.29059-2-hasunpark@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
A previous change added NULL checks and cleanup for allocation
failures in sma1307_setting_loaded().
However, the cleanup for mode_set entries is wrong. Those entries are
allocated with devm_kzalloc(), so they are device-managed resources and
must not be freed with kfree(). Manually freeing them in the error path
can lead to a double free when devres later releases the same memory.
Drop the manual kfree() loop and let devres handle the cleanup.
Fixes: 0ec6bd1670 ("ASoC: sma1307: Add NULL check in sma1307_setting_loaded()")
Cc: stable@vger.kernel.org
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Link: https://patch.msgid.link/20260313040611.391479-1-lgs201920130244@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
In aml_spisg_probe(), ctlr is allocated by
spi_alloc_target()/spi_alloc_host(), but fails to call
spi_controller_put() in several error paths. This leads
to a memory leak whenever the driver fails to probe after
the initial allocation.
Convert to use devm_spi_alloc_host()/devm_spi_alloc_target()
to fix the memory leak.
Fixes: cef9991e04 ("spi: Add Amlogic SPISG driver")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260308-spisg-v1-1-2cace5cafc24@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
The driver uses devm_clk_get_enabled() which enables the clock and
registers a callback to automatically disable it when the device
is unbound.
Remove the redundant aml_sfc_disable_clk() call in the error paths
and remove callback.
Fixes: 4670db6f32 ("spi: amlogic: add driver for Amlogic SPI Flash Controller")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260308-spifc-a4-1-v1-1-77e286c26832@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Our vcpu reset suffers from a particularly interesting flaw, as it
does not correctly deal with state that will have an effect on the
execution flow out of reset.
Take the following completely random example, never seen in the wild
and that never resulted in a couple of sleepless nights: /s
- vcpu-A issues a PSCI_CPU_OFF using the SMC conduit
- SMC being a trapped instruction (as opposed to HVC which is always
normally executed), we annotate the vcpu as needing to skip the
next instruction, which is the SMC itself
- vcpu-A is now safely off
- vcpu-B issues a PSCI_CPU_ON for vcpu-A, providing a starting PC
- vcpu-A gets reset, get the new PC, and is sent on its merry way
- right at the point of entering the guest, we notice that a PC
increment is pending (remember the earlier SMC?)
- vcpu-A skips its first instruction...
What could possibly go wrong?
Well, I'm glad you asked. For pKVM as a NV guest, that first instruction
is extremely significant, as it indicates whether the CPU is booting
or resuming. Having skipped that instruction, nothing makes any sense
anymore, and CPU hotplugging fails.
This is all caused by the decoupling of PC update from the handling
of an exception that triggers such update, making it non-obvious
what affects what when.
Fix this train wreck by discarding all the PC-affecting state on
vcpu reset.
Fixes: f5e3068061 ("KVM: arm64: Move __adjust_pc out of line")
Cc: stable@vger.kernel.org
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://patch.msgid.link/20260312140850.822968-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
The assembly flush instructions were swapped for I- and D-cache flags:
SYSCALL_DEFINE3(cacheflush, ...)
{
if (cache & DCACHE) {
"fic ...\n"
}
if (cache & ICACHE && error == 0) {
"fdc ...\n"
}
Fix it by using fdc for DCACHE, and fic for ICACHE flushing.
Reported-by: Felix Lechner <felix.lechner@lease-up.com>
Fixes: c6d96328fe ("parisc: Add cacheflush() syscall")
Cc: <stable@vger.kernel.org> # v6.5+
Signed-off-by: Helge Deller <deller@gmx.de>
Kevin Hao says:
====================
net: macb: Fix Ethernet malfunction on AMD Versal board after suspend
On Versal boards, the tx/rx queue pointer registers are cleared after suspend,
which causes Ethernet malfunction. This patch series addresses this issue by
reinitializing the tx/rx queue pointer registers and the rx ring.
====================
Link: https://patch.msgid.link/20260312-macb-versal-v1-0-467647173fa4@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On certain platforms, such as AMD Versal boards, the tx/rx queue pointer
registers are cleared after suspend, and the rx queue pointer register
is also disabled during suspend if WOL is enabled. Previously, we assumed
that these registers would be restored by macb_mac_link_up(). However,
in commit bf9cf80cab, macb_init_buffers() was moved from
macb_mac_link_up() to macb_open(). Therefore, we should call
macb_init_buffers() to reinitialize the tx/rx queue pointer registers
during resume.
Due to the reset of these two registers, we also need to adjust the
tx/rx rings accordingly. The tx ring will be handled by
gem_shuffle_tx_rings() in macb_mac_link_up(), so we only need to
initialize the rx ring here.
Fixes: bf9cf80cab ("net: macb: Fix tx/rx malfunction after phy link down and up")
Reported-by: Quanyang Wang <quanyang.wang@windriver.com>
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Tested-by: Quanyang Wang <quanyang.wang@windriver.com>
Cc: stable@vger.kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260312-macb-versal-v1-2-467647173fa4@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Page recycling was removed from the XDP_DROP path in emac_run_xdp() to
avoid conflicts with AF_XDP zero-copy mode, which uses xsk_buff_free()
instead.
However, this causes a memory leak when running XDP programs that drop
packets in non-zero-copy mode (standard page pool mode). The pages are
never returned to the page pool, leading to OOM conditions.
Fix this by handling cleanup in the caller, emac_rx_packet().
When emac_run_xdp() returns ICSSG_XDP_CONSUMED for XDP_DROP, the
caller now recycles the page back to the page pool. The zero-copy
path, emac_rx_packet_zc() already handles cleanup correctly with
xsk_buff_free().
Fixes: 7a64bb388d ("net: ti: icssg-prueth: Add AF_XDP zero copy for RX")
Signed-off-by: Meghana Malladi <m-malladi@ti.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260311095441.1691636-1-m-malladi@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For Zhaoxin processors, the XSHA1 instruction requires the total memory
allocated at %rdi register must be 32 bytes, while the XSHA1 and
XSHA256 instruction doesn't perform any operation when %ecx is zero.
Due to these requirements, the current padlock-sha driver does not work
correctly with Zhaoxin processors. It cannot pass the self-tests and
therefore does not activate the driver on Zhaoxin processors. This issue
has been reported in Debian [1]. The self-tests fail with the
following messages [2]:
alg: shash: sha1-padlock-nano test failed (wrong result) on test vector 0, cfg="init+update+final aligned buffer"
alg: self-tests for sha1 using sha1-padlock-nano failed (rc=-22)
alg: shash: sha256-padlock-nano test failed (wrong result) on test vector 0, cfg="init+update+final aligned buffer"
alg: self-tests for sha256 using sha256-padlock-nano failed (rc=-22)
Disable the padlock-sha driver on Zhaoxin processors with the CPU family
0x07 and newer. Following the suggestion in [3], support for PHE will be
added to lib/crypto/ instead.
[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1113996
[2] https://linux-hardware.org/?probe=271fabb7a4&log=dmesg
[3] https://lore.kernel.org/linux-crypto/aUI4CGp6kK7mxgEr@gondor.apana.org.au/
Fixes: 63dc06cd12 ("crypto: padlock-sha - Use API partial block handling")
Cc: stable@vger.kernel.org
Signed-off-by: AlanSong-oc <AlanSong-oc@zhaoxin.com>
Link: https://lore.kernel.org/r/20260313080150.9393-2-AlanSong-oc@zhaoxin.com
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
A potential race condition exists in mana_hwc_destroy_channel() where
hwc->caller_ctx is freed before the HWC's Completion Queue (CQ) and
Event Queue (EQ) are destroyed. This allows an in-flight CQ interrupt
handler to dereference freed memory, leading to a use-after-free or
NULL pointer dereference in mana_hwc_handle_resp().
mana_smc_teardown_hwc() signals the hardware to stop but does not
synchronize against IRQ handlers already executing on other CPUs. The
IRQ synchronization only happens in mana_hwc_destroy_cq() via
mana_gd_destroy_eq() -> mana_gd_deregister_irq(). Since this runs
after kfree(hwc->caller_ctx), a concurrent mana_hwc_rx_event_handler()
can dereference freed caller_ctx (and rxq->msg_buf) in
mana_hwc_handle_resp().
Fix this by reordering teardown to reverse-of-creation order: destroy
the TX/RX work queues and CQ/EQ before freeing hwc->caller_ctx. This
ensures all in-flight interrupt handlers complete before the memory they
access is freed.
Fixes: ca9c54d2d6 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/abHA3AjNtqa1nx9k@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Westphal says:
====================
netfilter: updates for net
This is a much earlier pull request than usual, due to the large
backlog. We are aware of several unfixed issues, in particular
in ctnetlink, patches are being worked on.
The following patchset contains Netfilter fixes for *net*:
1) fix a use-after-free in ctnetlink, from Hyunwoo Kim, broken
since v3.10.
2) add missing netlink range checks in ctnetlink, broken since v2.6
days.
3) fix content length truncation in sip conntrack helper,
from Lukas Johannes Möller. Broken since 2.6.34.
4) Revert a recent patch to add stronger checks for overlapping ranges
in nf_tables rbtree set type.
Patch is correct, but several nftables version have a bug (now fixed)
that trigger the checks incorrectly.
5) Reset mac header before the vlan push to avoid warning splat (and
make things functional). From Eric Woudstra.
6) Add missing bounds check in H323 conntrack helper, broken since this
helper was added 20 years ago, from Jenny Guanni Qu.
7) Fix a memory leak in the dynamic set infrastructure, from Pablo Neira
Ayuso. Broken since v5.11.
8+9) a few spots failed to purge skbs queued to userspace via nfqueue,
this causes RCU escape / use-after-free. Also from Pablo. broken
since v3.4 added the CT target to xtables.
10) Fix undefined behaviour in xt_time, use u32 for a shift-by-31
operation, not s32, from Jenny Guanni Qu.
11) H323 conntrack helper lacks a check for length variable becoming
negative after decrement, causes major out-of-bounds read due to
cast to unsigned size later, also from Jenny.
Both issues exist since 2.6 days.
* tag 'nf-26-03-13' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_conntrack_h323: check for zero length in DecodeQ931()
netfilter: xt_time: use unsigned int for monthday bit shift
netfilter: xt_CT: drop pending enqueued packets on template removal
netfilter: nft_ct: drop pending enqueued packets on removal
nf_tables: nft_dynset: fix possible stateful expression memleak in error path
netfilter: nf_conntrack_h323: fix OOB read in decode_int() CONS case
netfilter: nf_flow_table_ip: reset mac header before vlan push
netfilter: revert nft_set_rbtree: validate open interval overlap
netfilter: nf_conntrack_sip: fix Content-Length u32 truncation in sip_help_tcp()
netfilter: conntrack: add missing netlink policy validations
netfilter: ctnetlink: fix use-after-free in ctnetlink_dump_exp_ct()
====================
Link: https://patch.msgid.link/20260313150614.21177-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- hci_sync: Fix hci_le_create_conn_sync
- MGMT: Fix list corruption and UAF in command complete handlers
- L2CAP: Disconnect if received packet's SDU exceeds IMTU
- L2CAP: Disconnect if sum of payload sizes exceed SDU
- L2CAP: Fix accepting multiple L2CAP_ECRED_CONN_REQ
- L2CAP: Fix type confusion in l2cap_ecred_reconf_rsp()
- L2CAP: Validate L2CAP_INFO_RSP payload length before access
- L2CAP: Fix use-after-free in l2cap_unregister_user
- ISO: Fix defer tests being unstable
- HIDP: Fix possible UAF
- SMP: make SM/PER/KDU/BI-04-C happy
- qca: fix ROM version reading on WCN3998 chips
* tag 'for-net-2026-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: qca: fix ROM version reading on WCN3998 chips
Bluetooth: L2CAP: Validate L2CAP_INFO_RSP payload length before access
Bluetooth: L2CAP: Fix type confusion in l2cap_ecred_reconf_rsp()
Bluetooth: L2CAP: Fix accepting multiple L2CAP_ECRED_CONN_REQ
Bluetooth: L2CAP: Fix use-after-free in l2cap_unregister_user
Bluetooth: HIDP: Fix possible UAF
Bluetooth: MGMT: Fix list corruption and UAF in command complete handlers
Bluetooth: hci_sync: Fix hci_le_create_conn_sync
Bluetooth: ISO: Fix defer tests being unstable
Bluetooth: SMP: make SM/PER/KDU/BI-04-C happy
Bluetooth: LE L2CAP: Disconnect if sum of payload sizes exceed SDU
Bluetooth: LE L2CAP: Disconnect if received packet's SDU exceeds IMTU
====================
Link: https://patch.msgid.link/20260312200655.1215688-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a reader's file descriptor is closed while in the middle of reading
a cache_request (rp->offset != 0), cache_release() decrements the
request's readers count but never checks whether it should free the
request.
In cache_read(), when readers drops to 0 and CACHE_PENDING is clear, the
cache_request is removed from the queue and freed along with its buffer
and cache_head reference. cache_release() lacks this cleanup.
The only other path that frees requests with readers == 0 is
cache_dequeue(), but it runs only when CACHE_PENDING transitions from
set to clear. If that transition already happened while readers was
still non-zero, cache_dequeue() will have skipped the request, and no
subsequent call will clean it up.
Add the same cleanup logic from cache_read() to cache_release(): after
decrementing readers, check if it reached 0 with CACHE_PENDING clear,
and if so, dequeue and free the cache_request.
Reported-by: NeilBrown <neilb@ownmail.net>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The /proc/fs/nfs/exports proc entry is created at module init
and persists for the module's lifetime. exports_proc_open()
captures the caller's current network namespace and stores
its svc_export_cache in seq->private, but takes no reference
on the namespace. If the namespace is subsequently torn down
(e.g. container destruction after the opener does setns() to a
different namespace), nfsd_net_exit() calls nfsd_export_shutdown()
which frees the cache. Subsequent reads on the still-open fd
dereference the freed cache_detail, walking a freed hash table.
Hold a reference on the struct net for the lifetime of the open
file descriptor. This prevents nfsd_net_exit() from running --
and thus prevents nfsd_export_shutdown() from freeing the cache
-- while any exports fd is open. cache_detail already stores
its net pointer (cd->net, set by cache_create_net()), so
exports_release() can retrieve it without additional per-file
storage.
Reported-by: Misbah Anjum N <misanjum@linux.ibm.com>
Closes: https://lore.kernel.org/linux-nfs/dcd371d3a95815a84ba7de52cef447b8@linux.ibm.com/
Fixes: 96d851c4d2 ("nfsd: use proper net while reading "exports" file")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_export_put() calls path_put() and auth_domain_put() immediately
when the last reference drops, before the RCU grace period. RCU
readers in e_show() and c_show() access both ex_path (via
seq_path/d_path) and ex_client->name (via seq_escape) without
holding a reference. If cache_clean removes the entry and drops the
last reference concurrently, the sub-objects are freed while still
in use, producing a NULL pointer dereference in d_path.
Commit 2530766492 ("nfsd: fix UAF when access ex_uuid or
ex_stats") moved kfree of ex_uuid and ex_stats into the
call_rcu callback, but left path_put() and auth_domain_put() running
before the grace period because both may sleep and call_rcu
callbacks execute in softirq context.
Replace call_rcu/kfree_rcu with queue_rcu_work(), which defers the
callback until after the RCU grace period and executes it in process
context where sleeping is permitted. This allows path_put() and
auth_domain_put() to be moved into the deferred callback alongside
the other resource releases. Apply the same fix to expkey_put(),
which has the identical pattern with ek_path and ek_client.
A dedicated workqueue scopes the shutdown drain to only NFSD
export release work items; flushing the shared
system_unbound_wq would stall on unrelated work from other
subsystems. nfsd_export_shutdown() uses rcu_barrier() followed
by flush_workqueue() to ensure all deferred release callbacks
complete before the export caches are destroyed.
Reported-by: Misbah Anjum N <misanjum@linux.ibm.com>
Closes: https://lore.kernel.org/linux-nfs/dcd371d3a95815a84ba7de52cef447b8@linux.ibm.com/
Fixes: c224edca7a ("nfsd: no need get cache ref when protected by rcu")
Fixes: 1b10f0b603 ("SUNRPC: no need get cache ref when protected by rcu")
Cc: stable@vger.kernel.org
Reviwed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
A race condition exists between lec_atm_close() setting priv->lecd
to NULL and concurrent access to priv->lecd in send_to_lecd(),
lec_handle_bridge(), and lec_atm_send(). When the socket is freed
via RCU while another thread is still using it, a use-after-free
occurs in sock_def_readable() when accessing the socket's wait queue.
The root cause is that lec_atm_close() clears priv->lecd without
any synchronization, while callers dereference priv->lecd without
any protection against concurrent teardown.
Fix this by converting priv->lecd to an RCU-protected pointer:
- Mark priv->lecd as __rcu in lec.h
- Use rcu_assign_pointer() in lec_atm_close() and lecd_attach()
for safe pointer assignment
- Use rcu_access_pointer() for NULL checks that do not dereference
the pointer in lec_start_xmit(), lec_push(), send_to_lecd() and
lecd_attach()
- Use rcu_read_lock/rcu_dereference/rcu_read_unlock in send_to_lecd(),
lec_handle_bridge() and lec_atm_send() to safely access lecd
- Use rcu_assign_pointer() followed by synchronize_rcu() in
lec_atm_close() to ensure all readers have completed before
proceeding. This is safe since lec_atm_close() is called from
vcc_release() which holds lock_sock(), a sleeping lock.
- Remove the manual sk_receive_queue drain from lec_atm_close()
since vcc_destroy_socket() already drains it after lec_atm_close()
returns.
v2: Switch from spinlock + sock_hold/put approach to RCU to properly
fix the race. The v1 spinlock approach had two issues pointed out
by Eric Dumazet:
1. priv->lecd was still accessed directly after releasing the
lock instead of using a local copy.
2. The spinlock did not prevent packets being queued after
lec_atm_close() drains sk_receive_queue since timer and
workqueue paths bypass netif_stop_queue().
Note: Syzbot patch testing was attempted but the test VM terminated
unexpectedly with "Connection to localhost closed by remote host",
likely due to a QEMU AHCI emulation issue unrelated to this fix.
Compile testing with "make W=1 net/atm/lec.o" passes cleanly.
Reported-by: syzbot+f50072212ab792c86925@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f50072212ab792c86925
Link: https://lore.kernel.org/all/20260309093614.502094-1-kartikey406@gmail.com/T/ [v1]
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260309155908.508768-1-kartikey406@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When OGM aggregation state is toggled at runtime, an existing forwarded
packet may have been allocated with only packet_len bytes, while a later
packet can still be selected for aggregation. Appending in this case can
hit skb_put overflow conditions.
Reject aggregation when the target skb tailroom cannot accommodate the new
packet. The caller then falls back to creating a new forward packet
instead of appending.
Fixes: c6c8fea297 ("net: Add batman-adv meshing protocol")
Cc: stable@vger.kernel.org
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Signed-off-by: Yuan Tan <tanyuan98@outlook.com>
Signed-off-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Ao Zhou <n05ec@lzu.edu.cn>
Signed-off-by: Yang Yang <n05ec@lzu.edu.cn>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Commit 551120148b ("crypto: ccp - Fix a case where SNP_SHUTDOWN is
missed") fixed a case where SNP is left in INIT state if page reclaim
fails. It removes the transition to the INIT state for this command and
adjusts the page state management.
While doing this, it added a call to snp_leak_pages() after a call to
snp_reclaim_pages() failed. Since snp_reclaim_pages() already calls
snp_leak_pages() internally on the pages it fails to reclaim, calling
it again leaks the exact same page twice.
Fix by removing the extra call to snp_leak_pages().
The problem was found by an experimental code review agent based on
gemini-3.1-pro while reviewing backports into v6.18.y.
Assisted-by: Gemini:gemini-3.1-pro
Fixes: 551120148b ("crypto: ccp - Fix a case where SNP_SHUTDOWN is missed")
Cc: Tycho Andersen (AMD) <tycho@kernel.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Tycho Andersen (AMD) <tycho@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add myself, Dexuan, and Long as maintainers. Deepak is stepping down
from these responsibilities.
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
In the error path of mshv_map_user_memory(), calling vfree() directly on
the region leaves the MMU notifier registered. When userspace later unmaps
the memory, the notifier fires and accesses the freed region, causing a
use-after-free and potential kernel panic.
Replace vfree() with mshv_partition_put() to properly unregister
the MMU notifier before freeing the region.
Fixes: b9a66cd5cc ("mshv: Add support for movable memory regions")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
In DecodeQ931(), the UserUserIE code path reads a 16-bit length from
the packet, then decrements it by 1 to skip the protocol discriminator
byte before passing it to DecodeH323_UserInformation(). If the encoded
length is 0, the decrement wraps to -1, which is then passed as a
large value to the decoder, leading to an out-of-bounds read.
Add a check to ensure len is positive after the decrement.
Fixes: 5e35941d99 ("[NETFILTER]: Add H.323 conntrack/NAT helper")
Reported-by: Klaudia Kloc <klaudia@vidocsecurity.com>
Reported-by: Dawid Moczadło <dawid@vidocsecurity.com>
Tested-by: Jenny Guanni Qu <qguanni@gmail.com>
Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
The monthday field can be up to 31, and shifting a signed integer 1
by 31 positions (1 << 31) is undefined behavior in C, as the result
overflows a 32-bit signed int. Use 1U to ensure well-defined behavior
for all valid monthday values.
Change the weekday shift to 1U as well for consistency.
Fixes: ee4411a1b1 ("[NETFILTER]: x_tables: add xt_time match")
Reported-by: Klaudia Kloc <klaudia@vidocsecurity.com>
Reported-by: Dawid Moczadło <dawid@vidocsecurity.com>
Tested-by: Jenny Guanni Qu <qguanni@gmail.com>
Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Templates refer to objects that can go away while packets are sitting in
nfqueue refer to:
- helper, this can be an issue on module removal.
- timeout policy, nfnetlink_cttimeout might remove it.
The use of templates with zone and event cache filter are safe, since
this just copies values.
Flush these enqueued packets in case the template rule gets removed.
Fixes: 24de58f465 ("netfilter: xt_CT: allow to attach timeout policy + glue code")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Packets sitting in nfqueue might hold a reference to:
- templates that specify the conntrack zone, because a percpu area is
used and module removal is possible.
- conntrack timeout policies and helper, where object removal leave
a stale reference.
Since these objects can just go away, drop enqueued packets to avoid
stale reference to them.
If there is a need for finer grain removal, this logic can be revisited
to make selective packet drop upon dependencies.
Fixes: 7e0b2b57f0 ("netfilter: nft_ct: add ct timeout support")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
If cloning the second stateful expression in the element via GFP_ATOMIC
fails, then the first stateful expression remains in place without being
released.
unreferenced object (percpu) 0x607b97e9cab8 (size 16):
comm "softirq", pid 0, jiffies 4294931867
hex dump (first 16 bytes on cpu 3):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
backtrace (crc 0):
pcpu_alloc_noprof+0x453/0xd80
nft_counter_clone+0x9c/0x190 [nf_tables]
nft_expr_clone+0x8f/0x1b0 [nf_tables]
nft_dynset_new+0x2cb/0x5f0 [nf_tables]
nft_rhash_update+0x236/0x11c0 [nf_tables]
nft_dynset_eval+0x11f/0x670 [nf_tables]
nft_do_chain+0x253/0x1700 [nf_tables]
nft_do_chain_ipv4+0x18d/0x270 [nf_tables]
nf_hook_slow+0xaa/0x1e0
ip_local_deliver+0x209/0x330
Fixes: 563125a73a ("netfilter: nftables: generalize set extension to support for several expressions")
Reported-by: Gurpreet Shergill <giki.shergill@proton.me>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
In decode_int(), the CONS case calls get_bits(bs, 2) to read a length
value, then calls get_uint(bs, len) without checking that len bytes
remain in the buffer. The existing boundary check only validates the
2 bits for get_bits(), not the subsequent 1-4 bytes that get_uint()
reads. This allows a malformed H.323/RAS packet to cause a 1-4 byte
slab-out-of-bounds read.
Add a boundary check for len bytes after get_bits() and before
get_uint().
Fixes: 5e35941d99 ("[NETFILTER]: Add H.323 conntrack/NAT helper")
Reported-by: Klaudia Kloc <klaudia@vidocsecurity.com>
Reported-by: Dawid Moczadło <dawid@vidocsecurity.com>
Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
With double vlan tagged packets in the fastpath, getting the error:
skb_vlan_push got skb with skb->data not at mac header (offset 18)
Call skb_reset_mac_header() before calling skb_vlan_push().
Fixes: c653d5a78f ("netfilter: flowtable: inline vlan encapsulation in xmit path")
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
This reverts commit 648946966a ("netfilter: nft_set_rbtree: validate
open interval overlap").
There have been reports of nft failing to laod valid rulesets after this
patch was merged into -stable.
I can reproduce several such problem with recent nft versions, including
nft 1.1.6 which is widely shipped by distributions.
We currently have little choice here.
This commit can be resurrected at some point once the nftables fix that
triggers the false overlap positive has appeared in common distros
(see e83e32c8d1cd ("mnl: restore create element command with large batches" in
nftables.git).
Fixes: 648946966a ("netfilter: nft_set_rbtree: validate open interval overlap")
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
sip_help_tcp() parses the SIP Content-Length header with
simple_strtoul(), which returns unsigned long, but stores the result in
unsigned int clen. On 64-bit systems, values exceeding UINT_MAX are
silently truncated before computing the SIP message boundary.
For example, Content-Length 4294967328 (2^32 + 32) is truncated to 32,
causing the parser to miscalculate where the current message ends. The
loop then treats trailing data in the TCP segment as a second SIP
message and processes it through the SDP parser.
Fix this by changing clen to unsigned long to match the return type of
simple_strtoul(), and reject Content-Length values that exceed the
remaining TCP payload length.
Fixes: f5b321bd37 ("netfilter: nf_conntrack_sip: add TCP support")
Signed-off-by: Lukas Johannes Möller <research@johannes-moeller.dev>
Signed-off-by: Florian Westphal <fw@strlen.de>
Hyunwoo Kim reports out-of-bounds access in sctp and ctnetlink.
These attributes are used by the kernel without any validation.
Extend the netlink policies accordingly.
Quoting the reporter:
nlattr_to_sctp() assigns the user-supplied CTA_PROTOINFO_SCTP_STATE
value directly to ct->proto.sctp.state without checking that it is
within the valid range. [..]
and: ... with exp->dir = 100, the access at
ct->master->tuplehash[100] reads 5600 bytes past the start of a
320-byte nf_conn object, causing a slab-out-of-bounds read confirmed by
UBSAN.
Fixes: 076a0ca026 ("netfilter: ctnetlink: add NAT support for expectations")
Fixes: a258860e01 ("netfilter: ctnetlink: add full support for SCTP to ctnetlink")
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
ctnetlink_dump_exp_ct() stores a conntrack pointer in cb->data for the
netlink dump callback ctnetlink_exp_ct_dump_table(), but drops the
conntrack reference immediately after netlink_dump_start(). When the
dump spans multiple rounds, the second recvmsg() triggers the dump
callback which dereferences the now-freed conntrack via nfct_help(ct),
leading to a use-after-free on ct->ext.
The bug is that the netlink_dump_control has no .start or .done
callbacks to manage the conntrack reference across dump rounds. Other
dump functions in the same file (e.g. ctnetlink_get_conntrack) properly
use .start/.done callbacks for this purpose.
Fix this by adding .start and .done callbacks that hold and release the
conntrack reference for the duration of the dump, and move the
nfct_help() call after the cb->args[0] early-return check in the dump
callback to avoid dereferencing ct->ext unnecessarily.
BUG: KASAN: slab-use-after-free in ctnetlink_exp_ct_dump_table+0x4f/0x2e0
Read of size 8 at addr ffff88810597ebf0 by task ctnetlink_poc/133
CPU: 1 UID: 0 PID: 133 Comm: ctnetlink_poc Not tainted 7.0.0-rc2+ #3 PREEMPTLAZY
Call Trace:
<TASK>
ctnetlink_exp_ct_dump_table+0x4f/0x2e0
netlink_dump+0x333/0x880
netlink_recvmsg+0x3e2/0x4b0
? aa_sk_perm+0x184/0x450
sock_recvmsg+0xde/0xf0
Allocated by task 133:
kmem_cache_alloc_noprof+0x134/0x440
__nf_conntrack_alloc+0xa8/0x2b0
ctnetlink_create_conntrack+0xa1/0x900
ctnetlink_new_conntrack+0x3cf/0x7d0
nfnetlink_rcv_msg+0x48e/0x510
netlink_rcv_skb+0xc9/0x1f0
nfnetlink_rcv+0xdb/0x220
netlink_unicast+0x3ec/0x590
netlink_sendmsg+0x397/0x690
__sys_sendmsg+0xf4/0x180
Freed by task 0:
slab_free_after_rcu_debug+0xad/0x1e0
rcu_core+0x5c3/0x9c0
Fixes: e844a92843 ("netfilter: ctnetlink: allow to dump expectation per master conntrack")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Code allocates standard kernel memory to pass to the MPAM, which expects
__iomem. The code is safe, because __iomem accessors should work fine
on kernel mapped memory, however leads to sparse warnings:
test_mpam_devices.c:327:42: warning: incorrect type in initializer (different address spaces)
test_mpam_devices.c:327:42: expected char [noderef] __iomem *buf
test_mpam_devices.c:327:42: got void *
test_mpam_devices.c:342:24: warning: cast removes address space '__iomem' of expression
Cast the pointer to memory via __force to silence them.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512160133.eAzPdJv2-lkp@intel.com/
Acked-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Accesses to MSC must be made from a cpu that is affine to that MSC and the
driver checks this in __mpam_write_reg() using smp_processor_id(). A fake
in-memory MSC is used for testing. When using that, it doesn't matter which
cpu we access it from but calling smp_processor_id() from a preemptible
context gives warnings when running with CONFIG_DEBUG_PREEMPT.
Add a test helper that wraps mpam_reset_msc_bitmap() with preemption
disabled to ensure all (fake) MSC accesses are made with preemption
disabled.
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
When an MSC supporting memory bandwidth monitoring is brought offline and
then online, mpam_restore_mbwu_state() calls __ris_msmon_read() via ipi to
restore the configuration of the bandwidth counters. It doesn't care about
the value read, mbwu_arg.val, and doesn't set it leading to a null pointer
dereference when __ris_msmon_read() adds to it. This results in a kernel
oops with a call trace such as:
Call trace:
__ris_msmon_read+0x19c/0x64c (P)
mpam_restore_mbwu_state+0xa0/0xe8
smp_call_on_cpu_callback+0x1c/0x38
process_one_work+0x154/0x4b4
worker_thread+0x188/0x310
kthread+0x11c/0x130
ret_from_fork+0x10/0x20
Provide a local variable for val to avoid __ris_msmon_read() dereferencing
a null pointer when adding to val.
Fixes: 41e8a14950 ("arm_mpam: Track bandwidth counter state for power management")
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
DW_CFA_advance_loc4 is defined but no handler is implemented. Its
CFA opcode defaults to EDYNSCS_INVALID_CFA_OPCODE triggering an
error which wrongfully prevents modules from loading.
Link: https://bugs.gentoo.org/971060
Signed-off-by: Pepper Gray <hello@peppergray.xyz>
Signed-off-by: Will Deacon <will@kernel.org>
If we log the parent directory of a conflicting inode, we are not logging
the new dentries of the directory, so when we finish we have the parent
directory's inode marked as logged but we did not log its new dentries.
As a consequence if the parent directory is explicitly fsynced later and
it does not have any new changes since we logged it, the fsync is a no-op
and after a power failure the new dentries are missing.
Example scenario:
$ mkdir foo
$ sync
$rmdir foo
$ mkdir dir1
$ mkdir dir2
# A file with the same name and parent as the directory we just deleted
# and was persisted in a past transaction. So the deleted directory's
# inode is a conflicting inode of this new file's inode.
$ touch foo
$ ln foo dir2/link
# The fsync on dir2 will log the parent directory (".") because the
# conflicting inode (deleted directory) does not exists anymore, but it
# it does not log its new dentries (dir1).
$ xfs_io -c "fsync" dir2
# This fsync on the parent directory is no-op, since the previous fsync
# logged it (but without logging its new dentries).
$ xfs_io -c "fsync" .
<power failure>
# After log replay dir1 is missing.
Fix this by ensuring we log new dir dentries whenever we log the parent
directory of a no longer existing conflicting inode.
A test case for fstests will follow soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/182055fa-e9ce-4089-9f5f-4b8a23e8dd91@gmail.com/
Fixes: a3baaf0d78 ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function `btrfs_relocate_chunk()` always passes verbose=true to
`btrfs_relocate_block_group()` instead of the `verbose` parameter passed
into it by it's callers.
While user initiated rebalancing should be logged in the Kernel's log
buffer. This causes excessive log spamming from automatic rebalancing,
e.g. on zoned filesystems running low on usable space.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After cancel_delayed_work_sync() is called from
xfrm_nat_keepalive_net_fini(), xfrm_state_fini() flushes remaining
states via __xfrm_state_delete(), which calls
xfrm_nat_keepalive_state_updated() to re-schedule nat_keepalive_work.
The following is a simple race scenario:
cpu0 cpu1
cleanup_net() [Round 1]
ops_undo_list()
xfrm_net_exit()
xfrm_nat_keepalive_net_fini()
cancel_delayed_work_sync(nat_keepalive_work);
xfrm_state_fini()
xfrm_state_flush()
xfrm_state_delete(x)
__xfrm_state_delete(x)
xfrm_nat_keepalive_state_updated(x)
schedule_delayed_work(nat_keepalive_work);
rcu_barrier();
net_complete_free();
net_passive_dec(net);
llist_add(&net->defer_free_list, &defer_free_list);
cleanup_net() [Round 2]
rcu_barrier();
net_complete_free()
kmem_cache_free(net_cachep, net);
nat_keepalive_work()
// on freed net
To prevent this, cancel_delayed_work_sync() is replaced with
disable_delayed_work_sync().
Fixes: f531d13bdf ("xfrm: support sending NAT keepalives in ESP in UDP states")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The lcd2s_print() and lcd2s_gotoxy() functions currently ignore the
return value of lcd2s_i2c_master_send(), which can fail. This can lead
to silent data loss or incorrect cursor positioning.
Add proper error checking: if the number of bytes sent does not match
the expected length, return -EIO; otherwise propagate any error code
from the I2C transfer.
Signed-off-by: Wang Jun <1742789905@qq.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
The alternate screen support added by commit 23743ba647 ("vt: add
support for smput/rmput escape codes") only saves and restores the
regular screen buffer (vc_origin), but completely ignores the corresponding
unicode screen buffer (vc_uni_lines) creating a messed-up display.
Add vc_saved_uni_lines to save the unicode screen buffer when entering
the alternate screen, and restore it when leaving. Also ensure proper
cleanup in reset_terminal() and vc_deallocate().
Fixes: 23743ba647 ("vt: add support for smput/rmput escape codes")
Cc: stable <stable@kernel.org>
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
Link: https://patch.msgid.link/5o2p6qp3-91pq-0p17-or02-1oors4417ns7@onlyvoer.pbz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sabrina Dubroca says:
====================
xfrm: fix most sparse warnings
This series fixes most of the sparse warnings currently reported about
RCU pointers for files under net/xfrm. There's no actual bug in the
current code, we only need to use the correct helpers in each context.
====================
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Blamed commits forgot that vxlan/geneve use udp_tunnel[6]_xmit_skb() which
call iptunnel_xmit_stats().
iptunnel_xmit_stats() was assuming tunnels were only using
NETDEV_PCPU_STAT_TSTATS.
@syncp offset in pcpu_sw_netstats and pcpu_dstats is different.
32bit kernels would either have corruptions or freezes if the syncp
sequence was overwritten.
This patch also moves pcpu_stat_type closer to dev->{t,d}stats to avoid
a potential cache line miss since iptunnel_xmit_stats() needs to read it.
Fixes: 6fa6de3022 ("geneve: Handle stats using NETDEV_PCPU_STAT_DSTATS.")
Fixes: be226352e8 ("vxlan: Handle stats using NETDEV_PCPU_STAT_DSTATS.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20260311123110.1471930-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a peer MEP is being deleted, cancel_delayed_work_sync() is called
on ccm_rx_dwork before freeing. However, br_cfm_frame_rx() runs in
softirq context under rcu_read_lock (without RTNL) and can re-schedule
ccm_rx_dwork via ccm_rx_timer_start() between cancel_delayed_work_sync()
returning and kfree_rcu() being called.
The following is a simple race scenario:
cpu0 cpu1
mep_delete_implementation()
cancel_delayed_work_sync(ccm_rx_dwork);
br_cfm_frame_rx()
// peer_mep still in hlist
if (peer_mep->ccm_defect)
ccm_rx_timer_start()
queue_delayed_work(ccm_rx_dwork)
hlist_del_rcu(&peer_mep->head);
kfree_rcu(peer_mep, rcu);
ccm_rx_work_expired()
// on freed peer_mep
To prevent this, cancel_delayed_work_sync() is replaced with
disable_delayed_work_sync() in both peer MEP deletion paths, so
that subsequent queue_delayed_work() calls from br_cfm_frame_rx()
are silently rejected.
The cc_peer_disable() helper retains cancel_delayed_work_sync()
because it is also used for the CC enable/disable toggle path where
the work must remain re-schedulable.
Fixes: dc32cbb3db ("bridge: cfm: Kernel space implementation of CFM. CCM frame RX added.")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/abBgYT5K_FI9rD1a@v4bel
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Igor Ushakov reported that GC purged the receive queue of
an alive socket due to a race with MSG_PEEK with a nice repro.
This is the exact same issue previously fixed by commit
cbcf01128d ("af_unix: fix garbage collect vs MSG_PEEK").
After GC was replaced with the current algorithm, the cited
commit removed the locking dance in unix_peek_fds() and
reintroduced the same issue.
The problem is that MSG_PEEK bumps a file refcount without
interacting with GC.
Consider an SCC containing sk-A and sk-B, where sk-A is
close()d but can be recv()ed via sk-B.
The bad thing happens if sk-A is recv()ed with MSG_PEEK from
sk-B and sk-B is close()d while GC is checking unix_vertex_dead()
for sk-A and sk-B.
GC thread User thread
--------- -----------
unix_vertex_dead(sk-A)
-> true <------.
\
`------ recv(sk-B, MSG_PEEK)
invalidate !! -> sk-A's file refcount : 1 -> 2
close(sk-B)
-> sk-B's file refcount : 2 -> 1
unix_vertex_dead(sk-B)
-> true
Initially, sk-A's file refcount is 1 by the inflight fd in sk-B
recvq. GC thinks sk-A is dead because the file refcount is the
same as the number of its inflight fds.
However, sk-A's file refcount is bumped silently by MSG_PEEK,
which invalidates the previous evaluation.
At this moment, sk-B's file refcount is 2; one by the open fd,
and one by the inflight fd in sk-A. The subsequent close()
releases one refcount by the former.
Finally, GC incorrectly concludes that both sk-A and sk-B are dead.
One option is to restore the locking dance in unix_peek_fds(),
but we can resolve this more elegantly thanks to the new algorithm.
The point is that the issue does not occur without the subsequent
close() and we actually do not need to synchronise MSG_PEEK with
the dead SCC detection.
When the issue occurs, close() and GC touch the same file refcount.
If GC sees the refcount being decremented by close(), it can just
give up garbage-collecting the SCC.
Therefore, we only need to signal the race during MSG_PEEK with
a proper memory barrier to make it visible to the GC.
Let's use seqcount_t to notify GC when MSG_PEEK occurs and let
it defer the SCC to the next run.
This way no locking is needed on the MSG_PEEK side, and we can
avoid imposing a penalty on every MSG_PEEK unnecessarily.
Note that we can retry within unix_scc_dead() if MSG_PEEK is
detected, but we do not do so to avoid hung task splat from
abusive MSG_PEEK calls.
Fixes: 118f457da9 ("af_unix: Remove lock dance in unix_peek_fds().")
Reported-by: Igor Ushakov <sysroot314@gmail.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260311054043.1231316-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
l2cap_information_rsp() checks that cmd_len covers the fixed
l2cap_info_rsp header (type + result, 4 bytes) but then reads
rsp->data without verifying that the payload is present:
- L2CAP_IT_FEAT_MASK calls get_unaligned_le32(rsp->data), which reads
4 bytes past the header (needs cmd_len >= 8).
- L2CAP_IT_FIXED_CHAN reads rsp->data[0], 1 byte past the header
(needs cmd_len >= 5).
A truncated L2CAP_INFO_RSP with result == L2CAP_IR_SUCCESS triggers an
out-of-bounds read of adjacent skb data.
Guard each data access with the required payload length check. If the
payload is too short, skip the read and let the state machine complete
with safe defaults (feat_mask and remote_fixed_chan remain zero from
kzalloc), so the info timer cleanup and l2cap_conn_start() still run
and the connection is not stalled.
Fixes: 4e8402a3f8 ("[Bluetooth] Retrieve L2CAP features mask on connection setup")
Cc: stable@vger.kernel.org
Signed-off-by: Lukas Johannes Möller <research@johannes-moeller.dev>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
l2cap_ecred_reconf_rsp() casts the incoming data to struct
l2cap_ecred_conn_rsp (the ECRED *connection* response, 8 bytes with
result at offset 6) instead of struct l2cap_ecred_reconf_rsp (2 bytes
with result at offset 0).
This causes two problems:
- The sizeof(*rsp) length check requires 8 bytes instead of the
correct 2, so valid L2CAP_ECRED_RECONF_RSP packets are rejected
with -EPROTO.
- rsp->result reads from offset 6 instead of offset 0, returning
wrong data when the packet is large enough to pass the check.
Fix by using the correct type. Also pass the already byte-swapped
result variable to BT_DBG instead of the raw __le16 field.
Fixes: 15f02b9105 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
Cc: stable@vger.kernel.org
Signed-off-by: Lukas Johannes Möller <research@johannes-moeller.dev>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
After commit ab4eedb790 ("Bluetooth: L2CAP: Fix corrupted list in
hci_chan_del"), l2cap_conn_del() uses conn->lock to protect access to
conn->users. However, l2cap_register_user() and l2cap_unregister_user()
don't use conn->lock, creating a race condition where these functions can
access conn->users and conn->hchan concurrently with l2cap_conn_del().
This can lead to use-after-free and list corruption bugs, as reported
by syzbot.
Fix this by changing l2cap_register_user() and l2cap_unregister_user()
to use conn->lock instead of hci_dev_lock(), ensuring consistent locking
for the l2cap_conn structure.
Reported-by: syzbot+14b6d57fb728e27ce23c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=14b6d57fb728e27ce23c
Fixes: ab4eedb790 ("Bluetooth: L2CAP: Fix corrupted list in hci_chan_del")
Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Commit 302a1f674c ("Bluetooth: MGMT: Fix possible UAFs") introduced
mgmt_pending_valid(), which not only validates the pending command but
also unlinks it from the pending list if it is valid. This change in
semantics requires updates to several completion handlers to avoid list
corruption and memory safety issues.
This patch addresses two left-over issues from the aforementioned rework:
1. In mgmt_add_adv_patterns_monitor_complete(), mgmt_pending_remove()
is replaced with mgmt_pending_free() in the success path. Since
mgmt_pending_valid() already unlinks the command at the beginning of
the function, calling mgmt_pending_remove() leads to a double list_del()
and subsequent list corruption/kernel panic.
2. In set_mesh_complete(), the use of mgmt_pending_foreach() in the error
path is removed. Since the current command is already unlinked by
mgmt_pending_valid(), this foreach loop would incorrectly target other
pending mesh commands, potentially freeing them while they are still being
processed concurrently (leading to UAFs). The redundant mgmt_cmd_status()
is also simplified to use cmd->opcode directly.
Fixes: 302a1f674c ("Bluetooth: MGMT: Fix possible UAFs")
Signed-off-by: Wang Tao <wangtao554@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
While introducing hci_le_create_conn_sync the functionality
of hci_connect_le was ported to hci_le_create_conn_sync including
the disable of the scan before starting the connection.
When this code was run non synchronously the immediate call that was
setting the flag HCI_LE_SCAN_INTERRUPTED had an impact. Since the
completion handler for the LE_SCAN_DISABLE was not immediately called.
In the completion handler of the LE_SCAN_DISABLE event, this flag is
checked to set the state of the hdev to DISCOVERY_STOPPED.
With the synchronised approach the later setting of the
HCI_LE_SCAN_INTERRUPTED flag has not the same effect. The completion
handler would immediately fire in the LE_SCAN_DISABLE call, check for
the flag, which is then not yet set and do nothing.
To fix this issue and make the function call work as before, we move the
setting of the flag HCI_LE_SCAN_INTERRUPTED before disabling the scan.
Fixes: 8e8b92ee60 ("Bluetooth: hci_sync: Add hci_le_create_conn_sync")
Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
iso-tester defer tests seem to fail with hci_conn_hash_lookup_cig
being unable to resolve a cig in set_cig_params_sync due a race
where it is run immediatelly before hci_bind_cis is able to set
the QoS settings into the hci_conn object.
So this moves the assigning of the QoS settings to be done directly
by hci_le_set_cig_params to prevent that from happening again.
Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The last test step ("Test with Invalid public key X and Y, all set to
0") expects to get an "DHKEY check failed" instead of "unspecified".
Fixes: 6d19628f53 ("Bluetooth: SMP: Fail if remote and local public keys are identical")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Core 6.0, Vol 3, Part A, 3.4.3:
"... If the sum of the payload sizes for the K-frames exceeds the
specified SDU length, the receiver shall disconnect the channel."
This fixes L2CAP/LE/CFC/BV-27-C (running together with 'l2test -r -P
0x0027 -V le_public').
Fixes: aac23bf636 ("Bluetooth: Implement LE L2CAP reassembly")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Core 6.0, Vol 3, Part A, 3.4.3:
"If the SDU length field value exceeds the receiver's MTU, the receiver
shall disconnect the channel..."
This fixes L2CAP/LE/CFC/BV-26-C (running together with 'l2test -r -P
0x0027 -V le_public -I 100').
Fixes: aac23bf636 ("Bluetooth: Implement LE L2CAP reassembly")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
DW UART cannot write to LCR, DLL, and DLH while BUSY is asserted.
Existance of BUSY depends on uart_16550_compatible, if UART HW is
configured with it those registers can always be written.
There currently is dw8250_force_idle() which attempts to achieve
non-BUSY state by disabling FIFO, however, the solution is unreliable
when Rx keeps getting more and more characters.
Create a sequence of operations that ensures UART cannot keep BUSY
asserted indefinitely. The new sequence relies on enabling loopback mode
temporarily to prevent incoming Rx characters keeping UART BUSY.
Ensure no Tx in ongoing while the UART is switches into the loopback
mode (requires exporting serial8250_fifo_wait_for_lsr_thre() and adding
DMA Tx pause/resume functions).
According to tests performed by Adriana Nicolae <adriana@arista.com>,
simply disabling FIFO or clearing FIFOs only once does not always
ensure BUSY is deasserted but up to two tries may be needed. This could
be related to ongoing Rx of a character (a guess, not known for sure).
Therefore, retry FIFO clearing a few times (retry limit 4 is arbitrary
number but using, e.g., p->fifosize seems overly large). Tests
performed by others did not exhibit similar challenge but it does not
seem harmful to leave the FIFO clearing loop in place for all DW UARTs
with BUSY functionality.
Use the new dw8250_idle_enter/exit() to do divisor writes and LCR
writes. In case of plain LCR writes, opportunistically try to update
LCR first and only invoke dw8250_idle_enter() if the write did not
succeed (it has been observed that in practice most LCR writes do
succeed without complications).
This issue was first reported by qianfan Zhao who put lots of debugging
effort into understanding the solution space.
Fixes: c49436b657 ("serial: 8250_dw: Improve unwritable LCR workaround")
Fixes: 7d4008ebb1 ("tty: add a DesignWare 8250 driver")
Cc: stable <stable@kernel.org>
Reported-by: qianfan Zhao <qianfanguijin@163.com>
Link: https://lore.kernel.org/linux-serial/289bb78a-7509-1c5c-2923-a04ed3b6487d@163.com/
Reported-by: Adriana Nicolae <adriana@arista.com>
Link: https://lore.kernel.org/linux-serial/20250819182322.3451959-1-adriana@arista.com/
Reported-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Murthy, Shanth <shanth.murthy@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20260203171049.4353-8-ilpo.jarvinen@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When DW UART is !uart_16550_compatible, it can indicate BUSY at any
point (when under constant Rx pressure) unless a complex sequence of
steps is performed. Any LCR write can run a foul with the condition
that prevents writing LCR while the UART is BUSY, which triggers
BUSY_DETECT interrupt that seems unmaskable using IER bits.
Normal flow is that dw8250_handle_irq() handles BUSY_DETECT condition
by reading USR register. This BUSY feature, however, breaks the
assumptions made in serial8250_do_shutdown(), which runs
synchronize_irq() after clearing IER and assumes no interrupts can
occur after that point but then proceeds to update LCR, which on DW
UART can trigger an interrupt.
If serial8250_do_shutdown() releases the interrupt handler before the
handler has run and processed the BUSY_DETECT condition by read the USR
register, the IRQ is not deasserted resulting in interrupt storm that
triggers "irq x: nobody cared" warning leading to disabling the IRQ.
Add late synchronize_irq() into serial8250_do_shutdown() to ensure
BUSY_DETECT from DW UART is handled before port's interrupt handler is
released. Alternative would be to add DW UART specific shutdown
function but it would mostly duplicate the generic code and the extra
synchronize_irq() seems pretty harmless in serial8250_do_shutdown().
Fixes: 7d4008ebb1 ("tty: add a DesignWare 8250 driver")
Cc: stable <stable@kernel.org>
Reported-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Murthy, Shanth <shanth.murthy@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20260203171049.4353-7-ilpo.jarvinen@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
INTC10EE UART can end up into an interrupt storm where it reports
IIR_NO_INT (0x1). If the storm happens during active UART operation, it
is promptly stopped by IIR value change due to Rx or Tx events.
However, when there is no activity, either due to idle serial line or
due to specific circumstances such as during shutdown that writes
IER=0, there is nothing to stop the storm.
During shutdown the storm is particularly problematic because
serial8250_do_shutdown() calls synchronize_irq() that will hang in
waiting for the storm to finish which never happens.
This problem can also result in triggering a warning:
irq 45: nobody cared (try booting with the "irqpoll" option)
[...snip...]
handlers:
serial8250_interrupt
Disabling IRQ #45
Normal means to reset interrupt status by reading LSR, MSR, USR, or RX
register do not result in the UART deasserting the IRQ.
Add a quirk to INTC10EE UARTs to enable Tx interrupts if UART's Tx is
currently empty and inactive. Rework IIR_NO_INT to keep track of the
number of consecutive IIR_NO_INT, and on fourth one perform the quirk.
Enabling Tx interrupts should change IIR value from IIR_NO_INT to
IIR_THRI which has been observed to stop the storm.
Fixes: e92fad0249 ("serial: 8250_dw: Add ACPI ID for Granite Rapids-D UART")
Cc: stable <stable@kernel.org>
Reported-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Murthy, Shanth <shanth.murthy@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20260203171049.4353-6-ilpo.jarvinen@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
dw8250_handle_irq() takes port's lock multiple times with no good
reason to release it in between and calls serial8250_handle_irq()
that also takes port's lock.
Take port's lock only once in dw8250_handle_irq() and use
serial8250_handle_irq_locked() to avoid releasing port's lock in
between.
As IIR_NO_INT check in serial8250_handle_irq() was outside of port's
lock, it has to be done already in dw8250_handle_irq().
DW UART can, in addition to IIR_NO_INT, report BUSY_DETECT (0x7) which
collided with the IIR_NO_INT (0x1) check in serial8250_handle_irq()
(because & is used instead of ==) meaning that no other work is done by
serial8250_handle_irq() during an BUSY_DETECT interrupt.
This allows reorganizing code in dw8250_handle_irq() to do both
IIR_NO_INT and BUSY_DETECT handling right at the start simplifying
the logic.
Tested-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Murthy, Shanth <shanth.murthy@intel.com>
Cc: stable <stable@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20260203171049.4353-5-ilpo.jarvinen@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
8250_port exports serial8250_handle_irq() to HW specific 8250 drivers.
It takes port's lock within but a HW specific 8250 driver may want to
take port's lock itself, do something, and then call the generic
handler in 8250_port but to do that, the caller has to release port's
lock for no good reason.
Introduce serial8250_handle_irq_locked() which a HW specific driver can
call while already holding port's lock.
As this is new export, put it straight into a namespace (where all 8250
exports should eventually be moved).
Tested-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Murthy, Shanth <shanth.murthy@intel.com>
Cc: stable <stable@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20260203171049.4353-4-ilpo.jarvinen@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
uart_write_room() and uart_write() behave inconsistently when
xmit_buf is NULL (which happens for PORT_UNKNOWN ports that were
never properly initialized):
- uart_write_room() returns kfifo_avail() which can be > 0
- uart_write() checks xmit_buf and returns 0 if NULL
This inconsistency causes an infinite loop in drivers that rely on
tty_write_room() to determine if they can write:
while (tty_write_room(tty) > 0) {
written = tty->ops->write(...);
// written is always 0, loop never exits
}
For example, caif_serial's handle_tx() enters an infinite loop when
used with PORT_UNKNOWN serial ports, causing system hangs.
Fix by making uart_write_room() also check xmit_buf and return 0 if
it's NULL, consistent with uart_write().
Reproducer: https://gist.github.com/mrpre/d9a694cc0e19828ee3bc3b37983fde13
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Cc: stable <stable@kernel.org>
Link: https://patch.msgid.link/20260204074327.226165-1-jiayuan.chen@linux.dev
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ulite_probe() calls pm_runtime_put_autosuspend() at the end of probe
without holding a corresponding PM runtime reference for non-console
ports.
During ulite_assign(), uart_add_one_port() triggers uart_configure_port()
which calls ulite_pm() via uart_change_pm(). For non-console ports, the
UART core performs a balanced get/put cycle:
uart_change_pm(ON) -> ulite_pm() -> pm_runtime_get_sync() +1
uart_change_pm(OFF) -> ulite_pm() -> pm_runtime_put_autosuspend() -1
This leaves no spare reference for the pm_runtime_put_autosuspend() at
the end of probe. The PM runtime core prevents the count from actually
going below zero, and instead triggers a
"Runtime PM usage count underflow!" warning.
For console ports the bug is masked: the UART core skips the
uart_change_pm(OFF) call, so the UART core's unbalanced get happens to
pair with probe's trailing put.
Add pm_runtime_get_noresume() before pm_runtime_enable() to take an
explicit probe-owned reference that the trailing
pm_runtime_put_autosuspend() can release. This ensures a correct usage
count regardless of whether the port is a console.
Fixes: 5bbe10a694 ("tty: serial: uartlite: Add runtime pm support")
Cc: stable <stable@kernel.org>
Signed-off-by: Maciej Andrzejewski ICEYE <maciej.andrzejewski@m-works.net>
Link: https://patch.msgid.link/20260305123746.4152800-1-maciej.andrzejewski@m-works.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 039d492637 ("serial: 8250: Toggle IER bits on only after irq
has been set up") moved IRQ setup before the THRE test, in combination
with commit 205d300aea ("serial: 8250: change lock order in
serial8250_do_startup()") the interrupt handler can run during the
test and race with its IIR reads. This can produce wrong THRE test
results and cause spurious registration of the
serial8250_backup_timeout timer. Unconditionally disable the IRQ for
the short duration of the test and re-enable it afterwards to avoid
the race.
Fixes: 039d492637 ("serial: 8250: Toggle IER bits on only after irq has been set up")
Depends-on: 205d300aea ("serial: 8250: change lock order in serial8250_do_startup()")
Cc: stable <stable@kernel.org>
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Alban Bedel <alban.bedel@lht.dlh.de>
Tested-by: Maximilian Lueer <maximilian.lueer@lht.dlh.de>
Link: https://patch.msgid.link/20260224121639.579404-1-alban.bedel@lht.dlh.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
`dmaengine_terminate_async` does not guarantee that the
`__dma_tx_complete` callback will run. The callback is currently the
only place where `dma->tx_running` gets cleared. If the transaction is
canceled and the callback never runs, then `dma->tx_running` will never
get cleared and we will never schedule new TX DMA transactions again.
This change makes it so we clear `dma->tx_running` after we terminate
the DMA transaction. This is "safe" because `serial8250_tx_dma_flush`
is holding the UART port lock. The first thing the callback does is also
grab the UART port lock, so access to `dma->tx_running` is serialized.
Fixes: 9e512eaaf8 ("serial: 8250: Fix fifo underflow on flush")
Cc: stable <stable@kernel.org>
Signed-off-by: Raul E Rangel <rrangel@google.com>
Link: https://patch.msgid.link/20260209135815.1.I16366ecb0f62f3c96fe3dd5763fcf6f3c2b4d8cd@changeid
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The wrong value of the number of domains is wrong which leads to
failures when trying to enumerate nested power domains.
PM: genpd_xlate_onecell: invalid domain index 0
PM: genpd_xlate_onecell: invalid domain index 1
PM: genpd_xlate_onecell: invalid domain index 3
PM: genpd_xlate_onecell: invalid domain index 4
PM: genpd_xlate_onecell: invalid domain index 5
PM: genpd_xlate_onecell: invalid domain index 13
PM: genpd_xlate_onecell: invalid domain index 14
Attempts to use these power domains fail, so fix this by
using the correct value of calculated power domains.
Signed-off-by: Adam Ford <aford173@gmail.com>
Fixes: 88914db077 ("pmdomain: mediatek: Add support for Hardware Voter power domains")
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Cc: stable@vger.kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Upon resuming from suspend, the Touch Bar driver was missing a resume
method in order to restore the original mode the Touch Bar was on before
suspending. It is the same as the reset_resume method.
[jkosina@suse.com: rebased on top of the pm_ptr() conversion]
Cc: stable@vger.kernel.org
Signed-off-by: Aditya Garg <gargaditya08@live.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
The Logitech MX Master 4 can be connected over bluetooth or through a
Logitech Bolt receiver. This change adds support for non-standard HID
features, such as high resolution scrolling when the mouse is connected
over bluetooth.
Because no Logitech Bolt receiver driver exists yet those features
won't be available when the mouse is connected through the receiver.
Signed-off-by: Adrian Freund <adrian@freund.io>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
When running the command:
'perf record -e "{instructions,instructions:p}" -j any,counter sleep 1',
a "shift-out-of-bounds" warning is reported on CWF.
UBSAN: shift-out-of-bounds in /kbuild/src/consumer/arch/x86/events/intel/lbr.c:970:15
shift exponent 64 is too large for 64-bit type 'long long unsigned int'
......
intel_pmu_lbr_counters_reorder.isra.0.cold+0x2a/0xa7
intel_pmu_lbr_save_brstack+0xc0/0x4c0
setup_arch_pebs_sample_data+0x114b/0x2400
The warning occurs because the second "instructions:p" event, which
involves branch counters sampling, is incorrectly programmed to fixed
counter 0 instead of the general-purpose (GP) counters 0-3 that support
branch counters sampling. Currently only GP counters 0-3 support branch
counters sampling on CWF, any event involving branch counters sampling
should be programed on GP counters 0-3. Since the counter index of fixed
counter 0 is 32, it leads to the "src" value in below code is right
shifted 64 bits and trigger the "shift-out-of-bounds" warning.
cnt = (src >> (order[j] * LBR_INFO_BR_CNTR_BITS)) & LBR_INFO_BR_CNTR_MASK;
The root cause is the loss of the branch counters constraint for the
new event in the branch counters sampling event group. Since it isn't
yet part of the sibling list. This results in the second
"instructions:p" event being programmed on fixed counter 0 incorrectly
instead of the appropriate GP counters 0-3.
To address this, we apply the missing branch counters constraint for
the last event in the group. Additionally, we introduce a new function,
`intel_set_branch_counter_constr()`, to apply the branch counters
constraint and avoid code duplication.
Fixes: 3374491619 ("perf/x86/intel: Support branch counters logging")
Reported-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260228053320.140406-2-dapeng1.mi@linux.intel.com
Cc: stable@vger.kernel.org
Oliver reported that x86_pmu_del() ended up doing an out-of-bound memory access
when group_sched_in() fails and needs to roll back.
This *should* be handled by the transaction callbacks, but he found that when
the group leader is a software event, the transaction handlers of the wrong PMU
are used. Despite the move_group case in perf_event_open() and group_sched_in()
using pmu_ctx->pmu.
Turns out, inherit uses event->pmu to clone the events, effectively undoing the
move_group case for all inherited contexts. Fix this by also making inherit use
pmu_ctx->pmu, ensuring all inherited counters end up in the same pmu context.
Similarly, __perf_event_read() should use equally use pmu_ctx->pmu for the
group case.
Fixes: bd27568117 ("perf: Rewrite core context handling")
Reported-by: Oliver Rosenberg <olrose55@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://patch.msgid.link/20260309133713.GB606826@noisy.programming.kicks-ass.net
Both Mi Dapeng and Ian Rogers noted that not everything that sets HES_STOPPED
is required to EF_UPDATE. Specifically the 'step 1' loop of rescheduling
explicitly does EF_UPDATE to ensure the counter value is read.
However, then 'step 2' simply leaves the new counter uninitialized when
HES_STOPPED, even though, as noted above, the thing that stopped them might not
be aware it needs to EF_RELOAD -- since it didn't EF_UPDATE on stop.
One such location that is affected is throttling, throttle does pmu->stop(, 0);
and unthrottle does pmu->start(, 0); possibly restarting an uninitialized counter.
Fixes: a4eaf7f146 ("perf: Rework the PMU methods")
Reported-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reported-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://patch.msgid.link/20260311204035.GX606826@noisy.programming.kicks-ass.net
A production AMD EPYC system crashed with a NULL pointer dereference
in the PMU NMI handler:
BUG: kernel NULL pointer dereference, address: 0000000000000198
RIP: x86_perf_event_update+0xc/0xa0
Call Trace:
<NMI>
amd_pmu_v2_handle_irq+0x1a6/0x390
perf_event_nmi_handler+0x24/0x40
The faulting instruction is `cmpq $0x0, 0x198(%rdi)` with RDI=0,
corresponding to the `if (unlikely(!hwc->event_base))` check in
x86_perf_event_update() where hwc = &event->hw and event is NULL.
drgn inspection of the vmcore on CPU 106 showed a mismatch between
cpuc->active_mask and cpuc->events[]:
active_mask: 0x1e (bits 1, 2, 3, 4)
events[1]: 0xff1100136cbd4f38 (valid)
events[2]: 0x0 (NULL, but active_mask bit 2 set)
events[3]: 0xff1100076fd2cf38 (valid)
events[4]: 0xff1100079e990a90 (valid)
The event that should occupy events[2] was found in event_list[2]
with hw.idx=2 and hw.state=0x0, confirming x86_pmu_start() had run
(which clears hw.state and sets active_mask) but events[2] was
never populated.
Another event (event_list[0]) had hw.state=0x7 (STOPPED|UPTODATE|ARCH),
showing it was stopped when the PMU rescheduled events, confirming the
throttle-then-reschedule sequence occurred.
The root cause is commit 7e772a93eb ("perf/x86: Fix NULL event access
and potential PEBS record loss") which moved the cpuc->events[idx]
assignment out of x86_pmu_start() and into step 2 of x86_pmu_enable(),
after the PERF_HES_ARCH check. This broke any path that calls
pmu->start() without going through x86_pmu_enable() -- specifically
the unthrottle path:
perf_adjust_freq_unthr_events()
-> perf_event_unthrottle_group()
-> perf_event_unthrottle()
-> event->pmu->start(event, 0)
-> x86_pmu_start() // sets active_mask but not events[]
The race sequence is:
1. A group of perf events overflows, triggering group throttle via
perf_event_throttle_group(). All events are stopped: active_mask
bits cleared, events[] preserved (x86_pmu_stop no longer clears
events[] after commit 7e772a93eb).
2. While still throttled (PERF_HES_STOPPED), x86_pmu_enable() runs
due to other scheduling activity. Stopped events that need to
move counters get PERF_HES_ARCH set and events[old_idx] cleared.
In step 2 of x86_pmu_enable(), PERF_HES_ARCH causes these events
to be skipped -- events[new_idx] is never set.
3. The timer tick unthrottles the group via pmu->start(). Since
commit 7e772a93eb removed the events[] assignment from
x86_pmu_start(), active_mask[new_idx] is set but events[new_idx]
remains NULL.
4. A PMC overflow NMI fires. The handler iterates active counters,
finds active_mask[2] set, reads events[2] which is NULL, and
crashes dereferencing it.
Move the cpuc->events[hwc->idx] assignment in x86_pmu_enable() to
before the PERF_HES_ARCH check, so that events[] is populated even
for events that are not immediately started. This ensures the
unthrottle path via pmu->start() always finds a valid event pointer.
Fixes: 7e772a93eb ("perf/x86: Fix NULL event access and potential PEBS record loss")
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260310-perf-v2-1-4a3156fce43c@debian.org
There are two versions of the __this_cpu_local_lock() definitions in
include/linux/local_lock_internal.h: one version that relies on the
Clang overloading functionality and another version that does not.
Select the latter version when using sparse. This patch fixes the
following errors reported by sparse:
include/linux/local_lock_internal.h:331:40: sparse: sparse: multiple definitions for function '__this_cpu_local_lock'
include/linux/local_lock_internal.h:325:37: sparse: the previous one is here
Closes: https://lore.kernel.org/oe-kbuild-all/202603062334.wgI5htP0-lkp@intel.com/
Fixes: d3febf16de ("locking/local_lock: Support Clang's context analysis")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Marco Elver <elver@google.com>
Link: https://patch.msgid.link/20260311231455.1961413-1-bvanassche@acm.org
net->xfrm.nlsk is used in 2 types of contexts:
- fully under RCU, with rcu_read_lock + rcu_dereference and a NULL check
- in the netlink handlers, with requests coming from a userspace socket
In the 2nd case, net->xfrm.nlsk is guaranteed to stay non-NULL and the
object is alive, since we can't enter the netns destruction path while
the user socket holds a reference on the netns.
After adding the __rcu annotation to netns_xfrm.nlsk (which silences
sparse warnings in the RCU users and __net_init code), we need to tell
sparse that the 2nd case is safe. Add a helper for that.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In xfrm_policy_init:
add rcu_assign_pointer to fix warning:
net/xfrm/xfrm_policy.c:4238:29: warning: incorrect type in assignment (different address spaces)
net/xfrm/xfrm_policy.c:4238:29: expected struct hlist_head [noderef] __rcu *table
net/xfrm/xfrm_policy.c:4238:29: got struct hlist_head *
add rcu_dereference_protected to silence warning:
net/xfrm/xfrm_policy.c:4265:36: warning: incorrect type in argument 1 (different address spaces)
net/xfrm/xfrm_policy.c:4265:36: expected struct hlist_head *n
net/xfrm/xfrm_policy.c:4265:36: got struct hlist_head [noderef] __rcu *table
The netns is being created, no concurrent access is possible yet.
In xfrm_policy_fini, net is going away, there shouldn't be any
concurrent changes to the hashtables, so we can use
rcu_dereference_protected to silence warnings:
net/xfrm/xfrm_policy.c:4291:17: warning: incorrect type in argument 1 (different address spaces)
net/xfrm/xfrm_policy.c:4291:17: expected struct hlist_head const *h
net/xfrm/xfrm_policy.c:4291:17: got struct hlist_head [noderef] __rcu *table
net/xfrm/xfrm_policy.c:4292:36: warning: incorrect type in argument 1 (different address spaces)
net/xfrm/xfrm_policy.c:4292:36: expected struct hlist_head *n
net/xfrm/xfrm_policy.c:4292:36: got struct hlist_head [noderef] __rcu *table
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Silence sparse warnings in xfrm_state_fini:
net/xfrm/xfrm_state.c:3327:9: warning: incorrect type in argument 1 (different address spaces)
net/xfrm/xfrm_state.c:3327:9: expected struct hlist_head const *h
net/xfrm/xfrm_state.c:3327:9: got struct hlist_head [noderef] __rcu *state_byseq
Add xfrm_state_deref_netexit() to wrap those calls. The netns is going
away, we don't have to worry about the state_by* pointers being
changed behind our backs.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
xfrm_state_lookup_spi_proto is called under xfrm_state_lock by
xfrm_alloc_spi, no need to take a reference on the state and pretend
to be under RCU.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We're under xfrm_state_lock for all those walks, we can use
xfrm_state_deref_prot to silence sparse warnings such as:
net/xfrm/xfrm_state.c:933:17: warning: dereference of noderef expression
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Use rcu_assign_pointer, and tmp variables for freeing on the error
path without accessing net->xfrm.state_by*.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In all callers, x is not an __rcu pointer. We can drop the annotation to
avoid sparse warnings:
net/xfrm/xfrm_state.c:58:39: warning: incorrect type in argument 1 (different address spaces)
net/xfrm/xfrm_state.c:58:39: expected struct refcount_struct [usertype] *r
net/xfrm/xfrm_state.c:58:39: got struct refcount_struct [noderef] __rcu *
net/xfrm/xfrm_state.c:1166:42: warning: incorrect type in argument 1 (different address spaces)
net/xfrm/xfrm_state.c:1166:42: expected struct xfrm_state [noderef] __rcu *x
net/xfrm/xfrm_state.c:1166:42: got struct xfrm_state *[assigned] x
(repeated for each caller)
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
For unstated reasons, function mshv_partition_ioctl_set_memory passes
struct mshv_user_mem_region by value instead of by reference. Change
it to pass by reference.
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
hv_hvcrash_ctxt_save() in arch/x86/hyperv/hv_crash.c currently saves
segment registers via a general-purpose register (%eax). Update the
code to save segment registers (cs, ss, ds, es, fs, gs) directly to
the crash context memory using movw. This avoids unnecessary use of
a general-purpose register, making the code simpler and more efficient.
The size of the corresponding object file improves as follows:
text data bss dec hex filename
4167 176 200 4543 11bf hv_crash-old.o
4151 176 200 4527 11af hv_crash-new.o
No functional change occurs to the saved context contents; this is
purely a code-quality improvement.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Long Li <longli@microsoft.com>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
hv_crash_c_entry() is a C function that is entered without a stack,
and this is only allowed for functions that have the __naked attribute,
which informs the compiler that it must not emit the usual prologue and
epilogue or emit any other kind of instrumentation that relies on a
stack frame.
So split up the function, and set the __naked attribute on the initial
part that sets up the stack, GDT, IDT and other pieces that are needed
for ordinary C execution. Given that function calls are not permitted
either, use the existing long return coded in an asm() block to call the
second part of the function, which is an ordinary function that is
permitted to call other functions as usual.
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> # asm parts, not hv parts
Reviewed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: linux-hyperv@vger.kernel.org
Fixes: 94212d3461 ("x86/hyperv: Implement hypervisor RAM collection into vmcore")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
This reverts commit 36d6cbb621.
Calling this as a passthrough hypercall leaves the VM in an inconsistent
state. Revert before it is released.
Signed-off-by: Wei Liu <wei.liu@kernel.org>
When oops_panic_write is set, the driver disables interrupts and
switches to PIO polling mode but still falls through into the DMA
path. DMA cannot be used reliably in panic context, so make the
DMA path an else branch to ensure only PIO is used during panic
writes.
Fixes: c1ac2dc34b ("mtd: rawnand: brcmnand: When oops in progress use pio and interrupt polling")
Signed-off-by: Kamal Dasu <kamal.dasu@broadcom.com>
Reviewed-by: William Zhang <william.zhang@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
nand_lock() and nand_unlock() call into chip->ops.lock_area/unlock_area
without holding the NAND device lock. On controllers that implement
SET_FEATURES via multiple low-level PIO commands, these can race with
concurrent UBI/UBIFS background erase/write operations that hold the
device lock, resulting in cmd_pending conflicts on the NAND controller.
Add nand_get_device()/nand_release_device() around the lock/unlock
operations to serialize them against all other NAND controller access.
Fixes: 92270086b7 ("mtd: rawnand: Add support for manufacturer specific lock/unlock operation")
Signed-off-by: Kamal Dasu <kamal.dasu@broadcom.com>
Reviewed-by: William Zhang <william.zhang@broadcom.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Example is wrong, the reg property of the flash is always matching the
node name.
Fixes: 68cd8ef484 ("dt-bindings: mtd: st,spear600-smi: convert to DT schema")
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
These properties must be set because they overwrite the default values,
especially #size-cells which is 0 for most controllers and is 'const: 1'
here.
Fixes: 68cd8ef484 ("dt-bindings: mtd: st,spear600-smi: convert to DT schema")
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
The description mixes two nodes. There is the controller, and there is
the flash. Describe the flash (which itself can be considered an mtd
device, unlike the top level controller), and move the st,smi-fast-mode
property inside, as this property is flash specific and should not live
in the parent controller node.
Fixes: 68cd8ef484 ("dt-bindings: mtd: st,spear600-smi: convert to DT schema")
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Elan touchscreens have a HID-battery device for the stylus which is always
there even if there is no stylus.
This is causing upower to report an empty battery for the stylus and some
desktop-environments will show a notification about this, which is quite
annoying.
Because of this the HID-battery is being ignored on all Elan I2c and USB
touchscreens, but this causes there to be no battery reporting for
the stylus at all.
This adds a new HID_BATTERY_QUIRK_DYNAMIC and uses these for the Elan
touchscreens.
This new quirks causes the present value of the battery to start at 0,
which will make userspace ignore it and only sets present to 1 after
receiving a battery input report which only happens when the stylus
gets in range.
Reported-by: ggrundik@gmail.com
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221118
Signed-off-by: Hans de Goede <johannes.goede@oss.qualcomm.com>
Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
Drop the Asus UX550* touchscreen ignore battery quirks, there is a blanket
HID_BATTERY_QUIRK_IGNORE for all USB_VENDOR_ID_ELAN USB touchscreens now,
so these are just a duplicate of those.
Signed-off-by: Hans de Goede <johannes.goede@oss.qualcomm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
The refactoring commit 5fa9b82cbc ("scripts: kconfig:
merge_config.sh: refactor from shell/sed/grep to awk") passes
$TMP_FILE.new as ARGV[3] to awk, using it as both an output destination
and an input file argument. When the base file is empty, nothing is
written to ARGV[3] during processing, so awk fails trying to open it
for reading:
awk: cmd. line:52: fatal: cannot open file
`./.tmp.config.grcQin34jb.new' for reading: No such file or directory
mv: cannot stat './.tmp.config.grcQin34jb.new': No such file or directory
Pass the output path via -v outfile instead and drop the FILENAME ==
ARGV[3] { nextfile }.
Fixes: 5fa9b82cbc ("scripts: kconfig: merge_config.sh: refactor from shell/sed/grep to awk")
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Link: https://patch.msgid.link/20260310-fixes-merge-config-v1-1-beaeeaded6bd@samsung.com
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Commit 60fbb14396 ("mm/huge_memory: adjust try_to_migrate_one() and
split_huge_pmd_locked()") return false unconditionally after
split_huge_pmd_locked(). This may fail try_to_migrate() early when
TTU_SPLIT_HUGE_PMD is specified.
The reason is the above commit adjusted try_to_migrate_one() to, when a
PMD-mapped THP entry is found, and TTU_SPLIT_HUGE_PMD is specified (for
example, via unmap_folio()), return false unconditionally. This breaks
the rmap walk and fail try_to_migrate() early, if this PMD-mapped THP is
mapped in multiple processes.
The user sensible impact of this bug could be:
* On memory pressure, shrink_folio_list() may split partially mapped
folio with split_folio_to_list(). Then free unmapped pages without IO.
If failed, it may not be reclaimed.
* On memory failure, memory_failure() would call try_to_split_thp_page()
to split folio contains the bad page. If succeed, the PG_has_hwpoisoned
bit is only set in the after-split folio contains @split_at. By doing
so, we limit bad memory. If failed to split, the whole folios is not
usable.
One way to reproduce:
Create an anonymous THP range and fork 512 children, so we have a
THP shared mapped in 513 processes. Then trigger folio split with
/sys/kernel/debug/split_huge_pages debugfs to split the THP folio to
order 0.
Without the above commit, we can successfully split to order 0. With the
above commit, the folio is still a large folio.
And currently there are two core users of TTU_SPLIT_HUGE_PMD:
* try_to_unmap_one()
* try_to_migrate_one()
try_to_unmap_one() would restart the rmap walk, so only
try_to_migrate_one() is affected.
We can't simply revert commit 60fbb14396 ("mm/huge_memory: adjust
try_to_migrate_one() and split_huge_pmd_locked()"), since it removed some
duplicated check covered by page_vma_mapped_walk().
This patch fixes this by restart page_vma_mapped_walk() after
split_huge_pmd_locked(). Since we cannot simply return "true" to fix the
problem, as that would affect another case:
When invoking folio_try_share_anon_rmap_pmd() from
split_huge_pmd_locked(), the latter can fail and leave a large folio
mapped through PTEs, in which case we ought to return true from
try_to_migrate_one(). This might result in unnecessary walking of the
rmap but is relatively harmless.
Link: https://lkml.kernel.org/r/20260305015006.27343-1-richard.weiyang@gmail.com
Fixes: 60fbb14396 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Tested-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Gavin Guo <gavinguo@igalia.com>
Acked-by: David Hildenbrand (arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We batch unmap anonymous lazyfree folios by folio_unmap_pte_batch. If the
batch has a mix of writable and non-writable bits, we may end up setting
the entire batch writable. Fix this by respecting writable bit during
batching.
Although on a successful unmap of a lazyfree folio, the soft-dirty bit is
lost, preserve it on pte restoration by respecting the bit during
batching, to make the fix consistent w.r.t both writable bit and
soft-dirty bit.
I was able to write the below reproducer and crash the kernel.
Explanation of reproducer (set 64K mTHP to always):
Fault in a 64K large folio. Split the VMA at mid-point with
MADV_DONTFORK. fork() - parent points to the folio with 8 writable ptes
and 8 non-writable ptes. Merge the VMAs with MADV_DOFORK so that
folio_unmap_pte_batch() can determine all the 16 ptes as a batch. Do
MADV_FREE on the range to mark the folio as lazyfree. Write to the memory
to dirty the pte, eventually rmap will dirty the folio. Then trigger
reclaim, we will hit the pte restoration path, and the kernel will crash
with the trace given below.
The BUG happens at:
BUG_ON(atomic_inc_return(&ptc->anon_map_count) > 1 && rw);
The code path is asking for anonymous page to be mapped writable into the
pagetable. The BUG_ON() firing implies that such a writable page has been
mapped into the pagetables of more than one process, which breaks
anonymous memory/CoW semantics.
[ 21.134473] kernel BUG at mm/page_table_check.c:118!
[ 21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[ 21.135917] Modules linked in:
[ 21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
[ 21.136858] Hardware name: linux,dummy-virt (DT)
[ 21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 21.137308] pc : page_table_check_set+0x28c/0x2a8
[ 21.137607] lr : page_table_check_set+0x134/0x2a8
[ 21.137885] sp : ffff80008a3b3340
[ 21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
[ 21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
[ 21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
[ 21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
[ 21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
[ 21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
[ 21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
[ 21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
[ 21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
[ 21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
[ 21.141991] Call trace:
[ 21.142093] page_table_check_set+0x28c/0x2a8 (P)
[ 21.142265] __page_table_check_ptes_set+0x144/0x1e8
[ 21.142441] __set_ptes_anysz.constprop.0+0x160/0x1a8
[ 21.142766] contpte_set_ptes+0xe8/0x140
[ 21.142907] try_to_unmap_one+0x10c4/0x10d0
[ 21.143177] rmap_walk_anon+0x100/0x250
[ 21.143315] try_to_unmap+0xa0/0xc8
[ 21.143441] shrink_folio_list+0x59c/0x18a8
[ 21.143759] shrink_lruvec+0x664/0xbf0
[ 21.144043] shrink_node+0x218/0x878
[ 21.144285] __node_reclaim.constprop.0+0x98/0x338
[ 21.144763] user_proactive_reclaim+0x2a4/0x340
[ 21.145056] reclaim_store+0x3c/0x60
[ 21.145216] dev_attr_store+0x20/0x40
[ 21.145585] sysfs_kf_write+0x84/0xa8
[ 21.145835] kernfs_fop_write_iter+0x130/0x1c8
[ 21.145994] vfs_write+0x2b8/0x368
[ 21.146119] ksys_write+0x70/0x110
[ 21.146240] __arm64_sys_write+0x24/0x38
[ 21.146380] invoke_syscall+0x50/0x120
[ 21.146513] el0_svc_common.constprop.0+0x48/0xf8
[ 21.146679] do_el0_svc+0x28/0x40
[ 21.146798] el0_svc+0x34/0x110
[ 21.146926] el0t_64_sync_handler+0xa0/0xe8
[ 21.147074] el0t_64_sync+0x198/0x1a0
[ 21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
[ 21.147440] ---[ end trace 0000000000000000 ]---
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/wait.h>
#include <sched.h>
#include <fcntl.h>
void write_to_reclaim() {
const char *path = "/sys/devices/system/node/node0/reclaim";
const char *value = "409600000000";
int fd = open(path, O_WRONLY);
if (fd == -1) {
perror("open");
exit(EXIT_FAILURE);
}
if (write(fd, value, sizeof("409600000000") - 1) == -1) {
perror("write");
close(fd);
exit(EXIT_FAILURE);
}
printf("Successfully wrote %s to %s\n", value, path);
close(fd);
}
int main()
{
char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if ((unsigned long)ptr != (1UL << 30)) {
perror("mmap");
return 1;
}
/* a 64K folio gets faulted in */
memset(ptr, 0, 1UL << 16);
/* 32K half will not be shared into child */
if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
perror("madvise madv dontfork");
return 1;
}
pid_t pid = fork();
if (pid < 0) {
perror("fork");
return 1;
} else if (pid == 0) {
sleep(15);
} else {
/* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
perror("madvise madv fork");
return 1;
}
if (madvise(ptr, (1UL << 16), MADV_FREE)) {
perror("madvise madv free");
return 1;
}
/* dirty the large folio */
(*ptr) += 10;
write_to_reclaim();
// sleep(10);
waitpid(pid, NULL, 0);
}
}
Link: https://lkml.kernel.org/r/20260303061528.2429162-1-dev.jain@arm.com
Fixes: 354dffd295 ("mm: support batched unmap for lazyfree large folios during reclamation")
Signed-off-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Tested-by: Lance Yang <lance.yang@linux.dev>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
move_pages_huge_pmd() handles UFFDIO_MOVE for both normal THPs and huge
zero pages. For the huge zero page path, src_folio is explicitly set to
NULL, and is used as a sentinel to skip folio operations like lock and
rmap.
In the huge zero page branch, src_folio is NULL, so folio_mk_pmd(NULL,
pgprot) passes NULL through folio_pfn() and page_to_pfn(). With
SPARSEMEM_VMEMMAP this silently produces a bogus PFN, installing a PMD
pointing to non-existent physical memory. On other memory models it is a
NULL dereference.
Use page_folio(src_page) to obtain the valid huge zero folio from the
page, which was obtained from pmd_page() and remains valid throughout.
After commit d82d09e482 ("mm/huge_memory: mark PMD mappings of the huge
zero folio special"), moved huge zero PMDs must remain special so
vm_normal_page_pmd() continues to treat them as special mappings.
move_pages_huge_pmd() currently reconstructs the destination PMD in the
huge zero page branch, which drops PMD state such as pmd_special() on
architectures with CONFIG_ARCH_HAS_PTE_SPECIAL. As a result,
vm_normal_page_pmd() can treat the moved huge zero PMD as a normal page
and corrupt its refcount.
Instead of reconstructing the PMD from the folio, derive the destination
entry from src_pmdval after pmdp_huge_clear_flush(), then handle the PMD
metadata the same way move_huge_pmd() does for moved entries by marking it
soft-dirty and clearing uffd-wp.
Link: https://lkml.kernel.org/r/a1e787dd-b911-474d-8570-f37685357d86@lucifer.local
Fixes: e3981db444 ("mm: add folio_mk_pmd()")
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use the correct function (or macro) names to avoid kernel-doc warnings:
Warning: include/linux/build_bug.h:38 function parameter 'cond' not
described in 'BUILD_BUG_ON_MSG'
Warning: include/linux/build_bug.h:38 function parameter 'msg' not
described in 'BUILD_BUG_ON_MSG'
Warning: include/linux/build_bug.h:76 function parameter 'expr' not
described in 'static_assert'
Link: https://lkml.kernel.org/r/20260302005144.3467019-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For commit b0dcdcb9ae ("resolve_btfids: Fix linker flags detection"),
I suggested setting HOSTPKG_CONFIG to $PKG_CONFIG when compiling
resolve_btfids, but I forgot the quotes around that variable.
As a result, when running vmtest.sh with static linking, it fails as
follows:
$ LDLIBS=-static PKG_CONFIG='pkg-config --static' ./vmtest.sh
[...]
make: unrecognized option '--static'
Usage: make [options] [target] ...
[...]
This worked when I tested it because HOSTPKG_CONFIG didn't have a
default value in the resolve_btfids Makefile, but once it does, the
quotes aren't preserved and it fails on the next make call.
Fixes: b0dcdcb9ae ("resolve_btfids: Fix linker flags detection")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/abADBwn_ykblpABE@mail.gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF_ST | BPF_PROBE_MEM32 immediate stores are not handled by
bpf_jit_blind_insn(), allowing user-controlled 32-bit immediates to
survive unblinded into JIT-compiled native code when bpf_jit_harden >= 1.
The root cause is that convert_ctx_accesses() rewrites BPF_ST|BPF_MEM
to BPF_ST|BPF_PROBE_MEM32 for arena pointer stores during verification,
before bpf_jit_blind_constants() runs during JIT compilation. The
blinding switch only matches BPF_ST|BPF_MEM (mode 0x60), not
BPF_ST|BPF_PROBE_MEM32 (mode 0xa0). The instruction falls through
unblinded.
Add BPF_ST|BPF_PROBE_MEM32 cases to bpf_jit_blind_insn() alongside the
existing BPF_ST|BPF_MEM cases. The blinding transformation is identical:
load the blinded immediate into BPF_REG_AX via mov+xor, then convert
the immediate store to a register store (BPF_STX).
The rewritten STX instruction must preserve the BPF_PROBE_MEM32 mode so
the architecture JIT emits the correct arena addressing (R12-based on
x86-64). Cannot use the BPF_STX_MEM() macro here because it hardcodes
BPF_MEM mode; construct the instruction directly instead.
Fixes: 6082b6c328 ("bpf: Recognize addr_space_cast instruction in the verifier.")
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Sachin Kumar <xcyfun@protonmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/Y6IT5VvNRchPBLI5D7JZHBzZrU9rb0ycRJPJzJSXGj7kJlX8RJwZFSM2YZjcDxoQKABkxt1T8Os2gi23PYyFuQe6KkZGWVyfz8K5afdy9ak=@protonmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add a test case to ensure that BPF_END operations correctly break
register's scalar ID ties.
The test creates a scenario where r1 is a copy of r0, r0 undergoes a
byte swap, and then r0 is checked against a constant.
- Without the fix in the verifier, the bounds learned from r0 are
incorrectly propagated to r1, making the verifier believe r1 is
bounded and wrongly allowing subsequent pointer arithmetic.
- With the fix, r1 remains an unbounded scalar, and the verifier
correctly rejects the arithmetic operation between the frame pointer
and the unbounded register.
Co-developed-by: Tianci Cao <ziye@zju.edu.cn>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260304083228.142016-3-tangyazhou@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
In landlock_restrict_sibling_threads(), when the calling thread is
interrupted while waiting for sibling threads to prepare, it executes
a recovery path.
Previously, this path included a wait_for_completion() call on
all_prepared to prevent a Use-After-Free of the local shared_ctx.
However, this wait is redundant. Exiting the main do-while loop
already leads to a bottom cleanup section that unconditionally waits
for all_finished. Therefore, replacing the wait with a simple break
is safe, prevents UAF, and correctly unblocks the remaining task_works.
Clean up the error path by breaking the loop and updating the
surrounding comments to accurately reflect the state machine.
Suggested-by: Günther Noack <gnoack3000@gmail.com>
Signed-off-by: Yihan Ding <dingyihan@uniontech.com>
Tested-by: Günther Noack <gnoack3000@gmail.com>
Reviewed-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260306021651.744723-3-dingyihan@uniontech.com
Signed-off-by: Mickaël Salaün <mic@digikod.net>
syzbot found a deadlock in landlock_restrict_sibling_threads().
When multiple threads concurrently call landlock_restrict_self() with
sibling thread restriction enabled, they can deadlock by mutually
queueing task_works on each other and then blocking in kernel space
(waiting for the other to finish).
Fix this by serializing the TSYNC operations within the same process
using the exec_update_lock. This prevents concurrent invocations
from deadlocking.
We use down_write_trylock() and restart the syscall if the lock
cannot be acquired immediately. This ensures that if a thread fails
to get the lock, it will return to userspace, allowing it to process
any pending TSYNC task_works from the lock holder, and then
transparently restart the syscall.
Fixes: 42fc7e6543 ("landlock: Multithreading support for landlock_restrict_self()")
Reported-by: syzbot+7ea2f5e9dfd468201817@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7ea2f5e9dfd468201817
Suggested-by: Günther Noack <gnoack3000@gmail.com>
Suggested-by: Tingmao Wang <m@maowtm.org>
Tested-by: Justin Suess <utilityemal77@gmail.com>
Signed-off-by: Yihan Ding <dingyihan@uniontech.com>
Tested-by: Günther Noack <gnoack3000@gmail.com>
Reviewed-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260306021651.744723-2-dingyihan@uniontech.com
Signed-off-by: Mickaël Salaün <mic@digikod.net>
XG mobile station 2022 has a different PID than the 2023 model: add it
that model to hid-asus.
Signed-off-by: Denis Benato <denis.benato@linux.dev>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
GPIO controller driver should typically implement the .get_direction()
callback as GPIOLIB internals may try to use it to determine the state
of a pin. Since introduction of shared proxy, it prints a warning splat
when using a shared spmi gpio.
The implementation is not easy because the controller supports enabling
the input and output logic at the same time, so we aligns on the
behaviour of the .get() operation and return -EINVAL in other
situations.
Fixes: eadff30244 ("pinctrl: Qualcomm SPMI PMIC GPIO pin controller driver")
Fixes: d7b5f5cc5e ("pinctrl: qcom: spmi-gpio: Add support for GPIO LV/MV subtype")
Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
aesbs_setkey() and aesbs_cbc_ctr_setkey() allocate struct crypto_aes_ctx
on the stack. On arm64, the kernel-mode NEON context is also stored on
the stack, causing the combined frame size to exceed 1024 bytes and
triggering -Wframe-larger-than= warnings.
Allocate struct crypto_aes_ctx on the heap instead and use
kfree_sensitive() to ensure the key material is zeroed on free.
Use a goto-based cleanup path to ensure kfree_sensitive() is always
called.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Fixes: 4fa617cc68 ("arm64/fpsimd: Allocate kernel mode FP/SIMD buffers on the stack")
Link: https://lore.kernel.org/r/20260306064254.2079274-1-yphbchou0911@gmail.com
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Fix a warning for:
$ ./scripts/kconfig/merge_config.sh .config extra.config
Using .config as base
Merging extra.config
./scripts/kconfig/merge_config.sh: 384: [: false: unexpected operator
The shellcheck report is also attached:
if [ "$STRICT" == "true" ] && [ "$STRICT_MODE_VIOLATED" == "true" ]; then
^-- SC3014 (warning): In POSIX sh, == in place of = is undefined.
^-- SC3014 (warning): In POSIX sh, == in place of = is undefined.
Fixes: dfc97e1c5d ("scripts: kconfig: merge_config.sh: use awk in checks too")
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
Reviewed-by: Mikko Rapeli <mikko.rapeli@linaro.org>
Link: https://patch.msgid.link/20260309121505.40454-1-o451686892@gmail.com
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
The wacom_intuos_bt_irq() function processes Bluetooth HID reports
without sufficient bounds checking. A maliciously crafted short report
can trigger an out-of-bounds read when copying data into the wacom
structure.
Specifically, report 0x03 requires at least 22 bytes to safely read
the processed data and battery status, while report 0x04 (which
falls through to 0x03) requires 32 bytes.
Add explicit length checks for these report IDs and log a warning if
a short report is received.
Signed-off-by: Benoît Sevens <bsevens@google.com>
Reviewed-by: Jason Gerecke <jason.gerecke@wacom.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
Commit 60f3ada4174f ("kunit: Add --list_suites to show suites") introduced
the --list_suites option to kunit.py, but the update to the corresponding
run_wrapper documentation was omitted.
Add the missing description for --list_suites to keep the documentation in
sync with the tool's supported arguments.
Fixes: 60f3ada4174f ("kunit: Add --list_suites to show suites")
Signed-off-by: Ryota Sakamoto <sakamo.ryota@gmail.com>
Reviewed-by: David Gow <david@davidgow.net>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Dingisoul with KASAN reports a use after free if device_add() fails in
nd_async_device_register().
Commit b6eae0f61d ("libnvdimm: Hold reference on parent while
scheduling async init") correctly added a reference on the parent device
to be held until asynchronous initialization was complete. However, if
device_add() results in an allocation failure the ref count of the
device drops to 0 prior to the parent pointer being accessed. Thus
resulting in use after free.
The bug bot AI correctly identified the fix. Save a reference to the
parent pointer to be used to drop the parent reference regardless of the
outcome of device_add().
Reported-by: Dingisoul <dingiso.kernel@gmail.com>
Closes: http://lore.kernel.org/8855544b-be9e-4153-aa55-0bc328b13733@gmail.com
Fixes: b6eae0f61d ("libnvdimm: Hold reference on parent while scheduling async init")
Cc: stable@vger.kernel.org
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://patch.msgid.link/20260306-fix-uaf-async-init-v1-1-a28fd7526723@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
In iptfs_reassem_cont(), IP-TFS attempts to append data to the new inner
packet 'newskb' that is being reassembled. First a zero-copy approach is
tried if it succeeds then newskb becomes non-linear.
When a subsequent fragment in the same datagram does not meet the
fast-path conditions, a memory copy is performed. It calls skb_put() to
append the data and as newskb is non-linear it triggers
SKB_LINEAR_ASSERT check.
Oops: invalid opcode: 0000 [#1] SMP NOPTI
[...]
RIP: 0010:skb_put+0x3c/0x40
[...]
Call Trace:
<IRQ>
iptfs_reassem_cont+0x1ab/0x5e0 [xfrm_iptfs]
iptfs_input_ordered+0x2af/0x380 [xfrm_iptfs]
iptfs_input+0x122/0x3e0 [xfrm_iptfs]
xfrm_input+0x91e/0x1a50
xfrm4_esp_rcv+0x3a/0x110
ip_protocol_deliver_rcu+0x1d7/0x1f0
ip_local_deliver_finish+0xbe/0x1e0
__netif_receive_skb_core.constprop.0+0xb56/0x1120
__netif_receive_skb_list_core+0x133/0x2b0
netif_receive_skb_list_internal+0x1ff/0x3f0
napi_complete_done+0x81/0x220
virtnet_poll+0x9d6/0x116e [virtio_net]
__napi_poll.constprop.0+0x2b/0x270
net_rx_action+0x162/0x360
handle_softirqs+0xdc/0x510
__irq_exit_rcu+0xe7/0x110
irq_exit_rcu+0xe/0x20
common_interrupt+0x85/0xa0
</IRQ>
<TASK>
Fix this by checking if the skb is non-linear. If it is, linearize it by
calling skb_linearize(). As the initial allocation of newskb originally
reserved enough tailroom for the entire reassembled packet we do not
need to check if we have enough tailroom or extend it.
Fixes: 5f2b6a9095 ("xfrm: iptfs: add skb-fragment sharing code")
Reported-by: Hao Long <me@imlonghao.com>
Closes: https://lore.kernel.org/netdev/DGRCO9SL0T5U.JTINSHJQ9KPK@imlonghao.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
At the end of this function, d is the traversal cursor of flist, but the
code completes found instead. This can lead to issues such as NULL pointer
dereferences, double completion, or descriptor leaks.
Fix this by completing d instead of found in the final
list_for_each_entry_safe() loop.
Fixes: aa8d18becc ("dmaengine: idxd: add callback support for iaa crypto")
Signed-off-by: Tuo Li <islituo@gmail.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://patch.msgid.link/20260106032428.162445-1-islituo@gmail.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
When a write subrequest is marked NETFS_SREQ_NEED_RETRY, the retry path
in netfs_unbuffered_write() unconditionally calls stream->prepare_write()
without checking if it is NULL.
Filesystems such as 9P do not set the prepare_write operation, so
stream->prepare_write remains NULL. When get_user_pages() fails with
-EFAULT and the subrequest is flagged for retry, this results in a NULL
pointer dereference at fs/netfs/direct_write.c:189.
Fix this by mirroring the pattern already used in write_retry.c: if
stream->prepare_write is NULL, skip renegotiation and directly reissue
the subrequest via netfs_reissue_write(), which handles iterator reset,
IN_PROGRESS flag, stats update and reissue internally.
Fixes: a0b4c7a491 ("netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence")
Reported-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7227db0fbac9f348dba0
Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
Link: https://patch.msgid.link/20260307043947.347092-1-kartikey406@gmail.com
Tested-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
Others have submitted this issue (https://lore.kernel.org/dmaengine/
20240722030405.3385-1-zhengdongxiong@gxmicro.cn/),
but it has not been fixed yet. Therefore, more supplementary information
is provided here.
As mentioned in the "PCS-CCS-CB-TCB" Producer-Consumer Synchronization of
"DesignWare Cores PCI Express Controller Databook, version 6.00a":
1. The Consumer CYCLE_STATE (CCS) bit in the register only needs to be
initialized once; the value will update automatically to be
~CYCLE_BIT (CB) in the next chunk.
2. The Consumer CYCLE_BIT bit in the register is loaded from the LL
element and tested against CCS. When CB = CCS, the data transfer is
executed. Otherwise not.
The current logic sets customer (HDMA) CS and CB bits to 1 in each chunk
while setting the producer (software) CB of odd chunks to 0 and even
chunks to 1 in the linked list. This is leading to a mismatch between
the producer CB and consumer CS bits.
This issue can be reproduced by setting the transmission data size to
exceed one chunk. By the way, in the EDMA using the same "PCS-CCS-CB-TCB"
mechanism, the CS bit is only initialized once and this issue was not
found. Refer to
drivers/dma/dw-edma/dw-edma-v0-core.c:dw_edma_v0_core_start.
So fix this issue by initializing the CYCLE_STATE and CYCLE_BIT bits
only once.
Fixes: e74c39573d ("dmaengine: dw-edma: Add support for native HDMA")
Signed-off-by: LUO Haowen <luo-hw@foxmail.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/tencent_CB11AA9F3920C1911AF7477A9BD8EFE0AD05@qq.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
On admin queue completion handling, if the admin command completed with
error we print data from the completion context. The issue is that we
already freed the completion context in polling/interrupts handler which
means we print data from context in an unknown state (it might be
already used again).
Change the admin submission flow so alloc/dealloc of the context will be
symmetric and dealloc will be called after any potential use of the
context.
Fixes: 68fb9f3e31 ("RDMA/efa: Remove redundant NULL pointer check of CQE")
Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Michael Margolin <mrgolin@amazon.com>
Signed-off-by: Yonatan Nachum <ynachum@amazon.com>
Link: https://patch.msgid.link/20260308165350.18219-1-ynachum@amazon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Since commit b5daf93b80 ("firmware: arm_scmi: Avoid notifier
registration for unsupported events") the call chains leading to the helper
__scmi_event_handler_get_ops expect an ERR_PTR to be returned on failure to
get an handler for the requested event key, while the current helper can
still return a NULL when no handler could be found or created.
Fix by forcing an ERR_PTR return value when the handler reference is NULL.
Fixes: b5daf93b80 ("firmware: arm_scmi: Avoid notifier registration for unsupported events")
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Message-Id: <20260305131011.541444-1-cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@kernel.org>
A device_node reference obtained from the device tree is not released
on all error paths in the arm_scpi probe path. Specifically, a node
returned by of_parse_phandle() could be leaked when the probe failed
after the node was acquired. The probe function returns early and
the shmem reference is not released.
Use __free(device_node) scope-based cleanup to automatically release
the reference when the variable goes out of scope.
Fixes: ed7ecb8839 ("firmware: arm_scpi: Add compatibility checks for shmem node")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Message-Id: <20260121-arm_scpi_2-v2-1-702d7fa84acb@gmail.com>
Signed-off-by: Sudeep Holla <sudeep.holla@kernel.org>
According to the FF-A specification (DEN0077, v1.1, §13.7), when
FFA_RXTX_UNMAP is invoked from any instance other than non-secure
physical, the w1 register must be zero (MBZ). If a non-zero value is
supplied in this context, the SPMC must return FFA_INVALID_PARAMETER.
The Arm FF-A driver operates exclusively as a guest or non-secure
physical instance where the partition ID is always zero and is not
invoked from a hypervisor context where w1 carries a VM ID. In this
execution model, the partition ID observed by the driver is always zero,
and passing a VM ID is unnecessary and potentially invalid.
Remove the vm_id parameter from ffa_rxtx_unmap() and ensure that the
SMC call is issued with w1 implicitly zeroed, as required by the
specification. This prevents invalid parameter errors and aligns the
implementation with the defined FF-A ABI behavior.
Fixes: 3bbfe98710 ("firmware: arm_ffa: Add initial Arm FFA driver support")
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Message-Id: <20260304120953.847671-1-yeoreum.yun@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@kernel.org>
Commit e7e222ad73 ("cxl: Move devm_cxl_add_nvdimm_bridge() to
cxl_pmem.ko") moves devm_cxl_add_nvdimm_bridge() into the cxl_pmem file,
which has independent config compile options for built-in or module. The
call from cxl_acpi_probe() is guarded by IS_ENABLED(CONFIG_CXL_PMEM),
which evaluates to true for both =y and =m.
When CONFIG_CXL_PMEM=m, a built-in cxl_acpi attempts to reference a
symbol exported by a module, which fails to link. CXL_PMEM cannot simply
be promoted to =y in this configuration because it depends on LIBNVDIMM,
which may itself be =m.
Add a Kconfig dependency to prevent CXL_ACPI from being built-in when
CXL_PMEM is a module. This contrains CXL_ACPI to =m when CXL_PMEM=m,
while still allowing CXL_ACPI to be freely configured when CXL_PMEM is
either built-in or disabled.
[ dj: Fix up commit reference formatting. ]
Fixes: e7e222ad73 ("cxl: Move devm_cxl_add_nvdimm_bridge() to cxl_pmem.ko")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20260305204057.1516948-1-kbusch@meta.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
The rzt2h_gpio_get_direction() function is called from
gpiod_get_direction(), which ends up being used within the __setup_irq()
call stack when requesting an interrupt.
__setup_irq() holds a raw_spinlock_t with IRQs disabled, which creates
an atomic context. spinlock_t cannot be used within atomic context
when PREEMPT_RT is enabled, since it may become a sleeping lock.
An "[ BUG: Invalid wait context ]" splat is observed when running with
CONFIG_PROVE_LOCKING enabled, describing exactly the aforementioned call
stack.
__setup_irq() needs to hold a raw_spinlock_t with IRQs disabled to
serialize access against a concurrent hard interrupt.
Switch to raw_spinlock_t to fix this.
Fixes: 829dde3369 ("pinctrl: renesas: rzt2h: Add GPIO IRQ chip to handle interrupts")
Signed-off-by: Cosmin Tanislav <cosmin-gabriel.tanislav.xa@renesas.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20260205103930.666051-1-cosmin-gabriel.tanislav.xa@renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
When calling of_parse_phandle_with_fixed_args(), the caller is
responsible for calling of_node_put() to release the device node
reference.
In rzt2h_gpio_register(), the driver fails to call of_node_put() to
release the reference in of_args.np, which causes a memory leak.
Add the missing of_node_put() call to fix the leak.
Fixes: 34d4d09307 ("pinctrl: renesas: Add support for RZ/T2H")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20260127-rzt2h-v1-1-86472e7421b8@gmail.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
The default settings for the Versa3 device on the Renesas RZ/G3S SMARC
SoM board have PLL2 disabled. PLL2 was later enabled together with audio
support, as it is required to support both 44.1 kHz and 48 kHz audio.
With PLL2 enabled, it was observed that Linux occasionally either hangs
during boot (the last log message being related to the I2C probe) or
randomly crashes. This was mainly reproducible on cold boots. During
debugging, it was also noticed that the Unicode replacement character (�)
sometimes appears on the serial console. Further investigation traced this
to the configuration applied through the Versa3 register at offset 0x1c,
which controls PLL enablement.
The appearance of the Unicode replacement character suggested an issue
with the SoC reference clock. The RZ/G3S reference clock is provided by
the Versa3 clock generator (REF output).
After checking with the Renesas Versa3 hardware team, it was found that
this is related to the PLL2 lock bit being set through the
renesas,settings DT property.
The PLL lock bit must be set to avoid unstable clock output from the PLL.
However, due to the Versa3 hardware design, when a PLL lock bit is set,
all outputs (including the REF clock) are temporarily disabled until the
configured PLLs become stable.
As an alternative, the bypass bit can be used. This does not interrupt the
PLL2 output or any other Versa3 outputs, but it may result in temporary
instability on PLL2 output while the configuration is applied. Since PLL2
feeds only the audio path and audio is not used during early boot, this is
acceptable and does not affect system boot.
Drop the PLL2 lock bit and set the bypass bit instead.
This has been tested with more than 1000 cold boots.
Fixes: a94253232b ("arm64: dts: renesas: rzg3s-smarc-som: Add versa3 clock generator node")
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20260302135703.162601-1-claudiu.beznea.uj@bp.renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
The HW user manual for the Renesas RZ/V2H(P) SoC (a.k.a r9a09g057)
states that only WDT1 is supposed to be accessed by the CA55 cores.
WDT0 is supposed to be used by the CM33 core, WDT2 is supposed
to be used by the CR8 core 0, and WDT3 is supposed to be used
by the CR8 core 1.
Remove wdt{0,2,3} from the SoC specific device tree to make it
compliant with the specification from the HW manual.
This change is harmless as there are currently no users of the
wdt{0,2,3} device tree nodes, only the wdt1 node is actually used.
Fixes: 095105496e ("arm64: dts: renesas: r9a09g057: Add WDT0-WDT3 nodes")
Signed-off-by: Fabrizio Castro <fabrizio.castro.jz@renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20260203124247.7320-3-fabrizio.castro.jz@renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Set an appropriate ramp delay for the SD0 I/O voltage regulator in the
CN15 SD overlay to make UHS-I voltage switching reliable during card
initialization.
This issue was observed on the RZ/V2H EVK, while the same UHS-I cards
worked on the RZ/V2N EVK without problems. Adding the ramp delay makes
the behavior consistent and avoids SD init timeouts.
Before this change SD0 could fail with:
mmc0: error -110 whilst initialising SD card
With the delay in place UHS-I cards enumerate correctly:
mmc0: new UHS-I speed SDR104 SDXC card at address aaaa
mmcblk0: mmc0:aaaa SR64G 59.5 GiB
mmcblk0: p1
Fixes: 3d6c2bc762 ("arm64: dts: renesas: Add CN15 eMMC and SD overlays for RZ/V2H and RZ/V2N EVKs")
Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20260123225957.1007089-5-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Add a ramp delay of 60 uV/us to the vqmmc_sdhi0 voltage regulator to
fix UHS-I SD card detection failures.
Measurements on CN78 pin 4 showed the actual voltage ramp time to be
21.86ms when switching between 3.3V and 1.8V. A 25ms ramp delay has
been configured to provide adequate margin. The calculation is based
on the voltage delta of 1.5V (3.3V - 1.8V):
1500000 uV / 60 uV/us = 25000 us (25ms)
Prior to this patch, UHS-I cards failed to initialize with:
mmc0: error -110 whilst initialising SD card
After this patch, UHS-I cards are properly detected on SD0:
mmc0: new UHS-I speed SDR104 SDXC card at address aaaa
mmcblk0: mmc0:aaaa SR64G 59.5 GiB
Fixes: d065453e5e ("arm64: dts: renesas: rzt2h-rzn2h-evk: Enable SD card slot")
Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20260123225957.1007089-2-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
When the nl80211 socket that originated a PMSR request is
closed, cfg80211_release_pmsr() sets the request's nl_portid
to zero and schedules pmsr_free_wk to process the abort
asynchronously. If the interface is concurrently torn down
before that work runs, cfg80211_pmsr_wdev_down() calls
cfg80211_pmsr_process_abort() directly. However, the already-
scheduled pmsr_free_wk work item remains pending and may run
after the interface has been removed from the driver. This
could cause the driver's abort_pmsr callback to operate on a
torn-down interface, leading to undefined behavior and
potential crashes.
Cancel pmsr_free_wk synchronously in cfg80211_pmsr_wdev_down()
before calling cfg80211_pmsr_process_abort(). This ensures any
pending or in-progress work is drained before interface teardown
proceeds, preventing the work from invoking the driver abort
callback after the interface is gone.
Fixes: 9bb7e0f24e ("cfg80211: add peer measurement with FTM initiator API")
Signed-off-by: Peddolla Harshavardhan Reddy <peddolla.reddy@oss.qualcomm.com>
Link: https://patch.msgid.link/20260305160712.1263829-3-peddolla.reddy@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In some scenarios, a deadlock can happen, involving _do_shadow_pte().
Convert all usages of pgste_get_lock() to pgste_get_trylock() in
_do_shadow_pte() and return -EAGAIN. All callers can already deal with
-EAGAIN being returned.
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
ieee80211_chan_bw_change() iterates all stations and accesses
link->reserved.oper via sta->sdata->link[link_id]. For stations on
AP_VLAN interfaces (e.g. 4addr WDS clients), sta->sdata points to
the VLAN sdata, whose link never participates in chanctx reservations.
This leaves link->reserved.oper zero-initialized with chan == NULL,
causing a NULL pointer dereference in __ieee80211_sta_cap_rx_bw()
when accessing chandef->chan->band during CSA.
Resolve the VLAN sdata to its parent AP sdata using get_bss_sdata()
before accessing link data.
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://patch.msgid.link/20260305170812.2904208-1-nbd@nbd.name
[also change sta->sdata in ARRAY_SIZE even if it doesn't matter]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If the firmware version query fails, the driver currently ignores the
error and continues initializing. This leaves the device in a bad state.
Fix this by making bng_re_query_hwrm_version() return the error code and
update the driver to check for this error and stop the setup process
safely if it happens.
Fixes: 745065770c ("RDMA/bng_re: Register and get the resources from bnge driver")
Signed-off-by: Kamal Heib <kheib@redhat.com>
Link: https://patch.msgid.link/20260303043645.425724-1-kheib@redhat.com
Reviewed-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reset controller fixes for v7.0
* Fix NULL pointer dereference in reset-rzg2l-usbphy-ctrl driver for
renesas,rzg2l-usbphy-ctrl devices without pwrrdy control.
* tag 'reset-fixes-for-v7.0' of https://git.pengutronix.de/git/pza/linux:
reset: rzg2l-usbphy-ctrl: Check pwrrdy is valid before using it
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
FSL SOC Fixes for 7.0
- Fix a race condition in Freescale Queue and Buffer Manager.
- Fix a trivial error verification in CPM1
* tag 'soc_fsl-7.0-2' of https://git.kernel.org/pub/scm/linux/kernel/git/chleroy/linux:
soc: fsl: cpm1: qmc: Fix error check for devm_ioremap_resource() in qmc_qe_init_resources()
soc: fsl: qbman: fix race condition in qman_destroy_fq
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
RISC-V soc fixes for v7.0-rc1
drivers:
Fix leaks in probe/init function teardown code in three drivers.
microchip:
Fix a warning introduced by a recent binding change, that made resets
required on Polarfire SoC's CAN IP.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
* tag 'riscv-soc-fixes-for-v7.0-rc1' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux:
cache: ax45mp: Fix device node reference leak in ax45mp_cache_init()
cache: starfive: fix device node leak in starlink_cache_init()
riscv: dts: microchip: add can resets to mpfs
soc: microchip: mpfs: Fix memory leak in mpfs_sys_controller_probe()
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
If task_work_add() failed, ctx->task is put but the tsync_works struct
is not reset to its previous state. The first consequence is that the
kernel allocates memory for dying threads, which could lead to
user-accounted memory exhaustion (not very useful nor specific to this
case). The second consequence is that task_work_cancel(), called by
cancel_tsync_works(), can dereference a NULL task pointer.
Fix this issues by keeping a consistent works->size wrt the added task
work. This is done in a new tsync_works_trim() helper which also cleans
up the shared_ctx and work fields.
As a safeguard, add a pointer check to cancel_tsync_works() and update
tsync_works_release() accordingly.
Cc: Jann Horn <jannh@google.com>
Reviewed-by: Günther Noack <gnoack@google.com>
Link: https://lore.kernel.org/r/20260217122341.2359582-1-mic@digikod.net
[mic: Replace memset() with compound literal]
Signed-off-by: Mickaël Salaün <mic@digikod.net>
The GL9750 SD host controller has intermittent data corruption during
DMA write operations. The GM_BURST register's R_OSRC_Lmt field
(bits 17:16), which limits outstanding DMA read requests from system
memory, is not being cleared during initialization. The Windows driver
sets R_OSRC_Lmt to zero, limiting requests to the smallest unit.
Clear R_OSRC_Lmt to match the Windows driver behavior. This eliminates
write corruption verified with f3write/f3read tests while maintaining
DMA performance.
Cc: stable@vger.kernel.org
Fixes: e51df6ce66 ("mmc: host: sdhci-pci: Add Genesys Logic GL975x support")
Closes: https://lore.kernel.org/linux-mmc/33d12807-5c72-41ce-8679-57aa11831fad@linux.dev/
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Matthew Schwartz <matthew.schwartz@linux.dev>
Reviewed-by: Ben Chuang <ben.chuang@genesyslogic.com.tw>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
ionic_build_hdr() populated the Ethernet source MAC (hdr->eth.smac_h) by
passing the header’s storage directly to rdma_read_gid_l2_fields().
However, ib_ud_header_init() is called after that and re-initializes the
UD header, which wipes the previously written smac_h. As a result, packets
are emitted with an zero source MAC address on the wire.
Correct the source MAC by reading the GID-derived smac into a temporary
buffer and copy it after ib_ud_header_init() completes.
Fixes: e8521822c7 ("RDMA/ionic: Register device ops for control path")
Cc: stable@vger.kernel.org # 6.18
Signed-off-by: Abhijit Gangurde <abhijit.gangurde@amd.com>
Link: https://patch.msgid.link/20260227061809.2979990-1-abhijit.gangurde@amd.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
If IB_MR_REREG_TRANS is set during rereg_user_mr, the
umem will be released and a new one will be allocated
in irdma_rereg_mr_trans. If any step of irdma_rereg_mr_trans
fails after the new umem is allocated, it releases the umem,
but does not set iwmr->region to NULL. The problem is that
this failure is propagated to the user, who will then call
ibv_dereg_mr (as they should). Then, the dereg_mr path will
see a non-NULL umem and attempt to call ib_umem_release again.
Fix this by setting iwmr->region to NULL after ib_umem_release.
Fixed: 5ac388db27 ("RDMA/irdma: Add support to re-register a memory region")
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260227152743.1183388-1-jmoroni@google.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
cxl_detach_ep() is called during bottom-up removal when all CXL memory
devices beneath a switch port have been removed. For each port in the
hierarchy it locks both the port and its parent, removes the endpoint,
and if the port is now empty, marks it dead and unregisters the port
by calling delete_switch_port(). There are two places during this work
where the parent_port may be used after freeing:
First, a concurrent detach may have already processed a port by the
time a second worker finds it via bus_find_device(). Without pinning
parent_port, it may already be freed when we discover port->dead and
attempt to unlock the parent_port. In a production kernel that's a
silent memory corruption, with lock debug, it looks like this:
[]DEBUG_LOCKS_WARN_ON(__owner_task(owner) != get_current())
[]WARNING: kernel/locking/mutex.c:949 at __mutex_unlock_slowpath+0x1ee/0x310
[]Call Trace:
[]mutex_unlock+0xd/0x20
[]cxl_detach_ep+0x180/0x400 [cxl_core]
[]devm_action_release+0x10/0x20
[]devres_release_all+0xa8/0xe0
[]device_unbind_cleanup+0xd/0xa0
[]really_probe+0x1a6/0x3e0
Second, delete_switch_port() releases three devm actions registered
against parent_port. The last of those is unregister_port() and it
calls device_unregister() on the child port, which can cascade. If
parent_port is now also empty the device core may unregister and free
it too. So by the time delete_switch_port() returns, parent_port may
be free, and the subsequent device_unlock(&parent_port->dev) operates
on freed memory. The kernel log looks same as above, with a different
offset in cxl_detach_ep().
Both of these issues stem from the absence of a lifetime guarantee
between a child port and its parent port.
Establish a lifetime rule for ports: child ports hold a reference to
their parent device until release. Take the reference when the port
is allocated and drop it when released. This ensures the parent is
valid for the full lifetime of the child and eliminates the use after
free window in cxl_detach_ep().
This is easily reproduced with a reload of cxl_acpi in QEMU with CXL
devices present.
Fixes: 2345df5424 ("cxl/memdev: Fix endpoint port removal")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260226184439.1732841-1-alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Earlier TEE subsystem assumed to refcount all the memory pages to be
shared with TEE implementation to be refcounted. However, the slab
allocations within the kernel don't allow refcounting kernel pages.
It is rather better to trust the kernel clients to not free pages while
being shared with TEE implementation. Hence, remove refcounting of kernel
pages from register_shm_helper() API.
Fixes: b9c0e49abf ("mm: decline to manipulate the refcount on a slab page")
Reported-by: Marco Felsch <m.felsch@pengutronix.de>
Reported-by: Sven Püschel <s.pueschel@pengutronix.de>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Co-developed-by: Sumit Garg <sumit.garg@oss.qualcomm.com>
Signed-off-by: Sumit Garg <sumit.garg@oss.qualcomm.com>
Tested-by: Sven Püschel <s.pueschel@pengutronix.de>
Signed-off-by: Jens Wiklander <jens.wiklander@linaro.org>
Change additionalProperties to unevaluatedProperties because it refs to
/schemas/input/matrix-keymap.yaml.
Fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx6dl-victgo.dtb: keypad@70 (holtek,ht16k33): 'keypad,num-columns', 'keypad,num-rows' do not match any of the regexes: '^pinctrl-[0-9]+$'
from schema $id: http://devicetree.org/schemas/auxdisplay/holtek,ht16k33.yaml#
Fixes: f12b457c6b ("dt-bindings: auxdisplay: ht16k33: Convert to json-schema")
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Add validation of the inner IPv4 packet tot_len and ihl fields parsed
from decrypted IPTFS payloads in __input_process_payload(). A crafted
ESP packet containing an inner IPv4 header with tot_len=0 causes an
infinite loop: iplen=0 leads to capturelen=min(0, remaining)=0, so the
data offset never advances and the while(data < tail) loop never
terminates, spinning forever in softirq context.
Reject inner IPv4 packets where tot_len < ihl*4 or ihl*4 < sizeof(struct
iphdr), which catches both the tot_len=0 case and malformed ihl values.
The normal IP stack performs this validation in ip_rcv_core(), but IPTFS
extracts and processes inner packets before they reach that layer.
Reported-by: Roshan Kumar <roshaen09@gmail.com>
Fixes: 6c82d24336 ("xfrm: iptfs: add basic receive packet (tunnel egress) handling")
Cc: stable@vger.kernel.org
Signed-off-by: Roshan Kumar <roshaen09@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The serdes device_node is obtained using of_get_child_by_name(),
which increments the reference count. However, it is never put,
leading to a reference leak.
Add the missing of_node_put() calls to ensure the reference count is
properly balanced.
Fixes: 7ae14cf581 ("phy: ti: j721e-wiz: Implement DisplayPort mode to the wiz driver")
Suggested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/20260212-wiz-v2-1-6e8bd4cc7a4a@gmail.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
The blamed commit introduced support for specifying individual lanes as
OF nodes in the device, and these can have status = "disabled".
When that happens, for_each_available_child_of_node() skips them and
lynx_28g_probe_lane() -> devm_phy_create() is not called, so lane->phy
will be NULL. Yet it will be dereferenced in lynx_28g_cdr_lock_check(),
resulting in a crash.
This used to be well handled in v3 of that patch:
https://lore.kernel.org/linux-phy/20250926180505.760089-14-vladimir.oltean@nxp.com/
but until v5 was merged, the logic to support per-lane OF nodes was
split into a separate change, and the per-SoC compatible strings patch
was deferred to a "part 2" set. The splitting was done improperly, and
that handling of NULL lane->phy pointers was not integrated into the
proper commit.
Fixes: 7df7d58abb ("phy: lynx-28g: support individual lanes as OF PHY providers")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260226182853.1103616-1-vladimir.oltean@nxp.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Geert reports that enabling CONFIG_KUNIT_ALL_TESTS shouldn't enable
features that aren't enabled without it. That isn't what "*all* tests"
means, but as the prompt puts it, "All KUnit tests with satisfied
dependencies".
The impact is that enabling CONFIG_KUNIT_ALL_TESTS brings features which
cannot be disabled as built-in into the kernel.
Keep the pattern where consumer drivers have to "select PHY_COMMON_PROPS",
but if KUNIT_ALL_TESTS is enabled, also make PHY_COMMON_PROPS user
selectable, so it can be turned off.
Modify PHY_COMMON_PROPS_TEST to depend on PHY_COMMON_PROPS rather than
select it.
Fixes: e7556b59ba ("phy: add phy_get_rx_polarity() and phy_get_tx_polarity()")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Closes: https://lore.kernel.org/linux-phy/CAMuHMdUBaoYKNj52gn8DQeZFZ42Cvm6xT6fvo0-_twNv1k3Jhg@mail.gmail.com/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260226153315.3530378-1-vladimir.oltean@nxp.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
interrupts (SINTs) from the hypervisor for doorbells and intercepts.
There is no such vector reserved for arm64.
On arm64, the hypervisor exposes a synthetic register that can be read
to find the INTID that should be used for SINTs. This INTID is in the
PPI range.
To better unify the code paths, introduce mshv_sint_vector_init() that
either reads the synthetic register and obtains the INTID (arm64) or
just uses HYPERVISOR_CALLBACK_VECTOR as the interrupt vector (x86).
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Rename mshv_synic_init() to mshv_synic_cpu_init() and
mshv_synic_cleanup() to mshv_synic_cpu_exit() to better reflect that
these functions handle per-cpu synic setup and teardown.
Use mshv_synic_init/cleanup() to perform init/cleanup that is not per-cpu.
Move all the synic related setup from mshv_parent_partition_init.
Move the reboot notifier to mshv_synic.c because it currently only
operates on the synic cpuhp state.
Move out synic_pages from the global mshv_root since its use is now
completely local to mshv_synic.c.
This is in preparation for adding more stuff to mshv_synic_init().
No functional change.
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Fix wrong variable used for error checking after dma_alloc_coherent()
call. The function checks cdns_ctrl->dma_cdma_desc instead of
cdns_ctrl->cdma_desc, which could lead to incorrect error handling.
Fixes: ec4ba01e89 ("mtd: rawnand: Add new Cadence NAND driver to MTD subsystem")
Cc: stable@vger.kernel.org
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Reviewed-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Given CONFIG_FORTIFY_SOURCE=y and a recent compiler,
commit 439a1bcac6 ("fortify: Use __builtin_dynamic_object_size() when
available") produces the warning below and an oops.
Searching for RedBoot partition table in 50000000.flash at offset 0x7e0000
------------[ cut here ]------------
WARNING: lib/string_helpers.c:1035 at 0xc029e04c, CPU#0: swapper/0/1
memcmp: detected buffer overflow: 15 byte read of buffer size 14
Modules linked in:
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.19.0 #1 NONE
As Kees said, "'names' is pointing to the final 'namelen' many bytes
of the allocation ... 'namelen' could be basically any length at all.
This fortify warning looks legit to me -- this code used to be reading
beyond the end of the allocation."
Since the size of the dynamic allocation is calculated with strlen()
we can use strcmp() instead of memcmp() and remain within bounds.
Cc: Kees Cook <kees@kernel.org>
Cc: stable@vger.kernel.org
Cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/all/202602151911.AD092DFFCD@keescook/
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Suggested-by: Kees Cook <kees@kernel.org>
Signed-off-by: Finn Thain <fthain@linux-m68k.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
I have not been actively involved in SPI NOR development recently and
would like to step down to focus on my current day-to-day work.
The subsystem remains in good hands with Pratyush and Michael.
Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Acked-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
During the device remove process, the device is reset, causing the
configuration registers to go back to their default state, which is
zero. As the driver is checking if the event log support was enabled
before deallocating, it will fail if a reset happened before.
Do not check if the support was enabled, the check for 'idxd->evl'
being valid (only allocated if the HW capability is available) is
enough.
Fixes: 244da66cda ("dmaengine: idxd: setup event log configuration")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://patch.msgid.link/20260121-idxd-fix-flr-on-kernel-queues-v3-v3-10-7ed70658a9d1@intel.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
When using MSI (not MSI-X) with multiple IRQs, the MSI data value
must be unique per vector to ensure correct interrupt delivery.
Currently, the driver fails to increment the MSI data per vector,
causing interrupts to be misrouted.
Fix this by caching the base MSI data and adjusting each vector's
data accordingly during IRQ setup.
Fixes: e63d79d1ff04 ("dmaengine: dw-edma: Add Synopsys DesignWare eDMA IP core driver")
Signed-off-by: Shenghui Shi <brody.shi@m2semi.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260209103726.414-1-brody.shi@m2semi.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Configure only the requested channel when a fixed channel is specified
to avoid modifying other channels unintentionally.
Fix parameter configuration when a fixed DMA channel is requested on
i.MX9 AON domain and i.MX8QM/QXP/DXL platforms. When a client requests
a fixed channel (e.g., channel 6), the driver traverses channels 0-5
and may unintentionally modify their configuration if they are unused.
This leads to issues such as setting the `is_multi_fifo` flag unexpectedly,
causing memcpy tests to fail when using the dmatest tool.
Only affect edma memcpy test when the channel is fixed.
Fixes: 72f5801a4e ("dmaengine: fsl-edma: integrate v3 support")
Signed-off-by: Joy Zou <joy.zou@nxp.com>
Cc: stable@vger.kernel.org
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20250917-b4-edma-chanconf-v1-1-886486e02e91@nxp.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
When the TX queue for espintcp is full, esp_output_tail_tcp will
return an error and not free the skb, because with synchronous crypto,
the common xfrm output code will drop the packet for us.
With async crypto (esp_output_done), we need to drop the skb when
esp_output_tail_tcp returns an error.
Fixes: e27cca96cd ("xfrm: add espintcp (RFC 8229)")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
When we update an SA, we construct a new state and call
xdo_dev_state_add, but never insert it. The existing state is updated,
then we immediately destroy the new state. Since we haven't added it,
we don't go through the standard state delete code, and we're skipping
removing it from the device (but xdo_dev_state_free will get called
when we destroy the temporary state).
This is similar to commit c5d4d7d831 ("xfrm: Fix deletion of
offloaded SAs on failure.").
Fixes: d77e38e612 ("xfrm: Add an IPsec hardware offloading API")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
pcpu_num = 0 is a valid value. The marker for "unset pcpu_num" which
makes copy_to_user_state_extra not add the XFRMA_SA_PCPU attribute is
UINT_MAX.
Fixes: 1ddf9916ac ("xfrm: Add support for per cpu xfrm state handling.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We're returning an error caused by invalid user input without setting
an extack. Add one.
Fixes: 1ddf9916ac ("xfrm: Add support for per cpu xfrm state handling.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Commit 4e6e8c2b75 ("binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4") added
support for AT_HWCAP3 and AT_HWCAP4, but it missed updating the AUX
vector size calculation in create_elf_fdpic_tables() and
AT_VECTOR_SIZE_BASE in include/linux/auxvec.h.
Similar to the fix for AT_HWCAP2 in commit c6a09e342f ("binfmt_elf_fdpic:
fix AUXV size calculation when ELF_HWCAP2 is defined"), this omission
leads to a mismatch between the reserved space and the actual number of
AUX entries, eventually triggering a kernel BUG_ON(csp != sp).
Fix this by incrementing nitems when ELF_HWCAP3 or ELF_HWCAP4 are
defined and updating AT_VECTOR_SIZE_BASE.
Cc: Mark Brown <broonie@kernel.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
Fixes: 4e6e8c2b75 ("binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4")
Signed-off-by: Andrei Vagin <avagin@google.com>
Link: https://patch.msgid.link/20260217180108.1420024-2-avagin@google.com
Signed-off-by: Kees Cook <kees@kernel.org>
The pwrrdy regmap_filed is allocated in rzg2l_usbphy_ctrl_pwrrdy_init()
only if the driver data is set to RZG2L_USBPHY_CTRL_PWRRDY. Check that
pwrrdy is valid before using it to avoid "Unable to handle kernel NULL
pointer dereference at virtual address" errors.
Fixes: c5b7cd9ade ("reset: rzg2l-usbphy-ctrl: Add suspend/resume support")
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
When QMAN_FQ_FLAG_DYNAMIC_FQID is set, there's a race condition between
fq_table[fq->idx] state and freeing/allocating from the pool and
WARN_ON(fq_table[fq->idx]) in qman_create_fq() gets triggered.
Indeed, we can have:
Thread A Thread B
qman_destroy_fq() qman_create_fq()
qman_release_fqid()
qman_shutdown_fq()
gen_pool_free()
-- At this point, the fqid is available again --
qman_alloc_fqid()
-- so, we can get the just-freed fqid in thread B --
fq->fqid = fqid;
fq->idx = fqid * 2;
WARN_ON(fq_table[fq->idx]);
fq_table[fq->idx] = fq;
fq_table[fq->idx] = NULL;
And adding some logs between qman_release_fqid() and
fq_table[fq->idx] = NULL makes the WARN_ON() trigger a lot more.
To prevent that, ensure that fq_table[fq->idx] is set to NULL before
gen_pool_free() is called by using smp_wmb().
Fixes: c535e923bb ("soc/fsl: Introduce DPAA 1.x QMan device driver")
Signed-off-by: Richard Genoud <richard.genoud@bootlin.com>
Tested-by: CHAMPSEIX Thomas <thomas.champseix@alstomgroup.com>
Link: https://lore.kernel.org/r/20251223072549.397625-1-richard.genoud@bootlin.com
Signed-off-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
In ax45mp_cache_init(), of_find_matching_node() returns a device node
with an incremented reference count that must be released with
of_node_put(). The current code fails to call of_node_put() which
causes a reference leak.
Use the __free(device_node) attribute to ensure automatic cleanup when
the variable goes out of scope.
Fixes: d34599bcd2 ("cache: Add L2 cache management for Andes AX45MP RISC-V core")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
of_find_matching_node() returns a device_node with refcount incremented.
Use __free(device_node) attribute to automatically call of_node_put()
when the variable goes out of scope, preventing the refcount leak.
Fixes: cabff60ca7 ("cache: Add StarFive StarLink cache management")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
The can IP on PolarFire SoC requires the use of the blocks reset
during normal operation, and the property is therefore required by the
binding, causing a warning on the m100pfsevp board where it is default
enabled:
mpfs-m100pfsevp.dtb: can@2010c000 (microchip,mpfs-can): 'resets' is a required property
Add the reset to both can nodes.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
In mpfs_sys_controller_probe(), if of_get_mtd_device_by_node() fails,
the function returns immediately without freeing the allocated memory
for sys_controller, leading to a memory leak.
Fix this by jumping to the out_free label to ensure the memory is
properly freed.
Also, consolidate the error handling for the mbox_request_channel()
failure case to use the same label.
Fixes: 742aa6c563 ("soc: microchip: mpfs: enable access to the system controller's flash")
Co-developed-by: Jianhao Xu <jianhao.xu@seu.edu.cn>
Signed-off-by: Jianhao Xu <jianhao.xu@seu.edu.cn>
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
2026-01-15 19:15:19 +00:00
843 changed files with 10888 additions and 4297 deletions
@@ -109,9 +109,6 @@ static int h4_recv(struct hci_uart *hu, const void *data, int count)
{
structh4_struct*h4=hu->priv;
if(!test_bit(HCI_UART_REGISTERED,&hu->flags))
return-EUNATCH;
h4->rx_skb=h4_recv_buf(hu,h4->rx_skb,data,count,
h4_recv_pkts,ARRAY_SIZE(h4_recv_pkts));
if(IS_ERR(h4->rx_skb)){
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.