Jon Hunter says:
====================
net: stmmac: Fix Tegra234 MGBE clock
The name of the PTP ref clock for the Tegra234 MGBE ethernet controller
does not match the generic name in the stmmac platform driver. Despite
this basic ethernet is functional on the Tegra234 platforms that use
this driver and as far as I know, we have not tested PTP support with
this driver. Hence, the risk of breaking any functionality is low.
The previous attempt to fix this in the stmmac platform driver, by
supporting the Tegra234 PTP clock name, was rejected [0]. The preference
from the netdev maintainers is to fix this in the DT binding for
Tegra234.
This series fixes this by correcting the device-tree binding to align
with the generic name for the PTP clock. I understand that this is
breaking the ABI for this device, which we should never do, but this
is a last resort for getting this fixed. I am open to any better ideas
to fix this. Please note that we still maintain backward compatibility
in the driver to allow older device-trees to work, but we don't
advertise this via the binding, because I did not see any value in doing
so.
====================
Link: https://patch.msgid.link/20260401102941.17466-1-jonathanh@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The PTP clock for the Tegra234 MGBE device is incorrectly named
'ptp-ref' and should be 'ptp_ref'. This is causing the following
warning to be observed on Tegra234 platforms that use this device:
ERR KERN tegra-mgbe 6800000.ethernet eth0: Invalid PTP clock rate
WARNING KERN tegra-mgbe 6800000.ethernet eth0: PTP init failed
Although this constitutes an ABI breakage in the binding for this
device, PTP support has clearly never worked and so fix this now
so we can correct the device-tree for this device. Note that the
MGBE driver still supports the legacy 'ptp-ref' clock name and so
older/existing device-trees will still work, but given that this
is not the correct name, there is no point to advertise this in the
binding.
Fixes: 189c2e5c76 ("dt-bindings: net: Add Tegra234 MGBE")
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Link: https://patch.msgid.link/20260401102941.17466-3-jonathanh@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since commit 030ce919e1 ("net: stmmac: make sure that ptp_rate is not
0 before configuring timestamping") was added the following error is
observed on Tegra234:
ERR KERN tegra-mgbe 6800000.ethernet eth0: Invalid PTP clock rate
WARNING KERN tegra-mgbe 6800000.ethernet eth0: PTP init failed
It turns out that the Tegra234 device-tree binding defines the PTP ref
clock name as 'ptp-ref' and not 'ptp_ref' and the above commit now
exposes this and that the PTP clock is not configured correctly.
In order to update device-tree to use the correct 'ptp_ref' name, update
the Tegra MGBE driver to use 'ptp_ref' by default and fallback to using
'ptp-ref' if this clock name is present.
Fixes: d8ca113724 ("net: stmmac: tegra: Add MGBE support")
Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260401102941.17466-2-jonathanh@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
s3fwrn82_uart_read() reports the number of accepted bytes to the serdev
core. The current code consumes bytes into recv_skb and may already
deliver a complete frame before allocating a fresh receive buffer.
If that alloc_skb() fails, the callback returns 0 even though it has
already consumed bytes, and it leaves recv_skb as NULL for the next
receive callback. That breaks the receive_buf() accounting contract and
can also lead to a NULL dereference on the next skb_put_u8().
Allocate the receive skb lazily before consuming the next byte instead.
If allocation fails, return the number of bytes already accepted.
Fixes: 3f52c2cb7e ("nfc: s3fwrn5: Support a UART interface")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Link: https://patch.msgid.link/20260402042148.65236-1-pengpeng@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv6_stub->ipv6_dev_find() may return ERR_PTR(-EAFNOSUPPORT) when the
IPv6 stack is not active (CONFIG_IPV6=m and not loaded), and passing
this error pointer to dev_hold() will cause a kernel crash with
null-ptr-deref.
Instead, silently discard the request. RFC 8335 does not appear to
define a specific response for the case where an IPv6 interface
identifier is syntactically valid but the implementation cannot perform
the lookup at runtime, and silently dropping the request may safer than
misreporting "No Such Interface".
Fixes: d329ea5bd8 ("icmp: add response to RFC 8335 PROBE messages")
Signed-off-by: Yiqi Sun <sunyiqixm@gmail.com>
Link: https://patch.msgid.link/20260402070419.2291578-1-sunyiqixm@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When querying a nexthop object via RTM_GETNEXTHOP, the kernel currently
allocates a fixed-size skb using NLMSG_GOODSIZE. While sufficient for
single nexthops and small Equal-Cost Multi-Path groups, this fixed
allocation fails for large nexthop groups like 512 nexthops.
This results in the following warning splat:
WARNING: net/ipv4/nexthop.c:3395 at rtm_get_nexthop+0x176/0x1c0, CPU#20: rep/4608
[...]
RIP: 0010:rtm_get_nexthop (net/ipv4/nexthop.c:3395)
[...]
Call Trace:
<TASK>
rtnetlink_rcv_msg (net/core/rtnetlink.c:6989)
netlink_rcv_skb (net/netlink/af_netlink.c:2550)
netlink_unicast (net/netlink/af_netlink.c:1319 net/netlink/af_netlink.c:1344)
netlink_sendmsg (net/netlink/af_netlink.c:1894)
____sys_sendmsg (net/socket.c:721 net/socket.c:736 net/socket.c:2585)
___sys_sendmsg (net/socket.c:2641)
__sys_sendmsg (net/socket.c:2671)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
</TASK>
Fix this by allocating the size dynamically using nh_nlmsg_size() and
using nlmsg_new(), this is consistent with nexthop_notify() behavior. In
addition, adjust nh_nlmsg_size_grp() so it calculates the size needed
based on flags passed. While at it, also add the size of NHA_FDB for
nexthop group size calculation as it was missing too.
This cannot be reproduced via iproute2 as the group size is currently
limited and the command fails as follows:
addattr_l ERROR: message exceeded bound of 1048
Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Closes: https://lore.kernel.org/netdev/CAL_bE8Li2h4KO+AQFXW4S6Yb_u5X4oSKnkywW+LPFjuErhqELA@mail.gmail.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260402072613.25262-2-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently NHA_HW_STATS_ENABLE is included twice everytime a dump of
nexthop group is performed with NHA_OP_FLAG_DUMP_STATS. As all the stats
querying were moved to nla_put_nh_group_stats(), leave only that
instance of the attribute querying.
Fixes: 5072ae00ae ("net: nexthop: Expose nexthop group HW stats to user space")
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260402072613.25262-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
qca_tty_receive() consumes each input byte before checking whether a
completed frame needs a fresh receive skb. When the current byte completes
a frame, the driver delivers that frame and then allocates a new skb for
the next one.
If that allocation fails, the current code returns i even though data[i]
has already been consumed and may already have completed the delivered
frame. Since serdev interprets the return value as the number of accepted
bytes, this under-reports progress by one byte and can replay the final
byte of the completed frame into a fresh parser state on the next call.
Return i + 1 in that failure path so the accepted-byte count matches the
actual receive-state progress.
Fixes: dfc768fbe6 ("net: qualcomm: add QCA7000 UART driver")
Cc: stable@vger.kernel.org
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Reviewed-by: Stefan Wahren <wahrenst@gmx.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260402071207.4036-1-pengpeng@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The GRP_ACK_MSG handler in tipc_group_proto_rcv() currently decrements
bc_ackers on every inbound group ACK, even when the same member has
already acknowledged the current broadcast round.
Because bc_ackers is a u16, a duplicate ACK received after the last
legitimate ACK wraps the counter to 65535. Once wrapped,
tipc_group_bc_cong() keeps reporting congestion and later group
broadcasts on the affected socket stay blocked until the group is
recreated.
Fix this by ignoring duplicate or stale ACKs before touching bc_acked or
bc_ackers. This makes repeated GRP_ACK_MSG handling idempotent and
prevents the underflow path.
Fixes: 2f487712b8 ("tipc: guarantee that group broadcast doesn't bypass group unicast")
Cc: stable@vger.kernel.org
Signed-off-by: Oleh Konko <security@1seal.org>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/41a4833f368641218e444fdcff822039.security@1seal.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rtnl_newlink() lacks a CAP_NET_ADMIN capability check on the peer
network namespace when creating paired devices (veth, vxcan,
netkit). This allows an unprivileged user with a user namespace
to create interfaces in arbitrary network namespaces, including
init_net.
Add a netlink_ns_capable() check for CAP_NET_ADMIN in the peer
namespace before allowing device creation to proceed.
Fixes: 81adee47df ("net: Support specifying the network namespace upon device creation.")
Signed-off-by: Nikolaos Gkarlis <nickgarlis@gmail.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260402181432.4126920-1-nickgarlis@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When CONFIG_BRIDGE_VLAN_FILTERING is not set, br_vlan_group() and
nbp_vlan_group() return NULL (br_private.h stub definitions). The
BR_BOOLOPT_FDB_LOCAL_VLAN_0 toggle code is compiled unconditionally and
reaches br_fdb_delete_locals_per_vlan_port() and
br_fdb_insert_locals_per_vlan_port(), where the NULL vlan group pointer
is dereferenced via list_for_each_entry(v, &vg->vlan_list, vlist).
The observed crash is in the delete path, triggered when creating a
bridge with IFLA_BR_MULTI_BOOLOPT containing BR_BOOLOPT_FDB_LOCAL_VLAN_0
via RTM_NEWLINK. The insert helper has the same bug pattern.
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000056: 0000 [#1] KASAN NOPTI
KASAN: null-ptr-deref in range [0x00000000000002b0-0x00000000000002b7]
RIP: 0010:br_fdb_delete_locals_per_vlan+0x2b9/0x310
Call Trace:
br_fdb_toggle_local_vlan_0+0x452/0x4c0
br_toggle_fdb_local_vlan_0+0x31/0x80 net/bridge/br.c:276
br_boolopt_toggle net/bridge/br.c:313
br_boolopt_multi_toggle net/bridge/br.c:364
br_changelink net/bridge/br_netlink.c:1542
br_dev_newlink net/bridge/br_netlink.c:1575
Add NULL checks for the vlan group pointer in both helpers, returning
early when there are no VLANs to iterate. This matches the existing
pattern used by other bridge FDB functions such as br_fdb_add() and
br_fdb_delete().
Fixes: 21446c06b4 ("net: bridge: Introduce UAPI for BR_BOOLOPT_FDB_LOCAL_VLAN_0")
Signed-off-by: Zijing Yin <yzjaurora@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20260402140153.3925663-1-yzjaurora@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If an error occurs on the subsequents buffers belonging to the
non-linear part of the skb (e.g. due to an error in the payload length
reported by the NIC or if we consumed all the available fragments for
the skb), the page_pool fragment will not be linked to the skb so it will
not return to the pool in the airoha_qdma_rx_process() error path. Fix the
memory leak partially reverting commit 'd6d2b0e1538d ("net: airoha: Fix
page recycling in airoha_qdma_rx_process()")' and always running
page_pool_put_full_page routine in the airoha_qdma_rx_process() error
path.
Fixes: d6d2b0e153 ("net: airoha: Fix page recycling in airoha_qdma_rx_process()")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260402-airoha_qdma_rx_process-mem-leak-fix-v1-1-b5706f402d3c@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When CONFIG_FIXED_PHY is in a loadable module, the fec driver cannot be
built-in any more:
x86_64-linux-ld: vmlinux.o: in function `fec_enet_mii_probe':
fec_main.c:(.text+0xc4f367): undefined reference to `fixed_phy_unregister'
x86_64-linux-ld: vmlinux.o: in function `fec_enet_close':
fec_main.c:(.text+0xc59591): undefined reference to `fixed_phy_unregister'
x86_64-linux-ld: vmlinux.o: in function `fec_enet_mii_probe.cold':
Select the fixed phy support on all targets to make this build
correctly, not just on coldfire.
Notat that Essentially the stub helpers in include/linux/phy_fixed.h
cannot be used correctly because of this build time dependency,
and we could just remove them to hit the build failure more often
when a driver uses them without the 'select FIXED_PHY'.
Fixes: dc86b621e1 ("net: fec: register a fixed phy using fixed_phy_register_100fd if needed")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260402141048.2713445-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcf_csum_act() walks nested VLAN headers directly from skb->data when an
skb still carries in-payload VLAN tags. The current code reads
vlan->h_vlan_encapsulated_proto and then pulls VLAN_HLEN bytes without
first ensuring that the full VLAN header is present in the linear area.
If only part of an inner VLAN header is linearized, accessing
h_vlan_encapsulated_proto reads past the linear area, and the following
skb_pull(VLAN_HLEN) may violate skb invariants.
Fix this by requiring pskb_may_pull(skb, VLAN_HLEN) before accessing and
pulling each nested VLAN header. If the header still is not fully
available, drop the packet through the existing error path.
Fixes: 2ecba2d1e4 ("net: sched: act_csum: Fix csum calc for tagged packets")
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Tested-by: Ren Wei <enjou1224z@gmail.com>
Signed-off-by: Ruide Cao <caoruide123@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/22df2fcb49f410203eafa5d97963dd36089f4ecf.1774892775.git.caoruide123@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The jumbo_frm() chain-mode implementation unconditionally computes
len = nopaged_len - bmax;
where nopaged_len = skb_headlen(skb) (linear bytes only) and bmax is
BUF_SIZE_8KiB or BUF_SIZE_2KiB. However, the caller stmmac_xmit()
decides to invoke jumbo_frm() based on skb->len (total length including
page fragments):
is_jumbo = stmmac_is_jumbo_frm(priv, skb->len, enh_desc);
When a packet has a small linear portion (nopaged_len <= bmax) but a
large total length due to page fragments (skb->len > bmax), the
subtraction wraps as an unsigned integer, producing a huge len value
(~0xFFFFxxxx). This causes the while (len != 0) loop to execute
hundreds of thousands of iterations, passing skb->data + bmax * i
pointers far beyond the skb buffer to dma_map_single(). On IOMMU-less
SoCs (the typical deployment for stmmac), this maps arbitrary kernel
memory to the DMA engine, constituting a kernel memory disclosure and
potential memory corruption from hardware.
Fix this by introducing a buf_len local variable clamped to
min(nopaged_len, bmax). Computing len = nopaged_len - buf_len is then
always safe: it is zero when the linear portion fits within a single
descriptor, causing the while (len != 0) loop to be skipped naturally,
and the fragment loop in stmmac_xmit() handles page fragments afterward.
Fixes: 286a837217 ("stmmac: add CHAINED descriptor mode support (V4)")
Cc: stable@vger.kernel.org
Signed-off-by: Tyllis Xu <LivelyCarpet87@gmail.com>
Link: https://patch.msgid.link/20260401044708.1386919-1-LivelyCarpet87@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When dma_map_single() fails in tse_start_xmit(), the function returns
NETDEV_TX_OK without freeing the skb. Since NETDEV_TX_OK tells the
stack the packet was consumed, the skb is never freed, leaking memory
on every DMA mapping failure.
Add dev_kfree_skb_any() before returning to properly free the skb.
Fixes: bbd2190ce9 ("Altera TSE: Add main and header file for Altera Ethernet Driver")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
Link: https://patch.msgid.link/20260401211218.279185-1-devnexen@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull networking fixes from Jakub Kicinski:
"With fixes from wireless, bluetooth and netfilter included we're back
to each PR carrying 30%+ more fixes than in previous era.
The good news is that so far none of the "extra" fixes are themselves
causing real regressions. Not sure how much comfort that is.
Current release - fix to a fix:
- netdevsim: fix build if SKB_EXTENSIONS=n
- eth: stmmac: skip VLAN restore when VLAN hash ops are missing
Previous releases - regressions:
- wifi: iwlwifi: mvm: don't send a 6E related command when
not supported
Previous releases - always broken:
- some info leak fixes
- add missing clearing of skb->cb[] on ICMP paths from tunnels
- ipv6:
- flowlabel: defer exclusive option free until RCU teardown
- avoid overflows in ip6_datagram_send_ctl()
- mpls: add seqcount to protect platform_labels from OOB access
- bridge: improve safety of parsing ND options
- bluetooth: fix leaks, overflows and races in hci_sync
- netfilter: add more input validation, some to address bugs directly
some to prevent exploits from cooking up broken configurations
- wifi:
- ath: avoid poor performance due to stopping the wrong
aggregation session
- virt_wifi: remove SET_NETDEV_DEV to avoid use-after-free
- eth:
- fec: fix the PTP periodic output sysfs interface
- enetc: safely reinitialize TX BD ring when it has unsent frames"
* tag 'net-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (95 commits)
eth: fbnic: Increase FBNIC_QUEUE_SIZE_MIN to 64
ipv6: avoid overflows in ip6_datagram_send_ctl()
net: hsr: fix VLAN add unwind on slave errors
net: hsr: serialize seq_blocks merge across nodes
vsock: initialize child_ns_mode_locked in vsock_net_init()
selftests/tc-testing: add tests for cls_fw and cls_flow on shared blocks
net/sched: cls_flow: fix NULL pointer dereference on shared blocks
net/sched: cls_fw: fix NULL pointer dereference on shared blocks
net/x25: Fix overflow when accumulating packets
net/x25: Fix potential double free of skb
bnxt_en: Restore default stat ctxs for ULP when resource is available
bnxt_en: Don't assume XDP is never enabled in bnxt_init_dflt_ring_mode()
bnxt_en: Refactor some basic ring setup and adjustment logic
net/mlx5: Fix switchdev mode rollback in case of failure
net/mlx5: Avoid "No data available" when FW version queries fail
net/mlx5: lag: Check for LAG device before creating debugfs
net: macb: properly unregister fixed rate clocks
net: macb: fix clk handling on PCI glue driver removal
virtio_net: clamp rss_max_key_size to NETDEV_RSS_KEY_LEN
net/sched: sch_netem: fix out-of-bounds access in packet corruption
...
Pull iommu fixes from Joerg Roedel:
- IOMMU-PT related compile breakage in for AMD driver
- IOTLB flushing behavior when unmapped region is larger than requested
due to page-sizes
- Fix IOTLB flush behavior with empty gathers
* tag 'iommu-fixes-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommupt/amdv1: mark amdv1pt_install_leaf_entry as __always_inline
iommupt: Fix short gather if the unmap goes into a large mapping
iommu: Do not call drivers for empty gathers
Pull sound fixes from Takashi Iwai:
"People have been so busy for hunting and we're still getting more
changes than wished for, but it doesn't look too scary; almost all
changes are device-specific small fixes.
I guess it's rather a casual bump, and no more Easter eggs are left
for 7.0 (hopefully)...
- Fixes for the recent regression on ctxfi driver
- Fix missing INIT_LIST_HEAD() for ASoC card_aux_list
- Usual HD- and USB-audio, and ASoC AMD quirk updates
- ASoC fixes for AMD and Intel"
* tag 'sound-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (24 commits)
ASoC: amd: ps: Fix missing leading zeros in subsystem_device SSID log
ALSA: usb-audio: Exclude Scarlett 2i2 1st Gen (8016) from SKIP_IFACE_SETUP
ALSA: hda/realtek: add quirk for Acer Swift SFG14-73
ALSA: hda/realtek: Add quirk for Lenovo Yoga Pro 7 14IMH9
ASoC: Intel: boards: fix unmet dependency on PINCTRL
ASoC: Intel: ehl_rt5660: Use the correct rtd->dev device in hw_params
ALSA: ctxfi: Don't enumerate SPDIF1 at DAIO initialization
ALSA: hda/realtek: Add quirk for Lenovo Yoga Slim 7 14AKP10
ALSA: hda/realtek: add quirk for HP Laptop 15-fc0xxx
ASoC: ep93xx: Fix unchecked clk_prepare_enable() and add rollback on failure
ASoC: soc-core: call missing INIT_LIST_HEAD() for card_aux_list
ALSA: hda/realtek: Add quirk for Samsung Book2 Pro 360 (NP950QED)
ASoC: amd: yc: Add DMI entry for HP Laptop 15-fc0xxx
ASoC: amd: yc: Add DMI quirk for ASUS Vivobook Pro 16X OLED M7601RM
ALSA: hda/realtek: Add quirk for ASUS ROG Strix SCAR 15
ALSA: usb-audio: Exclude Scarlett Solo 1st Gen from SKIP_IFACE_SETUP
ALSA: caiaq: fix stack out-of-bounds read in init_card
ALSA: ctxfi: Check the error for index mapping
ALSA: ctxfi: Fix missing SPDIFI1 index handling
ALSA: hda/realtek: add quirk for HP Victus 15-fb0xxx
...
Pull auxdisplay fixes from Andy Shevchenko:
- Fix NULL dereference in linedisp_release()
- Fix ht16k33 DT bindings to avoid warnings
- Handle errors in I²C transfers in lcd2s driver
* tag 'auxdisplay-v7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/andy/linux-auxdisplay:
auxdisplay: line-display: fix NULL dereference in linedisp_release
auxdisplay: lcd2s: add error handling for i2c transfers
dt-bindings: auxdisplay: ht16k33: Use unevaluatedProperties to fix common property warning
On systems with 64K pages, RX queues will be wedged if users set the
descriptor count to the current minimum (16). Fbnic fragments large
pages into 4K chunks, and scales down the ring size accordingly. With
64K pages and 16 descriptors, the ring size mask is 0 and will never
be filled.
32 descriptors is another special case that wedges the RX rings.
Internally, the rings track pages for the head/tail pointers, not page
fragments. So with 32 descriptors, there's only 1 usable page as one
ring slot is kept empty to disambiguate between an empty/full ring.
As a result, the head pointer never advances and the HW stalls after
consuming 16 page fragments.
Fixes: 0cb4c0a137 ("eth: fbnic: Implement Rx queue alloc/start/stop/free")
Signed-off-by: Dimitri Daskalakis <daskald@meta.com>
Link: https://patch.msgid.link/20260401162848.2335350-1-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yiming Qian reported :
<quote>
I believe I found a locally triggerable kernel bug in the IPv6 sendmsg
ancillary-data path that can panic the kernel via `skb_under_panic()`
(local DoS).
The core issue is a mismatch between:
- a 16-bit length accumulator (`struct ipv6_txoptions::opt_flen`, type
`__u16`) and
- a pointer to the *last* provided destination-options header (`opt->dst1opt`)
when multiple `IPV6_DSTOPTS` control messages (cmsgs) are provided.
- `include/net/ipv6.h`:
- `struct ipv6_txoptions::opt_flen` is `__u16` (wrap possible).
(lines 291-307, especially 298)
- `net/ipv6/datagram.c:ip6_datagram_send_ctl()`:
- Accepts repeated `IPV6_DSTOPTS` and accumulates into `opt_flen`
without rejecting duplicates. (lines 909-933)
- `net/ipv6/ip6_output.c:__ip6_append_data()`:
- Uses `opt->opt_flen + opt->opt_nflen` to compute header
sizes/headroom decisions. (lines 1448-1466, especially 1463-1465)
- `net/ipv6/ip6_output.c:__ip6_make_skb()`:
- Calls `ipv6_push_frag_opts()` if `opt->opt_flen` is non-zero.
(lines 1930-1934)
- `net/ipv6/exthdrs.c:ipv6_push_frag_opts()` / `ipv6_push_exthdr()`:
- Push size comes from `ipv6_optlen(opt->dst1opt)` (based on the
pointed-to header). (lines 1179-1185 and 1206-1211)
1. `opt_flen` is a 16-bit accumulator:
- `include/net/ipv6.h:298` defines `__u16 opt_flen; /* after fragment hdr */`.
2. `ip6_datagram_send_ctl()` accepts *repeated* `IPV6_DSTOPTS` cmsgs
and increments `opt_flen` each time:
- In `net/ipv6/datagram.c:909-933`, for `IPV6_DSTOPTS`:
- It computes `len = ((hdr->hdrlen + 1) << 3);`
- It checks `CAP_NET_RAW` using `ns_capable(net->user_ns,
CAP_NET_RAW)`. (line 922)
- Then it does:
- `opt->opt_flen += len;` (line 927)
- `opt->dst1opt = hdr;` (line 928)
There is no duplicate rejection here (unlike the legacy
`IPV6_2292DSTOPTS` path which rejects duplicates at
`net/ipv6/datagram.c:901-904`).
If enough large `IPV6_DSTOPTS` cmsgs are provided, `opt_flen` wraps
while `dst1opt` still points to a large (2048-byte)
destination-options header.
In the attached PoC (`poc.c`):
- 32 cmsgs with `hdrlen=255` => `len = (255+1)*8 = 2048`
- 1 cmsg with `hdrlen=0` => `len = 8`
- Total increment: `32*2048 + 8 = 65544`, so `(__u16)opt_flen == 8`
- The last cmsg is 2048 bytes, so `dst1opt` points to a 2048-byte header.
3. The transmit path sizes headers using the wrapped `opt_flen`:
- In `net/ipv6/ip6_output.c:1463-1465`:
- `headersize = sizeof(struct ipv6hdr) + (opt ? opt->opt_flen +
opt->opt_nflen : 0) + ...;`
With wrapped `opt_flen`, `headersize`/headroom decisions underestimate
what will be pushed later.
4. When building the final skb, the actual push length comes from
`dst1opt` and is not limited by wrapped `opt_flen`:
- In `net/ipv6/ip6_output.c:1930-1934`:
- `if (opt->opt_flen) proto = ipv6_push_frag_opts(skb, opt, proto);`
- In `net/ipv6/exthdrs.c:1206-1211`, `ipv6_push_frag_opts()` pushes
`dst1opt` via `ipv6_push_exthdr()`.
- In `net/ipv6/exthdrs.c:1179-1184`, `ipv6_push_exthdr()` does:
- `skb_push(skb, ipv6_optlen(opt));`
- `memcpy(h, opt, ipv6_optlen(opt));`
With insufficient headroom, `skb_push()` underflows and triggers
`skb_under_panic()` -> `BUG()`:
- `net/core/skbuff.c:2669-2675` (`skb_push()` calls `skb_under_panic()`)
- `net/core/skbuff.c:207-214` (`skb_panic()` ends in `BUG()`)
- The `IPV6_DSTOPTS` cmsg path requires `CAP_NET_RAW` in the target
netns user namespace (`ns_capable(net->user_ns, CAP_NET_RAW)`).
- Root (or any task with `CAP_NET_RAW`) can trigger this without user
namespaces.
- An unprivileged `uid=1000` user can trigger this if unprivileged
user namespaces are enabled and it can create a userns+netns to obtain
namespaced `CAP_NET_RAW` (the attached PoC does this).
- Local denial of service: kernel BUG/panic (system crash).
- Reproducible with a small userspace PoC.
</quote>
This patch does not reject duplicated options, as this might break
some user applications.
Instead, it makes sure to adjust opt_flen and opt_nflen to correctly
reflect the size of the current option headers, preventing the overflows
and the potential for panics.
This applies to IPV6_DSTOPTS, IPV6_HOPOPTS, and IPV6_RTHDR.
Specifically:
When a new IPV6_DSTOPTS is processed, the length of the old opt->dst1opt
is subtracted from opt->opt_flen before adding the new length.
When a new IPV6_HOPOPTS is processed, the length of the old opt->dst0opt
is subtracted from opt->opt_nflen.
When a new Routing Header (IPV6_RTHDR or IPV6_2292RTHDR) is processed,
the length of the old opt->srcrt is subtracted from opt->opt_nflen.
In the special case within IPV6_2292RTHDR handling where dst1opt is moved
to dst0opt, the length of the old opt->dst0opt is subtracted from
opt->opt_nflen before the new one is added.
Fixes: 333fad5364 ("[IPV6]: Support several new sockopt / ancillary data in Advanced API (RFC3542).")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Closes: https://lore.kernel.org/netdev/CAL_bE8JNzawgr5OX5m+3jnQDHry2XxhQT5=jThW1zDPtUikRYA@mail.gmail.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260401154721.3740056-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Luka Gejak says:
====================
net: hsr: fixes for PRP duplication and VLAN unwind
This series addresses two logic bugs in the HSR/PRP implementation
identified during a protocol audit. These are targeted for the 'net'
tree as they fix potential memory corruption and state inconsistency.
The primary change resolves a race condition in the node merging path by
implementing address-based lock ordering. This ensures that concurrent
mutations of sequence blocks do not lead to state corruption or
deadlocks.
An additional fix corrects asymmetric VLAN error unwinding by
implementing a centralized unwind path on slave errors.
====================
Link: https://patch.msgid.link/20260401092243.52121-1-luka.gejak@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When vlan_vid_add() fails for a secondary slave, the error path calls
vlan_vid_del() on the failing port instead of the peer slave that had
already succeeded. This results in asymmetric VLAN state across the HSR
pair.
Fix this by switching to a centralized unwind path that removes the VID
from any slave device that was already programmed.
Fixes: 1a8a63a530 ("net: hsr: Add VLAN CTAG filter support")
Signed-off-by: Luka Gejak <luka.gejak@linux.dev>
Link: https://patch.msgid.link/20260401092243.52121-3-luka.gejak@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During node merging, hsr_handle_sup_frame() walks node_curr->seq_blocks
to update node_real without holding node_curr->seq_out_lock. This
allows concurrent mutations from duplicate registration paths, risking
inconsistent state or XArray/bitmap corruption.
Fix this by locking both nodes' seq_out_lock during the merge.
To prevent ABBA deadlocks, locks are acquired in order of memory
address.
Reviewed-by: Felix Maurer <fmaurer@redhat.com>
Fixes: 415e636751 ("hsr: Implement more robust duplicate discard for PRP")
Signed-off-by: Luka Gejak <luka.gejak@linux.dev>
Link: https://patch.msgid.link/20260401092243.52121-2-luka.gejak@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The `child_ns_mode_locked` field lives in `struct net`, which persists
across vsock module reloads. When the module is unloaded and reloaded,
`vsock_net_init()` resets `mode` and `child_ns_mode` back to their
default values, but does not reset `child_ns_mode_locked`.
The stale lock from the previous module load causes subsequent writes
to `child_ns_mode` to silently fail: `vsock_net_set_child_mode()` sees
the old lock, skips updating the actual value, and returns success
when the requested mode matches the stale lock. The sysctl handler
reports no error, but `child_ns_mode` remains unchanged.
Steps to reproduce:
$ modprobe vsock
$ echo local > /proc/sys/net/vsock/child_ns_mode
$ cat /proc/sys/net/vsock/child_ns_mode
local
$ modprobe -r vsock
$ modprobe vsock
$ echo local > /proc/sys/net/vsock/child_ns_mode
$ cat /proc/sys/net/vsock/child_ns_mode
global <--- expected "local"
Fix this by initializing `child_ns_mode_locked` to 0 (unlocked) in
`vsock_net_init()`, so the write-once mechanism works correctly after
module reload.
Fixes: 102eab95f0 ("vsock: lock down child_ns_mode as write-once")
Reported-by: Jin Liu <jinl@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com>
Link: https://patch.msgid.link/20260401092153.28462-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Regression tests for the shared-block NULL derefs fixed in the previous
two patches:
- fw: attempt to attach an empty fw filter to a shared block and
verify the configuration is rejected with EINVAL.
- flow: create a flow filter on a shared block without a baseclass
and verify the configuration is rejected with EINVAL.
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260331050217.504278-3-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
flow_change() calls tcf_block_q() and dereferences q->handle to derive
a default baseclass. Shared blocks leave block->q NULL, causing a NULL
deref when a flow filter without a fully qualified baseclass is created
on a shared block.
Check tcf_block_shared() before accessing block->q and return -EINVAL
for shared blocks. This avoids the null-deref shown below:
=======================================================================
KASAN: null-ptr-deref in range [0x0000000000000038-0x000000000000003f]
RIP: 0010:flow_change (net/sched/cls_flow.c:508)
Call Trace:
tc_new_tfilter (net/sched/cls_api.c:2432)
rtnetlink_rcv_msg (net/core/rtnetlink.c:6980)
[...]
=======================================================================
Fixes: 1abf272022 ("net: sched: tcindex, fw, flow: use tcf_block_q helper to get struct Qdisc")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260331050217.504278-2-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The old-method path in fw_classify() calls tcf_block_q() and
dereferences q->handle. Shared blocks leave block->q NULL, causing a
NULL deref when an empty cls_fw filter is attached to a shared block
and a packet with a nonzero major skb mark is classified.
Reject the configuration in fw_change() when the old method (no
TCA_OPTIONS) is used on a shared block, since fw_classify()'s
old-method path needs block->q which is NULL for shared blocks.
The fixed null-ptr-deref calling stack:
KASAN: null-ptr-deref in range [0x0000000000000038-0x000000000000003f]
RIP: 0010:fw_classify (net/sched/cls_fw.c:81)
Call Trace:
tcf_classify (./include/net/tc_wrapper.h:197 net/sched/cls_api.c:1764 net/sched/cls_api.c:1860)
tc_run (net/core/dev.c:4401)
__dev_queue_xmit (net/core/dev.c:4535 net/core/dev.c:4790)
Fixes: 1abf272022 ("net: sched: tcindex, fw, flow: use tcf_block_q helper to get struct Qdisc")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260331050217.504278-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Martin Schiller says:
====================
net/x25: Fix overflow and double free
This patch set includes 2 fixes:
The first removes a potential double free of received skb
The second fixes an overflow when accumulating packets with the more-bit
set.
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
====================
Link: https://patch.msgid.link/20260331-x25_fraglen-v4-0-3e69f18464b4@dev.tdt.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When alloc_skb fails in x25_queue_rx_frame it calls kfree_skb(skb) at
line 48 and returns 1 (error).
This error propagates back through the call chain:
x25_queue_rx_frame returns 1
|
v
x25_state3_machine receives the return value 1 and takes the else
branch at line 278, setting queued=0 and returning 0
|
v
x25_process_rx_frame returns queued=0
|
v
x25_backlog_rcv at line 452 sees queued=0 and calls kfree_skb(skb)
again
This would free the same skb twice. Looking at x25_backlog_rcv:
net/x25/x25_in.c:x25_backlog_rcv() {
...
queued = x25_process_rx_frame(sk, skb);
...
if (!queued)
kfree_skb(skb);
}
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Link: https://patch.msgid.link/20260331-x25_fraglen-v4-1-3e69f18464b4@dev.tdt.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
ASoC: Fixes for v7.0
Another smallish batch of fixes and quirks, these days it's AMD that is
getting all the DMI entries added. We've got one core fix for a missing
list initialisation with auxiliary devices, otherwise it's all fairly
small things.
Michael Chan says:
====================
bnxt_en: Bug fixes
The first patch is a refactor patch needed by the second patch to
fix XDP ring initialization during FW reset. The third patch
fixes an issue related to stats context reservation for RoCE.
====================
Link: https://patch.msgid.link/20260331065138.948205-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During resource reservation, if the L2 driver does not have enough
MSIX vectors to provide to the RoCE driver, it sets the stat ctxs for
ULP also to 0 so that we don't have to reserve it unnecessarily.
However, subsequently the user may reduce L2 rings thereby freeing up
some resources that the L2 driver can now earmark for RoCE. In this
case, the driver should restore the default ULP stat ctxs to make
sure that all RoCE resources are ready for use.
The RoCE driver may fail to initialize in this scenario without this
fix.
Fixes: d630624ebd ("bnxt_en: Utilize ulp client resources if RoCE is not registered")
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20260331065138.948205-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The original code made the assumption that when we set up the initial
default ring mode, we must be just loading the driver and XDP cannot
be enabled yet. This is not true when the FW goes through a resource
or capability change. Resource reservations will be cancelled and
reinitialized with XDP already enabled. devlink reload with XDP enabled
will also have the same issue. This scenario will cause the ring
arithmetic to be all wrong in the bnxt_init_dflt_ring_mode() path
causing failure:
bnxt_en 0000:a1:00.0 ens2f0np0: bnxt_setup_int_mode err: ffffffea
bnxt_en 0000:a1:00.0 ens2f0np0: bnxt_request_irq err: ffffffea
bnxt_en 0000:a1:00.0 ens2f0np0: nic open fail (rc: ffffffea)
Fix it by properly accounting for XDP in the bnxt_init_dflt_ring_mode()
path by using the refactored helper functions in the previous patch.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Fixes: ec5d31e3c1 ("bnxt_en: Handle firmware reset status during IF_UP.")
Fixes: 228ea8c187 ("bnxt_en: implement devlink dev reload driver_reinit")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20260331065138.948205-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Avoid printing the misleading "kernel answers: No data available" devlink
output when querying firmware or pending firmware version fails
(e.g. MLX5 fw state errors / flash failures).
FW can fail on loading the pending flash image and get its version due
to various reasons, examples:
mlxfw: Firmware flash failed: key not applicable, err (7)
mlx5_fw_image_pending: can't read pending fw version while fw state is 1
and the resulting:
$ devlink dev info
kernel answers: No data available
Instead, just report 0 or 0xfff.. versions in case of failure to indicate
a problem, and let other information be shown.
after the fix:
$ devlink dev info
pci/0000:00:06.0:
driver mlx5_core
serial_number xxx...
board.serial_number MT2225300179
versions:
fixed:
fw.psid MT_0000000436
running:
fw.version 22.41.0188
fw 22.41.0188
stored:
fw.version 255.255.65535
fw 255.255.65535
Fixes: 9c86b07e30 ("net/mlx5: Added fw version query command")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260330194015.53585-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__mlx5_lag_dev_add_mdev() may return 0 (success) even when an error
occurs that is handled gracefully. Consequently, the initialization
flow proceeds to call mlx5_ldev_add_debugfs() even when there is no
valid LAG context.
mlx5_ldev_add_debugfs() blindly created the debugfs directory and
attributes. This exposed interfaces (like the members file) that rely on
a valid ldev pointer, leading to potential NULL pointer dereferences if
accessed when ldev is NULL.
Add a check to verify that mlx5_lag_dev(dev) returns a valid pointer
before attempting to create the debugfs entries.
Fixes: 7f46a0b732 ("net/mlx5: Lag, add debugfs to query hardware lag state")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260330194015.53585-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
platform_device_unregister() may still want to use the registered clks
during runtime resume callback.
Note that there is a commit d82d5303c4 ("net: macb: fix use after free
on rmmod") that addressed the similar problem of clk vs platform device
unregistration but just moved the bug to another place.
Save the pointers to clks into local variables for reuse after platform
device is unregistered.
BUG: KASAN: use-after-free in clk_prepare+0x5a/0x60
Read of size 8 at addr ffff888104f85e00 by task modprobe/597
CPU: 2 PID: 597 Comm: modprobe Not tainted 6.1.164+ #114
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x8d/0xba
print_report+0x17f/0x496
kasan_report+0xd9/0x180
clk_prepare+0x5a/0x60
macb_runtime_resume+0x13d/0x410 [macb]
pm_generic_runtime_resume+0x97/0xd0
__rpm_callback+0xc8/0x4d0
rpm_callback+0xf6/0x230
rpm_resume+0xeeb/0x1a70
__pm_runtime_resume+0xb4/0x170
bus_remove_device+0x2e3/0x4b0
device_del+0x5b3/0xdc0
platform_device_del+0x4e/0x280
platform_device_unregister+0x11/0x50
pci_device_remove+0xae/0x210
device_remove+0xcb/0x180
device_release_driver_internal+0x529/0x770
driver_detach+0xd4/0x1a0
bus_remove_driver+0x135/0x260
driver_unregister+0x72/0xb0
pci_unregister_driver+0x26/0x220
__do_sys_delete_module+0x32e/0x550
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
</TASK>
Allocated by task 519:
kasan_save_stack+0x2c/0x50
kasan_set_track+0x21/0x30
__kasan_kmalloc+0x8e/0x90
__clk_register+0x458/0x2890
clk_hw_register+0x1a/0x60
__clk_hw_register_fixed_rate+0x255/0x410
clk_register_fixed_rate+0x3c/0xa0
macb_probe+0x1d8/0x42e [macb_pci]
local_pci_probe+0xd7/0x190
pci_device_probe+0x252/0x600
really_probe+0x255/0x7f0
__driver_probe_device+0x1ee/0x330
driver_probe_device+0x4c/0x1f0
__driver_attach+0x1df/0x4e0
bus_for_each_dev+0x15d/0x1f0
bus_add_driver+0x486/0x5e0
driver_register+0x23a/0x3d0
do_one_initcall+0xfd/0x4d0
do_init_module+0x18b/0x5a0
load_module+0x5663/0x7950
__do_sys_finit_module+0x101/0x180
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Freed by task 597:
kasan_save_stack+0x2c/0x50
kasan_set_track+0x21/0x30
kasan_save_free_info+0x2a/0x50
__kasan_slab_free+0x106/0x180
__kmem_cache_free+0xbc/0x320
clk_unregister+0x6de/0x8d0
macb_remove+0x73/0xc0 [macb_pci]
pci_device_remove+0xae/0x210
device_remove+0xcb/0x180
device_release_driver_internal+0x529/0x770
driver_detach+0xd4/0x1a0
bus_remove_driver+0x135/0x260
driver_unregister+0x72/0xb0
pci_unregister_driver+0x26/0x220
__do_sys_delete_module+0x32e/0x550
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Fixes: d82d5303c4 ("net: macb: fix use after free on rmmod")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Link: https://patch.msgid.link/20260330184542.626619-1-pchelkin@ispras.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rss_max_key_size in the virtio spec is the maximum key size supported by
the device, not a mandatory size the driver must use. Also the value 40
is a spec minimum, not a spec maximum.
The current code rejects RSS and can fail probe when the device reports a
larger rss_max_key_size than the driver buffer limit. Instead, clamp the
effective key length to min(device rss_max_key_size, NETDEV_RSS_KEY_LEN)
and keep RSS enabled.
This keeps probe working on devices that advertise larger maximum key sizes
while respecting the netdev RSS key buffer size limit.
Fixes: 3f7d9c1964 ("virtio_net: Add hash_key_length check")
Cc: stable@vger.kernel.org
Signed-off-by: Srujana Challa <schalla@marvell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://patch.msgid.link/20260326142344.1171317-1-schalla@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In netem_enqueue(), the packet corruption logic uses
get_random_u32_below(skb_headlen(skb)) to select an index for
modifying skb->data. When an AF_PACKET TX_RING sends fully non-linear
packets over an IPIP tunnel, skb_headlen(skb) evaluates to 0.
Passing 0 to get_random_u32_below() takes the variable-ceil slow path
which returns an unconstrained 32-bit random integer. Using this
unconstrained value as an offset into skb->data results in an
out-of-bounds memory access.
Fix this by verifying skb_headlen(skb) is non-zero before attempting
to corrupt the linear data area. Fully non-linear packets will silently
bypass the corruption logic.
Fixes: c865e5d99e ("[PKT_SCHED] netem: packet corruption option")
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Signed-off-by: Yuan Tan <tanyuan98@outlook.com>
Signed-off-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Yuhang Zheng <z1652074432@gmail.com>
Signed-off-by: Yucheng Lu <kanolyc@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://patch.msgid.link/45435c0935df877853a81e6d06205ac738ec65fa.1774941614.git.kanolyc@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net. Note that most
of the bugs fixed here are >5 years old. The large PR is not due to an
increase in regressions.
1) Flowtable hardware offload support in IPv6 can lead to out-of-bounds
when populating the rule action array when combined with double-tagged
vlan. Bump the maximum number of actions from 16 to 24 and check that
such limit is never reached, otherwise bail out. This bugs stems from
the original flowtable hardware offload support.
2) nfnetlink_log does not include the netlink header size of the trailing
NLMSG_DONE message when calculating the skb size. From Florian Westphal.
3) Reject names in xt_cgroup and xt_rateest extensions which are not
nul-terminated. Also from Florian.
4) Use nla_strcmp in ipset lookup by set name, since IPSET_ATTR_NAME and
IPSET_ATTR_NAMEREF are of NLA_STRING type. From Florian Westphal.
5) When unregistering conntrack helpers, pass the helper that is going
away so the expectation cleanup is done accordingly, otherwise UaF is
possible when accessing expectation that refer to the helper that is
gone. From Qi Tang.
6) Zero expectation NAT fields to address leaking kernel memory through
the expectation netlink dump when unset. Also from Qi Tang.
7) Use the master conntrack helper when creating expectations via
ctnetlink, ignore the suggested helper through CTA_EXPECT_HELP_NAME.
This allows to address a possible read of kernel memory off the
expectation object boundary.
8) Fix incorrect release of the hash bucket logic in ipset when the
bucket is empty, leading to shrinking the hash bucket to size 0
which deals to out-of-bound write in next element additions.
From Yifan Wu.
9) Allow the use of x_tables extensions that explicitly declare
NFPROTO_ARP support only. This is to avoid an incorrect hook number
validation due to non-overlapping arp and inet hook number
definitions.
10) Reject immediate NF_QUEUE verdict in nf_tables. The userspace
nft tool always uses the nft_queue expression for queueing.
This ensures this verdict cannot be used for the arp family,
which does supported this.
* tag 'nf-26-04-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: reject immediate NF_QUEUE verdict
netfilter: x_tables: restrict xt_check_match/xt_check_target extensions for NFPROTO_ARP
netfilter: ipset: drop logically empty buckets in mtype_del
netfilter: ctnetlink: ignore explicit helper on new expectations
netfilter: ctnetlink: zero expect NAT fields when CTA_EXPECT_NAT absent
netfilter: nf_conntrack_helper: pass helper to expect cleanup
netfilter: ipset: use nla_strcmp for IPSET_ATTR_NAME attr
netfilter: x_tables: ensure names are nul-terminated
netfilter: nfnetlink_log: account for netlink header size
netfilter: flowtable: strictly check for maximum number of actions
====================
Link: https://patch.msgid.link/20260401103646.1015423-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rds_ib_get_mr() extracts the rds_ib_connection from conn->c_transport_data
and passes it to rds_ib_reg_frmr() for FRWR memory registration. On a
fresh outgoing connection, ic is allocated in rds_ib_conn_alloc() with
i_cm_id = NULL because the connection worker has not yet called
rds_ib_conn_path_connect() to create the rdma_cm_id. When sendmsg() with
RDS_CMSG_RDMA_MAP is called on such a connection, the sendmsg path parses
the control message before any connection establishment, allowing
rds_ib_post_reg_frmr() to dereference ic->i_cm_id->qp and crash the
kernel.
The existing guard in rds_ib_reg_frmr() only checks for !ic (added in
commit 9e630bcb77), which does not catch this case since ic is allocated
early and is always non-NULL once the connection object exists.
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
RIP: 0010:rds_ib_post_reg_frmr+0x50e/0x920
Call Trace:
rds_ib_post_reg_frmr (net/rds/ib_frmr.c:167)
rds_ib_map_frmr (net/rds/ib_frmr.c:252)
rds_ib_reg_frmr (net/rds/ib_frmr.c:430)
rds_ib_get_mr (net/rds/ib_rdma.c:615)
__rds_rdma_map (net/rds/rdma.c:295)
rds_cmsg_rdma_map (net/rds/rdma.c:860)
rds_sendmsg (net/rds/send.c:1363)
____sys_sendmsg
do_syscall_64
Add a check in rds_ib_get_mr() that verifies ic, i_cm_id, and qp are all
non-NULL before proceeding with FRMR registration, mirroring the guard
already present in rds_ib_post_inv(). Return -ENODEV when the connection
is not ready, which the existing error handling in rds_cmsg_send() converts
to -EAGAIN for userspace retry and triggers rds_conn_connect_if_down() to
start the connection worker.
Fixes: 1659185fb4 ("RDS: IB: Support Fastreg MR (FRMR) memory registration mode")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Reviewed-by: Allison Henderson <achender@kernel.org>
Link: https://patch.msgid.link/20260330163237.2752440-2-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
fib6_metric_set() may be called concurrently from softirq context without
holding the FIB table lock. A typical path is:
ndisc_router_discovery()
spin_unlock_bh(&table->tb6_lock) <- lock released
fib6_metric_set(rt, RTAX_HOPLIMIT, ...) <- lockless call
When two CPUs process Router Advertisement packets for the same router
simultaneously, they can both arrive at fib6_metric_set() with the same
fib6_info pointer whose fib6_metrics still points to dst_default_metrics.
if (f6i->fib6_metrics == &dst_default_metrics) { /* both CPUs: true */
struct dst_metrics *p = kzalloc_obj(*p, GFP_ATOMIC);
refcount_set(&p->refcnt, 1);
f6i->fib6_metrics = p; /* CPU1 overwrites CPU0's p -> p0 leaked */
}
The dst_metrics allocated by the losing CPU has refcnt=1 but no pointer
to it anywhere in memory, producing a kmemleak report:
unreferenced object 0xff1100025aca1400 (size 96):
comm "softirq", pid 0, jiffies 4299271239
backtrace:
kmalloc_trace+0x28a/0x380
fib6_metric_set+0xcd/0x180
ndisc_router_discovery+0x12dc/0x24b0
icmpv6_rcv+0xc16/0x1360
Fix this by:
- Set val for p->metrics before published via cmpxchg() so the metrics
value is ready before the pointer becomes visible to other CPUs.
- Replace the plain pointer store with cmpxchg() and free the allocation
safely when competition failed.
- Add READ_ONCE()/WRITE_ONCE() for metrics[] setting in the non-default
metrics path to prevent compiler-based data races.
Fixes: d4ead6b34b ("net/ipv6: move metrics from dst to rt6_info")
Reported-by: Fei Liu <feliu@redhat.com>
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260331-b4-fib6_metric_set-kmemleak-v3-1-88d27f4d8825@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
hci_le_big_create_sync() uses DEFINE_FLEX to allocate a
struct hci_cp_le_big_create_sync on the stack with room for 0x11 (17)
BIS entries. However, conn->num_bis can hold up to HCI_MAX_ISO_BIS (31)
entries — validated against ISO_MAX_NUM_BIS (0x1f) in the caller
hci_conn_big_create_sync(). When conn->num_bis is between 18 and 31,
the memcpy that copies conn->bis into cp->bis writes up to 14 bytes
past the stack buffer, corrupting adjacent stack memory.
This is trivially reproducible: binding an ISO socket with
bc_num_bis = ISO_MAX_NUM_BIS (31) and calling listen() will
eventually trigger hci_le_big_create_sync() from the HCI command
sync worker, causing a KASAN-detectable stack-out-of-bounds write:
BUG: KASAN: stack-out-of-bounds in hci_le_big_create_sync+0x256/0x3b0
Write of size 31 at addr ffffc90000487b48 by task kworker/u9:0/71
Fix this by changing the DEFINE_FLEX count from the incorrect 0x11 to
HCI_MAX_ISO_BIS, which matches the maximum number of BIS entries that
conn->bis can actually carry.
Fixes: 42ecf19471 ("Bluetooth: ISO: Do not emit LE BIG Create Sync if previous is pending")
Cc: stable@vger.kernel.org
Signed-off-by: hkbinbin <hkbinbinbin@gmail.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The legacy responder path in smp_random() currently labels the stored
STK as authenticated whenever pending_sec_level is BT_SECURITY_HIGH.
That reflects what the local service requested, not what the pairing
flow actually achieved.
For Just Works/Confirm legacy pairing, SMP_FLAG_MITM_AUTH stays clear
and the resulting STK should remain unauthenticated even if the local
side requested HIGH security. Use the established MITM state when
storing the responder STK so the key metadata matches the pairing result.
This also keeps the legacy path aligned with the Secure Connections code,
which already treats JUST_WORKS/JUST_CFM as unauthenticated.
Fixes: fff3490f47 ("Bluetooth: Fix setting correct authentication information for SMP STK")
Cc: stable@vger.kernel.org
Signed-off-by: Oleh Konko <security@1seal.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
smp_cmd_pairing_req() currently builds the pairing response from the
initiator auth_req before enforcing the local BT_SECURITY_HIGH
requirement. If the initiator omits SMP_AUTH_MITM, the response can
also omit it even though the local side still requires MITM.
tk_request() then sees an auth value without SMP_AUTH_MITM and may
select JUST_CFM, making method selection inconsistent with the pairing
policy the responder already enforces.
When the local side requires HIGH security, first verify that MITM can
be achieved from the IO capabilities and then force SMP_AUTH_MITM in the
response in both rsp.auth_req and auth. This keeps the responder auth bits
and later method selection aligned.
Fixes: 2b64d153a0 ("Bluetooth: Add MITM mechanism to LE-SMP")
Cc: stable@vger.kernel.org
Suggested-by: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Signed-off-by: Oleh Konko <security@1seal.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
mesh_send() currently bounds MGMT_OP_MESH_SEND by total command
length, but it never verifies that the bytes supplied for the
flexible adv_data[] array actually match the embedded adv_data_len
field. MGMT_MESH_SEND_SIZE only covers the fixed header, so a
truncated command can still pass the existing 20..50 byte range
check and later drive the async mesh send path past the end of the
queued command buffer.
Keep rejecting zero-length and oversized advertising payloads, but
validate adv_data_len explicitly and require the command length to
exactly match the flexible array size before queueing the request.
Fixes: b338d91703 ("Bluetooth: Implement support for Mesh")
Reported-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_conn lookup and field access must be covered by hdev lock in
hci_le_remote_conn_param_req_evt, otherwise it's possible it is freed
concurrently.
Extend the hci_dev_lock critical section to cover all conn usage.
Fixes: 95118dd4ed ("Bluetooth: hci_event: Use of a function table to handle LE subevents")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_conn lookup and field access must be covered by hdev lock in
set_cig_params_sync, otherwise it's possible it is freed concurrently.
Take hdev lock to prevent hci_conn from being deleted or modified
concurrently. Just RCU lock is not suitable here, as we also want to
avoid "tearing" in the configuration.
Fixes: a091289218 ("Bluetooth: hci_conn: Fix hci_le_set_cig_params")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Load Long Term Keys stores the user-provided enc_size and later uses
it to size fixed-size stack operations when replying to LE LTK
requests. An enc_size larger than the 16-byte key buffer can therefore
overflow the reply stack buffer.
Reject oversized enc_size values while validating the management LTK
record so invalid keys never reach the stored key state.
Fixes: 346af67b8d ("Bluetooth: Add MGMT handlers for dealing with SMP LTK's")
Reported-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Commit 5df5dafc17 ("Bluetooth: hci_uart: Fix another race during
initialization") fixed a race for hci commands sent during initialization.
However, there is still a race that happens if an hci event from one of
these commands is received before HCI_UART_REGISTERED has been set at
the end of hci_uart_register_dev(). The event will be ignored which
causes the command to fail with a timeout in the log:
"Bluetooth: hci0: command 0x1003 tx timeout"
This is because the hci event receive path (hci_uart_tty_receive ->
h4_recv) requires HCI_UART_REGISTERED to be set in h4_recv(), while the
hci command transmit path (hci_uart_send_frame -> h4_enqueue) only
requires HCI_UART_PROTO_INIT to be set in hci_uart_send_frame().
The check for HCI_UART_REGISTERED was originally added in commit
c257820291 ("Bluetooth: Fix H4 crash from incoming UART packets")
to fix a crash caused by hu->hdev being null dereferenced. That can no
longer happen: once HCI_UART_PROTO_INIT is set in hci_uart_register_dev()
all pointers (hu, hu->priv and hu->hdev) are valid, and
hci_uart_tty_receive() already calls h4_recv() on HCI_UART_PROTO_INIT
or HCI_UART_PROTO_READY.
Remove the check for HCI_UART_REGISTERED in h4_recv() to fix the race
condition.
Fixes: 5df5dafc17 ("Bluetooth: hci_uart: Fix another race during initialization")
Signed-off-by: Jonathan Rissanen <jonathan.rissanen@axis.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When hci_cmd_sync_queue_once() returns with error, the destroy callback
will not be called.
Fix leaking references / memory on these failures.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_cmd_sync_queue_once() needs to indicate whether a queue item was
added, so caller can know if callbacks are called, so it can avoid
leaking resources.
Change the function to return -EEXIST if queue item already exists.
Modify all callsites to handle that.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_store_wake_reason() is called from hci_event_packet() immediately
after stripping the HCI event header but before hci_event_func()
enforces the per-event minimum payload length from hci_ev_table.
This means a short HCI event frame can reach bacpy() before any bounds
check runs.
Rather than duplicating skb parsing and per-event length checks inside
hci_store_wake_reason(), move wake-address storage into the individual
event handlers after their existing event-length validation has
succeeded. Convert hci_store_wake_reason() into a small helper that only
stores an already-validated bdaddr while the caller holds hci_dev_lock().
Use the same helper after hci_event_func() with a NULL address to
preserve the existing unexpected-wake fallback semantics when no
validated event handler records a wake address.
Annotate the helper with __must_hold(&hdev->lock) and add
lockdep_assert_held(&hdev->lock) so future call paths keep the lock
contract explicit.
Call the helper from hci_conn_request_evt(), hci_conn_complete_evt(),
hci_sync_conn_complete_evt(), le_conn_complete_evt(),
hci_le_adv_report_evt(), hci_le_ext_adv_report_evt(),
hci_le_direct_adv_report_evt(), hci_le_pa_sync_established_evt(), and
hci_le_past_received_evt().
Fixes: 2f20216c1d ("Bluetooth: Emit controller suspend and resume events")
Cc: stable@vger.kernel.org
Signed-off-by: Oleh Konko <security@1seal.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
sco_sock_connect() checks sk_state and sk_type without holding
the socket lock. Two concurrent connect() syscalls on the same
socket can both pass the check and enter sco_connect(), leading
to use-after-free.
The buggy scenario involves three participants and was confirmed
with additional logging instrumentation:
Thread A (connect): HCI disconnect: Thread B (connect):
sco_sock_connect(sk) sco_sock_connect(sk)
sk_state==BT_OPEN sk_state==BT_OPEN
(pass, no lock) (pass, no lock)
sco_connect(sk): sco_connect(sk):
hci_dev_lock hci_dev_lock
hci_connect_sco <- blocked
-> hcon1
sco_conn_add->conn1
lock_sock(sk)
sco_chan_add:
conn1->sk = sk
sk->conn = conn1
sk_state=BT_CONNECT
release_sock
hci_dev_unlock
hci_dev_lock
sco_conn_del:
lock_sock(sk)
sco_chan_del:
sk->conn=NULL
conn1->sk=NULL
sk_state=
BT_CLOSED
SOCK_ZAPPED
release_sock
hci_dev_unlock
(unblocked)
hci_connect_sco
-> hcon2
sco_conn_add
-> conn2
lock_sock(sk)
sco_chan_add:
sk->conn=conn2
sk_state=
BT_CONNECT
// zombie sk!
release_sock
hci_dev_unlock
Thread B revives a BT_CLOSED + SOCK_ZAPPED socket back to
BT_CONNECT. Subsequent cleanup triggers double sock_put() and
use-after-free. Meanwhile conn1 is leaked as it was orphaned
when sco_conn_del() cleared the association.
Fix this by:
- Moving lock_sock() before the sk_state/sk_type checks in
sco_sock_connect() to serialize concurrent connect attempts
- Fixing the sk_type != SOCK_SEQPACKET check to actually
return the error instead of just assigning it
- Adding a state re-check in sco_connect() after lock_sock()
to catch state changes during the window between the locks
- Adding sco_pi(sk)->conn check in sco_chan_add() to prevent
double-attach of a socket to multiple connections
- Adding hci_conn_drop() on sco_chan_add failure to prevent
HCI connection leaks
Fixes: 9a8ec9e8eb ("Bluetooth: SCO: Fix possible circular locking dependency on sco_connect_cfm")
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
hci_cmd_sync_run() may run the work immediately if called from existing
sync work (otherwise it queues a new sync work). In this case it fails
to call the destroy() function.
On immediate run, make it behave same way as if item was queued
successfully: call destroy, and return 0.
The only callsite is hci_abort_conn() via hci_cmd_sync_run_once(), and
this changes its return value. However, its return value is not used
except as the return value for hci_disconnect(), and nothing uses the
return value of hci_disconnect(). Hence there should be no behavior
change anywhere.
Fixes: c898f6d7b0 ("Bluetooth: hci_sync: Introduce hci_cmd_sync_run/hci_cmd_sync_run_once")
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
nft_queue is always used from userspace nftables to deliver the NF_QUEUE
verdict. Immediately emitting an NF_QUEUE verdict is never used by the
userspace nft tools, so reject immediate NF_QUEUE verdicts.
The arp family does not provide queue support, but such an immediate
verdict is still reachable. Globally reject NF_QUEUE immediate verdicts
to address this issue.
Fixes: f342de4e2f ("netfilter: nf_tables: reject QUEUE/DROP verdict parameters")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Weiming Shi says:
xt_match and xt_target structs registered with NFPROTO_UNSPEC can be
loaded by any protocol family through nft_compat. When such a
match/target sets .hooks to restrict which hooks it may run on, the
bitmask uses NF_INET_* constants. This is only correct for families
whose hook layout matches NF_INET_*: IPv4, IPv6, INET, and bridge
all share the same five hooks (PRE_ROUTING ... POST_ROUTING).
ARP only has three hooks (IN=0, OUT=1, FORWARD=2) with different
semantics. Because NF_ARP_OUT == 1 == NF_INET_LOCAL_IN, the .hooks
validation silently passes for the wrong reasons, allowing matches to
run on ARP chains where the hook assumptions (e.g. state->in being
set on input hooks) do not hold. This leads to NULL pointer
dereferences; xt_devgroup is one concrete example:
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000044: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000220-0x0000000000000227]
RIP: 0010:devgroup_mt+0xff/0x350
Call Trace:
<TASK>
nft_match_eval (net/netfilter/nft_compat.c:407)
nft_do_chain (net/netfilter/nf_tables_core.c:285)
nft_do_chain_arp (net/netfilter/nft_chain_filter.c:61)
nf_hook_slow (net/netfilter/core.c:623)
arp_xmit (net/ipv4/arp.c:666)
</TASK>
Kernel panic - not syncing: Fatal exception in interrupt
Fix it by restricting arptables to NFPROTO_ARP extensions only.
Note that arptables-legacy only supports:
- arpt_CLASSIFY
- arpt_mangle
- arpt_MARK
that provide explicit NFPROTO_ARP match/target declarations.
Fixes: 9291747f11 ("netfilter: xtables: add device group match")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
mtype_del() counts empty slots below n->pos in k, but it only drops the
bucket when both n->pos and k are zero. This misses buckets whose live
entries have all been removed while n->pos still points past deleted slots.
Treat a bucket as empty when all positions below n->pos are unused and
release it directly instead of shrinking it further.
Fixes: 8af1c6fbd9 ("netfilter: ipset: Fix forceadd evaluation path")
Cc: stable@vger.kernel.org
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <dstsmallbird@foxmail.com>
Signed-off-by: Yifan Wu <yifanwucs@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Reviewed-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use the existing master conntrack helper, anything else is not really
supported and it just makes validation more complicated, so just ignore
what helper userspace suggests for this expectation.
This was uncovered when validating CTA_EXPECT_CLASS via different helper
provided by userspace than the existing master conntrack helper:
BUG: KASAN: slab-out-of-bounds in nf_ct_expect_related_report+0x2479/0x27c0
Read of size 4 at addr ffff8880043fe408 by task poc/102
Call Trace:
nf_ct_expect_related_report+0x2479/0x27c0
ctnetlink_create_expect+0x22b/0x3b0
ctnetlink_new_expect+0x4bd/0x5c0
nfnetlink_rcv_msg+0x67a/0x950
netlink_rcv_skb+0x120/0x350
Allowing to read kernel memory bytes off the expectation boundary.
CTA_EXPECT_HELP_NAME is still used to offer the helper name to userspace
via netlink dump.
Fixes: bd07793705 ("netfilter: nfnetlink_queue: allow to attach expectations to conntracks")
Reported-by: Qi Tang <tpluszz77@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ctnetlink_alloc_expect() allocates expectations from a non-zeroing
slab cache via nf_ct_expect_alloc(). When CTA_EXPECT_NAT is not
present in the netlink message, saved_addr and saved_proto are
never initialized. Stale data from a previous slab occupant can
then be dumped to userspace by ctnetlink_exp_dump_expect(), which
checks these fields to decide whether to emit CTA_EXPECT_NAT.
The safe sibling nf_ct_expect_init(), used by the packet path,
explicitly zeroes these fields.
Zero saved_addr, saved_proto and dir in the else branch, guarded
by IS_ENABLED(CONFIG_NF_NAT) since these fields only exist when
NAT is enabled.
Confirmed by priming the expect slab with NAT-bearing expectations,
freeing them, creating a new expectation without CTA_EXPECT_NAT,
and observing that the ctnetlink dump emits a spurious
CTA_EXPECT_NAT containing stale data from the prior allocation.
Fixes: 076a0ca026 ("netfilter: ctnetlink: add NAT support for expectations")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Qi Tang <tpluszz77@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
nf_conntrack_helper_unregister() calls nf_ct_expect_iterate_destroy()
to remove expectations belonging to the helper being unregistered.
However, it passes NULL instead of the helper pointer as the data
argument, so expect_iter_me() never matches any expectation and all
of them survive the cleanup.
After unregister returns, nfnl_cthelper_del() frees the helper
object immediately. Subsequent expectation dumps or packet-driven
init_conntrack() calls then dereference the freed exp->helper,
causing a use-after-free.
Pass the actual helper pointer so expectations referencing it are
properly destroyed before the helper object is freed.
BUG: KASAN: slab-use-after-free in string+0x38f/0x430
Read of size 1 at addr ffff888003b14d20 by task poc/103
Call Trace:
string+0x38f/0x430
vsnprintf+0x3cc/0x1170
seq_printf+0x17a/0x240
exp_seq_show+0x2e5/0x560
seq_read_iter+0x419/0x1280
proc_reg_read+0x1ac/0x270
vfs_read+0x179/0x930
ksys_read+0xef/0x1c0
Freed by task 103:
The buggy address is located 32 bytes inside of
freed 192-byte region [ffff888003b14d00, ffff888003b14dc0)
Fixes: ac7b848390 ("netfilter: expect: add and use nf_ct_expect_iterate helpers")
Signed-off-by: Qi Tang <tpluszz77@gmail.com>
Reviewed-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
IPSET_ATTR_NAME and IPSET_ATTR_NAMEREF are of NLA_STRING type, they
cannot be treated like a c-string.
They either have to be switched to NLA_NUL_STRING, or the compare
operations need to use the nla functions.
Fixes: f830837f0e ("netfilter: ipset: list:set set type support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reject names that lack a \0 character before feeding them
to functions that expect c-strings.
Fixes tag is the most recent commit that needs this change.
Fixes: c38c4597e4 ("netfilter: implement xt_cgroup cgroup2 path match")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This is a followup to an old bug fix: NLMSG_DONE needs to account
for the netlink header size, not just the attribute size.
This can result in a WARN splat + drop of the netlink message,
but other than this there are no ill effects.
Fixes: 9dfa1dfe4d ("netfilter: nf_log: account for size of NLMSG_DONE attribute")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The maximum number of flowtable hardware offload actions in IPv6 is:
* ethernet mangling (4 payload actions, 2 for each ethernet address)
* SNAT (4 payload actions)
* DNAT (4 payload actions)
* Double VLAN (4 vlan actions, 2 for popping vlan, and 2 for pushing)
for QinQ.
* Redirect (1 action)
Which makes 17, while the maximum is 16. But act_ct supports for tunnels
actions too. Note that payload action operates at 32-bit word level, so
mangling an IPv6 address takes 4 payload actions.
Update flow_action_entry_next() calls to check for the maximum number of
supported actions.
While at it, rise the maximum number of actions per flow from 16 to 24
so this works fine with IPv6 setups.
Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
stmmac_vlan_restore() unconditionally calls stmmac_vlan_update() when
NETIF_F_VLAN_FEATURES is set. On platforms where priv->hw->vlan (or
->update_vlan_hash) is not provided, stmmac_update_vlan_hash() returns
-EINVAL via stmmac_do_void_callback(), resulting in a spurious
"Failed to restore VLANs" error even when no VLAN filtering is in use.
Remove not needed comment.
Remove not used return value from stmmac_vlan_restore().
Tested on Orange Pi Zero 3.
Fixes: bd7ad51253 ("net: stmmac: Fix VLAN HW state restore")
Signed-off-by: Michal Piekos <michal.piekos@mmpsystems.pl>
Link: https://patch.msgid.link/20260328-vlan-restore-error-v4-1-f88624c530dc@mmpsystems.pl
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ftgmac100_alloc_rings() allocates rx_skbs, tx_skbs, rxdes, txdes, and
rx_scratch in stages. On intermediate failures it returned -ENOMEM
directly, leaking resources allocated earlier in the function.
Rework the failure path to use staged local unwind labels and free
allocated resources in reverse order before returning -ENOMEM. This
matches common netdev allocation cleanup style.
Fixes: d72e01a043 ("ftgmac100: Use a scratch buffer for failed RX allocations")
Cc: stable@vger.kernel.org
Signed-off-by: Yufan Chen <yufan.chen@linux.dev>
Link: https://patch.msgid.link/20260328163257.60836-1-yufan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
`ip6fl_seq_show()` walks the global flowlabel hash under the seq-file
RCU read-side lock and prints `fl->opt->opt_nflen` when an option block
is present.
Exclusive flowlabels currently free `fl->opt` as soon as `fl->users`
drops to zero in `fl_release()`. However, the surrounding
`struct ip6_flowlabel` remains visible in the global hash table until
later garbage collection removes it and `fl_free_rcu()` finally tears it
down.
A concurrent `/proc/net/ip6_flowlabel` reader can therefore race that
early `kfree()` and dereference freed option state, triggering a crash
in `ip6fl_seq_show()`.
Fix this by keeping `fl->opt` alive until `fl_free_rcu()`. That matches
the lifetime already required for the enclosing flowlabel while readers
can still reach it under RCU.
Fixes: d3aedd5ebd ("ipv6 flowlabel: Convert hash list to RCU.")
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Tested-by: Ren Wei <enjou1224z@gmail.com>
Signed-off-by: Zhengchuan Liang <zcliangcn@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/07351f0ec47bcee289576f39f9354f4a64add6e4.1774855883.git.zcliangcn@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull sched_ext fixes from Tejun Heo:
- Fix SCX_KICK_WAIT deadlock where multiple CPUs waiting for each other
in hardirq context form a cycle. Move the wait to a balance callback
which can drop the rq lock and process IPIs.
- Fix inconsistent NUMA node lookup in scx_select_cpu_dfl() where
the waker_node used cpu_to_node() while prev_cpu used
scx_cpu_node_if_enabled(), leading to undefined behavior when
per-node idle tracking is disabled.
* tag 'sched_ext-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
selftests/sched_ext: Add cyclic SCX_KICK_WAIT stress test
sched_ext: Fix SCX_KICK_WAIT deadlock by deferring wait to balance callback
sched_ext: Fix inconsistent NUMA node lookup in scx_select_cpu_dfl()
Pull workqueue fix from Tejun Heo:
- Fix false positive stall reports on weakly ordered architectures
where the lockless worklist/timestamp check in the watchdog can
observe stale values due to memory reordering.
Recheck under pool->lock to confirm.
* tag 'wq-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Better describe stall check
workqueue: Fix false positive stall reports
Pull cgroup fixes from Tejun Heo:
- Fix cgroup rmdir racing with dying tasks.
Deferred task cgroup unlink introduced a window where cgroup.procs
is empty but the cgroup is still populated, causing rmdir to fail
with -EBUSY and selftest failures.
Make rmdir wait for dying tasks to fully leave and fix selftests to
not depend on synchronous populated updates.
- Fix cpuset v1 task migration failure from empty cpusets under strict
security policies.
When CPU hotplug removes the last CPU from a v1 cpuset, tasks must be
migrated to an ancestor without a security_task_setscheduler() check
that would block the migration.
* tag 'cgroup-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Skip security check for hotplug induced v1 task migration
cgroup/cpuset: Simplify setsched decision check in task iteration loop of cpuset_can_attach()
cgroup: Fix cgroup_drain_dying() testing the wrong condition
selftests/cgroup: Don't require synchronous populated update on task exit
cgroup: Wait for dying tasks to leave on rmdir
When a CPU hot removal causes a v1 cpuset to lose all its CPUs, the
cpuset hotplug handler will schedule a work function to migrate tasks
in that cpuset with no CPU to its ancestor to enable those tasks to
continue running.
If a strict security policy is in place, however, the task migration
may fail when security_task_setscheduler() call in cpuset_can_attach()
returns a -EACCES error. That will mean that those tasks will have
no CPU to run on. The system administrators will have to explicitly
intervene to either add CPUs to that cpuset or move the tasks elsewhere
if they are aware of it.
This problem was found by a reported test failure in the LTP's
cpuset_hotplug_test.sh. Fix this problem by treating this special case as
an exception to skip the setsched security check in cpuset_can_attach()
when a v1 cpuset with tasks have no CPU left.
With that patch applied, the cpuset_hotplug_test.sh test can be run
successfully without failure.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Centralize the check required to run security_task_setscheduler() in
the task iteration loop of cpuset_can_attach() outside of the loop as
it has no dependency on the characteristics of the tasks themselves.
There is no functional change.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull udf fix from Jan Kara:
"Fix for a race in UDF that can lead to memory corruption"
* tag 'fs_for_v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix race between file type conversion and writeback
mpage: Provide variant of mpage_writepages() with own optional folio handler
The Lenovo Yoga Pro 7 14IMH9 (DMI: 83E2) shares PCI SSID 17aa:3847
with the Legion 7 16ACHG6, but has a different codec subsystem ID
(17aa:38cf). The existing SND_PCI_QUIRK for 17aa:3847 applies
ALC287_FIXUP_LEGION_16ACHG6, which attempts to initialize an external
I2C amplifier (CLSA0100) that is not present on the Yoga Pro 7 14IMH9.
As a result, pin 0x17 (bass speakers) is connected to DAC 0x06 which
has no volume control, making hardware volume adjustment completely
non-functional. Audio is either silent or at maximum volume regardless
of the slider position.
Add a HDA_CODEC_QUIRK entry using the codec subsystem ID (17aa:38cf)
to correctly identify the Yoga Pro 7 14IMH9 and apply
ALC287_FIXUP_YOGA9_14IMH9_BASS_SPK_PIN, which redirects pin 0x17 to
DAC 0x02 and restores proper volume control. The existing Legion entry
is preserved unchanged.
This follows the same pattern used for 17aa:386e, where Legion Y9000X
and Yoga Pro 7 14ARP8 share a PCI SSID but are distinguished via
HDA_CODEC_QUIRK.
Link: https://github.com/nomad4tech/lenovo-yoga-pro-7-linux
Tested-by: Alexander Savenko <alex.sav4387@gmail.com>
Signed-off-by: Alexander Savenko <alex.sav4387@gmail.com>
Link: https://patch.msgid.link/20260331082929.44890-1-alex.sav4387@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
br_mrp_start_test() and br_mrp_start_in_test() accept the user-supplied
interval value from netlink without validation. When interval is 0,
usecs_to_jiffies(0) yields 0, causing the delayed work
(br_mrp_test_work_expired / br_mrp_in_test_work_expired) to reschedule
itself with zero delay. This creates a tight loop on system_percpu_wq
that allocates and transmits MRP test frames at maximum rate, exhausting
all system memory and causing a kernel panic via OOM deadlock.
The same zero-interval issue applies to br_mrp_start_in_test_parse()
for interconnect test frames.
Use NLA_POLICY_MIN(NLA_U32, 1) in the nla_policy tables for both
IFLA_BRIDGE_MRP_START_TEST_INTERVAL and
IFLA_BRIDGE_MRP_START_IN_TEST_INTERVAL, so zero is rejected at the
netlink attribute parsing layer before the value ever reaches the
workqueue scheduling code. This is consistent with how other bridge
subsystems (br_fdb, br_mst) enforce range constraints on netlink
attributes.
Fixes: 20f6a05ef6 ("bridge: mrp: Rework the MRP netlink interface")
Fixes: 7ab1748e4c ("bridge: mrp: Extend MRP netlink interface for configuring MRP interconnect")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260328063000.1845376-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This reverts commit c073f07576 ("ASoC: Intel: sof_sdw: select PINCTRL_CS42L43 and SPI_CS42L43")
Currently, SND_SOC_INTEL_SOUNDWIRE_SOF_MACH selects PINCTRL_CS42L43
without also selecting or depending on PINCTRL, despite PINCTRL_CS42L43
depending on PINCTRL.
See the following Kbuild warning:
WARNING: unmet direct dependencies detected for PINCTRL_CS42L43
Depends on [n]: PINCTRL [=n] && MFD_CS42L43 [=m]
Selected by [m]:
- SND_SOC_INTEL_SOUNDWIRE_SOF_MACH [=m] && SOUND [=y] && SND [=m] && SND_SOC [=m] && SND_SOC_INTEL_MACH [=y] && (SND_SOC_SOF_INTEL_COMMON [=m] || !SND_SOC_SOF_INTEL_COMMON [=m]) && SND_SOC_SOF_INTEL_SOUNDWIRE [=m] && I2C [=y] && SPI_MASTER [=y] && ACPI [=y] && (MFD_INTEL_LPSS [=n] || COMPILE_TEST [=y]) && (SND_SOC_INTEL_USER_FRIENDLY_LONG_NAMES [=n] || COMPILE_TEST [=y]) && SOUNDWIRE [=m]
In response to v1 of this patch [1], Arnd pointed out that there is
no compile-time dependency sof_sdw and the PINCTRL_CS42L43 driver.
After testing, I can confirm that the kernel compiled with
SND_SOC_INTEL_SOUNDWIRE_SOF_MACH enabled and PINCTRL_CS42L43 disabled.
This unmet dependency was detected by kconfirm, a static analysis
tool for Kconfig.
Link: https://lore.kernel.org/all/b8aecc71-1fed-4f52-9f6c-263fbe56d493@app.fastmail.com/ [1]
Fixes: c073f07576 ("ASoC: Intel: sof_sdw: select PINCTRL_CS42L43 and SPI_CS42L43")
Signed-off-by: Julian Braha <julianbraha@gmail.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260325001522.1727678-1-julianbraha@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
In rt5660_hw_params(), the error path for snd_soc_dai_set_sysclk()
correctly uses rtd->dev as the logging device, but the error path for
snd_soc_dai_set_pll() uses codec_dai->dev instead.
These two devices are distinct:
- rtd->dev is the platform device of the PCM runtime (the Intel HDA/SSP
controller, e.g. 0000:00:1f.3), which owns the machine driver callback.
- codec_dai->dev is the I2C device of the rt5660 codec itself
(i2c-10EC5660:00).
Since hw_params is a machine driver operation and both calls are made
within the same function from the machine driver's context, all error
messages should be attributed to rtd->dev. Using codec_dai->dev for one
of them would suggest the error originates inside the codec driver,
which is misleading.
Align the PLL error log with the sysclk one to use rtd->dev, matching
the convention used by all other Intel board drivers in this directory.
Signed-off-by: Sachin Mokashi <sachin.mokashi@intel.com>
Link: https://patch.msgid.link/20260327131439.1330373-1-sachin.mokashi@intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Suraj Gupta says:
====================
Correct BD length masks and BQL accounting for multi-BD TX packets
This patch series fixes two issues in the Xilinx AXI Ethernet driver:
1. Corrects the BD length masks to match the AXIDMA IP spec.
2. Fixes BQL accounting for multi-BD TX packets.
====================
Link: https://patch.msgid.link/20260327073238.134948-1-suraj.gupta2@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When a TX packet spans multiple buffer descriptors (scatter-gather),
axienet_free_tx_chain sums the per-BD actual length from descriptor
status into a caller-provided accumulator. That sum is reset on each
NAPI poll. If the BDs for a single packet complete across different
polls, the earlier bytes are lost and never credited to BQL. This
causes BQL to think bytes are permanently in-flight, eventually
stalling the TX queue.
The SKB pointer is stored only on the last BD of a packet. When that
BD completes, use skb->len for the byte count instead of summing
per-BD status lengths. This matches netdev_sent_queue(), which debits
skb->len, and naturally survives across polls because no partial
packet contributes to the accumulator.
Fixes: c900e49d58 ("net: xilinx: axienet: Implement BQL")
Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com>
Reviewed-by: Sean Anderson <sean.anderson@linux.dev>
Link: https://patch.msgid.link/20260327073238.134948-3-suraj.gupta2@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The XAXIDMA_BD_CTRL_LENGTH_MASK and XAXIDMA_BD_STS_ACTUAL_LEN_MASK
macros were defined as 0x007FFFFF (23 bits), but the AXI DMA IP
product guide (PG021) specifies the buffer length field as bits 25:0
(26 bits). Update both masks to match the IP documentation.
In practice this had no functional impact, since Ethernet frames are
far smaller than 2^23 bytes and the extra bits were always zero, but
the masks should still reflect the hardware specification.
Fixes: 8a3b7a252d ("drivers/net/ethernet/xilinx: added Xilinx AXI Ethernet driver")
Signed-off-by: Suraj Gupta <suraj.gupta2@amd.com>
Reviewed-by: Sean Anderson <sean.anderson@linux.dev>
Link: https://patch.msgid.link/20260327073238.134948-2-suraj.gupta2@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
pn532_receive_buf() appends every incoming byte to dev->recv_skb and
only resets the buffer after pn532_uart_rx_is_frame() recognizes a
complete frame. A continuous stream of bytes without a valid PN532 frame
header therefore keeps growing the skb until skb_put_u8() hits the tail
limit.
Drop the accumulated partial frame once the fixed receive buffer is full
so malformed UART traffic cannot grow the skb past
PN532_UART_SKB_BUFF_LEN.
Fixes: c656aa4c27 ("nfc: pn533: add UART phy driver")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Link: https://patch.msgid.link/20260326142033.82297-1-pengpeng@iscas.ac.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
bond_xmit_broadcast() reuses the original skb for the last slave
(determined by bond_is_last_slave()) and clones it for others.
Concurrent slave enslave/release can mutate the slave list during
RCU-protected iteration, changing which slave is "last" mid-loop.
This causes the original skb to be double-consumed (double-freed).
Replace the racy bond_is_last_slave() check with a simple index
comparison (i + 1 == slaves_count) against the pre-snapshot slave
count taken via READ_ONCE() before the loop. This preserves the
zero-copy optimization for the last slave while making the "last"
determination stable against concurrent list mutations.
The UAF can trigger the following crash:
==================================================================
BUG: KASAN: slab-use-after-free in skb_clone
Read of size 8 at addr ffff888100ef8d40 by task exploit/147
CPU: 1 UID: 0 PID: 147 Comm: exploit Not tainted 7.0.0-rc3+ #4 PREEMPTLAZY
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:123)
print_report (mm/kasan/report.c:379 mm/kasan/report.c:482)
kasan_report (mm/kasan/report.c:597)
skb_clone (include/linux/skbuff.h:1724 include/linux/skbuff.h:1792 include/linux/skbuff.h:3396 net/core/skbuff.c:2108)
bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5334)
bond_start_xmit (drivers/net/bonding/bond_main.c:5567 drivers/net/bonding/bond_main.c:5593)
dev_hard_start_xmit (include/linux/netdevice.h:5325 include/linux/netdevice.h:5334 net/core/dev.c:3871 net/core/dev.c:3887)
__dev_queue_xmit (include/linux/netdevice.h:3601 net/core/dev.c:4838)
ip6_finish_output2 (include/net/neighbour.h:540 include/net/neighbour.h:554 net/ipv6/ip6_output.c:136)
ip6_finish_output (net/ipv6/ip6_output.c:208 net/ipv6/ip6_output.c:219)
ip6_output (net/ipv6/ip6_output.c:250)
ip6_send_skb (net/ipv6/ip6_output.c:1985)
udp_v6_send_skb (net/ipv6/udp.c:1442)
udpv6_sendmsg (net/ipv6/udp.c:1733)
__sys_sendto (net/socket.c:730 net/socket.c:742 net/socket.c:2206)
__x64_sys_sendto (net/socket.c:2209)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
</TASK>
Allocated by task 147:
Freed by task 147:
The buggy address belongs to the object at ffff888100ef8c80
which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 192 bytes inside of
freed 224-byte region [ffff888100ef8c80, ffff888100ef8d60)
Memory state around the buggy address:
ffff888100ef8c00: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
ffff888100ef8c80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888100ef8d00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
^
ffff888100ef8d80: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
ffff888100ef8e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Fixes: 4e5bd03ae3 ("net: bonding: fix bond_xmit_broadcast return value error bug")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260326075553.3960562-1-xmei5@asu.edu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The recent refactoring of xfi driver changed the assignment of
atc->daios[] at atc_get_resources(); now it loops over all enum
DAIOTYP entries while it looped formerly only a part of them.
The problem is that the last entry, SPDIF1, is a special type that
is used only for hw20k1 CTSB073X model (as a replacement of SPDIFIO),
and there is no corresponding definition for hw20k2. Due to the lack
of the info, it caused a kernel crash on hw20k2, which was already
worked around by the commit b045ab3dff ("ALSA: ctxfi: Fix missing
SPDIFI1 index handling").
This patch addresses the root cause of the regression above properly,
simply by skipping the incorrect SPDIF1 type in the parser loop.
For making the change clearer, the code is slightly arranged, too.
Fixes: a2dbaeb5c6 ("ALSA: ctxfi: Refactor resource alloc for sparse mappings")
Cc: <stable@vger.kernel.org>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1259925
Link: https://patch.msgid.link/20260331081227.216134-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
bnxt_hwrm_func_backing_store_qcaps_v2() stores resp->type from the
firmware response in ctxm->type and later uses that value to index
fixed backing-store metadata arrays such as ctx_arr[] and
bnxt_bstore_to_trace[].
ctxm->type is fixed by the current backing-store query type and matches
the array index of ctx->ctx_arr. Set ctxm->type from the current loop
variable instead of depending on resp->type.
Also update the loop to advance type from next_valid_type in the for
statement, which keeps the control flow simpler for non-valid and
unchanged entries.
Fixes: 6a4d0774f0 ("bnxt_en: Add support for new backing store query firmware API")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Tested-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20260328234357.43669-1-pengpeng@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Netfilter flowtable can theoretically try to offload flower rules as soon
as a net_device is registered while all the other ones are not
registered or initialized, triggering a possible NULL pointer dereferencing
of qdma pointer in airoha_ppe_set_cpu_port routine. Moreover, if
register_netdev() fails for a particular net_device, there is a small
race if Netfilter tries to offload flowtable rules before all the
net_devices are properly unregistered in airoha_probe() error patch,
triggering a NULL pointer dereferencing in airoha_ppe_set_cpu_port
routine. In order to avoid any possible race, delay offloading until
all net_devices are registered in the networking subsystem.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260329-airoha-regiser-race-fix-v2-1-f4ebb139277b@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When building netlink messages, tc_chain_fill_node() never initializes
the tcm_info field of struct tcmsg. Since the allocation is not zeroed,
kernel heap memory is leaked to userspace through this 4-byte field.
The fix simply zeroes tcm_info alongside the other fields that are
already initialized.
Fixes: 32a4f5ecd7 ("net: sched: introduce chain object to uapi")
Signed-off-by: Yochai Eisenrich <echelonh@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260328211436.1010152-1-echelonh@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull crypto library fix from Eric Biggers:
"Fix missing zeroization of the ChaCha state"
* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crypto: chacha: Zeroize permuted_state before it leaves scope
Pull rtla build fix from Steven Rostedt:
- Fix build failure when libbpf does not exist
RTLA supports building without BPF libraries, but a recent change
added a libbpf.h include outside of the BPF protection which caused
build failures when libbpf was not installed.
* tag 'trace-rtla-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rtla: Fix build without libbpf header
ep93xx_i2s_enable() calls clk_prepare_enable() on three clocks in
sequence (mclk, sclk, lrclk) without checking the return value of any
of them. If an intermediate enable fails, the clocks that were already
enabled are never rolled back, leaking them until the next disable cycle
— which may never come if the stream never started cleanly.
Change ep93xx_i2s_enable() from void to int. Add error checking after
each clk_prepare_enable() call and unwind already-enabled clocks on
failure. Propagate the error through ep93xx_i2s_startup() and
ep93xx_i2s_resume(), both of which already return int.
Signed-off-by: Jihed Chaibi <jihed.chaibi.dev@gmail.com>
Fixes: f4ff6b56bc ("ASoC: cirrus: i2s: Prepare clock before using it")
Link: https://patch.msgid.link/20260324210909.45494-1-jihed.chaibi.dev@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Add a test that creates a 3-CPU kick_wait cycle (A->B->C->A). A BPF
scheduler kicks the next CPU in the ring with SCX_KICK_WAIT on every
enqueue while userspace workers generate continuous scheduling churn via
sched_yield(). Without the preceding fix, this hangs the machine within seconds.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
SCX_KICK_WAIT busy-waits in kick_cpus_irq_workfn() using
smp_cond_load_acquire() until the target CPU's kick_sync advances. Because
the irq_work runs in hardirq context, the waiting CPU cannot reschedule and
its own kick_sync never advances. If multiple CPUs form a wait cycle, all
CPUs deadlock.
Replace the busy-wait in kick_cpus_irq_workfn() with resched_curr() to
force the CPU through do_pick_task_scx(), which queues a balance callback
to perform the wait. The balance callback drops the rq lock and enables
IRQs following the sched_core_balance() pattern, so the CPU can process
IPIs while waiting. The local CPU's kick_sync is advanced on entry to
do_pick_task_scx() and continuously during the wait, ensuring any CPU that
starts waiting for us sees the advancement and cannot form cyclic
dependencies.
Fixes: 90e55164da ("sched_ext: Implement SCX_KICK_WAIT")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Christian Loehle <christian.loehle@arm.com>
Link: https://lore.kernel.org/r/20260316100249.1651641-1-christian.loehle@arm.com
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Component has "card_aux_list" which is added/deled in bind/unbind aux dev
function (A), and used in for_each_card_auxs() loop (B).
static void soc_unbind_aux_dev(...)
{
...
for_each_card_auxs_safe(...) {
...
(A) list_del(&component->card_aux_list);
} ^^^^^^^^^^^^^
}
static int soc_bind_aux_dev(...)
{
...
for_each_card_pre_auxs(...) {
...
(A) list_add(&component->card_aux_list, ...);
} ^^^^^^^^^^^^^
...
}
#define for_each_card_auxs(card, component) \
(B) list_for_each_entry(component, ..., card_aux_list)
^^^^^^^^^^^^^
But it has been used without calling INIT_LIST_HEAD().
> git grep card_aux_list sound/soc
sound/soc/soc-core.c: list_del(&component->card_aux_list);
sound/soc/soc-core.c: list_add(&component->card_aux_list, ...);
call missing INIT_LIST_HEAD() for it.
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Link: https://patch.msgid.link/87341mxa8l.wl-kuninori.morimoto.gx@renesas.com
Signed-off-by: Mark Brown <broonie@kernel.org>
The loop creates a whitespace-stripped copy of the card shortname
where `len < sizeof(card->id)` is used for the bounds check. Since
sizeof(card->id) is 16 and the local id buffer is also 16 bytes,
writing 16 non-space characters fills the entire buffer,
overwriting the terminating nullbyte.
When this non-null-terminated string is later passed to
snd_card_set_id() -> copy_valid_id_string(), the function scans
forward with `while (*nid && ...)` and reads past the end of the
stack buffer, reading the contents of the stack.
A USB device with a product name containing many non-ASCII, non-space
characters (e.g. multibyte UTF-8) will reliably trigger this as follows:
BUG: KASAN: stack-out-of-bounds in copy_valid_id_string
sound/core/init.c:696 [inline]
BUG: KASAN: stack-out-of-bounds in snd_card_set_id_no_lock+0x698/0x74c
sound/core/init.c:718
The off-by-one has been present since commit bafeee5b1f ("ALSA:
snd_usb_caiaq: give better shortname") from June 2009 (v2.6.31-rc1),
which first introduced this whitespace-stripping loop. The original
code never accounted for the null terminator when bounding the copy.
Fix this by changing the loop bound to `sizeof(card->id) - 1`,
ensuring at least one byte remains as the null terminator.
Fixes: bafeee5b1f ("ALSA: snd_usb_caiaq: give better shortname")
Cc: stable@vger.kernel.org
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Reported-by: Berk Cem Goksel <berkcgoksel@gmail.com>
Signed-off-by: Berk Cem Goksel <berkcgoksel@gmail.com>
Link: https://patch.msgid.link/20260329133825.581585-1-berkcgoksel@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Pull vfs fixes from Christian Brauner:
- Fix netfs_limit_iter() hitting BUG() when an ITER_KVEC iterator
reaches it via core dump writes to 9P filesystems. Add ITER_KVEC
handling following the same pattern as the existing ITER_BVEC code.
- Fix a NULL pointer dereference in the netfs unbuffered write retry
path when the filesystem (e.g., 9P) doesn't set the prepare_write
operation.
- Clear I_DIRTY_TIME in sync_lazytime for filesystems implementing
->sync_lazytime. Without this the flag stays set and may cause
additional unnecessary calls during inode deactivation.
- Increase tmpfs size in mount_setattr selftests. A recent commit
bumped the ext4 image size to 2 GB but didn't adjust the tmpfs
backing store, so mkfs.ext4 fails with ENOSPC writing metadata.
- Fix an invalid folio access in iomap when i_blkbits matches the folio
size but differs from the I/O granularity. The cur_folio pointer
would not get invalidated and iomap_read_end() would still be called
on it despite the IO helper owning it.
- Fix hash_name() docstring.
- Fix read abandonment during netfs retry where the subreq variable
used for abandonment could be uninitialized on the first pass or
point to a deleted subrequest on later passes.
- Don't block sync for filesystems with no data integrity guarantees.
Add a SB_I_NO_DATA_INTEGRITY superblock flag replacing the per-inode
AS_NO_DATA_INTEGRITY mapping flag so sync kicks off writeback but
doesn't wait for flusher threads. This fixes a suspend-to-RAM hang on
fuse-overlayfs where the flusher thread blocks when the fuse daemon
is frozen.
- Fix a lockdep splat in iomap when reads fail. iomap_read_end_io()
invokes fserror_report() which calls igrab() taking i_lock in hardirq
context while i_lock is normally held with interrupts enabled. Kick
failed read handling to a workqueue.
- Remove the redundant netfs_io_stream::front member and use
stream->subrequests.next instead, fixing a potential issue in the
direct write code path.
* tag 'vfs-7.0-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs: Fix the handling of stream->front by removing it
iomap: fix lockdep complaint when reads fail
writeback: don't block sync for filesystems with no data integrity guarantees
netfs: Fix read abandonment during retry
vfs: fix docstring of hash_name()
iomap: fix invalid folio access when i_blkbits differs from I/O granularity
selftests/mount_setattr: increase tmpfs size for idmapped mount tests
fs: clear I_DIRTY_TIME in sync_lazytime
netfs: Fix NULL pointer dereference in netfs_unbuffered_write() on retry
netfs: Fix kernel BUG in netfs_limit_iter() for ITER_KVEC iterators
Pull phy fixes from Vinod Koul:
- Qualcomm PCS table fix for ufs phy
- TI device node reference fix
- Common prop kconfig fix
- lynx CDR lock workaround for lanes disabled
- usb disconnect function fix of k1 driver
* tag 'phy-fixes-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy:
phy: qcom: qmp-ufs: Fix SM8650 PCS table for Gear 4
phy: ti: j721e-wiz: Fix device node reference leak in wiz_get_lane_phy_types()
phy: k1-usb: add disconnect function support
phy: lynx-28g: skip CDR lock workaround for lanes disabled in the device tree
phy: make PHY_COMMON_PROPS Kconfig symbol conditionally user-selectable
Pull dmaengine fixes from Vinod Koul:
"A bunch of driver fixes with idxd ones being the biggest:
- Xilinx regmap init error handling, dma_device directions, residue
calculation, and reset related timeout fixes
- Renesas CHCTRL updates and driver list fixes
- DW HDMA cycle bits and MSI data programming fix
- IDXD pile of fixes for memeory leak and FLR fixes"
* tag 'dmaengine-fix-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (21 commits)
dmaengine: xilinx_dma: Fix reset related timeout with two-channel AXIDMA
dmaengine: xilinx: xilinx_dma: Fix unmasked residue subtraction
dmaengine: xilinx: xilinx_dma: Fix residue calculation for cyclic DMA
dmaengine: xilinx: xilinx_dma: Fix dma_device directions
dmaengine: sh: rz-dmac: Move CHCTRL updates under spinlock
dmaengine: sh: rz-dmac: Protect the driver specific lists
dmaengine: idxd: fix possible wrong descriptor completion in llist_abort_desc()
dmaengine: xilinx: xdma: Fix regmap init error handling
dmaengine: dw-edma: Fix multiple times setting of the CYCLE_STATE and CYCLE_BIT bits for HDMA.
dmaengine: idxd: Fix leaking event log memory
dmaengine: idxd: Fix freeing the allocated ida too late
dmaengine: idxd: Fix memory leak when a wq is reset
dmaengine: idxd: Fix not releasing workqueue on .release()
dmaengine: idxd: Wait for submitted operations on .device_synchronize()
dmaengine: idxd: Flush all pending descriptors
dmaengine: idxd: Flush kernel workqueues on Function Level Reset
dmaengine: idxd: Fix possible invalid memory access after FLR
dmaengine: idxd: Fix crash when the event log is disabled
dmaengine: idxd: Fix lockdep warnings when calling idxd_device_config()
dmaengine: dw-edma: fix MSI data programming for multi-IRQ case
...
Pull i2c fixes from Wolfram Sang:
- designware: fix resume-probe race causing NULL-deref in amdisp
- imx: fix timeout on repeated reads and extra clock at end
- MAINTAINERS: drop outdated I2C website
* tag 'i2c-for-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
MAINTAINERS: drop outdated I2C website
i2c: designware: amdisp: Fix resume-probe race condition issue
i2c: imx: ensure no clock is generated after last read
i2c: imx: fix i2c issue when reading multiple messages
Pull kvm fixes from Paolo Bonzini:
"s390:
- Lots of small and not-so-small fixes for the newly rewritten gmap,
mostly affecting the handling of nested guests.
x86:
- Fix an issue with shadow paging, which causes KVM to install an
MMIO PTE in the shadow page tables without first zapping a non-MMIO
SPTE if KVM didn't see the write that modified the shadowed guest
PTE.
While commit a54aa15c6b ("KVM: x86/mmu: Handle MMIO SPTEs
directly in mmu_set_spte()") was right about it being impossible to
miss such a write if it was coming from the guest, it failed to
account for writes to guest memory that are outside the scope of
KVM: if userspace modifies the guest PTE, and then the guest hits a
relevant page fault, KVM will get confused"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86/mmu: Only WARN in direct MMUs when overwriting shadow-present SPTE
KVM: x86/mmu: Drop/zap existing present SPTE even when creating an MMIO SPTE
KVM: s390: Fix KVM_S390_VCPU_FAULT ioctl
KVM: s390: vsie: Fix guest page tables protection
KVM: s390: vsie: Fix unshadowing while shadowing
KVM: s390: vsie: Fix refcount overflow for shadow gmaps
KVM: s390: vsie: Fix nested guest memory shadowing
KVM: s390: Correctly handle guest mappings without struct page
KVM: s390: Fix gmap_link()
KVM: s390: vsie: Fix check for pre-existing shadow mapping
KVM: s390: Remove non-atomic dat_crstep_xchg()
KVM: s390: vsie: Fix dat_split_ste()
Pull xen fix from Juergen Gross:
"A single fix for a very rare bug introduced in rc5"
* tag 'for-linus-7.0a-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/privcmd: unregister xenstore notifier on module exit
Pull x86 fixes from Ingo Molnar:
- Fix an early boot crash in AMD SEV-SNP guests, caused by incorrect
FSGSBASE init ordering (Nikunj A Dadhania)
- Remove X86_CR4_FRED from the CR4 pinned bits mask, to fix a race
window during the bootup of SEV-{ES,SNP} or TDX guests, which can
crash them if they trigger exceptions in that window (Borislav
Petkov)
- Fix early boot failures on SEV-ES/SNP guests, due to incorrect early
GHCB access (Nikunj A Dadhania)
- Add clarifying comment to the CRn pinning logic, to avoid future
confusion & bugs (Peter Zijlstra)
* tag 'x86-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add comment clarifying CRn pinning
x86/fred: Fix early boot failures on SEV-ES/SNP guests
x86/cpu: Remove X86_CR4_FRED from the CR4 pinned bits mask
x86/cpu: Enable FSGSBASE early in cpu_init_exception_handling()
Pull timer fix from Ingo Molnar:
"Fix an argument order bug in the alarm timer forwarding logic, which
may cause missed expirations or incorrect overrun accounting"
* tag 'timers-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
alarmtimer: Fix argument order in alarm_timer_forward()
Pull futex fixes from Ingo Molnar:
- Tighten up the sys_futex_requeue() ABI a bit, to disallow dissimilar
futex flags and potential UaF access (Peter Zijlstra)
- Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
(Hao-Yu Yang)
- Clear stale exiting pointer in futex_lock_pi() retry path, which
triggered a warning (and potential misbehavior) in stress-testing
(Davidlohr Bueso)
* tag 'locking-urgent-2026-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Clear stale exiting pointer in futex_lock_pi() retry path
futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy()
futex: Require sys_futex_requeue() to have identical flags
Pull overlayfs fixes from Amir Goldstein:
- Fix regression in 'xino' feature detection
I clumsily introduced this regression myself when working on another
subsystem (fsnotify). Both the regression and the fix have almost no
visible impact on users except for some kmsg prints.
- Fix to performance regression in v6.12.
This regression was reported by Google COS developers.
It is not uncommon these days for the year-old mature LTS to get
adopted by distros and get exposed to many new workloads. We made a
sub-smart move of making a behavior change in v6.12 which could
impact performance, without making it opt-in. Fixing this mistake
retroactively, to be picked by LTS.
* tag 'ovl-fixes-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
ovl: make fsync after metadata copy-up opt-in mount option
ovl: fix wrong detection of 32bit inode numbers
Pull ext4 fixes from Ted Ts'o:
- Update the MAINTAINERS file to add reviewers for the ext4 file system
- Add a test issue an ext4 warning (not a WARN_ON) if there are still
dirty pages attached to an evicted inode.
- Fix a number of Syzkaller issues
- Fix memory leaks on error paths
- Replace some BUG and WARN with EFSCORRUPTED reporting
- Fix a potential crash when disabling discard via remount followed by
an immediate unmount. (Found by Sashiko)
- Fix a corner case which could lead to allocating blocks for an
indirect-mapped inode block numbers > 2**32
- Fix a race when reallocating a freed inode that could result in a
deadlock
- Fix a user-after-free in update_super_work when racing with umount
- Fix build issues when trying to build ext4's kunit tests as a module
- Fix a bug where ext4_split_extent_zeroout() could fail to pass back
an error from ext4_ext_dirty()
- Avoid allocating blocks from a corrupted block group in
ext4_mb_find_by_goal()
- Fix a percpu_counters list corruption BUG triggered by an ext4
extents kunit
- Fix a potetial crash caused by the fast commit flush path potentially
accessing the jinode structure before it is fully initialized
- Fix fsync(2) in no-journal mode to make sure the dirtied inode is
write to storage
- Fix a bug when in no-journal mode, when ext4 tries to avoid using
recently deleted inodes, if lazy itable initialization is enabled,
can lead to an unitialized inode getting skipped and triggering an
e2fsck complaint
- Fix journal credit calculation when setting an xattr when both the
encryption and ea_inode feeatures are enabled
- Fix corner cases which could result in stale xarray tags after
writeback
- Fix generic/475 failures caused by ENOSPC errors while creating a
symlink when the system crashes resulting to a file system
inconsistency when replaying the fast commit journal
* tag 'ext4_for_linus-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (27 commits)
ext4: always drain queued discard work in ext4_mb_release()
ext4: handle wraparound when searching for blocks for indirect mapped blocks
ext4: skip split extent recovery on corruption
ext4: fix iloc.bh leak in ext4_fc_replay_inode() error paths
ext4: fix deadlock on inode reallocation
ext4: fix use-after-free in update_super_work when racing with umount
ext4: fix the might_sleep() warnings in kvfree()
ext4: reject mount if bigalloc with s_first_data_block != 0
ext4: fix extents-test.c is not compiled when EXT4_KUNIT_TESTS=M
ext4: fix mballoc-test.c is not compiled when EXT4_KUNIT_TESTS=M
ext4: introduce EXPORT_SYMBOL_FOR_EXT4_TEST() helper
jbd2: gracefully abort on checkpointing state corruptions
ext4: avoid infinite loops caused by residual data
ext4: validate p_idx bounds in ext4_ext_correct_indexes
ext4: test if inode's all dirty pages are submitted to disk
ext4: minor fix for ext4_split_extent_zeroout()
ext4: avoid allocate block from corrupted group in ext4_mb_find_by_goal()
ext4: kunit: extents-test: lix percpu_counters list corruption
ext4: publish jinode after initialization
ext4: replace BUG_ON with proper error handling in ext4_read_inline_folio
...
Pull btrfs fixes from David Sterba:
"A few more fixes. There's one that stands out in size as it fixes an
edge case in fsync.
- fix issue on fsync where file with zero size appears as a non-zero
after log replay
- in zlib compression, handle a crash when data alignment causes
folio reference issues
- fix possible crash with enabled tracepoints on a overlayfs mount
- handle device stats update error
- on zoned filesystems, fix kobject leak on sub-block groups
- fix super block offset in an error message in validation"
* tag 'for-7.0-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix lost error when running device stats on multiple devices fs
btrfs: tracepoints: get correct superblock from dentry in event btrfs_sync_file()
btrfs: zlib: handle page aligned compressed size correctly
btrfs: fix leak of kobject name for sub-group space_info
btrfs: fix zero size inode with non-zero size after log replay
btrfs: fix super block offset in error message in btrfs_validate_super()
Pull misc fixes from Andrew Morton:
"10 hotfixes. 8 are cc:stable. 9 are for MM.
There's a 3-patch series of DAMON fixes from Josh Law and SeongJae
Park. The rest are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-03-28-10-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/mseal: update VMA end correctly on merge
bug: avoid format attribute warning for clang as well
mm/pagewalk: fix race between concurrent split and refault
mm/memory: fix PMD/PUD checks in follow_pfnmap_start()
mm/damon/sysfs: check contexts->nr in repeat_call_fn
mm/damon/sysfs: check contexts->nr before accessing contexts_arr[0]
mm/damon/sysfs: fix param_ctx leak on damon_sysfs_new_test_ctx() failure
mm/swap: fix swap cache memcg accounting
MAINTAINERS, mailmap: update email address for Harry Yoo
mm/huge_memory: fix folio isn't locked in softleaf_to_folio()
As stated on the website: "This wiki has been archived and the content
is no longer updated." No need to reference it.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Pull tracing fixes from Steven Rostedt:
- Fix potential deadlock in osnoise and hotplug
The interface_lock can be called by a osnoise thread and the CPU
shutdown logic of osnoise can wait for this thread to finish. But
cpus_read_lock() can also be taken while holding the interface_lock.
This produces a circular lock dependency and can cause a deadlock.
Swap the ordering of cpus_read_lock() and the interface_lock to have
interface_lock taken within the cpus_read_lock() context to prevent
this circular dependency.
- Fix freeing of event triggers in early boot up
If the same trigger is added on the kernel command line, the second
one will fail to be applied and the trigger created will be freed.
This calls into the deferred logic and creates a kernel thread to do
the freeing. But the command line logic is called before kernel
threads can be created and this leads to a NULL pointer dereference.
Delay freeing event triggers until late init.
* tag 'trace-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Drain deferred trigger frees if kthread creation fails
tracing: Fix potential deadlock in cpu hotplug with osnoise
Pull s390 fixes from Vasily Gorbik:
- Add array_index_nospec() to syscall dispatch table lookup to prevent
limited speculative out-of-bounds access with user-controlled syscall
number
- Mark array_index_mask_nospec() __always_inline since GCC may emit an
out-of-line call instead of the inline data dependency sequence the
mitigation relies on
- Clear r12 on kernel entry to prevent potential speculative use of
user value in system_call, ext/io/mcck interrupt handlers
* tag 's390-7.0-6' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/entry: Scrub r12 register on kernel entry
s390/syscalls: Add spectre boundary for syscall dispatch table
s390/barrier: Make array_index_mask_nospec() __always_inline
Fuzzying/stressing futexes triggered:
WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524
When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY
and stores a refcounted task pointer in 'exiting'.
After wait_for_owner_exiting() consumes that reference, the local pointer
is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a
different error, the bogus pointer is passed to wait_for_owner_exiting().
CPU0 CPU1 CPU2
futex_lock_pi(uaddr)
// acquires the PI futex
exit()
futex_cleanup_begin()
futex_state = EXITING;
futex_lock_pi(uaddr)
futex_lock_pi_atomic()
attach_to_pi_owner()
// observes EXITING
*exiting = owner; // takes ref
return -EBUSY
wait_for_owner_exiting(-EBUSY, owner)
put_task_struct(); // drops ref
// exiting still points to owner
goto retry;
futex_lock_pi_atomic()
lock_pi_update_atomic()
cmpxchg(uaddr)
*uaddr ^= WAITERS // whatever
// value changed
return -EAGAIN;
wait_for_owner_exiting(-EAGAIN, exiting) // stale
WARN_ON_ONCE(exiting)
Fix this by resetting upon retry, essentially aligning it with requeue_pi.
Fixes: 3ef240eaff ("futex: Prevent exit livelock")
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net
Boot-time trigger registration can fail before the trigger-data cleanup
kthread exists. Deferring those frees until late init is fine, but the
post-boot fallback must still drain the deferred list if kthread
creation never succeeds.
Otherwise, boot-deferred nodes can accumulate on
trigger_data_free_list, later frees fall back to synchronously freeing
only the current object, and the older queued entries are leaked
forever.
To trigger this, add the following to the kernel command line:
trace_event=sched_switch trace_trigger=sched_switch.traceon,sched_switch.traceon
The second traceon trigger will fail and be freed. This triggers a NULL
pointer dereference and crashes the kernel.
Keep the deferred boot-time behavior, but when kthread creation fails,
drain the whole queued list synchronously. Do the same in the late-init
drain path so queued entries are not stranded there either.
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260324221326.1395799-3-atwellwea@gmail.com
Fixes: 61d445af0a ("tracing: Add bulk garbage collection of freeing event_trigger_data")
Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
This motherboard uses USB audio instead, causing this driver to complain
about "no codecs found!".
Add it to the denylist to silence the warning.
The first attempt only matched on the PCI device, but this caused issues
for some laptops, so DMI match against the board as well.
Signed-off-by: Stuart Hayhurst <stuart.a.hayhurst@gmail.com>
Link: https://patch.msgid.link/20260327155737.21818-2-stuart.a.hayhurst@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Wei Fang says:
====================
net: enetc: add more checks to enetc_set_rxfh()
ENETC only supports Toeplitz algorithm, and VFs do not support setting
the RSS key, but enetc_set_rxfh() does not check these constraints and
silently accepts unsupported configurations. This may mislead users or
tools into believing that the requested RSS settings have been
successfully applied. So add checks to reject unsupported hash functions
and RSS key updates on VFs, and return "-EOPNOTSUPP" to user space.
====================
Link: https://patch.msgid.link/20260326075233.3628047-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
VFs do not have privilege to configure the RSS key because the registers
are owned by the PF. Currently, if VF attempts to configure the RSS key,
enetc_set_rxfh() simply skips the configuration and does not generate a
warning, which may mislead users into thinking the feature is supported.
To improve this situation, add a check to reject RSS key configuration
on VFs.
Fixes: d382563f54 ("enetc: Add RFS and RSS support")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Clark Wang <xiaoning.wang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Link: https://patch.msgid.link/20260326075233.3628047-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In commit 8110633db4 ("net: sfp-bus: allow SFP quirks to override
Autoneg and pause bits") we moved the setting of Autoneg and pause bits
before the call to SFP quirk when parsing SFP module support.
Since the quirk for Ubiquiti U-Fiber Instant SFP module zeroes the
support bits and sets 1000baseX_Full only, the above mentioned commit
changed the overall computed support from
1000baseX_Full, Autoneg, Pause, Asym_Pause
to just
1000baseX_Full.
This broke the SFP module for mvneta, which requires Autoneg for
1000baseX since commit c762b7fac1 ("net: mvneta: deny disabling
autoneg for 802.3z modes").
Fix this by setting back the Autoneg, Pause and Asym_Pause bits in the
quirk.
Fixes: 8110633db4 ("net: sfp-bus: allow SFP quirks to override Autoneg and pause bits")
Signed-off-by: Marek Behún <kabel@kernel.org>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20260326122038.2489589-1-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
follow_pfnmap_start() suffers from two problems:
(1) We are not re-fetching the pmd/pud after taking the PTL
Therefore, we are not properly stabilizing what the lock actually
protects. If there is concurrent zapping, we would indicate to the
caller that we found an entry, however, that entry might already have
been invalidated, or contain a different PFN after taking the lock.
Properly use pmdp_get() / pudp_get() after taking the lock.
(2) pmd_leaf() / pud_leaf() are not well defined on non-present entries
pmd_leaf()/pud_leaf() could wrongly trigger on non-present entries.
There is no real guarantee that pmd_leaf()/pud_leaf() returns something
reasonable on non-present entries. Most architectures indeed either
perform a present check or make it work by smart use of flags.
However, for example loongarch checks the _PAGE_HUGE flag in pmd_leaf(),
and always sets the _PAGE_HUGE flag in __swp_entry_to_pmd(). Whereby
pmd_trans_huge() explicitly checks pmd_present(), pmd_leaf() does not do
that.
Let's check pmd_present()/pud_present() before assuming "the is a present
PMD leaf" when spotting pmd_leaf()/pud_leaf(), like other page table
handling code that traverses user page tables does.
Given that non-present PMD entries are likely rare in VM_IO|VM_PFNMAP, (1)
is likely more relevant than (2). It is questionable how often (1) would
actually trigger, but let's CC stable to be sure.
This was found by code inspection.
Link: https://lkml.kernel.org/r/20260323-follow_pfnmap_fix-v1-1-5b0ec10872b3@kernel.org
Fixes: 6da8e9634b ("mm: new follow_pfnmap API")
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Multiple sysfs command paths dereference contexts_arr[0] without first
verifying that kdamond->contexts->nr == 1. A user can set nr_contexts to
0 via sysfs while DAMON is running, causing NULL pointer dereferences.
In more detail, the issue can be triggered by privileged users like
below.
First, start DAMON and make contexts directory empty
(kdamond->contexts->nr == 0).
# damo start
# cd /sys/kernel/mm/damon/admin/kdamonds/0
# echo 0 > contexts/nr_contexts
Then, each of below commands will cause the NULL pointer dereference.
# echo update_schemes_stats > state
# echo update_schemes_tried_regions > state
# echo update_schemes_tried_bytes > state
# echo update_schemes_effective_quotas > state
# echo update_tuned_intervals > state
Guard all commands (except OFF) at the entry point of
damon_sysfs_handle_cmd().
Link: https://lkml.kernel.org/r/20260321175427.86000-3-sj@kernel.org
Fixes: 0ac32b8aff ("mm/damon/sysfs: support DAMOS stats")
Signed-off-by: Josh Law <objecting@objecting.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [5.18+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
On arm64 server, we found folio that get from migration entry isn't locked
in softleaf_to_folio(). This issue triggers when mTHP splitting and
zap_nonpresent_ptes() races, and the root cause is lack of memory barrier
in softleaf_to_folio(). The race is as follows:
CPU0 CPU1
deferred_split_scan() zap_nonpresent_ptes()
lock folio
split_folio()
unmap_folio()
change ptes to migration entries
__split_folio_to_order() softleaf_to_folio()
set flags(including PG_locked) for tail pages folio = pfn_folio(softleaf_to_pfn(entry))
smp_wmb() VM_WARN_ON_ONCE(!folio_test_locked(folio))
prep_compound_page() for tail pages
In __split_folio_to_order(), smp_wmb() guarantees page flags of tail pages
are visible before the tail page becomes non-compound. smp_wmb() should
be paired with smp_rmb() in softleaf_to_folio(), which is missed. As a
result, if zap_nonpresent_ptes() accesses migration entry that stores tail
pfn, softleaf_to_folio() may see the updated compound_head of tail page
before page->flags.
This issue will trigger VM_WARN_ON_ONCE() in pfn_swap_entry_folio()
because of the race between folio split and zap_nonpresent_ptes()
leading to a folio incorrectly undergoing modification without a folio
lock being held.
This is a BUG_ON() before commit 93976a2034 ("mm: eliminate further
swapops predicates"), which in merged in v6.19-rc1.
To fix it, add missing smp_rmb() if the softleaf entry is migration entry
in softleaf_to_folio() and softleaf_to_page().
[tujinjiang@huawei.com: update function name and comments]
Link: https://lkml.kernel.org/r/20260321075214.3305564-1-tujinjiang@huawei.com
Link: https://lkml.kernel.org/r/20260319012541.4158561-1-tujinjiang@huawei.com
Fixes: e9b61f1985 ("thp: reintroduce split_huge_page()")
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Barry Song <baohua@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a regression test for the divide-by-zero in rtsc_min() triggered
when m2sm() converts a large m1 value (e.g. 32gbit) to a u64 scaled
slope reaching 2^32. rtsc_min() stores the difference of two such u64
values (sm1 - sm2) in a u32 variable `dsm`, truncating 2^32 to zero
and causing a divide-by-zero oops in the concave-curve intersection
path. The test configures an HFSC class with m1=32gbit d=1ms m2=0bit,
sends a packet to activate the class, waits for it to drain and go
idle, then sends another packet to trigger reactivation through
rtsc_min().
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260326204310.1549327-2-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
m2sm() converts a u32 slope to a u64 scaled value. For large inputs
(e.g. m1=4000000000), the result can reach 2^32. rtsc_min() stores
the difference of two such u64 values in a u32 variable `dsm` and
uses it as a divisor. When the difference is exactly 2^32 the
truncation yields zero, causing a divide-by-zero oops in the
concave-curve intersection path:
Oops: divide error: 0000
RIP: 0010:rtsc_min (net/sched/sch_hfsc.c:601)
Call Trace:
init_ed (net/sched/sch_hfsc.c:629)
hfsc_enqueue (net/sched/sch_hfsc.c:1569)
[...]
Widen `dsm` to u64 and replace do_div() with div64_u64() so the full
difference is preserved.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260326204310.1549327-1-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While reviewing recent ext4 patch[1], Sashiko raised the following
concern[2]:
> If the filesystem is initially mounted with the discard option,
> deleting files will populate sbi->s_discard_list and queue
> s_discard_work. If it is then remounted with nodiscard, the
> EXT4_MOUNT_DISCARD flag is cleared, but the pending s_discard_work is
> neither cancelled nor flushed.
[1] https://lore.kernel.org/r/20260319094545.19291-1-qiang.zhang@linux.dev/
[2] https://sashiko.dev/#/patchset/20260319094545.19291-1-qiang.zhang%40linux.dev
The concern was valid, but it had nothing to do with the patch[1].
One of the problems with Sashiko in its current (early) form is that
it will detect pre-existing issues and report it as a problem with the
patch that it is reviewing.
In practice, it would be hard to hit deliberately (unless you are a
malicious syzkaller fuzzer), since it would involve mounting the file
system with -o discard, and then deleting a large number of files,
remounting the file system with -o nodiscard, and then immediately
unmounting the file system before the queued discard work has a change
to drain on its own.
Fix it because it's a real bug, and to avoid Sashiko from raising this
concern when analyzing future patches to mballoc.c.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: 55cdd0af2b ("ext4: get discard out of jbd2 commit kthread contex")
Cc: stable@kernel.org
Commit 4865c768b5 ("ext4: always allocate blocks only from groups
inode can use") restricts what blocks will be allocated for indirect
block based files to block numbers that fit within 32-bit block
numbers.
However, when using a review bot running on the latest Gemini LLM to
check this commit when backporting into an LTS based kernel, it raised
this concern:
If ac->ac_g_ex.fe_group is >= ngroups (for instance, if the goal
group was populated via stream allocation from s_mb_last_groups),
then start will be >= ngroups.
Does this allow allocating blocks beyond the 32-bit limit for
indirect block mapped files? The commit message mentions that
ext4_mb_scan_groups_linear() takes care to not select unsupported
groups. However, its loop uses group = *start, and the very first
iteration will call ext4_mb_scan_group() with this unsupported
group because next_linear_group() is only called at the end of the
iteration.
After reviewing the code paths involved and considering the LLM
review, I determined that this can happen when there is a file system
where some files/directories are extent-mapped and others are
indirect-block mapped. To address this, add a safety clamp in
ext4_mb_scan_groups().
Fixes: 4865c768b5 ("ext4: always allocate blocks only from groups inode can use")
Cc: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20260326045834.1175822-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
ext4_split_extent_at() retries after ext4_ext_insert_extent() fails by
refinding the original extent and restoring its length. That recovery is
only safe for transient resource failures such as -ENOSPC, -EDQUOT, and
-ENOMEM.
When ext4_ext_insert_extent() fails because the extent tree is already
corrupted, ext4_find_extent() can return a leaf path without p_ext.
ext4_split_extent_at() then dereferences path[depth].p_ext while trying to
fix up the original extent length, causing a NULL pointer dereference while
handling a pre-existing filesystem corruption.
Do not enter the recovery path for corruption errors, and validate p_ext
after refinding the extent before touching it. This keeps the recovery path
limited to cases it can actually repair and turns the syzbot-triggered crash
into a proper corruption report.
Fixes: 716b9c23b8 ("ext4: refactor split and convert extents")
Reported-by: syzbot+1ffa5d865557e51cb604@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=1ffa5d865557e51cb604
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: hongao <hongao@uniontech.com>
Link: https://patch.msgid.link/EF77870F23FF9C90+20260324015815.35248-1-hongao@uniontech.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
During code review, Joseph found that ext4_fc_replay_inode() calls
ext4_get_fc_inode_loc() to get the inode location, which holds a
reference to iloc.bh that must be released via brelse().
However, several error paths jump to the 'out' label without
releasing iloc.bh:
- ext4_handle_dirty_metadata() failure
- sync_dirty_buffer() failure
- ext4_mark_inode_used() failure
- ext4_iget() failure
Fix this by introducing an 'out_brelse' label placed just before
the existing 'out' label to ensure iloc.bh is always released.
Additionally, make ext4_fc_replay_inode() propagate errors
properly instead of always returning 0.
Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260323060836.3452660-1-libaokun@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Currently there is a race in ext4 when reallocating freed inode
resulting in a deadlock:
Task1 Task2
ext4_evict_inode()
handle = ext4_journal_start();
...
if (IS_SYNC(inode))
handle->h_sync = 1;
ext4_free_inode()
ext4_new_inode()
handle = ext4_journal_start()
finds the bit in inode bitmap
already clear
insert_inode_locked()
waits for inode to be
removed from the hash.
ext4_journal_stop(handle)
jbd2_journal_stop(handle)
jbd2_log_wait_commit(journal, tid);
- deadlocks waiting for transaction handle Task2 holds
Fix the problem by removing inode from the hash already in
ext4_clear_inode() by which time all IO for the inode is done so reuse
is already fine but we are still before possibly blocking on transaction
commit.
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Link: https://lore.kernel.org/all/abNvb2PcrKj1FBeC@ly-workstation
Fixes: 88ec797c46 ("fs: make insert_inode_locked() wait for inode destruction")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260320090428.24899-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Commit b98535d091 ("ext4: fix bug_on in start_this_handle during umount
filesystem") moved ext4_unregister_sysfs() before flushing s_sb_upd_work
to prevent new error work from being queued via /proc/fs/ext4/xx/mb_groups
reads during unmount. However, this introduced a use-after-free because
update_super_work calls ext4_notify_error_sysfs() -> sysfs_notify() which
accesses the kobject's kernfs_node after it has been freed by kobject_del()
in ext4_unregister_sysfs():
update_super_work ext4_put_super
----------------- --------------
ext4_unregister_sysfs(sb)
kobject_del(&sbi->s_kobj)
__kobject_del()
sysfs_remove_dir()
kobj->sd = NULL
sysfs_put(sd)
kernfs_put() // RCU free
ext4_notify_error_sysfs(sbi)
sysfs_notify(&sbi->s_kobj)
kn = kobj->sd // stale pointer
kernfs_get(kn) // UAF on freed kernfs_node
ext4_journal_destroy()
flush_work(&sbi->s_sb_upd_work)
Instead of reordering the teardown sequence, fix this by making
ext4_notify_error_sysfs() detect that sysfs has already been torn down
by checking s_kobj.state_in_sysfs, and skipping the sysfs_notify() call
in that case. A dedicated mutex (s_error_notify_mutex) serializes
ext4_notify_error_sysfs() against kobject_del() in ext4_unregister_sysfs()
to prevent TOCTOU races where the kobject could be deleted between the
state_in_sysfs check and the sysfs_notify() call.
Fixes: b98535d091 ("ext4: fix bug_on in start_this_handle during umount filesystem")
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260319120336.157873-1-jiayuan.chen@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Yang Yang says:
====================
bridge/vxlan: harden ND option parsing paths
This series hardens ND option parsing in bridge and vxlan paths.
Patch 1 linearizes the request skb in br_nd_send() before walking ND
options. Patch 2 adds explicit ND option length validation in
br_nd_send(). Patch 3 adds matching ND option length validation in
vxlan_na_create().
====================
Link: https://patch.msgid.link/20260326034441.2037420-1-n05ec@lzu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch targets two internal state machine invariants in checkpoint.c
residing inside functions that natively return integer error codes.
- In jbd2_cleanup_journal_tail(): A blocknr of 0 indicates a severely
corrupted journal superblock. Replaced the J_ASSERT with a WARN_ON_ONCE
and a graceful journal abort, returning -EFSCORRUPTED.
- In jbd2_log_do_checkpoint(): Replaced the J_ASSERT_BH checking for
an unexpected buffer_jwrite state. If the warning triggers, we
explicitly drop the just-taken get_bh() reference and call __flush_batch()
to safely clean up any previously queued buffers in the j_chkpt_bhs array,
preventing a memory leak before returning -EFSCORRUPTED.
Signed-off-by: Milos Nikic <nikic.milos@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Baokun Li <libaokun@linux.alibaba.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260311041548.159424-1-nikic.milos@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
On the mkdir/mknod path, when mapping logical blocks to physical blocks,
if inserting a new extent into the extent tree fails (in this example,
because the file system disabled the huge file feature when marking the
inode as dirty), ext4_ext_map_blocks() only calls ext4_free_blocks() to
reclaim the physical block without deleting the corresponding data in
the extent tree. This causes subsequent mkdir operations to reference
the previously reclaimed physical block number again, even though this
physical block is already being used by the xattr block. Therefore, a
situation arises where both the directory and xattr are using the same
buffer head block in memory simultaneously.
The above causes ext4_xattr_block_set() to enter an infinite loop about
"inserted" and cannot release the inode lock, ultimately leading to the
143s blocking problem mentioned in [1].
If the metadata is corrupted, then trying to remove some extent space
can do even more harm. Also in case EXT4_GET_BLOCKS_DELALLOC_RESERVE
was passed, remove space wrongly update quota information.
Jan Kara suggests distinguishing between two cases:
1) The error is ENOSPC or EDQUOT - in this case the filesystem is fully
consistent and we must maintain its consistency including all the
accounting. However these errors can happen only early before we've
inserted the extent into the extent tree. So current code works correctly
for this case.
2) Some other error - this means metadata is corrupted. We should strive to
do as few modifications as possible to limit damage. So I'd just skip
freeing of allocated blocks.
[1]
INFO: task syz.0.17:5995 blocked for more than 143 seconds.
Call Trace:
inode_lock_nested include/linux/fs.h:1073 [inline]
__start_dirop fs/namei.c:2923 [inline]
start_dirop fs/namei.c:2934 [inline]
Reported-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=1659aaaaa8d9d11265d7
Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com
Reported-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=512459401510e2a9a39f
Tested-by: syzbot+1659aaaaa8d9d11265d7@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: syzbot+512459401510e2a9a39f@syzkaller.appspotmail.com
Link: https://patch.msgid.link/tencent_43696283A68450B761D76866C6F360E36705@qq.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
The commit aa373cf550 ("writeback: stop background/kupdate works from
livelocking other works") introduced an issue where unmounting a filesystem
in a multi-logical-partition scenario could lead to batch file data loss.
This problem was not fixed until the commit d92109891f ("fs/writeback:
bail out if there is no more inodes for IO and queued once"). It took
considerable time to identify the root cause. Additionally, in actual
production environments, we frequently encountered file data loss after
normal system reboots. Therefore, we are adding a check in the inode
release flow to verify whether all dirty pages have been flushed to disk,
in order to determine whether the data loss is caused by a logic issue in
the filesystem code.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260303012242.3206465-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
There's issue as follows:
...
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 206 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 2243 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): Delayed block allocation failed for inode 2239 at logical offset 0 with max blocks 1 with error 117
EXT4-fs (mmcblk0p1): This should not happen!! Data will be lost
EXT4-fs (mmcblk0p1): error count since last fsck: 1
EXT4-fs (mmcblk0p1): initial error at time 1765597433: ext4_mb_generate_buddy:760
EXT4-fs (mmcblk0p1): last error at time 1765597433: ext4_mb_generate_buddy:760
...
According to the log analysis, blocks are always requested from the
corrupted block group. This may happen as follows:
ext4_mb_find_by_goal
ext4_mb_load_buddy
ext4_mb_load_buddy_gfp
ext4_mb_init_cache
ext4_read_block_bitmap_nowait
ext4_wait_block_bitmap
ext4_validate_block_bitmap
if (!grp || EXT4_MB_GRP_BBITMAP_CORRUPT(grp))
return -EFSCORRUPTED; // There's no logs.
if (err)
return err; // Will return error
ext4_lock_group(ac->ac_sb, group);
if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info))) // Unreachable
goto out;
After commit 9008a58e5d ("ext4: make the bitmap read routines return
real error codes") merged, Commit 163a203ddb ("ext4: mark block group
as corrupt on block bitmap error") is no real solution for allocating
blocks from corrupted block groups. This is because if
'EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)' is true, then
'ext4_mb_load_buddy()' may return an error. This means that the block
allocation will fail.
Therefore, check block group if corrupted when ext4_mb_load_buddy()
returns error.
Fixes: 163a203ddb ("ext4: mark block group as corrupt on block bitmap error")
Fixes: 9008a58e5d ("ext4: make the bitmap read routines return real error codes")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260302134619.3145520-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
commit 82f80e2e3b ("ext4: add extent status cache support to kunit tests"),
added ext4_es_register_shrinker() in extents_kunit_init() function but
failed to add the unregister shrinker routine in extents_kunit_exit().
This could cause the following percpu_counters list corruption bug.
ok 1 split unwrit extent to 2 extents and convert 1st half writ
slab kmalloc-4k start c0000002007ff000 pointer offset 1448 size 4096
list_add corruption. next->prev should be prev (c000000004bc9e60), but was 0000000000000000. (next=c0000002007ff5a8).
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:29!
cpu 0x2: Vector: 700 (Program Check) at [c000000241927a30]
pc: c000000000f26ed0: __list_add_valid_or_report+0x120/0x164
lr: c000000000f26ecc: __list_add_valid_or_report+0x11c/0x164
sp: c000000241927cd0
msr: 800000000282b033
current = 0xc000000241215200
paca = 0xc0000003fffff300 irqmask: 0x03 irq_happened: 0x09
pid = 258, comm = kunit_try_catch
kernel BUG at lib/list_debug.c:29!
enter ? for help
__percpu_counter_init_many+0x148/0x184
ext4_es_register_shrinker+0x74/0x23c
extents_kunit_init+0x100/0x308
kunit_try_run_case+0x78/0x1f8
kunit_generic_run_threadfn_adapter+0x40/0x70
kthread+0x190/0x1a0
start_kernel_thread+0x14/0x18
2:mon>
This happens because:
extents_kunit_init(test N):
ext4_es_register_shrinker(sbi)
percpu_counters_init() x 4; // this adds 4 list nodes to global percpu_counters list
list_add(&fbc->list, &percpu_counters);
shrinker_register();
extents_kunit_exit(test N):
kfree(sbi); // frees sbi w/o removing those 4 list nodes.
// So, those list node now becomes dangling pointers
extents_kunit_init(test N+1):
kzalloc_obj(ext4_sb_info) // allocator returns same page, but zeroed.
ext4_es_register_shrinker(sbi)
percpu_counters_init()
list_add(&fbc->list, &percpu_counters);
__list_add_valid(new, prev, next);
next->prev != prev // list corruption bug detected, since next->prev = NULL
Fixes: 82f80e2e3b ("ext4: add extent status cache support to kunit tests")
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/5bb9041471dab8ce870c191c19cbe4df57473be8.1772381213.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Replace BUG_ON() with proper error handling when inline data size
exceeds PAGE_SIZE. This prevents kernel panic and allows the system to
continue running while properly reporting the filesystem corruption.
The error is logged via ext4_error_inode(), the buffer head is released
to prevent memory leak, and -EFSCORRUPTED is returned to indicate
filesystem corruption.
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Link: https://patch.msgid.link/20260223123345.14838-2-ytohnuki@amazon.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
When inode metadata is changed, we sometimes just call
ext4_mark_inode_dirty() to track modified metadata. This copies inode
metadata into block buffer which is enough when we are journalling
metadata. However when we are running in nojournal mode we currently
fail to write the dirtied inode buffer during fsync(2) because the inode
is not marked as dirty. Use explicit ext4_write_inode() call to make
sure the inode table buffer is written to the disk. This is a band aid
solution but proper solution requires a much larger rewrite including
changes in metadata bh tracking infrastructure.
Reported-by: Free Ekanayaka <free.ekanayaka@gmail.com>
Link: https://lore.kernel.org/all/87il8nhxdm.fsf@x1.mail-host-address-is-not-set/
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20260216164848.3074-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
recently_deleted() checks whether inode has been used in the near past.
However this can give false positive result when inode table is not
initialized yet and we are in fact comparing to random garbage (or stale
itable block of a filesystem before mkfs). Ultimately this results in
uninitialized inodes being skipped during inode allocation and possibly
they are never initialized and thus e2fsck complains. Verify if the
inode has been initialized before checking for dtime.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20260216164848.3074-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Dimitri Daskalakis says:
====================
Fix page fragment handling when PAGE_SIZE > 4K
FBNIC operates on fixed size descriptors (4K). When the OS supports pages
larger than 4K, we fragment the page across multiple descriptors.
While performance testing, I found several issues with our page fragment
handling, resulting in low throughput and potential RX stalls.
====================
Link: https://patch.msgid.link/20260324195123.3486219-1-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix an issue arising when ext4 features has_journal, ea_inode, and encrypt
are activated simultaneously, leading to ENOSPC when creating an encrypted
file.
Fix by passing XATTR_CREATE flag to xattr_set_handle function if a handle
is specified, i.e., when the function is called in the control flow of
creating a new inode. This aligns the number of jbd2 credits set_handle
checks for with the number allocated for creating a new inode.
ext4_set_context must not be called with a non-null handle (fs_data) if
fscrypt context xattr is not guaranteed to not exist yet. The only other
usage of this function currently is when handling the ioctl
FS_IOC_SET_ENCRYPTION_POLICY, which calls it with fs_data=NULL.
Fixes: c1a5d5f6ab ("ext4: improve journal credit handling in set xattr paths")
Co-developed-by: Anthony Durrer <anthonydev@fastmail.com>
Signed-off-by: Anthony Durrer <anthonydev@fastmail.com>
Signed-off-by: Simon Weber <simon.weber.39@gmail.com>
Reviewed-by: Eric Biggers <ebiggers@kernel.org>
Link: https://patch.msgid.link/20260207100148.724275-4-simon.weber.39@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
FBNIC supports fixed size buffers of 4K. When PAGE_SIZE > 4K, we
fragment the page across multiple descriptors (FBNIC_BD_FRAG_COUNT).
When refilling the BDQ, the correct number of entries are populated,
but tail was only incremented by one. So on a system with 64K pages,
HW would get one descriptor refilled for every 16 we populate.
Additionally, we program the ring size in the HW when enabling the BDQ.
This was not accounting for page fragments, so on systems with 64K pages,
the HW used 1/16th of the ring.
Fixes: 0cb4c0a137 ("eth: fbnic: Implement Rx queue alloc/start/stop/free")
Signed-off-by: Dimitri Daskalakis <daskald@meta.com>
Link: https://patch.msgid.link/20260324195123.3486219-2-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add a check in ext4_setattr() to convert files from inline data storage
to extent-based storage when truncate() grows the file size beyond the
inline capacity. This prevents the filesystem from entering an
inconsistent state where the inline data flag is set but the file size
exceeds what can be stored inline.
Without this fix, the following sequence causes a kernel BUG_ON():
1. Mount filesystem with inode that has inline flag set and small size
2. truncate(file, 50MB) - grows size but inline flag remains set
3. sendfile() attempts to write data
4. ext4_write_inline_data() hits BUG_ON(write_size > inline_capacity)
The crash occurs because ext4_write_inline_data() expects inline storage
to accommodate the write, but the actual inline capacity (~60 bytes for
i_block + ~96 bytes for xattrs) is far smaller than the file size and
write request.
The fix checks if the new size from setattr exceeds the inode's actual
inline capacity (EXT4_I(inode)->i_inline_size) and converts the file to
extent-based storage before proceeding with the size change.
This addresses the root cause by ensuring the inline data flag and file
size remain consistent during truncate operations.
Reported-by: syzbot+7de5fe447862fc37576f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7de5fe447862fc37576f
Tested-by: syzbot+7de5fe447862fc37576f@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
Link: https://patch.msgid.link/20260207043607.1175976-1-kartikey406@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
There are cases where ext4_bio_write_page() gets called for a page which
has no buffers to submit. This happens e.g. when the part of the file is
actually a hole, when we cannot allocate blocks due to being called from
jbd2, or in data=journal mode when checkpointing writes the buffers
earlier. In these cases we just return from ext4_bio_write_page()
however if the page didn't need redirtying, we will leave stale DIRTY
and/or TOWRITE tags in xarray because those get cleared only in
__folio_start_writeback(). As a result we can leave these tags set in
mappings even after a final sync on filesystem that's getting remounted
read-only or that's being frozen. Various assertions can then get upset
when writeback is started on such filesystems (Gerald reported assertion
in ext4_journal_check_start() firing).
Fix the problem by cycling the page through writeback state even if we
decide nothing needs to be written for it so that xarray tags get
properly updated. This is slightly silly (we could update the xarray
tags directly) but I don't think a special helper messing with xarray
tags is really worth it in this relatively rare corner case.
Reported-by: Gerald Yang <gerald.yang@canonical.com>
Link: https://lore.kernel.org/all/20260128074515.2028982-1-gerald.yang@canonical.com
Fixes: dff4ac75ee ("ext4: move keep_towrite handling to ext4_bio_write_page()")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260205092223.21287-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Oskar Kjos reported the following problem.
ip4ip6_err() calls icmp_send() on a cloned skb whose cb[] was written
by the IPv6 receive path as struct inet6_skb_parm. icmp_send() passes
IPCB(skb2) to __ip_options_echo(), which interprets that cb[] region
as struct inet_skb_parm (IPv4). The layouts differ: inet6_skb_parm.nhoff
at offset 14 overlaps inet_skb_parm.opt.rr, producing a non-zero rr
value. __ip_options_echo() then reads optlen from attacker-controlled
packet data at sptr[rr+1] and copies that many bytes into dopt->__data,
a fixed 40-byte stack buffer (IP_OPTIONS_DATA_FIXED_SIZE).
To fix this we clear skb2->cb[], as suggested by Oskar Kjos.
Also add minimal IPv4 header validation (version == 4, ihl >= 5).
Fixes: c4d3efafcc ("[IPV6] IP6TUNNEL: Add support to IPv4 over IPv6 tunnel.")
Reported-by: Oskar Kjos <oskar.kjos@hotmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260326155138.2429480-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sashiko AI-review observed:
In ip6_err_gen_icmpv6_unreach(), the skb is an outer IPv4 ICMP error packet
where its cb contains an IPv4 inet_skb_parm. When skb is cloned into skb2
and passed to icmp6_send(), it uses IP6CB(skb2).
IP6CB interprets the IPv4 inet_skb_parm as an inet6_skb_parm. The cipso
offset in inet_skb_parm.opt directly overlaps with dsthao in inet6_skb_parm
at offset 18.
If an attacker sends a forged ICMPv4 error with a CIPSO IP option, dsthao
would be a non-zero offset. Inside icmp6_send(), mip6_addr_swap() is called
and uses ipv6_find_tlv(skb, opt->dsthao, IPV6_TLV_HAO).
This would scan the inner, attacker-controlled IPv6 packet starting at that
offset, potentially returning a fake TLV without checking if the remaining
packet length can hold the full 18-byte struct ipv6_destopt_hao.
Could mip6_addr_swap() then perform a 16-byte swap that extends past the end
of the packet data into skb_shared_info?
Should the cb array also be cleared in ip6_err_gen_icmpv6_unreach() and
ip6ip6_err() to prevent this?
This patch implements the first suggestion.
I am not sure if ip6ip6_err() needs to be changed.
A separate patch would be better anyway.
Fixes: ca15a078bd ("sit: generate icmpv6 error when receiving icmpv4 error")
Reported-by: Ido Schimmel <idosch@nvidia.com>
Closes: https://sashiko.dev/#/patchset/20260326155138.2429480-1-edumazet%40google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Oskar Kjos <oskar.kjos@hotmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260326202608.2976021-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit '5f920d5d6083 ("ext4: verify fast symlink length")' causes the
generic/475 test to fail during orphan cleanup of zero-length symlinks.
generic/475 84s ... _check_generic_filesystem: filesystem on /dev/vde is inconsistent
The fsck reports are provided below:
Deleted inode 9686 has zero dtime.
Deleted inode 158230 has zero dtime.
...
Inode bitmap differences: -9686 -158230
Orphan file (inode 12) block 13 is not clean.
Failed to initialize orphan file.
In ext4_symlink(), a newly created symlink can be added to the orphan
list due to ENOSPC. Its data has not been initialized, and its size is
zero. Therefore, we need to disregard the length check of the symbolic
link when cleaning up orphan inodes. Instead, we should ensure that the
nlink count is zero.
Fixes: 5f920d5d60 ("ext4: verify fast symlink length")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260131091156.1733648-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Pull hwmon fixes from Guenter Roeck:
- PMBus driver fixes:
- Add mutex protection for regulator operations
- Fix reading from "write-only" attributes
- Mark lowest/average/highest/rated attributes as read-only
- isl68137: Add mutex protection for AVS enable sysfs attributes
- ina233: Fix error handling and sign extension when reading shunt voltage
- adm1177: Fix sysfs ABI violation and current unit conversion
- peci: Fix off-by-one in cputemp_is_visible(), and crit_hyst returning
delta instead of absolute temperature
* tag 'hwmon-for-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (pmbus/core) Protect regulator operations with mutex
hwmon: (pmbus) Introduce the concept of "write-only" attributes
hwmon: (pmbus) Mark lowest/average/highest/rated attributes as read-only
hwmon: (adm1177) fix sysfs ABI violation and current unit conversion
hwmon: (peci/cputemp) Fix off-by-one in cputemp_is_visible()
hwmon: (peci/cputemp) Fix crit_hyst returning delta instead of absolute temperature
hwmon: (pmbus/isl68137) Add mutex protection for AVS enable sysfs attributes
hwmon: (pmbus/ina233) Fix error handling and sign extension in shunt voltage read
Pull SCSI fixes from James Bottomley:
"Driver (and enclosure) only fixes. Most are obvious. The big change is
in the tcm_loop driver to add command draining to error handling (the
lack of which was causing hangs with the potential for double use
crashes)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: target: file: Use kzalloc_flex for aio_cmd
scsi: scsi_transport_sas: Fix the maximum channel scanning issue
scsi: target: tcm_loop: Drain commands in target_reset handler
scsi: ibmvfc: Fix OOB access in ibmvfc_discover_targets_done()
scsi: ses: Handle positive SCSI error from ses_recv_diag()
Pull drm fixes from Dave Airlie:
"Weekly fixes, still a bit busy, but the usual suspects amdgpu and
i915/xe have a bunch of small fixes, and otherwise it's just a few
minor driver fixes.
loognsoon:
- update MAINTAINERS
shmem:
- fault handler fix
syncobj:
- fix GFP flags
amdgpu:
- DSC fix
- Module parameter parsing fix
- PASID reuse fix
- drm_edid leak fix
- SMU 13.x fixes
- SMU 14.x fix
- Fence fix in amdgpu_amdkfd_submit_ib()
- LVDS fixes
- GPU page fault fix for non-4K pages
amdkfd:
- Ordering fix in kfd_ioctl_create_process()
i915/display:
- DP tunnel error handling fix
- Spurious GMBUS timeout fix
- Unlink NV12 planes earlier
- Order OP vs. timeout correctly in __wait_for()
xe:
- Fix UAF in SRIOV migration restore
- Updates to HW W/a
- VMBind remap fix
ivpu:
- poweroff fix
mediatek:
- fix register ordering"
* tag 'drm-fixes-2026-03-28-1' of https://gitlab.freedesktop.org/drm/kernel: (25 commits)
MAINTAINERS: Update GPU driver maintainer information
drm/xe: always keep track of remap prev/next
drm/syncobj: Fix xa_alloc allocation flags
drm/amd/display: Fix DCE LVDS handling
drm/amdgpu: Handle GPU page faults correctly on non-4K page systems
drm/amd/pm: disable OD_FAN_CURVE if temp or pwm range invalid for smu v14
drm/amdkfd: Fix NULL pointer check order in kfd_ioctl_create_process
drm/amd/display: check if ext_caps is valid in BL setup
drm/amdgpu: Fix fence put before wait in amdgpu_amdkfd_submit_ib
drm/xe: Implement recent spec updates to Wa_16025250150
accel/ivpu: Add disable clock relinquish workaround for NVL-A0
drm/i915/dp_tunnel: Fix error handling when clearing stream BW in atomic state
drm/amd/pm: disable OD_FAN_CURVE if temp or pwm range invalid for smu v13
drm/amd/pm: Return -EOPNOTSUPP for unsupported OD_MCLK on smu_v13_0_6
drm/amd/pm: Skip redundant UCLK restore in smu_v13_0_6
drm/amd/display: Fix drm_edid leak in amdgpu_dm
drm/amdgpu: prevent immediate PASID reuse case
drm/amdgpu: fix strsep() corrupting lockup_timeout on multi-GPU (v3)
drm/amd/display: Do not skip unrelated mode changes in DSC validation
drm/xe/pf: Fix use-after-free in migration restore
...
Before commit f33f2d4c7c ("s390/bp: remove TIF_ISOLATE_BP"),
all entry handlers loaded r12 with the current task pointer
(lg %r12,__LC_CURRENT) for use by the BPENTER/BPEXIT macros. That
commit removed TIF_ISOLATE_BP, dropping both the branch prediction
macros and the r12 load, but did not add r12 to the register clearing
sequence.
Add the missing xgr %r12,%r12 to make the register scrub consistent
across all entry points.
Fixes: f33f2d4c7c ("s390/bp: remove TIF_ISOLATE_BP")
Cc: stable@kernel.org
Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Pull spi fixes from Mark Brown:
"There are two core fixes here. One is from Johan dealing with an issue
introduced by a devm_ API usage update causing things to be freed
earlier than they had earlier when we fail to register a device,
another from Danilo avoids unlocked acccess to data by converting to
use a driver core API.
We also have a few relatively minor driver specific fixes"
* tag 'spi-fix-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: spi-fsl-lpspi: fix teardown order issue (UAF)
spi: fix use-after-free on managed registration failure
spi: use generic driver_override infrastructure
spi: meson-spicc: Fix double-put in remove path
spi: sn-f-ospi: Use devm_mutex_init() to simplify code
spi: sn-f-ospi: Fix resource leak in f_ospi_probe()
Pull regulator fix from Mark Brown:
"A fix from Alice for the rust bindings, they didn't handle the stub
implementation of the C API used when CONFIG_REGULATOR is disabled
leading to undefined behaviour"
* tag 'regulator-fix-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
rust: regulator: do not assume that regulator_get() returns non-null
Pull regmap fix from Mark Brown:
"A fix from Andy Shevchenko for an issue with caching of page selector
registers which are located inside the page they are switching"
* tag 'regmap-fix-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: Synchronize cache for the page selector
Pull tsm fix from Dan Williams:
- Fix a VMM controlled buffer length used to emit TDX attestation
reports
* tag 'tsm-fixes-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm:
virt: tdx-guest: Fix handling of host controlled 'quote' buffer length
Pull VFIO fix from Alex Williamson:
- Fix double-free and reference count underflow if dma-buf file
allocation fails (Alex Williamson)
* tag 'vfio-v7.0-rc6' of https://github.com/awilliam/linux-vfio:
vfio/pci: Fix double free in dma-buf feature
Pull EFI fix from Ard Biesheuvel:
"Fix a potential buffer overrun issue introduced by the previous fix
for EFI boot services region reservations on x86"
* tag 'efi-fixes-for-v7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
x86/efi: efi_unmap_boot_services: fix calculation of ranges_to_free size
Pull LoongArch fixes from Huacai Chen:
"Fix missing NULL checks for kstrdup(), workaround LS2K/LS7A GPU
DMA hang bug, emit GNU_EH_FRAME for vDSO correctly, and fix some
KVM-related bugs"
* tag 'loongarch-fixes-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: KVM: Fix base address calculation in kvm_eiointc_regs_access()
LoongArch: KVM: Handle the case that EIOINTC's coremap is empty
LoongArch: KVM: Make kvm_get_vcpu_by_cpuid() more robust
LoongArch: vDSO: Emit GNU_EH_FRAME correctly
LoongArch: Workaround LS2K/LS7A GPU DMA hang bug
LoongArch: Fix missing NULL checks for kstrdup()
Pull io_uring fixes from Jens Axboe:
"Just two small fixes, both fixing regressions added in the fdinfo code
in 6.19 with the SQE mixed size support"
* tag 'io_uring-7.0-20260327' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/fdinfo: fix OOB read in SQE_MIXED wrap check
io_uring/fdinfo: fix SQE_MIXED SQE displaying
Adjust KVM's sanity check against overwriting a shadow-present SPTE with a
another SPTE with a different target PFN to only apply to direct MMUs,
i.e. only to MMUs without shadowed gPTEs. While it's impossible for KVM
to overwrite a shadow-present SPTE in response to a guest write, writes
from outside the scope of KVM, e.g. from host userspace, aren't detected
by KVM's write tracking and so can break KVM's shadow paging rules.
------------[ cut here ]------------
pfn != spte_to_pfn(*sptep)
WARNING: arch/x86/kvm/mmu/mmu.c:3069 at mmu_set_spte+0x1e4/0x440 [kvm], CPU#0: vmx_ept_stale_r/872
Modules linked in: kvm_intel kvm irqbypass
CPU: 0 UID: 1000 PID: 872 Comm: vmx_ept_stale_r Not tainted 7.0.0-rc2-eafebd2d2ab0-sink-vm #319 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:mmu_set_spte+0x1e4/0x440 [kvm]
Call Trace:
<TASK>
ept_page_fault+0x535/0x7f0 [kvm]
kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
kvm_mmu_page_fault+0x8d/0x620 [kvm]
vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0xc55/0x1c20 [kvm]
kvm_vcpu_ioctl+0x2d5/0x980 [kvm]
__x64_sys_ioctl+0x8a/0xd0
do_syscall_64+0xb5/0x730
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
---[ end trace 0000000000000000 ]---
Fixes: 11d4517511 ("KVM: x86/mmu: Warn if PFN changes on shadow-present SPTE in shadow MMU")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
When installing an emulated MMIO SPTE, do so *after* dropping/zapping the
existing SPTE (if it's shadow-present). While commit a54aa15c6b was
right about it being impossible to convert a shadow-present SPTE to an
MMIO SPTE due to a _guest_ write, it failed to account for writes to guest
memory that are outside the scope of KVM.
E.g. if host userspace modifies a shadowed gPTE to switch from a memslot
to emulted MMIO and then the guest hits a relevant page fault, KVM will
install the MMIO SPTE without first zapping the shadow-present SPTE.
------------[ cut here ]------------
is_shadow_present_pte(*sptep)
WARNING: arch/x86/kvm/mmu/mmu.c:484 at mark_mmio_spte+0xb2/0xc0 [kvm], CPU#0: vmx_ept_stale_r/4292
Modules linked in: kvm_intel kvm irqbypass
CPU: 0 UID: 1000 PID: 4292 Comm: vmx_ept_stale_r Not tainted 7.0.0-rc2-eafebd2d2ab0-sink-vm #319 PREEMPT
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:mark_mmio_spte+0xb2/0xc0 [kvm]
Call Trace:
<TASK>
mmu_set_spte+0x237/0x440 [kvm]
ept_page_fault+0x535/0x7f0 [kvm]
kvm_mmu_do_page_fault+0xee/0x1f0 [kvm]
kvm_mmu_page_fault+0x8d/0x620 [kvm]
vmx_handle_exit+0x18c/0x5a0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0xc55/0x1c20 [kvm]
kvm_vcpu_ioctl+0x2d5/0x980 [kvm]
__x64_sys_ioctl+0x8a/0xd0
do_syscall_64+0xb5/0x730
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x47fa3f
</TASK>
---[ end trace 0000000000000000 ]---
Reported-by: Alexander Bulekov <bkov@amazon.com>
Debugged-by: Alexander Bulekov <bkov@amazon.com>
Suggested-by: Fred Griffoul <fgriffo@amazon.co.uk>
Fixes: a54aa15c6b ("KVM: x86/mmu: Handle MMIO SPTEs directly in mmu_set_spte()")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: s390: More memory management fixes
Lots of small and not-so-small fixes for the newly rewritten gmap,
mostly affecting the handling of nested guests.
Since the ChaCha permutation is invertible, the local variable
'permuted_state' is sufficient to compute the original 'state', and thus
the key, even after the permutation has been done.
While the kernel is quite inconsistent about zeroizing secrets on the
stack (and some prominent userspace crypto libraries don't bother at all
since it's not guaranteed to work anyway), the kernel does try to do it
as a best practice, especially in cases involving the RNG.
Thus, explicitly zeroize 'permuted_state' before it goes out of scope.
Fixes: c08d0e6473 ("crypto: chacha20 - Add a generic ChaCha20 stream cipher implementation")
Cc: stable@vger.kernel.org
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20260326032920.39408-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Pull rdma fixes from Jason Gunthorpe:
- Quite a few irdma bug fixes, several user triggerable
- Fix a 0 SMAC header in ionic
- Tolerate FW errors for RAAS in bng_re
- Don't UAF in efa when printing error events
- Better handle pool exhaustion in the new bvec paths
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/irdma: Harden depth calculation functions
RDMA/irdma: Return EINVAL for invalid arp index error
RDMA/irdma: Fix deadlock during netdev reset with active connections
RDMA/irdma: Remove reset check from irdma_modify_qp_to_err()
RDMA/irdma: Clean up unnecessary dereference of event->cm_node
RDMA/irdma: Remove a NOP wait_event() in irdma_modify_qp_roce()
RDMA/irdma: Update ibqp state to error if QP is already in error state
RDMA/irdma: Initialize free_qp completion before using it
RDMA/efa: Fix possible deadlock
RDMA/rw: Fix MR pool exhaustion in bvec RDMA READ path
RDMA/rw: Fall back to direct SGE on MR pool exhaustion
RDMA/efa: Fix use of completion ctx after free
RDMA/bng_re: Fix silent failure in HWRM version query
RDMA/ionic: Preserve and set Ethernet source MAC after ib_ud_header_init()
RDMA/irdma: Fix double free related to rereg_user_mr
Pull pci fixes from Bjorn Helgaas:
- Remove power-off from pwrctrl drivers since this is now done directly
by the PCI controller drivers (Chen-Yu Tsai)
- Fix pwrctrl device node leak (Felix Gu)
- Document a TLP header decoder for AER log messages (Lukas Wunner)
* tag 'pci-v7.0-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
Documentation: PCI: Document PCIe TLP Header decoder for AER messages
PCI/pwrctrl: Fix pci_pwrctrl_is_required() device node leak
PCI/pwrctrl: Do not power off on pwrctrl device removal
Pull sound fixes from Takashi Iwai:
"This became slightly big partly due to my time off in the last week.
But all changes are about device-specific fixes, so it should be
safely applicable.
ASoC:
- Fix double free in sma1307
- Fix uninitialized variables in simple-card-utils/imx-card
- Address clock leaks and error propagation in ADAU1372
- Add DMI quirks and ACP/SDW support for ASUS
- Fix Intel CATPT DMA mask
- Fix SOF topology parsing
- Fix DT bindings for RK3576 SPDIF, STM32 SAI and WCD934x
HD-audio:
- Quirks for Lenovo, ASUS, and various HP models, as well as
a speaker pop fix on Star Labs StarFighter
- Revert MSI X870E Tomahawk denylist again
USB-Audio:
- Fix distorted audio on Focusrite Scarlett 2i2/2i4 1st Gen
- Add iface reset quirk for AB17X
- Update Qualcomm USB audio Kconfig dependencies and license
Misc:
- Fix minor compile warnings for firewire and asihpi drivers"
* tag 'sound-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (35 commits)
Revert "ALSA: hda/intel: Add MSI X870E Tomahawk to denylist"
ALSA: usb-audio: Add iface reset and delay quirk for AB17X USB Audio
ALSA: hda/realtek: add HP Laptop 15-fd0xxx mute LED quirk
ALSA: usb-audio: Exclude Scarlett 2i4 1st Gen from SKIP_IFACE_SETUP
ALSA: hda/realtek: Add mute LED quirk for HP Pavilion 15-eg0xxx
ALSA: hda/realtek - Fixed Speaker Mute LED for HP EliteBoard G1a platform
ASoC: SOF: ipc4-topology: Allow bytes controls without initial payload
ASoC: adau1372: Fix clock leak on PLL lock failure
ASoC: adau1372: Fix unchecked clk_prepare_enable() return value
ASoC: SDCA: fix finding wrong entity
ASoC: SDCA: remove the max count of initialization table
ASoC: codecs: wcd934x: fix typo in dt parsing
ASoC: dt-bindings: stm32: Fix incorrect compatible string in stm32h7-sai match
ASoC: Intel: catpt: Fix the device initialization
ASoC: amd: acp: add ASUS HN7306EA quirk for legacy SDW machine
ASoC: SOF: topology: reject invalid vendor array size in token parser
ASoC: tas2781: Add null check for calibration data
ALSA: asihpi: avoid write overflow check warning
ASoC: fsl: imx-card: initialize playback_only and capture_only
ASoC: simple-card-utils: Check value of is_playback_only and is_capture_only
...
Pull media fixes from Mauro Carvalho Chehab:
- uvcvideo may cause OOPS when out of memory
- remove a deadlock in the ccs driver
* tag 'media/v7.0-6' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: ccs: Avoid deadlock in ccs_init_state()
media: uvcvideo: Fix bug in error path of uvc_alloc_urb_buffers
Pull sysctl fix from Joel Granados:
"Fix uninitialized variable error when writing to a sysctl bitmap
Removed the possibility of returning an unjustified -EINVAL when
writing to a sysctl bitmap"
* tag 'sysctl-7.00-fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: fix uninitialized variable in proc_do_large_bitmap
Pull xfs fixes from Carlos Maiolino:
"This includes a few important bug fixes, and some code refactoring
that was necessary for one of the fixes"
* tag 'xfs-fixes-7.0-rc6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: remove file_path tracepoint data
xfs: don't irele after failing to iget in xfs_attri_recover_work
xfs: remove redundant validation in xlog_recover_attri_commit_pass2
xfs: fix ri_total validation in xlog_recover_attri_commit_pass2
xfs: close crash window in attr dabtree inactivation
xfs: factor out xfs_attr3_leaf_init
xfs: factor out xfs_attr3_node_entry_remove
xfs: only assert new size for datafork during truncate extents
xfs: annotate struct xfs_attr_list_context with __counted_by_ptr
xfs: cleanup buftarg handling in XFS_IOC_VERIFY_MEDIA
xfs: scrub: unlock dquot before early return in quota scrub
xfs: refactor xfsaild_push loop into helper
xfs: save ailp before dropping the AIL lock in push callbacks
xfs: avoid dereferencing log items after push callbacks
xfs: stop reclaim before pushing AIL during unmount
Pull smb server fixes from Steve French:
- Fix out of bounds write
- Fix for better calculating max output buffers
- Fix memory leaks in SMB2/SMB3 lock
- Fix use after free
- Multichannel fix
* tag 'v7.0-rc5-ksmbd-srv-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix potencial OOB in get_file_all_info() for compound requests
ksmbd: replace hardcoded hdr2_len with offsetof() in smb2_calc_max_out_buf_len()
ksmbd: fix memory leaks and NULL deref in smb2_lock()
ksmbd: fix use-after-free and NULL deref in smb_grant_oplock()
ksmbd: do not expire session on binding failure
udf_setsize() can race with udf_writepages() as follows:
udf_setsize() udf_writepages()
if (iinfo->i_alloc_type ==
ICBTAG_FLAG_AD_IN_ICB)
err = udf_expand_file_adinicb(inode);
err = udf_extend_file(inode, newsize);
udf_adinicb_writepages()
memcpy_from_file_folio() - crash
because inode size is too big.
Fix the problem by checking the file type under folio lock in
udf_handle_page_wb() handler called from __mpage_writepages() which
properly serializes with udf_expand_file_adinicb().
Reported-by: Jianzhou Zhao <luckd0g@163.com>
Link: https://lore.kernel.org/all/f622c01.67ac.19cdbdd777d.Coremail.luckd0g@163.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260326140635.15895-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Some filesystems need to treat some folios specially (for example for
inodes with inline data). Doing the handling in their .writepages method
in a race-free manner results in duplicating some of the writeback
internals. So provide generalized version of mpage_writepages() that
allows filesystem to provide a handler called for each folio which can
handle the folio in a special way.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260326140635.15895-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Identified resume-probe race condition in kernel v7.0 with the commit
38fa29b01a ("i2c: designware: Combine the init functions"),but this
issue existed from the beginning though not detected.
The amdisp i2c device requires ISP to be in power-on state for probe
to succeed. To meet this requirement, this device is added to genpd
to control ISP power using runtime PM. The pm_runtime_get_sync() called
before i2c_dw_probe() triggers PM resume, which powers on ISP and also
invokes the amdisp i2c runtime resume before the probe completes resulting
in this race condition and a NULL dereferencing issue in v7.0
Fix this race condition by using the genpd APIs directly during probe:
- Call dev_pm_genpd_resume() to Power ON ISP before probe
- Call dev_pm_genpd_suspend() to Power OFF ISP after probe
- Set the device to suspended state with pm_runtime_set_suspended()
- Enable runtime PM only after the device is fully initialized
Fixes: d6263c468a ("i2c: amd-isp: Add ISP i2c-designware driver")
Co-developed-by: Bin Du <bin.du@amd.com>
Signed-off-by: Bin Du <bin.du@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Cc: <stable@vger.kernel.org> # v6.16+
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260320201302.3490570-1-pratap.nirujogi@amd.com
When reading from the I2DR register, right after releasing the bus by
clearing MSTA and MTX, the I2C controller might still generate an
additional clock cycle which can cause devices to misbehave. Ensure to
only read from I2DR after the bus is not busy anymore. Because this
requires polling, the read of the last byte is moved outside of the
interrupt handler.
An example for such a failing transfer is this:
i2ctransfer -y -a 0 w1@0x00 0x02 r1
Error: Sending messages failed: Connection timed out
It does not happen with every device because not all devices react to
the additional clock cycle.
Fixes: 5f5c2d4579 ("i2c: imx: prevent rescheduling in non dma mode")
Cc: stable@vger.kernel.org # v6.13+
Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260218150940.131354-3-eichest@gmail.com
When reading multiple messages, meaning a repeated start is required,
polling the bus busy bit must be avoided. This must only be done for
the last message. Otherwise, the driver will timeout.
Here an example of such a sequence that fails with an error:
i2ctransfer -y -a 0 w1@0x00 0x02 r1 w1@0x00 0x02 r1
Error: Sending messages failed: Connection timed out
Fixes: 5f5c2d4579 ("i2c: imx: prevent rescheduling in non dma mode")
Cc: stable@vger.kernel.org # v6.13+
Signed-off-by: Stefan Eichenberger <stefan.eichenberger@toradex.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260218150940.131354-2-eichest@gmail.com
emac_dispatch_skb_zc() allocates a new skb via napi_alloc_skb() but
never copies the packet data from the XDP buffer into it. The skb is
passed up the stack containing uninitialized heap memory instead of
the actual received packet, leaking kernel heap contents to userspace.
Copy the received packet data from the XDP buffer into the skb using
skb_copy_to_linear_data().
Additionally, remove the skb_mark_for_recycle() call since the skb is
backed by the NAPI page frag allocator, not page_pool. Marking a
non-page_pool skb for recycle causes the free path to return pages to
a page_pool that does not own them, corrupting page_pool state.
The non-ZC path (emac_rx_packet) does not have these issues because it
uses napi_build_skb() to wrap the existing page_pool page directly,
requiring no copy, and correctly marks for recycle since the page comes
from page_pool_dev_alloc_pages().
Fixes: 7a64bb388d ("net: ti: icssg-prueth: Add AF_XDP zero copy for RX")
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When driver signals carrier up via netif_carrier_on() its internal
link_up state isn't updated immediately. This leads to inconsistent
speed/duplex in /proc/net/bonding/bondX where the speed and duplex
is shown as unknown while ethtool shows correct values. Fix this by
using netif_carrier_ok() for link checking in get_ksettings function.
Fixes: 84421b99ce ("tg3: Update link_up flag for phylib devices")
Signed-off-by: Thomas Bogendoerfer <tbogendoerfer@suse.de>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ioam6_fill_trace_data() stores the schema contribution to the trace
length in a u8. With bit 22 enabled and the largest schema payload,
sclen becomes 1 + 1020 / 4, wraps from 256 to 0, and bypasses the
remaining-space check. __ioam6_fill_trace_data() then positions the
write cursor without reserving the schema area but still copies the
4-byte schema header and the full schema payload, overrunning the trace
buffer.
Keep sclen in an unsigned int so the remaining-space check and the write
cursor calculation both see the full schema length.
Fixes: 8c6f6fa677 ("ipv6: ioam: IOAM Generic Netlink API")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 7d6899fb69 ("ovl: fsync after metadata copy-up") was done to
fix durability of overlayfs copy up on an upper filesystem which does
not enforce ordering on storing of metadata changes (e.g. ubifs).
In an earlier revision of the regressing commit by Lei Lv, the metadata
fsync behavior was opt-in via a new "fsync=strict" mount option.
We were hoping that the opt-in mount option could be avoided, so the
change was only made to depend on metacopy=off, in the hope of not
hurting performance of metadata heavy workloads, which are more likely
to be using metacopy=on.
This hope was proven wrong by a performance regression report from Google
COS workload after upgrade to kernel 6.12.
This is an adaptation of Lei's original "fsync=strict" mount option
to the existing upstream code.
The new mount option is mutually exclusive with the "volatile" mount
option, so the latter is now an alias to the "fsync=volatile" mount
option.
Reported-by: Chenglong Tang <chenglongtang@google.com>
Closes: https://lore.kernel.org/linux-unionfs/CAOdxtTadAFH01Vui1FvWfcmQ8jH1O45owTzUcpYbNvBxnLeM7Q@mail.gmail.com/
Link: https://lore.kernel.org/linux-unionfs/CAOQ4uxgKC1SgjMWre=fUb00v8rxtd6sQi-S+dxR8oDzAuiGu8g@mail.gmail.com/
Fixes: 7d6899fb69 ("ovl: fsync after metadata copy-up")
Depends: 50e638beb6 ("ovl: Use str_on_off() helper in ovl_show_options()")
Cc: stable@vger.kernel.org # v6.12+
Signed-off-by: Fei Lv <feilv@asrmicro.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Setting up the interface when suspended/resumeing fail on this card.
Adding a reset and delay quirk will eliminate this problem.
usb 1-1: new full-speed USB device number 2 using xhci-hcd
usb 1-1: New USB device found, idVendor=001f, idProduct=0b23
usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 1-1: Product: AB17X USB Audio
usb 1-1: Manufacturer: Generic
usb 1-1: SerialNumber: 20241228172028
Signed-off-by: Lianqin Hu <hulianqin@vivo.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/PUZPR06MB6224CA59AD2B26054120B276D249A@PUZPR06MB6224.apcprd06.prod.outlook.com
The HP Pavilion 15-eg0xxx with subsystem ID 0x103c87cb uses a Realtek
ALC287 codec with a mute LED wired to GPIO pin 4 (mask 0x10). The
existing ALC287_FIXUP_HP_GPIO_LED fixup already handles this correctly,
but the subsystem ID was missing from the quirk table.
GPIO pin confirmed via manual hda-verb testing:
hda-verb SET_GPIO_MASK 0x10
hda-verb SET_GPIO_DIRECTION 0x10
hda-verb SET_GPIO_DATA 0x10
Signed-off-by: César Montoya <sprit152009@gmail.com>
Link: https://patch.msgid.link/20260321153603.12771-1-sprit152009@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
On the HP EliteBoard G1a platform (models without a headphone jack).
the speaker mute LED failed to function. The Sysfs ctl-led info showed
empty values because the standard LED registration couldn't correctly
bind to the master switch.
Adding this patch will fix and enable the speaker mute LED feature.
Tested-by: Chris Chiu <chris.chiu@canonical.com>
Signed-off-by: Kailang Yang <kailang@realtek.com>
Link: https://lore.kernel.org/279e929e884849df84687dbd67f20037@realtek.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
ASoC: Fixes for v7.0
This is two week's worth of fixes and quirks so it's a bit larger than
you might expect, there's nothing too exciting individually and nothing
in core code.
linedisp_release() currently retrieves the enclosing struct linedisp via
to_linedisp(). That lookup depends on the attachment list, but the
attachment may already have been removed before put_device() invokes the
release callback. This can happen in linedisp_unregister(), and can also
be reached from some linedisp_register() error paths.
In that case, to_linedisp() returns NULL and linedisp_release()
dereferences it while freeing the display resources.
The struct device released here is the embedded linedisp->dev used by
linedisp_register(), so retrieve the enclosing object directly with
container_of() instead.
Fixes: 66c9380948 ("auxdisplay: linedisp: encapsulate container_of usage within to_linedisp")
Cc: stable@vger.kernel.org
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
After enabling CONFIG_GCOV_KERNEL and CONFIG_GCOV_PROFILE_ALL, following
build failure is observed under GCC 14.2.1:
In function 'amdv1pt_install_leaf_entry',
inlined from '__do_map_single_page' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:650:3,
inlined from '__map_single_page0' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:661:1,
inlined from 'pt_descend' at drivers/iommu/generic_pt/fmt/../pt_iter.h:391:9,
inlined from '__do_map_single_page' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:657:10,
inlined from '__map_single_page1.constprop' at drivers/iommu/generic_pt/fmt/../iommu_pt.h:661:1:
././include/linux/compiler_types.h:706:45: error: call to '__compiletime_assert_71' declared with attribute error: FIELD_PREP: value too large for the field
706 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
|
......
drivers/iommu/generic_pt/fmt/amdv1.h:220:26: note: in expansion of macro 'FIELD_PREP'
220 | FIELD_PREP(AMDV1PT_FMT_OA,
| ^~~~~~~~~~
In the path '__do_map_single_page()', level 0 always invokes
'pt_install_leaf_entry(&pts, map->oa, PAGE_SHIFT, …)'. At runtime that
lands in the 'if (oasz_lg2 == isz_lg2)' arm of 'amdv1pt_install_leaf_entry()';
the contiguous-only 'else' block is unreachable for 4 KiB pages.
With CONFIG_GCOV_KERNEL + CONFIG_GCOV_PROFILE_ALL, the extra
instrumentation changes GCC's inlining so that the "dead" 'else' branch
still gets instantiated. The compiler constant-folds the contiguous OA
expression, runs the 'FIELD_PREP()' compile-time check, and produces:
FIELD_PREP: value too large for the field
gcov-enabled builds therefore fail even though the code path never executes.
Fix this by marking amdv1pt_install_leaf_entry as __always_inline.
Fixes: dcd6a011a8 ("iommupt: Add map_pages op")
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Sherry Yang <sherry.yang@oracle.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
unmap has the odd behavior that it can unmap more than requested if the
ending point lands within the middle of a large or contiguous IOPTE.
In this case the gather should flush everything unmapped which can be
larger than what was requested to be unmapped. The gather was only
flushing the range requested to be unmapped, not extending to the extra
range, resulting in a short invalidation if the caller hits this special
condition.
This was found by the new invalidation/gather test I am adding in
preparation for ARMv8. Claude deduced the root cause.
As far as I remember nothing relies on unmapping a large entry, so this is
likely not a triggerable bug.
Cc: stable@vger.kernel.org
Fixes: 7c53f4238a ("iommupt: Add unmap_pages op")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
An empty gather is coded with start=U64_MAX, end=0 and several drivers go
on to convert that to a size with:
end - start + 1
Which gives 2 for an empty gather. This then causes Weird Stuff to
happen (for example an UBSAN splat in VT-d) that is hopefully harmless,
but maybe not.
Prevent drivers from being called right in iommu_iotlb_sync().
Auditing shows that AMD, Intel, Mediatek and RSIC-V drivers all do things
on these empty gathers.
Further, there are several callers that can trigger empty gathers,
especially in unusual conditions. For example iommu_map_nosync() will call
a 0 size unmap on some error paths. Also in VFIO, iommupt and other
places.
Cc: stable@vger.kernel.org
Reported-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Closes: https://lore.kernel.org/r/11145826.aFP6jjVeTY@jkrzyszt-mobl2.ger.corp.intel.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
When processing Router Advertisements with user options the kernel
builds an RTM_NEWNDUSEROPT netlink message. The nduseroptmsg struct
has three padding fields that are never zeroed and can leak kernel data
The fix is simple, just zeroes the padding fields.
Fixes: 31910575a9 ("[IPv6]: Export userland ND options through netlink (RDNSS support)")
Signed-off-by: Yochai Eisenrich <echelonh@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260324224925.2437775-1-echelonh@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wei Fang says:
====================
net: enetc: safely reinitialize TX BD ring when it has unsent frames
Currently the driver does not reset the producer index register (PIR) and
consumer index register (CIR) when initializing a TX BD ring. The driver
only reads the PIR and CIR and initializes the software indexes. If the
TX BD ring is reinitialized when it still contains unsent frames, its PIR
and CIR will not be equal after the reinitialization. However, the BDs
between CIR and PIR have been freed and become invalid and this can lead
to a hardware malfunction, causing the TX BD ring will not work properly.
Since the PIR and CIR are sofeware-configurable on ENETC v4. Therefore,
the driver must reset them if they are not equal when reinitializing
the TX BD ring.
However, resetting the PIR and CIR alone is insufficient, it cannot
completely solve the problem. When a link-down event occurs while the TX
BD ring is transmitting frames, subsequent reinitialization of the TX BD
ring may cause it to malfunction. Because enetc4_pl_mac_link_down() only
clears PMa_COMMAND_CONFIG[TX_EN] to disable MAC transmit data path. It
doesn't set PORT[TXDIS] to 1 to flush the TX BD ring. Therefore, it is
not safe to reinitialize the TX BD ring at this point.
To safely reinitialize the TX BD ring after a link-down event, we checked
with the NETC IP team, a proper Ethernet MAC graceful stop is necessary.
Therefore, add the Ethernet MAC graceful stop to the link-down event
handler enetc4_pl_mac_link_down(). Note that this patch set is not
applicable to ENETC v1 (LS1028A).
====================
Link: https://patch.msgid.link/20260324062121.2745033-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For ENETC v4, the PIR and CIR will be reset if they are not equal when
reinitializing the TX BD ring. However, resetting the PIR and CIR alone
is insufficient. When a link-down event occurs while the TX BD ring is
transmitting frames, subsequent reinitialization of the TX BD ring may
cause it to malfunction. For example, the below steps can reproduce the
problem.
1. Unplug the cable when the TX BD ring is busy transmitting frames.
2. Disable the network interface (ifconfig eth0 down).
3. Re-enable the network interface (ifconfig eth0 up).
4. Plug in the cable, the TX BD ring may fail to transmit packets.
When the link-down event occurs, enetc4_pl_mac_link_down() only clears
PMa_COMMAND_CONFIG[TX_EN] to disable MAC transmit data path. It doesn't
set PORT[TXDIS] to 1 to flush the TX BD ring. Therefore, reinitializing
the TX BD ring at this point is unsafe. To safely reinitialize the TX BD
ring after a link-down event, we checked with the NETC IP team, a proper
Ethernet MAC graceful stop is necessary. Therefore, add the Ethernet MAC
graceful stop to the link-down event handler enetc4_pl_mac_link_down().
Fixes: 99100d0d99 ("net: enetc: add preliminary support for i.MX95 ENETC PF")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260324062121.2745033-3-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the driver does not reset the producer index register (PIR) and
consumer index register (CIR) when initializing a TX BD ring. The driver
only reads the PIR and CIR and initializes the software indexes. If the
TX BD ring is reinitialized when it still contains unsent frames, its PIR
and CIR will not be equal after the reinitialization. However, the BDs
between CIR and PIR have been freed and become invalid and this can lead
to a hardware malfunction, causing the TX BD ring will not work properly.
For ENETC v4, it supports software to set the PIR and CIR, so the driver
can reset these two registers if they are not equal when reinitializing
the TX BD ring. Therefore, add this solution for ENETC v4. Note that this
patch does not work for ENETC v1 because it does not support software to
set the PIR and CIR.
Fixes: 99100d0d99 ("net: enetc: add preliminary support for i.MX95 ENETC PF")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260324062121.2745033-2-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the PPS channel configuration was implemented, the channel
index for the periodic outputs was configured as the hardware
channel number.
The sysfs interface uses a logical channel index, and rejects numbers
greater than `n_per_out` (see period_store() in ptp_sysfs.c).
That property was left at 1, since the driver implements channel
selection, not simultaneous operation of multiple PTP hardware timer
channels.
A second check in fec_ptp_enable() returns -EOPNOTSUPP when the two
channel numbers disagree, making channels 1..3 unusable from sysfs.
Fix by removing this redundant check in the FEC PTP driver.
Fixes: 566c2d8388 ("net: fec: make PPS channel configurable")
Signed-off-by: Buday Csaba <buday.csaba@prolan.hu>
Link: https://patch.msgid.link/8ec2afe88423c2231f9cf8044d212ce57846670e.1774359059.git.buday.csaba@prolan.hu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__skb_ext_put() is not declared if SKB_EXTENSIONS is not enabled, which
causes a build error:
drivers/net/netdevsim/netdev.c: In function 'nsim_forward_skb':
drivers/net/netdevsim/netdev.c:114:25: error: implicit declaration of function '__skb_ext_put'; did you mean 'skb_ext_put'? [-Werror=implicit-function-declaration]
114 | __skb_ext_put(psp_ext);
| ^~~~~~~~~~~~~
| skb_ext_put
cc1: some warnings being treated as errors
Add a stub to fix the build.
Fixes: 7d9351435e ("netdevsim: drop PSP ext ref on forward failure")
Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Link: https://patch.msgid.link/20260324140857.783-1-dqfext@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__io_uring_show_fdinfo() iterates over pending SQEs and, for 128-byte
SQEs on an IORING_SETUP_SQE_MIXED ring, needs to detect when the second
half of the SQE would be past the end of the sq_sqes array. The current
check tests (++sq_head & sq_mask) == 0, but sq_head is only incremented
when a 128-byte SQE is encountered, not on every iteration. The actual
array index is sq_idx = (i + sq_head) & sq_mask, which can be sq_mask
(the last slot) while the wrap check passes.
Fix by checking sq_idx directly. Keep the sq_head increment so the loop
still skips the second half of the 128-byte SQE on the next iteration.
Fixes: 1cba30bf9f ("io_uring: add support for IORING_SETUP_SQE_MIXED")
Signed-off-by: Nicholas Carlini <nicholas@carlini.com>
Link: https://patch.msgid.link/20260327021823.3138396-1-nicholas@carlini.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the gmac0 is disabled, the precheck for a valid ingress device will
cause a NULL pointer deref and crash the system. This happens because
eth->netdev[0] will be NULL but the code will directly try to access
netdev_ops.
Instead of just checking for the first net_device, it must be checked if
any of the mtk_eth net_devices is matching the netdev_ops of the ingress
device.
Cc: stable@vger.kernel.org
Fixes: 73cfd947db ("net: ethernet: mtk_eth_soc: ppe: prevent ppe update for non-mtk devices")
Signed-off-by: Sven Eckelmann (Plasma Cloud) <se@simonwunderlich.de>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260324-wed-crash-gmac0-disabled-v1-1-3bc388aee565@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
MANA passes rxq->alloc_size to napi_build_skb() for all RX buffers.
It is correct for fragment-backed RX buffers, where alloc_size matches
the actual backing allocation used for each packet buffer. However, in
the non-fragment RX path mana allocates a full page, or a higher-order
page, per RX buffer. In that case alloc_size only reflects the usable
packet area and not the actual backing memory.
This causes napi_build_skb() to underestimate the skb backing allocation
in the single-buffer RX path, so skb->truesize is derived from a value
smaller than the real RX buffer allocation.
Fix this by updating alloc_size in the non-fragment RX path to the
actual backing allocation size before it is passed to napi_build_skb().
Fixes: 730ff06d3f ("net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.")
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://patch.msgid.link/acLUhLpLum6qrD/N@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The RCU-protected codepaths (mpls_forward, mpls_dump_routes) can have
an inconsistent view of platform_labels vs platform_label in case of a
concurrent resize (resize_platform_label_table, under
platform_mutex). This can lead to OOB accesses.
This patch adds a seqcount, so that we get a consistent snapshot.
Note that mpls_label_ok is also susceptible to this, so the check
against RTA_DST in rtm_to_route_config, done outside platform_mutex,
is not sufficient. This value gets passed to mpls_label_ok once more
in both mpls_route_add and mpls_route_del, so there is no issue, but
that additional check must not be removed.
Reported-by: Yuan Tan <tanyuan98@outlook.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Fixes: 7720c01f3f ("mpls: Add a sysctl to control the size of the mpls label table")
Fixes: dde1b38e87 ("mpls: Convert mpls_dump_routes() to RCU.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/cd8fca15e3eb7e212b094064cd83652e20fd9d31.1774284088.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Johannes Berg says:
====================
Couple more fixes:
- virt_wifi: remove SET_NETDEV_DEV to avoid UAF on teardown
- iwlwifi:
- fix (some) devices that don't have 6 GHz (WiFi6E)
- fix potential OOB read of firmware notification
- set WiFi generation for firmware to avoid packet drops
- fix multi-link scan timing
- wilc1000: fix integer overflow
- ath11k/ath12k: fix TID during A-MPDU session teardown
- wl1251: don't trust firmware TX status response index
* tag 'wireless-2026-03-26' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: virt_wifi: remove SET_NETDEV_DEV to avoid use-after-free
wifi: iwlwifi: mvm: fix potential out-of-bounds read in iwl_mvm_nd_match_info_handler()
wifi: wl1251: validate packet IDs before indexing tx_frames
wifi: wilc1000: fix u8 overflow in SSID scan buffer size calculation
wifi: ath12k: Pass the correct value of each TID during a stop AMPDU session
wifi: ath11k: Pass the correct value of each TID during a stop AMPDU session
wifi: iwlwifi: mld: correctly set wifi generation data
wifi: iwlwifi: mvm: don't send a 6E related command when not supported
wifi: iwlwifi: mld: Fix MLO scan timing
====================
Link: https://patch.msgid.link/20260326093329.77815-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull smb client fix from Steve French:
- Fix rebuild of mapping table
* tag 'v7.0-rc5-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
smb/client: ensure smb2_mapping_table rebuild on cmd changes
Pull power management fixes from Rafael Wysocki:
"These fix two cpufreq issues, one in the core and one in the
conservative governor, and two issues related to system sleep:
- Restore the cpufreq core behavior changed inadvertently during the
6.19 development cycle to call cpufreq_frequency_table_cpuinfo()
for cpufreq policies getting re-initialized which ensures that
policy->max and policy->cpuinfo_max_freq will be valid going
forward (Viresh Kumar)
- Adjust the cached requested frequency in the conservative cpufreq
governor on policy limits changes to prevent it from becoming stale
in some cases (Viresh Kumar)
- Prevent pm_restore_gfp_mask() from triggering a WARN_ON() in some
code paths in which it is legitimately called without invoking
pm_restrict_gfp_mask() previously (Youngjun Park)
- Update snapshot_write_finalize() to take trailing zero pages into
account properly which prevents user space restore from failing
subsequently in some cases (Alberto Garcia)"
* tag 'pm-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: Drop spurious WARN_ON() from pm_restore_gfp_mask()
PM: hibernate: Drain trailing zero pages on userspace restore
cpufreq: conservative: Reset requested_freq on limits change
cpufreq: Don't skip cpufreq_frequency_table_cpuinfo()
Pull thermal control fix from Rafael Wysocki:
"This prevents the int340x thermal driver from taking the power slider
offset parameter into account incorrectly in some cases (Srinivas
Pandruvada)"
* tag 'thermal-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: intel: int340x: soc_slider: Set offset only for balanced mode
Pull ACPI support fix from Rafael Wysocki:
"Prevent use-after-free from occurring on reduced-hardware ACPI
platforms when -EPROBE_DEFER is returned by ec_install_handlers()
during ACPI EC driver initialization (Weiming Shi)"
* tag 'acpi-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: EC: clean up handlers on probe failure in acpi_ec_setup()
Pull Landlock fixes from Mickaël Salaün:
"This mainly fixes Landlock TSYNC issues related to interrupts and
unexpected task exit.
Other fixes touch documentation and sample, and a new test extends
coverage"
* tag 'landlock-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
landlock: Expand restrict flags example for ABI version 8
selftests/landlock: Test tsync interruption and cancellation paths
landlock: Clean up interrupted thread logic in TSYNC
landlock: Serialize TSYNC thread restriction
samples/landlock: Bump ABI version to 8
landlock: Improve TSYNC types
landlock: Fully release unused TSYNC work entries
landlock: Fix formatting
Merge fixes related to system sleep for 7.0-rc6:
- Prevent pm_restore_gfp_mask() from triggering a WARN_ON() in some
code paths in which it is legitimately called without invoking
pm_restrict_gfp_mask() previously (Youngjun Park)
- Update snapshot_write_finalize() to take trailing zero pages into
account properly which prevents user space restore from failing
subsequently in some cases (Alberto Garcia)
* pm-sleep:
PM: sleep: Drop spurious WARN_ON() from pm_restore_gfp_mask()
PM: hibernate: Drain trailing zero pages on userspace restore
Pull networking fixes from Paolo Abeni:
"Including fixes from Bluetooth, CAN, IPsec and Netfilter.
Notably, this includes the fix for the Bluetooth regression that you
were notified about. I'm not aware of any other pending regressions.
Current release - regressions:
- bluetooth:
- fix stack-out-of-bounds read in l2cap_ecred_conn_req
- fix regressions caused by reusing ident
- netfilter: revisit array resize logic
- eth: ice: set max queues in alloc_etherdev_mqs()
Previous releases - regressions:
- core: correctly handle tunneled traffic on IPV6_CSUM GSO fallback
- bluetooth:
- fix dangling pointer on mgmt_add_adv_patterns_monitor_complete
- fix deadlock in l2cap_conn_del()
- sched: codel: fix stale state for empty flows in fq_codel
- ipv6: remove permanent routes from tb6_gc_hlist when all exceptions expire.
- xfrm: fix skb_put() panic on non-linear skb during reassembly
- openvswitch:
- avoid releasing netdev before teardown completes
- validate MPLS set/set_masked payload length
- eth: iavf: fix out-of-bounds writes in iavf_get_ethtool_stats()
Previous releases - always broken:
- bluetooth: fix null-ptr-deref on l2cap_sock_ready_cb
- udp: fix wildcard bind conflict check when using hash2
- netfilter: fix use of uninitialized rtp_addr in process_sdp
- tls: Purge async_hold in tls_decrypt_async_wait()
- xfrm:
- prevent policy_hthresh.work from racing with netns teardown
- fix skb leak with espintcp and async crypto
- smc: fix double-free of smc_spd_priv when tee() duplicates splice pipe buffer
- can:
- add missing error handling to call can_ctrlmode_changelink()
- fix OOB heap access in cgw_csum_crc8_rel()
- eth:
- mana: fix use-after-free in add_adev() error path
- virtio-net: fix for VIRTIO_NET_F_GUEST_HDRLEN
- bcmasp: fix double free of WoL irq"
* tag 'net-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (90 commits)
net: macb: use the current queue number for stats
netfilter: ctnetlink: use netlink policy range checks
netfilter: nf_conntrack_sip: fix use of uninitialized rtp_addr in process_sdp
netfilter: nf_conntrack_expect: skip expectations in other netns via proc
netfilter: nf_conntrack_expect: store netns and zone in expectation
netfilter: ctnetlink: ensure safe access to master conntrack
netfilter: nf_conntrack_expect: use expect->helper
netfilter: nf_conntrack_expect: honor expectation helper field
netfilter: nft_set_rbtree: revisit array resize logic
netfilter: ip6t_rt: reject oversized addrnr in rt_mt6_check()
netfilter: nfnetlink_log: fix uninitialized padding leak in NFULA_PAYLOAD
tls: Purge async_hold in tls_decrypt_async_wait()
selftests: netfilter: nft_concat_range.sh: add check for flush+reload bug
netfilter: nft_set_pipapo_avx2: don't return non-matching entry on expiry
Bluetooth: btusb: clamp SCO altsetting table indices
Bluetooth: L2CAP: Fix ERTM re-init and zero pdu_len infinite loop
Bluetooth: L2CAP: Fix deadlock in l2cap_conn_del()
Bluetooth: btintel: serialize btintel_hw_error() with hci_req_sync_lock
Bluetooth: L2CAP: Fix send LE flow credits in ACL link
net: mana: fix use-after-free in add_adev() error path
...
Pull pin control fixes from Linus Walleij:
- Implement .get_direction() in the spmi-gpio gpio_chip
Recent changes makes this start to print warnings and it's not nice,
let's just fix it
- Clamp the return value of gpio_get() in the Renesas RZA1 driver
- Add the GPIO_GENERIC dependency to the STM32 HDP driver
- Modify the Mediatek driver to accept devices that do not use external
interrupts (EINT) at all
- Fix flag propagation in the Sunxi driver, so that we can fix an issue
with uninitialized pins in a follow-up patch using said flags
* tag 'pinctrl-v7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: sunxi: fix gpiochip_lock_as_irq() failure when pinmux is unknown
pinctrl: sunxi: pass down flags to pinctrl routines
pinctrl: mediatek: common: Fix probe failure for devices without EINT
pinctrl: stm32: fix HDP driver dependency on GPIO_GENERIC
pinctrl: renesas: rza1: Normalize return value of gpio_get()
pinctrl: qcom: spmi-gpio: implement .get_direction()
pinctrl: renesas: rzt2h: Fix invalid wait context
pinctrl: renesas: rzt2h: Fix device node leak in rzt2h_gpio_register()
Pull dma-mapping fixes from Marek Szyprowski:
"A set of fixes for DMA-mapping subsystem, which resolve false-
positive warnings from KMSAN and DMA-API debug (Shigeru Yoshida
and Leon Romanovsky) as well as a simple build fix (Miguel Ojeda)"
* tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-mapping: add missing `inline` for `dma_free_attrs`
mm/hmm: Indicate that HMM requires DMA coherency
RDMA/umem: Tell DMA mapping that UMEM requires coherency
iommu/dma: add support for DMA_ATTR_REQUIRE_COHERENT attribute
dma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is set
dma-mapping: Introduce DMA require coherency attribute
dma-mapping: Clarify valid conditions for CPU cache line overlap
dma-mapping: handle DMA_ATTR_CPU_CACHE_CLEAN in trace output
dma-debug: Allow multiple invocations of overlapping entries
dma: swiotlb: add KMSAN annotations to swiotlb_bounce()
During futex_key_to_node_opt() execution, vma->vm_policy is read under
speculative mmap lock and RCU. Concurrently, mbind() may call
vma_replace_policy() which frees the old mempolicy immediately via
kmem_cache_free().
This creates a race where __futex_key_to_node() dereferences a freed
mempolicy pointer, causing a use-after-free read of mpol->mode.
[ 151.412631] BUG: KASAN: slab-use-after-free in __futex_key_to_node (kernel/futex/core.c:349)
[ 151.414046] Read of size 2 at addr ffff888001c49634 by task e/87
[ 151.415969] Call Trace:
[ 151.416732] __asan_load2 (mm/kasan/generic.c:271)
[ 151.416777] __futex_key_to_node (kernel/futex/core.c:349)
[ 151.416822] get_futex_key (kernel/futex/core.c:374 kernel/futex/core.c:386 kernel/futex/core.c:593)
Fix by adding rcu to __mpol_put().
Fixes: c042c50521 ("futex: Implement FUTEX2_MPOL")
Reported-by: Hao-Yu Yang <naup96721@gmail.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Hao-Yu Yang <naup96721@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Link: https://patch.msgid.link/20260324174418.GB1850007@noisy.programming.kicks-ass.net
Nicholas reported that his LLM found it was possible to create a UaF
when sys_futex_requeue() is used with different flags. The initial
motivation for allowing different flags was the variable sized futex,
but since that hasn't been merged (yet), simply mandate the flags are
identical, as is the case for the old style sys_futex() requeue
operations.
Fixes: 0f4b5f9722 ("futex: Add sys_futex_requeue()")
Reported-by: Nicholas Carlini <npc@anthropic.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
A previous commit changed the behaviour of the KVM_S390_VCPU_FAULT
ioctl. The current (wrong) implementation will trigger a guest
addressing exception if the requested address lies outside of a
memslot, unless the VM is UCONTROL.
Restore the previous behaviour by open coding the fault-in logic.
Fixes: 3762e905ec ("KVM: s390: use __kvm_faultin_pfn()")
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
When shadowing, the guest page tables are write-protected, in order to
trap changes and properly unshadow the shadow mapping for the nested
guest. Already shadowed levels are skipped, so that only the needed
levels are write protected.
Currently the levels that get write protected are exactly one level too
deep: the last level (nested guest memory) gets protected in the wrong
way, and will be protected again correctly a few lines afterwards; most
importantly, the highest non-shadowed level does *not* get write
protected.
Moreover, if the nested guest is running in a real address space, there
are no DAT tables to shadow.
Write protect the correct levels, so that all the levels that need to
be protected are protected, and avoid double protecting the last level;
skip attempting to shadow the DAT tables when the nested guest is
running in a real address space.
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
If shadowing causes the shadow gmap to get unshadowed, exit early to
prevent an attempt to dereference the parent pointer, which at this
point is NULL.
Opportunistically add some more checks to prevent NULL parents.
Fixes: a2c17f9270 ("KVM: s390: New gmap code")
Fixes: e5f98a6899 ("KVM: s390: Add some helper functions needed for vSIE")
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
In most cases gmap_put() was not called when it should have.
Add the missing gmap_put() in vsie_run().
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Fix _do_shadow_pte() to use the correct pointer (guest pte instead of
nested guest) to set up the new pte.
Add a check to return -EOPNOTSUPP if the mapping for the nested guest
is writeable but the same page in the guest is only read-only.
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Introduce a new special softbit for large pages, like already presend
for normal pages, and use it to mark guest mappings that do not have
struct pages.
Whenever a leaf DAT entry becomes dirty, check the special softbit and
only call SetPageDirty() if there is an actual struct page.
Move the logic to mark pages dirty inside _gmap_ptep_xchg() and
_gmap_crstep_xchg_atomic(), to avoid needlessly duplicating the code.
Fixes: 5a74e3d934 ("KVM: s390: KVM-specific bitfields and helper functions")
Fixes: a2c17f9270 ("KVM: s390: New gmap code")
Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
The slow path of the fault handler ultimately called gmap_link(), which
assumed the fault was a major fault, and blindly called dat_link().
In case of minor faults, things were not always handled properly; in
particular the prefix and vsie marker bits were ignored.
Move dat_link() into gmap.c, renaming it accordingly. Once moved, the
new _gmap_link() function will be able to correctly honour the prefix
and vsie markers.
This will cause spurious unshadows in some uncommon cases.
Fixes: 94fd9b16cc ("KVM: s390: KVM page table management functions: lifecycle management")
Fixes: a2c17f9270 ("KVM: s390: New gmap code")
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
When shadowing a nested guest, a check is performed and no shadowing is
attempted if the nested guest is already shadowed.
The existing check was incomplete; fix it by also checking whether the
leaf DAT table entry in the existing shadow gmap has the same protection
as the one specified in the guest DAT entry.
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
In practice dat_crstep_xchg() is racy and hard to use correctly. Simply
remove it and replace its uses with dat_crstep_xchg_atomic().
This solves some actual races that lead to system hangs / crashes.
Opportunistically fix an alignment issue in _gmap_crstep_xchg_atomic().
Fixes: 589071eaaa ("KVM: s390: KVM page table management functions: clear and replace")
Fixes: 94fd9b16cc ("KVM: s390: KVM page table management functions: lifecycle management")
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
If the guest misbehaves and puts the page tables for its nested guest
inside the memory of the nested guest itself, and the guest and nested
guest are being mapped with large pages, the shadow mapping will
lose synchronization with the actual mapping, since this will cause the
large page with the vsie notification bit to be split, but the
vsie notification bit will not be propagated to the resulting small
pages.
Fix this by propagating the vsie_notif bit from large pages to normal
pages when splitting a large page.
Fixes: 2db149a0a6 ("KVM: s390: KVM page table management functions: walks")
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Reviewed-by: Steffen Eiden <seiden@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Pablo Neira Ayuso says:
====================
Netfilter for net
This is v3, I kept back an ipset fix and another to tigthen the xtables
interface to reject invalid combinations with the NFPROTO_ARP family.
They need a bit more discussion. I fixed the issues reported by AI on
patch 9 (add #ifdef to access ct zone, update nf_conntrack_broadcast
and patch 10 (use better Fixes: tag). Thanks!
The following patchset contains Netfilter fixes for *net*.
Note that most bugs fixed here stem from 2.6 days, the large PR is not
due to an increase in regressions.
1) Fix incorrect reject of set updates with nf_tables pipapo set
avx2 backend. This comes with a regression test in patch 2.
From Florian Westphal.
2) nfnetlink_log needs to zero padding to prevent infoleak to userspace,
from Weiming Shi.
3) xtables ip6t_rt module never validated that addrnr length is within the
allowed array boundary. Reject bogus values. From Ren Wei.
4) Fix high memory usage in rbtree set backend that was unwanted side-effect
of the recently added binary search blob. From Pablo Neira Ayuso.
5) Patches 5 to 10, also from Pablo, address long-standing RCU safety bugs
in conntracks handling of expectations: We can never safely defer
a conntrack extension area without holding a reference. Yet expectation
handling does so in multiple places. Fix this by avoiding the need to
look into the master conntrack to begin with and by extending locked
sections in a few places.
11) Fix use of uninitialized rtp_addr in the sip conntrack helper,
also from Weiming Shi.
12) Add stricter netlink policy checks in ctnetlink, from David Carlier.
This avoids undefined behaviour when userspace provides huge wscale
value.
netfilter pull request 26-03-26
* tag 'nf-26-03-26' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: ctnetlink: use netlink policy range checks
netfilter: nf_conntrack_sip: fix use of uninitialized rtp_addr in process_sdp
netfilter: nf_conntrack_expect: skip expectations in other netns via proc
netfilter: nf_conntrack_expect: store netns and zone in expectation
netfilter: ctnetlink: ensure safe access to master conntrack
netfilter: nf_conntrack_expect: use expect->helper
netfilter: nf_conntrack_expect: honor expectation helper field
netfilter: nft_set_rbtree: revisit array resize logic
netfilter: ip6t_rt: reject oversized addrnr in rt_mt6_check()
netfilter: nfnetlink_log: fix uninitialized padding leak in NFULA_PAYLOAD
selftests: netfilter: nft_concat_range.sh: add check for flush+reload bug
netfilter: nft_set_pipapo_avx2: don't return non-matching entry on expiry
====================
Link: https://patch.msgid.link/20260326125153.685915-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The netfs_io_stream::front member is meant to point to the subrequest
currently being collected on a stream, but it isn't actually used this way
by direct write (which mostly ignores it). However, there's a tracepoint
which looks at it. Further, stream->front is actually redundant with
stream->subrequests.next.
Fix the potential problem in the direct code by just removing the member
and using stream->subrequests.next instead, thereby also simplifying the
code.
Fixes: a0b4c7a491 ("netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence")
Reported-by: Paulo Alcantara <pc@manguebit.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/4158599.1774426817@warthog.procyon.org.uk
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Tony Nguyen says:
====================
For ice:
Michal corrects call to alloc_etherdev_mqs() to provide maximum number
of queues supported rather than currently allocated number of queues.
Petr Oros fixes issues related to some ethtool operations in switchdev
mode.
For iavf:
Kohei Enju corrects number of reported queues for ethtool statistics to
absolute max as using current number could race and cause out-of-bounds
issues.
For idpf:
Josh NULLs cdev_info pointer after freeing to prevent possible subsequent
improper access. He also defers setting of refillqs value until after
allocation to prevent possible NULL pointer dereference.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
idpf: only assign num refillqs if allocation was successful
idpf: clear stale cdev_info ptr
iavf: fix out-of-bounds writes in iavf_get_ethtool_stats()
ice: use ice_update_eth_stats() for representor stats
ice: fix inverted ready check for VF representors
ice: set max queues in alloc_etherdev_mqs()
====================
Link: https://patch.msgid.link/20260323205843.624704-1-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The xfile/xmbuf shmem file descriptions are no longer as detailed as
they were when online fsck was first merged, because moving to static
strings in commit 60382993a2 ("xfs: get rid of the
xchk_xfile_*_descr calls") removed a memory allocation and hence a
source of failure.
However this makes encoding the description in the tracepoints sort of a
waste of memory. David Laight also points out that file_path doesn't
zero the whole buffer which causes exposure of stale trace bytes, and
Steven Rostedt wonders why we're not using a dynamic array for the file
path.
I don't think this is worth fixing, so let's just rip it out.
Cc: rostedt@goodmis.org
Cc: david.laight.linux@gmail.com
Link: https://lore.kernel.org/linux-xfs/20260323172204.work.979-kees@kernel.org/
Cc: stable@vger.kernel.org # v6.11
Fixes: 19ebc8f84e ("xfs: fix file_path handling in tracepoints")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
xlog_recovery_iget* never set @ip to a valid pointer if they return
an error, so this irele will walk off a dangling pointer. Fix that.
Cc: stable@vger.kernel.org # v6.10
Fixes: ae673f534a ("xfs: record inode generation in xattr update log intent items")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When displaying pending SQEs for a MIXED ring, each 128-byte SQE
increments sq_head to skip the second slot, but the loop counter is not
adjusted. This can cause the loop to read past sq_tail by one entry for
each 128-byte SQE encountered, displaying SQEs that haven't been made
consumable yet by the application.
Match the kernel's own consumption logic in io_init_req() which
decrements what's left when consuming the extra slot.
Fixes: 1cba30bf9f ("io_uring: add support for IORING_SETUP_SQE_MIXED")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There's a potential mismatch between the memory reserved for statistics
and the amount of memory written.
gem_get_sset_count() correctly computes the number of stats based on the
active queues, whereas gem_get_ethtool_stats() indiscriminately copies
data using the maximum number of queues, and in the case the number of
active queues is less than MACB_MAX_QUEUES, this results in a OOB write
as observed in the KASAN splat.
==================================================================
BUG: KASAN: vmalloc-out-of-bounds in gem_get_ethtool_stats+0x54/0x78
[macb]
Write of size 760 at addr ffff80008080b000 by task ethtool/1027
CPU: [...]
Tainted: [E]=UNSIGNED_MODULE
Hardware name: raspberrypi rpi/rpi, BIOS 2025.10 10/01/2025
Call trace:
show_stack+0x20/0x38 (C)
dump_stack_lvl+0x80/0xf8
print_report+0x384/0x5e0
kasan_report+0xa0/0xf0
kasan_check_range+0xe8/0x190
__asan_memcpy+0x54/0x98
gem_get_ethtool_stats+0x54/0x78 [macb
926c13f3af83b0c6fe64badb21ec87d5e93fcf65]
dev_ethtool+0x1220/0x38c0
dev_ioctl+0x4ac/0xca8
sock_do_ioctl+0x170/0x1d8
sock_ioctl+0x484/0x5d8
__arm64_sys_ioctl+0x12c/0x1b8
invoke_syscall+0xd4/0x258
el0_svc_common.constprop.0+0xb4/0x240
do_el0_svc+0x48/0x68
el0_svc+0x40/0xf8
el0t_64_sync_handler+0xa0/0xe8
el0t_64_sync+0x1b0/0x1b8
The buggy address belongs to a 1-page vmalloc region starting at
0xffff80008080b000 allocated at dev_ethtool+0x11f0/0x38c0
The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000
index:0xffff00000a333000 pfn:0xa333
flags: 0x7fffc000000000(node=0|zone=0|lastcpupid=0x1ffff)
raw: 007fffc000000000 0000000000000000 dead000000000122 0000000000000000
raw: ffff00000a333000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff80008080b080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff80008080b100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff80008080b180: 00 00 00 00 00 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
^
ffff80008080b200: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
ffff80008080b280: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
==================================================================
Fix it by making sure the copied size only considers the active number of
queues.
Fixes: 512286bbd4 ("net: macb: Added some queue statistics")
Signed-off-by: Paolo Valerio <pvalerio@redhat.com>
Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Link: https://patch.msgid.link/20260323191634.2185840-1-pvalerio@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- L2CAP: Fix deadlock in l2cap_conn_del()
- L2CAP: Fix ERTM re-init and zero pdu_len infinite loop
- L2CAP: Fix send LE flow credits in ACL link
- btintel: serialize btintel_hw_error() with hci_req_sync_lock
- btusb: clamp SCO altsetting table indices
* tag 'for-net-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: btusb: clamp SCO altsetting table indices
Bluetooth: L2CAP: Fix ERTM re-init and zero pdu_len infinite loop
Bluetooth: L2CAP: Fix deadlock in l2cap_conn_del()
Bluetooth: btintel: serialize btintel_hw_error() with hci_req_sync_lock
Bluetooth: L2CAP: Fix send LE flow credits in ACL link
====================
Link: https://patch.msgid.link/20260325194358.618892-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The error path through vfio_pci_core_feature_dma_buf() ignores its
own advice to only use dma_buf_put() after dma_buf_export(), instead
falling through the entire unwind chain. In the unlikely event that
we encounter file descriptor exhaustion, this can result in an
unbalanced refcount on the vfio device and double free of allocated
objects.
Avoid this by moving the "put" directly into the error path and return
the errno rather than entering the unwind chain.
Reported-by: Renato Marziano <renato@marziano.top>
Fixes: 5d74781ebc ("vfio/pci: Add dma-buf export support for MMIO regions")
Cc: stable@vger.kernel.org
Acked-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@nvidia.com>
Link: https://lore.kernel.org/r/20260323215659.2108191-3-alex.williamson@nvidia.com
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex@shazbot.org>
Replace manual range and mask validations with netlink policy
annotations in ctnetlink code paths, so that the netlink core rejects
invalid values early and can generate extack errors.
- CTA_PROTOINFO_TCP_STATE: reject values > TCP_CONNTRACK_SYN_SENT2 at
policy level, removing the manual >= TCP_CONNTRACK_MAX check.
- CTA_PROTOINFO_TCP_WSCALE_ORIGINAL/REPLY: reject values > TCP_MAX_WSCALE
(14). The normal TCP option parsing path already clamps to this value,
but the ctnetlink path accepted 0-255, causing undefined behavior when
used as a u32 shift count.
- CTA_FILTER_ORIG_FLAGS/REPLY_FLAGS: use NLA_POLICY_MASK with
CTA_FILTER_F_ALL, removing the manual mask checks.
- CTA_EXPECT_FLAGS: use NLA_POLICY_MASK with NF_CT_EXPECT_MASK, adding
a new mask define grouping all valid expect flags.
Extracted from a broader nf-next patch by Florian Westphal, scoped to
ctnetlink for the fixes tree.
Fixes: c8e2078cfe ("[NETFILTER]: ctnetlink: add support for internal tcp connection tracking flags handling")
Signed-off-by: David Carlier <devnexen@gmail.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
process_sdp() declares union nf_inet_addr rtp_addr on the stack and
passes it to the nf_nat_sip sdp_session hook after walking the SDP
media descriptions. However rtp_addr is only initialized inside the
media loop when a recognized media type with a non-zero port is found.
If the SDP body contains no m= lines, only inactive media sections
(m=audio 0 ...) or only unrecognized media types, rtp_addr is never
assigned. Despite that, the function still calls hooks->sdp_session()
with &rtp_addr, causing nf_nat_sdp_session() to format the stale stack
value as an IP address and rewrite the SDP session owner and connection
lines with it.
With CONFIG_INIT_STACK_ALL_ZERO (default on most distributions) this
results in the session-level o= and c= addresses being rewritten to
0.0.0.0 for inactive SDP sessions. Without stack auto-init the
rewritten address is whatever happened to be on the stack.
Fix this by pre-initializing rtp_addr from the session-level connection
address (caddr) when available, and tracking via a have_rtp_addr flag
whether any valid address was established. Skip the sdp_session hook
entirely when no valid address exists.
Fixes: 4ab9e64e5e ("[NETFILTER]: nf_nat_sip: split up SDP mangling")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Skip expectations that do not reside in this netns.
Similar to e77e6ff502 ("netfilter: conntrack: do not dump other netns's
conntrack entries via proc").
Fixes: 9b03f38d04 ("netfilter: netns nf_conntrack: per-netns expectations")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
There is a teardown order issue in the driver. The SPI controller is
registered using devm_spi_register_controller(), which delays
unregistration of the SPI controller until after the fsl_lpspi_remove()
function returns.
As the fsl_lpspi_remove() function synchronously tears down the DMA
channels, a running SPI transfer triggers the following NULL pointer
dereference due to use after free:
| fsl_lpspi 42550000.spi: I/O Error in DMA RX
| Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[...]
| Call trace:
| fsl_lpspi_dma_transfer+0x260/0x340 [spi_fsl_lpspi]
| fsl_lpspi_transfer_one+0x198/0x448 [spi_fsl_lpspi]
| spi_transfer_one_message+0x49c/0x7c8
| __spi_pump_transfer_message+0x120/0x420
| __spi_sync+0x2c4/0x520
| spi_sync+0x34/0x60
| spidev_message+0x20c/0x378 [spidev]
| spidev_ioctl+0x398/0x750 [spidev]
[...]
Switch from devm_spi_register_controller() to spi_register_controller() in
fsl_lpspi_probe() and add the corresponding spi_unregister_controller() in
fsl_lpspi_remove().
Fixes: 5314987de5 ("spi: imx: add lpspi bus driver")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://patch.msgid.link/20260319-spi-fsl-lpspi-fixes-v1-1-b433e435b2d8@pengutronix.de
Signed-off-by: Mark Brown <broonie@kernel.org>
__nf_ct_expect_find() and nf_ct_expect_find_get() are called under
rcu_read_lock() but they dereference the master conntrack via
exp->master.
Since the expectation does not hold a reference on the master conntrack,
this could be dying conntrack or different recycled conntrack than the
real master due to SLAB_TYPESAFE_RCU.
Store the netns, the master_tuple and the zone in struct
nf_conntrack_expect as a safety measure.
This patch is required by the follow up fix not to dump expectations
that do not belong to this netns.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Holding reference on the expectation is not sufficient, the master
conntrack object can just go away, making exp->master invalid.
To access exp->master safely:
- Grab the nf_conntrack_expect_lock, this gets serialized with
clean_from_lists() which also holds this lock when the master
conntrack goes away.
- Hold reference on master conntrack via nf_conntrack_find_get().
Not so easy since the master tuple to look up for the master conntrack
is not available in the existing problematic paths.
This patch goes for extending the nf_conntrack_expect_lock section
to address this issue for simplicity, in the cases that are described
below this is just slightly extending the lock section.
The add expectation command already holds a reference to the master
conntrack from ctnetlink_create_expect().
However, the delete expectation command needs to grab the spinlock
before looking up for the expectation. Expand the existing spinlock
section to address this to cover the expectation lookup. Note that,
the nf_ct_expect_iterate_net() calls already grabs the spinlock while
iterating over the expectation table, which is correct.
The get expectation command needs to grab the spinlock to ensure master
conntrack does not go away. This also expands the existing spinlock
section to cover the expectation lookup too. I needed to move the
netlink skb allocation out of the spinlock to keep it GFP_KERNEL.
For the expectation events, the IPEXP_DESTROY event is already delivered
under the spinlock, just move the delivery of IPEXP_NEW under the
spinlock too because the master conntrack event cache is reached through
exp->master.
While at it, add lockdep notations to help identify what codepaths need
to grab the spinlock.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use expect->helper in ctnetlink and /proc to dump the helper name.
Using nfct_help() without holding a reference to the master conntrack
is unsafe.
Use exp->master->helper in ctnetlink path if userspace does not provide
an explicit helper when creating an expectation to retain the existing
behaviour. The ctnetlink expectation path holds the reference on the
master conntrack and nf_conntrack_expect lock and the nfnetlink glue
path refers to the master ct that is attached to the skb.
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The expectation helper field is mostly unused. As a result, the
netfilter codebase relies on accessing the helper through exp->master.
Always set on the expectation helper field so it can be used to reach
the helper.
nf_ct_expect_init() is called from packet path where the skb owns
the ct object, therefore accessing exp->master for the newly created
expectation is safe. This saves a lot of updates in all callsites
to pass the ct object as parameter to nf_ct_expect_init().
This is a preparation patches for follow up fixes.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Chris Arges reports high memory consumption with thousands of
containers, this patch revisits the array allocation logic.
For anonymous sets, start by 16 slots (which takes 256 bytes on x86_64).
Expand it by x2 until threshold of 512 slots is reached, over that
threshold, expand it by x1.5.
For non-anonymous set, start by 1024 slots in the array (which takes 16
Kbytes initially on x86_64). Expand it by x1.5.
Use set->ndeact to subtract deactivated elements when calculating the
number of the slots in the array, otherwise the array size array gets
increased artifically. Add special case shrink logic to deal with flush
set too.
The shrink logic is skipped by anonymous sets.
Use check_add_overflow() to calculate the new array size.
Add a WARN_ON_ONCE check to make sure elements fit into the new array
size.
Reported-by: Chris Arges <carges@cloudflare.com>
Fixes: 7e43e0a114 ("netfilter: nft_set_rbtree: translate rbtree to array for binary search")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reject rt match rules whose addrnr exceeds IP6T_RT_HOPS.
rt_mt6() expects addrnr to stay within the bounds of rtinfo->addrs[].
Validate addrnr during rule installation so malformed rules are rejected
before the match logic can use an out-of-range value.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Tested-by: Yuhang Zheng <z1652074432@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
__build_packet_message() manually constructs the NFULA_PAYLOAD netlink
attribute using skb_put() and skb_copy_bits(), bypassing the standard
nla_reserve()/nla_put() helpers. While nla_total_size(data_len) bytes
are allocated (including NLA alignment padding), only data_len bytes
of actual packet data are copied. The trailing nla_padlen(data_len)
bytes (1-3 when data_len is not 4-byte aligned) are never initialized,
leaking stale heap contents to userspace via the NFLOG netlink socket.
Replace the manual attribute construction with nla_reserve(), which
handles the tailroom check, header setup, and padding zeroing via
__nla_reserve(). The subsequent skb_copy_bits() fills in the payload
data on top of the properly initialized attribute.
Fixes: df6fb868d6 ("[NETFILTER]: nfnetlink: convert to generic netlink attribute functions")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The sub-device state lock has been already acquired when ccs_init_state()
is called. Do not try to acquire it again.
Reported-by: David Heidelberg <david@ixit.cz>
Fixes: a88883d120 ("media: ccs: Rely on sub-device state locking")
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Hans Verkuil <hverkuil+cisco@kernel.org>
The SPI API is asymmetric and the controller is freed as part of
deregistration (unless it has been allocated using
devm_spi_alloc_host/target()).
A recent change converting the managed registration function to use
devm_add_action_or_reset() inadvertently introduced a (mostly
theoretical) regression where a non-devres managed controller could be
freed as part of failed registration. This in turn would lead to
use-after-free in controller driver error paths.
Fix this by taking another reference before calling
devm_add_action_or_reset() and not releasing it on errors for
non-devres allocated controllers.
An alternative would be a partial revert of the offending commit, but
it is better to handle this explicitly until the API has been fixed
(e.g. see 5e844cc37a ("spi: Introduce device-managed SPI controller
allocation")).
Fixes: b6376dbed8 ("spi: Simplify devm_spi_*_controller()")
Reported-by: Felix Gu <ustc.gu@gmail.com>
Link: https://lore.kernel.org/all/20260324145548.139952-1-ustc.gu@gmail.com/
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Johan Hovold <johan@kernel.org>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://patch.msgid.link/20260325145319.1132072-1-johan@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Jihed Chaibi <jihed.chaibi.dev@gmail.com> says:
adau1372_set_power() had two related error handling issues in its enable
path: clk_prepare_enable() was called but its return value discarded, and
adau1372_enable_pll() was a void function that silently swallowed lock
failures, leaving mclk enabled and adau1372->enabled set to true despite
the device being in a broken state.
Patch 1 fixes the unchecked clk_prepare_enable() by making
adau1372_set_power() return int and propagating the error.
Patch 2 converts adau1372_enable_pll() to return int and adds a full
unwind in adau1372_set_power() if PLL lock fails, reversing the regcache,
GPIO power-down, and clock state.
adau1372_enable_pll() was a void function that logged a dev_err() on
PLL lock timeout but did not propagate the error. As a result,
adau1372_set_power() would continue with adau1372->enabled set to true
despite the PLL being unlocked, and the mclk left enabled with no
corresponding disable on the error path.
Convert adau1372_enable_pll() to return int, using -ETIMEDOUT on lock
timeout and propagating regmap errors directly. In adau1372_set_power(),
check the return value and unwind in reverse order: restore regcache to
cache-only mode, reassert GPIO power-down, and disable the clock before
returning the error.
Signed-off-by: Jihed Chaibi <jihed.chaibi.dev@gmail.com>
Fixes: 6cd4c6459e ("ASoC: Add ADAU1372 audio CODEC support")
Reviewed-by: Nuno Sá <nuno.sa@analog.com>
Link: https://patch.msgid.link/20260325210704.76847-3-jihed.chaibi.dev@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
adau1372_set_power() calls clk_prepare_enable() but discards the return
value. If the clock enable fails, the driver proceeds to access registers
on unpowered hardware, potentially causing silent corruption.
Make adau1372_set_power() return int and propagate the error from
clk_prepare_enable(). Update adau1372_set_bias_level() to return the
error directly for the STANDBY and OFF cases.
Signed-off-by: Jihed Chaibi <jihed.chaibi.dev@gmail.com>
Fixes: 6cd4c6459e ("ASoC: Add ADAU1372 audio CODEC support")
Reviewed-by: Nuno Sá <nuno.sa@analog.com>
Link: https://patch.msgid.link/20260325210704.76847-2-jihed.chaibi.dev@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
The async_hold queue pins encrypted input skbs while
the AEAD engine references their scatterlist data. Once
tls_decrypt_async_wait() returns, every AEAD operation
has completed and the engine no longer references those
skbs, so they can be freed unconditionally.
A subsequent patch adds batch async decryption to
tls_sw_read_sock(), introducing a new call site that
must drain pending AEAD operations and release held
skbs. Move __skb_queue_purge(&ctx->async_hold) into
tls_decrypt_async_wait() so the purge is centralized
and every caller -- recvmsg's drain path, the -EBUSY
fallback in tls_do_decryption(), and the new read_sock
batch path -- releases held skbs on synchronization
without each site managing the purge independently.
This fixes a leak when tls_strp_msg_hold() fails part-way through,
after having added some cloned skbs to the async_hold
queue. tls_decrypt_sg() will then call tls_decrypt_async_wait() to
process all pending decrypts, and drop back to synchronous mode, but
tls_sw_recvmsg() only flushes the async_hold queue when one record has
been processed in "fully-async" mode, which may not be the case here.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Fixes: b8a6ff84ab ("tls: wait for pending async decryptions if tls_strp_msg_hold fails")
Link: https://patch.msgid.link/20260324-tls-read-sock-v5-1-5408befe5774@oracle.com
[pabeni@redhat.com: added leak comment]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
proc_do_large_bitmap() does not initialize variable c, which is expected
to be set to a trailing character by proc_get_long().
However, proc_get_long() only sets c when the input buffer contains a
trailing character after the parsed value.
If c is not initialized it may happen to contain a '-'. If this is the
case proc_do_large_bitmap() expects to be able to parse a second part of
the input buffer. If there is no second part an unjustified -EINVAL will
be returned.
Initialize c to 0 to prevent returning -EINVAL on valid input.
Fixes: 9f977fb7ae ("sysctl: add proc_do_large_bitmap")
Signed-off-by: Marc Buerg <buermarc@googlemail.com>
Reviewed-by: Joel Granados <joel.granados@kernel.org>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Commit 453b8fb68f ("xen/privcmd: restrict usage in
unprivileged domU") added a xenstore notifier to defer setting the
restriction target until Xenstore is ready.
XEN_PRIVCMD can be built as a module, but privcmd_exit() leaves that
notifier behind. Balance the notifier lifecycle by unregistering it on
module exit.
This is harmless even if xenstore was already ready at registration
time and the notifier was never queued on the chain.
Fixes: 453b8fb68f ("xen/privcmd: restrict usage in unprivileged domU")
Signed-off-by: GuoHan Zhao <zhaoguohan@kylinos.cn>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20260325120246.252899-1-zhaoguohan@kylinos.cn>
In function kvm_eiointc_regs_access(), the register base address is
caculated from array base address plus offset, the offset is absolute
value from the base address. The data type of array base address is
u64, it should be converted into the "void *" type and then plus the
offset.
Cc: <stable@vger.kernel.org>
Fixes: d3e43a1f34 ("LoongArch: KVM: Use 64-bit register definition for EIOINTC").
Reported-by: Aurelien Jarno <aurel32@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1131431
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
EIOINTC's coremap in eiointc_update_sw_coremap() can be empty, currently
we get a cpuid with -1 in this case, but we actually need 0 because it's
similar as the case that cpuid >= 4.
This fix an out-of-bounds access to kvm_arch::phyid_map::phys_map[].
Cc: <stable@vger.kernel.org>
Fixes: 3956a52bc0 ("LoongArch: KVM: Add EIOINTC read and write functions")
Reported-by: Aurelien Jarno <aurel32@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1131431
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
With -fno-asynchronous-unwind-tables and --no-eh-frame-hdr (the default
of the linker), the GNU_EH_FRAME segment (specified by vdso.lds.S) is
empty. This is not valid, as the current DWARF specification mandates
the first byte of the EH frame to be the version number 1. It causes
some unwinders to complain, for example the ClickHouse query profiler
spams the log with messages:
clickhouse-server[365854]: libunwind: unsupported .eh_frame_hdr
version: 127 at 7ffffffb0000
Here "127" is just the byte located at the p_vaddr (0, i.e. the
beginning of the vDSO) of the empty GNU_EH_FRAME segment. Cross-
checking with /proc/365854/maps has also proven 7ffffffb0000 is the
start of vDSO in the process VM image.
In LoongArch the -fno-asynchronous-unwind-tables option seems just a
MIPS legacy, and MIPS only uses this option to satisfy the MIPS-specific
"genvdso" program, per the commit cfd75c2db1 ("MIPS: VDSO: Explicitly
use -fno-asynchronous-unwind-tables"). IIRC it indicates some inherent
limitation of the MIPS ELF ABI and has nothing to do with LoongArch. So
we can simply flip it over to -fasynchronous-unwind-tables and pass
--eh-frame-hdr for linking the vDSO, allowing the profilers to unwind the
stack for statistics even if the sample point is taken when the PC is in
the vDSO.
However simply adjusting the options above would exploit an issue: when
the libgcc unwinder saw the invalid GNU_EH_FRAME segment, it silently
falled back to a machine-specific routine to match the code pattern of
rt_sigreturn() and extract the registers saved in the sigframe if the
code pattern is matched. As unwinding from signal handlers is vital for
libgcc to support pthread cancellation etc., the fall-back routine had
been silently keeping the LoongArch Linux systems functioning since
Linux 5.19. But when we start to emit GNU_EH_FRAME with the correct
format, fall-back routine will no longer be used and libgcc will fail
to unwind the sigframe, and unwinding from signal handlers will no
longer work, causing dozens of glibc test failures. To make it possible
to unwind from signal handlers again, it's necessary to code the unwind
info in __vdso_rt_sigreturn via .cfi_* directives.
The offsets in the .cfi_* directives depend on the layout of struct
sigframe, notably the offset of sigcontext in the sigframe. To use the
offset in the assembly file, factor out struct sigframe into a header to
allow asm-offsets.c to output the offset for assembly.
To work around a long-term issue in the libgcc unwinder (the pc is
unconditionally substracted by 1: doing so is technically incorrect for
a signal frame), a nop instruction is included with the two real
instructions in __vdso_rt_sigreturn in the same FDE PC range. The same
hack has been used on x86 for a long time.
Cc: stable@vger.kernel.org
Fixes: c6b99bed6b ("LoongArch: Add VDSO and VSYSCALL support")
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
1. Hardware limitation: GPU, DC and VPU are typically PCI device 06.0,
06.1 and 06.2. They share some hardware resources, so when configure the
PCI 06.0 device BAR1, DMA memory access cannot be performed through this
BAR, otherwise it will cause hardware abnormalities.
2. In typical scenarios of reboot or S3/S4, DC access to memory through
BAR is not prohibited, resulting in GPU DMA hangs.
3. Workaround method: When configuring the 06.0 device BAR1, turn off
the memory access of DC, GPU and VPU (via DC's CRTC registers).
Cc: stable@vger.kernel.org
Signed-off-by: Qianhai Wu <wuqianhai@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
1. Replace "of_find_node_by_path("/")" with "of_root" to avoid multiple
calls to "of_node_put()".
2. Fix a potential kernel oops during early boot when memory allocation
fails while parsing CPU model from device tree.
Cc: stable@vger.kernel.org
Signed-off-by: Li Jun <lijun01@kylinos.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Pull erofs fixes from Gao Xiang:
- Mark I/Os as failed when encountering short reads on file-backed
mounts
- Label GFP_NOIO in the BIO completion when the completion is in the
process context, and directly call into the decompression to avoid
deadlocks
- Improve Kconfig descriptions to better highlight the overall efforts
- Fix .fadvise() for page cache sharing
* tag 'erofs-for-7.0-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix .fadvise() for page cache sharing
erofs: update the Kconfig description
erofs: add GFP_NOIO in the bio completion if needed
erofs: set fileio bio failed in short read case
Pull RCU fixes from Boqun Feng:
"Fix a regression introduced by commit c27cea4416 ("rcu: Re-implement
RCU Tasks Trace in terms of SRCU-fast"): BPF contexts can run with
preemption disabled or scheduler locks held, so call_srcu() must work
in all such contexts.
Fix this by converting SRCU's spinlocks to raw spinlocks and avoiding
scheduler lock acquisition in call_srcu() by deferring to an irq_work
(similar to call_rcu_tasks_generic()), for both tree SRCU and tiny
SRCU.
Also fix a follow-on lockdep splat caused by srcu_node allocation
under the newly introduced raw spinlock by deferring the allocation to
grace-period worker context"
* tag 'rcu-fixes.v7.0-20260325a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux:
srcu: Use irq_work to start GP in tiny SRCU
rcu: Use an intermediate irq_work to start process_srcu()
srcu: Push srcu_node allocation to GP when non-preemptible
srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()
cgroup_drain_dying() was using cgroup_is_populated() to test whether there are
dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets,
nr_populated_domain_children and nr_populated_threaded_children, but
cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether
there are children is cgroup_destroy_locked()'s concern.
This caused hangs during shutdown. When systemd tried to rmdir a cgroup that had
no direct tasks but had a populated child, cgroup_drain_dying() would enter its
wait loop because cgroup_is_populated() was true from
nr_populated_domain_children. The task iterator found nothing to wait for, yet
the populated state never cleared because it was driven by live tasks in the
child cgroup.
Fix it by using cgroup_has_tasks() which only tests nr_populated_csets.
v3: Fix cgroup_is_populated() -> cgroup_has_tasks() (Sebastian).
v2: https://lore.kernel.org/r/20260323200205.1063629-1-tj@kernel.org
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: 1b164b876c ("cgroup: Wait for dying tasks to leave on rmdir")
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
When a compound request consists of QUERY_DIRECTORY + QUERY_INFO
(FILE_ALL_INFORMATION) and the first command consumes nearly the entire
max_trans_size, get_file_all_info() would blindly call smbConvertToUTF16()
with PATH_MAX, causing out-of-bounds write beyond the response buffer.
In get_file_all_info(), there was a missing validation check for
the client-provided OutputBufferLength before copying the filename into
FileName field of the smb2_file_all_info structure.
If the filename length exceeds the available buffer space, it could lead to
potential buffer overflows or memory corruption during smbConvertToUTF16
conversion. This calculating the actual free buffer size using
smb2_calc_max_out_buf_len() and returning -EINVAL if the buffer is
insufficient and updating smbConvertToUTF16 to use the actual filename
length (clamped by PATH_MAX) to ensure a safe copy operation.
Cc: stable@vger.kernel.org
Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Reported-by: Asim Viladi Oglu Manizada <manizada@pm.me>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Pull hardening fixes from Kees Cook:
- fix required Clang version for CC_HAS_COUNTED_BY_PTR (Nathan
Chancellor)
- update Coccinelle script used for kmalloc_obj
* tag 'hardening-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
init/Kconfig: Require a release version of clang-22 for CC_HAS_COUNTED_BY_PTR
coccinelle: kmalloc_obj: Remove default GFP_KERNEL arg
Pull x86 platform driver fixes from Ilpo Järvinen:
"Fixes and New HW Support. The trivial drop of unused gz_chain_head is
not exactly fixes material but it allows other work to avoid problems
so I decided to take it in along with the fixes.
- amd/hsmp: Fix typo in error message
- asus-armoury: Add support for G614FP, GA503QM, GZ302EAC, and GZ302EAC
- asus-nb-wmi: Add DMI quirk for ASUS ROG Flow Z13-KJP GZ302EAC
- hp-wmi: Support for Omen 16-k0xxx, 16-wf1xxx, 16-xf0xxx
- intel-hid: Disable wakeup_mode during hibernation
- ISST:
- Check HWP support before MSR access
- Correct locked bit width
- lenovo: wmi-gamezone: Drop unused gz_chain_head
- olpc-xo175-ec: Fix overflow error message"
* tag 'platform-drivers-x86-v7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: ISST: Correct locked bit width
platform/x86: intel-hid: disable wakeup_mode during hibernation
platform/x86: asus-armoury: add support for GZ302EA and GZ302EAC
platform/x86: asus-nb-wmi: add DMI quirk for ASUS ROG Flow Z13-KJP GZ302EAC
platform/x86/amd/hsmp: Fix typo in error message
platform/olpc: olpc-xo175-ec: Fix overflow error message to print inlen
platform/x86: lenovo: wmi-gamezone: Drop gz_chain_head
platform/x86: ISST: Check HWP support before MSR access
platform/x86: hp-wmi: Add support for Omen 16-k0xxx (8A4D)
platform/x86: hp-wmi: Add support for Omen 16-wf1xxx (8C76)
platform/x86: hp-wmi: Add Omen 16-xf0xxx (8BCA) support
platform/x86: asus-armoury: add support for G614FP
platform/x86: asus-armoury: add support for GA503QM
MAINTAINERS: change email address of Denis Benato
This test will fail without
the preceding commit ("netfilter: nft_set_pipapo_avx2: fix match retart if found element is expired"):
reject overlapping range on add 0s [ OK ]
reload with flush /dev/stdin:59:32-52: Error: Could not process rule: File exists
add element inet filter test { 10.0.0.29 . 10.0.2.29 }
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
New test case fails unexpectedly when avx2 matching functions are used.
The test first loads a ranomly generated pipapo set
with 'ipv4 . port' key, i.e. nft -f foo.
This works. Then, it reloads the set after a flush:
(echo flush set t s; cat foo) | nft -f -
This is expected to work, because its the same set after all and it was
already loaded once.
But with avx2, this fails: nft reports a clashing element.
The reported clash is of following form:
We successfully re-inserted
a . b
c . d
Then we try to insert a . d
avx2 finds the already existing a . d, which (due to 'flush set') is marked
as invalid in the new generation. It skips the element and moves to next.
Due to incorrect masking, the skip-step finds the next matching
element *only considering the first field*,
i.e. we return the already reinserted "a . b", even though the
last field is different and the entry should not have been matched.
No such error is reported for the generic c implementation (no avx2) or when
the last field has to use the 'nft_pipapo_avx2_lookup_slow' fallback.
Bisection points to
7711f4bb4b ("netfilter: nft_set_pipapo: fix range overlap detection")
but that fix merely uncovers this bug.
Before this commit, the wrong element is returned, but erronously
reported as a full, identical duplicate.
The root-cause is too early return in the avx2 match functions.
When we process the last field, we should continue to process data
until the entire input size has been consumed to make sure no stale
bits remain in the map.
Link: https://lore.kernel.org/netfilter-devel/20260321152506.037f68c0@elisabeth/
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The regulator operations pmbus_regulator_get_voltage(),
pmbus_regulator_set_voltage(), and pmbus_regulator_list_voltage()
access PMBus registers and shared data but were not protected by
the update_lock mutex. This could lead to race conditions.
However, adding mutex protection directly to these functions causes
a deadlock because pmbus_regulator_notify() (which calls
regulator_notifier_call_chain()) is often called with the mutex
already held (e.g., from pmbus_fault_handler()). If a regulator
callback then calls one of the now-protected voltage functions,
it will attempt to acquire the same mutex.
Rework pmbus_regulator_notify() to utilize a worker function to
send notifications outside of the mutex protection. Events are
stored as atomics in a per-page bitmask and processed by the worker.
Initialize the worker and its associated data during regulator
registration, and ensure it is cancelled on device removal using
devm_add_action_or_reset().
While at it, remove the unnecessary include of linux/of.h.
Cc: Sanman Pradhan <psanman@juniper.net>
Fixes: ddbb4db4ce ("hwmon: (pmbus) Add regulator support")
Reviewed-by: Sanman Pradhan <psanman@juniper.net>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Attributes intended to clear sensor history are intended to be writeable
only. Reading those attributes today results in reporting more or less
random values. To avoid ABI surprises, have those attributes explicitly
return 0 when reading.
Fixes: 787c095eda ("hwmon: (pmbus/core) Add support for rated attributes")
Reviewed-by: Sanman Pradhan <psanman@juniper.net>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Writing those attributes is not supported, so mark them as read-only.
Prior to this change, attempts to write into these attributes returned
an error.
Mark boolean fields in struct pmbus_limit_attr and in struct
pmbus_sensor_attr as bit fields to reduce configuration data size.
The data is scanned only while probing, so performance is not a concern.
Fixes: 6f183d33a0 ("hwmon: (pmbus) Add support for peak attributes")
Reviewed-by: Sanman Pradhan <psanman@juniper.net>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Currently we execute `SET_NETDEV_DEV(dev, &priv->lowerdev->dev)` for
the virt_wifi net devices. However, unregistering a virt_wifi device in
netdev_run_todo() can happen together with the device referenced by
SET_NETDEV_DEV().
It can result in use-after-free during the ethtool operations performed
on a virt_wifi device that is currently being unregistered. Such a net
device can have the `dev.parent` field pointing to the freed memory,
but ethnl_ops_begin() calls `pm_runtime_get_sync(dev->dev.parent)`.
Let's remove SET_NETDEV_DEV for virt_wifi to avoid bugs like this:
==================================================================
BUG: KASAN: slab-use-after-free in __pm_runtime_resume+0xe2/0xf0
Read of size 2 at addr ffff88810cfc46f8 by task pm/606
Call Trace:
<TASK>
dump_stack_lvl+0x4d/0x70
print_report+0x170/0x4f3
? __pfx__raw_spin_lock_irqsave+0x10/0x10
kasan_report+0xda/0x110
? __pm_runtime_resume+0xe2/0xf0
? __pm_runtime_resume+0xe2/0xf0
__pm_runtime_resume+0xe2/0xf0
ethnl_ops_begin+0x49/0x270
ethnl_set_features+0x23c/0xab0
? __pfx_ethnl_set_features+0x10/0x10
? kvm_sched_clock_read+0x11/0x20
? local_clock_noinstr+0xf/0xf0
? local_clock+0x10/0x30
? kasan_save_track+0x25/0x60
? __kasan_kmalloc+0x7f/0x90
? genl_family_rcv_msg_attrs_parse.isra.0+0x150/0x2c0
genl_family_rcv_msg_doit+0x1e7/0x2c0
? __pfx_genl_family_rcv_msg_doit+0x10/0x10
? __pfx_cred_has_capability.isra.0+0x10/0x10
? stack_trace_save+0x8e/0xc0
genl_rcv_msg+0x411/0x660
? __pfx_genl_rcv_msg+0x10/0x10
? __pfx_ethnl_set_features+0x10/0x10
netlink_rcv_skb+0x121/0x380
? __pfx_genl_rcv_msg+0x10/0x10
? __pfx_netlink_rcv_skb+0x10/0x10
? __pfx_down_read+0x10/0x10
genl_rcv+0x23/0x30
netlink_unicast+0x60f/0x830
? __pfx_netlink_unicast+0x10/0x10
? __pfx___alloc_skb+0x10/0x10
netlink_sendmsg+0x6ea/0xbc0
? __pfx_netlink_sendmsg+0x10/0x10
? __futex_queue+0x10b/0x1f0
____sys_sendmsg+0x7a2/0x950
? copy_msghdr_from_user+0x26b/0x430
? __pfx_____sys_sendmsg+0x10/0x10
? __pfx_copy_msghdr_from_user+0x10/0x10
___sys_sendmsg+0xf8/0x180
? __pfx____sys_sendmsg+0x10/0x10
? __pfx_futex_wait+0x10/0x10
? fdget+0x2e4/0x4a0
__sys_sendmsg+0x11f/0x1c0
? __pfx___sys_sendmsg+0x10/0x10
do_syscall_64+0xe2/0x570
? exc_page_fault+0x66/0xb0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
This fix may be combined with another one in the ethtool subsystem:
https://lore.kernel.org/all/20260322075917.254874-1-alex.popov@linux.com/T/#u
Fixes: d43c65b05b ("ethtool: runtime-resume netdev parent in ethnl_ops_begin")
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Breno Leitao <leitao@debian.org>
Link: https://patch.msgid.link/20260324224607.374327-1-alex.popov@linux.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
btusb_work() maps the number of active SCO links to USB alternate
settings through a three-entry lookup table when CVSD traffic uses
transparent voice settings. The lookup currently indexes alts[] with
data->sco_num - 1 without first constraining sco_num to the number of
available table entries.
While the table only defines alternate settings for up to three SCO
links, data->sco_num comes from hci_conn_num() and is used directly.
Cap the lookup to the last table entry before indexing it so the
driver keeps selecting the highest supported alternate setting without
reading past alts[].
Fixes: baac6276c0 ("Bluetooth: btusb: handle mSBC audio over USB Endpoints")
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
l2cap_config_req() processes CONFIG_REQ for channels in BT_CONNECTED
state to support L2CAP reconfiguration (e.g. MTU changes). However,
since both CONF_INPUT_DONE and CONF_OUTPUT_DONE are already set from
the initial configuration, the reconfiguration path falls through to
l2cap_ertm_init(), which re-initializes tx_q, srej_q, srej_list, and
retrans_list without freeing the previous allocations and sets
chan->sdu to NULL without freeing the existing skb. This leaks all
previously allocated ERTM resources.
Additionally, l2cap_parse_conf_req() does not validate the minimum
value of remote_mps derived from the RFC max_pdu_size option. A zero
value propagates to l2cap_segment_sdu() where pdu_len becomes zero,
causing the while loop to never terminate since len is never
decremented, exhausting all available memory.
Fix the double-init by skipping l2cap_ertm_init() and
l2cap_chan_ready() when the channel is already in BT_CONNECTED state,
while still allowing the reconfiguration parameters to be updated
through l2cap_parse_conf_req(). Also add a pdu_len zero check in
l2cap_segment_sdu() as a safeguard.
Fixes: 96298f6401 ("Bluetooth: L2CAP: handle l2cap config request during open state")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
l2cap_conn_del() calls cancel_delayed_work_sync() for both info_timer
and id_addr_timer while holding conn->lock. However, the work functions
l2cap_info_timeout() and l2cap_conn_update_id_addr() both acquire
conn->lock, creating a potential AB-BA deadlock if the work is already
executing when l2cap_conn_del() takes the lock.
Move the work cancellations before acquiring conn->lock and use
disable_delayed_work_sync() to additionally prevent the works from
being rearmed after cancellation, consistent with the pattern used in
hci_conn_del().
Fixes: ab4eedb790 ("Bluetooth: L2CAP: Fix corrupted list in hci_chan_del")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
btintel_hw_error() issues two __hci_cmd_sync() calls (HCI_OP_RESET
and Intel exception-info retrieval) without holding
hci_req_sync_lock(). This lets it race against
hci_dev_do_close() -> btintel_shutdown_combined(), which also runs
__hci_cmd_sync() under the same lock. When both paths manipulate
hdev->req_status/req_rsp concurrently, the close path may free the
response skb first, and the still-running hw_error path hits a
slab-use-after-free in kfree_skb().
Wrap the whole recovery sequence in hci_req_sync_lock/unlock so it
is serialized with every other synchronous HCI command issuer.
Below is the data race report and the kasan report:
BUG: data-race in __hci_cmd_sync_sk / btintel_shutdown_combined
read of hdev->req_rsp at net/bluetooth/hci_sync.c:199
by task kworker/u17:1/83:
__hci_cmd_sync_sk+0x12f2/0x1c30 net/bluetooth/hci_sync.c:200
__hci_cmd_sync+0x55/0x80 net/bluetooth/hci_sync.c:223
btintel_hw_error+0x114/0x670 drivers/bluetooth/btintel.c:254
hci_error_reset+0x348/0xa30 net/bluetooth/hci_core.c:1030
write/free by task ioctl/22580:
btintel_shutdown_combined+0xd0/0x360
drivers/bluetooth/btintel.c:3648
hci_dev_close_sync+0x9ae/0x2c10 net/bluetooth/hci_sync.c:5246
hci_dev_do_close+0x232/0x460 net/bluetooth/hci_core.c:526
BUG: KASAN: slab-use-after-free in
sk_skb_reason_drop+0x43/0x380 net/core/skbuff.c:1202
Read of size 4 at addr ffff888144a738dc
by task kworker/u17:1/83:
__hci_cmd_sync_sk+0x12f2/0x1c30 net/bluetooth/hci_sync.c:200
__hci_cmd_sync+0x55/0x80 net/bluetooth/hci_sync.c:223
btintel_hw_error+0x186/0x670 drivers/bluetooth/btintel.c:260
Fixes: 973bb97e5a ("Bluetooth: btintel: Add generic function for handling hardware errors")
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When the L2CAP channel mode is L2CAP_MODE_ERTM/L2CAP_MODE_STREAMING,
l2cap_publish_rx_avail will be called and le flow credits will be sent in
l2cap_chan_rx_avail, even though the link type is ACL.
The logs in question as follows:
> ACL Data RX: Handle 129 flags 0x02 dlen 12
L2CAP: Unknown (0x16) ident 4 len 4
40 00 ed 05
< ACL Data TX: Handle 129 flags 0x00 dlen 10
L2CAP: Command Reject (0x01) ident 4 len 2
Reason: Command not understood (0x0000)
Bluetooth: Unknown BR/EDR signaling command 0x16
Bluetooth: Wrong link type (-22)
Fixes: ce60b9231b ("Bluetooth: compute LE flow credits based on recvbuf space")
Signed-off-by: Zhang Chen <zhangchen01@kylinos.cn>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Tiny SRCU's srcu_gp_start_if_needed() directly calls schedule_work(),
which acquires the workqueue pool->lock.
This causes a lockdep splat when call_srcu() is called with a scheduler
lock held, due to:
call_srcu() [holding pi_lock]
srcu_gp_start_if_needed()
schedule_work() -> pool->lock
workqueue_init() / create_worker() [holding pool->lock]
wake_up_process() -> try_to_wake_up() -> pi_lock
Also add irq_work_sync() to cleanup_srcu_struct() to prevent a
use-after-free if a queued irq_work fires after cleanup begins.
Tested with rcutorture SRCU-T and no lockdep warnings.
[ Thanks to Boqun for similar fix in patch "rcu: Use an intermediate irq_work
to start process_srcu()" ]
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Since commit c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms
of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can
happen basically everywhere (including where a scheduler lock is held),
call_srcu() now needs to avoid acquiring scheduler lock because
otherwise it could cause deadlock [1]. Fix this by following what the
previous RCU Tasks Trace did: using an irq_work to delay the queuing of
the work to start process_srcu().
[boqun: Apply Joel's feedback]
[boqun: Apply Andrea's test feedback]
Reported-by: Andrea Righi <arighi@nvidia.com>
Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/
Fixes: commit c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast")
Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1]
Suggested-by: Zqiang <qiang.zhang@linux.dev>
Tested-by: Andrea Righi <arighi@nvidia.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Boqun Feng <boqun@kernel.org>
When the srcutree.convert_to_big and srcutree.big_cpu_lim kernel boot
parameters specify initialization-time allocation of the srcu_node
tree for statically allocated srcu_struct structures (for example, in
DEFINE_SRCU() at build time instead of init_srcu_struct() at runtime),
init_srcu_struct_nodes() will attempt to dynamically allocate this tree
at the first run-time update-side use of this srcu_struct structure,
but while holding a raw spinlock. Because the memory allocator can
acquire non-raw spinlocks, this can result in lockdep splats.
This commit therefore uses the same SRCU_SIZE_ALLOC trick that is used
when the first run-time update-side use of this srcu_struct structure
happens before srcu_init() is called. The actual allocation then takes
place from workqueue context at the ends of upcoming SRCU grace periods.
[boqun: Adjust the sha1 of the Fixes tag]
Fixes: 175b45ed34 ("srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Tree SRCU has used non-raw spinlocks for many years, motivated by a desire
to avoid unnecessary real-time latency and the absence of any reason to
use raw spinlocks. However, the recent use of SRCU in tracing as the
underlying implementation of RCU Tasks Trace means that call_srcu()
is invoked from preemption-disabled regions of code, which in turn
requires that any locks acquired by call_srcu() or its callees must be
raw spinlocks.
This commit therefore converts SRCU's spinlocks to raw spinlocks.
[boqun: Add Fixes tag]
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: c27cea4416 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Boqun Feng <boqun@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Try to be more explicit why the workqueue watchdog does not take
pool->lock by default. Spin locks are full memory barriers which
delay anything. Obviously, they would primary delay operations
on the related worker pools.
Explain why it is enough to prevent the false positive by re-checking
the timestamp under the pool->lock.
Finally, make it clear what would be the alternative solution in
__queue_work() which is a hotter path.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
I and Qianhai are GPU R&D engineers at Loongson, specializing
in kernel driver development. We understand that the current
Loongson GPU driver lacks dedicated maintenance resources
because of some reasons.
As Loongson GPU driver developers, we have both the capability
and the responsibility to continuously maintain the Loongson
GPU driver, ensuring minimal impact on its users. After internal
discussions, our team has decided to recommend me and Qianhai
to take over the maintenance responsibilities, and recommend
Huacai, Mingcong and Ruoyao to help to review.
And We'll continue to maintain it for current supported chips
and drive future updates according to chip support plan.
Signed-off-by: Jianmin Lv <lvjianmin@loongson.cn>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patch.msgid.link/20260320101012.22714-1-lvjianmin@loongson.cn
The adm1177 driver exposes the current alert threshold through
hwmon_curr_max_alarm. This violates the hwmon sysfs ABI, where
*_alarm attributes are read-only status flags and writable thresholds
must use currN_max.
The driver also stores the threshold internally in microamps, while
currN_max is defined in milliamps. Convert the threshold accordingly
on both the read and write paths.
Widen the cached threshold and related calculations to 64 bits so
that small shunt resistor values do not cause truncation or overflow.
Also use 64-bit arithmetic for the mA/uA conversions, clamp writes
to the range the hardware can represent, and propagate failures from
adm1177_write_alert_thr() instead of silently ignoring them.
Update the hwmon documentation to reflect the attribute rename and
the correct units returned by the driver.
Fixes: 09b08ac9e8 ("hwmon: (adm1177) Add ADM1177 Hot Swap Controller and Digital Power Monitor driver")
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Acked-by: Nuno Sá <nuno.sa@analog.com>
Link: https://lore.kernel.org/r/20260325051246.28262-1-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
During 3D workload, user is reporting hitting:
[ 413.361679] WARNING: drivers/gpu/drm/xe/xe_vm.c:1217 at vm_bind_ioctl_ops_unwind+0x1e2/0x2e0 [xe], CPU#7: vkd3d_queue/9925
[ 413.361944] CPU: 7 UID: 1000 PID: 9925 Comm: vkd3d_queue Kdump: loaded Not tainted 7.0.0-070000rc3-generic #202603090038 PREEMPT(lazy)
[ 413.361949] RIP: 0010:vm_bind_ioctl_ops_unwind+0x1e2/0x2e0 [xe]
[ 413.362074] RSP: 0018:ffffd4c25c3df930 EFLAGS: 00010282
[ 413.362077] RAX: 0000000000000000 RBX: ffff8f3ee817ed10 RCX: 0000000000000000
[ 413.362078] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 413.362079] RBP: ffffd4c25c3df980 R08: 0000000000000000 R09: 0000000000000000
[ 413.362081] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8f41fbf99380
[ 413.362082] R13: ffff8f3ee817e968 R14: 00000000ffffffef R15: ffff8f43d00bd380
[ 413.362083] FS: 00000001040ff6c0(0000) GS:ffff8f4696d89000(0000) knlGS:00000000330b0000
[ 413.362085] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
[ 413.362086] CR2: 00007ddfc4747000 CR3: 00000002e6262005 CR4: 0000000000f72ef0
[ 413.362088] PKRU: 55555554
[ 413.362089] Call Trace:
[ 413.362092] <TASK>
[ 413.362096] xe_vm_bind_ioctl+0xa9a/0xc60 [xe]
Which seems to hint that the vma we are re-inserting for the ops unwind
is either invalid or overlapping with something already inserted in the
vm. It shouldn't be invalid since this is a re-insertion, so must have
worked before. Leaving the likely culprit as something already placed
where we want to insert the vma.
Following from that, for the case where we do something like a rebind in
the middle of a vma, and one or both mapped ends are already compatible,
we skip doing the rebind of those vma and set next/prev to NULL. As well
as then adjust the original unmap va range, to avoid unmapping the ends.
However, if we trigger the unwind path, we end up with three va, with
the two ends never being removed and the original va range in the middle
still being the shrunken size.
If this occurs, one failure mode is when another unwind op needs to
interact with that range, which can happen with a vector of binds. For
example, if we need to re-insert something in place of the original va.
In this case the va is still the shrunken version, so when removing it
and then doing a re-insert it can overlap with the ends, which were
never removed, triggering a warning like above, plus leaving the vm in a
bad state.
With that, we need two things here:
1) Stop nuking the prev/next tracking for the skip cases. Instead
relying on checking for skip prev/next, where needed. That way on the
unwind path, we now correctly remove both ends.
2) Undo the unmap va shrinkage, on the unwind path. With the two ends
now removed the unmap va should expand back to the original size again,
before re-insertion.
v2:
- Update the explanation in the commit message, based on an actual IGT of
triggering this issue, rather than conjecture.
- Also undo the unmap shrinkage, for the skip case. With the two ends
now removed, the original unmap va range should expand back to the
original range.
v3:
- Track the old start/range separately. vma_size/start() uses the va
info directly.
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/7602
Fixes: 8f33b4f054 ("drm/xe: Avoid doing rebinds")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260318100208.78097-2-matthew.auld@intel.com
(cherry picked from commit aec6969f75afbf4e01fd5fb5850ed3e9c27043ac)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Under an UML build for an upcoming series [1], I got `-Wstatic-in-inline`
for `dma_free_attrs`:
BINDGEN rust/bindings/bindings_generated.rs - due to target missing
In file included from rust/helpers/helpers.c:59:
rust/helpers/dma.c:17:2: warning: static function 'dma_free_attrs' is used in an inline function with external linkage [-Wstatic-in-inline]
17 | dma_free_attrs(dev, size, cpu_addr, dma_handle, attrs);
| ^
rust/helpers/dma.c:12:1: note: use 'static' to give inline function 'rust_helper_dma_free_attrs' internal linkage
12 | __rust_helper void rust_helper_dma_free_attrs(struct device *dev, size_t size,
| ^
| static
The issue is that `dma_free_attrs` was not marked `inline` when it was
introduced alongside the rest of the stubs.
Thus mark it.
Fixes: ed6ccf10f2 ("dma-mapping: properly stub out the DMA API for !CONFIG_HAS_DMA")
Closes: https://lore.kernel.org/rust-for-linux/20260322194616.89847-1-ojeda@kernel.org/ [1]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260325015548.70912-1-ojeda@kernel.org
If auxiliary_device_add() fails, add_adev() jumps to add_fail and calls
auxiliary_device_uninit(adev).
The auxiliary device has its release callback set to adev_release(),
which frees the containing struct mana_adev. Since adev is embedded in
struct mana_adev, the subsequent fall-through to init_fail and access
to adev->id may result in a use-after-free.
Fix this by saving the allocated auxiliary device id in a local
variable before calling auxiliary_device_add(), and use that saved id
in the cleanup path after auxiliary_device_uninit().
Fixes: a69839d432 ("net: mana: Add support for auxiliary device")
Cc: stable@vger.kernel.org
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Link: https://patch.msgid.link/20260323165730.945365-1-lgs201920130244@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When codel_dequeue() finds an empty queue, it resets vars->dropping
but does not reset vars->first_above_time. The reference CoDel
algorithm (Nichols & Jacobson, ACM Queue 2012) resets both:
dodeque_result codel_queue_t::dodeque(time_t now) {
...
if (r.p == NULL) {
first_above_time = 0; // <-- Linux omits this
}
...
}
Note that codel_should_drop() does reset first_above_time when called
with a NULL skb, but codel_dequeue() returns early before ever calling
codel_should_drop() in the empty-queue case. The post-drop code paths
do reach codel_should_drop(NULL) and correctly reset the timer, so a
dropped packet breaks the cycle -- but the next delivered packet
re-arms first_above_time and the cycle repeats.
For sparse flows such as ICMP ping (one packet every 200ms-1s), the
first packet arms first_above_time, the flow goes empty, and the
second packet arrives after the interval has elapsed and gets dropped.
The pattern repeats, producing sustained loss on flows that are not
actually congested.
Test: veth pair, fq_codel, BQL disabled, 30000 iptables rules in the
consumer namespace (NAPI-64 cycle ~14ms, well above fq_codel's 5ms
target), ping at 5 pps under UDP flood:
Before fix: 26% ping packet loss
After fix: 0% ping packet loss
Fix by resetting first_above_time to zero in the empty-queue path
of codel_dequeue(), matching the reference algorithm.
Fixes: 76e3cc126b ("codel: Controlled Delay AQM")
Fixes: d068ca2ae2 ("codel: split into multiple files")
Co-developed-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Reported-by: Chris Arges <carges@cloudflare.com>
Tested-by: Jonas Köppeler <j.koeppeler@tu-berlin.de>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/all/20260318134826.1281205-7-hawk@kernel.org/
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260323174920.253526-1-hawk@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The driver does not explicitly configure the MAC duplex mode when
bringing the link up. As a result, the MAC may retain a stale duplex
setting from a previous link state, leading to duplex mismatches with
the link partner and degraded network performance.
Update lan743x_phylink_mac_link_up() to set or clear the MAC_CR_DPX_
bit according to the negotiated duplex mode.
This ensures the MAC configuration is consistent with the phylink
resolved state.
Fixes: a5f199a8d8 ("net: lan743x: Migrate phylib to phylink")
Signed-off-by: Thangaraj Samynathan <thangaraj.s@microchip.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20260323065345.144915-1-thangaraj.s@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A UAF issue occurs when the virtio_net driver is configured with napi_tx=N
and the device's IFF_XMIT_DST_RELEASE flag is cleared
(e.g., during the configuration of tc route filter rules).
When IFF_XMIT_DST_RELEASE is removed from the net_device, the network stack
expects the driver to hold the reference to skb->dst until the packet
is fully transmitted and freed. In virtio_net with napi_tx=N,
skbs may remain in the virtio transmit ring for an extended period.
If the network namespace is destroyed while these skbs are still pending,
the corresponding dst_ops structure has freed. When a subsequent packet
is transmitted, free_old_xmit() is triggered to clean up old skbs.
It then calls dst_release() on the skb associated with the stale dst_entry.
Since the dst_ops (referenced by the dst_entry) has already been freed,
a UAF kernel paging request occurs.
fix it by adds skb_dst_drop(skb) in start_xmit to explicitly release
the dst reference before the skb is queued in virtio_net.
Call Trace:
Unable to handle kernel paging request at virtual address ffff80007e150000
CPU: 2 UID: 0 PID: 6236 Comm: ping Kdump: loaded Not tainted 7.0.0-rc1+ #6 PREEMPT
...
percpu_counter_add_batch+0x3c/0x158 lib/percpu_counter.c:98 (P)
dst_release+0xe0/0x110 net/core/dst.c:177
skb_release_head_state+0xe8/0x108 net/core/skbuff.c:1177
sk_skb_reason_drop+0x54/0x2d8 net/core/skbuff.c:1255
dev_kfree_skb_any_reason+0x64/0x78 net/core/dev.c:3469
napi_consume_skb+0x1c4/0x3a0 net/core/skbuff.c:1527
__free_old_xmit+0x164/0x230 drivers/net/virtio_net.c:611 [virtio_net]
free_old_xmit drivers/net/virtio_net.c:1081 [virtio_net]
start_xmit+0x7c/0x530 drivers/net/virtio_net.c:3329 [virtio_net]
...
Reproduction Steps:
NETDEV="enp3s0"
config_qdisc_route_filter() {
tc qdisc del dev $NETDEV root
tc qdisc add dev $NETDEV root handle 1: prio
tc filter add dev $NETDEV parent 1:0 \
protocol ip prio 100 route to 100 flowid 1:1
ip route add 192.168.1.100/32 dev $NETDEV realm 100
}
test_ns() {
ip netns add testns
ip link set $NETDEV netns testns
ip netns exec testns ifconfig $NETDEV 10.0.32.46/24
ip netns exec testns ping -c 1 10.0.32.1
ip netns del testns
}
config_qdisc_route_filter
test_ns
sleep 2
test_ns
Fixes: f2fc6a5458 ("[NETNS][IPV6] route6 - move ip6_dst_ops inside the network namespace")
Cc: stable@vger.kernel.org
Signed-off-by: xietangxin <xietangxin@yeah.net>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Fixes: 0287587884 ("net: better IFF_XMIT_DST_RELEASE support")
Link: https://patch.msgid.link/20260312025406.15641-1-xietangxin@yeah.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, .fadvise() doesn't work well if page cache sharing is on
since shared inodes belong to a pseudo fs generated with init_pseudo(),
and sb->s_bdi is the default one &noop_backing_dev_info.
Then, generic_fadvise() will just behave as a no-op if sb->s_bdi is
&noop_backing_dev_info, but as the bdev fs (the bdev fs changes
inode_to_bdi() instead), it's actually NOT a pure memfs.
Let's generate a real bdi for erofs_ishare_mnt instead.
Fixes: d86d7817c0 ("erofs: implement .fadvise for page cache share")
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Pull Kbuild fixes from Nathan Chancellor:
"This mostly addresses some issues with the awk conversion in
scripts/kconfig/merge_config.sh.
- Fix typo to ensure .builtin-dtbs.S is properly cleaned
- Fix '==' bashism in scripts/kconfig/merge_config.sh
- Fix awk error in scripts/kconfig/merge_config.sh when base
configuration is empty
- Fix inconsistent indentation in scripts/kconfig/merge_config.sh"
* tag 'kbuild-fixes-7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
scripts: kconfig: merge_config.sh: fix indentation
scripts: kconfig: merge_config.sh: pass output file as awk variable
scripts: kconfig: merge_config.sh: fix unexpected operator warning
kbuild: Delete .builtin-dtbs.S when running make clean
alarm_timer_forward() passes arguments to alarm_forward() in the wrong
order:
alarm_forward(alarm, timr->it_interval, now);
However, alarm_forward() is defined as:
u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval);
and uses the second argument as the current time:
delta = ktime_sub(now, alarm->node.expires);
Passing the interval as "now" results in incorrect delta computation,
which can lead to missed expirations or incorrect overrun accounting.
This issue has been present since the introduction of
alarm_timer_forward().
Fix this by swapping the arguments.
Fixes: e7561f1633 ("alarmtimer: Implement forward callback")
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260323061130.29991-1-zhanxusheng@xiaomi.com
test_cgcore_populated (test_core) and test_cgkill_{simple,tree,forkbomb}
(test_kill) check cgroup.events "populated 0" immediately after reaping
child tasks with waitpid(). This used to work because cgroup_task_exit() in
do_exit() unlinked tasks from css_sets before exit_notify() woke up
waitpid().
d245698d72 ("cgroup: Defer task cgroup unlink until after the task is done
switching out") moved the unlink to cgroup_task_dead() in
finish_task_switch(), which runs after exit_notify(). The populated counter
is now decremented after the parent's waitpid() can return, so there is no
longer a synchronous ordering guarantee. On PREEMPT_RT, where
cgroup_task_dead() is further deferred through lazy irq_work, the race
window is even larger.
The synchronous populated transition was never part of the cgroup interface
contract - it was an implementation artifact. Use cg_read_strcmp_wait() which
retries for up to 1 second, matching what these tests actually need to
verify: that the cgroup eventually becomes unpopulated after all tasks exit.
Fixes: d245698d72 ("cgroup: Defer task cgroup unlink until after the task is done switching out")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Christian Brauner <brauner@kernel.org>
Cc: cgroups@vger.kernel.org
a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup") hid PF_EXITING
tasks from cgroup.procs so that systemd doesn't see tasks that have already
been reaped via waitpid(). However, the populated counter (nr_populated_csets)
is only decremented when the task later passes through cgroup_task_dead() in
finish_task_switch(). This means cgroup.procs can appear empty while the
cgroup is still populated, causing rmdir to fail with -EBUSY.
Fix this by making cgroup_rmdir() wait for dying tasks to fully leave. If the
cgroup is populated but all remaining tasks have PF_EXITING set (the task
iterator returns none due to the existing filter), wait for a kick from
cgroup_task_dead() and retry. The wait is brief as tasks are removed from the
cgroup's css_set between PF_EXITING assertion in do_exit() and
cgroup_task_dead() in finish_task_switch().
v2: cgroup_is_populated() true to false transition happens under css_set_lock
not cgroup_mutex, so retest under css_set_lock before sleeping to avoid
missed wakeups (Sebastian).
Fixes: a72f73c4dd ("cgroup: Don't expose dead tasks in cgroup")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202603222104.2c81684e-lkp@intel.com
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Bert Karwatzki <spasswolf@web.de>
Cc: Michal Koutny <mkoutny@suse.com>
Cc: cgroups@vger.kernel.org
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Clear the pending exception state from a vcpu coming out of reset,
as it could otherwise affect the first instruction executed in the
guest
- Fix pointer arithmetic in address translation emulation, so that
the Hardware Access bit is set on the correct PTE instead of some
other location
s390:
- Fix deadlock in new memory management
- Properly handle kernel faults on donated memory
- Fix bounds checking for irq routing, with selftest
- Fix invalid machine checks and log all of them"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: arm64: Fix the descriptor address in __kvm_at_swap_desc()
KVM: s390: vsie: Avoid injecting machine check on signal
KVM: s390: log machine checks more aggressively
KVM: s390: selftests: Add IRQ routing address offset tests
KVM: s390: Limit adapter indicator access to mapped page
s390/mm: Add missing secure storage access fixups for donated memory
KVM: arm64: Discard PC update state on vcpu reset
KVM: s390: Fix a deadlock
Add LANDLOCK_RESTRICT_SELF_TSYNC to the backwards compatibility example
for restrict flags. This introduces completeness, similar to that of
the ruleset attributes example. However, as the new example can impact
enforcement in certain cases, an appropriate warning is also included.
Additionally, I modified the two comments of the example to make them
more consistent with the ruleset attributes example's.
Signed-off-by: Panagiotis "Ivory" Vasilopoulos <git@n0toose.net>
Co-developed-by: Dan Cojocaru <dan@dcdev.ro>
Signed-off-by: Dan Cojocaru <dan@dcdev.ro>
Reviewed-by: Günther Noack <gnoack@google.com>
Link: https://lore.kernel.org/r/20260304-landlock-docs-add-tsync-example-v4-1-819a276f05c5@n0toose.net
[mic: Update date, improve comments consistency, fix newline issue]
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Pull Compute Express Link (CXL) fixes from Dave Jiang:
- Adjust the startup priority of cxl_pmem to be higher than that of
cxl_acpi
- Use proper endpoint validity check upon sanitize
- Avoid incorrect DVSEC fallback when HDM decoders are enabled
- Fix CXL_ACPI and CXL_PMEM Kconfig tristate mismatch
- Fix leakage in __construct_region()
- Fix use after free of parent_port in cxl_detach_ep()
* tag 'cxl-fixes-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl: Adjust the startup priority of cxl_pmem to be higher than that of cxl_acpi
cxl/mbox: Use proper endpoint validity check upon sanitize
cxl/hdm: Avoid incorrect DVSEC fallback when HDM decoders are enabled
cxl/acpi: Fix CXL_ACPI and CXL_PMEM Kconfig tristate mismatch
cxl/region: Fix leakage in __construct_region()
cxl/port: Fix use after free of parent_port in cxl_detach_ep()
During a GPU page fault, the driver restores the SVM range and then maps it
into the GPU page tables. The current implementation passes a GPU-page-size
(4K-based) PFN to svm_range_restore_pages() to restore the range.
SVM ranges are tracked using system-page-size PFNs. On systems where the
system page size is larger than 4K, using GPU-page-size PFNs to restore the
range causes two problems:
Range lookup fails:
Because the restore function receives PFNs in GPU (4K) units, the SVM
range lookup does not find the existing range. This will result in a
duplicate SVM range being created.
VMA lookup failure:
The restore function also tries to locate the VMA for the faulting address.
It converts the GPU-page-size PFN into an address using the system page
size, which results in an incorrect address on non-4K page-size systems.
As a result, the VMA lookup fails with the message: "address 0xxxx VMA is
removed".
This patch passes the system-page-size PFN to svm_range_restore_pages() so
that the SVM range is restored correctly on non-4K page systems.
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 074fe395fb13247b057f60004c7ebcca9f38ef46)
Forcibly disable the OD_FAN_CURVE feature when temperature or PWM range is invalid,
otherwise PMFW will reject this configuration on smu v14.0.2/14.0.3.
example:
$ sudo cat /sys/bus/pci/devices/<BDF>/gpu_od/fan_ctrl/fan_curve
OD_FAN_CURVE:
0: 0C 0%
1: 0C 0%
2: 0C 0%
3: 0C 0%
4: 0C 0%
OD_RANGE:
FAN_CURVE(hotspot temp): 0C 0C
FAN_CURVE(fan speed): 0% 0%
$ echo "0 50 40" | sudo tee fan_curve
kernel log:
[ 969.761627] amdgpu 0000:03:00.0: amdgpu: Fan curve temp setting(50) must be within [0, 0]!
[ 1010.897800] amdgpu 0000:03:00.0: amdgpu: Fan curve temp setting(50) must be within [0, 0]!
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ab4905d466b60f170d85e19ca2a5d2b159aeb780)
Cc: stable@vger.kernel.org
In kfd_ioctl_create_process(), the pointer 'p' is used before checking
if it is NULL.
The code accesses p->context_id before validating 'p'. This can lead
to a possible NULL pointer dereference.
Move the NULL check before using 'p' so that the pointer is validated
before access.
Fixes the below:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_chardev.c:3177 kfd_ioctl_create_process() warn: variable dereferenced before check 'p' (see line 3174)
Fixes: cc6b66d661 ("amdkfd: introduce new ioctl AMDKFD_IOC_CREATE_PROCESS")
Cc: Zhu Lingshan <lingshan.zhu@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 19d4149b22f57094bfc4b86b742381b3ca394ead)
amdgpu_amdkfd_submit_ib() submits a GPU job and gets a fence
from amdgpu_ib_schedule(). This fence is used to wait for job
completion.
Currently, the code drops the fence reference using dma_fence_put()
before calling dma_fence_wait().
If dma_fence_put() releases the last reference, the fence may be
freed before dma_fence_wait() is called. This can lead to a
use-after-free.
Fix this by waiting on the fence first and releasing the reference
only after dma_fence_wait() completes.
Fixes the below:
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c:697 amdgpu_amdkfd_submit_ib() warn: passing freed memory 'f' (line 696)
Fixes: 9ae55f030d ("drm/amdgpu: Follow up change to previous drm scheduler change.")
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 8b9e5259adc385b61a6590a13b82ae0ac2bd3482)
When ec_install_handlers() returns -EPROBE_DEFER on reduced-hardware
platforms, it has already started the EC and installed the address
space handler with the struct acpi_ec pointer as handler context.
However, acpi_ec_setup() propagates the error without any cleanup.
The caller acpi_ec_add() then frees the struct acpi_ec for non-boot
instances, leaving a dangling handler context in ACPICA.
Any subsequent AML evaluation that accesses an EC OpRegion field
dispatches into acpi_ec_space_handler() with the freed pointer,
causing a use-after-free:
BUG: KASAN: slab-use-after-free in mutex_lock (kernel/locking/mutex.c:289)
Write of size 8 at addr ffff88800721de38 by task init/1
Call Trace:
<TASK>
mutex_lock (kernel/locking/mutex.c:289)
acpi_ec_space_handler (drivers/acpi/ec.c:1362)
acpi_ev_address_space_dispatch (drivers/acpi/acpica/evregion.c:293)
acpi_ex_access_region (drivers/acpi/acpica/exfldio.c:246)
acpi_ex_field_datum_io (drivers/acpi/acpica/exfldio.c:509)
acpi_ex_extract_from_field (drivers/acpi/acpica/exfldio.c:700)
acpi_ex_read_data_from_field (drivers/acpi/acpica/exfield.c:327)
acpi_ex_resolve_node_to_value (drivers/acpi/acpica/exresolv.c:392)
</TASK>
Allocated by task 1:
acpi_ec_alloc (drivers/acpi/ec.c:1424)
acpi_ec_add (drivers/acpi/ec.c:1692)
Freed by task 1:
kfree (mm/slub.c:6876)
acpi_ec_add (drivers/acpi/ec.c:1751)
The bug triggers on reduced-hardware EC platforms (ec->gpe < 0)
when the GPIO IRQ provider defers probing. Once the stale handler
exists, any unprivileged sysfs read that causes AML to touch an
EC OpRegion (battery, thermal, backlight) exercises the dangling
pointer.
Fix this by calling ec_remove_handlers() in the error path of
acpi_ec_setup() before clearing first_ec. ec_remove_handlers()
checks each EC_FLAGS_* bit before acting, so it is safe to call
regardless of how far ec_install_handlers() progressed:
-ENODEV (handler not installed): only calls acpi_ec_stop()
-EPROBE_DEFER (handler installed): removes handler, stops EC
Fixes: 03e9a0e057 ("ACPI: EC: Consolidate event handler installation code")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Link: https://patch.msgid.link/20260324165458.1337233-2-bestswngs@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
KVM/arm64 fixes for 7.0, take #4
- Clear the pending exception state from a vcpu coming out of
reset, as it could otherwise affect the first instruction
executed in the guest.
- Fix the address translation emulation icode to set the Hardware
Access bit on the correct PTE instead of some other location.
Pull MM fixes from Andrew Morton:
"6 hotfixes. 2 are cc:stable. All are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-03-23-17-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/damon/stat: monitor all System RAM resources
mm/zswap: add missing kunmap_local()
mailmap: update email address for Muhammad Usama Anjum
zram: do not slot_free() written-back slots
mm/damon/core: avoid use of half-online-committed context
mm/rmap: clear vma->anon_vma on error
Refine the description to better highlight its features and use cases.
In addition, add instructions for building it as a module and clarify
the compression option.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Fix parsing 'overwrite' in command line event definitions in
big-endian machines by writing correct union member
- Fix finding default metric in 'perf stat'
- Fix relative paths for including headers in 'perf kvm stat'
- Sync header copies with the kernel sources: msr-index.h, kvm,
build_bug.h
* tag 'perf-tools-fixes-for-v7.0-2-2026-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
tools headers: Synchronize linux/build_bug.h with the kernel sources
tools headers UAPI: Sync x86's asm/kvm.h with the kernel sources
tools headers UAPI: Sync linux/kvm.h with the kernel sources
tools arch x86: Sync the msr-index.h copy with the kernel sources
perf kvm stat: Fix relative paths for including headers
perf parse-events: Fix big-endian 'overwrite' by writing correct union member
perf metricgroup: Fix metricgroup__has_metric_or_groups()
tools headers: Skip arm64 cputype.h check
Pull media fixes from Mauro Carvalho Chehab:
- rkvdec: fix stack usage with clang and improve handling missing
short/long term RPS
- synopsys: fix a Kconfig issue and an out-of-bounds check
- verisilicon: Fix kernel panic due to __initconst misuse
- media core: serialize REINIT and REQBUFS with req_queue_mutex
* tag 'media/v7.0-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: verisilicon: Fix kernel panic due to __initconst misuse
media: rkvdec: reduce stack usage in rkvdec_init_v4l2_vp9_count_tbl()
media: rkvdec: reduce excessive stack usage in assemble_hw_pps()
media: rkvdec: Improve handling missing short/long term RPS
media: mc, v4l2: serialize REINIT and REQBUFS with req_queue_mutex
media: synopsys: csi2rx: add missing kconfig dependency
media: synopsys: csi2rx: fix out-of-bounds check for formats array
The implicit FILEID_INO32_GEN encoder was changed to be explicit,
so we need to fix the detection.
When mounting overlayfs with upperdir and lowerdir on different ext4
filesystems, the expected kmsg log is:
overlayfs: "xino" feature enabled using 32 upper inode bits.
But instead, since the regressing commit, the kmsg log was:
overlayfs: "xino" feature enabled using 2 upper inode bits.
Fixes: e21fc2038c ("exportfs: make ->encode_fh() a mandatory method for NFS export")
Cc: stable@vger.kernel.org # v6.7+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
The memcpy function assumes the dynamic array notif->matches is at least
as large as the number of bytes to copy. Otherwise, results->matches may
contain unwanted data. To guarantee safety, extend the validation in one
of the checks to ensure sufficient packet length.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Cc: stable@vger.kernel.org
Fixes: 5ac54afd4d ("wifi: iwlwifi: mvm: Add handling for scan offload match info notification")
Signed-off-by: Alexey Velichayshiy <a.velichayshiy@ispras.ru>
Link: https://patch.msgid.link/20260207150335.1013646-1-a.velichayshiy@ispras.ru
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Also note that we do not enable the driver_override feature of struct
bus_type, as SPI - in contrast to most other buses - passes "" to
sysfs_emit() when the driver_override pointer is NULL. Thus, printing
"\n" instead of "(null)\n".
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Fixes: 5039563e7c ("spi: Add driver_override SPI device attribute")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Link: https://patch.msgid.link/20260324005919.2408620-12-dakr@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
cputemp_is_visible() validates the channel index against
CPUTEMP_CHANNEL_NUMS, but currently uses '>' instead of '>='.
As a result, channel == CPUTEMP_CHANNEL_NUMS is not rejected even though
valid indices are 0 .. CPUTEMP_CHANNEL_NUMS - 1.
Fix the bounds check by using '>=' so invalid channel indices are
rejected before indexing the core bitmap.
Fixes: bf3608f338 ("hwmon: peci: Add cputemp driver")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260323002352.93417-3-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
The hwmon sysfs ABI expects tempN_crit_hyst to report the temperature at
which the critical condition clears, not the hysteresis delta from the
critical limit.
The peci cputemp driver currently returns tjmax - tcontrol for
crit_hyst_type, which is the hysteresis margin rather than the
corresponding absolute temperature.
Return tcontrol directly, and update the documentation accordingly.
Fixes: bf3608f338 ("hwmon: peci: Add cputemp driver")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260323002352.93417-2-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
The custom avs0_enable and avs1_enable sysfs attributes access PMBus
registers through the exported API helpers (pmbus_read_byte_data,
pmbus_read_word_data, pmbus_write_word_data, pmbus_update_byte_data)
without holding the PMBus update_lock mutex. These exported helpers do
not acquire the mutex internally, unlike the core's internal callers
which hold the lock before invoking them.
The store callback is especially vulnerable: it performs a multi-step
read-modify-write sequence (read VOUT_COMMAND, write VOUT_COMMAND, then
update OPERATION) where concurrent access from another thread could
interleave and corrupt the register state.
Add pmbus_lock_interruptible()/pmbus_unlock() around both the show and
store callbacks to serialize PMBus register access with the rest of the
driver.
Fixes: 038a9c3d1e ("hwmon: (pmbus/isl68137) Add driver for Intersil ISL68137 PWM Controller")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260319173055.125271-3-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
ina233_read_word_data() reads MFR_READ_VSHUNT via pmbus_read_word_data()
but has two issues:
1. The return value is not checked for errors before being used in
arithmetic. A negative error code from a failed I2C transaction is
passed directly to DIV_ROUND_CLOSEST(), producing garbage data.
2. MFR_READ_VSHUNT is a 16-bit two's complement value. Negative shunt
voltages (values with bit 15 set) are treated as large positive
values since pmbus_read_word_data() returns them zero-extended in an
int. This leads to incorrect scaling in the VIN coefficient
conversion.
Fix both issues by adding an error check, casting to s16 for proper
sign extension, and clamping the result to a valid non-negative range.
The clamp is necessary because read_word_data callbacks must return
non-negative values on success (negative values indicate errors to the
pmbus core).
Fixes: b64b6cb163 ("hwmon: Add driver for TI INA233 Current and Power Monitor")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260319173055.125271-2-sanman.pradhan@hpe.com
[groeck: Fixed clamp to avoid losing the sign bit]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Jeff Johnson says:
==================
ath.git update for v7.0-rc6
For both ath11k and ath12k use the correct TID when stopping an AMPDU
session.
==================
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Miri Korenblit says:
====================
wifi: iwlwifi: fixes - 2026-03-24
- Fix MLO scan timing (record the scan start in FW)
- don't send a 6E related command when not supported
- correctly set wifi generation data
====================
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
wl1251_tx_packet_cb() uses the firmware completion ID directly to index
the fixed 16-entry wl->tx_frames[] array. The ID is a raw u8 from the
completion block, and the callback does not currently verify that it
fits the array before dereferencing it.
Reject completion IDs that fall outside wl->tx_frames[] and keep the
existing NULL check in the same guard. This keeps the fix local to the
trust boundary and avoids touching the rest of the completion flow.
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
Link: https://patch.msgid.link/20260323080845.40033-1-pengpeng@iscas.ac.cn
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The variable valuesize is declared as u8 but accumulates the total
length of all SSIDs to scan. Each SSID contributes up to 33 bytes
(IEEE80211_MAX_SSID_LEN + 1), and with WILC_MAX_NUM_PROBED_SSID (10)
SSIDs the total can reach 330, which wraps around to 74 when stored
in a u8.
This causes kmalloc to allocate only 75 bytes while the subsequent
memcpy writes up to 331 bytes into the buffer, resulting in a 256-byte
heap buffer overflow.
Widen valuesize from u8 to u32 to accommodate the full range.
Fixes: c5c77ba18e ("staging: wilc1000: Add SDIO/SPI 802.11 driver")
Cc: stable@vger.kernel.org
Signed-off-by: Yasuaki Torimaru <yasuakitorimaru@gmail.com>
Link: https://patch.msgid.link/20260324100624.983458-1-yasuakitorimaru@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Steffen Klassert says:
====================
pull request (net): ipsec 2026-03-23
1) Add missing extack for XFRMA_SA_PCPU in add_acquire and allocspi.
From Sabrina Dubroca.
2) Fix the condition on x->pcpu_num in xfrm_sa_len by using the
proper check. From Sabrina Dubroca.
3) Call xdo_dev_state_delete during state update to properly cleanup
the xdo device state. From Sabrina Dubroca.
4) Fix a potential skb leak in espintcp when async crypto is used.
From Sabrina Dubroca.
5) Validate inner IPv4 header length in IPTFS payload to avoid
parsing malformed packets. From Roshan Kumar.
6) Fix skb_put() panic on non-linear skb during IPTFS reassembly.
From Fernando Fernandez Mancera.
7) Silence various sparse warnings related to RCU, state, and policy
handling. From Sabrina Dubroca.
8) Fix work re-schedule race after cancel in xfrm_nat_keepalive_net_fini().
From Hyunwoo Kim.
9) Prevent policy_hthresh.work from racing with netns teardown by using
a proper cleanup mechanism. From Minwoo Ra.
10) Validate that the family of the source and destination addresses match
in pfkey_send_migrate(). From Eric Dumazet.
11) Only publish mode_data after the clone is setup in the IPTFS receive path.
This prevents leaving x->mode_data pointing at freed memory on error.
From Paul Moses.
Please pull or let me know if there are problems.
ipsec-2026-03-23
* tag 'ipsec-2026-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: iptfs: only publish mode_data after clone setup
af_key: validate families in pfkey_send_migrate()
xfrm: prevent policy_hthresh.work from racing with netns teardown
xfrm: Fix work re-schedule after cancel in xfrm_nat_keepalive_net_fini()
xfrm: avoid RCU warnings around the per-netns netlink socket
xfrm: add rcu_access_pointer to silence sparse warning for xfrm_input_afinfo
xfrm: policy: silence sparse warning in xfrm_policy_unregister_afinfo
xfrm: policy: fix sparse warnings in xfrm_policy_{init,fini}
xfrm: state: silence sparse warnings during netns exit
xfrm: remove rcu/state_hold from xfrm_state_lookup_spi_proto
xfrm: state: add xfrm_state_deref_prot to state_by* walk under lock
xfrm: state: fix sparse warnings around XFRM_STATE_INSERT
xfrm: state: fix sparse warnings in xfrm_state_init
xfrm: state: fix sparse warnings on xfrm_state_hold_rcu
xfrm: iptfs: fix skb_put() panic on non-linear skb during reassembly
xfrm: iptfs: validate inner IPv4 header length in IPTFS payload
esp: fix skb leak with espintcp and async crypto
xfrm: call xdo_dev_state_delete during state update
xfrm: fix the condition on x->pcpu_num in xfrm_sa_len
xfrm: add missing extack for XFRMA_SA_PCPU in add_acquire and allocspi
====================
Link: https://patch.msgid.link/20260323083440.2741292-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
With traffic ongoing for data TID [TID 0], an DELBA request to
stop AMPDU for the BA session was received on management TID [TID 4].
The corresponding TID number was incorrectly passed to stop the BA session,
resulting in the BA session for data TIDs being stopped and the BA size
being reduced to 1, causing an overall dip in TCP throughput.
Fix this issue by passing the correct argument from
ath12k_dp_rx_ampdu_stop() to ath12k_dp_arch_peer_rx_tid_reo_update()
during an AMPDU stop session. Instead of passing peer->dp_peer->rx_tid,
which is the base address of the array, corresponding to TID 0, pass
the value of &peer->dp_peer->rx_tid[params->tid]. With this, the
different TID numbers are accounted for.
Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.5-01651-QCAHKSWPL_SILICONZ-1
Fixes: d889913205 ("wifi: ath12k: driver for Qualcomm Wi-Fi 7 devices")
Signed-off-by: Reshma Immaculate Rajkumar <reshma.rajkumar@oss.qualcomm.com>
Reviewed-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com>
Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
Link: https://patch.msgid.link/20260227110123.3726354-1-reshma.rajkumar@oss.qualcomm.com
Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
During ongoing traffic, a request to stop an AMPDU session
for one TID could incorrectly affect other active sessions.
This can happen because an incorrect TID reference would be
passed when updating the BA session state, causing the wrong
session to be stopped. As a result, the affected session would
be reduced to a minimal BA size, leading to a noticeable
throughput degradation.
Fix this issue by passing the correct argument from
ath11k_dp_rx_ampdu_stop() to ath11k_peer_rx_tid_reo_update()
during a stop AMPDU session. Instead of passing peer->tx_tid, which
is the base address of the array, corresponding to TID 0; pass
the value of &peer->rx_tid[params->tid], where the different TID numbers
are accounted for.
Tested-on: QCN9074 hw1.0 PCI WLAN.HK.2.9.0.1-02146-QCAHKSWPL_SILICONZ-1
Fixes: d5c65159f2 ("ath11k: driver for Qualcomm IEEE 802.11ax devices")
Signed-off-by: Reshma Immaculate Rajkumar <reshma.rajkumar@oss.qualcomm.com>
Reviewed-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com>
Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
Link: https://patch.msgid.link/20260319065608.2408179-1-reshma.rajkumar@oss.qualcomm.com
Signed-off-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
The hardware teams noticed that the originally documented workaround
steps for Wa_16025250150 may not be sufficient to fully avoid a hardware
issue. The workaround documentation has been augmented to suggest
programming one additional register; make the corresponding change in
the driver.
Fixes: 7654d51f1f ("drm/xe/xe2hpg: Add Wa_16025250150")
Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Link: https://patch.msgid.link/20260319-wa_16025250150_part2-v1-1-46b1de1a31b2@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit a31566762d4075646a8a2214586158b681e94305)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
The Rust `Regulator` abstraction uses `NonNull` to wrap the underlying
`struct regulator` pointer. When `CONFIG_REGULATOR` is disabled, the C
stub for `regulator_get` returns `NULL`. `from_err_ptr` does not treat
`NULL` as an error, so it was passed to `NonNull::new_unchecked`,
causing undefined behavior.
Fix this by using a raw pointer `*mut bindings::regulator` instead of
`NonNull`. This allows `inner` to be `NULL` when `CONFIG_REGULATOR` is
disabled, and leverages the C stubs which are designed to handle `NULL`
or are no-ops.
Fixes: 9b614ceada ("rust: regulator: add a bare minimum regulator abstraction")
Reported-by: Miguel Ojeda <ojeda@kernel.org>
Closes: https://lore.kernel.org/r/20260322193830.89324-1-ojeda@kernel.org
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Link: https://patch.msgid.link/20260324-regulator-fix-v1-1-a5244afa3c15@google.com
Signed-off-by: Mark Brown <broonie@kernel.org>
In each MAC context, the firmware expects the wifi generation
data, i.e. whether or not HE/EHT (and in the future UHR) is
enabled on that MAC.
However, this is currently handled wrong in two ways:
- EHT is only enabled when the interface is also an MLD, but
we currently allow (despite the spec) connecting with EHT
but without MLO.
- when HE or EHT are used by TDLS peers, the firmware needs
to have them enabled regardless of the AP
Fix this by iterating setting up the data depending on the
interface type:
- for AP, just set it according to the BSS configuration
- for monitor, set it according to HW capabilities
- otherwise, particularly for client, iterate all stations
and then their links on the interface in question and set
according to their capabilities, this handles the AP and
TDLS peers. Re-calculate this whenever a TDLS station is
marked associated or removed so that it's kept updated,
for the AP it's already updated on assoc/disassoc.
Fixes: d1e879ec60 ("wifi: iwlwifi: add iwlmld sub-driver")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260319110722.404713b22177.Ic972b5e557d011a5438f8f97c1e793cc829e2ea9@changeid
Link: https://patch.msgid.link/20260324093333.2953495-1-miriam.rachel.korenblit@intel.com
Calculate MLO scan start time based on actual
scan start notification from firmware instead of recording
time when scan command is sent.
Currently, MLO scan start time was captured immediately
after sending the scan command to firmware. However, the
actual scan start time may differ due to the FW being busy
with a previous scan.
In that case, the link selection code will think that the MLO
scan is too old, and will warn.
To fix it, Implement start scan notification handling to
capture the precise moment when firmware begins the scan
operation.
Fixes: 9324731b99 ("wifi: iwlwifi: mld: avoid selecting bad links")
Signed-off-by: Pagadala Yesu Anjaneyulu <pagadala.yesu.anjaneyulu@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260324113316.4c56b8bac533.I6e656d8cc30bb82c96aabadedd62bd67f4c46bf9@changeid
NETIF_F_IPV6_CSUM only advertises support for checksum offload of
packets without IPv6 extension headers. Packets with extension
headers must fall back onto software checksumming. Since TSO
depends on checksum offload, those must revert to GSO.
The below commit introduces that fallback. It always checks
network header length. For tunneled packets, the inner header length
must be checked instead. Extend the check accordingly.
A special case is tunneled packets without inner IP protocol. Such as
RFC 6951 SCTP in UDP. Those are not standard IPv6 followed by
transport header either, so also must revert to the software GSO path.
Cc: stable@vger.kernel.org
Fixes: 864e339697 ("net: gso: Forbid IPv6 TSO with extensions on devices with only IPV6_CSUM")
Reported-by: Tangxin Xie <xietangxin@yeah.net>
Closes: https://lore.kernel.org/netdev/0414e7e2-9a1c-4d7c-a99d-b9039cf68f40@yeah.net/
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260320190148.2409107-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Marc Kleine-Budde says:
====================
pull-request: can 2026-03-23
this is a pull request of 5 patches for net/main.
The first patch is by me and adds missing error handling to the CAN
netlink device configuration code.
Wenyuan Li contributes a patch for the mcp251x drier to add missing
error handling for power enabling in th open and resume functions.
Oliver Hartkopp's patch adds missing atomic access in hot path for the
CAN procfs statistics.
A series by Ali Norouzi and Oliver Hartkopp fix a can-Out-of-Bounds
Heap R/W in the can-gw protocol and a UAF in the CAN isotp protocol.
linux-can-fixes-for-7.0-20260323
* tag 'linux-can-fixes-for-7.0-20260323' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: isotp: fix tx.buf use-after-free in isotp_sendmsg()
can: gw: fix OOB heap access in cgw_csum_crc8_rel()
can: statistics: add missing atomic access in hot path
can: mcp251x: add error handling for power enable in open and resume
can: netlink: can_changelink(): add missing error handling to call can_ctrlmode_changelink()
====================
Link: https://patch.msgid.link/20260323103224.218099-1-mkl@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
cppi5_hdesc_get_psdata() returns a pointer into the CPPI descriptor.
In both emac_rx_packet() and emac_rx_packet_zc(), the descriptor is
freed via k3_cppi_desc_pool_free() before the psdata pointer is used
by emac_rx_timestamp(), which dereferences psdata[0] and psdata[1].
This constitutes a use-after-free on every received packet that goes
through the timestamp path.
Defer the descriptor free until after all accesses through the psdata
pointer are complete. For emac_rx_packet(), move the free into the
requeue label so both early-exit and success paths free the descriptor
after all accesses are done. For emac_rx_packet_zc(), move the free to
the end of the loop body after emac_dispatch_skb_zc() (which calls
emac_rx_timestamp()) has returned.
Fixes: 46eeb90f03 ("net: ti: icssg-prueth: Use page_pool API for RX buffer allocation")
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260320174439.41080-1-devnexen@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Similar to commit 950803f725 ("bonding: fix type confusion in
bond_setup_by_slave()") team has the same class of header_ops type
confusion.
For non-Ethernet ports, team_setup_by_port() copies port_dev->header_ops
directly. When the team device later calls dev_hard_header() or
dev_parse_header(), these callbacks can run with the team net_device
instead of the real lower device, so netdev_priv(dev) is interpreted as
the wrong private type and can crash.
The syzbot report shows a crash in bond_header_create(), but the root
cause is in team: the topology is gre -> bond -> team, and team calls
the inherited header_ops with its own net_device instead of the lower
device, so bond_header_create() receives a team device and interprets
netdev_priv() as bonding private data, causing a type confusion crash.
Fix this by introducing team header_ops wrappers for create/parse,
selecting a team port under RCU, and calling the lower device callbacks
with port->dev, so each callback always sees the correct net_device
context.
Also pass the selected lower device to the lower parse callback, so
recursion is bounded in stacked non-Ethernet topologies and parse
callbacks always run with the correct device context.
Fixes: 1d76efe157 ("team: add support for non-ethernet devices")
Reported-by: syzbot+3d8bc31c45e11450f24c@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69b46af7.050a0220.36eb34.000e.GAE@google.com/T/
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Link: https://patch.msgid.link/20260320072139.134249-2-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Xuan Zhuo says:
====================
virtio-net: fix for VIRTIO_NET_F_GUEST_HDRLEN
The commit be50da3e9d ("net: virtio_net: implement exact header length
guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN
feature in virtio-net.
This feature requires virtio-net to set hdr_len to the actual header
length of the packet when transmitting, the number of
bytes from the start of the packet to the beginning of the
transport-layer payload.
However, in practice, hdr_len was being set using skb_headlen(skb),
which is clearly incorrect. This path set fixes that issue.
As discussed in [0], this version checks the VIRTIO_NET_F_GUEST_HDRLEN is
negotiated.
[0]: http://lore.kernel.org/all/20251029030913.20423-1-xuanzhuo@linux.alibaba.com
v10: fix http://lore.kernel.org/all/202603122214.8Anoxrmq-lkp@intel.com
====================
Link: https://patch.msgid.link/20260320021818.111741-1-xuanzhuo@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The commit a2fb4bc4e2 ("net: implement virtio helpers to handle UDP
GSO tunneling.") introduces support for the UDP GSO tunnel feature in
virtio-net.
The virtio spec says:
If the \field{gso_type} has the VIRTIO_NET_HDR_GSO_UDP_TUNNEL_IPV4 bit or
VIRTIO_NET_HDR_GSO_UDP_TUNNEL_IPV6 bit set, \field{hdr_len} accounts for
all the headers up to and including the inner transport.
The commit did not update the hdr_len to include the inner transport.
I observed that the "hdr_len" is 116 for this packet:
17:36:18.241105 52:55:00:d1:27:0a > 2e:2c:df:46:a9:e1, ethertype IPv4 (0x0800), length 2912: (tos 0x0, ttl 64, id 45197, offset 0, flags [none], proto UDP (17), length 2898)
192.168.122.100.50613 > 192.168.122.1.4789: [bad udp cksum 0x8106 -> 0x26a0!] VXLAN, flags [I] (0x08), vni 1
fa:c3:ba:82:05:ee > ce:85:0c:31:77:e5, ethertype IPv4 (0x0800), length 2862: (tos 0x0, ttl 64, id 14678, offset 0, flags [DF], proto TCP (6), length 2848)
192.168.3.1.49880 > 192.168.3.2.9898: Flags [P.], cksum 0x9266 (incorrect -> 0xaa20), seq 515667:518463, ack 1, win 64, options [nop,nop,TS val 2990048824 ecr 2798801412], length 2796
116 = 14(mac) + 20(ip) + 8(udp) + 8(vxlan) + 14(inner mac) + 20(inner ip) + 32(innner tcp)
Fixes: a2fb4bc4e2 ("net: implement virtio helpers to handle UDP GSO tunneling.")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://patch.msgid.link/20260320021818.111741-3-xuanzhuo@linux.alibaba.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The commit be50da3e9d ("net: virtio_net: implement exact header length
guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN
feature in virtio-net.
This feature requires virtio-net to set hdr_len to the actual header
length of the packet when transmitting, the number of
bytes from the start of the packet to the beginning of the
transport-layer payload.
However, in practice, hdr_len was being set using skb_headlen(skb),
which is clearly incorrect. This commit fixes that issue.
Fixes: be50da3e9d ("net: virtio_net: implement exact header length guest feature")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://patch.msgid.link/20260320021818.111741-2-xuanzhuo@linux.alibaba.com
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Zorro Lang reported the following lockdep splat:
"While running fstests xfs/556 on kernel 7.0.0-rc4+ (HEAD=04a9f1766954),
a lockdep warning was triggered indicating an inconsistent lock state
for sb->s_type->i_lock_key.
"The deadlock might occur because iomap_read_end_io (called from a
hardware interrupt completion path) invokes fserror_report, which then
calls igrab. igrab attempts to acquire the i_lock spinlock. However,
the i_lock is frequently acquired in process context with interrupts
enabled. If an interrupt occurs while a process holds the i_lock, and
that interrupt handler calls fserror_report, the system deadlocks.
"I hit this warning several times by running xfs/556 (mostly) or
generic/648 on xfs. More details refer to below console log."
along with this dmesg, for which I've cleaned up the stacktraces:
run fstests xfs/556 at 2026-03-18 20:05:30
XFS (sda3): Mounting V5 Filesystem 396e9164-c45a-4e05-be9d-b38c2c5c6477
XFS (sda3): Ending clean mount
XFS (sda3): Unmounting Filesystem 396e9164-c45a-4e05-be9d-b38c2c5c6477
XFS (sda3): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (sda3): Ending clean mount
XFS (sda3): Unmounting Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (dm-0): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (dm-0): Ending clean mount
device-mapper: table: 253:0: adding target device (start sect 209 len 1) caused an alignment inconsistency
device-mapper: table: 253:0: adding target device (start sect 210 len 62914350) caused an alignment inconsistency
buffer_io_error: 6 callbacks suppressed
Buffer I/O error on dev dm-0, logical block 209, async page read
Buffer I/O error on dev dm-0, logical block 209, async page read
XFS (dm-0): Unmounting Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (dm-0): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (dm-0): Ending clean mount
================================
WARNING: inconsistent lock state
7.0.0-rc4+ #1 Tainted: G S W
--------------------------------
inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
od/2368602 [HC1[1]:SC0[0]:HE0:SE1] takes:
ff1100069f2b4a98 (&sb->s_type->i_lock_key#31){?.+.}-{3:3}, at: igrab+0x28/0x1a0
{HARDIRQ-ON-W} state was registered at:
__lock_acquire+0x40d/0xbd0
lock_acquire.part.0+0xbd/0x260
_raw_spin_lock+0x37/0x80
unlock_new_inode+0x66/0x2a0
xfs_iget+0x67b/0x7b0 [xfs]
xfs_mountfs+0xde4/0x1c80 [xfs]
xfs_fs_fill_super+0xe86/0x17a0 [xfs]
get_tree_bdev_flags+0x312/0x590
vfs_get_tree+0x8d/0x2f0
vfs_cmd_create+0xb2/0x240
__do_sys_fsconfig+0x3d8/0x9a0
do_syscall_64+0x13a/0x1520
entry_SYSCALL_64_after_hwframe+0x76/0x7e
irq event stamp: 3118
hardirqs last enabled at (3117): [<ffffffffb54e4ad8>] _raw_spin_unlock_irq+0x28/0x50
hardirqs last disabled at (3118): [<ffffffffb54b84c9>] common_interrupt+0x19/0xe0
softirqs last enabled at (3040): [<ffffffffb290ca28>] handle_softirqs+0x6b8/0x950
softirqs last disabled at (3023): [<ffffffffb290ce4d>] __irq_exit_rcu+0xfd/0x250
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&sb->s_type->i_lock_key#31);
<Interrupt>
lock(&sb->s_type->i_lock_key#31);
*** DEADLOCK ***
1 lock held by od/2368602:
#0: ff1100069f2b4b58 (&sb->s_type->i_mutex_key#19){++++}-{4:4}, at: xfs_ilock+0x324/0x4b0 [xfs]
stack backtrace:
CPU: 15 UID: 0 PID: 2368602 Comm: od Kdump: loaded Tainted: G S W 7.0.0-rc4+ #1 PREEMPT(full)
Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
Hardware name: Dell Inc. PowerEdge R660/0R5JJC, BIOS 2.1.5 03/14/2024
Call Trace:
<IRQ>
dump_stack_lvl+0x6f/0xb0
print_usage_bug.part.0+0x230/0x2c0
mark_lock_irq+0x3ce/0x5b0
mark_lock+0x1cb/0x3d0
mark_usage+0x109/0x120
__lock_acquire+0x40d/0xbd0
lock_acquire.part.0+0xbd/0x260
_raw_spin_lock+0x37/0x80
igrab+0x28/0x1a0
fserror_report+0x127/0x2d0
iomap_finish_folio_read+0x13c/0x280
iomap_read_end_io+0x10e/0x2c0
clone_endio+0x37e/0x780 [dm_mod]
blk_update_request+0x448/0xf00
scsi_end_request+0x74/0x750
scsi_io_completion+0xe9/0x7c0
_scsih_io_done+0x6ba/0x1ca0 [mpt3sas]
_base_process_reply_queue+0x249/0x15b0 [mpt3sas]
_base_interrupt+0x95/0xe0 [mpt3sas]
__handle_irq_event_percpu+0x1f0/0x780
handle_irq_event+0xa9/0x1c0
handle_edge_irq+0x2ef/0x8a0
__common_interrupt+0xa0/0x170
common_interrupt+0xb7/0xe0
</IRQ>
<TASK>
asm_common_interrupt+0x26/0x40
RIP: 0010:_raw_spin_unlock_irq+0x2e/0x50
Code: 0f 1f 44 00 00 53 48 8b 74 24 08 48 89 fb 48 83 c7 18 e8 b5 73 5e fd 48 89 df e8 ed e2 5e fd e8 08 78 8f fd fb bf 01 00 00 00 <e8> 8d 56 4d fd 65 8b 05 46 d5 1d 03 85 c0 74 06 5b c3 cc cc cc cc
RSP: 0018:ffa0000027d07538 EFLAGS: 00000206
RAX: 0000000000000c2d RBX: ffffffffb6614bc8 RCX: 0000000000000080
RDX: 0000000000000000 RSI: ffffffffb6306a01 RDI: 0000000000000001
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: ffffffffb75efc67 R11: 0000000000000001 R12: ff1100015ada0000
R13: 0000000000000083 R14: 0000000000000002 R15: ffffffffb6614c10
folio_wait_bit_common+0x407/0x780
filemap_update_page+0x8e7/0xbd0
filemap_get_pages+0x904/0xc50
filemap_read+0x320/0xc20
xfs_file_buffered_read+0x2aa/0x380 [xfs]
xfs_file_read_iter+0x263/0x4a0 [xfs]
vfs_read+0x6cb/0xb70
ksys_read+0xf9/0x1d0
do_syscall_64+0x13a/0x1520
Zorro's diagnosis makes sense, so the solution is to kick the failed
read handling to a workqueue much like we added for writeback ioends in
commit 294f54f849 ("fserror: fix lockdep complaint when igrabbing
inode").
Cc: Zorro Lang <zlang@redhat.com>
Link: https://lore.kernel.org/linux-xfs/20260319194303.efw4wcu7c4idhthz@doltdoltdolt/
Fixes: a9d573ee88 ("iomap: report file I/O errors to the VFS")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://patch.msgid.link/20260323210017.GL6223@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Clearing the DP tunnel stream BW in the atomic state involves getting
the tunnel group state, which can fail. Handle the error accordingly.
This fixes at least one issue where drm_dp_tunnel_atomic_set_stream_bw()
failed to get the tunnel group state returning -EDEADLK, which wasn't
handled. This lead to the ctx->contended warn later in modeset_lock()
while taking a WW mutex for another object in the same atomic state, and
thus within the same already contended WW context.
Moving intel_crtc_state_alloc() later would avoid freeing saved_state on
the error path; this stable patch leaves that simplification for a
follow-up.
Cc: Uma Shankar <uma.shankar@intel.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: <stable@vger.kernel.org> # v6.9+
Fixes: a4efae87ec ("drm/i915/dp: Compute DP tunnel BW during encoder state computation")
Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/7617
Reviewed-by: Michał Grzelak <michal.grzelak@intel.com>
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Link: https://patch.msgid.link/20260320092900.13210-1-imre.deak@intel.com
(cherry picked from commit fb69d0076e687421188bc8103ab0e8e5825b1df1)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Pull xen fixes from Juergen Gross:
"Restrict the xen privcmd driver in unprivileged domU to only allow
hypercalls to target domain when using secure boot"
* tag 'xsa482-7.0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/privcmd: add boot control for restricted usage in domU
xen/privcmd: restrict usage in unprivileged domU
Currently, enetc_get_ringparam() only provides rx_pending and tx_pending,
but 'ethtool --show-ring' no longer displays these fields. Because the
ringparam retrieval path has moved to the new netlink interface, where
rings_fill_reply() emits the *x_pending only if the *x_max_pending values
are non-zero. So rx_max_pending and tx_max_pending to are added to
enetc_get_ringparam() to fix the issue.
Note that the maximum tx/rx ring size of hardware is 64K, but we haven't
added set_ringparam() to make the ring size configurable. To avoid users
mistakenly believing that the ring size can be increased, so set
the *x_max_pending to priv->*x_bd_count.
Fixes: e4a1717b67 ("ethtool: provide ring sizes with RINGS_GET request")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260320094222.706339-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When binding a udp_sock to a local address and port, UDP uses
two hashes (udptable->hash and udptable->hash2) for collision
detection. The current code switches to "hash2" when
hslot->count > 10.
"hash2" is keyed by local address and local port.
"hash" is keyed by local port only.
The issue can be shown in the following bind sequence (pseudo code):
bind(fd1, "[fd00::1]:8888")
bind(fd2, "[fd00::2]:8888")
bind(fd3, "[fd00::3]:8888")
bind(fd4, "[fd00::4]:8888")
bind(fd5, "[fd00::5]:8888")
bind(fd6, "[fd00::6]:8888")
bind(fd7, "[fd00::7]:8888")
bind(fd8, "[fd00::8]:8888")
bind(fd9, "[fd00::9]:8888")
bind(fd10, "[fd00::10]:8888")
/* Correctly return -EADDRINUSE because "hash" is used
* instead of "hash2". udp_lib_lport_inuse() detects the
* conflict.
*/
bind(fail_fd, "[::]:8888")
/* After one more socket is bound to "[fd00::11]:8888",
* hslot->count exceeds 10 and "hash2" is used instead.
*/
bind(fd11, "[fd00::11]:8888")
bind(fail_fd, "[::]:8888") /* succeeds unexpectedly */
The same issue applies to the IPv4 wildcard address "0.0.0.0"
and the IPv4-mapped wildcard address "::ffff:0.0.0.0". For
example, if there are existing sockets bound to
"192.168.1.[1-11]:8888", then binding "0.0.0.0:8888" or
"[::ffff:0.0.0.0]:8888" can also miss the conflict when
hslot->count > 10.
TCP inet_csk_get_port() already has the correct check in
inet_use_bhash2_on_bind(). Rename it to
inet_use_hash2_on_bind() and move it to inet_hashtables.h
so udp.c can reuse it in this fix.
Fixes: 30fff9231f ("udp: bind() optimisation")
Reported-by: Andrew Onyshchuk <oandrew@meta.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260319181817.1901357-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When CONFIG_FIXED_PHY=m but CONFIG_B44=y, the kernel fails to link:
ld.lld: error: undefined symbol: fixed_phy_unregister
>>> referenced by b44.c
>>> drivers/net/ethernet/broadcom/b44.o:(b44_remove_one) in archive vmlinux.a
ld.lld: error: undefined symbol: fixed_phy_register_100fd
>>> referenced by b44.c
>>> drivers/net/ethernet/broadcom/b44.o:(b44_register_phy_one) in archive vmlinux.a
The fixed phy support is small enough that just always enabling it
for b44 is the simplest solution, and it avoids adding ugly #ifdef
checks.
Fixes: 10d2f15afb ("net: b44: register a fixed phy using fixed_phy_register_100fd if needed")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260320154927.674555-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since 0417adf367 ("ppp: fix race conditions in ppp_fill_forward_path")
dev_fill_forward_path() should be called with RCU read lock held. This
fix was applied to net, while the Airoha flowtable commit was applied to
net-next, so it hadn't been an issue until net was merged into net-next.
Fixes: a8bdd935d1 ("net: airoha: Add wlan flowtable TX offload")
Signed-off-by: Qingfang Deng <dqfext@gmail.com>
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://patch.msgid.link/20260320094315.525126-1-dqfext@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
`packet_release()` has a race window where `NETDEV_UP` can re-register a
socket into a fanout group's `arr[]` array. The re-registration is not
cleaned up by `fanout_release()`, leaving a dangling pointer in the fanout
array.
`packet_release()` does NOT zero `po->num` in its `bind_lock` section.
After releasing `bind_lock`, `po->num` is still non-zero and `po->ifindex`
still matches the bound device. A concurrent `packet_notifier(NETDEV_UP)`
that already found the socket in `sklist` can re-register the hook.
For fanout sockets, this re-registration calls `__fanout_link(sk, po)`
which adds the socket back into `f->arr[]` and increments `f->num_members`,
but does NOT increment `f->sk_ref`.
The fix sets `po->num` to zero in `packet_release` while `bind_lock` is
held to prevent NETDEV_UP from linking, preventing the race window.
This bug was found following an additional audit with Claude Code based
on CVE-2025-38617.
Fixes: ce06b03e60 ("packet: Add helpers to register/unregister ->prot_hook")
Link: https://blog.calif.io/p/a-race-within-a-race-exploiting-cve
Signed-off-by: Yochai Eisenrich <echelonh@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260319200610.25101-1-echelonh@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kuniyuki Iwashima says:
====================
ipv6: Fix two GC issues with permanent routes.
Patch 1 fixes the unbounded growth of tb6_gc_hlist due to
permanent routes whose exception routes have all expired.
Patch 2 fixes an issue where exception routes tied to
permanent routes are not properly aged.
Patch 3 is a selftest for the issue fixed by Patch 2.
====================
Link: https://patch.msgid.link/20260320072317.2561779-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Without the prior commit, IPv6 GC cannot track exceptions tied
to permanent routes if they were originally added as temporary
routes.
Let's add a test case for the issue.
1. Add temporary routes
2. Create exceptions for the temporary routes
3. Promote the routes to permanent routes
4. Check if GC can find and purge the exceptions
A few notes:
+ At step 4, unlike other test cases, we cannot wait for
$GC_WAIT_TIME. While the exceptions are always iterable via
netlink (since it traverses the entire fib tree instead of
tb6_gc_hlist), rt6_nh_dump_exceptions() skips expired entries.
If we waited for the expiration time, we would be unable to
distinguish whether the exceptions were truly purged by GC or
just hidden due to being expired.
+ For the same reason, at step 2, we use ICMPv6 redirect message
instead of Packet Too Big message. This is because MTU exceptions
always have RTF_EXPIRES, and rt6_age_examine_exception() does not
respect the period specified by net.ipv6.route.flush=1.
+ We add a neighbour entry for the redirect target with NTF_ROUTER.
Without this, the exceptions would be removed at step 3 when the
fib6_may_remove_gc_list() is called.
Without the fix, the exceptions remain even after GC is triggered
by sysctl -wq net.ipv6.route.flush=1.
FAIL: Expected 0 routes, got 5
TEST: ipv6 route garbage collection (promote to permanent routes) [FAIL]
With the fix, GC purges the exceptions properly.
TEST: ipv6 route garbage collection (promote to permanent routes) [ OK ]
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260320072317.2561779-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The cited commit mechanically put fib6_remove_gc_list()
just after every fib6_clean_expires() call.
When a temporary route is promoted to a permanent route,
there may already be exception routes tied to it.
If fib6_remove_gc_list() removes the route from tb6_gc_hlist,
such exception routes will no longer be aged.
Let's replace fib6_remove_gc_list() with a new helper
fib6_may_remove_gc_list() and use fib6_age_exceptions() there.
Note that net->ipv6 is only compiled when CONFIG_IPV6 is
enabled, so fib6_{add,remove,may_remove}_gc_list() are guarded.
Fixes: 5eb902b8e7 ("net/ipv6: Remove expired routes with a separated list of routes.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260320072317.2561779-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 5eb902b8e7 ("net/ipv6: Remove expired routes with a
separated list of routes.") introduced a per-table GC list and
changed GC to iterate over that list instead of traversing
the entire route table.
However, it forgot to add permanent routes to tb6_gc_hlist
when exception routes are added.
Commit cfe82469a0 ("ipv6: add exception routes to GC list
in rt6_insert_exception") fixed that issue but introduced
another one.
Even after all exception routes expire, the permanent routes
remain in tb6_gc_hlist, potentially negating the performance
benefits intended by the initial change.
Let's count gc_args->more before and after rt6_age_exceptions()
and remove the permanent route when the delta is 0.
Note that the next patch will reuse fib6_age_exceptions().
Fixes: cfe82469a0 ("ipv6: add exception routes to GC list in rt6_insert_exception")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260320072317.2561779-2-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The prefix/header of a TLP that caused an error may be recorded in the AER
Capability and emitted to the kernel log in raw hex format. Document the
existence and usage of tlp-tool, which decodes the TLP Header into
human-readable form.
The TLP Header hints at the root cause of an error, yet is often ignored
because of its seeming opaqueness. Instead, PCIe errors are frequently
worked around by a change in the kernel without fully understanding the
actual source of the problem. With more documentation on available tools
we'll hopefully come up with better solutions.
There are also wireshark dissectors for TLPs, but it seems they expect a
complete TLP, not just the header, and they cannot grok the hex format
emitted by the kernel directly. tlp-tool appears to be the most cut and
dried solution out there.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Maciej Grochowski <mx2pg@pm.me>
Link: https://patch.msgid.link/bf826c41b4c1d255c7dcb16e266b52f774d944ed.1774246067.git.lukas@wunner.de
Deinit calls idpf_idc_deinit_core_aux_device to free the cdev_info
memory, but leaves the adapter->cdev_info field with a stale pointer
value. This will bypass subsequent "if (!cdev_info)" checks if cdev_info
is not reallocated. For example, if idc_init fails after a reset,
cdev_info will already have been freed during the reset handling, but it
will not have been reallocated. The next reset or rmmod will result in a
crash.
[ +0.000008] BUG: kernel NULL pointer dereference, address: 00000000000000d0
[ +0.000033] #PF: supervisor read access in kernel mode
[ +0.000020] #PF: error_code(0x0000) - not-present page
[ +0.000017] PGD 2097dfa067 P4D 0
[ +0.000017] Oops: Oops: 0000 [#1] SMP NOPTI
...
[ +0.000018] RIP: 0010:device_del+0x3e/0x3d0
[ +0.000010] Call Trace:
[ +0.000010] <TASK>
[ +0.000012] idpf_idc_deinit_core_aux_device+0x36/0x70 [idpf]
[ +0.000034] idpf_vc_core_deinit+0x3e/0x180 [idpf]
[ +0.000035] idpf_remove+0x40/0x1d0 [idpf]
[ +0.000035] pci_device_remove+0x42/0xb0
[ +0.000020] device_release_driver_internal+0x19c/0x200
[ +0.000024] driver_detach+0x48/0x90
[ +0.000018] bus_remove_driver+0x6d/0x100
[ +0.000023] pci_unregister_driver+0x2e/0xb0
[ +0.000022] __do_sys_delete_module.isra.0+0x18c/0x2b0
[ +0.000025] ? kmem_cache_free+0x2c2/0x390
[ +0.000023] do_syscall_64+0x107/0x7d0
[ +0.000023] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Pass the adapter struct into idpf_idc_deinit_core_aux_device instead and
clear the cdev_info ptr.
Fixes: f4312e6bfa ("idpf: implement core RDMA auxiliary dev create, init, and destroy")
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ice_repr_get_stats64() and __ice_get_ethtool_stats() call
ice_update_vsi_stats() on the VF's src_vsi. This always returns early
because ICE_VSI_DOWN is permanently set for VF VSIs - ice_up() is never
called on them since queues are managed by iavf through virtchnl.
In __ice_get_ethtool_stats() the original code called
ice_update_vsi_stats() for all VSIs including representors, iterated
over ice_gstrings_vsi_stats[] to populate the data, and then bailed out
with an early return before the per-queue ring stats section. That early
return was necessary because representor VSIs have no rings on the PF
side - the rings belong to the VF driver (iavf), so accessing per-queue
stats would be invalid.
Move the representor handling to the top of __ice_get_ethtool_stats()
and call ice_update_eth_stats() directly to read the hardware GLV_*
counters. This matches ice_get_vf_stats() which already uses
ice_update_eth_stats() for the same VF VSI in legacy mode. Apply the
same fix to ice_repr_get_stats64().
Note that ice_gstrings_vsi_stats[] contains five software ring counters
(rx_buf_failed, rx_page_failed, tx_linearize, tx_busy, tx_restart) that
are always zero for representors since the PF never processes packets on
VF rings. This is pre-existing behavior unchanged by this patch.
Fixes: 7aae80cef7 ("ice: add port representor ethtool ops and stats")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Patryk Holda <patryk.holda@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
With the move to explicit pwrctrl power on/off APIs, the caller, i.e., the
PCI controller driver, should manage the power state. The pwrctrl drivers
should not try to clean up or power off when they are removed, as this
might end up disabling an already disabled regulator, causing a big
warning. This can be triggered if a PCI controller driver's .remove()
callback calls pci_pwrctrl_destroy_devices() after
pci_pwrctrl_power_off_devices().
Drop the devm cleanup parts that turn off regulators from the pwrctrl
drivers.
Fixes: b921aa3f8d ("PCI/pwrctrl: Switch to pwrctrl create, power on/off, destroy APIs")
Signed-off-by: Chen-Yu Tsai <wenst@chromium.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Link: https://patch.msgid.link/20260226092234.3859740-1-wenst@chromium.org
Commit 0f00a897c9 ("ice: check if SF is ready in ethtool ops")
refactored the VF readiness check into a generic repr->ops.ready()
callback but implemented ice_repr_ready_vf() with inverted logic:
return !ice_check_vf_ready_for_cfg(repr->vf);
ice_check_vf_ready_for_cfg() returns 0 on success, so the negation
makes ready() return non-zero when the VF is ready. All callers treat
non-zero as "not ready, skip", causing ndo_get_stats64, get_drvinfo,
get_strings and get_ethtool_stats to always bail out in switchdev mode.
Remove the erroneous negation. The SF variant ice_repr_ready_sf() is
already correct (returns !active, i.e. non-zero when not active).
Fixes: 0f00a897c9 ("ice: check if SF is ready in ethtool ops")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Patryk Holda <patryk.holda@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
When allocating netdevice using alloc_etherdev_mqs() the maximum
supported queues number should be passed. The vsi->alloc_txq/rxq is
storing current number of queues, not the maximum ones.
Use the same function for getting max Tx and Rx queues which is used
during ethtool -l call to set maximum number of queues during netdev
allocation.
Reproduction steps:
$ethtool -l $pf # says current 16, max 64
$ethtool -S $pf # fine
$ethtool -L $pf combined 40 # crash
[491187.472594] Call Trace:
[491187.472829] <TASK>
[491187.473067] netif_set_xps_queue+0x26/0x40
[491187.473305] ice_vsi_cfg_txq+0x265/0x3d0 [ice]
[491187.473619] ice_vsi_cfg_lan_txqs+0x68/0xa0 [ice]
[491187.473918] ice_vsi_cfg_lan+0x2b/0xa0 [ice]
[491187.474202] ice_vsi_open+0x71/0x170 [ice]
[491187.474484] ice_vsi_recfg_qs+0x17f/0x230 [ice]
[491187.474759] ? dev_get_min_mp_channel_count+0xab/0xd0
[491187.474987] ice_set_channels+0x185/0x3d0 [ice]
[491187.475278] ethnl_set_channels+0x26f/0x340
Fixes: ee13aa1a2c ("ice: use netif_get_num_default_rss_queues()")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Alexander Nowlin <alexander.nowlin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Whenever we get an error updating the device stats item for a device in
btrfs_run_dev_stats() we allow the loop to go to the next device, and if
updating the stats item for the next device succeeds, we end up losing
the error we had from the previous device.
Fix this by breaking out of the loop once we get an error and make sure
it's returned to the caller. Since we are in the transaction commit path
(and in the critical section actually), returning the error will result
in a transaction abort.
Fixes: 733f4fbbc1 ("Btrfs: read device stats on mount, write modified ones during commit")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If overlay is used on top of btrfs, dentry->d_sb translates to overlay's
super block and fsid assignment will lead to a crash.
Use file_inode(file)->i_sb to always get btrfs_sb.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Since commit 3d74a7556f ("btrfs: zlib: introduce zlib_compress_bio()
helper"), there are some reports about different crashes in zlib
compression path. One of the symptoms is list corruption like the
following:
list_del corruption. next->prev should be fffffbb340204a08, but was ffff8d6517cb7de0. (next=fffffbb3402d62c8)
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:65!
Oops: invalid opcode: 0000 [#1] SMP NOPTI
CPU: 1 UID: 0 PID: 21436 Comm: kworker/u16:7 Not tainted 7.0.0-rc2-jcg+ #1 PREEMPT
Hardware name: LENOVO 10VGS02P00/3130, BIOS M1XKT57A 02/10/2022
Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
RIP: 0010:__list_del_entry_valid_or_report+0xec/0xf0
Call Trace:
<TASK>
btrfs_alloc_compr_folio+0xae/0xc0 [btrfs]
zlib_compress_bio+0x39d/0x6a0 [btrfs]
btrfs_compress_bio+0x2e3/0x3d0 [btrfs]
compress_file_range+0x2b0/0x660 [btrfs]
btrfs_work_helper+0xdb/0x3e0 [btrfs]
process_one_work+0x192/0x3d0
worker_thread+0x19a/0x310
kthread+0xdf/0x120
ret_from_fork+0x22e/0x310
ret_from_fork_asm+0x1a/0x30
</TASK>
---[ end trace 0000000000000000 ]---
Other symptoms include VM_BUG_ON() during folio_put() but it's rarer.
David Sterba firstly reported this during his CI runs but unfortunately
I'm unable to hit it.
Meanwhile zstd/lzo doesn't seem to have the same problem.
[CAUSE]
During zlib_compress_bio() every time the output buffer is full, we
queue the full folio into the compressed bio, and allocate a new folio
as the output folio.
After the input has finished, we loop through zlib_deflate() with
Z_FINISH to flush all output.
And when that is done, we still need to check if the last folio has any
content, and if so we still need to queue that part into the compressed
bio.
The problem is in the final folio handling, if the final folio is full
(for x86_64 the folio size is 4K), the length to queue is calculated by
u32 cur_len = offset_in_folio(out_folio, workspace->strm.total_out);
But since total_out is 4K aligned, the resulted @cur_len will be 0, then
we hit the bio_add_folio(), which has a quirk that if bio_add_folio()
got an length 0, it will still queue the folio into the bio, but return
false.
In that case we go to out: tag, which calls btrfs_free_compr_folio() to
release @out_folio, which may put the out folio into the btrfs global
pool list.
On the other hand, that @out_folio is already added to the
compressed bio, and will later be released again by
cleanup_compressed_bio(), which results double release.
And if this time we still need to put the folio into the btrfs global
pool list, it will result a list corruption because it's already in the
list.
[FIX]
Instead of offset_inside_folio(), directly use the difference between
strm.total_out and bi_size.
So that if the last folio is completely full, we can still properly
queue the full folio other than queueing zero byte.
Fixes: 3d74a7556f ("btrfs: zlib: introduce zlib_compress_bio() helper")
Reported-by: David Sterba <dsterba@suse.com>
Reported-by: Jean-Christophe Guillain <jean-christophe@guillain.net>
Reported-by: syzbot+3c4d8371d65230f852a2@syzkaller.appspotmail.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221176
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When create_space_info_sub_group() allocates elements of
space_info->sub_group[], kobject_init_and_add() is called for each
element via btrfs_sysfs_add_space_info_type(). However, when
check_removing_space_info() frees these elements, it does not call
btrfs_sysfs_remove_space_info() on them. As a result, kobject_put() is
not called and the associated kobj->name objects are leaked.
This memory leak is reproduced by running the blktests test case
zbd/009 on kernels built with CONFIG_DEBUG_KMEMLEAK. The kmemleak
feature reports the following error:
unreferenced object 0xffff888112877d40 (size 16):
comm "mount", pid 1244, jiffies 4294996972
hex dump (first 16 bytes):
64 61 74 61 2d 72 65 6c 6f 63 00 c4 c6 a7 cb 7f data-reloc......
backtrace (crc 53ffde4d):
__kmalloc_node_track_caller_noprof+0x619/0x870
kstrdup+0x42/0xc0
kobject_set_name_vargs+0x44/0x110
kobject_init_and_add+0xcf/0x150
btrfs_sysfs_add_space_info_type+0xfc/0x210 [btrfs]
create_space_info_sub_group.constprop.0+0xfb/0x1b0 [btrfs]
create_space_info+0x211/0x320 [btrfs]
btrfs_init_space_info+0x15a/0x1b0 [btrfs]
open_ctree+0x33c7/0x4a50 [btrfs]
btrfs_get_tree.cold+0x9f/0x1ee [btrfs]
vfs_get_tree+0x87/0x2f0
vfs_cmd_create+0xbd/0x280
__do_sys_fsconfig+0x3df/0x990
do_syscall_64+0x136/0x1540
entry_SYSCALL_64_after_hwframe+0x76/0x7e
To avoid the leak, call btrfs_sysfs_remove_space_info() instead of
kfree() for the elements.
Fixes: f92ee31e03 ("btrfs: introduce btrfs_space_info sub-group")
Link: https://lore.kernel.org/linux-block/b9488881-f18d-4f47-91a5-3c9bf63955a5@wdc.com/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging that an inode exists, as part of logging a new name or
logging new dir entries for a directory, we always set the generation of
the logged inode item to 0. This is to signal during log replay (in
overwrite_item()), that we should not set the i_size since we only logged
that an inode exists, so the i_size of the inode in the subvolume tree
must be preserved (as when we log new names or that an inode exists, we
don't log extents).
This works fine except when we have already logged an inode in full mode
or it's the first time we are logging an inode created in a past
transaction, that inode has a new i_size of 0 and then we log a new name
for the inode (due to a new hardlink or a rename), in which case we log
an i_size of 0 for the inode and a generation of 0, which causes the log
replay code to not update the inode's i_size to 0 (in overwrite_item()).
An example scenario:
mkdir /mnt/dir
xfs_io -f -c "pwrite 0 64K" /mnt/dir/foo
sync
xfs_io -c "truncate 0" -c "fsync" /mnt/dir/foo
ln /mnt/dir/foo /mnt/dir/bar
xfs_io -c "fsync" /mnt/dir
<power fail>
After log replay the file remains with a size of 64K. This is because when
we first log the inode, when we fsync file foo, we log its current i_size
of 0, and then when we create a hard link we log again the inode in exists
mode (LOG_INODE_EXISTS) but we set a generation of 0 for the inode item we
add to the log tree, so during log replay overwrite_item() sees that the
generation is 0 and i_size is 0 so we skip updating the inode's i_size
from 64K to 0.
Fix this by making sure at fill_inode_item() we always log the real
generation of the inode if it was logged in the current transaction with
the i_size we logged before. Also if an inode created in a previous
transaction is logged in exists mode only, make sure we log the i_size
stored in the inode item located from the commit root, so that if we log
multiple times that the inode exists we get the correct i_size.
A test case for fstests will follow soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/af8c15fa-4e41-4bb2-885c-0bc4e97532a6@gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix the superblock offset mismatch error message in
btrfs_validate_super(): we changed it so that it considers all the
superblocks, but the message still assumes we're only looking at the
first one.
The change from %u to %llu is because we're changing from a constant to
a u64.
Fixes: 069ec957c3 ("btrfs: Refactor btrfs_check_super_valid")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Forcibly disable the OD_FAN_CURVE feature when temperature or PWM range is invalid,
otherwise PMFW will reject this configuration on smu v13.0.x
example:
$ sudo cat /sys/bus/pci/devices/<BDF>/gpu_od/fan_ctrl/fan_curve
OD_FAN_CURVE:
0: 0C 0%
1: 0C 0%
2: 0C 0%
3: 0C 0%
4: 0C 0%
OD_RANGE:
FAN_CURVE(hotspot temp): 0C 0C
FAN_CURVE(fan speed): 0% 0%
$ echo "0 50 40" | sudo tee fan_curve
kernel log:
[ 756.442527] amdgpu 0000:03:00.0: amdgpu: Fan curve temp setting(50) must be within [0, 0]!
[ 777.345800] amdgpu 0000:03:00.0: amdgpu: Fan curve temp setting(50) must be within [0, 0]!
Closes: https://github.com/ROCm/amdgpu/issues/208
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 470891606c5a97b1d0d937e0aa67a3bed9fcb056)
Cc: stable@vger.kernel.org
When SET_UCLK_MAX capability is absent, return -EOPNOTSUPP from
smu_v13_0_6_emit_clk_levels() for OD_MCLK instead of 0. This makes
unsupported OD_MCLK reporting consistent with other clock types
and allows callers to skip the entry cleanly.
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit d82e0a72d9189e8acd353988e1a57f85ce479e37)
Cc: stable@vger.kernel.org
Only reapply UCLK soft limits during PP_OD_RESTORE_DEFAULT when the
current max differs from the DPM table max. This avoids redundant
SMC updates and prevents -EINVAL on restore when no change is needed.
Fixes: b7a9003445 ("drm/amd/pm: Allow setting max UCLK on SMU v13.0.6")
Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 17f11bbbc76c8e83c8474ea708316b1e3631d927)
[WHAT]
When a sink is connected, aconnector->drm_edid was overwritten without
freeing the previous allocation, causing a memory leak on resume.
[HOW]
Free the previous drm_edid before updating it.
Reviewed-by: Roman Li <roman.li@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Chuanyu Tseng <chuanyu.tseng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 52024a94e7111366141cfc5d888b2ef011f879e5)
Cc: stable@vger.kernel.org
PASID resue could cause interrupt issue when process
immediately runs into hw state left by previous
process exited with the same PASID, it's possible that
page faults are still pending in the IH ring buffer when
the process exits and frees up its PASID. To prevent the
case, it uses idr cyclic allocator same as kernel pid's.
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 8f1de51f49be692de137c8525106e0fce2d1912d)
Cc: stable@vger.kernel.org
amdgpu_device_get_job_timeout_settings() passes a pointer directly
to the global amdgpu_lockup_timeout[] buffer into strsep().
strsep() destructively replaces delimiter characters with '\0'
in-place.
On multi-GPU systems, this function is called once per device.
When a multi-value setting like "0,0,0,-1" is used, the first
GPU's call transforms the global buffer into "0\00\00\0-1". The
second GPU then sees only "0" (terminated at the first '\0'),
parses a single value, hits the single-value fallthrough
(index == 1), and applies timeout=0 to all rings — causing
immediate false job timeouts.
Fix this by copying into a stack-local array before calling
strsep(), so the global module parameter buffer remains intact
across calls. The buffer is AMDGPU_MAX_TIMEOUT_PARAM_LENGTH
(256) bytes, which is safe for the stack.
v2: wrap commit message to 72 columns, add Assisted-by tag.
v3: use stack array with strscpy() instead of kstrdup()/kfree()
to avoid unnecessary heap allocation (Christian).
This patch was developed with assistance from Claude (claude-opus-4-6).
Assisted-by: Claude:claude-opus-4-6
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Ruijing Dong <ruijing.dong@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 94d79f51efecb74be1d88dde66bdc8bfcca17935)
Cc: stable@vger.kernel.org
Starting with commit 17ce8a6907 ("drm/amd/display: Add dsc pre-validation in
atomic check"), amdgpu resets the CRTC state mode_changed flag to false when
recomputing the DSC configuration results in no timing change for a particular
stream.
However, this is incorrect in scenarios where a change in MST/DSC configuration
happens in the same KMS commit as another (unrelated) mode change. For example,
the integrated panel of a laptop may be configured differently (e.g., HDR
enabled/disabled) depending on whether external screens are attached. In this
case, plugging in external DP-MST screens may result in the mode_changed flag
being dropped incorrectly for the integrated panel if its DSC configuration
did not change during precomputation in pre_validate_dsc().
At this point, however, dm_update_crtc_state() has already created new streams
for CRTCs with DSC-independent mode changes. In turn,
amdgpu_dm_commit_streams() will never release the old stream, resulting in a
memory leak. amdgpu_dm_atomic_commit_tail() will never acquire a reference to
the new stream either, which manifests as a use-after-free when the stream gets
disabled later on:
BUG: KASAN: use-after-free in dc_stream_release+0x25/0x90 [amdgpu]
Write of size 4 at addr ffff88813d836524 by task kworker/9:9/29977
Workqueue: events drm_mode_rmfb_work_fn
Call Trace:
<TASK>
dump_stack_lvl+0x6e/0xa0
print_address_description.constprop.0+0x88/0x320
? dc_stream_release+0x25/0x90 [amdgpu]
print_report+0xfc/0x1ff
? srso_alias_return_thunk+0x5/0xfbef5
? __virt_addr_valid+0x225/0x4e0
? dc_stream_release+0x25/0x90 [amdgpu]
kasan_report+0xe1/0x180
? dc_stream_release+0x25/0x90 [amdgpu]
kasan_check_range+0x125/0x200
dc_stream_release+0x25/0x90 [amdgpu]
dc_state_destruct+0x14d/0x5c0 [amdgpu]
dc_state_release.part.0+0x4e/0x130 [amdgpu]
dm_atomic_destroy_state+0x3f/0x70 [amdgpu]
drm_atomic_state_default_clear+0x8ee/0xf30
? drm_mode_object_put.part.0+0xb1/0x130
__drm_atomic_state_free+0x15c/0x2d0
atomic_remove_fb+0x67e/0x980
Since there is no reliable way of figuring out whether a CRTC has unrelated
mode changes pending at the time of DSC validation, remember the value of the
mode_changed flag from before the point where a CRTC was marked as potentially
affected by a change in DSC configuration. Reset the mode_changed flag to this
earlier value instead in pre_validate_dsc().
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/5004
Fixes: 17ce8a6907 ("drm/amd/display: Add dsc pre-validation in atomic check")
Signed-off-by: Yussuf Khalil <dev@pp3345.net>
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit cc7c7121ae082b7b82891baa7280f1ff2608f22b)
DAMON_STAT usage document (Documentation/admin-guide/mm/damon/stat.rst)
says it monitors the system's entire physical memory. But, it is
monitoring only the biggest System RAM resource of the system. When there
are multiple System RAM resources, this results in monitoring only an
unexpectedly small fraction of the physical memory. For example, suppose
the system has a 500 GiB System RAM, 10 MiB non-System RAM, and 500 GiB
System RAM resources in order on the physical address space. DAMON_STAT
will monitor only the first 500 GiB System RAM. This situation is
particularly common on NUMA systems.
Select a physical address range that covers all System RAM areas of the
system, to fix this issue and make it work as documented.
[sj@kernel.org: return error if monitoring target region is invalid]
Link: https://lkml.kernel.org/r/20260317053631.87907-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20260316235118.873-1-sj@kernel.org
Fixes: 369c415e60 ("mm/damon: introduce DAMON_STAT module")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit e2c3b6b21c ("mm: zswap: use SG list decompression APIs from
zsmalloc") updated zswap_decompress() to use the scatterwalk API to copy
data for uncompressed pages.
In doing so, it mapped kernel memory locally for 32-bit kernels using
kmap_local_folio(), however it never unmapped this memory.
This resulted in the linked syzbot report where a BUG_ON() is triggered
due to leaking the kmap slot.
This patch fixes the issue by explicitly unmapping the established kmap.
Also, add flush_dcache_folio() after the kunmap_local() call
I had assumed that a new folio here combined with the flush that is done at
the point of setting the PTE would suffice, but it doesn't seem that's
actually the case, as update_mmu_cache() will in many archtectures only
actually flush entries where a dcache flush was done on a range previously.
I had also wondered whether kunmap_local() might suffice, but it doesn't
seem to be the case.
Some arches do seem to actually dcache flush on unmap, parisc does it if
CONFIG_HIGHMEM is not set by setting ARCH_HAS_FLUSH_ON_KUNMAP and calling
kunmap_flush_on_unmap() from __kunmap_local(), otherwise non-CONFIG_HIGHMEM
callers do nothing here.
Otherwise arch_kmap_local_pre_unmap() is called which does:
* sparc - flush_cache_all()
* arm - if VIVT, __cpuc_flush_dcache_area()
* otherwise - nothing
Also arch_kmap_local_post_unmap() is called which does:
* arm - local_flush_tlb_kernel_page()
* csky - kmap_flush_tlb()
* microblaze, ppc - local_flush_tlb_page()
* mips - local_flush_tlb_one()
* sparc - flush_tlb_all() (again)
* x86 - arch_flush_lazy_mmu_mode()
* otherwise - nothing
But this is only if it's high memory, and doesn't cover all architectures,
so is presumably intended to handle other cache consistency concerns.
In any case, VIPT is problematic here whether low or high memory (in spite
of what the documentation claims, see [0] - 'the kernel did write to a page
that is in the page cache page and / or in high memory'), because dirty
cache lines may exist at the set indexed by the kernel direct mapping,
which won't exist in the set indexed by any subsequent userland mapping,
meaning userland might read stale data from L2 cache.
Even if the documentation is correct and low memory is fine not to be
flushed here, we can't be sure as to whether the memory is low or high
(kmap_local_folio() will be a no-op if low), and this call should be
harmless if it is low.
VIVT would require more work if the memory were shared and already mapped,
but this isn't the case here, and would anyway be handled by the dcache
flush call.
In any case, we definitely need this flush as far as I can tell.
And we should probably consider updating the documentation unless it turns
out there's somehow dcache synchronisation that happens for low
memory/64-bit kernels elsewhere?
[ljs@kernel.org: add flush_dcache_folio() after the kunmap_local() call]
Link: https://lkml.kernel.org/r/13e09a99-181f-45ac-a18d-057faf94bccb@lucifer.local
Link: https://lkml.kernel.org/r/20260316140122.339697-1-ljs@kernel.org
Link: https://docs.kernel.org/core-api/cachetlb.html [0]
Fixes: e2c3b6b21c ("mm: zswap: use SG list decompression APIs from zsmalloc")
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reported-by: syzbot+fe426bef95363177631d@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69b75e2c.050a0220.12d28.015a.GAE@google.com
Acked-by: Yosry Ahmed <yosry@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Acked-by: Yosry Ahmed <yosry@kernel.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In f_ospi_probe(), when num_cs validation fails, it returns without
calling spi_controller_put() on the SPI controller, which causes a
resource leak.
Use devm_spi_alloc_host() instead of spi_alloc_host() to ensure the
SPI controller is properly freed when probe fails.
Fixes: 1b74dd64c8 ("spi: Add Socionext F_OSPI SPI flash controller driver")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260319-sn-f-v1-1-33a6738d2da8@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Fixes kernel hang during boot due to inability to set up IRQ on AXP313a.
The issue is caused by gpiochip_lock_as_irq() which is failing when gpio
is in uninitialized state.
Solution is to set pinmux to GPIO INPUT in
sunxi_pinctrl_irq_request_resources() if it wasn't initialized
earlier.
Tested on Orange Pi Zero 3.
Fixes: 01e10d0272 ("pinctrl: sunxi: Implement gpiochip::get_direction()")
Reviewed-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Chen-Yu Tsai <wens@kernel.org>
Signed-off-by: Michal Piekos <michal.piekos@mmpsystems.pl>
Signed-off-by: Linus Walleij <linusw@kernel.org>
Recent changes in the Allwinner pinctrl/GPIO IP made us add some quirks,
which the new SoCs (A523 family) need to use. We now have a comfortable
"flags" field on the per-SoC setup side, to tag those quirks we need, but
were translating those flag bits into specific fields for runtime use, in
the init routine.
Now the newest Allwinner GPIO IP adds even more quirks and exceptions,
some of a boolean nature.
To avoid inventing various new boolean flags for the runtime struct
sunxi_pinctrl, let's just directly pass on the flags variable used by the
setup code, so runtime can check for those various quirk bits directly.
Rename the "variant" member to "flags", and directly copy the value from
the setup code into there. Move the variant masking from the init
routine to the functions which actually use the "variant" value.
This mostly paves the way for the new A733 IP generation, which needs
more quirks to be checked at runtime.
Reviewed-by: Chen-Yu Tsai <wens@kernel.org>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Michal Piekos <michal.piekos@mmpsystems.pl>
Signed-off-by: Linus Walleij <linusw@kernel.org>
FRED-enabled SEV-(ES,SNP) guests fail to boot due to the following issues
in the early boot sequence:
* FRED does not have a #VC exception handler in the dispatch logic
* Early FRED #VC exceptions attempt to use uninitialized per-CPU GHCBs
instead of boot_ghcb
Add X86_TRAP_VC case to fred_hwexc() with a new exc_vmm_communication()
function that provides the unified entry point FRED requires, dispatching
to existing user/kernel handlers based on privilege level. The function is
already declared via DECLARE_IDTENTRY_VC().
Fix early GHCB access by falling back to boot_ghcb in
__sev_{get,put}_ghcb() when per-CPU GHCBs are not yet initialized.
Fixes: 14619d912b ("x86/fred: FRED entry/exit and dispatch code")
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: <stable@kernel.org> # 6.12+
Link: https://patch.msgid.link/20260318075654.1792916-4-nikunj@amd.com
The current rule for smb2_mapping_table.c uses `$(call cmd,...)`, which
fails to track command line modifications in the Makefile (e.g., modifying
the command to `perl -d` or `perl -w` for debug will not trigger a rebuild)
and does not generate the required .cmd file for Kbuild.
Fix this by transitioning to the standard `$(call if_changed,...)` macro.
This includes adding the `FORCE` prerequisite and appending the output
file to the `targets` variable so Kbuild can track it properly.
As a result, Kbuild now automatically handles the cleaning of the
generated file, allowing us to safely drop the redundant `clean-files`
assignment.
Fixes: c527e13a7a ("cifs: Autogenerate SMB2 error mapping table")
Signed-off-by: Huiwen He <hehuiwen@kylinos.cn>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>
Commit in Fixes added the FRED CR4 bit to the CR4 pinned bits mask so
that whenever something else modifies CR4, that bit remains set. Which
in itself is a perfectly fine idea.
However, there's an issue when during boot FRED is initialized: first on
the BSP and later on the APs. Thus, there's a window in time when
exceptions cannot be handled.
This becomes particularly nasty when running as SEV-{ES,SNP} or TDX
guests which, when they manage to trigger exceptions during that short
window described above, triple fault due to FRED MSRs not being set up
yet.
See Link tag below for a much more detailed explanation of the
situation.
So, as a result, the commit in that Link URL tried to address this
shortcoming by temporarily disabling CR4 pinning when an AP is not
online yet.
However, that is a problem in itself because in this case, an attack on
the kernel needs to only modify the online bit - a single bit in RW
memory - and then disable CR4 pinning and then disable SM*P, leading to
more and worse things to happen to the system.
So, instead, remove the FRED bit from the CR4 pinning mask, thus
obviating the need to temporarily disable CR4 pinning.
If someone manages to disable FRED when poking at CR4, then
idt_invalidate() would make sure the system would crash'n'burn on the
first exception triggered, which is a much better outcome security-wise.
Fixes: ff45746fbf ("x86/cpu: Add X86_CR4_FRED macro")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org> # 6.12+
Link: https://lore.kernel.org/r/177385987098.1647592.3381141860481415647.tip-bot2@tip-bot2
Commit 35e4a69b20 ("PM: sleep: Allow pm_restrict_gfp_mask()
stacking") introduced refcount-based GFP mask management that warns
when pm_restore_gfp_mask() is called with saved_gfp_count == 0.
Some hibernation paths call pm_restore_gfp_mask() defensively where
the GFP mask may or may not be restricted depending on the execution
path. For example, the uswsusp interface invokes it in
SNAPSHOT_CREATE_IMAGE, SNAPSHOT_UNFREEZE, and snapshot_release().
Before the stacking change this was a silent no-op; it now triggers
a spurious WARNING.
Remove the WARN_ON() wrapper from the !saved_gfp_count check while
retaining the check itself, so that defensive calls remain harmless
without producing false warnings.
Fixes: 35e4a69b20 ("PM: sleep: Allow pm_restrict_gfp_mask() stacking")
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
[ rjw: Subject tweak ]
Link: https://patch.msgid.link/20260322120528.750178-1-youngjun.park@lge.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The gz_chain_head variable has been unused since the driver's initial
addition to the tree. Its use was eliminated between v3 and v4 during
development but due to the reference of gz_chain_head's wait_list
member, the compiler could not warn that it was unused.
After a (tip) commit ("locking/rwsem: Remove the list_head from struct
rw_semaphore"), which removed a reference to the variable passed to
__RWSEM_INITIALIZER(), certain configurations show an unused variable
warning from the Lenovo wmi-gamezone driver:
drivers/platform/x86/lenovo/wmi-gamezone.c:34:31: warning: 'gz_chain_head' defined but not used [-Wunused-variable]
34 | static BLOCKING_NOTIFIER_HEAD(gz_chain_head);
| ^~~~~~~~~~~~~
include/linux/notifier.h:119:39: note: in definition of macro 'BLOCKING_NOTIFIER_HEAD'
119 | struct blocking_notifier_head name = \
| ^~~~
Remove the variable to prevent the warning from showing up.
Fixes: 22024ac536 ("platform/x86: Add Lenovo Gamezone WMI Driver")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Mark Pearson <mpearson-lenovo@squebb.ca>
Link: https://patch.msgid.link/20260313-lenovo-wmi-gamezone-remove-gz_chain_head-v1-1-ce5231f0c6fa@kernel.org
[ij: reorganized the changelog]
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
On some systems, HWP can be explicitly disabled in the BIOS settings
When HWP is disabled by firmware, the HWP CPUID bit is not set, and
attempting to read MSR_PM_ENABLE will result in a General Protection
(GP) fault.
unchecked MSR access error: RDMSR from 0x770 at rIP: 0xffffffffc33db92e (disable_dynamic_sst_features+0xe/0x50 [isst_tpmi_core])
Call Trace:
<TASK>
? ex_handler_msr+0xf6/0x150
? fixup_exception+0x1ad/0x340
? gp_try_fixup_and_notify+0x1e/0xb0
? exc_general_protection+0xc9/0x390
? terminate_walk+0x64/0x100
? asm_exc_general_protection+0x22/0x30
? disable_dynamic_sst_features+0xe/0x50 [isst_tpmi_core]
isst_if_def_ioctl+0xece/0x1050 [isst_tpmi_core]
? ioctl_has_perm.constprop.42+0xe0/0x130
isst_if_def_ioctl+0x10d/0x1a0 [isst_if_common]
__se_sys_ioctl+0x86/0xc0
do_syscall_64+0x8a/0x100
entry_SYSCALL_64_after_hwframe+0x78/0xe2
RIP: 0033:0x7f36eaef54a7
Add a check for X86_FEATURE_HWP before accessing the MSR. If HWP is
not available, return true safely.
Fixes: 12a7d2cb81 ("platform/x86: ISST: Add SST-CP support via TPMI")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Link: https://patch.msgid.link/20260303074635.2218-1-lirongqing@baidu.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
The HP Omen 16-xf0xxx board 8BCA uses the same Victus-S fan and
thermal WMI path as other recently supported Omen/Victus boards,
but it requires Omen v1 thermal profile parameters for correct
platform profile behavior.
Add board 8BCA to victus_s_thermal_profile_boards[] and map it
to omen_v1_thermal_params.
Validated on HP Omen 16-xf0xxx (board 8BCA):
- /sys/firmware/acpi/platform_profile exposes
low-power/balanced/performance
- fan RPM reporting works (fan1_input/fan2_input)
- manual fan control works through hp-wmi hwmon (pwm1/pwm1_enable)
Signed-off-by: Raed <thisisraed@outlook.com>
Link: https://patch.msgid.link/20260311131338.965249-1-youaretalkingtoraed@gmail.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Commit 005e8dddd4 ("PM: hibernate: don't store zero pages in the
image file") added an optimization to skip zero-filled pages in the
hibernation image. On restore, zero pages are handled internally by
snapshot_write_next() in a loop that processes them without returning
to the caller.
With the userspace restore interface, writing the last non-zero page
to /dev/snapshot is followed by the SNAPSHOT_ATOMIC_RESTORE ioctl. At
this point there are no more calls to snapshot_write_next() so any
trailing zero pages are not processed, snapshot_image_loaded() fails
because handle->cur is smaller than expected, the ioctl returns -EPERM
and the image is not restored.
The in-kernel restore path is not affected by this because the loop in
load_image() in swap.c calls snapshot_write_next() until it returns 0.
It is this final call that drains any trailing zero pages.
Fixed by calling snapshot_write_next() in snapshot_write_finalize(),
giving the kernel the chance to drain any trailing zero pages.
Fixes: 005e8dddd4 ("PM: hibernate: don't store zero pages in the image file")
Signed-off-by: Alberto Garcia <berto@igalia.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Link: https://patch.msgid.link/ef5a7c5e3e3dbd17dcb20efaa0c53a47a23498bb.1773075892.git.berto@igalia.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move FSGSBASE enablement from identify_cpu() to cpu_init_exception_handling()
to ensure it is enabled before any exceptions can occur on both boot and
secondary CPUs.
== Background ==
Exception entry code (paranoid_entry()) uses ALTERNATIVE patching based on
X86_FEATURE_FSGSBASE to decide whether to use RDGSBASE/WRGSBASE instructions
or the slower RDMSR/SWAPGS sequence for saving/restoring GSBASE.
On boot CPU, ALTERNATIVE patching happens after enabling FSGSBASE in CR4.
When the feature is available, the code is permanently patched to use
RDGSBASE/WRGSBASE, which require CR4.FSGSBASE=1 to execute without triggering
== Boot Sequence ==
Boot CPU (with CR pinning enabled):
trap_init()
cpu_init() <- Uses unpatched code (RDMSR/SWAPGS)
x2apic_setup()
...
arch_cpu_finalize_init()
identify_boot_cpu()
identify_cpu()
cr4_set_bits(X86_CR4_FSGSBASE) # Enables the feature
# This becomes part of cr4_pinned_bits
...
alternative_instructions() <- Patches code to use RDGSBASE/WRGSBASE
Secondary CPUs (with CR pinning enabled):
start_secondary()
cr4_init() <- Code already patched, CR4.FSGSBASE=1
set implicitly via cr4_pinned_bits
cpu_init() <- exceptions work because FSGSBASE is
already enabled
Secondary CPU (with CR pinning disabled):
start_secondary()
cr4_init() <- Code already patched, CR4.FSGSBASE=0
cpu_init()
x2apic_setup()
rdmsrq(MSR_IA32_APICBASE) <- Triggers #VC in SNP guests
exc_vmm_communication()
paranoid_entry() <- Uses RDGSBASE with CR4.FSGSBASE=0
(patched code)
...
ap_starting()
identify_secondary_cpu()
identify_cpu()
cr4_set_bits(X86_CR4_FSGSBASE) <- Enables the feature, which is
too late
== CR Pinning ==
Currently, for secondary CPUs, CR4.FSGSBASE is set implicitly through
CR-pinning: the boot CPU sets it during identify_cpu(), it becomes part of
cr4_pinned_bits, and cr4_init() applies those pinned bits to secondary CPUs.
This works but creates an undocumented dependency between cr4_init() and the
pinning mechanism.
== Problem ==
Secondary CPUs boot after alternatives have been applied globally. They
execute already-patched paranoid_entry() code that uses RDGSBASE/WRGSBASE
instructions, which require CR4.FSGSBASE=1. Upcoming changes to CR pinning
behavior will break the implicit dependency, causing secondary CPUs to
generate #UD.
This issue manifests itself on AMD SEV-SNP guests, where the rdmsrq() in
x2apic_setup() triggers a #VC exception early during cpu_init(). The #VC
handler (exc_vmm_communication()) executes the patched paranoid_entry() path.
Without CR4.FSGSBASE enabled, RDGSBASE instructions trigger #UD.
== Fix ==
Enable FSGSBASE explicitly in cpu_init_exception_handling() before loading
exception handlers. This makes the dependency explicit and ensures both
boot and secondary CPUs have FSGSBASE enabled before paranoid_entry()
executes.
Fixes: c82965f9e5 ("x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit")
Reported-by: Borislav Petkov <bp@alien8.de>
Suggested-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Nikunj A Dadhania <nikunj@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Cc: <stable@kernel.org>
Link: https://patch.msgid.link/20260318075654.1792916-2-nikunj@amd.com
Remove the redundant post-parse validation switch. By the time that
block is reached, xfs_attri_validate() has already guaranteed all name
lengths are non-zero via xfs_attri_validate_namelen(), and
xfs_attri_validate_name_iovec() has already returned -EFSCORRUPTED for
NULL names. For the REMOVE case, attr_value and value_len are
structurally guaranteed to be NULL/zero because the parsing loop only
populates them when value_len != 0. All checks in that switch are
therefore dead code.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The ri_total checks for SET/REPLACE operations are hardcoded to 3,
but xfs_attri_item_size() only emits a value iovec when value_len > 0,
so ri_total is 2 when value_len == 0.
For PPTR_SET/PPTR_REMOVE/PPTR_REPLACE, value_len is validated by
xfs_attri_validate() to be exactly sizeof(struct xfs_parent_rec) and
is never zero, so their hardcoded checks remain correct.
This problem may cause log recovery failures. The following script can be
used to reproduce the problem:
#!/bin/bash
mkfs.xfs -f /dev/sda
mount /dev/sda /mnt/test/
touch /mnt/test/file
for i in {1..200}; do
attr -s "user.attr_$i" -V "value_$i" /mnt/test/file > /dev/null
done
echo 1 > /sys/fs/xfs/debug/larp
echo 1 > /sys/fs/xfs/sda/errortag/larp
attr -s "user.zero" -V "" /mnt/test/file
echo 0 > /sys/fs/xfs/sda/errortag/larp
umount /mnt/test
mount /dev/sda /mnt/test/ # mount failed
Fix this by deriving the expected count dynamically as "2 + !!value_len"
for SET/REPLACE operations.
Cc: stable@vger.kernel.org # v6.9
Fixes: ad206ae50e ("xfs: check opcode and iovec count match in xlog_recover_attri_commit_pass2")
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
When inactivating an inode with node-format extended attributes,
xfs_attr3_node_inactive() invalidates all child leaf/node blocks via
xfs_trans_binval(), but intentionally does not remove the corresponding
entries from their parent node blocks. The implicit assumption is that
xfs_attr_inactive() will truncate the entire attr fork to zero extents
afterwards, so log recovery will never reach the root node and follow
those stale pointers.
However, if a log shutdown occurs after the leaf/node block cancellations
commit but before the attr bmap truncation commits, this assumption
breaks. Recovery replays the attr bmap intact (the inode still has
attr fork extents), but suppresses replay of all cancelled leaf/node
blocks, maybe leaving them as stale data on disk. On the next mount,
xlog_recover_process_iunlinks() retries inactivation and attempts to
read the root node via the attr bmap. If the root node was not replayed,
reading the unreplayed root block triggers a metadata verification
failure immediately; if it was replayed, following its child pointers
to unreplayed child blocks triggers the same failure:
XFS (pmem0): Metadata corruption detected at
xfs_da3_node_read_verify+0x53/0x220, xfs_da3_node block 0x78
XFS (pmem0): Unmount and run xfs_repair
XFS (pmem0): First 128 bytes of corrupted metadata buffer:
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
XFS (pmem0): metadata I/O error in "xfs_da_read_buf+0x104/0x190" at daddr 0x78 len 8 error 117
Fix this in two places:
In xfs_attr3_node_inactive(), after calling xfs_trans_binval() on a
child block, immediately remove the entry that references it from the
parent node in the same transaction. This eliminates the window where
the parent holds a pointer to a cancelled block. Once all children are
removed, the now-empty root node is converted to a leaf block within the
same transaction. This node-to-leaf conversion is necessary for crash
safety. If the system shutdown after the empty node is written to the
log but before the second-phase bmap truncation commits, log recovery
will attempt to verify the root block on disk. xfs_da3_node_verify()
does not permit a node block with count == 0; such a block will fail
verification and trigger a metadata corruption shutdown. on the other
hand, leaf blocks are allowed to have this transient state.
In xfs_attr_inactive(), split the attr fork truncation into two explicit
phases. First, truncate all extents beyond the root block (the child
extents whose parent references have already been removed above).
Second, invalidate the root block and truncate the attr bmap to zero in
a single transaction. The two operations in the second phase must be
atomic: as long as the attr bmap has any non-zero length, recovery can
follow it to the root block, so the root block invalidation must commit
together with the bmap-to-zero truncation.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@vger.kernel.org
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Factor out wrapper xfs_attr3_leaf_init function, which exported for
external use.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Factor out wrapper xfs_attr3_node_entry_remove function, which
exported for external use.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The assertion functions properly because we currently only truncate the
attr to a zero size. Any other new size of the attr is not preempted.
Make this assertion is specific to the datafork, preparing for
subsequent patches to truncate the attribute to a non-zero size.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
unlink_nv12_plane() will clobber parts of the plane state
potentially already set up by plane_atomic_check(), so we
must make sure not to call the two in the wrong order.
The problem happens when a plane previously selected as
a Y plane is now configured as a normal plane by user space.
plane_atomic_check() will first compute the proper plane
state based on the userspace request, and unlink_nv12_plane()
later clears some of the state.
This used to work on account of unlink_nv12_plane() skipping
the state clearing based on the plane visibility. But I removed
that check, thinking it was an impossible situation. Now when
that situation happens unlink_nv12_plane() will just WARN
and proceed to clobber the state.
Rather than reverting to the old way of doing things, I think
it's more clear if we unlink the NV12 planes before we even
compute the new plane state.
Cc: stable@vger.kernel.org
Reported-by: Khaled Almahallawy <khaled.almahallawy@intel.com>
Closes: https://lore.kernel.org/intel-gfx/20260212004852.1920270-1-khaled.almahallawy@intel.com/
Tested-by: Khaled Almahallawy <khaled.almahallawy@intel.com>
Fixes: 6a01df2f1b ("drm/i915: Remove pointless visible check in unlink_nv12_plane()")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260316163953.12905-2-ville.syrjala@linux.intel.com
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
(cherry picked from commit 017ecd04985573eeeb0745fa2c23896fb22ee0cc)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Put the barrier() before the OP so that anything we read out in
OP and check in COND will actually be read out after the timeout
has been evaluated.
Currently the only place where we use OP is __intel_wait_for_register(),
but the use there is precisely susceptible to this reordering, assuming
the ktime_*() stuff itself doesn't act as a sufficient barrier:
__intel_wait_for_register(...)
{
...
ret = __wait_for(reg_value = intel_uncore_read_notrace(...),
(reg_value & mask) == value, ...);
...
}
Cc: stable@vger.kernel.org
Fixes: 1c3c1dc66a ("drm/i915: Add compiler barrier to wait_for")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260313110740.24620-1-ville.syrjala@linux.intel.com
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
(cherry picked from commit a464bace0482aa9a83e9aa7beefbaf44cd58e6cf)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
After this commit (e2b76ab8b5 "ksmbd: add support for read compound"),
response buffer management was changed to use dynamic iov array.
In the new design, smb2_calc_max_out_buf_len() expects the second
argument (hdr2_len) to be the offset of ->Buffer field in the
response structure, not a hardcoded magic number.
Fix the remaining call sites to use the correct offsetof() value.
Cc: stable@vger.kernel.org
Fixes: e2b76ab8b5 ("ksmbd: add support for read compound")
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
smb2_lock() has three error handling issues after list_del() detaches
smb_lock from lock_list at no_check_cl:
1) If vfs_lock_file() returns an unexpected error in the non-UNLOCK
path, goto out leaks smb_lock and its flock because the out:
handler only iterates lock_list and rollback_list, neither of
which contains the detached smb_lock.
2) If vfs_lock_file() returns -ENOENT in the UNLOCK path, goto out
leaks smb_lock and flock for the same reason. The error code
returned to the dispatcher is also stale.
3) In the rollback path, smb_flock_init() can return NULL on
allocation failure. The result is dereferenced unconditionally,
causing a kernel NULL pointer dereference. Add a NULL check to
prevent the crash and clean up the bookkeeping; the VFS lock
itself cannot be rolled back without the allocation and will be
released at file or connection teardown.
Fix cases 1 and 2 by hoisting the locks_free_lock()/kfree() to before
the if(!rc) check in the UNLOCK branch so all exit paths share one
free site, and by freeing smb_lock and flock before goto out in the
non-UNLOCK branch. Propagate the correct error code in both cases.
Fix case 3 by wrapping the VFS unlock in an if(rlock) guard and adding
a NULL check for locks_free_lock(rlock) in the shared cleanup.
Found via call-graph analysis using sqry.
Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org
Suggested-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: Werner Kasselman <werner@verivus.com>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
smb_grant_oplock() has two issues in the oplock publication sequence:
1) opinfo is linked into ci->m_op_list (via opinfo_add) before
add_lease_global_list() is called. If add_lease_global_list()
fails (kmalloc returns NULL), the error path frees the opinfo
via __free_opinfo() while it is still linked in ci->m_op_list.
Concurrent m_op_list readers (opinfo_get_list, or direct iteration
in smb_break_all_levII_oplock) dereference the freed node.
2) opinfo->o_fp is assigned after add_lease_global_list() publishes
the opinfo on the global lease list. A concurrent
find_same_lease_key() can walk the lease list and dereference
opinfo->o_fp->f_ci while o_fp is still NULL.
Fix by restructuring the publication sequence to eliminate post-publish
failure:
- Set opinfo->o_fp before any list publication (fixes NULL deref).
- Preallocate lease_table via alloc_lease_table() before opinfo_add()
so add_lease_global_list() becomes infallible after publication.
- Keep the original m_op_list publication order (opinfo_add before
lease list) so concurrent opens via same_client_has_lease() and
opinfo_get_list() still see the in-flight grant.
- Use opinfo_put() instead of __free_opinfo() on err_out so that
the RCU-deferred free path is used.
This also requires splitting add_lease_global_list() to take a
preallocated lease_table and changing its return type from int to void,
since it can no longer fail.
Fixes: 1dfd062caa ("ksmbd: fix use-after-free by using call_rcu() for oplock_info")
Cc: stable@vger.kernel.org
Signed-off-by: Werner Kasselman <werner@verivus.com>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When a multichannel session binding request fails (e.g. wrong password),
the error path unconditionally sets sess->state = SMB2_SESSION_EXPIRED.
However, during binding, sess points to the target session looked up via
ksmbd_session_lookup_slowpath() -- which belongs to another connection's
user. This allows a remote attacker to invalidate any active session by
simply sending a binding request with a wrong password (DoS).
Fix this by skipping session expiration when the failed request was
a binding attempt, since the session does not belong to the current
connection. The reference taken by ksmbd_session_lookup_slowpath() is
still correctly released via ksmbd_user_session_put().
Cc: stable@vger.kernel.org
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
To pick up the changes in:
6ffd853b0b ("build_bug.h: correct function parameters names in kernel-doc")
That just add some comments, addressing this perf tools build warning:
Warning: Kernel ABI header differences:
diff -u tools/include/linux/build_bug.h include/linux/build_bug.h
Please take a look at tools/include/uapi/README for further info on this
synchronization process.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the changes in:
e2ffe85b6d ("KVM: x86: Introduce KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM")
That just rebuilds kvm-stat.c on x86, no change in functionality.
This silences these perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/arch/x86/include/uapi/asm/kvm.h arch/x86/include/uapi/asm/kvm.h
Please see tools/include/uapi/README for further details.
Cc: Jim Mattson <jmattson@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the changes in:
da142f3d37 ("KVM: Remove subtle "struct kvm_stats_desc" pseudo-overlay")
That just rebuilds perf, as these patches don't add any new KVM ioctl to
be harvested for the 'perf trace' ioctl syscall argument beautifiers.
This addresses this perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
Please see tools/include/uapi/README for further details.
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick up the changes from these csets:
9073428bb2 ("x86/sev: Allow IBPB-on-Entry feature for SNP guests")
That cause no changes to tooling as it doesn't include a new MSR to be
captured by the tools/perf/trace/beauty/tracepoints/x86_msr.sh script.
Just silences this perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Pull bpf fixes from Alexei Starovoitov:
- Fix how linked registers track zero extension of subregisters (Daniel
Borkmann)
- Fix unsound scalar fork for OR instructions (Daniel Wade)
- Fix exception exit lock check for subprogs (Ihor Solodrai)
- Fix undefined behavior in interpreter for SDIV/SMOD instructions
(Jenny Guanni Qu)
- Release module's BTF when module is unloaded (Kumar Kartikeya
Dwivedi)
- Fix constant blinding for PROBE_MEM32 instructions (Sachin Kumar)
- Reset register ID for END instructions to prevent incorrect value
tracking (Yazhou Tang)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Add a test cases for sync_linked_regs regarding zext propagation
bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagation
selftests/bpf: Add tests for maybe_fork_scalars() OR vs AND handling
bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR
selftests/bpf: Add tests for sdiv32/smod32 with INT_MIN dividend
bpf: Fix undefined behavior in interpreter sdiv/smod for INT_MIN
selftests/bpf: Add tests for bpf_throw lock leak from subprogs
bpf: Fix exception exit lock checking for subprogs
bpf: Release module BTF IDR before module unload
selftests/bpf: Fix pkg-config call on static builds
bpf: Fix constant blinding for PROBE_MEM32 stores
selftests/bpf: Add test for BPF_END register ID reset
bpf: Reset register ID for BPF_END value tracking
Pull tracing fixes from Steven Rostedt:
- Revert "tracing: Remove pid in task_rename tracing output"
A change was made to remove the pid field from the task_rename event
because it was thought that it was always done for the current task
and recording the pid would be redundant. This turned out to be
incorrect and there are a few corner case where this is not true and
caused some regressions in tooling.
- Fix the reading from user space for migration
The reading of user space uses a seq lock type of logic where it uses
a per-cpu temporary buffer and disables migration, then enables
preemption, does the copy from user space, disables preemption,
enables migration and checks if there was any schedule switches while
preemption was enabled. If there was a context switch, then it is
considered that the per-cpu buffer could be corrupted and it tries
again. There's a protection check that tests if it takes a hundred
tries, it issues a warning and exits out to prevent a live lock.
This was triggered because the task was selected by the load balancer
to be migrated to another CPU, every time preemption is enabled the
migration task would schedule in try to migrate the task but can't
because migration is disabled and let it run again. This caused the
scheduler to schedule out the task every time it enabled preemption
and made the loop never exit (until the 100 iteration test
triggered).
Fix this by enabling and disabling preemption and keeping migration
enabled if the reading from user space needs to be done again. This
will let the migration thread migrate the task and the copy from user
space will likely pass on the next iteration.
- Fix trace_marker copy option freeing
The "copy_trace_marker" option allows a tracing instance to get a
copy of a write to the trace_marker file of the top level instance.
This is managed by a link list protected by RCU. When an instance is
removed, a check is made if the option is set, and if so
synchronized_rcu() is called.
The problem is that an iteration is made to reset all the flags to
what they were when the instance was created (to perform clean ups)
was done before the check of the copy_trace_marker option and that
option was cleared, so the synchronize_rcu() was never called.
Move the clearing of all the flags after the check of
copy_trace_marker to do synchronize_rcu() so that the option is still
set if it was before and the synchronization is performed.
- Fix entries setting when validating the persistent ring buffer
When validating the persistent ring buffer on boot up, the number of
events per sub-buffer is added to the sub-buffer meta page. The
validator was updating cpu_buffer->head_page (the first sub-buffer of
the per-cpu buffer) and not the "head_page" variable that was
iterating the sub-buffers. This was causing the first sub-buffer to
be assigned the entries for each sub-buffer and not the sub-buffer
that was supposed to be updated.
- Use "hash" value to update the direct callers
When updating the ftrace direct callers, it assigned a temporary
callback to all the callback functions of the ftrace ops and not just
the functions represented by the passed in hash. This causes an
unnecessary slow down of the functions of the ftrace_ops that is not
being modified. Only update the functions that are going to be
modified to call the ftrace loop function so that the update can be
made on those functions.
* tag 'trace-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Use hash argument for tmp_ops in update_ftrace_direct_mod
ring-buffer: Fix to update per-subbuf entries of persistent ring buffer
tracing: Fix trace_marker copy link list updates
tracing: Fix failure to read user space from system call trace events
tracing: Revert "tracing: Remove pid in task_rename tracing output"
Pull i2c fixes from Wolfram Sang:
- fix broken I2C communication on Armada 3700 with recovery
- fix device_node reference leak in probe (fsi)
- fix NULL-deref when serial string is missing (cp2615)
* tag 'i2c-for-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: pxa: defer reset on Armada 3700 when recovery is used
i2c: fsi: Fix a potential leak in fsi_i2c_probe()
i2c: cp2615: fix serial string NULL-deref at probe
Pull x86 fixes from Ingo Molnar:
- Improve Qemu MCE-injection behavior by only using AMD SMCA MSRs if
the feature bit is set
- Fix the relative path of gettimeofday.c inclusion in vclock_gettime.c
- Fix a boot crash on UV clusters when a socket is marked as
'deconfigured' which are mapped to the SOCK_EMPTY node ID by
the UV firmware, while Linux APIs expect NUMA_NO_NODE.
The difference being (0xffff [unsigned short ~0]) vs [int -1]
* tag 'x86-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/uv: Handle deconfigured sockets
x86/entry/vdso: Fix path of included gettimeofday.c
x86/mce/amd: Check SMCA feature bit before accessing SMCA MSRs
Pull perf fixes from Ingo Molnar:
- Fix a PMU driver crash on AMD EPYC systems, caused by
a race condition in x86_pmu_enable()
- Fix a possible counter-initialization bug in x86_pmu_enable()
- Fix a counter inheritance bug in inherit_event() and
__perf_event_read()
- Fix an Intel PMU driver branch constraints handling bug
found by UBSAN
- Fix the Intel PMU driver's new Off-Module Response (OMR)
support code for Diamond Rapids / Nova lake, to fix a snoop
information parsing bug
* tag 'perf-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix OMR snoop information parsing issues
perf/x86/intel: Add missing branch counters constraint apply
perf: Make sure to use pmu_ctx->pmu for groups
x86/perf: Make sure to program the counter value for stopped events on migration
perf/x86: Move event pointer setup earlier in x86_pmu_enable()
Pull objtool fixes from Ingo Molnar:
"Fix three more livepatching related build environment bugs, and a
false positive warning with Clang jump tables"
* tag 'objtool-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix Clang jump table detection
livepatch/klp-build: Fix inconsistent kernel version
objtool/klp: fix mkstemp() failure with long paths
objtool/klp: fix data alignment in __clone_symbol()
Pull locking fix from Ingo Molnar:
"Fix a sparse build error regression in <linux/local_lock_internal.h>
caused by the locking context-analysis changes"
* tag 'locking-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
include/linux/local_lock_internal.h: Make this header file again compatible with sparse
Pull irq fix from Ingo Molnar:
"Fix a mailbox channel leak in the riscv-rpmi-sysmsi irqchip driver"
* tag 'irq-urgent-2026-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/riscv-rpmi-sysmsi: Fix mailbox channel leak in rpmi_sysmsi_probe()
The call to mipi_dsi_host_register triggers a callback to mtk_dsi_bind,
which uses dev_get_drvdata to retrieve the mtk_dsi struct, so this
structure needs to be stored inside the driver data before invoking it.
As drvdata is currently uninitialized it leads to a crash when
registering the DSI DRM encoder right after acquiring
the mode_config.idr_mutex, blocking all subsequent DRM operations.
Fixes the following crash during mediatek-drm probe (tested on Xiaomi
Smart Clock x04g):
Unable to handle kernel NULL pointer dereference at virtual address
0000000000000040
[...]
Modules linked in: mediatek_drm(+) drm_display_helper cec drm_client_lib
drm_dma_helper drm_kms_helper panel_simple
[...]
Call trace:
drm_mode_object_add+0x58/0x98 (P)
__drm_encoder_init+0x48/0x140
drm_encoder_init+0x6c/0xa0
drm_simple_encoder_init+0x20/0x34 [drm_kms_helper]
mtk_dsi_bind+0x34/0x13c [mediatek_drm]
component_bind_all+0x120/0x280
mtk_drm_bind+0x284/0x67c [mediatek_drm]
try_to_bring_up_aggregate_device+0x23c/0x320
__component_add+0xa4/0x198
component_add+0x14/0x20
mtk_dsi_host_attach+0x78/0x100 [mediatek_drm]
mipi_dsi_attach+0x2c/0x50
panel_simple_dsi_probe+0x4c/0x9c [panel_simple]
mipi_dsi_drv_probe+0x1c/0x28
really_probe+0xc0/0x3dc
__driver_probe_device+0x80/0x160
driver_probe_device+0x40/0x120
__device_attach_driver+0xbc/0x17c
bus_for_each_drv+0x88/0xf0
__device_attach+0x9c/0x1cc
device_initial_probe+0x54/0x60
bus_probe_device+0x34/0xa0
device_add+0x5b0/0x800
mipi_dsi_device_register_full+0xdc/0x16c
mipi_dsi_host_register+0xc4/0x17c
mtk_dsi_probe+0x10c/0x260 [mediatek_drm]
platform_probe+0x5c/0xa4
really_probe+0xc0/0x3dc
__driver_probe_device+0x80/0x160
driver_probe_device+0x40/0x120
__driver_attach+0xc8/0x1f8
bus_for_each_dev+0x7c/0xe0
driver_attach+0x24/0x30
bus_add_driver+0x11c/0x240
driver_register+0x68/0x130
__platform_register_drivers+0x64/0x160
mtk_drm_init+0x24/0x1000 [mediatek_drm]
do_one_initcall+0x60/0x1d0
do_init_module+0x54/0x240
load_module+0x1838/0x1dc0
init_module_from_file+0xd8/0xf0
__arm64_sys_finit_module+0x1b4/0x428
invoke_syscall.constprop.0+0x48/0xc8
do_el0_svc+0x3c/0xb8
el0_svc+0x34/0xe8
el0t_64_sync_handler+0xa0/0xe4
el0t_64_sync+0x198/0x19c
Code: 52800022 941004ab 2a0003f3 37f80040 (29005a80)
Fixes: e4732b590a ("drm/mediatek: dsi: Register DSI host after acquiring clocks and PHY")
Signed-off-by: Luca Leonardo Scorcia <l.scorcia@gmail.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Reviewed-by: CK Hu <ck.hu@mediatek.com>
Link: https://patchwork.kernel.org/project/dri-devel/patch/20260225094047.76780-1-l.scorcia@gmail.com/
Signed-off-by: Chun-Kuang Hu <chunkuang.hu@kernel.org>
On weakly ordered architectures (e.g., arm64), the lockless check in
wq_watchdog_timer_fn() can observe a reordering between the worklist
insertion and the last_progress_ts update. Specifically, the watchdog
can see a non-empty worklist (from a list_add) while reading a stale
last_progress_ts value, causing a false positive stall report.
This was confirmed by reading pool->last_progress_ts again after holding
pool->lock in wq_watchdog_timer_fn():
workqueue watchdog: pool 7 false positive detected!
lockless_ts=4784580465 locked_ts=4785033728
diff=453263ms worklist_empty=0
To avoid slowing down the hot path (queue_work, etc.), recheck
last_progress_ts with pool->lock held. This will eliminate the false
positive with minimal overhead.
Remove two extra empty lines in wq_watchdog_timer_fn() as we are on it.
Fixes: 82607adcf9 ("workqueue: implement lockup detector")
Cc: stable@vger.kernel.org # v4.5+
Assisted-by: claude-code:claude-opus-4-6
Signed-off-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
slot_free() basically completely resets the slots by clearing all of
its flags and attributes. While zram_writeback_complete() restores
some of flags back (those that are necessary for async read
decompression) we still lose a lot of slot's metadata. For example,
slot's ac-time, or ZRAM_INCOMPRESSIBLE.
More importantly, restoring flags/attrs requires extra attention as
some of the flags are directly affecting zram device stats. And the
original code did not pay that attention. Namely ZRAM_HUGE slots
handling in zram_writeback_complete(). The call to slot_free() would
decrement ->huge_pages, however when zram_writeback_complete() restored
the slot's ZRAM_HUGE flag, it would not get reflected in an incremented
->huge_pages. So when the slot would finally get freed, slot_free()
would decrement ->huge_pages again, leading to underflow.
Fix this by open-coding the required memory free and stats updates in
zram_writeback_complete(), rather than calling the destructive
slot_free(). Since we now preserve the ZRAM_HUGE flag on written-back
slots (for the deferred decompression path), we also update slot_free()
to skip decrementing ->huge_pages if ZRAM_WB is set.
Link: https://lkml.kernel.org/r/20260320023143.2372879-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20260319034912.1894770-1-senozhatsky@chromium.org
Fixes: d38fab605c ("zram: introduce compressed data writeback")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Richard Chang <richardycc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
One major usage of damon_call() is online DAMON parameters update. It is
done by calling damon_commit_ctx() inside the damon_call() callback
function. damon_commit_ctx() can fail for two reasons: 1) invalid
parameters and 2) internal memory allocation failures. In case of
failures, the damon_ctx that attempted to be updated (commit destination)
can be partially updated (or, corrupted from a perspective), and therefore
shouldn't be used anymore. The function only ensures the damon_ctx object
can safely deallocated using damon_destroy_ctx().
The API callers are, however, calling damon_commit_ctx() only after
asserting the parameters are valid, to avoid damon_commit_ctx() fails due
to invalid input parameters. But it can still theoretically fail if the
internal memory allocation fails. In the case, DAMON may run with the
partially updated damon_ctx. This can result in unexpected behaviors
including even NULL pointer dereference in case of damos_commit_dests()
failure [1]. Such allocation failure is arguably too small to fail, so
the real world impact would be rare. But, given the bad consequence, this
needs to be fixed.
Avoid such partially-committed (maybe-corrupted) damon_ctx use by saving
the damon_commit_ctx() failure on the damon_ctx object. For this,
introduce damon_ctx->maybe_corrupted field. damon_commit_ctx() sets it
when it is failed. kdamond_call() checks if the field is set after each
damon_call_control->fn() is executed. If it is set, ignore remaining
callback requests and return. All kdamond_call() callers including
kdamond_fn() also check the maybe_corrupted field right after
kdamond_call() invocations. If the field is set, break the kdamond_fn()
main loop so that DAMON sill doesn't use the context that might be
corrupted.
[sj@kernel.org: let kdamond_call() with cancel regardless of maybe_corrupted]
Link: https://lkml.kernel.org/r/20260320031553.2479-1-sj@kernel.org
Link: https://sashiko.dev/#/patchset/20260319145218.86197-1-sj%40kernel.org
Link: https://lkml.kernel.org/r/20260319145218.86197-1-sj@kernel.org
Link: https://lore.kernel.org/20260319043309.97966-1-sj@kernel.org [1]
Fixes: 3301f1861d ("mm/damon/sysfs: handle commit command using damon_call()")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 542eda1a83 ("mm/rmap: improve anon_vma_clone(),
unlink_anon_vmas() comments, add asserts") alters the way errors are
handled, but overlooked one important aspect of clean up.
When a VMA encounters an error state in anon_vma_clone() (that is, on
attempted allocation of anon_vma_chain objects), it cleans up partially
established state in cleanup_partial_anon_vmas(), before returning an
error.
However, this occurs prior to anon_vma->num_active_vmas being incremented,
and it also fails to clear the VMA's vma->anon_vma field, which remains in
place.
This is immediately an inconsistent state, because
anon_vma->num_active_vmas is supposed to track the number of VMAs whose
vma->anon_vma field references that anon_vma, and now that count is
off-by-negative-1 for each VMA for which this error state has occurred.
When VMAs are unlinked from this anon_vma, unlink_anon_vmas() will
eventually underflow anon_vma->num_active_vmas, which will trigger a
warning.
This will always eventually happen, as we unlink anon_vma's at process
teardown.
It could also cause maybe_reuse_anon_vma() to incorrectly permit the reuse
of an anon_vma which has active VMAs attached, which will lead to a
persistently invalid state.
The solution is to clear the VMA's anon_vma field when we clean up partial
state, as the fact we are doing so indicates clearly that the VMA is not
correctly integrated into the anon_vma tree and thus this field is
invalid.
Link: https://lkml.kernel.org/r/20260318122632.63404-1-ljs@kernel.org
Fixes: 542eda1a83 ("mm/rmap: improve anon_vma_clone(), unlink_anon_vmas() comments, add asserts")
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/linux-mm/20260302151547.2389070-1-sashal@kernel.org/
Reported-by: Jiakai Xu <jiakaipeanut@gmail.com>
Closes: https://lore.kernel.org/linux-mm/CAFb8wJvRhatRD-9DVmr5v5pixTMPEr3UKjYBJjCd09OfH55CKg@mail.gmail.com/
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Tested-by: Jiakai Xu <jiakaipeanut@gmail.com>
Acked-by: Harry Yoo <harry.yoo@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Sasha Levin (Microsoft) <sashal@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In the WAKE_SYNC path of scx_select_cpu_dfl(), waker_node was computed
with cpu_to_node(), while node (for prev_cpu) was computed with
scx_cpu_node_if_enabled(). When scx_builtin_idle_per_node is disabled,
idle_cpumask(waker_node) is called with a real node ID even though
per-node idle tracking is disabled, resulting in undefined behavior.
Fix by using scx_cpu_node_if_enabled() for waker_node as well, ensuring
both variables are computed consistently.
Fixes: 48849271e6 ("sched_ext: idle: Per-node idle cpumasks")
Cc: stable@vger.kernel.org # v6.15+
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Pull driver core fixes from Danilo Krummrich:
- Generalize driver_override in the driver core, providing a common
sysfs implementation and concurrency-safe accessors for bus
implementations
- Do not use driver_override as IRQ name in the hwmon axi-fan driver
- Remove an unnecessary driver_override check in sh platform_early
- Migrate the platform bus to use the generic driver_override
infrastructure, fixing a UAF condition caused by accessing the
driver_override field without proper locking in the platform_match()
callback
* tag 'driver-core-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
driver core: platform: use generic driver_override infrastructure
sh: platform_early: remove pdev->driver_override check
hwmon: axi-fan: don't use driver_override as IRQ name
docs: driver-model: document driver_override
driver core: generalize driver_override in struct device
The modify logic registers temporary ftrace_ops object (tmp_ops) to trigger
the slow path for all direct callers to be able to safely modify attached
addresses.
At the moment we use ops->func_hash for tmp_ops filter, which represents all
the systems attachments. It's faster to use just the passed hash filter, which
contains only the modified sites and is always a subset of the ops->func_hash.
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Menglong Dong <menglong8.dong@gmail.com>
Cc: Song Liu <song@kernel.org>
Link: https://patch.msgid.link/20260312123738.129926-1-jolsa@kernel.org
Fixes: e93672f770 ("ftrace: Add update_ftrace_direct_mod function")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
When the "copy_trace_marker" option is enabled for an instance, anything
written into /sys/kernel/tracing/trace_marker is also copied into that
instances buffer. When the option is set, that instance's trace_array
descriptor is added to the marker_copies link list. This list is protected
by RCU, as all iterations uses an RCU protected list traversal.
When the instance is deleted, all the flags that were enabled are cleared.
This also clears the copy_trace_marker flag and removes the trace_array
descriptor from the list.
The issue is after the flags are called, a direct call to
update_marker_trace() is performed to clear the flag. This function
returns true if the state of the flag changed and false otherwise. If it
returns true here, synchronize_rcu() is called to make sure all readers
see that its removed from the list.
But since the flag was already cleared, the state does not change and the
synchronization is never called, leaving a possible UAF bug.
Move the clearing of all flags below the updating of the copy_trace_marker
option which then makes sure the synchronization is performed.
Also use the flag for checking the state in update_marker_trace() instead
of looking at if the list is empty.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260318185512.1b6c7db4@gandalf.local.home
Fixes: 7b382efd5e ("tracing: Allow the top level trace_marker to write into another instances")
Reported-by: Sasha Levin <sashal@kernel.org>
Closes: https://lore.kernel.org/all/20260225133122.237275-1-sashal@kernel.org/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The system call trace events call trace_user_fault_read() to read the user
space part of some system calls. This is done by grabbing a per-cpu
buffer, disabling migration, enabling preemption, calling
copy_from_user(), disabling preemption, enabling migration and checking if
the task was preempted while preemption was enabled. If it was, the buffer
is considered corrupted and it tries again.
There's a safety mechanism that will fail out of this loop if it fails 100
times (with a warning). That warning message was triggered in some
pi_futex stress tests. Enabling the sched_switch trace event and
traceoff_on_warning, showed the problem:
pi_mutex_hammer-1375 [006] d..21 138.981648: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981651: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981656: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981659: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981664: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981667: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981671: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981675: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981679: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981682: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981687: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981690: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981695: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981698: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981703: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981706: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981711: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981714: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981719: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981722: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981727: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981730: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
pi_mutex_hammer-1375 [006] d..21 138.981735: sched_switch: prev_comm=pi_mutex_hammer prev_pid=1375 prev_prio=95 prev_state=R+ ==> next_comm=migration/6 next_pid=47 next_prio=0
migration/6-47 [006] d..2. 138.981738: sched_switch: prev_comm=migration/6 prev_pid=47 prev_prio=0 prev_state=S ==> next_comm=pi_mutex_hammer next_pid=1375 next_prio=95
What happened was the task 1375 was flagged to be migrated. When
preemption was enabled, the migration thread woke up to migrate that task,
but failed because migration for that task was disabled. This caused the
loop to fail to exit because the task scheduled out while trying to read
user space.
Every time the task enabled preemption the migration thread would schedule
in, try to migrate the task, fail and let the task continue. But because
the loop would only enable preemption with migration disabled, it would
always fail because each time it enabled preemption to read user space,
the migration thread would try to migrate it.
To solve this, when the loop fails to read user space without being
scheduled out, enabled and disable preemption with migration enabled. This
will allow the migration task to successfully migrate the task and the
next loop should succeed to read user space without being scheduled out.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20260316130734.1858a998@gandalf.local.home
Fixes: 64cf7d058a ("tracing: Have trace_marker use per-cpu data to read user space")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Add multiple test cases for linked register tracking with alu32 ops:
- Add a test that checks sync_linked_regs() regarding reg->id (the linked
target register) for BPF_ADD_CONST32 rather than known_reg->id (the
branch register).
- Add a test case for linked register tracking that exposes the cross-type
sync_linked_regs() bug. One register uses alu32 (w7 += 1, BPF_ADD_CONST32)
and another uses alu64 (r8 += 2, BPF_ADD_CONST64), both linked to the
same base register.
- Add a test case that exercises regsafe() path pruning when two execution
paths reach the same program point with linked registers carrying
different ADD_CONST flags (BPF_ADD_CONST32 from alu32 vs BPF_ADD_CONST64
from alu64). This particular test passes with and without the fix since
the pruning will fail due to different ranges, but it would still be
useful to carry this one as a regression test for the unreachable div
by zero.
With the fix applied all the tests pass:
# LDLIBS=-static PKG_CONFIG='pkg-config --static' ./vmtest.sh -- ./test_progs -t verifier_linked_scalars
[...]
./test_progs -t verifier_linked_scalars
#602/1 verifier_linked_scalars/scalars: find linked scalars:OK
#602/2 verifier_linked_scalars/sync_linked_regs_preserves_id:OK
#602/3 verifier_linked_scalars/scalars_neg:OK
#602/4 verifier_linked_scalars/scalars_neg_sub:OK
#602/5 verifier_linked_scalars/scalars_neg_alu32_add:OK
#602/6 verifier_linked_scalars/scalars_neg_alu32_sub:OK
#602/7 verifier_linked_scalars/scalars_pos:OK
#602/8 verifier_linked_scalars/scalars_sub_neg_imm:OK
#602/9 verifier_linked_scalars/scalars_double_add:OK
#602/10 verifier_linked_scalars/scalars_sync_delta_overflow:OK
#602/11 verifier_linked_scalars/scalars_sync_delta_overflow_large_range:OK
#602/12 verifier_linked_scalars/scalars_alu32_big_offset:OK
#602/13 verifier_linked_scalars/scalars_alu32_basic:OK
#602/14 verifier_linked_scalars/scalars_alu32_wrap:OK
#602/15 verifier_linked_scalars/scalars_alu32_zext_linked_reg:OK
#602/16 verifier_linked_scalars/scalars_alu32_alu64_cross_type:OK
#602/17 verifier_linked_scalars/scalars_alu32_alu64_regsafe_pruning:OK
#602/18 verifier_linked_scalars/alu32_negative_offset:OK
#602/19 verifier_linked_scalars/spurious_precision_marks:OK
#602 verifier_linked_scalars:OK
Summary: 1/19 PASSED, 0 SKIPPED, 0 FAILED
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260319211507.213816-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jenny reported that in sync_linked_regs() the BPF_ADD_CONST32 flag is
checked on known_reg (the register narrowed by a conditional branch)
instead of reg (the linked target register created by an alu32 operation).
Example case with reg:
1. r6 = bpf_get_prandom_u32()
2. r7 = r6 (linked, same id)
3. w7 += 5 (alu32 -- r7 gets BPF_ADD_CONST32, zero-extended by CPU)
4. if w6 < 0xFFFFFFFC goto safe (narrows r6 to [0xFFFFFFFC, 0xFFFFFFFF])
5. sync_linked_regs() propagates to r7 but does NOT call zext_32_to_64()
6. Verifier thinks r7 is [0x100000001, 0x100000004] instead of [1, 4]
Since known_reg above does not have BPF_ADD_CONST32 set above, zext_32_to_64()
is never called on alu32-derived linked registers. This causes the verifier
to track incorrect 64-bit bounds, while the CPU correctly zero-extends the
32-bit result.
The code checking known_reg->id was correct however (see scalars_alu32_wrap
selftest case), but the real fix needs to handle both directions - zext
propagation should be done when either register has BPF_ADD_CONST32, since
the linked relationship involves a 32-bit operation regardless of which
side has the flag.
Example case with known_reg (exercised also by scalars_alu32_wrap):
1. r1 = r0; w1 += 0x100 (alu32 -- r1 gets BPF_ADD_CONST32)
2. if r1 > 0x80 - known_reg = r1 (has BPF_ADD_CONST32), reg = r0 (doesn't)
Hence, fix it by checking for (reg->id | known_reg->id) & BPF_ADD_CONST32.
Moreover, sync_linked_regs() also has a soundness issue when two linked
registers used different ALU widths: one with BPF_ADD_CONST32 and the
other with BPF_ADD_CONST64. The delta relationship between linked registers
assumes the same arithmetic width though. When one register went through
alu32 (CPU zero-extends the 32-bit result) and the other went through
alu64 (no zero-extension), the propagation produces incorrect bounds.
Example:
r6 = bpf_get_prandom_u32() // fully unknown
if r6 >= 0x100000000 goto out // constrain r6 to [0, U32_MAX]
r7 = r6
w7 += 1 // alu32: r7.id = N | BPF_ADD_CONST32
r8 = r6
r8 += 2 // alu64: r8.id = N | BPF_ADD_CONST64
if r7 < 0xFFFFFFFF goto out // narrows r7 to [0xFFFFFFFF, 0xFFFFFFFF]
At the branch on r7, sync_linked_regs() runs with known_reg=r7
(BPF_ADD_CONST32) and reg=r8 (BPF_ADD_CONST64). The delta path
computes:
r8 = r7 + (delta_r8 - delta_r7) = 0xFFFFFFFF + (2 - 1) = 0x100000000
Then, because known_reg->id has BPF_ADD_CONST32, zext_32_to_64(r8) is
called, truncating r8 to [0, 0]. But r8 used a 64-bit ALU op -- the
CPU does NOT zero-extend it. The actual CPU value of r8 is
0xFFFFFFFE + 2 = 0x100000000, not 0. The verifier now underestimates
r8's 64-bit bounds, which is a soundness violation.
Fix sync_linked_regs() by skipping propagation when the two registers
have mixed ALU widths (one BPF_ADD_CONST32, the other BPF_ADD_CONST64).
Lastly, fix regsafe() used for path pruning: the existing checks used
"& BPF_ADD_CONST" to test for offset linkage, which treated
BPF_ADD_CONST32 and BPF_ADD_CONST64 as equivalent.
Fixes: 7a433e5193 ("bpf: Support negative offsets, BPF_SUB, and alu32 for linked register tracking")
Reported-by: Jenny Guanni Qu <qguanni@gmail.com>
Co-developed-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260319211507.213816-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Wade says:
====================
bpf: Fix unsound scalar forking for BPF_OR
maybe_fork_scalars() unconditionally sets the pushed path dst register
to 0 for both BPF_AND and BPF_OR. For AND this is correct (0 & K == 0),
but for OR it is wrong (0 | K == K, not 0). This causes the verifier to
track an incorrect value on the pushed path, leading to a verifier/runtime
divergence that allows out-of-bounds map value access.
v4: Use block comment style for multi-line comments in selftests (Amery Hung)
Add Reviewed-by/Acked-by tags
v3: Use single-line comment style in selftests (Alexei Starovoitov)
v2: Use push_stack(env, env->insn_idx, ...) to re-execute the insn
on the pushed path (Eduard Zingerman)
====================
Link: https://patch.msgid.link/20260314021521.128361-1-danjwade95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add three test cases to verifier_bounds.c to verify that
maybe_fork_scalars() correctly tracks register values for BPF_OR
operations with constant source operands:
1. or_scalar_fork_rejects_oob: After ARSH 63 + OR 8, the pushed
path should have dst = 8. With value_size = 8, accessing
map_value + 8 is out of bounds and must be rejected.
2. and_scalar_fork_still_works: Regression test ensuring AND
forking continues to work. ARSH 63 + AND 4 produces pushed
dst = 0 and current dst = 4, both within value_size = 8.
3. or_scalar_fork_allows_inbounds: After ARSH 63 + OR 4, the
pushed path has dst = 4, which is within value_size = 8
and should be accepted.
These tests exercise the fix in the previous patch, which makes the
pushed path re-execute the ALU instruction so it computes the correct
result for BPF_OR.
Signed-off-by: Daniel Wade <danjwade95@gmail.com>
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260314021521.128361-3-danjwade95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
maybe_fork_scalars() is called for both BPF_AND and BPF_OR when the
source operand is a constant. When dst has signed range [-1, 0], it
forks the verifier state: the pushed path gets dst = 0, the current
path gets dst = -1.
For BPF_AND this is correct: 0 & K == 0.
For BPF_OR this is wrong: 0 | K == K, not 0.
The pushed path therefore tracks dst as 0 when the runtime value is K,
producing an exploitable verifier/runtime divergence that allows
out-of-bounds map access.
Fix this by passing env->insn_idx (instead of env->insn_idx + 1) to
push_stack(), so the pushed path re-executes the ALU instruction with
dst = 0 and naturally computes the correct result for any opcode.
Fixes: bffacdb80b ("bpf: Recognize special arithmetic shift in the verifier")
Signed-off-by: Daniel Wade <danjwade95@gmail.com>
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260314021521.128361-2-danjwade95@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Jenny Guanni Qu says:
====================
bpf: Fix abs(INT_MIN) undefined behavior in interpreter sdiv/smod
The BPF interpreter's signed 32-bit division and modulo handlers use
abs() on s32 operands, which is undefined for S32_MIN. This causes
the interpreter to compute wrong results, creating a mismatch with
the verifier's range tracking.
For example, INT_MIN / 2 returns 0x40000000 instead of the correct
0xC0000000. The verifier tracks the correct range, so a crafted BPF
program can exploit the mismatch for out-of-bounds map value access
(confirmed by KASAN).
Patch 1 introduces abs_s32() which handles S32_MIN correctly and
replaces all 8 abs((s32)...) call sites. s32 is the only affected
case -- the s64 handlers do not use abs().
Patch 2 adds selftests covering sdiv32 and smod32 with INT_MIN
dividend to prevent regression.
Changes since v4:
- Renamed __safe_abs32() to abs_s32() and dropped inline keyword
per Alexei Starovoitov's feedback
Changes since v3:
- Fixed stray blank line deletion in the file header
- Improved comment per Yonghong Song's suggestion
- Added JIT vs interpreter context to selftest commit message
Changes since v2:
- Simplified to use -(u32)x per Mykyta Yatsenko's suggestion
Changes since v1:
- Moved helper above kerneldoc comment block to fix build warnings
====================
Link: https://patch.msgid.link/20260311011116.2108005-1-qguanni@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add tests to verify that signed 32-bit division and modulo operations
produce correct results when the dividend is INT_MIN (0x80000000).
The bug fixed in the previous commit only affects the BPF interpreter
path. When JIT is enabled (the default on most architectures), the
native CPU division instruction produces the correct result and these
tests pass regardless. With bpf_jit_enable=0, the interpreter is used
and without the previous fix, INT_MIN / 2 incorrectly returns
0x40000000 instead of 0xC0000000 due to abs(S32_MIN) undefined
behavior, causing these tests to fail.
Test cases:
- SDIV32 INT_MIN / 2 = -1073741824 (imm and reg divisor)
- SMOD32 INT_MIN % 2 = 0 (positive and negative divisor)
Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com>
Link: https://lore.kernel.org/r/20260311011116.2108005-3-qguanni@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The BPF interpreter's signed 32-bit division and modulo handlers use
the kernel abs() macro on s32 operands. The abs() macro documentation
(include/linux/math.h) explicitly states the result is undefined when
the input is the type minimum. When DST contains S32_MIN (0x80000000),
abs((s32)DST) triggers undefined behavior and returns S32_MIN unchanged
on arm64/x86. This value is then sign-extended to u64 as
0xFFFFFFFF80000000, causing do_div() to compute the wrong result.
The verifier's abstract interpretation (scalar32_min_max_sdiv) computes
the mathematically correct result for range tracking, creating a
verifier/interpreter mismatch that can be exploited for out-of-bounds
map value access.
Introduce abs_s32() which handles S32_MIN correctly by casting to u32
before negating, avoiding signed overflow entirely. Replace all 8
abs((s32)...) call sites in the interpreter's sdiv32/smod32 handlers.
s32 is the only affected case -- the s64 division/modulo handlers do
not use abs().
Fixes: ec0e2da95f ("bpf: Support new signed div/mod instructions.")
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com>
Link: https://lore.kernel.org/r/20260311011116.2108005-2-qguanni@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add test cases to ensure the verifier correctly rejects bpf_throw from
subprogs when RCU, preempt, or IRQ locks are held:
* reject_subprog_rcu_lock_throw: subprog acquires bpf_rcu_read_lock and
then calls bpf_throw
* reject_subprog_throw_preempt_lock: always-throwing subprog called while
caller holds bpf_preempt_disable
* reject_subprog_throw_irq_lock: always-throwing subprog called while
caller holds bpf_local_irq_save
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260320000809.643798-2-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
process_bpf_exit_full() passes check_lock = !curframe to
check_resource_leak(), which is false in cases when bpf_throw() is
called from a static subprog. This makes check_resource_leak() to skip
validation of active_rcu_locks, active_preempt_locks, and
active_irq_id on exception exits from subprogs.
At runtime bpf_throw() unwinds the stack via ORC without releasing any
user-acquired locks, which may cause various issues as the result.
Fix by setting check_lock = true for exception exits regardless of
curframe, since exceptions bypass all intermediate frame
cleanup. Update the error message prefix to "bpf_throw" for exception
exits to distinguish them from normal BPF_EXIT.
Fix reject_subprog_with_rcu_read_lock test which was previously
passing for the wrong reason. Test program returned directly from the
subprog call without closing the RCU section, so the error was
triggered by the unclosed RCU lock on normal exit, not by
bpf_throw. Update __msg annotations for affected tests to match the
new "bpf_throw" error prefix.
The spin_lock case is not affected because they are already checked [1]
at the call site in do_check_insn() before bpf_throw can run.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/verifier.c?h=v7.0-rc4#n21098
Assisted-by: Claude:claude-opus-4-6
Fixes: f18b03faba ("bpf: Implement BPF exceptions")
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260320000809.643798-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
i2c-fixes for v7.0-rc5
pxa: fix broken I2C communication on Armada 3700 with recovery
fsi: fix device_node reference leak in probe
cp2615: fix NULL-deref when serial string is missing
Pull hwmon fixes from Guenter Roeck:
- max6639: Fix pulses-per-revolution implementation
- Several PMBus drivers: Add missing error checks
* tag 'hwmon-for-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (max6639) Fix pulses-per-revolution implementation
hwmon: (pmbus/isl68137) Fix unchecked return value and use sysfs_emit()
hwmon: (pmbus/ina233) Add error check for pmbus_read_word_data() return value
hwmon: (pmbus/mp2869) Check pmbus_read_byte_data() before using its return value
hwmon: (pmbus/mp2975) Add error check for pmbus_read_word_data() return value
hwmon: (pmbus/hac300s) Add error check for pmbus_read_word_data() return value
Pull bootconfig fixes from Masami Hiramatsu:
- Check error code of xbc_init_node() in override value path in
xbc_parse_kv()
- Fix fd leak in load_xbc_file() on fstat failure
* tag 'bootconfig-fixes-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tools/bootconfig: fix fd leak in load_xbc_file() on fstat failure
lib/bootconfig: check xbc_init_node() return in override path
Pull btrfs fixes from David Sterba:
"Another batch of fixes for problems that have been identified by tools
analyzing code or by fuzzing. Most of them are short, two patches fix
the same thing in many places so the diffs are bigger.
- handle potential NULL pointer errors after attempting to read
extent and checksum trees
- prevent ENOSPC when creating many qgroups by ioctls in the same
transaction
- encoded write ioctl fixes (with 64K page and 4K block size):
- fix unexpected bio length
- do not let compressed bios and pages interfere with page cache
- compression fixes on setups with 64K page and 4K block size: fix
folio length assertions (zstd and lzo)
- remap tree fixes:
- make sure to hold block group reference while moving it
- handle early exit when moving block group to unused list
- handle deleted subvolumes with inconsistent state of deletion
progress"
* tag 'for-7.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: reject root items with drop_progress and zero drop_level
btrfs: check block group before marking it unused in balance_remap_chunks()
btrfs: hold block group reference during entire move_existing_remap()
btrfs: fix an incorrect ASSERT() condition inside lzo_decompress_bio()
btrfs: fix an incorrect ASSERT() condition inside zstd_decompress_bio()
btrfs: do not touch page cache for encoded writes
btrfs: fix a bug that makes encoded write bio larger than expected
btrfs: reserve enough transaction items for qgroup ioctls
btrfs: check for NULL root after calls to btrfs_csum_root()
btrfs: check for NULL root after calls to btrfs_extent_root()
Validate host controlled value `quote_buf->out_len` that determines how
many bytes of the quote are copied out to guest userspace. In TDX
environments with remote attestation, quotes are not considered private,
and can be forwarded to an attestation server.
Catch scenarios where the host specifies a response length larger than
the guest's allocation, or otherwise races modifying the response while
the guest consumes it.
This prevents contents beyond the pages allocated for `quote_buf`
(up to TSM_REPORT_OUTBLOB_MAX) from being read out to guest userspace,
and possibly forwarded in attestation requests.
Recall that some deployments want per-container configs-tsm-report
interfaces, so the leak may cross container protection boundaries, not
just local root.
Fixes: f4738f56d1 ("virt: tdx-guest: Add Quote generation support using TSM_REPORTS")
Cc: stable@vger.kernel.org
Signed-off-by: Zubin Mithra <zsm@google.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Sabrina Dubroca says:
====================
rtnetlink: add missing attributes in if_nlmsg_size
Once again we have some attributes added by rtnl_fill_ifinfo() that
aren't counted in if_nlmsg_size().
====================
Link: https://patch.msgid.link/cover.1773919462.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
smc_rx_splice() allocates one smc_spd_priv per pipe_buffer and stores
the pointer in pipe_buffer.private. The pipe_buf_operations for these
buffers used .get = generic_pipe_buf_get, which only increments the page
reference count when tee(2) duplicates a pipe buffer. The smc_spd_priv
pointer itself was not handled, so after tee() both the original and the
cloned pipe_buffer share the same smc_spd_priv *.
When both pipes are subsequently released, smc_rx_pipe_buf_release() is
called twice against the same object:
1st call: kfree(priv) sock_put(sk) smc_rx_update_cons() [correct]
2nd call: kfree(priv) sock_put(sk) smc_rx_update_cons() [UAF]
KASAN reports a slab-use-after-free in smc_rx_pipe_buf_release(), which
then escalates to a NULL-pointer dereference and kernel panic via
smc_rx_update_consumer() when it chases the freed priv->smc pointer:
BUG: KASAN: slab-use-after-free in smc_rx_pipe_buf_release+0x78/0x2a0
Read of size 8 at addr ffff888004a45740 by task smc_splice_tee_/74
Call Trace:
<TASK>
dump_stack_lvl+0x53/0x70
print_report+0xce/0x650
kasan_report+0xc6/0x100
smc_rx_pipe_buf_release+0x78/0x2a0
free_pipe_info+0xd4/0x130
pipe_release+0x142/0x160
__fput+0x1c6/0x490
__x64_sys_close+0x4f/0x90
do_syscall_64+0xa6/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
BUG: kernel NULL pointer dereference, address: 0000000000000020
RIP: 0010:smc_rx_update_consumer+0x8d/0x350
Call Trace:
<TASK>
smc_rx_pipe_buf_release+0x121/0x2a0
free_pipe_info+0xd4/0x130
pipe_release+0x142/0x160
__fput+0x1c6/0x490
__x64_sys_close+0x4f/0x90
do_syscall_64+0xa6/0x1a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
Kernel panic - not syncing: Fatal exception
Beyond the memory-safety problem, duplicating an SMC splice buffer is
semantically questionable: smc_rx_update_cons() would advance the
consumer cursor twice for the same data, corrupting receive-window
accounting. A refcount on smc_spd_priv could fix the double-free, but
the cursor-accounting issue would still need to be addressed separately.
The .get callback is invoked by both tee(2) and splice_pipe_to_pipe()
for partial transfers; both will now return -EFAULT. Users who need
to duplicate SMC socket data must use a copy-based read path.
Fixes: 9014db202c ("smc: add support for splice()")
Signed-off-by: Qi Tang <tpluszz77@gmail.com>
Link: https://patch.msgid.link/20260318064847.23341-1-tpluszz77@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ovs_netdev_tunnel_destroy() may run after NETDEV_UNREGISTER already
detached the device. Dropping the netdev reference in destroy can race
with concurrent readers that still observe vport->dev.
Do not release vport->dev in ovs_netdev_tunnel_destroy(). Instead, let
vport_netdev_free() drop the reference from the RCU callback, matching
the non-tunnel destroy path and avoiding additional synchronization
under RTNL.
Fixes: a9020fde67 ("openvswitch: Move tunnel destroy function to oppenvswitch module.")
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Tested-by: Ao Zhou <n05ec@lzu.edu.cn>
Co-developed-by: Yuan Tan <tanyuan98@outlook.com>
Signed-off-by: Yuan Tan <tanyuan98@outlook.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Yang Yang <n05ec@lzu.edu.cn>
Reviewed-by: Ilya Maximets <i.maximets@ovn.org>
Link: https://patch.msgid.link/20260319074241.3405262-1-n05ec@lzu.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kevin Hao says:
====================
net: macb: Fix two lock warnings when WOL is used
This patch series addresses two lock warnings that occur when using WOL as a
wakeup source on my AMD ZynqMP board.
====================
Link: https://patch.msgid.link/20260318-macb-irq-v2-0-f1179768ab24@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Access to net_device::ip_ptr and its associated members must be
protected by an RCU lock. Since we are modifying this piece of code,
let's also move it to execute only when WAKE_ARP is enabled.
To minimize the duration of the RCU lock, a local variable is used to
temporarily store the IP address. This change resolves the following
RCU check warning:
WARNING: suspicious RCU usage
7.0.0-rc3-next-20260310-yocto-standard+ #122 Not tainted
-----------------------------
drivers/net/ethernet/cadence/macb_main.c:5944 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
5 locks held by rtcwake/518:
#0: ffff000803ab1408 (sb_writers#5){.+.+}-{0:0}, at: vfs_write+0xf8/0x368
#1: ffff0008090bf088 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0xbc/0x1c8
#2: ffff00080098d588 (kn->active#70){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xcc/0x1c8
#3: ffff800081c84888 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0x1ec/0x290
#4: ffff0008009ba0f8 (&dev->mutex){....}-{4:4}, at: device_suspend+0x118/0x4f0
stack backtrace:
CPU: 3 UID: 0 PID: 518 Comm: rtcwake Not tainted 7.0.0-rc3-next-20260310-yocto-standard+ #122 PREEMPT
Hardware name: ZynqMP ZCU102 Rev1.1 (DT)
Call trace:
show_stack+0x24/0x38 (C)
__dump_stack+0x28/0x38
dump_stack_lvl+0x64/0x88
dump_stack+0x18/0x24
lockdep_rcu_suspicious+0x134/0x1d8
macb_suspend+0xd8/0x4c0
device_suspend+0x218/0x4f0
dpm_suspend+0x244/0x3a0
dpm_suspend_start+0x50/0x78
suspend_devices_and_enter+0xec/0x560
pm_suspend+0x194/0x290
state_store+0x110/0x158
kobj_attr_store+0x1c/0x30
sysfs_kf_write+0xa8/0xd0
kernfs_fop_write_iter+0x11c/0x1c8
vfs_write+0x248/0x368
ksys_write+0x7c/0xf8
__arm64_sys_write+0x28/0x40
invoke_syscall+0x4c/0xe8
el0_svc_common+0x98/0xf0
do_el0_svc+0x28/0x40
el0_svc+0x54/0x1e0
el0t_64_sync_handler+0x84/0x130
el0t_64_sync+0x198/0x1a0
Fixes: 0cb8de39a7 ("net: macb: Add ARP support to WOL")
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Cc: stable@vger.kernel.org
Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>
Link: https://patch.msgid.link/20260318-macb-irq-v2-2-f1179768ab24@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The devm_free_irq() and devm_request_irq() functions should not be
executed in an atomic context.
During device suspend, all userspace processes and most kernel threads
are frozen. Additionally, we flush all tx/rx status, disable all macb
interrupts, and halt rx operations. Therefore, it is safe to split the
region protected by bp->lock into two independent sections, allowing
devm_free_irq() and devm_request_irq() to run in a non-atomic context.
This modification resolves the following lockdep warning:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:591
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 501, name: rtcwake
preempt_count: 1, expected: 0
RCU nest depth: 1, expected: 0
7 locks held by rtcwake/501:
#0: ffff0008038c3408 (sb_writers#5){.+.+}-{0:0}, at: vfs_write+0xf8/0x368
#1: ffff0008049a5e88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0xbc/0x1c8
#2: ffff00080098d588 (kn->active#70){.+.+}-{0:0}, at: kernfs_fop_write_iter+0xcc/0x1c8
#3: ffff800081c84888 (system_transition_mutex){+.+.}-{4:4}, at: pm_suspend+0x1ec/0x290
#4: ffff0008009ba0f8 (&dev->mutex){....}-{4:4}, at: device_suspend+0x118/0x4f0
#5: ffff800081d00458 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48
#6: ffff0008031fb9e0 (&bp->lock){-.-.}-{3:3}, at: macb_suspend+0x144/0x558
irq event stamp: 8682
hardirqs last enabled at (8681): [<ffff8000813c7d7c>] _raw_spin_unlock_irqrestore+0x44/0x88
hardirqs last disabled at (8682): [<ffff8000813c7b58>] _raw_spin_lock_irqsave+0x38/0x98
softirqs last enabled at (7322): [<ffff8000800f1b4c>] handle_softirqs+0x52c/0x588
softirqs last disabled at (7317): [<ffff800080010310>] __do_softirq+0x20/0x2c
CPU: 1 UID: 0 PID: 501 Comm: rtcwake Not tainted 7.0.0-rc3-next-20260310-yocto-standard+ #125 PREEMPT
Hardware name: ZynqMP ZCU102 Rev1.1 (DT)
Call trace:
show_stack+0x24/0x38 (C)
__dump_stack+0x28/0x38
dump_stack_lvl+0x64/0x88
dump_stack+0x18/0x24
__might_resched+0x200/0x218
__might_sleep+0x38/0x98
__mutex_lock_common+0x7c/0x1378
mutex_lock_nested+0x38/0x50
free_irq+0x68/0x2b0
devm_irq_release+0x24/0x38
devres_release+0x40/0x80
devm_free_irq+0x48/0x88
macb_suspend+0x298/0x558
device_suspend+0x218/0x4f0
dpm_suspend+0x244/0x3a0
dpm_suspend_start+0x50/0x78
suspend_devices_and_enter+0xec/0x560
pm_suspend+0x194/0x290
state_store+0x110/0x158
kobj_attr_store+0x1c/0x30
sysfs_kf_write+0xa8/0xd0
kernfs_fop_write_iter+0x11c/0x1c8
vfs_write+0x248/0x368
ksys_write+0x7c/0xf8
__arm64_sys_write+0x28/0x40
invoke_syscall+0x4c/0xe8
el0_svc_common+0x98/0xf0
do_el0_svc+0x28/0x40
el0_svc+0x54/0x1e0
el0t_64_sync_handler+0x84/0x130
el0t_64_sync+0x198/0x1a0
Fixes: 558e35ccfe ("net: macb: WoL support for GEM type of Ethernet controller")
Cc: stable@vger.kernel.org
Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Link: https://patch.msgid.link/20260318-macb-irq-v2-1-f1179768ab24@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull drm fixes from Dave Airlie:
"Regular weekly pull request, from sunny San Diego. Usual suspects in
xe/i915/amdgpu with small fixes all over, then some minor fixes across
a few other drivers. It's probably a bit on the heavy side, but most
of the fix seem well contained,
core:
- drm_dev_unplug UAF fix
pagemap:
- lock handling fix
xe:
- A number of teardown fixes
- Skip over non-leaf PTE for PRL generation
- Fix an uninitialized variable
- Fix a missing runtime PM reference
i915/display:
- Fix#15771: Screen corruption and stuttering on P14s w/ 3K display
- Fix for PSR entry setup frames count on rejected commit
- Fix OOPS if firmware is not loaded and suspend is attempted
- Fix unlikely NULL deref due to DC6 on probe
amdgpu:
- Fix gamma 2.2 colorop TFs
- BO list fix
- LTO fix
- DC FP fix
- DisplayID handling fix
- DCN 2.01 fix
- MMHUB boundary fixes
- ISP fix
- TLB fence fix
- Hainan pm fix
radeon:
- Hainan pm fix
vmwgfx:
- memory leak fix
- doc warning fix
imagination:
- deadlock fix
- interrupt handling fixes
dw-hdmi-qp:
- multi channel audio fix"
* tag 'drm-fixes-2026-03-21' of https://gitlab.freedesktop.org/drm/kernel: (40 commits)
drm/xe: Fix missing runtime PM reference in ccs_mode_store
drm/xe: Open-code GGTT MMIO access protection
drm/xe/lrc: Fix uninitialized new_ts when capturing context timestamp
drm/xe/oa: Allow reading after disabling OA stream
drm/xe: Skip over non leaf pte for PRL generation
drm/xe/guc: Ensure CT state transitions via STOP before DISABLED
drm/xe: Trigger queue cleanup if not in wedged mode 2
drm/xe: Forcefully tear down exec queues in GuC submit fini
drm/xe: Always kill exec queues in xe_guc_submit_pause_abort
drm/xe/guc: Fail immediately on GuC load error
drm/i915/gt: Check set_default_submission() before deferencing
drm/radeon: apply state adjust rules to some additional HAINAN vairants
drm/amdgpu: apply state adjust rules to some additional HAINAN vairants
drm/amdgpu: rework how we handle TLB fences
drm/bridge: dw-hdmi-qp: fix multi-channel audio output
drm: Fix use-after-free on framebuffers and property blobs when calling drm_dev_unplug
drm/amdgpu: Fix ISP segfault issue in kernel v7.0
drm/amdgpu/gmc9.0: add bounds checking for cid
drm/amdgpu/mmhub4.2.0: add bounds checking for cid
drm/amdgpu/mmhub4.1.0: add bounds checking for cid
...
The valid range for the pulses-per-revolution devicetree property is
1..4. The current code checks for a range of 1..5. Fix it.
Declare the variable used to retrieve pulses per revolution from
devicetree as u32 (unsigned) to match the of_property_read_u32() API.
The current code uses a postfix decrement when writing the pulses per
resolution into the chip. This has no effect since the value is evaluated
before it is decremented. Fix it by decrementing before evaluating the
value.
Fixes: 7506ebcd66 ("hwmon: (max6639) : Configure based on DT property")
Cc: Naresh Solanki <naresh.solanki@9elements.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Pull execve fixes from Kees Cook:
- binfmt_elf_fdpic: fix AUXV size calculation (Andrei Vagin)
- fs/tests: exec: Remove bad test vector
* tag 'execve-v7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
fs/tests: exec: Remove bad test vector
binfmt_elf_fdpic: fix AUXV size calculation for ELF_HWCAP3 and ELF_HWCAP4
Pull tty/serial fixes from Greg KH:
"Here are some small tty/vt and serial driver fixes for 7.0-rc5.
Included in here are:
- 8250 driver fixes for reported problems
- serial core lockup fix
- uartlite driver bugfix
- vt save/restore bugfix
All of these have been in linux-next for over a week with no reported
problems"
* tag 'tty-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
vt: save/restore unicode screen buffer for alternate screen
serial: 8250_dw: Ensure BUSY is deasserted
serial: 8250: Add late synchronize_irq() to shutdown to handle DW UART BUSY
serial: 8250_dw: Rework IIR_NO_INT handling to stop interrupt storm
serial: 8250_dw: Rework dw8250_handle_irq() locking and IIR handling
serial: 8250: Add serial8250_handle_irq_locked()
serial: 8250_dw: Avoid unnecessary LCR writes
serial: 8250: Protect LCR write in shutdown
serial: 8250_pci: add support for the AX99100
serial: core: fix infinite loop in handle_tx() for PORT_UNKNOWN
serial: uartlite: fix PM runtime usage count underflow on probe
serial: 8250: always disable IRQ during THRE test
serial: 8250: Fix TX deadlock when using DMA
Commit 150a04d817 ("compiler_types.h: Attributes: Add __counted_by_ptr
macro") used Clang 22.0.0 as a minimum supported version for
__counted_by_ptr, which made sense while 22.0.0 was the version of
LLVM's main branch to allow developers to easily test and develop uses
of __counted_by_ptr in their code. However, __counted_by_ptr requires a
change [1] merged towards the end of the 22 development cycle to avoid
errors when applied to void pointers.
In file included from fs/xfs/xfs_attr_inactive.c:18:
fs/xfs/libxfs/xfs_attr.h:59:2: error: 'counted_by' cannot be applied to a pointer with pointee of unknown size because 'void' is an incomplete type
59 | void *buffer __counted_by_ptr(bufsize);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is disruptive for deployed prerelease clang-22 builds (such as
Android LLVM) or when bisecting between llvmorg-21-init and the fix.
Require a released version of clang-22 (i.e., 21.1.0 or newer) to
enabled __counted_by_ptr to ensure all fixes needed for proper support
are present.
Fixes: 150a04d817 ("compiler_types.h: Attributes: Add __counted_by_ptr macro")
Link: f29955a594 [1]
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260318-counted_by_ptr-release-clang-22-v1-1-e017da246df0@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
When a socket is deconfigured, it's mapped to SOCK_EMPTY (0xffff). This causes
a panic while allocating UV hub info structures.
Fix this by using NUMA_NO_NODE, allowing UV hub info structures to be
allocated on valid nodes.
Fixes: 8a50c58519 ("x86/platform/uv: UV support for sub-NUMA clustering")
Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/ab2BmGL0ehVkkjKk@hpe.com
Pull io_uring fixes from Jens Axboe:
- A bit of a work-around for AF_UNIX recv multishot, as the in-kernel
implementation doesn't properly signal EOF. We'll likely rework this
one going forward, but the fix is sufficient for now
- Two fixes for incrementally consumed buffers, for non-pollable files
and for 0 byte reads
* tag 'io_uring-7.0-20260320' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/kbuf: propagate BUF_MORE through early buffer commit path
io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOF
io_uring/poll: fix multishot recv missing EOF on wakeup race
Pull spi fixes from Mark Brown:
"There's a couple of core fixes here from Johan, fixing a race
condition and an error handling path, plus a bunch of driver specific
fixups.
The Qualcomm issues could be nasty if you ran into them, especially
the DMA ordering one"
* tag 'spi-fix-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: geni-qcom: Check DMA interrupts early in ISR
spi: fix statistics allocation
spi: fix use-after-free on controller registration failure
spi: geni-qcom: Fix CPHA and CPOL mode change detection
spi: axiado: Fix double-free in ax_spi_probe()
spi: amlogic-spisg: Fix memory leak in aml_spisg_probe()
spi: amlogic: spifc-a4: Remove redundant clock cleanup
Pull regulator fix from Mark Brown:
"Just one fix here from Hugo Villeneuve, the documentation for some of
the regulator DT properties had been cut'n'pasted so that if anyone
actually read it they'd be informed that those properties had
completely incorrect meanings"
* tag 'regulator-fix-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: dt-bindings: fix typos in regulator-uv-* descriptions
Pull pmdomain fixes from Ulf Hansson:
- bcm: increase ASB control timeout for bcm2835
- mediatek: fix power domain count
* tag 'pmdomain-v7.0-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
pmdomain: bcm: bcm2835-power: Increase ASB control timeout
pmdomain: mediatek: Fix power domain count
Pull ata fixes from Niklas Cassel:
- ADATA SU680 SSDs are causing command timeouts when LPM is enabled.
Enable the ATA_QUIRK_NOLPM quirk to prevent LPM from being enabled
on these devices (Damien)
- When receiving a REPORT SUPPORTED OPERATION CODES command with an
invalid REPORTING OPTIONS format, sense data should have the field
pointer set to byte 2 (the location of the REPORTING OPTIONS field)
instead of incorrectly pointing to byte 1 (Damien)
* tag 'ata-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: libata-scsi: report correct sense field pointer in ata_scsiop_maint_in()
ata: libata-core: disable LPM on ADATA SU680 SSD
Pull MTD fixes from Miquel Raynal:
- In SPI NOR, there was an issue with the RDCR capability, leading to
several platforms no longer capable of using it for wrong reasons
(the follow-up commit renames the helper to avoid future confusion)
- NAND controller drivers needed to be improved to fix some timings, a
locking schenario and avoid certain operations during panic writes
- The Spear600 DT binding conversion was done partially, leading to
several warnings which have individually been fixed
- Tudor gets replaced by Takahiro for the SPI NOR maintainance
- Plus two more misc fixes
* tag 'mtd/fixes-for-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
mtd: rawnand: pl353: make sure optimal timings are applied
mtd: spi-nor: Rename spi_nor_spimem_check_op()
mtd: spi-nor: Fix RDCR controller capability core check
mtd: rawnand: brcmnand: skip DMA during panic write
mtd: rawnand: serialize lock/unlock against other NAND operations
dt-bindings: mtd: st,spear600-smi: Fix example
dt-bindings: mtd: st,spear600-smi: #address/size-cells is mandatory
dt-bindings: mtd: st,spear600-smi: Fix description
mtd: rawnand: cadence: Fix error check for dma_alloc_coherent() in cadence_nand_init()
mtd: Avoid boot crash in RedBoot partition table parser
MAINTAINERS: add Takahiro Kuwano as SPI NOR reviewer
MAINTAINERS: remove Tudor Ambarus as SPI NOR maintainer
Pull iommu fixes from Joerg Roedel:
"Intel VT-d:
- Abort all pending requests on dev_tlb_inv timeout to avoid
hardlockup
- Limit IOPF handling to PRI-capable device to avoid SVA attach
failure
AMD-Vi:
- Make sure identity domain is not used when SNP is active
Core fixes:
- Handle mapping IOVA 0x0 correctly
- Fix crash in SVA code
- Kernel-doc fix in IO-PGTable code"
* tag 'iommu-fixes-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommu/amd: Block identity domain when SNP enabled
iommu/sva: Fix crash in iommu_sva_unbind_device()
iommu/io-pgtable: fix all kernel-doc warnings in io-pgtable.h
iommu: Fix mapping check for 0x0 to avoid re-mapping it
iommu/vt-d: Only handle IOPF for SVA when PRI is supported
iommu/vt-d: Fix intel iommu iotlb sync hardlockup and retry
Pull arm64 fixes from Will Deacon:
"There's a small crop of fixes for the MPAM resctrl driver, a fix for
SCS/PAC patching with the AMDGPU driver and a page-table fix for
realms running with 52-bit physical addresses:
- Fix DWARF parsing for SCS/PAC patching to work with very large
modules (such as the amdgpu driver)
- Fixes to the mpam resctrl driver
- Fix broken handling of 52-bit physical addresses when sharing
memory from within a realm"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: realm: Fix PTE_NS_SHARED for 52bit PA support
arm_mpam: Force __iomem casts
arm_mpam: Disable preemption when making accesses to fake MSC in kunit test
arm_mpam: Fix null pointer dereference when restoring bandwidth counters
arm64/scs: Fix handling of advance_loc4
Pull Hyper-V fixes from Wei Liu:
- Fix ARM64 MSHV support (Anirudh Rayabharam)
- Fix MSHV driver memory handling issues (Stanislav Kinsburskii)
- Update maintainers for Hyper-V DRM driver (Saurabh Sengar)
- Misc clean up in MSHV crashdump code (Ard Biesheuvel, Uros Bizjak)
- Minor improvements to MSHV code (Mukesh R, Wei Liu)
- Revert not yet released MSHV scrub partition hypercall (Wei Liu)
* tag 'hyperv-fixes-signed-20260319' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
mshv: Fix error handling in mshv_region_pin
MAINTAINERS: Update maintainers for Hyper-V DRM driver
mshv: Fix use-after-free in mshv_map_user_memory error path
mshv: pass struct mshv_user_mem_region by reference
x86/hyperv: Use any general-purpose register when saving %cr2 and %cr8
x86/hyperv: Use current_stack_pointer to avoid asm() in hv_hvcrash_ctxt_save()
x86/hyperv: Save segment registers directly to memory in hv_hvcrash_ctxt_save()
x86/hyperv: Use __naked attribute to fix stackless C function
Revert "mshv: expose the scrub partition hypercall"
mshv: add arm64 support for doorbell & intercept SINTs
mshv: refactor synic init and cleanup
x86/hyperv: print out reserved vectors in hexadecimal
Pull smb client fixes from Steve French:
- Fix reporting of i_blocks
- Fix Kerberos mounts with different usernames to same server
- Trivial comment cleanup
* tag 'v7.0-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: fix generic/694 due to wrong ->i_blocks
cifs: smb1: fix comment typo
smb: client: fix krb5 mount with username option
Pull smb server fixes from Steve French:
- Three use after free fixes (in close, in compounded ops, and in tree
disconnect)
- Multichannel fix
- return proper volume identifier (superblock uuid if available) in
FS_OBJECT_ID queries
* tag 'v7.0-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix use-after-free in durable v2 replay of active file handles
ksmbd: fix use-after-free of share_conf in compound request
ksmbd: use volume UUID in FS_OBJECT_ID_INFORMATION
ksmbd: unset conn->binding on failed binding request
ksmbd: fix share_conf UAF in tree_conn disconnect
ranges_to_free array should have enough room to store the entire EFI
memmap plus an extra element for NULL entry.
The calculation of this array size wrongly adds 1 to the overall size
instead of adding 1 to the number of elements.
Add parentheses to properly size the array.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: a4b0bf6a40 ("x86/efi: defer freeing of boot services memory")
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Add a SB_I_NO_DATA_INTEGRITY superblock flag for filesystems that cannot
guarantee data persistence on sync (eg fuse). For superblocks with this
flag set, sync kicks off writeback of dirty inodes but does not wait
for the flusher threads to complete the writeback.
This replaces the per-inode AS_NO_DATA_INTEGRITY mapping flag added in
commit f9a49aa302 ("fs/writeback: skip AS_NO_DATA_INTEGRITY mappings
in wait_sb_inodes()"). The flag belongs at the superblock level because
data integrity is a filesystem-wide property, not a per-inode one.
Having this flag at the superblock level also allows us to skip having
to iterate every dirty inode in wait_sb_inodes() only to skip each inode
individually.
Prior to this commit, mappings with no data integrity guarantees skipped
waiting on writeback completion but still waited on the flusher threads
to finish initiating the writeback. Waiting on the flusher threads is
unnecessary. This commit kicks off writeback but does not wait on the
flusher threads. This change properly addresses a recent report [1] for
a suspend-to-RAM hang seen on fuse-overlayfs that was caused by waiting
on the flusher threads to finish:
Workqueue: pm_fs_sync pm_fs_sync_work_fn
Call Trace:
<TASK>
__schedule+0x457/0x1720
schedule+0x27/0xd0
wb_wait_for_completion+0x97/0xe0
sync_inodes_sb+0xf8/0x2e0
__iterate_supers+0xdc/0x160
ksys_sync+0x43/0xb0
pm_fs_sync_work_fn+0x17/0xa0
process_one_work+0x193/0x350
worker_thread+0x1a1/0x310
kthread+0xfc/0x240
ret_from_fork+0x243/0x280
ret_from_fork_asm+0x1a/0x30
</TASK>
On fuse this is problematic because there are paths that may cause the
flusher thread to block (eg if systemd freezes the user session cgroups
first, which freezes the fuse daemon, before invoking the kernel
suspend. The kernel suspend triggers ->write_node() which on fuse issues
a synchronous setattr request, which cannot be processed since the
daemon is frozen. Or if the daemon is buggy and cannot properly complete
writeback, initiating writeback on a dirty folio already under writeback
leads to writeback_get_folio() -> folio_prepare_writeback() ->
unconditional wait on writeback to finish, which will cause a hang).
This commit restores fuse to its prior behavior before tmp folios were
removed, where sync was essentially a no-op.
[1] https://lore.kernel.org/linux-fsdevel/CAJnrk1a-asuvfrbKXbEwwDSctvemF+6zfhdnuzO65Pt8HsFSRw@mail.gmail.com/T/#m632c4648e9cafc4239299887109ebd880ac6c5c1
Fixes: 0c58a97f91 ("fuse: remove tmp folio for writebacks and internal rb tree")
Reported-by: John <therealgraysky@proton.me>
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260320005145.2483161-2-joannelkoong@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
sof_parse_token_sets() accepts array->size values that can be invalid
for a vendor tuple array header. In particular, a zero size does not
advance the parser state and can lead to non-progress parsing on
malformed topology data.
Validate array->size against the minimum header size and reject values
smaller than sizeof(*array) before parsing. This preserves behavior for
valid topologies and hardens malformed-input handling.
Signed-off-by: Cássio Gabriel <cassiogabrielcontato@gmail.com>
Acked-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Link: https://patch.msgid.link/20260319-sof-topology-array-size-fix-v1-1-f9191b16b1b7@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
When running in an unprivileged domU under Xen, the privcmd driver
is restricted to allow only hypercalls against a target domain, for
which the current domU is acting as a device model.
Add a boot parameter "unrestricted" to allow all hypercalls (the
hypervisor will still refuse destructive hypercalls affecting other
guests).
Make this new parameter effective only in case the domU wasn't started
using secure boot, as otherwise hypercalls targeting the domU itself
might result in violating the secure boot functionality.
This is achieved by adding another lockdown reason, which can be
tested to not being set when applying the "unrestricted" option.
This is part of XSA-482
Signed-off-by: Juergen Gross <jgross@suse.com>
---
V2:
- new patch
HMM is fundamentally about allowing a sophisticated device to perform DMA
directly to a process’s memory while the CPU accesses that same memory at
the same time. It is similar to SVA but does not rely on IOMMU support.
Because the entire model depends on concurrent access to shared memory, it
fails as a uAPI if SWIOTLB substitutes the memory or if the CPU caches are
not coherent with DMA.
Until now, there has been no reliable way to report this, and various
approximations have been used:
int hmm_dma_map_alloc(struct device *dev, struct hmm_dma_map *map,
size_t nr_entries, size_t dma_entry_size)
{
<...>
/*
* The HMM API violates our normal DMA buffer ownership rules and can't
* transfer buffer ownership. The dma_addressing_limited() check is a
* best approximation to ensure no swiotlb buffering happens.
*/
dma_need_sync = !dev->dma_skip_sync;
if (dma_need_sync || dma_addressing_limited(dev))
return -EOPNOTSUPP;
So let's mark mapped buffers with DMA_ATTR_REQUIRE_COHERENT attribute
to prevent silent data corruption if someone tries to use hmm in a system
with swiotlb or incoherent DMA
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260316-dma-debug-overlap-v3-8-1dde90a7f08b@nvidia.com
The Xen privcmd driver allows to issue arbitrary hypercalls from
user space processes. This is normally no problem, as access is
usually limited to root and the hypervisor will deny any hypercalls
affecting other domains.
In case the guest is booted using secure boot, however, the privcmd
driver would be enabling a root user process to modify e.g. kernel
memory contents, thus breaking the secure boot feature.
The only known case where an unprivileged domU is really needing to
use the privcmd driver is the case when it is acting as the device
model for another guest. In this case all hypercalls issued via the
privcmd driver will target that other guest.
Fortunately the privcmd driver can already be locked down to allow
only hypercalls targeting a specific domain, but this mode can be
activated from user land only today.
The target domain can be obtained from Xenstore, so when not running
in dom0 restrict the privcmd driver to that target domain from the
beginning, resolving the potential problem of breaking secure boot.
This is XSA-482
Reported-by: Teddy Astie <teddy.astie@vates.tech>
Fixes: 1c5de1939c ("xen: add privcmd driver")
Signed-off-by: Juergen Gross <jgross@suse.com>
---
V2:
- defer reading from Xenstore if Xenstore isn't ready yet (Jan Beulich)
- wait in open() if target domain isn't known yet
- issue message in case no target domain found (Jan Beulich)
Commit 4ab7bb9763 ("ata: libata-scsi: Refactor ata_scsiop_maint_in()")
modified ata_scsiop_maint_in() to directly call
ata_scsi_set_invalid_field() to set the field pointer of the sense data
of a failed MAINTENANCE IN command. However, in the case of an invalid
command format, the sense data field incorrectly indicates byte 1 of
the CDB. Fix this to indicate byte 2 of the command.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 4ab7bb9763 ("ata: libata-scsi: Refactor ata_scsiop_maint_in()")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Niklas Cassel <cassel@kernel.org>
After commit 37c4e72b06 ("scsi: Fix sas_user_scan() to handle wildcard
and multi-channel scans"), if the device supports multiple channels (0 to
shost->max_channel), user_scan() invokes updated sas_user_scan() to perform
the scan behavior for a specific transfer. However, when the user
specifies shost->max_channel, it will return -EINVAL, which is not
expected.
Fix and support specifying the scan shost->max_channel for scanning.
Fixes: 37c4e72b06 ("scsi: Fix sas_user_scan() to handle wildcard and multi-channel scans")
Signed-off-by: Yihang Li <liyihang9@huawei.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://patch.msgid.link/20260317063147.2182562-1-liyihang9@huawei.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
tcm_loop_target_reset() violates the SCSI EH contract: it returns SUCCESS
without draining any in-flight commands. The SCSI EH documentation
(scsi_eh.rst) requires that when a reset handler returns SUCCESS the driver
has made lower layers "forget about timed out scmds" and is ready for new
commands. Every other SCSI LLD (virtio_scsi, mpt3sas, ipr, scsi_debug,
mpi3mr) enforces this by draining or completing outstanding commands before
returning SUCCESS.
Because tcm_loop_target_reset() doesn't drain, the SCSI EH reuses in-flight
scsi_cmnd structures for recovery commands (e.g. TUR) while the target core
still has async completion work queued for the old se_cmd. The memset in
queuecommand zeroes se_lun and lun_ref_active, causing
transport_lun_remove_cmd() to skip its percpu_ref_put(). The leaked LUN
reference prevents transport_clear_lun_ref() from completing, hanging
configfs LUN unlink forever in D-state:
INFO: task rm:264 blocked for more than 122 seconds.
rm D 0 264 258 0x00004000
Call Trace:
__schedule+0x3d0/0x8e0
schedule+0x36/0xf0
transport_clear_lun_ref+0x78/0x90 [target_core_mod]
core_tpg_remove_lun+0x28/0xb0 [target_core_mod]
target_fabric_port_unlink+0x50/0x60 [target_core_mod]
configfs_unlink+0x156/0x1f0 [configfs]
vfs_unlink+0x109/0x290
do_unlinkat+0x1d5/0x2d0
Fix this by making tcm_loop_target_reset() actually drain commands:
1. Issue TMR_LUN_RESET via tcm_loop_issue_tmr() to drain all commands that
the target core knows about (those not yet CMD_T_COMPLETE).
2. Use blk_mq_tagset_busy_iter() to iterate all started requests and
flush_work() on each se_cmd — this drains any deferred completion work
for commands that already had CMD_T_COMPLETE set before the TMR (which
the TMR skips via __target_check_io_state()). This is the same pattern
used by mpi3mr, scsi_debug, and libsas to drain outstanding commands
during reset.
Fixes: e0eb5d38b7 ("scsi: target: tcm_loop: Use block cmd allocator for se_cmds")
Cc: stable@vger.kernel.org
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://patch.msgid.link/27011aa34c8f6b1b94d2e3cf5655b6d037f53428.1773706803.git.josef@toxicpanda.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
A malicious or compromised VIO server can return a num_written value in the
discover targets MAD response that exceeds max_targets. This value is
stored directly in vhost->num_targets without validation, and is then used
as the loop bound in ibmvfc_alloc_targets() to index into disc_buf[], which
is only allocated for max_targets entries. Indices at or beyond max_targets
access kernel memory outside the DMA-coherent allocation. The
out-of-bounds data is subsequently embedded in Implicit Logout and PLOGI
MADs that are sent back to the VIO server, leaking kernel memory.
Fix by clamping num_written to max_targets before storing it.
Fixes: 072b91f9c6 ("[SCSI] ibmvfc: IBM Power Virtual Fibre Channel Adapter Client Driver")
Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Tyllis Xu <LivelyCarpet87@gmail.com>
Reviewed-by: Dave Marquardt <davemarq@linux.ibm.com>
Acked-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Link: https://patch.msgid.link/20260314170151.548614-1-LivelyCarpet87@gmail.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
nci_close_device() flushes rx_wq and tx_wq while holding req_lock.
This causes a circular locking dependency because nci_rx_work()
running on rx_wq can end up taking req_lock too:
nci_rx_work -> nci_rx_data_packet -> nci_data_exchange_complete
-> __sk_destruct -> rawsock_destruct -> nfc_deactivate_target
-> nci_deactivate_target -> nci_request -> mutex_lock(&ndev->req_lock)
Move the flush of rx_wq after req_lock has been released.
This should safe (I think) because NCI_UP has already been cleared
and the transport is closed, so the work will see it and return
-ENETDOWN.
NIPA has been hitting this running the nci selftest with a debug
kernel on roughly 4% of the runs.
Fixes: 6a2968aaf5 ("NFC: basic NCI protocol implementation")
Reviewed-by: Ian Ray <ian.ray@gehealthcare.com>
Link: https://patch.msgid.link/20260317193334.988609-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull parisc fix from Helge Deller:
"Fix for the cacheflush() syscall which had D/I caches mixed up"
* tag 'parisc-for-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Flush correct cache in cacheflush() syscall
Pull pci fixes from Bjorn Helgaas:
- Create pwrctrl devices only for DT nodes below a PCI controller that
describe PCI devices and are related to a power supply; this prevents
waiting indefinitely for pwrctrl drivers that will never probe
(Manivannan Sadhasivam)
- Restore endpoint BAR mapping on subrange setup failure to make
selftest reliable (Koichiro Den)
* tag 'pci-v7.0-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI: endpoint: pci-epf-test: Roll back BAR mapping when subrange setup fails
PCI/pwrctrl: Create pwrctrl devices only for PCI device nodes
PCI/pwrctrl: Ensure that remote endpoint node parent has supply requirement
The I2C communication is completely broken on the Armada 3700 platform
since commit 0b01392c18 ("i2c: pxa: move to generic GPIO recovery").
For example, on the Methode uDPU board, probing of the two onboard
temperature sensors fails ...
[ 7.271713] i2c i2c-0: using pinctrl states for GPIO recovery
[ 7.277503] i2c i2c-0: PXA I2C adapter
[ 7.282199] i2c i2c-1: using pinctrl states for GPIO recovery
[ 7.288241] i2c i2c-1: PXA I2C adapter
[ 7.292947] sfp sfp-eth1: Host maximum power 3.0W
[ 7.299614] sfp sfp-eth0: Host maximum power 3.0W
[ 7.308178] lm75 1-0048: supply vs not found, using dummy regulator
[ 32.489631] lm75 1-0048: probe with driver lm75 failed with error -121
[ 32.496833] lm75 1-0049: supply vs not found, using dummy regulator
[ 82.890614] lm75 1-0049: probe with driver lm75 failed with error -121
... and accessing the plugged-in SFP modules also does not work:
[ 511.298537] sfp sfp-eth1: please wait, module slow to respond
[ 536.488530] sfp sfp-eth0: please wait, module slow to respond
...
[ 1065.688536] sfp sfp-eth1: failed to read EEPROM: -EREMOTEIO
[ 1090.888532] sfp sfp-eth0: failed to read EEPROM: -EREMOTEIO
After a discussion [1], there was an attempt to fix the problem by
reverting the offending change by commit 7b211c7671 ("Revert "i2c:
pxa: move to generic GPIO recovery""), but that only helped to fix
the issue in the 6.1.y stable tree. The reason behind the partial succes
is that there was another change in commit 20cb3fce4d ("i2c: Set i2c
pinctrl recovery info from it's device pinctrl") in the 6.3-rc1 cycle
which broke things further.
The cause of the problem is the same in case of both offending commits
mentioned above. Namely, the I2C core code changes the pinctrl state to
GPIO while running the recovery initialization code. Although the PXA
specific initialization also does this, but the key difference is that
it happens before the controller is getting enabled in i2c_pxa_reset(),
whereas in the case of the generic initialization it happens after that.
Change the code to reset the controller only before the first transfer
instead of before registering the controller. This ensures that the
controller is not enabled at the time when the generic recovery code
performs the pinctrl state changes, thus avoids the problem described
above.
As the result this change restores the original behaviour, which in
turn makes the I2C communication to work again as it can be seen from
the following log:
[ 7.363250] i2c i2c-0: using pinctrl states for GPIO recovery
[ 7.369041] i2c i2c-0: PXA I2C adapter
[ 7.373673] i2c i2c-1: using pinctrl states for GPIO recovery
[ 7.379742] i2c i2c-1: PXA I2C adapter
[ 7.384506] sfp sfp-eth1: Host maximum power 3.0W
[ 7.393013] sfp sfp-eth0: Host maximum power 3.0W
[ 7.399266] lm75 1-0048: supply vs not found, using dummy regulator
[ 7.407257] hwmon hwmon0: temp1_input not attached to any thermal zone
[ 7.413863] lm75 1-0048: hwmon0: sensor 'tmp75c'
[ 7.418746] lm75 1-0049: supply vs not found, using dummy regulator
[ 7.426371] hwmon hwmon1: temp1_input not attached to any thermal zone
[ 7.432972] lm75 1-0049: hwmon1: sensor 'tmp75c'
[ 7.755092] sfp sfp-eth1: module MENTECHOPTO POS22-LDCC-KR rev 1.0 sn MNC208U90009 dc 200828
[ 7.764997] mvneta d0040000.ethernet eth1: unsupported SFP module: no common interface modes
[ 7.785362] sfp sfp-eth0: module Mikrotik S-RJ01 rev 1.0 sn 61B103C55C58 dc 201022
[ 7.803426] hwmon hwmon2: temp1_input not attached to any thermal zone
Link: https://lore.kernel.org/r/20230926160255.330417-1-robert.marko@sartura.hr#1
Cc: stable@vger.kernel.org # 6.3+
Fixes: 20cb3fce4d ("i2c: Set i2c pinctrl recovery info from it's device pinctrl")
Signed-off-by: Gabor Juhos <j4g8y7@gmail.com>
Tested-by: Robert Marko <robert.marko@sartura.hr>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260226-i2c-pxa-fix-i2c-communication-v4-1-797a091dae87@gmail.com
The use of IONIC_CMD_LIF_SETATTR in the MAC address update path causes
the ionic firmware to update the LIF's identity in its persistent state.
Since the firmware state is maintained across host warm boots and driver
reloads, any MAC change on the Physical Function (PF) becomes "sticky.
This is problematic because it causes ethtool -P to report the
user-configured MAC as the permanent factory address, which breaks
system management tools that rely on a stable hardware identity.
While Virtual Functions (VFs) need this hardware-level programming to
properly handle MAC assignments in guest environments, the PF should
maintain standard transient behavior. This patch gates the
ionic_program_mac call using is_virtfn so that PF MAC changes remain
local to the netdev filters and do not overwrite the firmware's
permanent identity block.
Fixes: 19058be7c4 ("ionic: VF initial random MAC address if no assigned mac")
Signed-off-by: Mohammad Heib <mheib@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Link: https://patch.msgid.link/20260317170806.35390-1-mheib@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
During the cxl_acpi probe process, it checks whether the cxl_nvb device
and driver have been attached. Currently, the startup priority of the
cxl_pmem driver is lower than that of the cxl_acpi driver. At this point,
the cxl_nvb driver has not yet been registered on the cxl_bus, causing
the attachment check to fail. This results in a failure to add the root
nvdimm bridge, leading to a cxl_acpi probe failure and ultimately
affecting the subsequent loading of cxl drivers. As a consequence, only
one mem device object exists on the cxl_bus, while the cxl_port device
objects and decoder device objects are missing.
The solution is to raise the startup priority of cxl_pmem to be higher
than that of cxl_acpi, ensuring that the cxl_pmem driver is registered
before the aforementioned attachment check occurs.
Co-developed-by: Wang Yinfeng <wangyinfeng@phytium.com.cn>
Signed-off-by: Wang Yinfeng <wangyinfeng@phytium.com.cn>
Signed-off-by: Cui Chao <cuichao1753@phytium.com.cn>
Fixes: e7e222ad73 ("cxl: Move devm_cxl_add_nvdimm_bridge() to cxl_pmem.ko")
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20260319074535.1709250-1-cuichao1753@phytium.com.cn
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
When io_should_commit() returns true (eg for non-pollable files), buffer
commit happens at buffer selection time and sel->buf_list is set to
NULL. When __io_put_kbufs() generates CQE flags at completion time, it
calls __io_put_kbuf_ring() which finds a NULL buffer_list and hence
cannot determine whether the buffer was consumed or not. This means that
IORING_CQE_F_BUF_MORE is never set for non-pollable input with
incrementally consumed buffers.
Likewise for io_buffers_select(), which always commits upfront and
discards the return value of io_kbuf_commit().
Add REQ_F_BUF_MORE to store the result of io_kbuf_commit() during early
commit. Then __io_put_kbuf_ring() can check this flag and set
IORING_F_BUF_MORE accordingy.
Reported-by: Martin Michaelis <code@mgjm.de>
Cc: stable@vger.kernel.org
Fixes: ae98dbf43d ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1553
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For a zero length transfer, io_kbuf_inc_commit() is called with !len.
Since we never enter the while loop to consume the buffers,
io_kbuf_inc_commit() ends up returning true, consuming the buffer. But
if no data was consumed, by definition it cannot have consumed the
buffer. Return false for that case.
Reported-by: Martin Michaelis <code@mgjm.de>
Cc: stable@vger.kernel.org
Fixes: ae98dbf43d ("io_uring/kbuf: add support for incremental buffer consumption")
Link: https://github.com/axboe/liburing/issues/1553
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add tsync_interrupt test to exercise the signal interruption path in
landlock_restrict_sibling_threads(). When a signal interrupts
wait_for_completion_interruptible() while the calling thread waits for
sibling threads to finish credential preparation, the kernel:
1. Sets ERESTARTNOINTR to request a transparent syscall restart.
2. Calls cancel_tsync_works() to opportunistically dequeue task works
that have not started running yet.
3. Breaks out of the preparation loop, then unblocks remaining
task works via complete_all() and waits for them to finish.
4. Returns the error, causing abort_creds() in the syscall handler.
Specifically, cancel_tsync_works() in its entirety, the ERESTARTNOINTR
error branch in landlock_restrict_sibling_threads(), and the
abort_creds() error branch in the landlock_restrict_self() syscall
handler are timing-dependent and not exercised by the existing tsync
tests, making code coverage measurements non-deterministic.
The test spawns a signaler thread that rapidly sends SIGUSR1 to the
calling thread while it performs landlock_restrict_self() with
LANDLOCK_RESTRICT_SELF_TSYNC. Since ERESTARTNOINTR causes a
transparent restart, userspace always sees the syscall succeed.
This is a best-effort coverage test: the interruption path is exercised
when the signal lands during the preparation wait, which depends on
thread scheduling. The test creates enough idle sibling threads (200)
to ensure multiple serialized waves of credential preparation even on
machines with many cores (e.g., 64), widening the window for the
signaler. Deterministic coverage would require wrapping the wait call
with ALLOW_ERROR_INJECTION() and using CONFIG_FAIL_FUNCTION.
Test coverage for security/landlock was 90.2% of 2105 lines according to
LLVM 21, and it is now 91.1% of 2105 lines with this new test.
Cc: Günther Noack <gnoack@google.com>
Cc: Justin Suess <utilityemal77@gmail.com>
Cc: Tingmao Wang <m@maowtm.org>
Cc: Yihan Ding <dingyihan@uniontech.com>
Link: https://lore.kernel.org/r/20260310190416.1913908-1-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
Some pinctrl devices like mt6397 or mt6392 don't support EINT at all, but
the mtk_eint_init function is always called and returns -ENODEV, which
then bubbles up and causes probe failure.
To address this only call mtk_eint_init if EINT pins are present.
Tested on Xiaomi Mi Smart Clock x04g (mt6392).
Fixes: e46df235b4 ("pinctrl: mediatek: refactor EINT related code for all MediaTek pinctrl can fit")
Signed-off-by: Luca Leonardo Scorcia <l.scorcia@gmail.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
This attempt to fix regressions caused by reusing ident which apparently
is not handled well on certain stacks causing the stack to not respond to
requests, so instead of simple returning the first unallocated id this
stores the last used tx_ident and then attempt to use the next until all
available ids are exausted and then cycle starting over to 1.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221120
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221177
Fixes: 6c3ea155e5 ("Bluetooth: L2CAP: Fix not tracking outstanding TX ident")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Tested-by: Christian Eggers <ceggers@arri.de>
Smatch reports:
drivers/bluetooth/hci_ll.c:587 download_firmware() warn:
'fw' from request_firmware() not released on lines: 544.
In download_firmware(), if request_firmware() succeeds but the returned
firmware content is invalid (no data or zero size), the function returns
without releasing the firmware, resulting in a resource leak.
Fix this by calling release_firmware() before returning when
request_firmware() succeeded but the firmware content is invalid.
Fixes: 371805522f ("bluetooth: hci_uart: add LL protocol serdev driver support")
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Anas Iqbal <mohd.abd.6602@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
__hci_cmd_sync_sk() sets hdev->req_status under hdev->req_lock:
hdev->req_status = HCI_REQ_PEND;
However, several other functions read or write hdev->req_status without
holding any lock:
- hci_send_cmd_sync() reads req_status in hci_cmd_work (workqueue)
- hci_cmd_sync_complete() reads/writes from HCI event completion
- hci_cmd_sync_cancel() / hci_cmd_sync_cancel_sync() read/write
- hci_abort_conn() reads in connection abort path
Since __hci_cmd_sync_sk() runs on hdev->req_workqueue while
hci_send_cmd_sync() runs on hdev->workqueue, these are different
workqueues that can execute concurrently on different CPUs. The plain
C accesses constitute a data race.
Add READ_ONCE()/WRITE_ONCE() annotations on all concurrent accesses
to hdev->req_status to prevent potential compiler optimizations that
could affect correctness (e.g., load fusing in the wait_event
condition or store reordering).
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
sco_recv_frame() reads conn->sk under sco_conn_lock() but immediately
releases the lock without holding a reference to the socket. A concurrent
close() can free the socket between the lock release and the subsequent
sk->sk_state access, resulting in a use-after-free.
Other functions in the same file (sco_sock_timeout(), sco_conn_del())
correctly use sco_sock_hold() to safely hold a reference under the lock.
Fix by using sco_sock_hold() to take a reference before releasing the
lock, and adding sock_put() on all exit paths.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
l2cap_ecred_data_rcv() reads the SDU length field from skb->data using
get_unaligned_le16() without first verifying that skb contains at least
L2CAP_SDULEN_SIZE (2) bytes. When skb->len is less than 2, this reads
past the valid data in the skb.
The ERTM reassembly path correctly calls pskb_may_pull() before reading
the SDU length (l2cap_reassemble_sdu, L2CAP_SAR_START case). Apply the
same validation to the Enhanced Credit Based Flow Control data path.
Fixes: aac23bf636 ("Bluetooth: Implement LE L2CAP reassembly")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Syzbot reported a KASAN stack-out-of-bounds read in l2cap_build_cmd()
that is triggered by a malformed Enhanced Credit Based Connection Request.
The vulnerability stems from l2cap_ecred_conn_req(). The function allocates
a local stack buffer (`pdu`) designed to hold a maximum of 5 Source Channel
IDs (SCIDs), totaling 18 bytes. When an attacker sends a request with more
than 5 SCIDs, the function calculates `rsp_len` based on this unvalidated
`cmd_len` before checking if the number of SCIDs exceeds
L2CAP_ECRED_MAX_CID.
If the SCID count is too high, the function correctly jumps to the
`response` label to reject the packet, but `rsp_len` retains the
attacker's oversized value. Consequently, l2cap_send_cmd() is instructed
to read past the end of the 18-byte `pdu` buffer, triggering a
KASAN panic.
Fix this by moving the assignment of `rsp_len` to after the `num_scid`
boundary check. If the packet is rejected, `rsp_len` will safely
remain 0, and the error response will only read the 8-byte base header
from the stack.
Fixes: c28d2bff70 ("Bluetooth: L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short")
Reported-by: syzbot+b7f3e7d9a596bf6a63e3@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b7f3e7d9a596bf6a63e3
Tested-by: syzbot+b7f3e7d9a596bf6a63e3@syzkaller.appspotmail.com
Signed-off-by: Minseo Park <jacob.park.9436@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Pull networking fixes from Jakub Kicinski:
"Including fixes from wireless, Bluetooth and netfilter.
Nothing too exciting here, mostly fixes for corner cases.
Current release - fix to a fix:
- bonding: prevent potential infinite loop in bond_header_parse()
Current release - new code bugs:
- wifi: mac80211: check tdls flag in ieee80211_tdls_oper
Previous releases - regressions:
- af_unix: give up GC if MSG_PEEK intervened
- netfilter: conntrack: add missing netlink policy validations
- NFC: nxp-nci: allow GPIOs to sleep"
* tag 'net-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (78 commits)
MPTCP: fix lock class name family in pm_nl_create_listen_socket
icmp: fix NULL pointer dereference in icmp_tag_validation()
net: dsa: bcm_sf2: fix missing clk_disable_unprepare() in error paths
net: shaper: protect from late creation of hierarchy
net: shaper: protect late read accesses to the hierarchy
net: mvpp2: guard flow control update with global_tx_fc in buffer switching
nfnetlink_osf: validate individual option lengths in fingerprints
netfilter: nf_tables: release flowtable after rcu grace period on error
netfilter: bpf: defer hook memory release until rcu readers are done
net: bonding: fix NULL deref in bond_debug_rlb_hash_show
udp_tunnel: fix NULL deref caused by udp_sock_create6 when CONFIG_IPV6=n
net/mlx5e: Fix race condition during IPSec ESN update
net/mlx5e: Prevent concurrent access to IPSec ASO context
net/mlx5: qos: Restrict RTNL area to avoid a lock cycle
ipv6: add NULL checks for idev in SRv6 paths
NFC: nxp-nci: allow GPIOs to sleep
net: macb: fix uninitialized rx_fs_lock
net: macb: fix use-after-free access to PTP clock
netdevsim: drop PSP ext ref on forward failure
wifi: mac80211: always free skb on ieee80211_tx_prepare_skb() failure
...
icmp_tag_validation() unconditionally dereferences the result of
rcu_dereference(inet_protos[proto]) without checking for NULL.
The inet_protos[] array is sparse -- only about 15 of 256 protocol
numbers have registered handlers. When ip_no_pmtu_disc is set to 3
(hardened PMTU mode) and the kernel receives an ICMP Fragmentation
Needed error with a quoted inner IP header containing an unregistered
protocol number, the NULL dereference causes a kernel panic in
softirq context.
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
RIP: 0010:icmp_unreach (net/ipv4/icmp.c:1085 net/ipv4/icmp.c:1143)
Call Trace:
<IRQ>
icmp_rcv (net/ipv4/icmp.c:1527)
ip_protocol_deliver_rcu (net/ipv4/ip_input.c:207)
ip_local_deliver_finish (net/ipv4/ip_input.c:242)
ip_local_deliver (net/ipv4/ip_input.c:262)
ip_rcv (net/ipv4/ip_input.c:573)
__netif_receive_skb_one_core (net/core/dev.c:6164)
process_backlog (net/core/dev.c:6628)
handle_softirqs (kernel/softirq.c:561)
</IRQ>
Add a NULL check before accessing icmp_strict_tag_validation. If the
protocol has no registered handler, return false since it cannot
perform strict tag validation.
Fixes: 8ed1dc44d3 ("ipv4: introduce hardened ip_no_pmtu_disc mode")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Link: https://patch.msgid.link/20260318130558.1050247-4-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Smatch reports:
drivers/net/dsa/bcm_sf2.c:997 bcm_sf2_sw_resume() warn:
'priv->clk' from clk_prepare_enable() not released on lines: 983,990.
The clock enabled by clk_prepare_enable() in bcm_sf2_sw_resume()
is not released if bcm_sf2_sw_rst() or bcm_sf2_cfp_resume() fails.
Add the missing clk_disable_unprepare() calls in the error paths
to properly release the clock resource.
Fixes: e9ec5c3bd2 ("net: dsa: bcm_sf2: request and handle clocks")
Reviewed-by: Jonas Gorski <jonas.gorski@gmail.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Anas Iqbal <mohd.abd.6602@gmail.com>
Link: https://patch.msgid.link/20260318084212.1287-1-mohd.abd.6602@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
isotp_sendmsg() uses only cmpxchg() on so->tx.state to serialize access
to so->tx.buf. isotp_release() waits for ISOTP_IDLE via
wait_event_interruptible() and then calls kfree(so->tx.buf).
If a signal interrupts the wait_event_interruptible() inside close()
while tx.state is ISOTP_SENDING, the loop exits early and release
proceeds to force ISOTP_SHUTDOWN and continues to kfree(so->tx.buf)
while sendmsg may still be reading so->tx.buf for the final CAN frame
in isotp_fill_dataframe().
The so->tx.buf can be allocated once when the standard tx.buf length needs
to be extended. Move the kfree() of this potentially extended tx.buf to
sk_destruct time when either isotp_sendmsg() and isotp_release() are done.
Fixes: 96d1c81e6a ("can: isotp: add module parameter for maximum pdu size")
Cc: stable@vger.kernel.org
Reported-by: Ali Norouzi <ali.norouzi@keysight.com>
Co-developed-by: Ali Norouzi <ali.norouzi@keysight.com>
Signed-off-by: Ali Norouzi <ali.norouzi@keysight.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260319-fix-can-gw-and-can-isotp-v2-2-c45d52c6d2d8@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
cgw_csum_crc8_rel() correctly computes bounds-safe indices via calc_idx():
int from = calc_idx(crc8->from_idx, cf->len);
int to = calc_idx(crc8->to_idx, cf->len);
int res = calc_idx(crc8->result_idx, cf->len);
if (from < 0 || to < 0 || res < 0)
return;
However, the loop and the result write then use the raw s8 fields directly
instead of the computed variables:
for (i = crc8->from_idx; ...) /* BUG: raw negative index */
cf->data[crc8->result_idx] = ...; /* BUG: raw negative index */
With from_idx = to_idx = result_idx = -64 on a 64-byte CAN FD frame,
calc_idx(-64, 64) = 0 so the guard passes, but the loop iterates with
i = -64, reading cf->data[-64], and the write goes to cf->data[-64].
This write might end up to 56 (7.0-rc) or 40 (<= 6.19) bytes before the
start of the canfd_frame on the heap.
The companion function cgw_csum_xor_rel() uses `from`/`to`/`res`
correctly throughout; fix cgw_csum_crc8_rel() to match.
Confirmed with KASAN on linux-7.0-rc2:
BUG: KASAN: slab-out-of-bounds in cgw_csum_crc8_rel+0x515/0x5b0
Read of size 1 at addr ffff8880076619c8 by task poc_cgw_oob/62
To configure the can-gw crc8 checksums CAP_NET_ADMIN is needed.
Fixes: 456a8a646b ("can: gw: add support for CAN FD frames")
Cc: stable@vger.kernel.org
Reported-by: Ali Norouzi <ali.norouzi@keysight.com>
Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Ali Norouzi <ali.norouzi@keysight.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://patch.msgid.link/20260319-fix-can-gw-and-can-isotp-v2-1-c45d52c6d2d8@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
GGTT MMIO access is currently protected by hotplug (drm_dev_enter),
which works correctly when the driver loads successfully and is later
unbound or unloaded. However, if driver load fails, this protection is
insufficient because drm_dev_unplug() is never called.
Additionally, devm release functions cannot guarantee that all BOs with
GGTT mappings are destroyed before the GGTT MMIO region is removed, as
some BOs may be freed asynchronously by worker threads.
To address this, introduce an open-coded flag, protected by the GGTT
lock, that guards GGTT MMIO access. The flag is cleared during the
dev_fini_ggtt devm release function to ensure MMIO access is disabled
once teardown begins.
Cc: stable@vger.kernel.org
Fixes: 919bb54e98 ("drm/xe: Fix missing runtime outer protection for ggtt_remove_node")
Reviewed-by: Zhanjun Dong <zhanjun.dong@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260310225039.1320161-8-zhanjun.dong@intel.com
(cherry picked from commit 4f3a998a173b4325c2efd90bdadc6ccd3ad9a431)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Pull power management fixes from Rafael Wysocki:
"These fix an idle loop issue exposed by recent changes and a race
condition related to device removal in the runtime PM core code:
- Consolidate the handling of two special cases in the idle loop that
occur when only one CPU idle state is present (Rafael Wysocki)
- Fix a race condition related to device removal in the runtime PM
core code that may cause a stale device object pointer to be
dereferenced (Bart Van Assche)"
* tag 'pm-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: runtime: Fix a race condition related to device removal
sched: idle: Consolidate the handling of two special cases
The HDP driver uses the generic GPIO chip API, but this configuration
may not be enabled.
Ensure it is enabled by selecting the appropriate option.
Fixes: 4bcff9c05b ("pinctrl: stm32: use new generic GPIO chip API")
Signed-off-by: Amelie Delaunay <amelie.delaunay@foss.st.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
Pull ACPI support fixes from Rafael Wysocki:
"These fix an MFD child automatic modprobe issue introduced recently,
an ACPI processor driver issue introduced by a previous fix and an
ACPICA issue causing confusing messages regarding _DSM arguments to be
printed:
- Update the format of the last argument of _DSM to avoid printing
confusing error messages in some cases (Saket Dumbre)
- Fix MFD child automatic modprobe issue by removing a stale check
from acpi_companion_match() (Pratap Nirujogi)
- Prevent possible use-after-free in acpi_processor_errata_piix4()
from occurring by rearranging the code to print debug messages
while holding references to relevant device objects (Rafael
Wysocki)"
* tag 'acpi-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: bus: Fix MFD child automatic modprobe issue
ACPI: processor: Fix previous acpi_processor_errata_piix4() fix
ACPICA: Update the format of Arg3 of _DSM
Florian Westphal says:
====================
netfilter: updates for net
The following patchset contains Netfilter fixes for *net*:
1) Fix UaF when netfilter bpf link goes away while nfnetlink dumps
current hook list, we have to wait until rcu readers are gone.
2) Fix UaF when flowtable fails to register all devices, similar
bug as 1). From Pablo Neira Ayuso.
3) nfnetlink_osf fails to properly validate option length fields.
From Weiming Shi.
netfilter pull request nf-26-03-19
* tag 'nf-26-03-19' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
nfnetlink_osf: validate individual option lengths in fingerprints
netfilter: nf_tables: release flowtable after rcu grace period on error
netfilter: bpf: defer hook memory release until rcu readers are done
====================
Link: https://patch.msgid.link/20260319093834.19933-1-fw@strlen.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Merge an ACPICA fix and a core ACPI support code fix for 7.0-rc5:
- Update the format of the last argument of _DSM to avoid printing
confusing error messages in some cases (Saket Dumbre)
- Fix MFD child automatic modprobe issue by removing a stale check
from acpi_companion_match() (Pratap Nirujogi)
* acpica:
ACPICA: Update the format of Arg3 of _DSM
* acpi-bus:
ACPI: bus: Fix MFD child automatic modprobe issue
Merge a fix for a race condition related to device removal (Bart Van
Assche) for 7.0-rc5.
* pm-runtime:
PM: runtime: Fix a race condition related to device removal
Add missing error handling for mcp251x_power_enable() calls in both
mcp251x_open() and mcp251x_can_resume() functions.
In mcp251x_open(), if power enable fails, jump to error path to close
candev without attempting to disable power again.
In mcp251x_can_resume(), properly check return values of power enable calls
for both power and transceiver regulators. If any fails, return the error
code to the PM framework and log the failure.
This ensures the driver properly handles power control failures and
maintains correct device state.
Signed-off-by: Wenyuan Li <2063309626@qq.com>
Link: https://patch.msgid.link/tencent_F3EFC5D7738AC548857B91657715E2D3AA06@qq.com
[mkl: fix patch description]
[mkl: mcp251x_can_resume(): replace goto by return]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Getting engine specific CTX TIMESTAMP register can fail. In that case,
if the context is active, new_ts is uninitialized. Fix that case by
initializing new_ts to the last value that was sampled in SW -
lrc->ctx_timestamp.
Flagged by static analysis.
v2: Fix new_ts initialization (Ashutosh)
Fixes: bb63e7257e ("drm/xe: Avoid toggling schedule state to check LRC timestamp in TDR")
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Link: https://patch.msgid.link/20260312125308.3126607-2-umesh.nerlige.ramappa@intel.com
(cherry picked from commit 466e75d48038af252187855058a7a9312db9d2f8)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
The check using xe_child->base.children was insufficient in determining
if a pte was a leaf node. So explicitly skip over every non-leaf pt and
conditionally abort if there is a scenario where a non-leaf pt is
interleaved between leaf pt, which results in the page walker skipping
over some leaf pt.
Note that the behavior being targeted for abort is
PD[0] = 2M PTE
PD[1] = PT -> 512 4K PTEs
PD[2] = 2M PTE
results in abort, page walker won't descend PD[1].
With new abort, ensuring valid PRL before handling a second abort.
v2:
- Revert to previous assert.
- Revised non-leaf handling for interleaf child pt and leaf pte.
- Update comments to specifications. (Stuart)
- Remove unnecessary XE_PTE_PS64. (Matthew B)
v3:
- Modify secondary abort to only check non-leaf PTEs. (Matthew B)
Fixes: b912138df2 ("drm/xe: Create page reclaim list on unbind")
Signed-off-by: Brian Nguyen <brian3.nguyen@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Cc: Stuart Summers <stuart.summers@intel.com>
Link: https://patch.msgid.link/20260305171546.67691-6-brian3.nguyen@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
(cherry picked from commit 1d123587525db86cc8f0d2beb35d9e33ca3ade83)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
In GuC submit fini, forcefully tear down any exec queues by disabling
CTs, stopping the scheduler (which cleans up lost G2H), killing all
remaining queues, and resuming scheduling to allow any remaining cleanup
actions to complete and signal any remaining fences.
Split guc_submit_fini into device related and software only part. Using
device-managed and drm-managed action guarantees the correct ordering of
cleanup.
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: stable@vger.kernel.org
Reviewed-by: Zhanjun Dong <zhanjun.dong@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patch.msgid.link/20260310225039.1320161-3-zhanjun.dong@intel.com
(cherry picked from commit a6ab444a111a59924bd9d0c1e0613a75a0a40b89)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
By using the same variable for both the return of poll_timeout_us and
the return of the polled function guc_wait_ucode, the return value of
the latter is overwritten and lost after exiting the polling loop. Since
guc_wait_ucode returns -1 on GuC load failure, we lose that information
and always continue as if the GuC had been loaded correctly.
This is fixed by simply using 2 separate variables.
Fixes: a4916b4da4 ("drm/xe/guc: Refactor GuC load to use poll_timeout_us()")
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Vinay Belgaumkar <vinay.belgaumkar@intel.com>
Link: https://patch.msgid.link/20260303001732.2540493-2-daniele.ceraolospurio@intel.com
(cherry picked from commit c85ec5c5753a46b5c2aea1292536487be9470ffe)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
We look up a netdev during prep of Netlink ops (pre- callbacks)
and take a ref to it. Then later in the body of the callback
we take its lock or RCU which are the actual protections.
The netdev may get unregistered in between the time we take
the ref and the time we lock it. We may allocate the hierarchy
after flush has already run, which would lead to a leak.
Take the instance lock in pre- already, this saves us from the race
and removes the need for dedicated lock/unlock callbacks completely.
After all, if there's any chance of write happening concurrently
with the flush - we're back to leaking the hierarchy.
We may take the lock for devices which don't support shapers but
we're only dealing with SET operations here, not taking the lock
would be optimizing for an error case.
Fixes: 93954b40f6 ("net-shapers: implement NL set and delete operations")
Link: https://lore.kernel.org/20260309173450.538026-1-p@1g4.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260317161014.779569-2-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We look up a netdev during prep of Netlink ops (pre- callbacks)
and take a ref to it. Then later in the body of the callback
we take its lock or RCU which are the actual protections.
This is not proper, a conversion from a ref to a locked netdev
must include a liveness check (a check if the netdev hasn't been
unregistered already). Fix the read cases (those under RCU).
Writes needs a separate change to protect from creating the
hierarchy after flush has already run.
Fixes: 4b623f9f0f ("net-shapers: implement NL get operation")
Reported-by: Paul Moses <p@1g4.org>
Link: https://lore.kernel.org/20260309173450.538026-1-p@1g4.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20260317161014.779569-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
With LPA/LPA2, the top bits of the PFN (Bits[51:48]) end up in the lower bits
of the PTE. So, simply creating a mask of the "top IPA bit" doesn't work well
for these configurations to set the "top" bit at the output of Stage1
translation.
Fix this by using the __phys_to_pte_val() to do the right thing for all
configurations.
Tested using, kvmtool, placing the memory at a higher address (-m <size>@<Addr>).
e.g:
# lkvm run --realm -c 4 -m 512M@@128T -k Image --console serial
sh-5.0# dmesg | grep "LPA2\|RSI"
[ 0.000000] RME: Using RSI version 1.0
[ 0.000000] CPU features: detected: 52-bit Virtual Addressing (LPA2)
[ 0.777354] CPU features: detected: 52-bit Virtual Addressing for KVM (LPA2)
Fixes: 3993069549 ("arm64: realm: Query IPA size from the RMM")
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Under certain circumstances, all the remaining subrequests from a read
request will get abandoned during retry. The abandonment process expects
the 'subreq' variable to be set to the place to start abandonment from, but
it doesn't always have a useful value (it will be uninitialised on the
first pass through the loop and it may point to a deleted subrequest on
later passes).
Fix the first jump to "abandon:" to set subreq to the start of the first
subrequest expected to need retry (which, in this abandonment case, turned
out unexpectedly to no longer have NEED_RETRY set).
Also clear the subreq pointer after discarding superfluous retryable
subrequests to cause an oops if we do try to access it.
Fixes: ee4cdf7ba8 ("netfs: Speed up buffered reading")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/3775287.1773848338@warthog.procyon.org.uk
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
cc: Paulo Alcantara <pc@manguebit.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
mvpp2_bm_switch_buffers() unconditionally calls
mvpp2_bm_pool_update_priv_fc() when switching between per-cpu and
shared buffer pool modes. This function programs CM3 flow control
registers via mvpp2_cm3_read()/mvpp2_cm3_write(), which dereference
priv->cm3_base without any NULL check.
When the CM3 SRAM resource is not present in the device tree (the
third reg entry added by commit 60523583b0 ("dts: marvell: add CM3
SRAM memory to cp11x ethernet device tree")), priv->cm3_base remains
NULL and priv->global_tx_fc is false. Any operation that triggers
mvpp2_bm_switch_buffers(), for example an MTU change that crosses
the jumbo frame threshold, will crash:
Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000000
Mem abort info:
ESR = 0x0000000096000006
EC = 0x25: DABT (current EL), IL = 32 bits
pc : readl+0x0/0x18
lr : mvpp2_cm3_read.isra.0+0x14/0x20
Call trace:
readl+0x0/0x18
mvpp2_bm_pool_update_fc+0x40/0x12c
mvpp2_bm_pool_update_priv_fc+0x94/0xd8
mvpp2_bm_switch_buffers.isra.0+0x80/0x1c0
mvpp2_change_mtu+0x140/0x380
__dev_set_mtu+0x1c/0x38
dev_set_mtu_ext+0x78/0x118
dev_set_mtu+0x48/0xa8
dev_ifsioc+0x21c/0x43c
dev_ioctl+0x2d8/0x42c
sock_ioctl+0x314/0x378
Every other flow control call site in the driver already guards
hardware access with either priv->global_tx_fc or port->tx_fc.
mvpp2_bm_switch_buffers() is the only place that omits this check.
Add the missing priv->global_tx_fc guard to both the disable and
re-enable calls in mvpp2_bm_switch_buffers(), consistent with the
rest of the driver.
Fixes: 3a616b92a9 ("net: mvpp2: Add TX flow control support for jumbo frames")
Signed-off-by: Muhammad Hammad Ijaz <mhijaz@amazon.com>
Reviewed-by: Gunnar Kudrjavets <gunnarku@amazon.com>
Link: https://patch.msgid.link/20260316193157.65748-1-mhijaz@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
nfnl_osf_add_callback() validates opt_num bounds and string
NUL-termination but does not check individual option length fields.
A zero-length option causes nf_osf_match_one() to enter the option
matching loop even when foptsize sums to zero, which matches packets
with no TCP options where ctx->optp is NULL:
Oops: general protection fault
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:nf_osf_match_one (net/netfilter/nfnetlink_osf.c:98)
Call Trace:
nf_osf_match (net/netfilter/nfnetlink_osf.c:227)
xt_osf_match_packet (net/netfilter/xt_osf.c:32)
ipt_do_table (net/ipv4/netfilter/ip_tables.c:293)
nf_hook_slow (net/netfilter/core.c:623)
ip_local_deliver (net/ipv4/ip_input.c:262)
ip_rcv (net/ipv4/ip_input.c:573)
Additionally, an MSS option (kind=2) with length < 4 causes
out-of-bounds reads when nf_osf_match_one() unconditionally accesses
optp[2] and optp[3] for MSS value extraction. While RFC 9293
section 3.2 specifies that the MSS option is always exactly 4
bytes (Kind=2, Length=4), the check uses "< 4" rather than
"!= 4" because lengths greater than 4 do not cause memory
safety issues -- the buffer is guaranteed to be at least
foptsize bytes by the ctx->optsize == foptsize check.
Reject fingerprints where any option has zero length, or where an MSS
option has length less than 4, at add time rather than trusting these
values in the packet matching hot path.
Fixes: 11eeef41d5 ("netfilter: passive OS fingerprint xtables match")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Call synchronize_rcu() after unregistering the hooks from error path,
since a hook that already refers to this flowtable can be already
registered, exposing this flowtable to packet path and nfnetlink_hook
control plane.
This error path is rare, it should only happen by reaching the maximum
number hooks or by failing to set up to hardware offload, just call
synchronize_rcu().
There is a check for already used device hooks by different flowtable
that could result in EEXIST at this late stage. The hook parser can be
updated to perform this check earlier to this error path really becomes
rarely exercised.
Uncovered by KASAN reported as use-after-free from nfnetlink_hook path
when dumping hooks.
Fixes: 3b49e2e94e ("netfilter: nf_tables: add flow table netlink frontend")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Yiming Qian reports UaF when concurrent process is dumping hooks via
nfnetlink_hooks:
BUG: KASAN: slab-use-after-free in nfnl_hook_dump_one.isra.0+0xe71/0x10f0
Read of size 8 at addr ffff888003edbf88 by task poc/79
Call Trace:
<TASK>
nfnl_hook_dump_one.isra.0+0xe71/0x10f0
netlink_dump+0x554/0x12b0
nfnl_hook_get+0x176/0x230
[..]
Defer release until after concurrent readers have completed.
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Fixes: 84601d6ee6 ("bpf: add bpf_link support for BPF_NETFILTER programs")
Signed-off-by: Florian Westphal <fw@strlen.de>
Johannes Berg says:
====================
Just a few updates:
- cfg80211:
- guarantee pmsr work is cancelled
- mac80211:
- reject TDLS operations on non-TDLS stations
- fix crash in AP_VLAN bandwidth change
- fix leak or double-free on some TX preparation
failures
- remove keys needed for beacons _after_ stopping
those
- fix debugfs static branch race
- avoid underflow in inactive time
- fix another NULL dereference in mesh on invalid
frames
- ti/wlcore: avoid infinite realloc loop
* tag 'wireless-2026-03-18' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: always free skb on ieee80211_tx_prepare_skb() failure
wifi: wlcore: Return -ENOMEM instead of -EAGAIN if there is not enough headroom
wifi: mac80211: fix NULL deref in mesh_matches_local()
wifi: mac80211: check tdls flag in ieee80211_tdls_oper
wifi: cfg80211: cancel pmsr_free_wk in cfg80211_pmsr_wdev_down
wifi: mac80211: Fix static_branch_dec() underflow for aql_disable.
mac80211: fix crash in ieee80211_chan_bw_change for AP_VLAN stations
wifi: mac80211: use jiffies_delta_to_msecs() for sta_info inactive times
wifi: mac80211: remove keys after disabling beaconing
wifi: mac80211_hwsim: fully initialise PMSR capabilities
====================
Link: https://patch.msgid.link/20260318172515.381148-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When CONFIG_IPV6 is disabled, the udp_sock_create6() function returns 0
(success) without actually creating a socket. Callers such as
fou_create() then proceed to dereference the uninitialized socket
pointer, resulting in a NULL pointer dereference.
The captured NULL deref crash:
BUG: kernel NULL pointer dereference, address: 0000000000000018
RIP: 0010:fou_nl_add_doit (net/ipv4/fou_core.c:590 net/ipv4/fou_core.c:764)
[...]
Call Trace:
<TASK>
genl_family_rcv_msg_doit.constprop.0 (net/netlink/genetlink.c:1114)
genl_rcv_msg (net/netlink/genetlink.c:1194 net/netlink/genetlink.c:1209)
[...]
netlink_rcv_skb (net/netlink/af_netlink.c:2550)
genl_rcv (net/netlink/genetlink.c:1219)
netlink_unicast (net/netlink/af_netlink.c:1319 net/netlink/af_netlink.c:1344)
netlink_sendmsg (net/netlink/af_netlink.c:1894)
__sock_sendmsg (net/socket.c:727 (discriminator 1) net/socket.c:742 (discriminator 1))
__sys_sendto (./include/linux/file.h:62 (discriminator 1) ./include/linux/file.h:83 (discriminator 1) net/socket.c:2183 (discriminator 1))
__x64_sys_sendto (net/socket.c:2213 (discriminator 1) net/socket.c:2209 (discriminator 1) net/socket.c:2209 (discriminator 1))
do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
entry_SYSCALL_64_after_hwframe (net/arch/x86/entry/entry_64.S:130)
This patch makes udp_sock_create6 return -EPFNOSUPPORT instead, so
callers correctly take their error paths. There is only one caller of
the vulnerable function and only privileged users can trigger it.
Fixes: fd384412e1 ("udp_tunnel: Seperate ipv6 functions into its own file.")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260317010241.1893893-1-xmei5@asu.edu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In IPSec full offload mode, the device reports an ESN (Extended
Sequence Number) wrap event to the driver. The driver validates this
event by querying the IPSec ASO and checking that the esn_event_arm
field is 0x0, which indicates an event has occurred. After handling
the event, the driver must re-arm the context by setting esn_event_arm
back to 0x1.
A race condition exists in this handling path. After validating the
event, the driver calls mlx5_accel_esp_modify_xfrm() to update the
kernel's xfrm state. This function temporarily releases and
re-acquires the xfrm state lock.
So, need to acknowledge the event first by setting esn_event_arm to
0x1. This prevents the driver from reprocessing the same ESN update if
the hardware sends events for other reason. Since the next ESN update
only occurs after nearly 2^31 packets are received, there's no risk of
missing an update, as it will happen long after this handling has
finished.
Processing the event twice causes the ESN high-order bits (esn_msb) to
be incremented incorrectly. The driver then programs the hardware with
this invalid ESN state, which leads to anti-replay failures and a
complete halt of IPSec traffic.
Fix this by re-arming the ESN event immediately after it is validated,
before calling mlx5_accel_esp_modify_xfrm(). This ensures that any
spurious, duplicate events are correctly ignored, closing the race
window.
Fixes: fef0667893 ("net/mlx5e: Fix ESN update kernel panic")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260316094603.6999-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The query or updating IPSec offload object is through Access ASO WQE.
The driver uses a single mlx5e_ipsec_aso struct for each PF, which
contains a shared DMA-mapped context for all ASO operations.
A race condition exists because the ASO spinlock is released before
the hardware has finished processing WQE. If a second operation is
initiated immediately after, it overwrites the shared context in the
DMA area.
When the first operation's completion is processed later, it reads
this corrupted context, leading to unexpected behavior and incorrect
results.
This commit fixes the race by introducing a private context within
each IPSec offload object. The shared ASO context is now copied to
this private context while the ASO spinlock is held. Subsequent
processing uses this saved, per-object context, ensuring its integrity
is maintained.
Fixes: 1ed78fc033 ("net/mlx5e: Update IPsec soft and hard limits")
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260316094603.6999-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A lock dependency cycle exists where:
1. mlx5_ib_roce_init -> mlx5_core_uplink_netdev_event_replay ->
mlx5_blocking_notifier_call_chain (takes notifier_rwsem) ->
mlx5e_mdev_notifier_event -> mlx5_netdev_notifier_register ->
register_netdevice_notifier_dev_net (takes rtnl)
=> notifier_rwsem -> rtnl
2. mlx5e_probe -> _mlx5e_probe ->
mlx5_core_uplink_netdev_set (takes uplink_netdev_lock) ->
mlx5_blocking_notifier_call_chain (takes notifier_rwsem)
=> uplink_netdev_lock -> notifier_rwsem
3: devlink_nl_rate_set_doit -> devlink_nl_rate_set ->
mlx5_esw_devlink_rate_leaf_tx_max_set -> esw_qos_devlink_rate_to_mbps ->
mlx5_esw_qos_max_link_speed_get (takes rtnl) ->
mlx5_esw_qos_lag_link_speed_get_locked ->
mlx5_uplink_netdev_get (takes uplink_netdev_lock)
=> rtnl -> uplink_netdev_lock
=> BOOM! (lock cycle)
Fix that by restricting the rtnl-protected section to just the necessary
part, the call to netdev_master_upper_dev_get and speed querying, so
that the last lock dependency is avoided and the cycle doesn't close.
This is safe because mlx5_uplink_netdev_get uses netdev_hold to keep the
uplink netdev alive while its master device is queried.
Use this opportunity to rename the ambiguously-named "hold_rtnl_lock"
argument to "take_rtnl" and remove the "_locked" suffix from
mlx5_esw_qos_lag_link_speed_get_locked.
Fixes: 6b4be64fd9 ("net/mlx5e: Harden uplink netdev access against device unbind")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260316094603.6999-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2026-03-17 (igc, iavf, libie)
Kohei Enju adds use of helper function to add missing update of
skb->tail when padding is needed for igc.
Zdenek Bouska clears stale XSK timestamps when taking down Tx rings on
igc.
Petr Oros changes handling of iavf VLAN filter handling when an added
VLAN is also on the delete list to which can race and cause the VLAN
filter to not be added.
Michal frees cmd_buf for libie firmware logging to stop memory leaks.
* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
libie: prevent memleak in fwlog code
iavf: fix VLAN filter lost on add/delete race
igc: fix page fault in XDP TX timestamps handling
igc: fix missing update of skb->tail in igc_xmit_frame()
====================
Link: https://patch.msgid.link/20260317211906.115505-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gregory reported in [0] that the global_map_resize test when run in
repeatedly ends up failing during program load. This stems from the fact
that BTF reference has not dropped to zero after the previous run's
module is unloaded, and the older module's BTF is still discoverable and
visible. Later, in libbpf, load_module_btfs() will find the ID for this
stale BTF, open its fd, and then it will be used during program load
where later steps taking module reference using btf_try_get_module()
fail since the underlying module for the BTF is gone.
Logically, once a module is unloaded, it's associated BTF artifacts
should become hidden. The BTF object inside the kernel may still remain
alive as long its reference counts are alive, but it should no longer be
discoverable.
To fix this, let us call btf_free_id() from the MODULE_STATE_GOING case
for the module unload to free the BTF associated IDR entry, and disable
its discovery once module unload returns to user space. If a race
happens during unload, the outcome is non-deterministic anyway. However,
user space should be able to rely on the guarantee that once it has
synchronously established a successful module unload, no more stale
artifacts associated with this module can be obtained subsequently.
Note that we must be careful to not invoke btf_free_id() in btf_put()
when btf_is_module() is true now. There could be a window where the
module unload drops a non-terminal reference, frees the IDR, but the
same ID gets reused and the second unconditional btf_free_id() ends up
releasing an unrelated entry.
To avoid a special case for btf_is_module() case, set btf->id to zero to
make btf_free_id() idempotent, such that we can unconditionally invoke it
from btf_put(), and also from the MODULE_STATE_GOING case. Since zero is
an invalid IDR, the idr_remove() should be a noop.
Note that we can be sure that by the time we reach final btf_put() for
btf_is_module() case, the btf_free_id() is already done, since the
module itself holds the BTF reference, and it will call this function
for the BTF before dropping its own reference.
[0]: https://lore.kernel.org/bpf/cover.1773170190.git.grbell@redhat.com
Fixes: 36e68442d1 ("bpf: Load and verify kernel module BTFs")
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Reported-by: Gregory Bell <grbell@redhat.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260312205307.1346991-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
__in6_dev_get() can return NULL when the device has no IPv6 configuration
(e.g. MTU < IPV6_MIN_MTU or after NETDEV_UNREGISTER).
Add NULL checks for idev returned by __in6_dev_get() in both
seg6_hmac_validate_skb() and ipv6_srh_rcv() to prevent potential NULL
pointer dereferences.
Fixes: 1ababeba4a ("ipv6: implement dataplane support for rthdr type 4 (Segment Routing Header)")
Fixes: bf355b8d2c ("ipv6: sr: add core files for SR HMAC support")
Signed-off-by: Minhong He <heminhong@kylinos.cn>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Link: https://patch.msgid.link/20260316073301.106643-1-heminhong@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If hardware doesn't support RX Flow Filters, rx_fs_lock spinlock is not
initialized leading to the following assertion splat triggerable via
set_rxnfc callback.
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 1 PID: 949 Comm: syz.0.6 Not tainted 6.1.164+ #113
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x8d/0xba lib/dump_stack.c:106
assign_lock_key kernel/locking/lockdep.c:974 [inline]
register_lock_class+0x141b/0x17f0 kernel/locking/lockdep.c:1287
__lock_acquire+0x74f/0x6c40 kernel/locking/lockdep.c:4928
lock_acquire kernel/locking/lockdep.c:5662 [inline]
lock_acquire+0x190/0x4b0 kernel/locking/lockdep.c:5627
__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
_raw_spin_lock_irqsave+0x33/0x50 kernel/locking/spinlock.c:162
gem_del_flow_filter drivers/net/ethernet/cadence/macb_main.c:3562 [inline]
gem_set_rxnfc+0x533/0xac0 drivers/net/ethernet/cadence/macb_main.c:3667
ethtool_set_rxnfc+0x18c/0x280 net/ethtool/ioctl.c:961
__dev_ethtool net/ethtool/ioctl.c:2956 [inline]
dev_ethtool+0x229c/0x6290 net/ethtool/ioctl.c:3095
dev_ioctl+0x637/0x1070 net/core/dev_ioctl.c:510
sock_do_ioctl+0x20d/0x2c0 net/socket.c:1215
sock_ioctl+0x577/0x6d0 net/socket.c:1320
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:870 [inline]
__se_sys_ioctl fs/ioctl.c:856 [inline]
__x64_sys_ioctl+0x18c/0x210 fs/ioctl.c:856
do_syscall_x64 arch/x86/entry/common.c:46 [inline]
do_syscall_64+0x35/0x80 arch/x86/entry/common.c:76
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
A more straightforward solution would be to always initialize rx_fs_lock,
just like rx_fs_list. However, in this case the driver set_rxnfc callback
would return with a rather confusing error code, e.g. -EINVAL. So deny
set_rxnfc attempts directly if the RX filtering feature is not supported
by hardware.
Fixes: ae8223de3d ("net: macb: Added support for RX filtering")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Link: https://patch.msgid.link/20260316103826.74506-2-pchelkin@ispras.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
nsim_do_psp() takes an extra reference to the PSP skb extension so the
extension survives __dev_forward_skb(). That forward path scrubs the skb
and drops attached skb extensions before nsim_psp_handle_ext() can
reattach the PSP metadata.
If __dev_forward_skb() fails in nsim_forward_skb(), the function returns
before nsim_psp_handle_ext() can attach that extension to the skb, leaving
the extra reference leaked.
Drop the saved PSP extension reference before returning from the
forward-failure path. Guard the put because plain or non-decapsulated
traffic can also fail forwarding without ever taking the extra PSP
reference.
Fixes: f857478d62 ("netdevsim: a basic test PSP implementation")
Signed-off-by: Wesley Atwell <atwellwea@gmail.com>
Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com>
Link: https://patch.msgid.link/20260317061431.1482716-1-atwellwea@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The ':=' override path in xbc_parse_kv() calls xbc_init_node() to
re-initialize an existing value node but does not check the return
value. If xbc_init_node() fails (data offset out of range), parsing
silently continues with stale node data.
Add the missing error check to match the xbc_add_node() call path
which already checks for failure.
In practice, a bootconfig using ':=' to override a value near the
32KB data limit could silently retain the old value, meaning a
security-relevant boot parameter override (e.g., a trace filter or
debug setting) would not take effect as intended.
Link: https://lore.kernel.org/all/20260318155847.78065-2-objecting@objecting.org/
Fixes: e5efaeb8a8 ("bootconfig: Support mixing a value and subkeys under a key")
Signed-off-by: Josh Law <objecting@objecting.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Pull crypto library fixes from Eric Biggers:
- Disable the "padlock" SHA-1 and SHA-256 driver on Zhaoxin
processors, since it does not compute hash values correctly
- Make a generated file be removed by 'make clean'
- Fix excessive stack usage in some of the arm64 AES code
* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crypto: powerpc: Add powerpc/aesp8-ppc.S to clean-files
crypto: padlock-sha - Disable for Zhaoxin processor
crypto: arm64/aes-neonbs - Move key expansion off the stack
People do effort to inject MCEs into guests in order to simulate/test
handling of hardware errors. The real use case behind it is testing the
handling of SIGBUS which the memory failure code sends to the process.
If that process is QEMU, instead of killing the whole guest, the MCE can
be injected into the guest kernel so that latter can attempt proper
handling and kill the user *process* in the guest, instead, which
caused the MCE. The assumption being here that the whole injection flow
can supply enough information that the guest kernel can pinpoint the
right process. But that's a different topic...
Regardless of virtualization or not, access to SMCA-specific registers
like MCA_DESTAT should only be done after having checked the smca
feature bit. And there are AMD machines like Bulldozer (the one before
Zen1) which do support deferred errors but are not SMCA machines.
Therefore, properly check the feature bit before accessing related MSRs.
[ bp: Rewrite commit message. ]
Fixes: 7cb735d7c0 ("x86/mce: Unify AMD DFR handler with MCA Polling")
Signed-off-by: William Roche <william.roche@oracle.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20260218163025.1316501-1-william.roche@oracle.com
Pull nfsd fixes from Chuck Lever:
- Fix cache_request leak in cache_release()
- Fix heap overflow in the NFSv4.0 LOCK replay cache
- Hold net reference for the lifetime of /proc/fs/nfs/exports fd
- Defer sub-object cleanup in export "put" callbacks
* tag 'nfsd-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: fix heap overflow in NFSv4.0 LOCK replay cache
sunrpc: fix cache_request leak in cache_release
NFSD: Hold net reference for the lifetime of /proc/fs/nfs/exports fd
NFSD: Defer sub-object cleanup in export put callbacks
isl68137_avs_enable_show_page() uses the return value of
pmbus_read_byte_data() without checking for errors. If the I2C transaction
fails, a negative error code is passed through bitwise operations,
producing incorrect output.
Add an error check to propagate the return value if it is negative.
Additionally, modernize the callback by replacing sprintf()
with sysfs_emit().
Fixes: 038a9c3d1e ("hwmon: (pmbus/isl68137) Add driver for Intersil ISL68137 PWM Controller")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260318193952.47908-2-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
To pick up some extra files that need to be sync'ed with the kernel
sources to try and reduce the number of PRs.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The bcm2835_asb_control() function uses a tight polling loop to wait
for the ASB bridge to acknowledge a request. During intensive workloads,
this handshake intermittently fails for V3D's master ASB on BCM2711,
resulting in "Failed to disable ASB master for v3d" errors during
runtime PM suspend. As a consequence, the failed power-off leaves V3D in
a broken state, leading to bus faults or system hangs on later accesses.
As the timeout is insufficient in some scenarios, increase the polling
timeout from 1us to 5us, which is still negligible in the context of a
power domain transition. Also, replace the open-coded ktime_get_ns()/
cpu_relax() polling loop with readl_poll_timeout_atomic().
Cc: stable@vger.kernel.org
Fixes: 670c672608 ("soc: bcm: bcm2835-pm: Add support for power domains under a new binding.")
Signed-off-by: Maíra Canal <mcanal@igalia.com>
Reviewed-by: Stefan Wahren <wahrenst@gmx.net>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Timings of the nand are adjusted by pl35x_nfc_setup_interface() but
actually applied by the pl35x_nand_select_target() function.
If there is only one nand chip, the pl35x_nand_select_target() will only
apply the timings once since the test at its beginning will always be true
after the first call to this function. As a result, the hardware will
keep using the default timings set at boot to detect the nand chip, not
the optimal ones.
With this patch, we program directly the new timings when
pl35x_nfc_setup_interface() is called.
Fixes: 08d8c62164 ("mtd: rawnand: pl353: Add support for the ARM PL353 SMC NAND controller")
Signed-off-by: Olivier Sobrie <olivier@sobrie.be>
Cc: stable@vger.kernel.org
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
This helper really is just a little helper for internal purposes, and is
I/O operation oriented, despite its name. It has already been misused
in commit 5008c3ec3f ("mtd: spi-nor: core: Check read CR support"), so
rename it to clarify its purpose: it is only useful for reads and page
programs.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Commit 5008c3ec3f ("mtd: spi-nor: core: Check read CR support") adds a
controller check to make sure the core will not use CR reads on
controllers not supporting them. The approach is valid but the fix is
incorrect. Unfortunately, the author could not catch it, because the
expected behavior was met. The patch indeed drops the RDCR capability,
but it does it for all controllers!
The issue comes from the use of spi_nor_spimem_check_op() which is an
internal helper dedicated to check read/write operations only, despite
its generic name.
This helper looks for the biggest number of address bytes that can be
used for a page operation and tries 4 then 3. It then calls the usual
spi-mem helpers to do the checks. These will always fail because there
is now an inconsistency: the address cycles are forced to 4 (then 3)
bytes, but the bus width during the address cycles rightfully remains
0. There is a non-zero address length but a zero address bus width,
which is an invalid combination.
The correct check in this case is to directly call spi_mem_supports_op()
which doesn't messes up with the operation content.
Fixes: 5008c3ec3f ("mtd: spi-nor: core: Check read CR support")
Cc: stable@vger.kernel.org
Acked-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Acked-by: Takahiro Kuwano <takahiro.kuwano@infineon.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
The bio completion path in the process context (e.g. dm-verity)
will directly call into decompression rather than trigger another
workqueue context for minimal scheduling latencies, which can
then call vm_map_ram() with GFP_KERNEL.
Due to insufficient memory, vm_map_ram() may generate memory
swapping I/O, which can cause submit_bio_wait to deadlock
in some scenarios.
Trimmed down the call stack, as follows:
f2fs_submit_read_io
submit_bio //bio_list is initialized.
mmc_blk_mq_recovery
z_erofs_endio
vm_map_ram
__pte_alloc_kernel
__alloc_pages_direct_reclaim
shrink_folio_list
__swap_writepage
submit_bio_wait //bio_list is non-NULL, hang!!!
Use memalloc_noio_{save,restore}() to wrap up this path.
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Jiucheng Xu <jiucheng.xu@amlogic.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The current error handling has two issues:
First, pin_user_pages_fast() can return a short pin count (less than
requested but greater than zero) when it cannot pin all requested pages.
This is treated as success, leading to partially pinned regions being
used, which causes memory corruption.
Second, when an error occurs mid-loop, already pinned pages from the
current batch are not properly accounted for before calling
mshv_region_invalidate_pages(), causing a page reference leak.
Treat short pins as errors and fix partial batch accounting before
cleanup.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
clang-22 rightfully warns that the memcpy() in adapter_prepare() copies
between different structures, crossing the boundary of nested
structures inside it:
In file included from sound/pci/asihpi/hpimsgx.c:13:
In file included from include/linux/string.h:386:
include/linux/fortify-string.h:569:4: error: call to '__write_overflow_field' declared with 'warning' attribute: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror,-Wattribute-warning]
569 | __write_overflow_field(p_size_field, size);
The two structures seem to refer to the same layout, despite the
separate definitions, so the code is in fact correct.
Avoid the warning by copying the two inner structures separately.
I see the same pattern happens in other functions in the same file,
so there is a chance that this may come back in the future, but
this instance is the only one that I saw in practice, hitting it
multiple times per day in randconfig build.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260318124016.3488566-1-arnd@kernel.org
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Pull SoC fixes from Arnd Bergmann:
"The firmware drivers for ARM SCMI, FF-A and the Tee subsystem, as
well as the reset controller and cache controller subsystem all see
small bugfixes for reference ounting errors, ABI correctness, and
NULL pointer dereferences.
Similarly, there are multiple reference counting fixes in drivers/soc/
for vendor specific drivers (rockchips, microchip), while the
freescale drivers get a fix for a race condition and error handling.
The devicetree fixes for Rockchips and NXP got held up, so for
the moment there is only Renesas fixing problesm with SD card
initialization, a boot hang on one board and incorrect descriptions
for interrupts and clock registers on some SoCs. The Microchip
polarfire gets a dts fix for a boot time warning.
A defconfig fix avoids a warning about a conflicting assignment"
* tag 'soc-fixes-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (21 commits)
ARM: multi_v7_defconfig: Drop duplicate CONFIG_TI_PRUSS=m
firmware: arm_scmi: Spelling s/mulit/multi/, s/currenly/currently/
firmware: arm_scmi: Fix NULL dereference on notify error path
firmware: arm_scpi: Fix device_node reference leak in probe path
firmware: arm_ffa: Remove vm_id argument in ffa_rxtx_unmap()
arm64: dts: renesas: r8a78000: Fix out-of-range SPI interrupt numbers
arm64: dts: renesas: rzg3s-smarc-som: Set bypass for Versa3 PLL2
arm64: dts: renesas: r9a09g087: Fix CPG register region sizes
arm64: dts: renesas: r9a09g077: Fix CPG register region sizes
arm64: dts: renesas: r9a09g057: Remove wdt{0,2,3} nodes
arm64: dts: renesas: rzv2-evk-cn15-sd: Add ramp delay for SD0 regulator
arm64: dts: renesas: rzt2h-n2h-evk: Add ramp delay for SD0 card regulator
tee: shm: Remove refcounting of kernel pages
reset: rzg2l-usbphy-ctrl: Check pwrrdy is valid before using it
soc: fsl: cpm1: qmc: Fix error check for devm_ioremap_resource() in qmc_qe_init_resources()
soc: fsl: qbman: fix race condition in qman_destroy_fq
soc: rockchip: grf: Add missing of_node_put() when returning
cache: ax45mp: Fix device node reference leak in ax45mp_cache_init()
cache: starfive: fix device node leak in starlink_cache_init()
riscv: dts: microchip: add can resets to mpfs
...
Pull crypto fix from Herbert Xu:
- Remove duplicate snp_leak_pages call in ccp
* tag 'v7.0-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: ccp - Fix leaking the same page twice
Pull LoongArch fixes from Huacai Chen:
- only use SC.Q when supported by the assembler to fix a build failure
- fix calling smp_processor_id() in preemptible code
- make a BPF helper arch_protect_bpf_trampoline() return 0 to fix a
kernel memory access failure
- fix a typo issue in kvm_vm_init_features()
* tag 'loongarch-fixes-7.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: KVM: Fix typo issue in kvm_vm_init_features()
LoongArch: BPF: Make arch_protect_bpf_trampoline() return 0
LoongArch: No need to flush icache if text copy failed
LoongArch: Check return values for set_memory_{rw,rox}
LoongArch: Give more information if kmem access failed
LoongArch: Fix calling smp_processor_id() in preemptible code
LoongArch: Only use SC.Q when supported by the assembler
Shengjiu Wang <shengjiu.wang@nxp.com> says:
Check value of is_playback_only and is_capture_only in
graph_util_parse_link_direction() and initialize playback_only and
capture_only in imx-card.c
The audio-graph-card2 gets the value of 'playback-only' and
'capture_only' property in below sequence, if there is 'playback_only' or
'capture_only' property in port_cpu and port_codec nodes, but no these
properties in ep_cpu and ep_codec nodes, the value of playback_only and
capture_only will be flushed to zero in the end.
graph_util_parse_link_direction(lnk, &playback_only, &capture_only);
graph_util_parse_link_direction(ports_cpu, &playback_only, &capture_only);
graph_util_parse_link_direction(ports_codec, &playback_only, &capture_only);
graph_util_parse_link_direction(port_cpu, &playback_only, &capture_only);
graph_util_parse_link_direction(port_codec, &playback_only, &capture_only);
graph_util_parse_link_direction(ep_cpu, &playback_only, &capture_only);
graph_util_parse_link_direction(ep_codec, &playback_only, &capture_only);
So check the value of is_playback_only and is_capture_only in
graph_util_parse_link_direction() function, if they are true, then rewrite
the values, and no need to check the np variable as
of_property_read_bool() will ignore if it was NULL.
Fixes: 3cc393d223 ("ASoC: simple-card-utils: Fix pointer check in graph_util_parse_link_direction")
Fixes: 22a507d768 ("ASoC: simple-card-utils: Check device node before overwrite direction")
Suggested-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Acked-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
Link: https://patch.msgid.link/20260318102850.2794029-2-shengjiu.wang@nxp.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Arm SCMI fixes for v7.0
Few fixes to:
1. Address a NULL dereference in the SCMI notify error path by ensurin
__scmi_event_handler_get_ops() consistently returns an ERR_PTR on
failure, as expected by callers.
2. Fix a device_node reference leak in the SCPI probe path by introducing
scope-based cleanup for acquired DT nodes.
3. Correct minor spelling errors.
* tag 'scmi-fixes-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
firmware: arm_scmi: Spelling s/mulit/multi/, s/currenly/currently/
firmware: arm_scmi: Fix NULL dereference on notify error path
firmware: arm_scpi: Fix device_node reference leak in probe path
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Arm FF-A fix for v7.0
Fix removing the vm_id argument from ffa_rxtx_unmap(), as the FF-A
specification mandates this field be zero in all contexts except a
non-secure physical FF-A instance, where the ID is inherently 0.
* tag 'ffa-fix-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
firmware: arm_ffa: Remove vm_id argument in ffa_rxtx_unmap()
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
TEE shared memory update for 7.0
Remove refcounting of kernel pages in register_shm_helper() to support
slab allocations.
* tag 'tee-fix-for-v7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/jenswi/linux-tee:
tee: shm: Remove refcounting of kernel pages
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
MFD child devices sharing parent's ACPI Companion fails to probe as
acpi_companion_match() returns incompatible ACPI Companion handle for
binding with the check for pnp.type.backlight added recently. Remove this
pnp.type.backlight check in acpi_companion_match() to fix the automatic
modprobe issue.
Fixes: 7a7a7ed5f8bdb ("ACPI: scan: Register platform devices for backlight device objects")
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Link: https://patch.msgid.link/20260318034842.1216536-1-pratap.nirujogi@amd.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The deeply nested loop in rkvdec_init_v4l2_vp9_count_tbl() needs a lot
of registers, so when the clang register allocator runs out, it ends up
spilling countless temporaries to the stack:
drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c:966:12: error: stack frame size (1472) exceeds limit (1280) in 'rkvdec_vp9_start' [-Werror,-Wframe-larger-than]
Marking this function as noinline_for_stack keeps it out of
rkvdec_vp9_start(), giving the compiler more room for optimization.
The resulting code is good enough that both the total stack usage
and the loop get enough better to stay under the warning limit,
though it's still slow, and would need a larger rework if this
function ends up being called in a fast path.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Signed-off-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
The rkvdec_pps had a large set of bitfields, all of which
as misaligned. This causes clang-21 and likely other versions to
produce absolutely awful object code and a warning about very
large stack usage, on targets without unaligned access:
drivers/media/platform/rockchip/rkvdec/rkvdec-vp9.c:966:12: error: stack frame size (1472) exceeds limit (1280) in 'rkvdec_vp9_start' [-Werror,-Wframe-larger-than]
Part of the problem here is how all the bitfield accesses are
inlined into a function that already has large structures on
the stack.
Mark set_field_order_cnt() as noinline_for_stack, and split out
the following accesses in assemble_hw_pps() into another noinline
function, both of which now using around 800 bytes of stack in the
same configuration.
There is clearly still something wrong with clang here, but
splitting it into multiple functions reduces the risk of stack
overflow.
Fixes: fde2490757 ("media: rkvdec: Add H264 support for the VDPU383 variant")
Link: https://godbolt.org/z/acP1eKeq9
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Signed-off-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
The values of ext_sps_st_rps and ext_sps_lt_rps in struct rkvdec_hevc_run
are not initialized when the respective controls are not set by userspace.
When this is the case, set them to NULL so the rkvdec_hevc_run_preamble
function that parses controls does not access garbage data which leads to
a panic on unaccessible memory.
Fixes: c9a59dc2ac ("media: rkvdec: Add HEVC support for the VDPU381 variant")
Reported-by: Christian Hewitt <christianshewitt@gmail.com>
Suggested-by: Jonas Karlman <jonas@kwiboo.se>
Signed-off-by: Detlev Casanova <detlev.casanova@collabora.com>
Tested-by: Christian Hewitt <christianshewitt@gmail.com>
Reviewed-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Signed-off-by: Nicolas Dufresne <nicolas.dufresne@collabora.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
MEDIA_REQUEST_IOC_REINIT can run concurrently with VIDIOC_REQBUFS(0)
queue teardown paths. This can race request object cleanup against vb2
queue cancellation and lead to use-after-free reports.
We already serialize request queueing against STREAMON/OFF with
req_queue_mutex. Extend that serialization to REQBUFS, and also take
the same mutex in media_request_ioctl_reinit() so REINIT is in the
same exclusion domain.
This keeps request cleanup and queue cancellation from running in
parallel for request-capable devices.
Fixes: 6093d3002e ("media: vb2: keep a reference to the request until dqbuf")
Cc: stable@vger.kernel.org
Signed-off-by: Yuchan Nam <entropy1110@gmail.com>
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
An issue was exposed where OS can pass in U32_MAX for SQ/RQ/SRQ size.
This can cause integer overflow and truncation of SQ/RQ/SRQ depth
returning a success when it should have failed.
Harden the functions to do all depth calculations and boundary
checking in u64 sizes.
Fixes: 563e1feb5f ("RDMA/irdma: Add SRQ support")
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When rdma_connect() fails due to an invalid arp index, user space rdma core
reports ENOMEM which is confusing. Modify irdma_make_cm_node() to return the
correct error code.
Fixes: 146b9756f1 ("RDMA/irdma: Add connection manager")
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Resolve deadlock that occurs when user executes netdev reset while RDMA
applications (e.g., rping) are active. The netdev reset causes ice
driver to remove irdma auxiliary driver, triggering device_delete and
subsequent client removal. During client removal, uverbs_client waits
for QP reference count to reach zero while cma_client holds the final
reference, creating circular dependency and indefinite wait in iWARP
mode. Skip QP reference count wait during device reset to prevent
deadlock.
Fixes: c8f304d75f ("RDMA/irdma: Prevent QP use after free")
Signed-off-by: Anil Samal <anil.samal@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
During reset, irdma_modify_qp() to error should be called to disconnect
the QP. Without this fix, if not preceded by irdma_modify_qp() to error, the
API call irdma_destroy_qp() gets stuck waiting for the QP refcount to go
to zero, because the cm_node associated with this QP isn't disconnected.
Fixes: 915cc7ac0f ("RDMA/irdma: Add miscellaneous utility definitions")
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The cm_node is available and the usage of cm_node and event->cm_node
seems arbitrary. Clean up unnecessary dereference of event->cm_node.
Fixes: 146b9756f1 ("RDMA/irdma: Add connection manager")
Signed-off-by: Ivan Barrera <ivan.d.barrera@intel.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Remove a NOP wait_event() in irdma_modify_qp_roce() which is relevant
for iWARP and likely a copy and paste artifact for RoCEv2. The wait event
is for sending a reset on a TCP connection, after the reset has been
requested in irdma_modify_qp(), which occurs only in iWarp mode.
Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
In irdma_modify_qp() update ibqp state to error if the irdma QP is already
in error state, otherwise the ibqp state which is visible to the consumer
app remains stale.
Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
In irdma_create_qp, if ib_copy_to_udata fails, it will call
irdma_destroy_qp to clean up which will attempt to wait on
the free_qp completion, which is not initialized yet. Fix this
by initializing the completion before the ib_copy_to_udata call.
Fixes: b48c24c2d7 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Commit aa35dd5cbc ("iomap: fix invalid folio access after
folio_end_read()") partially addressed invalid folio access for folios
without an ifs attached, but it did not handle the case where
1 << inode->i_blkbits matches the folio size but is different from the
granularity used for the IO, which means IO can be submitted for less
than the full folio for the !ifs case.
In this case, the condition:
if (*bytes_submitted == folio_len)
ctx->cur_folio = NULL;
in iomap_read_folio_iter() will not invalidate ctx->cur_folio, and
iomap_read_end() will still be called on the folio even though the IO
helper owns it and will finish the read on it.
Fix this by unconditionally invalidating ctx->cur_folio for the !ifs
case.
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/linux-fsdevel/b3dfe271-4e3d-4922-b618-e73731242bca@wdc.com/
Fixes: b2f35ac414 ("iomap: add caller-provided callbacks for read and readahead")
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260317203935.830549-1-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add the `__counted_by_ptr` attribute to the `buffer` field of `struct
xfs_attr_list_context`. This field is used to point to a buffer of
size `bufsize`.
The `buffer` field is assigned in:
1. `xfs_ioc_attr_list` in `fs/xfs/xfs_handle.c`
2. `xfs_xattr_list` in `fs/xfs/xfs_xattr.c`
3. `xfs_getparents` in `fs/xfs/xfs_handle.c` (implicitly initialized to NULL)
In `xfs_ioc_attr_list`, `buffer` was assigned before `bufsize`. Reorder
them to ensure `bufsize` is set before `buffer` is assigned, although
no access happens between them.
In `xfs_xattr_list`, `buffer` was assigned before `bufsize`. Reorder
them to ensure `bufsize` is set before `buffer` is assigned.
In `xfs_getparents`, `buffer` is NULL (from zero initialization) and
remains NULL. `bufsize` is set to a non-zero value, but since `buffer`
is NULL, no access occurs.
In all cases, the pointer `buffer` is not accessed before `bufsize` is set.
This patch was generated by CodeMender and reviewed by Bill Wendling.
Tested by running xfstests.
Signed-off-by: Bill Wendling <morbo@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The newly added XFS_IOC_VERIFY_MEDIA is a bit unusual in how it handles
buftarg fields. Update it to be more in line with other XFS code:
- use btp->bt_dev instead of btp->bt_bdev->bd_dev to retrieve the device
number for tracing
- use btp->bt_logical_sectorsize instead of
bdev_logical_block_size(btp->bt_bdev) to retrieve the logical sector
size
- compare the buftarg and not the bdev to see if there is a separate
log buftarg
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
xchk_quota_item can return early after calling xchk_fblock_process_error.
When that helper returns false, the function returned immediately without
dropping dq->q_qlock, which can leave the dquot lock held and risk lock
leaks or deadlocks in later quota operations.
Fix this by unlocking dq->q_qlock before the early return.
Signed-off-by: hongao <hongao@uniontech.com>
Fixes: 7d1f0e167a ("xfs: check the ondisk space mapping behind a dquot")
Cc: <stable@vger.kernel.org> # v6.8
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Factor the loop body of xfsaild_push() into a separate
xfsaild_process_logitem() helper to improve readability.
This is a pure code movement with no functional change.
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
In xfs_inode_item_push() and xfs_qm_dquot_logitem_push(), the AIL lock
is dropped to perform buffer IO. Once the cluster buffer no longer
protects the log item from reclaim, the log item may be freed by
background reclaim or the dquot shrinker. The subsequent spin_lock()
call dereferences lip->li_ailp, which is a use-after-free.
Fix this by saving the ailp pointer in a local variable while the AIL
lock is held and the log item is guaranteed to be valid.
Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c
Fixes: 90c60e1640 ("xfs: xfs_iflush() is no longer necessary")
Cc: stable@vger.kernel.org # v5.9
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
After xfsaild_push_item() calls iop_push(), the log item may have been
freed if the AIL lock was dropped during the push. Background inode
reclaim or the dquot shrinker can free the log item while the AIL lock
is not held, and the tracepoints in the switch statement dereference
the log item after iop_push() returns.
Fix this by capturing the log item type, flags, and LSN before calling
xfsaild_push_item(), and introducing a new xfs_ail_push_class trace
event class that takes these pre-captured values and the ailp pointer
instead of the log item pointer.
Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c
Fixes: 90c60e1640 ("xfs: xfs_iflush() is no longer necessary")
Cc: stable@vger.kernel.org # v5.9
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The unmount sequence in xfs_unmount_flush_inodes() pushed the AIL while
background reclaim and inodegc are still running. This is broken
independently of any use-after-free issues - background reclaim and
inodegc should not be running while the AIL is being pushed during
unmount, as inodegc can dirty and insert inodes into the AIL during the
flush, and background reclaim can race to abort and free dirty inodes.
Reorder xfs_unmount_flush_inodes() to stop inodegc and cancel background
reclaim before pushing the AIL. Stop inodegc before cancelling
m_reclaim_work because the inodegc worker can re-queue m_reclaim_work
via xfs_inodegc_set_reclaimable.
Reported-by: syzbot+652af2b3c5569c4ab63c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=652af2b3c5569c4ab63c
Fixes: 90c60e1640 ("xfs: xfs_iflush() is no longer necessary")
Cc: stable@vger.kernel.org # v5.9
Signed-off-by: Yuto Ohnuki <ytohnuki@amazon.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
ieee80211_tx_prepare_skb() has three error paths, but only two of them
free the skb. The first error path (ieee80211_tx_prepare() returning
TX_DROP) does not free it, while invoke_tx_handlers() failure and the
fragmentation check both do.
Add kfree_skb() to the first error path so all three are consistent,
and remove the now-redundant frees in callers (ath9k, mt76,
mac80211_hwsim) to avoid double-free.
Document the skb ownership guarantee in the function's kdoc.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://patch.msgid.link/20260314065455.2462900-1-nbd@nbd.name
Fixes: 06be6b149f ("mac80211: add ieee80211_tx_prepare_skb() helper function")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Since upstream commit e75665dd09 ("wifi: wlcore: ensure skb headroom
before skb_push"), wl1271_tx_allocate() and with it
wl1271_prepare_tx_frame() returns -EAGAIN if pskb_expand_head() fails.
However, in wlcore_tx_work_locked(), a return value of -EAGAIN from
wl1271_prepare_tx_frame() is interpreted as the aggregation buffer being
full. This causes the code to flush the buffer, put the skb back at the
head of the queue, and immediately retry the same skb in a tight while
loop.
Because wlcore_tx_work_locked() holds wl->mutex, and the retry happens
immediately with GFP_ATOMIC, this will result in an infinite loop and a
CPU soft lockup. Return -ENOMEM instead so the packet is dropped and
the loop terminates.
The problem was found by an experimental code review agent based on
gemini-3.1-pro while reviewing backports into v6.18.y.
Assisted-by: Gemini:gemini-3.1-pro
Fixes: e75665dd09 ("wifi: wlcore: ensure skb headroom before skb_push")
Cc: Peter Astrand <astrand@lysator.liu.se>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Link: https://patch.msgid.link/20260318064636.3065925-1-linux@roeck-us.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
mesh_matches_local() unconditionally dereferences ie->mesh_config to
compare mesh configuration parameters. When called from
mesh_rx_csa_frame(), the parsed action-frame elements may not contain a
Mesh Configuration IE, leaving ie->mesh_config NULL and triggering a
kernel NULL pointer dereference.
The other two callers are already safe:
- ieee80211_mesh_rx_bcn_presp() checks !elems->mesh_config before
calling mesh_matches_local()
- mesh_plink_get_event() is only reached through
mesh_process_plink_frame(), which checks !elems->mesh_config, too
mesh_rx_csa_frame() is the only caller that passes raw parsed elements
to mesh_matches_local() without guarding mesh_config. An adjacent
attacker can exploit this by sending a crafted CSA action frame that
includes a valid Mesh ID IE but omits the Mesh Configuration IE,
crashing the kernel.
The captured crash log:
Oops: general protection fault, probably for non-canonical address ...
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
Workqueue: events_unbound cfg80211_wiphy_work
[...]
Call Trace:
<TASK>
? __pfx_mesh_matches_local (net/mac80211/mesh.c:65)
ieee80211_mesh_rx_queued_mgmt (net/mac80211/mesh.c:1686)
[...]
ieee80211_iface_work (net/mac80211/iface.c:1754 net/mac80211/iface.c:1802)
[...]
cfg80211_wiphy_work (net/wireless/core.c:426)
process_one_work (net/kernel/workqueue.c:3280)
? assign_work (net/kernel/workqueue.c:1219)
worker_thread (net/kernel/workqueue.c:3352)
? __pfx_worker_thread (net/kernel/workqueue.c:3385)
kthread (net/kernel/kthread.c:436)
[...]
ret_from_fork_asm (net/arch/x86/entry/entry_64.S:255)
</TASK>
This patch adds a NULL check for ie->mesh_config at the top of
mesh_matches_local() to return false early when the Mesh Configuration
IE is absent.
Fixes: 2e3c873682 ("mac80211: support functions for mesh")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Link: https://patch.msgid.link/20260318034244.2595020-1-xmei5@asu.edu
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The Focusrite Scarlett 2i2 1st Gen (1235:8006) produces
distorted/silent audio when QUIRK_FLAG_SKIP_IFACE_SETUP is active, as
that flag causes the feedback format to be detected as 17.15 instead
of 16.16.
Add a DEVICE_FLG entry for this device before the Focusrite VENDOR_FLG
entry so that it gets no quirk flags, overriding the vendor-wide
SKIP_IFACE_SETUP. This device doesn't have the internal mixer, Air, or
Safe modes that the quirk was designed to protect.
Fixes: 38c322068a ("ALSA: usb-audio: Add QUIRK_FLAG_SKIP_IFACE_SETUP")
Reported-by: pairomaniac [https://github.com/geoffreybennett/linux-fcp/issues/54]
Tested-by: pairomaniac [https://github.com/geoffreybennett/linux-fcp/issues/54]
Signed-off-by: Geoffrey D. Bennett <g@b4.vu>
Link: https://patch.msgid.link/abmsTjKmQMKbhYtK@m.b4.vu
Signed-off-by: Takashi Iwai <tiwai@suse.de>
smb2_get_ksmbd_tcon() reuses work->tcon in compound requests without
validating tcon->t_state. ksmbd_tree_conn_lookup() checks t_state ==
TREE_CONNECTED on the initial lookup path, but the compound reuse path
bypasses this check entirely.
If a prior command in the compound (SMB2_TREE_DISCONNECT) sets t_state
to TREE_DISCONNECTED and frees share_conf via ksmbd_share_config_put(),
subsequent commands dereference the freed share_conf through
work->tcon->share_conf.
KASAN report:
[ 4.144653] ==================================================================
[ 4.145059] BUG: KASAN: slab-use-after-free in smb2_write+0xc74/0xe70
[ 4.145415] Read of size 4 at addr ffff88810430c194 by task kworker/1:1/44
[ 4.145772]
[ 4.145867] CPU: 1 UID: 0 PID: 44 Comm: kworker/1:1 Not tainted 7.0.0-rc3+ #60 PREEMPTLAZY
[ 4.145871] Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 4.145875] Workqueue: ksmbd-io handle_ksmbd_work
[ 4.145888] Call Trace:
[ 4.145892] <TASK>
[ 4.145894] dump_stack_lvl+0x64/0x80
[ 4.145910] print_report+0xce/0x660
[ 4.145919] ? __pfx__raw_spin_lock_irqsave+0x10/0x10
[ 4.145928] ? smb2_write+0xc74/0xe70
[ 4.145931] kasan_report+0xce/0x100
[ 4.145934] ? smb2_write+0xc74/0xe70
[ 4.145937] smb2_write+0xc74/0xe70
[ 4.145939] ? __pfx_smb2_write+0x10/0x10
[ 4.145942] ? _raw_spin_unlock+0xe/0x30
[ 4.145945] ? ksmbd_smb2_check_message+0xeb2/0x24c0
[ 4.145948] ? smb2_tree_disconnect+0x31c/0x480
[ 4.145951] handle_ksmbd_work+0x40f/0x1080
[ 4.145953] process_one_work+0x5fa/0xef0
[ 4.145962] ? assign_work+0x122/0x3e0
[ 4.145964] worker_thread+0x54b/0xf70
[ 4.145967] ? __pfx_worker_thread+0x10/0x10
[ 4.145970] kthread+0x346/0x470
[ 4.145976] ? recalc_sigpending+0x19b/0x230
[ 4.145980] ? __pfx_kthread+0x10/0x10
[ 4.145984] ret_from_fork+0x4fb/0x6c0
[ 4.145992] ? __pfx_ret_from_fork+0x10/0x10
[ 4.145995] ? __switch_to+0x36c/0xbe0
[ 4.145999] ? __pfx_kthread+0x10/0x10
[ 4.146003] ret_from_fork_asm+0x1a/0x30
[ 4.146013] </TASK>
[ 4.146014]
[ 4.149858] Allocated by task 44:
[ 4.149953] kasan_save_stack+0x33/0x60
[ 4.150061] kasan_save_track+0x14/0x30
[ 4.150169] __kasan_kmalloc+0x8f/0xa0
[ 4.150274] ksmbd_share_config_get+0x1dd/0xdd0
[ 4.150401] ksmbd_tree_conn_connect+0x7e/0x600
[ 4.150529] smb2_tree_connect+0x2e6/0x1000
[ 4.150645] handle_ksmbd_work+0x40f/0x1080
[ 4.150761] process_one_work+0x5fa/0xef0
[ 4.150873] worker_thread+0x54b/0xf70
[ 4.150978] kthread+0x346/0x470
[ 4.151071] ret_from_fork+0x4fb/0x6c0
[ 4.151176] ret_from_fork_asm+0x1a/0x30
[ 4.151286]
[ 4.151332] Freed by task 44:
[ 4.151418] kasan_save_stack+0x33/0x60
[ 4.151526] kasan_save_track+0x14/0x30
[ 4.151634] kasan_save_free_info+0x3b/0x60
[ 4.151751] __kasan_slab_free+0x43/0x70
[ 4.151861] kfree+0x1ca/0x430
[ 4.151952] __ksmbd_tree_conn_disconnect+0xc8/0x190
[ 4.152088] smb2_tree_disconnect+0x1cd/0x480
[ 4.152211] handle_ksmbd_work+0x40f/0x1080
[ 4.152326] process_one_work+0x5fa/0xef0
[ 4.152438] worker_thread+0x54b/0xf70
[ 4.152545] kthread+0x346/0x470
[ 4.152638] ret_from_fork+0x4fb/0x6c0
[ 4.152743] ret_from_fork_asm+0x1a/0x30
[ 4.152853]
[ 4.152900] The buggy address belongs to the object at ffff88810430c180
[ 4.152900] which belongs to the cache kmalloc-96 of size 96
[ 4.153226] The buggy address is located 20 bytes inside of
[ 4.153226] freed 96-byte region [ffff88810430c180, ffff88810430c1e0)
[ 4.153549]
[ 4.153596] The buggy address belongs to the physical page:
[ 4.153750] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff88810430ce80 pfn:0x10430c
[ 4.154000] flags: 0x100000000000200(workingset|node=0|zone=2)
[ 4.154160] page_type: f5(slab)
[ 4.154251] raw: 0100000000000200 ffff888100041280 ffff888100040110 ffff888100040110
[ 4.154461] raw: ffff88810430ce80 0000000800200009 00000000f5000000 0000000000000000
[ 4.154668] page dumped because: kasan: bad access detected
[ 4.154820]
[ 4.154866] Memory state around the buggy address:
[ 4.155002] ffff88810430c080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 4.155196] ffff88810430c100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 4.155391] >ffff88810430c180: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
[ 4.155587] ^
[ 4.155693] ffff88810430c200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 4.155891] ffff88810430c280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 4.156087] ==================================================================
Add the same t_state validation to the compound reuse path, consistent
with ksmbd_tree_conn_lookup().
Fixes: 5005bcb421 ("ksmbd: validate session id and tree id in the compound request")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use sb->s_uuid for a proper volume identifier as the primary choice.
For filesystems that do not provide a UUID, fall back to stfs.f_fsid
obtained from vfs_statfs().
Cc: stable@vger.kernel.org
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When a multichannel SMB2_SESSION_SETUP request with
SMB2_SESSION_REQ_FLAG_BINDING fails ksmbd sets conn->binding = true
but never clears it on the error path. This leaves the connection in
a binding state where all subsequent ksmbd_session_lookup_all() calls
fall back to the global sessions table. This fix it by clearing
conn->binding = false in the error path.
Cc: stable@vger.kernel.org
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
__ksmbd_tree_conn_disconnect() drops the share_conf reference before
checking tree_conn->refcount. When someone uses SMB3 multichannel and
binds two connections to one session, a SESSION_LOGOFF on connection A
calls ksmbd_conn_wait_idle(conn) which only drains connection A's
request counter, not connection B's. This means there's a race condition:
requests already dispatched on connection B hold tree_conn references via
work->tcon. The disconnect path frees share_conf while those requests
are still walking work->tcon->share_conf, causing a use-after-free.
This fix combines the share_conf put with the tree_conn free so it
only happens when the last reference is dropped.
Fixes: b39a1833cc ("ksmbd: fix use-after-free in ksmbd_tree_connect_put under concurrency")
Signed-off-by: Nicholas Carlini <nicholas@carlini.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The ASYNC_EVENT_CMPL_EVENT_ID_DBG_BUF_PRODUCER handler in
bnxt_async_event_process() uses a firmware-supplied 'type' field
directly as an index into bp->bs_trace[] without bounds validation.
The 'type' field is a 16-bit value extracted from DMA-mapped completion
ring memory that the NIC writes directly to host RAM. A malicious or
compromised NIC can supply any value from 0 to 65535, causing an
out-of-bounds access into kernel heap memory.
The bnxt_bs_trace_check_wrap() call then dereferences bs_trace->magic_byte
and writes to bs_trace->last_offset and bs_trace->wrapped, leading to
kernel memory corruption or a crash.
Fix by adding a bounds check and defining BNXT_TRACE_MAX as
DBG_LOG_BUFFER_FLUSH_REQ_TYPE_ERR_QPC_TRACE + 1 to cover all currently
defined firmware trace types (0x0 through 0xc).
Fixes: 84fcd9449f ("bnxt_en: Manage the FW trace context memory")
Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Junrui Luo <moonafterrain@outlook.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/SYBPR01MB7881A253A1C9775D277F30E9AF42A@SYBPR01MB7881.ausprd01.prod.outlook.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ina233_read_word_data() uses the return value of pmbus_read_word_data()
directly in a DIV_ROUND_CLOSEST() computation without first checking for
errors. If the underlying I2C transaction fails, a negative error code is
used in the arithmetic, producing a garbage sensor value instead of
propagating the error.
Add the missing error check before using the return value.
Fixes: b64b6cb163 ("hwmon: Add driver for TI INA233 Current and Power Monitor")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260317174553.385567-1-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
In mp2869_read_byte_data() and mp2869_read_word_data(), the return value
of pmbus_read_byte_data() for PMBUS_STATUS_MFR_SPECIFIC is used directly
inside FIELD_GET() macro arguments without error checking. If the I2C
transaction fails, a negative error code is passed to FIELD_GET() and
FIELD_PREP(), silently corrupting the status register bits being
constructed.
Extract the nested pmbus_read_byte_data() calls into a separate variable
and check for errors before use. This also eliminates a redundant duplicate
read of the same register in the PMBUS_STATUS_TEMPERATURE case.
Fixes: a3a2923aaf ("hwmon: add MP2869,MP29608,MP29612 and MP29816 series driver")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260317173308.382545-4-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
mp2973_read_word_data() XORs the return value of pmbus_read_word_data()
with PB_STATUS_POWER_GOOD_N without first checking for errors. If the I2C
transaction fails, a negative error code is XORed with the constant,
producing a corrupted value that is returned as valid status data instead
of propagating the error.
Add the missing error check before modifying the return value.
Fixes: acda945afb ("hwmon: (pmbus/mp2975) Fix PGOOD in READ_STATUS_WORD")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260317173308.382545-3-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Pull libnvdimm fix from Ira Weiny:
- Fix old potential use after free bug
* tag 'libnvdimm-fixes-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
nvdimm/bus: Fix potential use after free in asynchronous initialization
hac300s_read_word_data() passes the return value of pmbus_read_word_data()
directly to FIELD_GET() without checking for errors. If the I2C transaction
fails, a negative error code is sign-extended and passed to FIELD_GET(),
which silently produces garbage data instead of propagating the error.
Add the missing error check before using the return value in
the FIELD_GET() macro.
Fixes: 669cf162f7 ("hwmon: Add support for HiTRON HAC300S PSU")
Cc: stable@vger.kernel.org
Signed-off-by: Sanman Pradhan <psanman@juniper.net>
Link: https://lore.kernel.org/r/20260317173308.382545-2-sanman.pradhan@hpe.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Pull kunit fix from Shuah Khan:
- Add documentation for --list_suites feature
* tag 'linux_kselftest-kunit-fixes-7.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Add documentation of --list_suites
All cmd_buf buffers are allocated and need to be freed after usage.
Add an error unwinding path that properly frees these buffers.
The memory leak happens whenever fwlog configuration is changed. For
example:
$echo 256K > /sys/kernel/debug/ixgbe/0000\:32\:00.0/fwlog/log_size
Fixes: 96a9a9341c ("ice: configure FW logging")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Pull HID fixes from Jiri Kosina:
- various fixes dealing with (intentionally) broken devices in HID
core, logitech-hidpp and multitouch drivers (Lee Jones)
- fix for OOB in wacom driver (Benoît Sevens)
- fix for potentialy HID-bpf-induced buffer overflow in () (Benjamin
Tissoires)
- various other small fixes and device ID / quirk additions
* tag 'hid-for-linus-2026031701' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: multitouch: Check to ensure report responses match the request
HID: logitech-hidpp: Prevent use-after-free on force feedback initialisation failure
HID: bpf: prevent buffer overflow in hid_hw_request
selftests/hid: fix compilation when bpf_wq and hid_device are not exported
HID: core: Mitigate potential OOB by removing bogus memset()
HID: intel-thc-hid: Set HID_PHYS with PCI BDF
HID: appletb-kbd: add .resume method in PM
HID: logitech-hidpp: Enable MX Master 4 over bluetooth
HID: input: Add HID_BATTERY_QUIRK_DYNAMIC for Elan touchscreens
HID: input: Drop Asus UX550* touchscreen ignore battery quirks
HID: asus: add xg mobile 2022 external hardware support
HID: wacom: fix out-of-bounds read in wacom_intuos_bt_irq
When iavf_add_vlan() finds an existing filter in IAVF_VLAN_REMOVE
state, it transitions the filter to IAVF_VLAN_ACTIVE assuming the
pending delete can simply be cancelled. However, there is no guarantee
that iavf_del_vlans() has not already processed the delete AQ request
and removed the filter from the PF. In that case the filter remains in
the driver's list as IAVF_VLAN_ACTIVE but is no longer programmed on
the NIC. Since iavf_add_vlans() only picks up filters in
IAVF_VLAN_ADD state, the filter is never re-added, and spoof checking
drops all traffic for that VLAN.
CPU0 CPU1 Workqueue
---- ---- ---------
iavf_del_vlan(vlan 100)
f->state = REMOVE
schedule AQ_DEL_VLAN
iavf_add_vlan(vlan 100)
f->state = ACTIVE
iavf_del_vlans()
f is ACTIVE, skip
iavf_add_vlans()
f is ACTIVE, skip
Filter is ACTIVE in driver but absent from NIC.
Transition to IAVF_VLAN_ADD instead and schedule
IAVF_FLAG_AQ_ADD_VLAN_FILTER so iavf_add_vlans() re-programs the
filter. A duplicate add is idempotent on the PF.
Fixes: 0c0da0e951 ("iavf: refactor VLAN filter states")
Signed-off-by: Petr Oros <poros@redhat.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
If an XDP application that requested TX timestamping is shutting down
while the link of the interface in use is still up the following kernel
splat is reported:
[ 883.803618] [ T1554] BUG: unable to handle page fault for address: ffffcfb6200fd008
...
[ 883.803650] [ T1554] Call Trace:
[ 883.803652] [ T1554] <TASK>
[ 883.803654] [ T1554] igc_ptp_tx_tstamp_event+0xdf/0x160 [igc]
[ 883.803660] [ T1554] igc_tsync_interrupt+0x2d5/0x300 [igc]
...
During shutdown of the TX ring the xsk_meta pointers are left behind, so
that the IRQ handler is trying to touch them.
This issue is now being fixed by cleaning up the stale xsk meta data on
TX shutdown. TX timestamps on other queues remain unaffected.
Fixes: 15fd021bc4 ("igc: Add Tx hardware timestamp request for AF_XDP zero-copy packet")
Signed-off-by: Zdenek Bouska <zdenek.bouska@siemens.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Florian Bezdeka <florian.bezdeka@siemens.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
igc_xmit_frame() misses updating skb->tail when the packet size is
shorter than the minimum one.
Use skb_put_padto() in alignment with other Intel Ethernet drivers.
Fixes: 0507ef8a03 ("igc: Add transmit and receive fastpath and interrupt handlers")
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The "Read backward ring buffer" test crashes on big-endian (e.g. s390x)
due to a NULL dereference when the backward mmap path isn't enabled.
Reproducer:
# ./perf test -F 'Read backward ring buffer'
Segmentation fault (core dumped)
# uname -m
s390x
#
Root cause:
get_config_terms() stores into evsel_config_term::val.val (u64) while later
code reads boolean fields such as evsel_config_term::val.overwrite.
On big-endian the 1-byte boolean is left-aligned, so writing
evsel_config_term::val.val = 1 is read back as
evsel_config_term::val.overwrite = 0,
leaving backward mmap disabled and a NULL map being used.
Store values in the union member that matches the term type, e.g.:
/* for OVERWRITE */
new_term->val.overwrite = 1; /* not new_term->val.val = 1 */
to fix this. Improve add_config_term() and add two more parameters for
string and value. Function add_config_term() now creates a complete node
element of type evsel_config_term and handles all evsel_config_term::val
union members.
Impact:
Enables backward mmap on big-endian and prevents the crash.
No change on little-endian.
Output after:
# ./perf test -Fv 44
--- start ---
Using CPUID IBM,9175,705,ME1,3.8,002f
mmap size 1052672B
mmap size 8192B
---- end ----
44: Read backward ring buffer : Ok
#
Fixes: 159ca97cd9 ("perf parse-events: Refactor get_config_terms() to remove macros")
Reviewed-by: James Clark <james.clark@linaro.org>
Reviewed-by: Jan Polensky <japo@linux.ibm.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@linaro.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Use metricgroup__for_each_metric() rather than
pmu_metrics_table__for_each_metric() that combines the
default metric table with, a potentially empty, CPUID table.
Fixes: cee275edcd ("perf metricgroup: Don't early exit if no CPUID table exists")
Reviewed-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Leo Yan <leo.yan@arm.com>
Cc: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When a socket send and shutdown() happen back-to-back, both fire
wake-ups before the receiver's task_work has a chance to run. The first
wake gets poll ownership (poll_refs=1), and the second bumps it to 2.
When io_poll_check_events() runs, it calls io_poll_issue() which does a
recv that reads the data and returns IOU_RETRY. The loop then drains all
accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only
the first event was consumed. Since the shutdown is a persistent state
change, no further wakeups will happen, and the multishot recv can hang
forever.
Check specifically for HUP in the poll loop, and ensure that another
loop is done to check for status if more than a single poll activation
is pending. This ensures we don't lose the shutdown event.
Cc: stable@vger.kernel.org
Fixes: dbc2564cfe ("io_uring: let fast poll support multishot")
Reported-by: Francis Brosseau <francis@malagauche.com>
Link: https://github.com/axboe/liburing/issues/1549
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In commit 507fd01d53 ("drivers: move the early platform device support to
arch/sh") platform_match() was copied over to the sh platform_early
code, accidentally including the driver_override check.
This check does not make sense for platform_early, as sysfs is not even
available in first place at this point in the boot process, hence remove
the check.
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Fixes: 507fd01d53 ("drivers: move the early platform device support to arch/sh")
Link: https://lore.kernel.org/all/DH4M3DJ4P58T.1BGVAVXN71Z09@kernel.org/
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Currently, there are 12 busses (including platform and PCI) that
duplicate the driver_override logic for their individual devices.
All of them seem to be prone to the bug described in [1].
While this could be solved for every bus individually using a separate
lock, solving this in the driver-core generically results in less (and
cleaner) changes overall.
Thus, move driver_override to struct device, provide corresponding
accessors for busses and handle locking with a separate lock internally.
In particular, add device_set_driver_override(),
device_has_driver_override(), device_match_driver_override() and
generalize the sysfs store() and show() callbacks via a driver_override
feature flag in struct bus_type.
Until all busses have migrated, keep driver_set_override() in place.
Note that we can't use the device lock for the reasons described in [2].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220789 [1]
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [2]
Tested-by: Gui-Dong Han <hanguidong02@gmail.com>
Co-developed-by: Gui-Dong Han <hanguidong02@gmail.com>
Signed-off-by: Gui-Dong Han <hanguidong02@gmail.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20260303115720.48783-2-dakr@kernel.org
[ Use dev->bus instead of sp->bus for consistency; fix commit message to
refer to the struct bus_type's driver_override feature flag. - Danilo ]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
In the error path for efa_com_alloc_comp_ctx() the semaphore assigned to
&aq->avail_cmds is not released.
Detected by Smatch:
drivers/infiniband/hw/efa/efa_com.c:662 efa_com_cmd_exec() warn:
inconsistent returns '&aq->avail_cmds'
Add release for &aq->avail_cmds in efa_com_alloc_comp_ctx() error path.
Fixes: ef3b06742c ("RDMA/efa: Fix use of completion ctx after free")
Signed-off-by: Ethan Tidmore <ethantidmore06@gmail.com>
Link: https://patch.msgid.link/20260314045730.1143862-1-ethantidmore06@gmail.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When IOVA-based DMA mapping is unavailable (e.g., IOMMU
passthrough mode), rdma_rw_ctx_init_bvec() falls back to
checking rdma_rw_io_needs_mr() with the raw bvec count.
Unlike the scatterlist path in rdma_rw_ctx_init(), which
passes a post-DMA-mapping entry count that reflects
coalescing of physically contiguous pages, the bvec path
passes the pre-mapping page count. This overstates the
number of DMA entries, causing every multi-bvec RDMA READ
to consume an MR from the QP's pool.
Under NFS WRITE workloads the server performs RDMA READs
to pull data from the client. With the inflated MR demand,
the pool is rapidly exhausted, ib_mr_pool_get() returns
NULL, and rdma_rw_init_one_mr() returns -EAGAIN. svcrdma
treats this as a DMA mapping failure, closes the connection,
and the client reconnects -- producing a cycle of 71% RPC
retransmissions and ~100 reconnections per test run. RDMA
WRITEs (NFS READ direction) are unaffected because
DMA_TO_DEVICE never triggers the max_sgl_rd check.
Remove the rdma_rw_io_needs_mr() gate from the bvec path
entirely, so that bvec RDMA operations always use the
map_wrs path (direct WR posting without MR allocation).
The bvec caller has no post-DMA-coalescing segment count
available -- xdr_buf and svc_rqst hold pages as individual
pointers, and physical contiguity is discovered only during
DMA mapping -- so the raw page count cannot serve as a
reliable input to rdma_rw_io_needs_mr(). iWARP devices,
which require MRs unconditionally, are handled by an
earlier check in rdma_rw_ctx_init_bvec() and are unaffected.
Fixes: bea28ac14c ("RDMA/core: add MR support for bvec-based RDMA operations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260313194201.5818-3-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When IOMMU passthrough mode is active, ib_dma_map_sgtable_attrs()
produces no coalescing: each scatterlist page maps 1:1 to a DMA
entry, so sgt.nents equals the raw page count. A 1 MB transfer
yields 256 DMA entries. If that count exceeds the device's
max_sgl_rd threshold (an optimization hint from mlx5 firmware),
rdma_rw_io_needs_mr() steers the operation into the MR
registration path. Each such operation consumes one or more MRs
from a pool sized at max_rdma_ctxs -- roughly one MR per
concurrent context. Under write-intensive workloads that issue
many concurrent RDMA READs, the pool is rapidly exhausted,
ib_mr_pool_get() returns NULL, and rdma_rw_init_one_mr() returns
-EAGAIN. Upper layer protocols treat this as a fatal DMA mapping
failure and tear down the connection.
The max_sgl_rd check is a performance optimization, not a
correctness requirement: the device can handle large SGE counts
via direct posting, just less efficiently than with MR
registration. When the MR pool cannot satisfy a request, falling
back to the direct SGE (map_wrs) path avoids the connection
reset while preserving the MR optimization for the common case
where pool resources are available.
Add a fallback in rdma_rw_ctx_init() so that -EAGAIN from
rdma_rw_init_mr_wrs() triggers direct SGE posting instead of
propagating the error. iWARP devices, which mandate MR
registration for RDMA READs, and force_mr debug mode continue
to treat -EAGAIN as terminal.
Fixes: 00bd1439f4 ("RDMA/rw: Support threshold for registration vs scattering to local pages")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260313194201.5818-2-cel@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add NULL pointer checks for dev->type before accessing
dev->type->name in ISP genpd add/remove functions to
prevent kernel crashes.
This regression was introduced in v7.0 as the wakeup sources
are registered using physical device instead of ACPI device.
This led to adding wakeup source device as the first child of
AMDGPU device without initializing dev-type variable, and
resulted in segfault when accessed it in the amdgpu isp driver.
Fixes: 057edc58aa ("ACPI: PM: Register wakeup sources under physical devices")
Suggested-by: Bin Du <Bin.Du@amd.com>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit c51632d1ed7ac5aed2d40dbc0718d75342c12c6a)
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Cc: Benjamin Cheng <benjamin.cheng@amd.com>
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit e14d468304832bcc4a082d95849bc0a41b18ddea)
Cc: stable@vger.kernel.org
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit dea5f235baf3786bfd4fd920b03c19285fdc3d9f)
Cc: stable@vger.kernel.org
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 04f063d85090f5dd0c671010ce88ee49d9dcc8ed)
Cc: stable@vger.kernel.org
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f14f27bbe2a3ed7af32d5f6eaf3f417139f45253)
Cc: stable@vger.kernel.org
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 1441f52c7f6ae6553664aa9e3e4562f6fc2fe8ea)
Cc: stable@vger.kernel.org
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 5f76083183363c4528a4aaa593f5d38c28fe7d7b)
Cc: stable@vger.kernel.org
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 89cd90375c19fb45138990b70e9f4ba4806f05c4)
Cc: stable@vger.kernel.org
The value should never exceed the array size as those
are the only values the hardware is expected to return,
but add checks anyway.
Reviewed-by: Benjamin Cheng <benjamin.cheng@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit e064cef4b53552602bb6ac90399c18f662f3cacd)
Cc: stable@vger.kernel.org
The ASICREV_IS_BEIGE_GOBY_P check always took precedence, because it includes all chip revisions upto NV_UNKNOWN.
Fixes: 54b822b3ea ("drm/amd/display: Use dce_version instead of chip_id")
Signed-off-by: Andy Nguyen <theofficialflow1996@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 9c7be0efa6f0daa949a5f3e3fdf9ea090b0713cb)
parse_edid_displayid_vrr() searches the EDID extension blocks for a
DisplayID extension before parsing the dynamic video timing range.
The code previously checked whether edid_ext was NULL after the search
loop. However, edid_ext is assigned during each iteration of the loop,
so it will never be NULL once the loop has executed. If no DisplayID
extension is found, edid_ext ends up pointing to the last extension
block, and the NULL check does not correctly detect the failure case.
Instead, check whether the loop completed without finding a matching
DisplayID block by testing "i == edid->extensions". This ensures the
function exits early when no DisplayID extension is present and avoids
parsing an unrelated EDID extension block.
Also simplify the EDID validation check using "!edid ||
!edid->extensions".
Fixes the below:
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:13079 parse_edid_displayid_vrr() warn: variable dereferenced before check 'edid_ext' (see line 13075)
Fixes: a638b837d0 ("drm/amd/display: Fix refresh rate range for some panel")
Cc: Roman Li <roman.li@amd.com>
Cc: Alex Hung <alex.hung@amd.com>
Cc: Jerry Zuo <jerry.zuo@amd.com>
Cc: Sun peng Li <sunpeng.li@amd.com>
Cc: Tom Chung <chiahsuan.chung@amd.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Tom Chung <chiahsuan.chung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 91c7e6342e98c846b259c57273436fdea4c043f2)
[Why]
The dcn32_override_min_req_memclk function is in dcn32_fpu.c, which is
compiled with CC_FLAGS_FPU into FP instructions. So when we call it we
must use DC_FP_{START,END} to save and restore the FP context, and
prepare the FP unit on architectures like LoongArch where the FP unit
isn't always on.
Reported-by: LiarOnce <liaronce@hotmail.com>
Fixes: ee7be8f3de ("drm/amd/display: Limit DCN32 8 channel or less parts to DPM1 for FPO")
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Reviewed-by: Alex Hung <alex.hung@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 25bb1d54ba3983c064361033a8ec15474fece37e)
Cc: stable@vger.kernel.org
Commit e1b385726f ("drm/amd/display: Add additional checks for PSP
footer size") introduced a use of an uninitialized stack variable
in dm_dmub_sw_init() (region_params.bss_data_size).
Interestingly, this seems to cause no issue on normal kernels. But when
full LTO is enabled, it causes the compiler to "optimize" out huge
swaths of amdgpu initialization code, and the driver is unusable:
amdgpu 0000:03:00.0: [drm] Loading DMUB firmware via PSP: version=0x07002F00
amdgpu 0000:03:00.0: sw_init of IP block <dm> failed 5
amdgpu 0000:03:00.0: amdgpu_device_ip_init failed
amdgpu 0000:03:00.0: Fatal error during GPU init
It surprises me that neither gcc nor clang emit a warning about this: I
only found it by bisecting the LTO breakage.
Fix by using the bss_data_size field from fw_meta_info_params, as was
presumably intended.
Fixes: e1b385726f ("drm/amd/display: Add additional checks for PSP footer size")
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
Reviewed-by: Harry Wentland <harry.wentland@amd.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit b7f1402f6ad24cc6b9a01fa09ebd1c6559d787d0)
Userspace can pass an arbitrary number of BO list entries via the
bo_number field. Although the previous multiplication overflow check
prevents out-of-bounds allocation, a large number of entries could still
cause excessive memory allocation (up to potentially gigabytes) and
unnecessarily long list processing times.
Introduce a hard limit of 128k entries per BO list, which is more than
sufficient for any realistic use case (e.g., a single list containing all
buffers in a large scene). This prevents memory exhaustion attacks and
ensures predictable performance.
Return -EINVAL if the requested entry count exceeds the limit
Reviewed-by: Christian König <christian.koenig@amd.com>
Suggested-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 688b87d39e0aa8135105b40dc167d74b5ada5332)
Cc: stable@vger.kernel.org
The mount_setattr_idmapped fixture mounts a 2 MB tmpfs at /mnt and then
creates a 2 GB sparse ext4 image at /mnt/C/ext4.img. While ftruncate()
succeeds (sparse file), mkfs.ext4 needs to write actual metadata blocks
(inode tables, journal, bitmaps) which easily exceeds the 2 MB tmpfs
limit, causing ENOSPC and failing the fixture setup for all
mount_setattr_idmapped tests.
This was introduced by commit d37d4720c3 ("selftests/mount_settattr:
ensure that ext4 filesystem can be created") which increased the image
size from 2 MB to 2 GB but didn't adjust the tmpfs size.
Bump the tmpfs size to 256 MB which is sufficient for the ext4 metadata.
Fixes: d37d4720c3 ("selftests/mount_settattr: ensure that ext4 filesystem can be created")
Signed-off-by: Christian Brauner <brauner@kernel.org>
This is an additional safety layer to ensure no accesses to the GPU
registers can be made while it is powered off.
While we can disable IRQ generation from GPU, META firmware, MIPS
firmware and for safety events, we cannot do the same for the RISC-V
firmware.
To keep a unified approach, once the firmware has completed its power
off sequence, disable IRQs for the while GPU at the kernel level
instead.
Signed-off-by: Alessio Belle <alessio.belle@imgtec.com>
Reviewed-by: Matt Coster <matt.coster@imgtec.com>
Link: https://patch.msgid.link/20260310-drain-irqs-before-suspend-v1-2-bf4f9ed68e75@imgtec.com
Signed-off-by: Matt Coster <matt.coster@imgtec.com>
The runtime PM suspend callback doesn't know whether the IRQ handler is
in progress on a different CPU core and doesn't wait for it to finish.
Depending on timing, the IRQ handler could be running while the GPU is
suspended, leading to kernel crashes when trying to access GPU
registers. See example signature below.
In a power off sequence initiated by the runtime PM suspend callback,
wait for any IRQ handlers in progress on other CPU cores to finish, by
calling synchronize_irq().
At the same time, remove the runtime PM resume/put calls in the threaded
IRQ handler. On top of not being the right approach to begin with, and
being at the wrong place as they should have wrapped all GPU register
accesses, the driver would hit a deadlock between synchronize_irq()
being called from a runtime PM suspend callback, holding the device
power lock, and the resume callback requiring the same.
Example crash signature on a TI AM68 SK platform:
[ 337.241218] SError Interrupt on CPU0, code 0x00000000bf000000 -- SError
[ 337.241239] CPU: 0 UID: 0 PID: 112 Comm: irq/234-gpu Tainted: G M 6.17.7-B2C-00005-g9c7bbe4ea16c #2 PREEMPT
[ 337.241246] Tainted: [M]=MACHINE_CHECK
[ 337.241249] Hardware name: Texas Instruments AM68 SK (DT)
[ 337.241252] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 337.241256] pc : pvr_riscv_irq_pending+0xc/0x24
[ 337.241277] lr : pvr_device_irq_thread_handler+0x64/0x310
[ 337.241282] sp : ffff800085b0bd30
[ 337.241284] x29: ffff800085b0bd50 x28: ffff0008070d9eab x27: ffff800083a5ce10
[ 337.241291] x26: ffff000806e48f80 x25: ffff0008070d9eac x24: 0000000000000000
[ 337.241296] x23: ffff0008068e9bf0 x22: ffff0008068e9bd0 x21: ffff800085b0bd30
[ 337.241301] x20: ffff0008070d9e00 x19: ffff0008068e9000 x18: 0000000000000001
[ 337.241305] x17: 637365645f656c70 x16: 0000000000000000 x15: ffff000b7df9ff40
[ 337.241310] x14: 0000a585fe3c0d0e x13: 000000999704f060 x12: 000000000002771a
[ 337.241314] x11: 00000000000000c0 x10: 0000000000000af0 x9 : ffff800085b0bd00
[ 337.241318] x8 : ffff0008071175d0 x7 : 000000000000b955 x6 : 0000000000000003
[ 337.241323] x5 : 0000000000000000 x4 : 0000000000000002 x3 : 0000000000000000
[ 337.241327] x2 : ffff800080e39d20 x1 : ffff800080e3fc48 x0 : 0000000000000000
[ 337.241333] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 337.241337] CPU: 0 UID: 0 PID: 112 Comm: irq/234-gpu Tainted: G M 6.17.7-B2C-00005-g9c7bbe4ea16c #2 PREEMPT
[ 337.241342] Tainted: [M]=MACHINE_CHECK
[ 337.241343] Hardware name: Texas Instruments AM68 SK (DT)
[ 337.241345] Call trace:
[ 337.241348] show_stack+0x18/0x24 (C)
[ 337.241357] dump_stack_lvl+0x60/0x80
[ 337.241364] dump_stack+0x18/0x24
[ 337.241368] vpanic+0x124/0x2ec
[ 337.241373] abort+0x0/0x4
[ 337.241377] add_taint+0x0/0xbc
[ 337.241384] arm64_serror_panic+0x70/0x80
[ 337.241389] do_serror+0x3c/0x74
[ 337.241392] el1h_64_error_handler+0x30/0x48
[ 337.241400] el1h_64_error+0x6c/0x70
[ 337.241404] pvr_riscv_irq_pending+0xc/0x24 (P)
[ 337.241410] irq_thread_fn+0x2c/0xb0
[ 337.241416] irq_thread+0x170/0x334
[ 337.241421] kthread+0x12c/0x210
[ 337.241428] ret_from_fork+0x10/0x20
[ 337.241434] SMP: stopping secondary CPUs
[ 337.241451] Kernel Offset: disabled
[ 337.241453] CPU features: 0x040000,02002800,20002001,0400421b
[ 337.241456] Memory Limit: none
[ 337.457921] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---
Fixes: cc1aeedb98 ("drm/imagination: Implement firmware infrastructure and META FW support")
Fixes: 96822d38ff ("drm/imagination: Handle Rogue safety event IRQs")
Cc: stable@vger.kernel.org # see patch description, needs adjustments for < 6.16
Signed-off-by: Alessio Belle <alessio.belle@imgtec.com>
Reviewed-by: Matt Coster <matt.coster@imgtec.com>
Link: https://patch.msgid.link/20260310-drain-irqs-before-suspend-v1-1-bf4f9ed68e75@imgtec.com
Signed-off-by: Matt Coster <matt.coster@imgtec.com>
For file systems implementing ->sync_lazytime, I_DIRTY_TIME fails to get
cleared in sync_lazytime, and might cause additional calls to
sync_lazytime during inode deactivation. Use the same pattern as in
__mark_inode_dirty to clear the flag under the inode lock.
Fixes: 5cf06ea56e ("fs: add a ->sync_lazytime method")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260317134409.1691317-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
Previously, commit 8388f7df93 ("iommu/amd: Do not support
IOMMU_DOMAIN_IDENTITY after SNP is enabled") prevented users from
changing the IOMMU domain to identity if SNP was enabled.
This resulted in an error when writing to sysfs:
# echo "identity" > /sys/kernel/iommu_groups/50/type
-bash: echo: write error: Cannot allocate memory
However, commit 4402f2627d ("iommu/amd: Implement global identity
domain") changed the flow of the code, skipping the SNP guard and
allowing users to change the IOMMU domain to identity after a machine
has booted.
Once the user does that, they will probably try to bind and the
device/driver will start to do DMA which will trigger errors:
iommu ivhd3: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:43:00.0 pasid=0x00000 address=0x3737b01000 flags=0x0020]
iommu ivhd3: AMD-Vi: Control Reg : 0xc22000142148d
AMD-Vi: DTE[0]: 6000000000000003
AMD-Vi: DTE[1]: 0000000000000001
AMD-Vi: DTE[2]: 2000003088b3e013
AMD-Vi: DTE[3]: 0000000000000000
bnxt_en 0000:43:00.0 (unnamed net_device) (uninitialized): Error (timeout: 500015) msg {0x0 0x0} len:0
iommu ivhd3: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:43:00.0 pasid=0x00000 address=0x3737b01000 flags=0x0020]
iommu ivhd3: AMD-Vi: Control Reg : 0xc22000142148d
AMD-Vi: DTE[0]: 6000000000000003
AMD-Vi: DTE[1]: 0000000000000001
AMD-Vi: DTE[2]: 2000003088b3e013
AMD-Vi: DTE[3]: 0000000000000000
bnxt_en 0000:43:00.0: probe with driver bnxt_en failed with error -16
To prevent this from happening, create an attach wrapper for
identity_domain_ops which returns EINVAL if amd_iommu_snp_en is true.
With this commit applied:
# echo "identity" > /sys/kernel/iommu_groups/62/type
-bash: echo: write error: Invalid argument
Fixes: 4402f2627d ("iommu/amd: Implement global identity domain")
Signed-off-by: Joe Damato <joe@dama.to>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
domain->mm->iommu_mm can be freed by iommu_domain_free():
iommu_domain_free()
mmdrop()
__mmdrop()
mm_pasid_drop()
After iommu_domain_free() returns, accessing domain->mm->iommu_mm may
dereference a freed mm structure, leading to a crash.
Fix this by moving the code that accesses domain->mm->iommu_mm to before
the call to iommu_domain_free().
Fixes: e37d5a2d60 ("iommu/sva: invalidate stale IOTLB entries for kernel address space")
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Avoid kernel-doc warnings in io-pgtable.h:
- use the correct struct member names or kernel-doc format
- add a missing struct member description
- add a missing function return comment section
Warning: include/linux/io-pgtable.h:187 struct member 'coherent_walk' not
described in 'io_pgtable_cfg'
Warning: include/linux/io-pgtable.h:187 struct member 'arm_lpae_s1_cfg' not
described in 'io_pgtable_cfg'
Warning: include/linux/io-pgtable.h:187 struct member 'arm_lpae_s2_cfg' not
described in 'io_pgtable_cfg'
Warning: include/linux/io-pgtable.h:187 struct member 'arm_v7s_cfg' not
described in 'io_pgtable_cfg'
Warning: include/linux/io-pgtable.h:187 struct member 'arm_mali_lpae_cfg'
not described in 'io_pgtable_cfg'
Warning: include/linux/io-pgtable.h:187 struct member 'apple_dart_cfg' not
described in 'io_pgtable_cfg'
Warning: include/linux/io-pgtable.h:187 struct member 'amd' not described
in 'io_pgtable_cfg'
Warning: include/linux/io-pgtable.h:223 struct member
'read_and_clear_dirty' not described in 'io_pgtable_ops'
Warning: include/linux/io-pgtable.h:237 No description found for return
value of 'alloc_io_pgtable_ops'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
syzbot reports "task hung in rpm_resume"
This is caused by aqc111_suspend calling
the PM variant of its write_cmd routine.
The simplified call trace looks like this:
rpm_suspend()
usb_suspend_both() - here udev->dev.power.runtime_status == RPM_SUSPENDING
aqc111_suspend() - called for the usb device interface
aqc111_write32_cmd()
usb_autopm_get_interface()
pm_runtime_resume_and_get()
rpm_resume() - here we call rpm_resume() on our parent
rpm_resume() - Here we wait for a status change that will never happen.
At this point we block another task which holds
rtnl_lock and locks up the whole networking stack.
Fix this by replacing the write_cmd calls with their _nopm variants
Reported-by: syzbot+48dc1e8dfc92faf1124c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=48dc1e8dfc92faf1124c
Fixes: e58ba4544c ("net: usb: aqc111: Add support for wake on LAN by MAGIC packet")
Signed-off-by: Nikola Z. Ivanov <zlatistiv@gmail.com>
Link: https://patch.msgid.link/20260313141643.1181386-1-zlatistiv@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit 789a5913b2 ("iommu/amd: Use the generic iommu page table")
introduces the shared iommu page table for AMD IOMMU. Some bioses
contain an identity mapping for address 0x0, which is not parsed
properly (e.g., certain Strix Halo devices). This causes the DMA
components of the device to fail to initialize (e.g., the NVMe SSD
controller), leading to a failed post.
Specifically, on the GPD Win 5, the NVME and SSD GPU fail to mount,
making collecting errors difficult. While debugging, it was found that
a -EADDRINUSE error was emitted and its source was traced to
iommu_iova_to_phys(). After adding some debug prints, it was found that
phys_addr becomes 0, which causes the code to try to re-map the 0
address and fail, causing a cascade leading to a failed post. This is
because the GPD Win 5 contains a 0x0-0x1 identity mapping for DMA
devices, causing it to be repeated for each device.
The cause of this failure is the following check in
iommu_create_device_direct_mappings(), where address aliasing is handled
via the following check:
```
phys_addr = iommu_iova_to_phys(domain, addr);
if (!phys_addr) {
map_size += pg_size;
continue;
}
````
Obviously, the iommu_iova_to_phys() signature is faulty and aliases
unmapped and 0 together, causing the allocation code to try to
re-allocate the 0 address per device. However, it has too many
instantiations to fix. Therefore, use a ternary so that when addr
is 0, the check is done for address 1 instead.
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Fixes: 789a5913b2 ("iommu/amd: Use the generic iommu page table")
Signed-off-by: Antheas Kapenekakis <lkml@antheas.dev>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
In intel_svm_set_dev_pasid(), the driver unconditionally manages the IOPF
handling during a domain transition. However, commit a86fb77173
("iommu/vt-d: Allow SVA with device-specific IOPF") introduced support for
SVA on devices that handle page faults internally without utilizing the
PCI PRI. On such devices, the IOMMU-side IOPF infrastructure is not
required. Calling iopf_for_domain_replace() on these devices is incorrect
and can lead to unexpected failures during PASID attachment or unwinding.
Add a check for info->pri_supported to ensure that the IOPF queue logic
is only invoked for devices that actually rely on the IOMMU's PRI-based
fault handling.
Fixes: 17fce9d233 ("iommu/vt-d: Put iopf enablement in domain attach path")
Cc: stable@vger.kernel.org
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20260310075520.295104-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
During the qi_check_fault process after an IOMMU ITE event, requests at
odd-numbered positions in the queue are set to QI_ABORT, only satisfying
single-request submissions. However, qi_submit_sync now supports multiple
simultaneous submissions, and can't guarantee that the wait_desc will be
at an odd-numbered position. Therefore, if an item times out, IOMMU can't
re-initiate the request, resulting in an infinite polling wait.
This modifies the process by setting the status of all requests already
fetched by IOMMU and recorded as QI_IN_USE status (including wait_desc
requests) to QI_ABORT, thus enabling multiple requests to be resubmitted.
Fixes: 8a1d824625 ("iommu/vt-d: Multiple descriptors per qi_submit_sync()")
Cc: stable@vger.kernel.org
Signed-off-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Link: https://lore.kernel.org/r/20260306101516.3885775-1-guanghuifeng@linux.alibaba.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Fixes: 8a1d824625 ("iommu/vt-d: Multiple descriptors per qi_submit_sync()")
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Fix a use-after-free in the clsact qdisc upon init/destroy rollback asymmetry.
The latter is achieved by first fully initializing a clsact instance, and
then in a second step having a replacement failure for the new clsact qdisc
instance. clsact_init() initializes ingress first and then takes care of the
egress part. This can fail midway, for example, via tcf_block_get_ext(). Upon
failure, the kernel will trigger the clsact_destroy() callback.
Commit 1cb6f0bae5 ("bpf: Fix too early release of tcx_entry") details the
way how the transition is happening. If tcf_block_get_ext on the q->ingress_block
ends up failing, we took the tcx_miniq_inc reference count on the ingress
side, but not yet on the egress side. clsact_destroy() tests whether the
{ingress,egress}_entry was non-NULL. However, even in midway failure on the
replacement, both are in fact non-NULL with a valid egress_entry from the
previous clsact instance.
What we really need to test for is whether the qdisc instance-specific ingress
or egress side previously got initialized. This adds a small helper for checking
the miniq initialization called mini_qdisc_pair_inited, and utilizes that upon
clsact_destroy() in order to fix the use-after-free scenario. Convert the
ingress_destroy() side as well so both are consistent to each other.
Fixes: 1cb6f0bae5 ("bpf: Fix too early release of tcx_entry")
Reported-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260313065531.98639-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
A single AXIDMA controller can have one or two channels. When it has two
channels, the reset for both are tied together: resetting one channel
resets the other as well. This creates a problem where resetting one
channel will reset the registers for both channels, including clearing
interrupt enable bits for the other channel, which can then lead to
timeouts as the driver is waiting for an interrupt which never comes.
The driver currently has a probe-time work around for this: when a
channel is created, the driver also resets and enables the
interrupts. With two channels the reset for the second channel will
clear the interrupt enables for the first one. The work around in the
driver is just to manually enable the interrupts again in
xilinx_dma_alloc_chan_resources().
This workaround only addresses the probe-time issue. When channels are
reset at runtime (e.g., in xilinx_dma_terminate_all() or during error
recovery), there's no corresponding mechanism to restore the other
channel's interrupt enables. This leads to one channel having its
interrupts disabled while the driver expects them to work, causing
timeouts and DMA failures.
A proper fix is a complicated matter, as we should not reset the other
channel when it's operating normally. So, perhaps, there should be some
kind of synchronization for a common reset, which is not trivial to
implement. To add to the complexity, the driver also supports other DMA
types, like VDMA, CDMA and MCDMA, which don't have a shared reset.
However, when the two-channel AXIDMA is used in the (assumably) normal
use case, providing DMA for a single memory-to-memory device, the common
reset is a bit smaller issue: when something bad happens on one channel,
or when one channel is terminated, the assumption is that we also want
to terminate the other channel. And thus resetting both at the same time
is "ok".
With that line of thinking we can implement a bit better work around
than just the current probe time work around: let's enable the
AXIDMA interrupts at xilinx_dma_start_transfer() instead.
This ensures interrupts are enabled whenever a transfer starts,
regardless of any prior resets that may have cleared them.
This approach is also more logical: enable interrupts only when needed
for a transfer, rather than at resource allocation time, and, I think,
all the other DMA types should also use this model, but I'm reluctant to
do such changes as I cannot test them.
The reset function still enables interrupts even though it's not needed
for AXIDMA anymore, but it's common code for all DMA types (VDMA, CDMA,
MCDMA), so leave it unchanged to avoid affecting other variants.
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ideasonboard.com>
Fixes: c0bba3a99f ("dmaengine: vdma: Add Support for Xilinx AXI Direct Memory Access Engine")
Link: https://patch.msgid.link/20260311-xilinx-dma-fix-v2-1-a725abb66e3c@ideasonboard.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
The segment .control and .status fields both contain top bits which are
not part of the buffer size, the buffer size is located only in the bottom
max_buffer_len bits. To avoid interference from those top bits, mask out
the size using max_buffer_len first, and only then subtract the values.
Fixes: a575d0b4e6 ("dmaengine: xilinx_dma: Introduce xilinx_dma_get_residue")
Signed-off-by: Marek Vasut <marex@nabladev.com>
Link: https://patch.msgid.link/20260316222530.163815-1-marex@nabladev.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
The cyclic DMA calculation is currently entirely broken and reports
residue only for the first segment. The problem is twofold.
First, when the first descriptor finishes, it is moved from active_list
to done_list, but it is never returned back into the active_list. The
xilinx_dma_tx_status() expects the descriptor to be in the active_list
to report any meaningful residue information, which never happens after
the first descriptor finishes. Fix this up in xilinx_dma_start_transfer()
and if the descriptor is cyclic, lift it from done_list and place it back
into active_list list.
Second, the segment .status fields of the descriptor remain dirty. Once
the DMA did one pass on the descriptor, the .status fields are populated
with data by the DMA, but the .status fields are not cleared before reuse
during the next cyclic DMA round. The xilinx_dma_get_residue() recognizes
that as if the descriptor was complete and had 0 residue, which is bogus.
Reinitialize the status field before placing the descriptor back into the
active_list.
Fixes: c0bba3a99f ("dmaengine: vdma: Add Support for Xilinx AXI Direct Memory Access Engine")
Signed-off-by: Marek Vasut <marex@nabladev.com>
Link: https://patch.msgid.link/20260316221943.160375-1-marex@nabladev.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
iptfs_clone_state() stores x->mode_data before allocating the reorder
window. If that allocation fails, the code frees the cloned state and
returns -ENOMEM, leaving x->mode_data pointing at freed memory.
The xfrm clone unwind later runs destroy_state() through x->mode_data,
so the failed clone path tears down IPTFS state that clone_state()
already freed.
Keep the cloned IPTFS state private until all allocations succeed so
failed clones leave x->mode_data unset. The destroy path already
handles a NULL mode_data pointer.
Fixes: 6be02e3e4f ("xfrm: iptfs: handle reordering of received packets")
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moses <p@1g4.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
[BUG]
When recovering relocation at mount time, merge_reloc_root() and
btrfs_drop_snapshot() both use BUG_ON(level == 0) to guard against
an impossible state: a non-zero drop_progress combined with a zero
drop_level in a root_item, which can be triggered:
------------[ cut here ]------------
kernel BUG at fs/btrfs/relocation.c:1545!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
CPU: 1 UID: 0 PID: 283 ... Tainted: 6.18.0+ #16 PREEMPT(voluntary)
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Ubuntu 24.04 PC v2, BIOS 1.16.3-debian-1.16.3-2
RIP: 0010:merge_reloc_root+0x1266/0x1650 fs/btrfs/relocation.c:1545
Code: ffff0000 00004589 d7e9acfa ffffe8a1 79bafebe 02000000
Call Trace:
merge_reloc_roots+0x295/0x890 fs/btrfs/relocation.c:1861
btrfs_recover_relocation+0xd6e/0x11d0 fs/btrfs/relocation.c:4195
btrfs_start_pre_rw_mount+0xa4d/0x1810 fs/btrfs/disk-io.c:3130
open_ctree+0x5824/0x5fe0 fs/btrfs/disk-io.c:3640
btrfs_fill_super fs/btrfs/super.c:987 [inline]
btrfs_get_tree_super fs/btrfs/super.c:1951 [inline]
btrfs_get_tree_subvol fs/btrfs/super.c:2094 [inline]
btrfs_get_tree+0x111c/0x2190 fs/btrfs/super.c:2128
vfs_get_tree+0x9a/0x370 fs/super.c:1758
fc_mount fs/namespace.c:1199 [inline]
do_new_mount_fc fs/namespace.c:3642 [inline]
do_new_mount fs/namespace.c:3718 [inline]
path_mount+0x5b8/0x1ea0 fs/namespace.c:4028
do_mount fs/namespace.c:4041 [inline]
__do_sys_mount fs/namespace.c:4229 [inline]
__se_sys_mount fs/namespace.c:4206 [inline]
__x64_sys_mount+0x282/0x320 fs/namespace.c:4206
...
RIP: 0033:0x7f969c9a8fde
Code: 0f1f4000 48c7c2b0 fffffff7 d8648902 b8ffffff ffc3660f
---[ end trace 0000000000000000 ]---
The bug is reproducible on 7.0.0-rc2-next-20260310 with our dynamic
metadata fuzzing tool that corrupts btrfs metadata at runtime.
[CAUSE]
A non-zero drop_progress.objectid means an interrupted
btrfs_drop_snapshot() left a resume point on disk, and in that case
drop_level must be greater than 0 because the checkpoint is only
saved at internal node levels.
Although this invariant is enforced when the kernel writes the root
item, it is not validated when the root item is read back from disk.
That allows on-disk corruption to provide an invalid state with
drop_progress.objectid != 0 and drop_level == 0.
When relocation recovery later processes such a root item,
merge_reloc_root() reads drop_level and hits BUG_ON(level == 0). The
same invalid metadata can also trigger the corresponding BUG_ON() in
btrfs_drop_snapshot().
[FIX]
Fix this by validating the root_item invariant in tree-checker when
reading root items from disk: if drop_progress.objectid is non-zero,
drop_level must also be non-zero. Reject such malformed metadata with
-EUCLEAN before it reaches merge_reloc_root() or btrfs_drop_snapshot()
and triggers the BUG_ON.
After the fix, the same corruption is correctly rejected by tree-checker
and the BUG_ON is no longer triggered.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a potential use-after-free in move_existing_remap(): we're calling
btrfs_put_block_group() on dest_bg, then passing it to
btrfs_add_block_group_free_space() a few lines later.
Fix this by getting the BG at the start of the function and putting it
near the end. This also means we're not doing a lookup twice for the
same thing.
Reported-by: Chris Mason <clm@fb.com>
Link: https://lore.kernel.org/linux-btrfs/20260125123908.2096548-1-clm@meta.com/
Fixes: bbea42dfb9 ("btrfs: move existing remaps before relocating block group")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/284 with 64K page size and 4K fs block size, it
crashes with the following ASSERT() triggered:
BTRFS info (device dm-3): use lzo compression, level 1
assertion failed: folio_size(fi.folio) == sectorsize :: 0, in lzo.c:450
------------[ cut here ]------------
kernel BUG at lzo.c:450!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 4 UID: 0 PID: 329 Comm: kworker/u37:2 Tainted: G OE 6.19.0-rc8-custom+ #185 PREEMPT(voluntary)
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: btrfs-endio simple_end_io_work [btrfs]
pc : lzo_decompress_bio+0x61c/0x630 [btrfs]
lr : lzo_decompress_bio+0x61c/0x630 [btrfs]
Call trace:
lzo_decompress_bio+0x61c/0x630 [btrfs] (P)
end_bbio_compressed_read+0x2a8/0x2c0 [btrfs]
btrfs_bio_end_io+0xc4/0x258 [btrfs]
btrfs_check_read_bio+0x424/0x7e0 [btrfs]
simple_end_io_work+0x40/0xa8 [btrfs]
process_one_work+0x168/0x3f0
worker_thread+0x25c/0x398
kthread+0x154/0x250
ret_from_fork+0x10/0x20
Code: 912a2021 b0000e00 91246000 940244e9 (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
Commit 37cc07cab7 ("btrfs: lzo: use folio_iter to handle
lzo_decompress_bio()") added the ASSERT() to make sure the folio size
matches the fs block size.
But the check is completely wrong, the original intention is to make
sure for bs > ps cases, we always got a large folio that covers a full fs
block.
However for bs < ps cases, a folio can never be smaller than page size,
and the ASSERT() gets triggered immediately.
[FIX]
Check the folio size against @min_folio_size instead, which will never
be smaller than PAGE_SIZE, and still cover bs > ps cases.
Fixes: 37cc07cab7 ("btrfs: lzo: use folio_iter to handle lzo_decompress_bio()")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/284 with 64K page size and 4K fs block size, it
crashes with the following ASSERT() triggered:
assertion failed: folio_size(fi.folio) == blocksize :: 0, in fs/btrfs/zstd.c:603
------------[ cut here ]------------
kernel BUG at fs/btrfs/zstd.c:603!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 2 UID: 0 PID: 1183 Comm: kworker/u35:4 Not tainted 6.19.0-rc8-custom+ #185 PREEMPT(voluntary)
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: btrfs-endio simple_end_io_work [btrfs]
pc : zstd_decompress_bio+0x4f0/0x508 [btrfs]
lr : zstd_decompress_bio+0x4f0/0x508 [btrfs]
Call trace:
zstd_decompress_bio+0x4f0/0x508 [btrfs] (P)
end_bbio_compressed_read+0x260/0x2c0 [btrfs]
btrfs_bio_end_io+0xc4/0x258 [btrfs]
btrfs_check_read_bio+0x424/0x7e0 [btrfs]
simple_end_io_work+0x40/0xa8 [btrfs]
process_one_work+0x168/0x3f0
worker_thread+0x25c/0x398
kthread+0x154/0x250
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
[CAUSE]
Commit 1914b94231 ("btrfs: zstd: use folio_iter to handle
zstd_decompress_bio()") added the ASSERT() to make sure the folio size
matches the fs block size.
But the check is completely wrong, the original intention is to make
sure for bs > ps cases, we always got a large folio that covers a full fs
block.
However for bs < ps cases, a folio can never be smaller than page size,
and the ASSERT() gets triggered immediately.
[FIX]
Check the folio size against @min_folio_size instead, which will never
be smaller than PAGE_SIZE, and still cover bs > ps cases.
Fixes: 1914b94231 ("btrfs: zstd: use folio_iter to handle zstd_decompress_bio()")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/284, the following ASSERT() will be triggered with
64K page size and 4K fs block size:
assertion failed: folio_test_writeback(folio) :: 0, in subpage.c:476
------------[ cut here ]------------
kernel BUG at subpage.c:476!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 4 UID: 0 PID: 2313 Comm: kworker/u37:2 Tainted: G OE 6.19.0-rc8-custom+ #185 PREEMPT(voluntary)
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: btrfs-endio simple_end_io_work [btrfs]
pc : btrfs_subpage_clear_writeback+0x148/0x160 [btrfs]
lr : btrfs_subpage_clear_writeback+0x148/0x160 [btrfs]
Call trace:
btrfs_subpage_clear_writeback+0x148/0x160 [btrfs] (P)
btrfs_folio_clamp_clear_writeback+0xb4/0xd0 [btrfs]
end_compressed_writeback+0xe0/0x1e0 [btrfs]
end_bbio_compressed_write+0x1e8/0x218 [btrfs]
btrfs_bio_end_io+0x108/0x258 [btrfs]
simple_end_io_work+0x68/0xa8 [btrfs]
process_one_work+0x168/0x3f0
worker_thread+0x25c/0x398
kthread+0x154/0x250
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
[CAUSE]
The offending bio is from an encoded write, where the compressed data is
directly written as a data extent, without touching the page cache.
However the encoded write still utilizes the regular buffered write path
for compressed data, by setting the compressed_bio::writeback flag.
When that flag is set, at end_bbio_compressed_write() btrfs will go
clearing the writeback flag of the folios in the page cache.
However for bs < ps cases, the subpage helper has one extra check to make
sure the folio has a writeback flag set in the first place.
But since it's an encoded write, we never go through page
cache, thus the folio has no writeback flag and triggers the ASSERT().
[FIX]
Do not set compressed_bio::writeback flag for encoded writes, and change
the ASSERT() in btrfs_submit_compressed_write() to make sure that flag
is not set.
Fixes: e1bc83f8b1 ("btrfs: get rid of compressed_folios[] usage for encoded writes")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/284 with 64K page size and 4K fs block size, the
following ASSERT() can be triggered:
assertion failed: cb->bbio.bio.bi_iter.bi_size == disk_num_bytes :: 0, in inode.c:9991
------------[ cut here ]------------
kernel BUG at inode.c:9991!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 5 UID: 0 PID: 6787 Comm: btrfs Tainted: G OE 6.19.0-rc8-custom+ #1 PREEMPT(voluntary)
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
pc : btrfs_do_encoded_write+0x9b0/0x9c0 [btrfs]
lr : btrfs_do_encoded_write+0x9b0/0x9c0 [btrfs]
Call trace:
btrfs_do_encoded_write+0x9b0/0x9c0 [btrfs] (P)
btrfs_do_write_iter+0x1d8/0x208 [btrfs]
btrfs_ioctl_encoded_write+0x3c8/0x6d0 [btrfs]
btrfs_ioctl+0xeb0/0x2b60 [btrfs]
__arm64_sys_ioctl+0xac/0x110
invoke_syscall.constprop.0+0x64/0xe8
el0_svc_common.constprop.0+0x40/0xe8
do_el0_svc+0x24/0x38
el0_svc+0x3c/0x1b8
el0t_64_sync_handler+0xa0/0xe8
el0t_64_sync+0x1a4/0x1a8
Code: 91180021 90001080 9111a000 94039d54 (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
After commit e1bc83f8b1 ("btrfs: get rid of compressed_folios[] usage
for encoded writes"), the encoded write is changed to copy the content
from the iov into a folio, and queue the folio into the compressed bio.
However we always queue the full folio into the compressed bio, which
can make the compressed bio larger than the on-disk extent, if the folio
size is larger than the fs block size.
Although we have an ASSERT() to catch such problem, for kernels without
CONFIG_BTRFS_ASSERT, such larger than expected bio will just be
submitted, possibly overwrite the next data extent, causing data
corruption.
[FIX]
Instead of blindly queuing the full folio into the compressed bio, only
queue the rounded up range, which is the old behavior before that
offending commit.
This also means we no longer need to zero the tailing range until the
folio end (but still to the block boundary), as such range will not be
submitted anyway.
And since we're here, add a final ASSERT() into
btrfs_submit_compressed_write() as the last safety net for kernels with
btrfs assertions enabled
Fixes: e1bc83f8b1 ("btrfs: get rid of compressed_folios[] usage for encoded writes")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently our qgroup ioctls don't reserve any space, they just do a
transaction join, which does not reserve any space, neither for the quota
tree updates nor for the delayed refs generated when updating the quota
tree. The quota root uses the global block reserve, which is fine most of
the time since we don't expect a lot of updates to the quota root, or to
be too close to -ENOSPC such that other critical metadata updates need to
resort to the global reserve.
However this is not optimal, as not reserving proper space may result in a
transaction abort due to not reserving space for delayed refs and then
abusing the use of the global block reserve.
For example, the following reproducer (which is unlikely to model any
real world use case, but just to illustrate the problem), triggers such a
transaction abort due to -ENOSPC when running delayed refs:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
umount $DEV &> /dev/null
# Limit device to 1G so that it's much faster to reproduce the issue.
mkfs.btrfs -f -b 1G $DEV
mount -o commit=600 $DEV $MNT
fallocate -l 800M $MNT/filler
btrfs quota enable $MNT
for ((i = 1; i <= 400000; i++)); do
btrfs qgroup create 1/$i $MNT
done
umount $MNT
When running this, we can see in dmesg/syslog that a transaction abort
happened:
[436.490] BTRFS error (device nullb0): failed to run delayed ref for logical 30408704 num_bytes 16384 type 176 action 1 ref_mod 1: -28
[436.493] ------------[ cut here ]------------
[436.494] BTRFS: Transaction aborted (error -28)
[436.495] WARNING: fs/btrfs/extent-tree.c:2247 at btrfs_run_delayed_refs+0xd9/0x110 [btrfs], CPU#4: umount/2495372
[436.497] Modules linked in: btrfs loop (...)
[436.508] CPU: 4 UID: 0 PID: 2495372 Comm: umount Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full)
[436.510] Tainted: [W]=WARN
[436.511] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[436.513] RIP: 0010:btrfs_run_delayed_refs+0xdf/0x110 [btrfs]
[436.514] Code: 0f 82 ea (...)
[436.518] RSP: 0018:ffffd511850b7d78 EFLAGS: 00010292
[436.519] RAX: 00000000ffffffe4 RBX: ffff8f120dad37e0 RCX: 0000000002040001
[436.520] RDX: 0000000000000002 RSI: 00000000ffffffe4 RDI: ffffffffc090fd80
[436.522] RBP: 0000000000000000 R08: 0000000000000001 R09: ffffffffc04d1867
[436.523] R10: ffff8f18dc1fffa8 R11: 0000000000000003 R12: ffff8f173aa89400
[436.524] R13: 0000000000000000 R14: ffff8f173aa89400 R15: 0000000000000000
[436.526] FS: 00007fe59045d840(0000) GS:ffff8f192e22e000(0000) knlGS:0000000000000000
[436.527] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[436.528] CR2: 00007fe5905ff2b0 CR3: 000000060710a002 CR4: 0000000000370ef0
[436.530] Call Trace:
[436.530] <TASK>
[436.530] btrfs_commit_transaction+0x73/0xc00 [btrfs]
[436.531] ? btrfs_attach_transaction_barrier+0x1e/0x70 [btrfs]
[436.532] sync_filesystem+0x7a/0x90
[436.533] generic_shutdown_super+0x28/0x180
[436.533] kill_anon_super+0x12/0x40
[436.534] btrfs_kill_super+0x12/0x20 [btrfs]
[436.534] deactivate_locked_super+0x2f/0xb0
[436.534] cleanup_mnt+0xea/0x180
[436.535] task_work_run+0x58/0xa0
[436.535] exit_to_user_mode_loop+0xed/0x480
[436.536] ? __x64_sys_umount+0x68/0x80
[436.536] do_syscall_64+0x2a5/0xf20
[436.537] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[436.537] RIP: 0033:0x7fe5906b6217
[436.538] Code: 0d 00 f7 (...)
[436.540] RSP: 002b:00007ffcd87a61f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[436.541] RAX: 0000000000000000 RBX: 00005618b9ecadc8 RCX: 00007fe5906b6217
[436.541] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005618b9ecb100
[436.542] RBP: 0000000000000000 R08: 00007ffcd87a4fe0 R09: 00000000ffffffff
[436.544] R10: 0000000000000103 R11: 0000000000000246 R12: 00007fe59081626c
[436.544] R13: 00005618b9ecb100 R14: 0000000000000000 R15: 00005618b9ecacc0
[436.545] </TASK>
[436.545] ---[ end trace 0000000000000000 ]---
Fix this by changing the qgroup ioctls to use start transaction instead of
joining so that proper space is reserved for the delayed refs generated
for the updates to the quota root. This way we don't get any transaction
abort.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both rz_dmac_disable_hw() and rz_dmac_irq_handle_channel() update the
CHCTRL register. To avoid concurrency issues when configuring
functionalities exposed by this registers, take the virtual channel lock.
All other CHCTRL updates were already protected by the same lock.
Previously, rz_dmac_disable_hw() disabled and re-enabled local IRQs, before
accessing CHCTRL registers but this does not ensure race-free access.
Remove the local IRQ disable/enable code as well.
Fixes: 5000d37042 ("dmaengine: sh: Add DMAC driver for RZ/G2L SoC")
Cc: stable@vger.kernel.org
Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Link: https://patch.msgid.link/20260316133252.240348-3-claudiu.beznea.uj@bp.renesas.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
The driver lists (ld_free, ld_queue) are used in
rz_dmac_free_chan_resources(), rz_dmac_terminate_all(),
rz_dmac_issue_pending(), and rz_dmac_irq_handler_thread(), all under
the virtual channel lock. Take the same lock in rz_dmac_prep_slave_sg()
and rz_dmac_prep_dma_memcpy() as well to avoid concurrency issues, since
these functions also check whether the lists are empty and update or
remove list entries.
Fixes: 5000d37042 ("dmaengine: sh: Add DMAC driver for RZ/G2L SoC")
Cc: stable@vger.kernel.org
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Link: https://patch.msgid.link/20260316133252.240348-2-claudiu.beznea.uj@bp.renesas.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
It is possible for a malicious (or clumsy) device to respond to a
specific report's feature request using a completely different report
ID. This can cause confusion in the HID core resulting in nasty
side-effects such as OOB writes.
Add a check to ensure that the report ID in the response, matches the
one that was requested. If it doesn't, omit reporting the raw event and
return early.
Signed-off-by: Lee Jones <lee@kernel.org>
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
btrfs_extent_root() can return a NULL pointer in case the root we are
looking for is not in the rb tree that tracks roots. So add checks to
every caller that is missing such check to log a message and return
an error. The same applies to callers of btrfs_block_group_root(),
since it calls btrfs_extent_root().
Reported-by: Chris Mason <clm@meta.com>
Link: https://lore.kernel.org/linux-btrfs/20260208161657.3972997-1-clm@meta.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
PSR entry_setup_frames is currently computed directly into struct
intel_dp:intel_psr:entry_setup_frames. This causes a problem if mode change
gets rejected after PSR compute config: Psr_entry_setup_frames computed for
this rejected state is in intel_dp:intel_psr:entry_setup_frame. Fix this by
computing it into intel_crtc_state and copy the value into
intel_dp:intel_psr:entry_setup_frames on PSR enable.
Fixes: 2b981d57e4 ("drm/i915/display: Support PSR entry VSC packet to be transmitted one frame earlier")
Cc: Mika Kahola <mika.kahola@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Signed-off-by: Jouni Högander <jouni.hogander@intel.com>
Reviewed-by: Suraj Kandpal <suraj.kandpal@intel.com>
Link: https://patch.msgid.link/20260312083710.1593781-3-jouni.hogander@intel.com
(cherry picked from commit 8c229b4aa00262c13787982e998c61c0783285e0)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Similar to commit d8dc872046 ("ALSA: firewire-lib: fix uninitialized
local variable"), the local variable `curr_cycle_time` in
process_rx_packets() is declared without initialization.
When the tracepoint event is not probed, the variable may appear to be
used without being initialized. In practice the value is only relevant
when the tracepoint is enabled, however initializing it avoids potential
use of an uninitialized value and improves code safety.
Initialize `curr_cycle_time` to zero.
Fixes: fef4e61b0b ("ALSA: firewire-lib: extend tracepoints event including CYCLE_TIME of 1394 OHCI")
Cc: stable@vger.kernel.org
Signed-off-by: Alexey Nepomnyashih <sdl@nppct.ru>
Link: https://patch.msgid.link/20260316191824.83249-1-sdl@nppct.ru
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Fixing a missing of_node_put() call.
* tag 'v7.0-rockchip-drvfixes1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mmind/linux-rockchip:
soc: rockchip: grf: Add missing of_node_put() when returning
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
On some systems (e.g. iMac 20,1 with BCM57766), the tg3 driver reads
a default placeholder mac address (00:10:18:00:00:00) from the
mailbox. The correct value on those systems are stored in the
'local-mac-address' property.
This patch, detect the default value and tries to retrieve
the correct address from the device_get_mac_address
function instead.
The patch has been tested on two different systems:
- iMac 20,1 (BCM57766) model which use the local-mac-address property
- iMac 13,2 (BCM57766) model which can use the mailbox,
NVRAM or MAC control registers
Tested-by: Rishon Jonathan R <mithicalaviator85@gmail.com>
Co-developed-by: Vincent MORVAN <vinc@42.fr>
Signed-off-by: Vincent MORVAN <vinc@42.fr>
Signed-off-by: Paul SAGE <paul.sage@42.fr>
Signed-off-by: Atharva Tiwari <atharvatiwarilinuxdev@gmail.com>
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20260314215432.3589-1-atharvatiwarilinuxdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tobgaertner says:
====================
net: usb: cdc_ncm: add ndpoffset to NDP nframes bounds check
The nframes bounds check in cdc_ncm_rx_verify_ndp16() and
cdc_ncm_rx_verify_ndp32() does not account for ndpoffset,
allowing out-of-bounds reads when the NDP is placed near the
end of the NTB.
====================
Link: https://patch.msgid.link/20260314054640.2895026-1-tob.gaertner@me.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The same bounds-check bug fixed for NDP16 in the previous patch also
exists in cdc_ncm_rx_verify_ndp32(). The DPE array size is validated
against the total skb length without accounting for ndpoffset, allowing
out-of-bounds reads when the NDP32 is placed near the end of the NTB.
Add ndpoffset to the nframes bounds check and use struct_size_t() to
express the NDP-plus-DPE-array size more clearly.
Compile-tested only.
Fixes: 0fa81b304a ("cdc_ncm: Implement the 32-bit version of NCM Transfer Block")
Signed-off-by: Tobi Gaertner <tob.gaertner@me.com>
Link: https://patch.msgid.link/20260314054640.2895026-3-tob.gaertner@me.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
cdc_ncm_rx_verify_ndp16() validates that the NDP header and its DPE
entries fit within the skb. The first check correctly accounts for
ndpoffset:
if ((ndpoffset + sizeof(struct usb_cdc_ncm_ndp16)) > skb_in->len)
but the second check omits it:
if ((sizeof(struct usb_cdc_ncm_ndp16) +
ret * (sizeof(struct usb_cdc_ncm_dpe16))) > skb_in->len)
This validates the DPE array size against the total skb length as if
the NDP were at offset 0, rather than at ndpoffset. When the NDP is
placed near the end of the NTB (large wNdpIndex), the DPE entries can
extend past the skb data buffer even though the check passes.
cdc_ncm_rx_fixup() then reads out-of-bounds memory when iterating
the DPE array.
Add ndpoffset to the nframes bounds check and use struct_size_t() to
express the NDP-plus-DPE-array size more clearly.
Fixes: ff06ab13a4 ("net: cdc_ncm: splitting rx_fixup for code reuse")
Signed-off-by: Tobi Gaertner <tob.gaertner@me.com>
Link: https://patch.msgid.link/20260314054640.2895026-2-tob.gaertner@me.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Do not run airoha_dev_stop routine explicitly in airoha_remove()
since ndo_stop() callback is already executed by unregister_netdev() in
__dev_close_many routine if necessary and, doing so, we will end up causing
an underflow in the qdma users atomic counters. Rely on networking subsystem
to stop the device removing the airoha_eth module.
Fixes: 23020f0493 ("net: airoha: Introduce ethernet support for EN7581 SoC")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260313-airoha-remove-ndo_stop-remove-net-v2-1-67542c3ceeca@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzkaller reported a panic in smc_tcp_syn_recv_sock() [1].
smc_tcp_syn_recv_sock() is called in the TCP receive path
(softirq) via icsk_af_ops->syn_recv_sock on the clcsock (TCP
listening socket). It reads sk_user_data to get the smc_sock
pointer. However, when the SMC listen socket is being closed
concurrently, smc_close_active() sets clcsock->sk_user_data
to NULL under sk_callback_lock, and then the smc_sock itself
can be freed via sock_put() in smc_release().
This leads to two issues:
1) NULL pointer dereference: sk_user_data is NULL when
accessed.
2) Use-after-free: sk_user_data is read as non-NULL, but the
smc_sock is freed before its fields (e.g., queued_smc_hs,
ori_af_ops) are accessed.
The race window looks like this (the syzkaller crash [1]
triggers via the SYN cookie path: tcp_get_cookie_sock() ->
smc_tcp_syn_recv_sock(), but the normal tcp_check_req() path
has the same race):
CPU A (softirq) CPU B (process ctx)
tcp_v4_rcv()
TCP_NEW_SYN_RECV:
sk = req->rsk_listener
sock_hold(sk)
/* No lock on listener */
smc_close_active():
write_lock_bh(cb_lock)
sk_user_data = NULL
write_unlock_bh(cb_lock)
...
smc_clcsock_release()
sock_put(smc->sk) x2
-> smc_sock freed!
tcp_check_req()
smc_tcp_syn_recv_sock():
smc = user_data(sk)
-> NULL or dangling
smc->queued_smc_hs
-> crash!
Note that the clcsock and smc_sock are two independent objects
with separate refcounts. TCP stack holds a reference on the
clcsock, which keeps it alive, but this does NOT prevent the
smc_sock from being freed.
Fix this by using RCU and refcount_inc_not_zero() to safely
access smc_sock. Since smc_tcp_syn_recv_sock() is called in
the TCP three-way handshake path, taking read_lock_bh on
sk_callback_lock is too heavy and would not survive a SYN
flood attack. Using rcu_read_lock() is much more lightweight.
- Set SOCK_RCU_FREE on the SMC listen socket so that
smc_sock freeing is deferred until after the RCU grace
period. This guarantees the memory is still valid when
accessed inside rcu_read_lock().
- Use rcu_read_lock() to protect reading sk_user_data.
- Use refcount_inc_not_zero(&smc->sk.sk_refcnt) to pin the
smc_sock. If the refcount has already reached zero (close
path completed), it returns false and we bail out safely.
Note: smc_hs_congested() has a similar lockless read of
sk_user_data without rcu_read_lock(), but it only checks for
NULL and accesses the global smc_hs_wq, never dereferencing
any smc_sock field, so it is not affected.
Reproducer was verified with mdelay injection and smc_run,
the issue no longer occurs with this patch applied.
[1] https://syzkaller.appspot.com/bug?extid=827ae2bfb3a3529333e9
Fixes: 8270d9c210 ("net/smc: Limit backlog connections")
Reported-by: syzbot+827ae2bfb3a3529333e9@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67eaf9b8.050a0220.3c3d88.004a.GAE@google.com/T/
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Link: https://patch.msgid.link/20260312092909.48325-1-jiayuan.chen@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For file-backed mount, IO requests are handled by vfs_iocb_iter_read().
However, it can be interrupted by SIGKILL, returning the number of
bytes actually copied. Unused folios in bio are unexpectedly marked
as uptodate.
vfs_read
filemap_read
filemap_get_pages
filemap_readahead
erofs_fileio_readahead
erofs_fileio_rq_submit
vfs_iocb_iter_read
filemap_read
filemap_get_pages <= detect signal
erofs_fileio_ki_complete <= set all folios uptodate
This patch addresses this by setting short read bio with an error
directly.
Fixes: bc804a8d7e ("erofs: handle end of filesystem properly for file-backed mounts")
Reported-by: chenguanyou <chenguanyou@xiaomi.com>
Signed-off-by: Yunlei He <heyunlei@xiaomi.com>
Signed-off-by: Sheng Yong <shengyong1@xiaomi.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The file contains a spelling error in a source comment (resposne).
Typos in comments reduce readability and make text searches less reliable
for developers and maintainers.
Replace 'resposne' with 'response' in the affected comment. This is a
comment-only cleanup and does not change behavior.
[v2: Removed Fixes: and Cc: to stable tags.]
Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Check the global CXL_HDM_DECODER_ENABLE bit instead of looping over
per-decoder COMMITTED bits to determine whether to fall back to DVSEC
range emulation. When the HDM decoder capability is globally enabled,
ignore DVSEC range registers regardless of individual decoder commit
state.
should_emulate_decoders() currently loops over per-decoder COMMITTED
bits, which leads to an incorrect DVSEC fallback when those bits are
zero. One way to trigger this is to destroy a region and bounce the
memdev:
cxl disable-region region0
cxl destroy-region region0
cxl disable-memdev mem0
cxl enable-memdev mem0
Region teardown zeroes the HDM decoder registers including the committed
bits. The subsequent memdev re-probe finds uncommitted decoders and falls
back to DVSEC emulation, even though HDM remains globally enabled.
Observed failures:
should_emulate_decoders: cxl_port endpoint6: decoder6.0: committed: 0 base: 0x0_00000000 size: 0x0_00000000
devm_cxl_setup_hdm: cxl_port endpoint6: Fallback map 1 range register
..
devm_cxl_add_region: cxl_acpi ACPI0017:00: decoder0.0: created region0
__construct_region: cxl_pci 0000:e1:00.0: mem1:decoder6.0:
__construct_region region0 res: [mem 0x850000000-0x284fffffff flags 0x200] iw: 1 ig: 4096
cxl region0: pci0000:e0:port1 cxl_port_setup_targets expected iw: 1 ig: 4096 ..
cxl region0: pci0000:e0:port1 cxl_port_setup_targets got iw: 1 ig: 256 state: disabled ..
cxl_port endpoint6: failed to attach decoder6.0 to region0: -6
..
devm_cxl_add_region: cxl_acpi ACPI0017:00: decoder0.0: created region4
alloc_hpa: cxl region4: HPA allocation error (-34) ..
Fixes: 52cc48ad2a ("cxl/hdm: Limit emulation to the number of range registers")
Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20260316201950.224567-1-Smita.KoralahalliChannabasappa@amd.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
If OF graph is used in the PCI device node, the pwrctrl core creates a
pwrctrl device even if the remote endpoint doesn't have power supply
requirements. Since the device doesn't have any power supply requirements,
there was no pwrctrl driver to probe, leading to PCI controller driver
probe deferral as it waits for all pwrctrl drivers to probe before starting
bus scan.
This issue happens with Qcom ath12k devices with WSI interface attached to
the Qcom IPQ platforms.
Fix this issue by checking for the existence of at least one power supply
property in the remote endpoint parent node. To consolidate all the checks,
create a new helper pci_pwrctrl_is_required() and move all the checks
there.
Fixes: 9db826206f ("PCI/pwrctrl: Create pwrctrl device if graph port is found")
Reported-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@oss.qualcomm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Reviewed-by: Krishna Chaitanya Chundru <krishna.chundru@oss.qualcomm.com>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Link: https://patch.msgid.link/20260223-pwrctrl-fixes-7-0-v2-1-97566dfb1809@oss.qualcomm.com
The NFSv4.0 replay cache uses a fixed 112-byte inline buffer
(rp_ibuf[NFSD4_REPLAY_ISIZE]) to store encoded operation responses.
This size was calculated based on OPEN responses and does not account
for LOCK denied responses, which include the conflicting lock owner as
a variable-length field up to 1024 bytes (NFS4_OPAQUE_LIMIT).
When a LOCK operation is denied due to a conflict with an existing lock
that has a large owner, nfsd4_encode_operation() copies the full encoded
response into the undersized replay buffer via read_bytes_from_xdr_buf()
with no bounds check. This results in a slab-out-of-bounds write of up
to 944 bytes past the end of the buffer, corrupting adjacent heap memory.
This can be triggered remotely by an unauthenticated attacker with two
cooperating NFSv4.0 clients: one sets a lock with a large owner string,
then the other requests a conflicting lock to provoke the denial.
We could fix this by increasing NFSD4_REPLAY_ISIZE to allow for a full
opaque, but that would increase the size of every stateowner, when most
lockowners are not that large.
Instead, fix this by checking the encoded response length against
NFSD4_REPLAY_ISIZE before copying into the replay buffer. If the
response is too large, set rp_buflen to 0 to skip caching the replay
payload. The status is still cached, and the client already received the
correct response on the original request.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@kernel.org
Reported-by: Nicholas Carlini <npc@anthropic.com>
Tested-by: Nicholas Carlini <npc@anthropic.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The elf_create_file() function fails with EINVAL when the build directory
path is long enough to truncate the "XXXXXX" suffix in the 256-byte
tmp_name buffer.
Simplify the code to remove the unnecessary dirname()/basename() split
and concatenation. Instead, allocate the exact number of bytes needed for
the path.
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Link: https://patch.msgid.link/20260310203751.1479229-3-joe.lawrence@redhat.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Commit 356e4b2f5b ("objtool: Fix data alignment in elf_add_data()")
corrected the alignment of data within a section (honoring the section's
sh_addralign). Apply the same alignment when klp-diff mode clones a
symbol, adjusting the new symbol's offset for the output section's
sh_addralign.
Fixes: dd590d4d57 ("objtool/klp: Introduce klp diff subcommand for diffing object files")
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Link: https://patch.msgid.link/20260310203751.1479229-2-joe.lawrence@redhat.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
There are two special cases in the idle loop that are handled
inconsistently even though they are analogous.
The first one is when a cpuidle driver is absent and the default CPU
idle time power management implemented by the architecture code is used.
In that case, the scheduler tick is stopped every time before invoking
default_idle_call().
The second one is when a cpuidle driver is present, but there is only
one idle state in its table. In that case, the scheduler tick is never
stopped at all.
Since each of these approaches has its drawbacks, reconcile them with
the help of one simple heuristic. Namely, stop the tick if the CPU has
been woken up by it in the previous iteration of the idle loop, or let
it tick otherwise.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Qais Yousef <qyousef@layalina.io>
Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Fixes: ed98c34919 ("sched: idle: Do not stop the tick before cpuidle_idle_call()")
[ rjw: Added Fixes tag, changelog edits ]
Link: https://patch.msgid.link/4741364.LvFx2qVVIh@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Pull misc fixes from Andrew Morton:
"6 hotfixes. 4 are cc:stable. 3 are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-03-16-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: update email address for Ignat Korchagin
mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared THP
mm/rmap: fix incorrect pte restoration for lazyfree folios
mm/huge_memory: fix use of NULL folio in move_pages_huge_pmd()
build_bug.h: correct function parameters names in kernel-doc
crash_dump: don't log dm-crypt key bytes in read_key_from_user_keying
The controller per-cpu statistics is not allocated until after the
controller has been registered with driver core, which leaves a window
where accessing the sysfs attributes can trigger a NULL-pointer
dereference.
Fix this by moving the statistics allocation to controller allocation
while tying its lifetime to that of the controller (rather than using
implicit devres).
Fixes: 6598b91b5a ("spi: spi.c: Convert statistics to per-cpu u64_stats_t")
Cc: stable@vger.kernel.org # 6.0
Cc: David Jander <david@protonic.nl>
Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260312151817.32100-3-johan@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
The initial StarFighter quirk fixed the runtime suspend pop by muting
speakers in the shutup callback before power-down. Further hardware
validation showed that the speaker path is controlled directly by LINE2
EAPD on NID 0x1b together with GPIO2 for the external amplifier.
Replace the shutup-delay workaround with explicit sequencing of those
controls at playback start and stop:
- assert LINE2 EAPD and drive GPIO2 high on PREPARE
- deassert LINE2 EAPD and drive GPIO2 low on CLEANUP
This avoids the runtime suspend pop without a sleep, and also fixes pops
around G3 entry and display-manager start that the original workaround
did not cover.
Fixes: 1cb3c20688 ("ALSA: hda/realtek: Fix speaker pop on Star Labs StarFighter")
Tested-by: Sean Rhodes <sean@starlabs.systems>
Signed-off-by: Sean Rhodes <sean@starlabs.systems>
Link: https://patch.msgid.link/20260315201127.33744-1-sean@starlabs.systems
Signed-off-by: Takashi Iwai <tiwai@suse.de>
The recent XFER_TO_GUEST_WORK change resulted in a situation, where the
vsie code would interpret a signal during work as a machine check during
SIE as both use the EINTR return code.
The exit_reason of the sie64a function has nothing to do with the
kvm_run exit_reason. Rename it and define a specific code for machine
checks instead of abusing -EINTR.
rename exit_reason into sie_return to avoid the naming conflict
and change the code flow in vsie.c to have a separate variable for rc
and sie_return.
Fixes: 2bd1337a12 ("KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions")
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
KVM will reinject machine checks that happen during guest activity.
From a host perspective this machine check is no longer visible
and even for the guest, the guest might decide to only kill a
userspace program or even ignore the machine check.
As this can be a disruptive event nevertheless, we should log this
not only in the VM debug event (that gets lost after guest shutdown)
but also on the global KVM event as well as syslog.
Consolidate the logging and log with loglevel 2 and higher.
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Acked-by: Hendrik Brueckner <brueckner@linux.ibm.com>
There are special cases where secure storage access exceptions happen
in a kernel context for pages that don't have the PG_arch_1 bit
set. That bit is set for non-exported guest secure storage (memory)
but is absent on storage donated to the Ultravisor since the kernel
isn't allowed to export donated pages.
Prior to this patch we would try to export the page by calling
arch_make_folio_accessible() which would instantly return since the
arch bit is absent signifying that the page was already exported and
no further action is necessary. This leads to secure storage access
exception loops which can never be resolved.
With this patch we unconditionally try to export and if that fails we
fixup.
Fixes: 084ea4d611 ("s390/mm: add (non)secure page access exceptions handlers")
Reported-by: Heiko Carstens <hca@linux.ibm.com>
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Pull btrfs fixes from David Sterba:
- fix logging of new dentries when logging parent directory and there
are conflicting inodes (e.g. deleted directory)
- avoid taking big device lock for zone setup, this is not necessary
during mount
- tune message verbosity when auto-reclaiming zones when low on space
- fix slightly misleading message of root item check
* tag 'for-7.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: tree-checker: fix misleading root drop_level error message
btrfs: log new dentries when logging parent dir of a conflicting inode
btrfs: don't take device_list_mutex when querying zone info
btrfs: pass 'verbose' parameter to btrfs_relocate_block_group
Fix 45+ kernel-doc warnings in vmwgfx_drv.h:
- spell a struct name correctly
- don't have structs between kernel-doc and its struct
- end description of struct members with ':'
- start all kernel-doc lines with " *"
- mark private struct member and enum value with "private:"
- add kernel-doc for enum vmw_dma_map_mode
- add missing struct member comments
- add missing function parameter comments
- convert "/**" to "/*" for non-kernel-doc comments
- add missing "Returns:" comments for several functions
- correct a function parameter name
to eliminate kernel-doc warnings (examples):
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:128 struct vmw_bo; error:
Cannot parse struct or union!
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:151 struct member 'used_prio'
not described in 'vmw_resource'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:151 struct member 'mob_node'
not described in 'vmw_resource'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:199 bad line: SM4 device.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:270 struct member 'private'
not described in 'vmw_res_cache_entry'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:280 Enum value
'vmw_dma_alloc_coherent' not described in enum 'vmw_dma_map_mode'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:280 Enum value
'vmw_dma_map_bind' not described in enum 'vmw_dma_map_mode'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:295 struct member 'addrs'
not described in 'vmw_sg_table'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:295 struct member 'mode'
not described in 'vmw_sg_table'
vmwgfx_drv.h:309: warning: Excess struct member 'num_regions' description
in 'vmw_sg_table'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:402 struct member 'filp'
not described in 'vmw_sw_context'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:732 This comment starts with
'/**', but isn't a kernel-doc comment.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:742 This comment starts with
'/**', but isn't a kernel-doc comment.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:762 This comment starts with
'/**', but isn't a kernel-doc comment.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:887 No description found for
return value of 'vmw_fifo_caps'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:901 No description found for
return value of 'vmw_is_cursor_bypass3_enabled'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:906 This comment starts with
'/**', but isn't a kernel-doc comment.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:961 This comment starts with
'/**', but isn't a kernel-doc comment.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:996 This comment starts with
'/**', but isn't a kernel-doc comment.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1082 cannot understand
function prototype: 'const struct dma_buf_ops vmw_prime_dmabuf_ops;'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1303 struct member 'do_cpy'
not described in 'vmw_diff_cpy'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1385 function parameter 'fmt'
not described in 'VMW_DEBUG_KMS'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1389 This comment starts with
'/**', but isn't a kernel-doc comment.
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1426 function parameter 'vmw'
not described in 'vmw_fifo_mem_read'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1426 No description found for
return value of 'vmw_fifo_mem_read'
Warning: drivers/gpu/drm/vmwgfx/vmwgfx_drv.h:1441 function parameter
'fifo_reg' not described in 'vmw_fifo_mem_write'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Zack Rusin <zack.rusin@broadcom.com>
Link: https://patch.msgid.link/20260219215548.470810-1-rdunlap@infradead.org
Presently, if the force feedback initialisation fails when probing the
Logitech G920 Driving Force Racing Wheel for Xbox One, an error number
will be returned and propagated before the userspace infrastructure
(sysfs and /dev/input) has been torn down. If userspace ignores the
errors and continues to use its references to these dangling entities, a
UAF will promptly follow.
We have 2 options; continue to return the error, but ensure that all of
the infrastructure is torn down accordingly or continue to treat this
condition as a warning by emitting the message but returning success.
It is thought that the original author's intention was to emit the
warning but keep the device functional, less the force feedback feature,
so let's go with that.
Signed-off-by: Lee Jones <lee@kernel.org>
Reviewed-by: Günther Noack <gnoack@google.com>
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
right now the returned value is considered to be always valid. However,
when playing with HID-BPF, the return value can be arbitrary big,
because it's the return value of dispatch_hid_bpf_raw_requests(), which
calls the struct_ops and we have no guarantees that the value makes
sense.
Fixes: 8bd0488b5e ("HID: bpf: add HID-BPF hooks for hid_hw_raw_requests")
Cc: stable@vger.kernel.org
Acked-by: Jiri Kosina <jkosina@suse.com>
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
The memset() in hid_report_raw_event() has the good intention of
clearing out bogus data by zeroing the area from the end of the incoming
data string to the assumed end of the buffer. However, as we have
previously seen, doing so can easily result in OOB reads and writes in
the subsequent thread of execution.
The current suggestion from one of the HID maintainers is to remove the
memset() and simply return if the incoming event buffer size is not
large enough to fill the associated report.
Suggested-by Benjamin Tissoires <bentiss@kernel.org>
Signed-off-by: Lee Jones <lee@kernel.org>
[bentiss: changed the return value]
Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
When 1-bit bus width is used with HS200/HS400 capabilities set,
mmc_select_hs200() returns 0 without actually switching. This
causes mmc_select_timing() to skip mmc_select_hs(), leaving eMMC
in legacy mode (26MHz) instead of High Speed SDR (52MHz).
Per JEDEC eMMC spec section 5.3.2, 1-bit mode supports High Speed
SDR. Drop incompatible HS200/HS400/UHS/DDR caps early so timing
selection falls through to mmc_select_hs() correctly.
Fixes: f2119df6b7 ("mmc: sd: add support for signal voltage switch procedure")
Signed-off-by: Luke Wang <ziniu.wang_1@nxp.com>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
setup_fifo_params computes mode_changed from spi->mode flags but tests
it against SE_SPI_CPHA and SE_SPI_CPOL, which are register offsets,
not SPI mode bits. This causes CPHA and CPOL updates to be skipped
on mode switches, leaving the controller with stale clock phase
and polarity settings.
Fix this by using SPI_CPHA and SPI_CPOL to detect mode changes before
updating the corresponding registers.
Fixes: 781c3e71c9 ("spi: spi-geni-qcom: rework setup_fifo_params")
Signed-off-by: Maramaina Naresh <naresh.maramaina@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Link: https://patch.msgid.link/20260316-spi-geni-cpha-cpol-fix-v1-1-4cb44c176b79@oss.qualcomm.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Currently HID_PHYS is empty, which means userspace tools (e.g. fwupd)
that depend on it for distinguishing the devices, are unable to do so.
Other drivers like i2c-hid, usbhid, surface-hid, all populate it.
With this change it's set to, for example: HID_PHYS=0000:00:10.0
Each function has just a single HID device, as far as I can tell, so
there is no need to add a suffix.
Tested with fwupd 2.1.1, can avoid https://github.com/fwupd/fwupd/pull/9995
Cc: Even Xu <even.xu@intel.com>
Cc: Xinpeng Sun <xinpeng.sun@intel.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Benjamin Tissoires <bentiss@kernel.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Daniel Schaefer <git@danielschaefer.me>
Reviewed-by: Even Xu <even.xu@intel.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
A XFRM_MSG_NEWSPDINFO request can queue the per-net work item
policy_hthresh.work onto the system workqueue.
The queued callback, xfrm_hash_rebuild(), retrieves the enclosing
struct net via container_of(). If the net namespace is torn down
before that work runs, the associated struct net may already have
been freed, and xfrm_hash_rebuild() may then dereference stale memory.
xfrm_policy_fini() already flushes policy_hash_work during teardown,
but it does not synchronize policy_hthresh.work.
Synchronize policy_hthresh.work in xfrm_policy_fini() as well, so the
queued work cannot outlive the net namespace teardown and access a
freed struct net.
Fixes: 880a6fab8f ("xfrm: configure policy hash table thresholds by netlink")
Signed-off-by: Minwoo Ra <raminwo0202@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
intel_dmc_update_dc6_allowed_count() oopses when DMC hasn't been
initialized, and dmc is thus NULL.
That would be the case when the call path is
intel_power_domains_init_hw() -> {skl,bxt,icl}_display_core_init() ->
gen9_set_dc_state() -> intel_dmc_update_dc6_allowed_count(), as
intel_power_domains_init_hw() is called *before* intel_dmc_init().
However, gen9_set_dc_state() calls intel_dmc_update_dc6_allowed_count()
conditionally, depending on the current and target DC states. At probe,
the target is disabled, but if DC6 is enabled, the function is called,
and an oops follows. Apparently it's quite unlikely that DC6 is enabled
at probe, as we haven't seen this failure mode before.
It is also strange to have DC6 enabled at boot, since that would require
the DMC firmware (loaded by BIOS); the BIOS loading the DMC firmware and
the driver stopping / reprogramming the firmware is a poorly specified
sequence and as such unlikely an intentional BIOS behaviour. It's more
likely that BIOS is leaving an unintentionally enabled DC6 HW state
behind (without actually loading the required DMC firmware for this).
The tracking of the DC6 allowed counter only works if starting /
stopping the counter depends on the _SW_ DC6 state vs. the current _HW_
DC6 state (since stopping the counter requires the DC5 counter captured
when the counter was started). Thus, using the HW DC6 state is incorrect
and it also leads to the above oops. Fix both issues by using the SW DC6
state for the tracking.
This is v2 of the fix originally sent by Jani, updated based on the
first Link: discussion below.
Link: https://lore.kernel.org/all/3626411dc9e556452c432d0919821b76d9991217@intel.com
Link: https://lore.kernel.org/all/20260228130946.50919-2-ltao@redhat.com
Fixes: 88c1f9a4d3 ("drm/i915/dmc: Create debugfs entry for dc6 counter")
Cc: Mohammed Thasleem <mohammed.thasleem@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Tao Liu <ltao@redhat.com>
Cc: <stable@vger.kernel.org> # v6.16+
Tested-by: Tao Liu <ltao@redhat.com>
Reviewed-by: Jani Nikula <jani.nikula@intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Link: https://patch.msgid.link/20260309164803.1918158-1-imre.deak@intel.com
(cherry picked from commit 2344b93af8eb5da5d496b4e0529d35f0f559eaf0)
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Most of VM feature detections are integer OR operations, and integer
assignment operation will clear previous integer OR operation. So here
change all integer assignment operations to integer OR operations.
Fixes: 82db90bf46 ("LoongArch: KVM: Move feature detection in kvm_vm_init_features()")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Occasionally there exist "text_copy_cb: operation failed" when executing
the bpf selftests, the reason is copy_to_kernel_nofault() failed and the
ecode of ESTAT register is 0x4 (PME: Page Modification Exception) due to
the pte is not writeable. The root cause is that there is another place
to set the pte entry as readonly which is in the generic weak version of
arch_protect_bpf_trampoline().
There are two ways to fix this race condition issue: the direct way is
to modify the generic weak arch_protect_bpf_trampoline() to add a mutex
lock for set_memory_rox(), but the other simple and proper way is to
just make arch_protect_bpf_trampoline() return 0 in the arch-specific
code because LoongArch has already use the BPF prog pack allocator for
trampoline.
Here are the trimmed kernel log messages:
copy_to_kernel_nofault: memory access failed, ecode 0x4
copy_to_kernel_nofault: the caller is text_copy_cb+0x50/0xa0
text_copy_cb: operation failed
------------[ cut here ]------------
bpf_prog_pack bug: missing bpf_arch_text_invalidate?
WARNING: kernel/bpf/core.c:1008 at bpf_prog_pack_free+0x200/0x228
...
Call Trace:
[<9000000000248914>] show_stack+0x64/0x188
[<9000000000241308>] dump_stack_lvl+0x6c/0x9c
[<90000000002705bc>] __warn+0x9c/0x200
[<9000000001c428c0>] __report_bug+0xa8/0x1c0
[<9000000001c42b5c>] report_bug+0x64/0x120
[<9000000001c7dcd0>] do_bp+0x270/0x3c0
[<9000000000246f40>] handle_bp+0x120/0x1c0
[<900000000047b030>] bpf_prog_pack_free+0x200/0x228
[<900000000047b2ec>] bpf_jit_binary_pack_free+0x24/0x60
[<900000000026989c>] bpf_jit_free+0x54/0xb0
[<900000000029e10c>] process_one_work+0x184/0x610
[<900000000029ef8c>] worker_thread+0x24c/0x388
[<90000000002a902c>] kthread+0x13c/0x170
[<9000000001c7dfe8>] ret_from_kernel_thread+0x28/0x1c0
[<9000000000246624>] ret_from_kernel_thread_asm+0xc/0x88
---[ end trace 0000000000000000 ]---
Here is a simple shell script to reproduce:
#!/bin/bash
for ((i=1; i<=1000; i++))
do
echo "Under testing $i ..."
dmesg -c > /dev/null
./test_progs -t fentry_attach_stress > /dev/null
dmesg -t | grep "text_copy_cb: operation failed"
if [ $? -eq 0 ]; then
break
fi
done
Cc: stable@vger.kernel.org
Fixes: 4ab17e762b ("LoongArch: BPF: Use BPF prog pack allocator")
Acked-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
set_memory_rw() and set_memory_rox() may fail, so we should check the
return values and return immediately in larch_insn_text_copy().
Cc: stable@vger.kernel.org
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
If memory access such as copy_{from, to}_kernel_nofault() failed, its
users do not know what happened, so it is very useful to print the
exception code for such cases. Furthermore, it is better to print the
caller function to know where is the entry.
Here are the low level call chains:
copy_from_kernel_nofault()
copy_from_kernel_nofault_loop()
__get_kernel_nofault()
copy_to_kernel_nofault()
copy_to_kernel_nofault_loop()
__put_kernel_nofault()
Cc: stable@vger.kernel.org
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Fix the warning:
BUG: using smp_processor_id() in preemptible [00000000] code: systemd/1
caller is larch_insn_text_copy+0x40/0xf0
Simply changing it to raw_smp_processor_id() is not enough: if preempt
and CPU hotplug happens after raw_smp_processor_id() but before calling
stop_machine(), the CPU where raw_smp_processor_id() has run may become
offline when stop_machine() and no CPU will run copy_to_kernel_nofault()
in text_copy_cb(). Thus guard the larch_insn_text_copy() calls with
cpus_read_lock() and change stop_machine() to stop_machine_cpuslocked()
to prevent this.
I've considered moving the locks inside larch_insn_text_copy() but
doing so seems not an easy hack. In bpf_arch_text_poke() obviously the
memcpy() call must be guarded by text_mutex, so we have to leave the
acquire of text_mutex out of larch_insn_text_copy(). But in the entire
kernel the acquire of mutexes is always after cpus_read_lock(), so we
cannot put cpus_read_lock() into larch_insn_text_copy() while leaving
the text_mutex acquire out (or we risk a deadlock due to inconsistent
lock acquire order). So let's fix the bug first and leave the posssible
refactor as future work.
Fixes: 9fbd18cf4c ("LoongArch: BPF: Add dynamic code modification support")
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
The 128-bit atomic cmpxchg implementation uses the SC.Q instruction.
Older versions of GNU AS do not support that instruction, erroring out:
ERROR:root:{standard input}: Assembler messages:
{standard input}:4831: Error: no match insn: sc.q $t0,$t1,$r14
{standard input}:6407: Error: no match insn: sc.q $t0,$t1,$r23
{standard input}:10856: Error: no match insn: sc.q $t0,$t1,$r14
make[4]: *** [../scripts/Makefile.build:289: mm/slub.o] Error 1
(Binutils 2.41)
So test support for SC.Q in Kconfig and disable the atomics if the
instruction is not available.
Fixes: f0e4b1b6e2 ("LoongArch: Add 128-bit atomic cmpxchg support")
Closes: https://lore.kernel.org/lkml/20260216082834-edc51c46-7b7a-4295-8ea5-4d9a3ca2224f@linutronix.de/
Reviewed-by: Xi Ruoyao <xry111@xry111.site>
Acked-by: Hengqi Chen <hengqi.chen@gmail.com>
Tested-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Customer reported that some of their krb5 mounts were failing against
a single server as the client was trying to mount the shares with
wrong credentials. It turned out the client was reusing SMB session
from first mount to try mounting the other shares, even though a
different username= option had been specified to the other mounts.
By using username mount option along with sec=krb5 to search for
principals from keytab is supported by cifs.upcall(8) since
cifs-utils-4.8. So fix this by matching username mount option in
match_session() even with Kerberos.
For example, the second mount below should fail with -ENOKEY as there
is no 'foobar' principal in keytab (/etc/krb5.keytab). The client
ends up reusing SMB session from first mount to perform the second
one, which is wrong.
```
$ ktutil
ktutil: add_entry -password -p testuser -k 1 -e aes256-cts
Password for testuser@ZELDA.TEST:
ktutil: write_kt /etc/krb5.keytab
ktutil: quit
$ klist -ke
Keytab name: FILE:/etc/krb5.keytab
KVNO Principal
---- ----------------------------------------------------------------
1 testuser@ZELDA.TEST (aes256-cts-hmac-sha1-96)
$ mount.cifs //w22-root2/scratch /mnt/1 -o sec=krb5,username=testuser
$ mount.cifs //w22-root2/scratch /mnt/2 -o sec=krb5,username=foobar
$ mount -t cifs | grep -Po 'username=\K\w+'
testuser
testuser
```
Reported-by: Oscar Santos <ossantos@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
ctlr is allocated using devm_spi_alloc_host(), which automatically
handles reference counting via the devm framework.
Calling spi_controller_put() manually in the probe error path is
redundant and results in a double-free.
Fixes: e75a6b00ad ("spi: axiado: Add driver for Axiado SPI DB controller")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260302-axiado-v1-1-1132819f1cb7@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Add an ACP70 SoundWire machine entry for ASUS PX13
(HN7306EA/HN7306EAC) with rt721 and two TAS2783 amps on link1.
Describe rt721 with jack/DMIC endpoints on this platform and add
explicit left/right TAS2783 speaker endpoint mapping via name prefixes.
Signed-off-by: Hasun Park <hasunpark@gmail.com>
Link: https://patch.msgid.link/20260308151654.29059-3-hasunpark@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Some ASUS ProArt PX13 systems expose ACP ACPI config flags that can
select a non-working fallback path.
Add a DMI override in snd_amd_acp_find_config() for ACP70+ boards and
return 0 so ACP ACPI flag-based selection is skipped on this platform.
This keeps machine driver selection on the intended SoundWire path.
Signed-off-by: Hasun Park <hasunpark@gmail.com>
Link: https://patch.msgid.link/20260308151654.29059-2-hasunpark@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
A previous change added NULL checks and cleanup for allocation
failures in sma1307_setting_loaded().
However, the cleanup for mode_set entries is wrong. Those entries are
allocated with devm_kzalloc(), so they are device-managed resources and
must not be freed with kfree(). Manually freeing them in the error path
can lead to a double free when devres later releases the same memory.
Drop the manual kfree() loop and let devres handle the cleanup.
Fixes: 0ec6bd1670 ("ASoC: sma1307: Add NULL check in sma1307_setting_loaded()")
Cc: stable@vger.kernel.org
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Link: https://patch.msgid.link/20260313040611.391479-1-lgs201920130244@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
In aml_spisg_probe(), ctlr is allocated by
spi_alloc_target()/spi_alloc_host(), but fails to call
spi_controller_put() in several error paths. This leads
to a memory leak whenever the driver fails to probe after
the initial allocation.
Convert to use devm_spi_alloc_host()/devm_spi_alloc_target()
to fix the memory leak.
Fixes: cef9991e04 ("spi: Add Amlogic SPISG driver")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260308-spisg-v1-1-2cace5cafc24@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
The driver uses devm_clk_get_enabled() which enables the clock and
registers a callback to automatically disable it when the device
is unbound.
Remove the redundant aml_sfc_disable_clk() call in the error paths
and remove callback.
Fixes: 4670db6f32 ("spi: amlogic: add driver for Amlogic SPI Flash Controller")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260308-spifc-a4-1-v1-1-77e286c26832@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Pull SCSI fixes from James Bottomley:
"The one core change is a re-roll of the tag allocation fix from the
last pull request that uses the correct goto to unroll all the
allocations. The remianing fixes are all small ones in drivers"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: hisi_sas: Fix NULL pointer exception during user_scan()
scsi: qla2xxx: Completely fix fcport double free
scsi: ufs: core: Fix SError in ufshcd_rtc_work() during UFS suspend
scsi: core: Fix error handling for scsi_alloc_sdev()
Pull probes fixes from Masami Hiramatsu:
- Avoid crash when rmmod/insmod after ftrace killed
This fixes a kernel crash caused by kprobes on the symbol in a module
which is unloaded after ftrace_kill() is called.
- Remove unneeded warnings from __arm_kprobe_ftrace()
Remove unneeded WARN messages which can be triggered if the kprobe is
using ftrace and it fails to enable the ftrace. Since kprobes
correctly handle such failure, we don't need to warn it.
* tag 'probes-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
kprobes: Remove unneeded warnings from __arm_kprobe_ftrace()
kprobes: avoid crash when rmmod/insmod after ftrace killed
Pull bootconfig fixes from Masami Hiramatsu:
- fix off-by-one in xbc_verify_tree() unclosed brace error. This fixes
a wrong error place in unclosed brace error message
- check bounds before writing in __xbc_open_brace(). This fixes to
check the array index before setting array, so that the bootconfig
can support 16th-depth nested brace correctly
- fix snprintf truncation check in xbc_node_compose_key_after(). This
fixes to handle the return value of snprintf() correctly in case of
the return value == size
- Add bootconfig tests about braces Add test cases for checking error
position about unclosed brace and ensuring supporting 16th depth
nested braces correctly
* tag 'bootconfig-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
bootconfig: Add bootconfig tests about braces
lib/bootconfig: fix snprintf truncation check in xbc_node_compose_key_after()
lib/bootconfig: check bounds before writing in __xbc_open_brace()
lib/bootconfig: fix off-by-one in xbc_verify_tree() unclosed brace error
Pull kvm fixes from Paolo Bonzini:
"Quite a large pull request, partly due to skipping last week and
therefore having material from ~all submaintainers in this one. About
a fourth of it is a new selftest, and a couple more changes are large
in number of files touched (fixing a -Wflex-array-member-not-at-end
compiler warning) or lines changed (reformatting of a table in the API
documentation, thanks rST).
But who am I kidding---it's a lot of commits and there are a lot of
bugs being fixed here, some of them on the nastier side like the
RISC-V ones.
ARM:
- Correctly handle deactivation of interrupts that were activated
from LRs. Since EOIcount only denotes deactivation of interrupts
that are not present in an LR, start EOIcount deactivation walk
*after* the last irq that made it into an LR
- Avoid calling into the stubs to probe for ICH_VTR_EL2.TDS when pKVM
is already enabled -- not only thhis isn't possible (pKVM will
reject the call), but it is also useless: this can only happen for
a CPU that has already booted once, and the capability will not
change
- Fix a couple of low-severity bugs in our S2 fault handling path,
affecting the recently introduced LS64 handling and the even more
esoteric handling of hwpoison in a nested context
- Address yet another syzkaller finding in the vgic initialisation,
where we would end-up destroying an uninitialised vgic with nasty
consequences
- Address an annoying case of pKVM failing to boot when some of the
memblock regions that the host is faulting in are not page-aligned
- Inject some sanity in the NV stage-2 walker by checking the limits
against the advertised PA size, and correctly report the resulting
faults
PPC:
- Fix a PPC e500 build error due to a long-standing wart that was
exposed by the recent conversion to kmalloc_obj(); rip out all the
ugliness that led to the wart
RISC-V:
- Prevent speculative out-of-bounds access using array_index_nospec()
in APLIC interrupt handling, ONE_REG regiser access, AIA CSR
access, float register access, and PMU counter access
- Fix potential use-after-free issues in kvm_riscv_gstage_get_leaf(),
kvm_riscv_aia_aplic_has_attr(), and kvm_riscv_aia_imsic_has_attr()
- Fix potential null pointer dereference in
kvm_riscv_vcpu_aia_rmw_topei()
- Fix off-by-one array access in SBI PMU
- Skip THP support check during dirty logging
- Fix error code returned for Smstateen and Ssaia ONE_REG interface
- Check host Ssaia extension when creating AIA irqchip
x86:
- Fix cases where CPUID mitigation features were incorrectly marked
as available whenever the kernel used scattered feature words for
them
- Validate _all_ GVAs, rather than just the first GVA, when
processing a range of GVAs for Hyper-V's TLB flush hypercalls
- Fix a brown paper bug in add_atomic_switch_msr()
- Use hlist_for_each_entry_srcu() when traversing mask_notifier_list,
to fix a lockdep warning; KVM doesn't hold RCU, just irq_srcu
- Ensure AVIC VMCB fields are initialized if the VM has an in-kernel
local APIC (and AVIC is enabled at the module level)
- Update CR8 write interception when AVIC is (de)activated, to fix a
bug where the guest can run in perpetuity with the CR8 intercept
enabled
- Add a quirk to skip the consistency check on FREEZE_IN_SMM, i.e. to
allow L1 hypervisors to set FREEZE_IN_SMM. This reverts (by
default) an unintentional tightening of userspace ABI in 6.17, and
provides some amount of backwards compatibility with hypervisors
who want to freeze PMCs on VM-Entry
- Validate the VMCS/VMCB on return to a nested guest from SMM,
because either userspace or the guest could stash invalid values in
memory and trigger the processor's consistency checks
Generic:
- Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from
being unnecessary and confusing, triggered compiler warnings due to
-Wflex-array-member-not-at-end
- Document that vcpu->mutex is take outside of kvm->slots_lock and
kvm->slots_arch_lock, which is intentional and desirable despite
being rather unintuitive
Selftests:
- Increase the maximum number of NUMA nodes in the guest_memfd
selftest to 64 (from 8)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (43 commits)
KVM: selftests: Verify SEV+ guests can read and write EFER, CR0, CR4, and CR8
Documentation: kvm: fix formatting of the quirks table
KVM: x86: clarify leave_smm() return value
selftests: kvm: add a test that VMX validates controls on RSM
selftests: kvm: extract common functionality out of smm_test.c
KVM: SVM: check validity of VMCB controls when returning from SMM
KVM: VMX: check validity of VMCS controls when returning from SMM
KVM: SVM: Set/clear CR8 write interception when AVIC is (de)activated
KVM: SVM: Initialize AVIC VMCB fields if AVIC is enabled with in-kernel APIC
KVM: x86: Introduce KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM
KVM: x86: Fix SRCU list traversal in kvm_fire_mask_notifiers()
KVM: VMX: Fix a wrong MSR update in add_atomic_switch_msr()
KVM: x86: hyper-v: Validate all GVAs during PV TLB flush
KVM: x86: synthesize CPUID bits only if CPU capability is set
KVM: PPC: e500: Rip out "struct tlbe_ref"
KVM: PPC: e500: Fix build error due to using kmalloc_obj() with wrong type
KVM: selftests: Increase 'maxnode' for guest_memfd tests
KVM: arm64: pkvm: Don't reprobe for ICH_VTR_EL2.TDS on CPU hotplug
KVM: arm64: vgic: Pick EOIcount deactivations from AP-list tail
KVM: arm64: Remove the redundant ISB in __kvm_at_s1e2()
...
Pull powerpc fixes from Madhavan Srinivasan:
- Fix KUAP warning in VMX usercopy path
- Fix lockdep warning during PCI enumeration
- Fix to move CMA reservations to arch_mm_preinit
- Fix to check current->mm is alive before getting user callchain
Thanks to Aboorva Devarajan, Christophe Leroy (CS GROUP), Dan Horák,
Nicolin Chen, Nilay Shroff, Qiao Zhao, Ritesh Harjani (IBM), Saket Kumar
Bhaskar, Sayali Patil, Shrikanth Hegde, Venkat Rao Bagalkote, and Viktor
Malik.
* tag 'powerpc-7.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/iommu: fix lockdep warning during PCI enumeration
powerpc/selftests/copyloops: extend selftest to exercise __copy_tofrom_user_power7_vmx
powerpc: fix KUAP warning in VMX usercopy path
powerpc, perf: Check that current->mm is alive before getting user callchain
powerpc/mem: Move CMA reservations to arch_mm_preinit
Pull x86 fix from Ingo Molnar:
"Work around S2RAM hang if the firmware unexpectedly re-enables the
x2apic hardware while it was disabled by the kernel.
Force-disable it again and issue a warning into the syslog"
* tag 'x86-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Disable x2apic on resume if the kernel expects so
Pull timer fix from Ingo Molnar:
"Fix function tracer recursion bug by marking jiffies_64_to_clock_t()
notrace"
* tag 'timers-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time/jiffies: Mark jiffies_64_to_clock_t() notrace
Pull scheduler fixes from Ingo Molnar:
"More MM-CID fixes, mostly fixing hangs/races:
- Fix CID hangs due to a race between concurrent forks
- Fix vfork()/CLONE_VM MMCID bug causing hangs
- Remove pointless preemption guard
- Fix CID task list walk performance regression on large systems
by removing the known-flaky and slow counting logic using
for_each_process_thread() in mm_cid_*fixup_tasks_to_cpus(), and
implementing a simple sched_mm_cid::node list instead"
* tag 'sched-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/mmcid: Avoid full tasklist walks
sched/mmcid: Remove pointless preempt guard
sched/mmcid: Handle vfork()/CLONE_VM correctly
sched/mmcid: Prevent CID stalls due to concurrent forks
Pull objtool fixes from Ingo Molnar:
- Fix cross-build bug by using HOSTCFLAGS for HAVE_XXHASH test
- Fix klp bug by fixing detection of corrupt static branch/call entries
- Handle unsupported pr_debug() usage more gracefully
- Fix hypothetical klp bug by avoiding NULL pointer dereference when
printing code symbol name
- Fix data alignment bug in elf_add_data() causing mangled strings
- Fix confusing ERROR_INSN() error message
- Handle unexpected Clang RSP musical chairs causing false positive
warnings
- Fix another objtool stack overflow in validate_branch()
* tag 'objtool-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix another stack overflow in validate_branch()
objtool: Handle Clang RSP musical chairs
objtool: Fix ERROR_INSN() error message
objtool: Fix data alignment in elf_add_data()
objtool: Use HOSTCFLAGS for HAVE_XXHASH test
objtool/klp: Avoid NULL pointer dereference when printing code symbol name
objtool/klp: Disable unsupported pr_debug() usage
objtool/klp: Fix detection of corrupt static branch/call entries
Pull irq fixes from Ingo Molnar:
"Two fixes for the riscv-aplic irqchip driver:
- Fix probing dependency bug on probing failure
- Fix double register_syscore() bug"
* tag 'irq-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/riscv-aplic: Register syscore operations only once
irqchip/riscv-aplic: Do not clear ACPI dependencies on probe failure
Our vcpu reset suffers from a particularly interesting flaw, as it
does not correctly deal with state that will have an effect on the
execution flow out of reset.
Take the following completely random example, never seen in the wild
and that never resulted in a couple of sleepless nights: /s
- vcpu-A issues a PSCI_CPU_OFF using the SMC conduit
- SMC being a trapped instruction (as opposed to HVC which is always
normally executed), we annotate the vcpu as needing to skip the
next instruction, which is the SMC itself
- vcpu-A is now safely off
- vcpu-B issues a PSCI_CPU_ON for vcpu-A, providing a starting PC
- vcpu-A gets reset, get the new PC, and is sent on its merry way
- right at the point of entering the guest, we notice that a PC
increment is pending (remember the earlier SMC?)
- vcpu-A skips its first instruction...
What could possibly go wrong?
Well, I'm glad you asked. For pKVM as a NV guest, that first instruction
is extremely significant, as it indicates whether the CPU is booting
or resuming. Having skipped that instruction, nothing makes any sense
anymore, and CPU hotplugging fails.
This is all caused by the decoupling of PC update from the handling
of an exception that triggers such update, making it non-obvious
what affects what when.
Fix this train wreck by discarding all the PC-affecting state on
vcpu reset.
Fixes: f5e3068061 ("KVM: arm64: Move __adjust_pc out of line")
Cc: stable@vger.kernel.org
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://patch.msgid.link/20260312140850.822968-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
The assembly flush instructions were swapped for I- and D-cache flags:
SYSCALL_DEFINE3(cacheflush, ...)
{
if (cache & DCACHE) {
"fic ...\n"
}
if (cache & ICACHE && error == 0) {
"fdc ...\n"
}
Fix it by using fdc for DCACHE, and fic for ICACHE flushing.
Reported-by: Felix Lechner <felix.lechner@lease-up.com>
Fixes: c6d96328fe ("parisc: Add cacheflush() syscall")
Cc: <stable@vger.kernel.org> # v6.5+
Signed-off-by: Helge Deller <deller@gmx.de>
Pull i3c fixes from Alexandre Belloni:
"This introduces the I3C_OR_I2C symbol which is not a fix per se but is
affecting multiple subsystems so it is included to ease
synchronization.
Apart from that, Adrian is mostly fixing the mipi-i3c-hci driver DMA
handling, and I took the opportunity to add two fixes for the dw-i3c
driver.
Subsystem:
- simplify combined i3c/i2c dependencies
Drivers:
- dw: handle 2C properly, fix possible race condition
- mipi-i3c-hci: many DMA related fixes"
* tag 'i3c/fixes-for-7.0' of git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux:
i3c: dw-i3c-master: Set SIR_REJECT in DAT on device attach and reattach
i3c: master: dw-i3c: Fix missing of_node for virtual I2C adapter
i3c: mipi-i3c-hci: Fallback to software reset when bus disable fails
i3c: mipi-i3c-hci: Fix handling of shared IRQs during early initialization
i3c: mipi-i3c-hci: Fix race in DMA error handling in interrupt context
i3c: mipi-i3c-hci: Consolidate common xfer processing logic
i3c: mipi-i3c-hci: Restart DMA ring correctly after dequeue abort
i3c: mipi-i3c-hci: Add missing TID field to no-op command descriptor
i3c: mipi-i3c-hci: Correct RING_CTRL_ABORT handling in DMA dequeue
i3c: mipi-i3c-hci: Fix race between DMA ring dequeue and interrupt handler
i3c: mipi-i3c-hci: Fix race in DMA ring dequeue
i3c: mipi-i3c-hci: Fix race in DMA ring enqueue for parallel xfers
i3c: mipi-i3c-hci: Consolidate spinlocks
i3c: mipi-i3c-hci: Factor out DMA mapping from queuing path
i3c: mipi-i3c-hci: Fix Hot-Join NACK
i3c: mipi-i3c-hci: Use ETIMEDOUT instead of ETIME for timeout errors
i3c: simplify combined i3c/i2c dependencies
Pull Rust fixes from Miguel Ojeda:
"Toolchain and infrastructure:
- Remap paths to avoid absolute ones starting with the upcoming Rust
1.95.0 release. This improves build reproducibility, avoids leaking
the exact path and avoids having the same path appear in two forms
The approach here avoids remapping debug information as well, in
order to avoid breaking tools that used the paths to access source
files, which was the previous attempt that needed to be reverted
- Allow 'unused_features' lint for the upcoming Rust 1.96.0 release.
While well-intentioned, we do not benefit much from the new lint
- Emit dependency information into '$(depfile)' directly to avoid a
temporary '.d' file (it was an old approach)
'kernel' crate:
- 'str' module: fix warning under '!CONFIG_BLOCK' by making
'NullTerminatedFormatter' public
- 'cpufreq' module: suppress false positive Clippy warning
'pin-init' crate:
- Remove '#[disable_initialized_field_access]' attribute which was
unsound. This means removing the support for structs with unaligned
fields (through the 'repr(packed)' attribute), for now
And document the load-bearing fact of field accessors (i.e. that
they are required for soundness)
- Replace shadowed return token by 'unsafe'-to-create token in order
to remain sound in the face of the likely upcoming Type Alias Impl
Trait (TAIT) and the next trait solver in upstream Rust"
* tag 'rust-fixes-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
rust: kbuild: allow `unused_features`
rust: cpufreq: suppress clippy::double_parens in Policy doctest
rust: pin-init: replace shadowed return token by `unsafe`-to-create token
rust: pin-init: internal: init: document load-bearing fact of field accessors
rust: pin-init: internal: init: remove `#[disable_initialized_field_access]`
rust: build: remap path to avoid absolute path
rust: kbuild: emit dep-info into $(depfile) directly
rust: str: make NullTerminatedFormatter public
Kevin Hao says:
====================
net: macb: Fix Ethernet malfunction on AMD Versal board after suspend
On Versal boards, the tx/rx queue pointer registers are cleared after suspend,
which causes Ethernet malfunction. This patch series addresses this issue by
reinitializing the tx/rx queue pointer registers and the rx ring.
====================
Link: https://patch.msgid.link/20260312-macb-versal-v1-0-467647173fa4@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On certain platforms, such as AMD Versal boards, the tx/rx queue pointer
registers are cleared after suspend, and the rx queue pointer register
is also disabled during suspend if WOL is enabled. Previously, we assumed
that these registers would be restored by macb_mac_link_up(). However,
in commit bf9cf80cab, macb_init_buffers() was moved from
macb_mac_link_up() to macb_open(). Therefore, we should call
macb_init_buffers() to reinitialize the tx/rx queue pointer registers
during resume.
Due to the reset of these two registers, we also need to adjust the
tx/rx rings accordingly. The tx ring will be handled by
gem_shuffle_tx_rings() in macb_mac_link_up(), so we only need to
initialize the rx ring here.
Fixes: bf9cf80cab ("net: macb: Fix tx/rx malfunction after phy link down and up")
Reported-by: Quanyang Wang <quanyang.wang@windriver.com>
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Tested-by: Quanyang Wang <quanyang.wang@windriver.com>
Cc: stable@vger.kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260312-macb-versal-v1-2-467647173fa4@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Page recycling was removed from the XDP_DROP path in emac_run_xdp() to
avoid conflicts with AF_XDP zero-copy mode, which uses xsk_buff_free()
instead.
However, this causes a memory leak when running XDP programs that drop
packets in non-zero-copy mode (standard page pool mode). The pages are
never returned to the page pool, leading to OOM conditions.
Fix this by handling cleanup in the caller, emac_rx_packet().
When emac_run_xdp() returns ICSSG_XDP_CONSUMED for XDP_DROP, the
caller now recycles the page back to the page pool. The zero-copy
path, emac_rx_packet_zc() already handles cleanup correctly with
xsk_buff_free().
Fixes: 7a64bb388d ("net: ti: icssg-prueth: Add AF_XDP zero copy for RX")
Signed-off-by: Meghana Malladi <m-malladi@ti.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260311095441.1691636-1-m-malladi@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For Zhaoxin processors, the XSHA1 instruction requires the total memory
allocated at %rdi register must be 32 bytes, while the XSHA1 and
XSHA256 instruction doesn't perform any operation when %ecx is zero.
Due to these requirements, the current padlock-sha driver does not work
correctly with Zhaoxin processors. It cannot pass the self-tests and
therefore does not activate the driver on Zhaoxin processors. This issue
has been reported in Debian [1]. The self-tests fail with the
following messages [2]:
alg: shash: sha1-padlock-nano test failed (wrong result) on test vector 0, cfg="init+update+final aligned buffer"
alg: self-tests for sha1 using sha1-padlock-nano failed (rc=-22)
alg: shash: sha256-padlock-nano test failed (wrong result) on test vector 0, cfg="init+update+final aligned buffer"
alg: self-tests for sha256 using sha256-padlock-nano failed (rc=-22)
Disable the padlock-sha driver on Zhaoxin processors with the CPU family
0x07 and newer. Following the suggestion in [3], support for PHE will be
added to lib/crypto/ instead.
[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1113996
[2] https://linux-hardware.org/?probe=271fabb7a4&log=dmesg
[3] https://lore.kernel.org/linux-crypto/aUI4CGp6kK7mxgEr@gondor.apana.org.au/
Fixes: 63dc06cd12 ("crypto: padlock-sha - Use API partial block handling")
Cc: stable@vger.kernel.org
Signed-off-by: AlanSong-oc <AlanSong-oc@zhaoxin.com>
Link: https://lore.kernel.org/r/20260313080150.9393-2-AlanSong-oc@zhaoxin.com
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
A potential race condition exists in mana_hwc_destroy_channel() where
hwc->caller_ctx is freed before the HWC's Completion Queue (CQ) and
Event Queue (EQ) are destroyed. This allows an in-flight CQ interrupt
handler to dereference freed memory, leading to a use-after-free or
NULL pointer dereference in mana_hwc_handle_resp().
mana_smc_teardown_hwc() signals the hardware to stop but does not
synchronize against IRQ handlers already executing on other CPUs. The
IRQ synchronization only happens in mana_hwc_destroy_cq() via
mana_gd_destroy_eq() -> mana_gd_deregister_irq(). Since this runs
after kfree(hwc->caller_ctx), a concurrent mana_hwc_rx_event_handler()
can dereference freed caller_ctx (and rxq->msg_buf) in
mana_hwc_handle_resp().
Fix this by reordering teardown to reverse-of-creation order: destroy
the TX/RX work queues and CQ/EQ before freeing hwc->caller_ctx. This
ensures all in-flight interrupt handlers complete before the memory they
access is freed.
Fixes: ca9c54d2d6 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/abHA3AjNtqa1nx9k@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull USB fixes from Greg KH:
"Here is a large chunk of USB driver fixes for 7.0-rc4. Included in
here are:
- usb gadget reverts due to reported issues, and then a follow-on fix
to hopefully resolve the reported overall problem
- xhci driver fixes
- dwc3 driver fixes
- usb core "killable" bulk message api addition to fix a usbtmc
driver bug where userspace could hang the driver for forever
- small USB driver fixes for reported issues
- new usb device quirks
All except the last USB device quirk change have been in linux-next
with no reported issues. That one came in too late, and is 'obviously
correct' :)"
* tag 'usb-7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (35 commits)
USB: ezcap401 needs USB_QUIRK_NO_BOS to function on 10gbs usb speed
usb: roles: get usb role switch from parent only for usb-b-connector
Revert "tcpm: allow looking for role_sw device in the main node"
usb: gadget: f_ncm: Fix net_device lifecycle with device_move
Revert "usb: gadget: u_ether: add gether_opts for config caching"
Revert "usb: gadget: u_ether: use <linux/hex.h> header file"
Revert "usb: gadget: u_ether: Add auto-cleanup helper for freeing net_device"
Revert "usb: gadget: f_ncm: align net_device lifecycle with bind/unbind"
Revert "usb: legacy: ncm: Fix NPE in gncm_bind"
Revert "usb: gadget: f_ncm: Fix atomic context locking issue"
usb: typec: altmode/displayport: set displayport signaling rate in configure message
usb: dwc3: pci: add support for the Intel Nova Lake -H
usb/core/quirks: Add Huawei ME906S-device to wakeup quirk
usb: gadget: uvc: fix interval_duration calculation
xhci: Fix NULL pointer dereference when reading portli debugfs files
usb: xhci: Prevent interrupt storm on host controller error (HCE)
usb: xhci: Fix memory leak in xhci_disable_slot()
usb: class: cdc-wdm: fix reordering issue in read code path
usb: renesas_usbhs: fix use-after-free in ISR during device removal
usb: cdc-acm: Restore CAP_BRK functionnality to CH343
...
Pull char / misc / IIO driver fixes from Greg KH:
"Here are some char/misc/iio/binder fixes for 7.0-rc4. Nothing major in
here, just the usual:
- lots of iio driver fixes for reported issues
- rust binder fixes for problems found
- gpib driver binding to the wrong device fix
- firmware driver fix
All of these have been in linux-next with no reported issues"
* tag 'char-misc-7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (28 commits)
gpib: lpvo_usb: fix unintended binding of FTDI 8U232AM devices
firmware: stratix10-svc: Add Multi SVC clients support
rust_binder: use lock_vma_under_rcu() in use_page_slow()
rust_binder: call set_notification_done() without proc lock
rust_binder: avoid reading the written value in offsets array
rust_binder: check ownership before using vma
rust_binder: fix oneway spam detection
firmware: stratix10-rsu: Fix NULL pointer dereference when RSU is disabled
iio: imu: adis: Fix NULL pointer dereference in adis_init
iio: imu: inv_icm45600: fix regulator put warning when probe fails
iio: buffer: Fix wait_queue not being removed
iio: gyro: mpu3050-core: fix pm_runtime error handling
iio: gyro: mpu3050-i2c: fix pm_runtime error handling
iio: adc: ad7768-1: Fix ERR_PTR dereference in ad7768_fill_scale_tbl
iio: chemical: sps30_serial: fix buffer size in sps30_serial_read_meas()
iio: chemical: sps30_i2c: fix buffer size in sps30_i2c_read_meas()
iio: magnetometer: tlv493d: remove erroneous shift in X-axis data
iio: proximity: hx9023s: Protect against division by zero in set_samp_freq
iio: proximity: hx9023s: fix assignment order for __counted_by
iio: chemical: bme680: Fix measurement wait duration calculation
...
Pull staging driver fixes from Greg KH:
"Here are three small staging driver fixes for 7.0-rc4 that resolve
some reported problems. They are:
- two rtl8723bs data validation bugfixes
- sm750fb removal path bugfix
All of these have been in linux-next for many weeks with no reported
issues"
* tag 'staging-7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: rtl8723bs: fix potential out-of-bounds read in rtw_restruct_wmm_ie
staging: rtl8723bs: properly validate the data in rtw_get_ie_ex()
staging: sm750fb: add missing pci_release_region on error and removal
Florian Westphal says:
====================
netfilter: updates for net
This is a much earlier pull request than usual, due to the large
backlog. We are aware of several unfixed issues, in particular
in ctnetlink, patches are being worked on.
The following patchset contains Netfilter fixes for *net*:
1) fix a use-after-free in ctnetlink, from Hyunwoo Kim, broken
since v3.10.
2) add missing netlink range checks in ctnetlink, broken since v2.6
days.
3) fix content length truncation in sip conntrack helper,
from Lukas Johannes Möller. Broken since 2.6.34.
4) Revert a recent patch to add stronger checks for overlapping ranges
in nf_tables rbtree set type.
Patch is correct, but several nftables version have a bug (now fixed)
that trigger the checks incorrectly.
5) Reset mac header before the vlan push to avoid warning splat (and
make things functional). From Eric Woudstra.
6) Add missing bounds check in H323 conntrack helper, broken since this
helper was added 20 years ago, from Jenny Guanni Qu.
7) Fix a memory leak in the dynamic set infrastructure, from Pablo Neira
Ayuso. Broken since v5.11.
8+9) a few spots failed to purge skbs queued to userspace via nfqueue,
this causes RCU escape / use-after-free. Also from Pablo. broken
since v3.4 added the CT target to xtables.
10) Fix undefined behaviour in xt_time, use u32 for a shift-by-31
operation, not s32, from Jenny Guanni Qu.
11) H323 conntrack helper lacks a check for length variable becoming
negative after decrement, causes major out-of-bounds read due to
cast to unsigned size later, also from Jenny.
Both issues exist since 2.6 days.
* tag 'nf-26-03-13' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_conntrack_h323: check for zero length in DecodeQ931()
netfilter: xt_time: use unsigned int for monthday bit shift
netfilter: xt_CT: drop pending enqueued packets on template removal
netfilter: nft_ct: drop pending enqueued packets on removal
nf_tables: nft_dynset: fix possible stateful expression memleak in error path
netfilter: nf_conntrack_h323: fix OOB read in decode_int() CONS case
netfilter: nf_flow_table_ip: reset mac header before vlan push
netfilter: revert nft_set_rbtree: validate open interval overlap
netfilter: nf_conntrack_sip: fix Content-Length u32 truncation in sip_help_tcp()
netfilter: conntrack: add missing netlink policy validations
netfilter: ctnetlink: fix use-after-free in ctnetlink_dump_exp_ct()
====================
Link: https://patch.msgid.link/20260313150614.21177-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- hci_sync: Fix hci_le_create_conn_sync
- MGMT: Fix list corruption and UAF in command complete handlers
- L2CAP: Disconnect if received packet's SDU exceeds IMTU
- L2CAP: Disconnect if sum of payload sizes exceed SDU
- L2CAP: Fix accepting multiple L2CAP_ECRED_CONN_REQ
- L2CAP: Fix type confusion in l2cap_ecred_reconf_rsp()
- L2CAP: Validate L2CAP_INFO_RSP payload length before access
- L2CAP: Fix use-after-free in l2cap_unregister_user
- ISO: Fix defer tests being unstable
- HIDP: Fix possible UAF
- SMP: make SM/PER/KDU/BI-04-C happy
- qca: fix ROM version reading on WCN3998 chips
* tag 'for-net-2026-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: qca: fix ROM version reading on WCN3998 chips
Bluetooth: L2CAP: Validate L2CAP_INFO_RSP payload length before access
Bluetooth: L2CAP: Fix type confusion in l2cap_ecred_reconf_rsp()
Bluetooth: L2CAP: Fix accepting multiple L2CAP_ECRED_CONN_REQ
Bluetooth: L2CAP: Fix use-after-free in l2cap_unregister_user
Bluetooth: HIDP: Fix possible UAF
Bluetooth: MGMT: Fix list corruption and UAF in command complete handlers
Bluetooth: hci_sync: Fix hci_le_create_conn_sync
Bluetooth: ISO: Fix defer tests being unstable
Bluetooth: SMP: make SM/PER/KDU/BI-04-C happy
Bluetooth: LE L2CAP: Disconnect if sum of payload sizes exceed SDU
Bluetooth: LE L2CAP: Disconnect if received packet's SDU exceeds IMTU
====================
Link: https://patch.msgid.link/20260312200655.1215688-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a reader's file descriptor is closed while in the middle of reading
a cache_request (rp->offset != 0), cache_release() decrements the
request's readers count but never checks whether it should free the
request.
In cache_read(), when readers drops to 0 and CACHE_PENDING is clear, the
cache_request is removed from the queue and freed along with its buffer
and cache_head reference. cache_release() lacks this cleanup.
The only other path that frees requests with readers == 0 is
cache_dequeue(), but it runs only when CACHE_PENDING transitions from
set to clear. If that transition already happened while readers was
still non-zero, cache_dequeue() will have skipped the request, and no
subsequent call will clean it up.
Add the same cleanup logic from cache_read() to cache_release(): after
decrementing readers, check if it reached 0 with CACHE_PENDING clear,
and if so, dequeue and free the cache_request.
Reported-by: NeilBrown <neilb@ownmail.net>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: stable@kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The /proc/fs/nfs/exports proc entry is created at module init
and persists for the module's lifetime. exports_proc_open()
captures the caller's current network namespace and stores
its svc_export_cache in seq->private, but takes no reference
on the namespace. If the namespace is subsequently torn down
(e.g. container destruction after the opener does setns() to a
different namespace), nfsd_net_exit() calls nfsd_export_shutdown()
which frees the cache. Subsequent reads on the still-open fd
dereference the freed cache_detail, walking a freed hash table.
Hold a reference on the struct net for the lifetime of the open
file descriptor. This prevents nfsd_net_exit() from running --
and thus prevents nfsd_export_shutdown() from freeing the cache
-- while any exports fd is open. cache_detail already stores
its net pointer (cd->net, set by cache_create_net()), so
exports_release() can retrieve it without additional per-file
storage.
Reported-by: Misbah Anjum N <misanjum@linux.ibm.com>
Closes: https://lore.kernel.org/linux-nfs/dcd371d3a95815a84ba7de52cef447b8@linux.ibm.com/
Fixes: 96d851c4d2 ("nfsd: use proper net while reading "exports" file")
Cc: stable@vger.kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
svc_export_put() calls path_put() and auth_domain_put() immediately
when the last reference drops, before the RCU grace period. RCU
readers in e_show() and c_show() access both ex_path (via
seq_path/d_path) and ex_client->name (via seq_escape) without
holding a reference. If cache_clean removes the entry and drops the
last reference concurrently, the sub-objects are freed while still
in use, producing a NULL pointer dereference in d_path.
Commit 2530766492 ("nfsd: fix UAF when access ex_uuid or
ex_stats") moved kfree of ex_uuid and ex_stats into the
call_rcu callback, but left path_put() and auth_domain_put() running
before the grace period because both may sleep and call_rcu
callbacks execute in softirq context.
Replace call_rcu/kfree_rcu with queue_rcu_work(), which defers the
callback until after the RCU grace period and executes it in process
context where sleeping is permitted. This allows path_put() and
auth_domain_put() to be moved into the deferred callback alongside
the other resource releases. Apply the same fix to expkey_put(),
which has the identical pattern with ek_path and ek_client.
A dedicated workqueue scopes the shutdown drain to only NFSD
export release work items; flushing the shared
system_unbound_wq would stall on unrelated work from other
subsystems. nfsd_export_shutdown() uses rcu_barrier() followed
by flush_workqueue() to ensure all deferred release callbacks
complete before the export caches are destroyed.
Reported-by: Misbah Anjum N <misanjum@linux.ibm.com>
Closes: https://lore.kernel.org/linux-nfs/dcd371d3a95815a84ba7de52cef447b8@linux.ibm.com/
Fixes: c224edca7a ("nfsd: no need get cache ref when protected by rcu")
Fixes: 1b10f0b603 ("SUNRPC: no need get cache ref when protected by rcu")
Cc: stable@vger.kernel.org
Reviwed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: NeilBrown <neil@brown.name>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
A race condition exists between lec_atm_close() setting priv->lecd
to NULL and concurrent access to priv->lecd in send_to_lecd(),
lec_handle_bridge(), and lec_atm_send(). When the socket is freed
via RCU while another thread is still using it, a use-after-free
occurs in sock_def_readable() when accessing the socket's wait queue.
The root cause is that lec_atm_close() clears priv->lecd without
any synchronization, while callers dereference priv->lecd without
any protection against concurrent teardown.
Fix this by converting priv->lecd to an RCU-protected pointer:
- Mark priv->lecd as __rcu in lec.h
- Use rcu_assign_pointer() in lec_atm_close() and lecd_attach()
for safe pointer assignment
- Use rcu_access_pointer() for NULL checks that do not dereference
the pointer in lec_start_xmit(), lec_push(), send_to_lecd() and
lecd_attach()
- Use rcu_read_lock/rcu_dereference/rcu_read_unlock in send_to_lecd(),
lec_handle_bridge() and lec_atm_send() to safely access lecd
- Use rcu_assign_pointer() followed by synchronize_rcu() in
lec_atm_close() to ensure all readers have completed before
proceeding. This is safe since lec_atm_close() is called from
vcc_release() which holds lock_sock(), a sleeping lock.
- Remove the manual sk_receive_queue drain from lec_atm_close()
since vcc_destroy_socket() already drains it after lec_atm_close()
returns.
v2: Switch from spinlock + sock_hold/put approach to RCU to properly
fix the race. The v1 spinlock approach had two issues pointed out
by Eric Dumazet:
1. priv->lecd was still accessed directly after releasing the
lock instead of using a local copy.
2. The spinlock did not prevent packets being queued after
lec_atm_close() drains sk_receive_queue since timer and
workqueue paths bypass netif_stop_queue().
Note: Syzbot patch testing was attempted but the test VM terminated
unexpectedly with "Connection to localhost closed by remote host",
likely due to a QEMU AHCI emulation issue unrelated to this fix.
Compile testing with "make W=1 net/atm/lec.o" passes cleanly.
Reported-by: syzbot+f50072212ab792c86925@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f50072212ab792c86925
Link: https://lore.kernel.org/all/20260309093614.502094-1-kartikey406@gmail.com/T/ [v1]
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260309155908.508768-1-kartikey406@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When OGM aggregation state is toggled at runtime, an existing forwarded
packet may have been allocated with only packet_len bytes, while a later
packet can still be selected for aggregation. Appending in this case can
hit skb_put overflow conditions.
Reject aggregation when the target skb tailroom cannot accommodate the new
packet. The caller then falls back to creating a new forward packet
instead of appending.
Fixes: c6c8fea297 ("net: Add batman-adv meshing protocol")
Cc: stable@vger.kernel.org
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Signed-off-by: Yuan Tan <tanyuan98@outlook.com>
Signed-off-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Ao Zhou <n05ec@lzu.edu.cn>
Signed-off-by: Yang Yang <n05ec@lzu.edu.cn>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Commit 551120148b ("crypto: ccp - Fix a case where SNP_SHUTDOWN is
missed") fixed a case where SNP is left in INIT state if page reclaim
fails. It removes the transition to the INIT state for this command and
adjusts the page state management.
While doing this, it added a call to snp_leak_pages() after a call to
snp_reclaim_pages() failed. Since snp_reclaim_pages() already calls
snp_leak_pages() internally on the pages it fails to reclaim, calling
it again leaks the exact same page twice.
Fix by removing the extra call to snp_leak_pages().
The problem was found by an experimental code review agent based on
gemini-3.1-pro while reviewing backports into v6.18.y.
Assisted-by: Gemini:gemini-3.1-pro
Fixes: 551120148b ("crypto: ccp - Fix a case where SNP_SHUTDOWN is missed")
Cc: Tycho Andersen (AMD) <tycho@kernel.org>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Tycho Andersen (AMD) <tycho@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull drm fixes from Dave Airlie:
"The weekly drm fixes. This is mostly msm fixes across the functions,
with amdgpu and i915. It also has a core rust fix and changes in
nova-core to take advantage of it, and otherwise just has some minor
driver fixes, and marks loongsoon as orphaned.
rust:
- Fix safety issue in dma_read! and dma_write!
nova-core:
- Fix UB in DmaGspMem pointer accessors
- Fix stack overflow in GSP memory allocation
loongsoon:
- mark drm driver as unmaintained
msm:
- Core:
- Adjusted msm_iommu_pagetable_prealloc_allocate() allocation type
- DPU:
- Fixed blue screens on Hamoa laptops by reverting the LM
reservation
- Fixed the size of the LM block on several platforms
- Dropped usage of %pK (again)
- Fixed smatch warning on SSPP v13+ code
- Fixed INTF_6 interrupts on Lemans
- DSI:
- Fixed DSI PHY revision on Kaanapali
- Fixed pixel clock calculation for the bonded DSI mode panels
with compression enabled
- DT bindings:
- Fixed DisplayPort description on Glymur
- Fixed model name in SM8750 MDSS schema
- GPU:
- Added MODULE_DEVICE_TABLE to the GPU driver
- Fix bogus protect error on X2-85
- Fix dma_free_attrs() buffer size
- Gen8 UBWC fix for Glymur
i915:
- Avoid hang when configuring VRR [icl]
- Fix sg_table overflow with >4GB folios
- Fix PSR Selective Update handling
- Fix eDP ALPM read-out sequence
amdgpu:
- SMU13 fix
- SMU14 fix
- Fixes for bringup hw testing
- Kerneldoc fix
- GC12 idle power fix for compute workloads
- DCCG fixes
amdkfd:
- Fix missing BO unreserve in an error path
ivpu:
- drop unnecessary bootparams register setting
amdxdna:
- fix runtime/suspend resume deadlock
bridge:
- ti-sn65dsi83: fix DSI rounding and dual LVDS
gud:
- fix NULL crtc dereference on display disable"
* tag 'drm-fixes-2026-03-14' of https://gitlab.freedesktop.org/drm/kernel: (44 commits)
drm/amd: Set num IP blocks to 0 if discovery fails
drm/amdkfd: Unreserve bo if queue update failed
drm/amd/display: Check for S0i3 to be done before DCCG init on DCN21
drm/amd/display: Add missing DCCG register entries for DCN20-DCN316
gpu: nova-core: gsp: fix UB in DmaGspMem pointer accessors
drm/loongson: Mark driver as orphaned
accel/amdxdna: Fix runtime suspend deadlock when there is pending job
gpu: nova-core: fix stack overflow in GSP memory allocation
accel/ivpu: Remove boot params address setting via MMIO register
drm/i915/dp: Read ALPM caps after DPCD init
drm/i915/psr: Write DSC parameters on Selective Update in ET mode
drm/i915/dsc: Add helper for writing DSC Selective Update ET parameters
drm/i915/dsc: Add Selective Update register definitions
drm/i915/psr: Repeat Selective Update area alignment
drm/i915: Fix potential overflow of shmem scatterlist length
drm/i915/vrr: Configure VRR timings after enabling TRANS_DDI_FUNC_CTL
drm/bridge: ti-sn65dsi83: halve horizontal syncs for dual LVDS output
drm/bridge: ti-sn65dsi83: fix CHA_DSI_CLK_RANGE rounding
drm/gud: fix NULL crtc dereference on display disable
drm/sitronix/st7586: fix bad pixel data due to byte swap
...
Pull workqueue fixes from Tejun Heo:
- Improve workqueue stall diagnostics: dump all busy workers (not just
running ones), show wall-clock duration of in-flight work items, and
add a sample module for reproducing stalls
- Fix POOL_BH vs WQ_BH flag namespace mismatch in pr_cont_worker_id()
- Rename pool->watchdog_ts to pool->last_progress_ts and related
functions for clarity
* tag 'wq-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Rename show_cpu_pool{s,}_hog{s,}() to reflect broadened scope
workqueue: Add stall detector sample module
workqueue: Show all busy workers in stall diagnostics
workqueue: Show in-flight work item duration in stall diagnostics
workqueue: Rename pool->watchdog_ts to pool->last_progress_ts
workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
Pull cgroup fixes from Tejun Heo:
- Hide PF_EXITING tasks from cgroup.procs to avoid exposing dead tasks
that haven't been removed yet, fixing a systemd timeout issue on
PREEMPT_RT
- Call rebuild_sched_domains() directly in CPU hotplug instead of
deferring to a workqueue, fixing a race where online/offline CPUs
could briefly appear in stale sched domains
* tag 'cgroup-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Don't expose dead tasks in cgroup
cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug
Pull sched_ext fixes from Tejun Heo:
- Fix data races flagged by KCSAN: add missing READ_ONCE()/WRITE_ONCE()
annotations for lock-free accesses to module parameters and dsq->seq
- Fix silent truncation of upper 32 enqueue flags (SCX_ENQ_PREEMPT and
above) when passed through the int sched_class interface
- Documentation updates: scheduling class precedence, task ownership
state machine, example scheduler descriptions, config list cleanup
- Selftest fix for format specifier and buffer length in
file_write_long()
* tag 'sched_ext-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointer
sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flags
sched_ext: Documentation: Update sched-ext.rst
sched_ext: Use READ_ONCE() for scx_slice_bypass_us in scx_bypass()
sched_ext: Documentation: Mention scheduling class precedence
sched_ext: Document task ownership state machine
sched_ext: Use READ_ONCE() for lock-free reads of module param variables
sched_ext/selftests: Fix format specifier and buffer length in file_write_long()
sched_ext: Use WRITE_ONCE() for the write side of dsq->seq update
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Fix stale build ID in module MMAP2 records in events synthesized for
pre-existing processes
- Fix rust cross compilation
- hashmap__new() error pointer return handling fixes
- Fix off-by-one bug in outside of functions check on the disasm code
- Update header copies of kernel headers, including prctl.h, mount.h,
fs.h, irq_vectors.h, perf_event.h, gfp_types.h, kvm.h, cpufeatures.h
msr-index.h, also the syscall tables files that introduced the
'rseq_slice_yield' syscall
- Finish removal of ETM_OPT_* on the ARM coresight support, needed to
sync the coresight-pmu.h header with the kernel sources
- Make in-target rule robust against too long argument error
* tag 'perf-tools-fixes-for-v7.0-1-2026-03-13' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (22 commits)
perf synthetic-events: Fix stale build ID in module MMAP2 records
perf annotate loongarch: Fix off-by-one bug in outside check
perf ftrace: Fix hashmap__new() error checking
perf annotate: Fix hashmap__new() error checking
perf cs-etm: Sync coresight-pmu.h header with the kernel sources
perf cs-etm: Finish removal of ETM_OPT_*
tools headers UAPI: Update tools' copy of linux/coresight-pmu.h
tools headers: Update the syscall tables and unistd.h, to support the new 'rseq_slice_yield' syscall
perf disasm: Fix off-by-one bug in outside check
tools arch x86: Sync msr-index.h to pick MSR_{OMR_[0-3],CORE_PERF_GLOBAL_STATUS_SET}
tools headers UAPI: Sync x86's asm/kvm.h with the kernel sources
tools headers x86 cpufeatures: Sync with the kernel sources
tools headers UAPI: Sync linux/kvm.h with the kernel sources
tools headers: Update the linux/gfp_types.h copy with the kernel sources
perf beauty: Update the linux/perf_event.h copy with the kernel sources
perf beauty: Update the arch/x86/include/asm/irq_vectors.h copy with the kernel sources
perf beauty: Sync UAPI linux/fs.h with kernel sources
perf beauty: Sync linux/mount.h copy with the kernel sources
tools build: Fix rust cross compilation
perf build: Prevent "argument list too long" error
...
Pull s390 fixes from Vasily Gorbik:
- Revert IRQ entry/exit path optimization that incorrectly cleared
some PSW bits before irqentry_exit(), causing boot failures with
linux-next and HRTIMER_REARM_DEFERRED (which only uncovered the
problem)
- Fix zcrypt code to show CCA card serial numbers even when the
default crypto domain is offline by selecting any domain available,
preventing empty sysfs entries
* tag 's390-7.0-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/zcrypt: Enable AUTOSEL_DOM for CCA serialnr sysfs attribute
s390: Revert "s390/irq/idle: Remove psw bits early"
Add myself, Dexuan, and Long as maintainers. Deepak is stepping down
from these responsibilities.
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
In the error path of mshv_map_user_memory(), calling vfree() directly on
the region leaves the MMU notifier registered. When userspace later unmaps
the memory, the notifier fires and accesses the freed region, causing a
use-after-free and potential kernel panic.
Replace vfree() with mshv_partition_put() to properly unregister
the MMU notifier before freeing the region.
Fixes: b9a66cd5cc ("mshv: Add support for movable memory regions")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Pull ceph fixes from Ilya Dryomov:
"A small pile of CephFS and messenger bug fixes, all marked for stable"
* tag 'ceph-for-7.0-rc4' of https://github.com/ceph/ceph-client:
libceph: Fix potential out-of-bounds access in ceph_handle_auth_reply()
libceph: Use u32 for non-negative values in ceph_monmap_decode()
MAINTAINERS: update email address of Dongsheng Yang
libceph: reject preamble if control segment is empty
libceph: admit message frames only in CEPH_CON_S_OPEN state
libceph: prevent potential out-of-bounds reads in process_message_header()
ceph: do not skip the first folio of the next object in writeback
ceph: fix memory leaks in ceph_mdsc_build_path()
ceph: add a bunch of missing ceph_path_info initializers
ceph: fix i_nlink underrun during async unlink
Pull xfs fixes from Carlos Maiolino:
"A couple race fixes found on the new healthmon mechanism, and another
flushing dquots during filesystem shutdown"
* tag 'xfs-fixes-7.0-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix integer overflow in bmap intent sort comparator
xfs: fix undersized l_iclog_roundoff values
xfs: ensure dquot item is deleted from AIL only after log shutdown
xfs: remove redundant set null for ip->i_itemp
xfs: fix returned valued from xfs_defer_can_append
xfs: Remove redundant NULL check after __GFP_NOFAIL
xfs: fix race between healthmon unmount and read_iter
xfs: remove scratch field from struct xfs_gc_bio
Pull smb client fixes from Steve French:
- Fix reconnect when using non-default port
- Fix default retransmission behavior
- Fix open handle reuse in cifs_open
- Fix export for smb2-mapperror-test
- Fix potential corruption on write retry
- Fix potentially uninitialized superblock flags
- Fix missing O_DIRECT and O_SYNC flags on create
* tag 'v7.0-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: make default value of retrans as zero
smb: client: fix open handle lookup in cifs_open()
smb: client: fix iface port assignment in parse_server_interfaces
smb/client: only export symbol for 'smb2maperror-test' module
smb: client: fix in-place encryption corruption in SMB2_write()
smb: client: fix sbflags initialization
smb: client: fix atomic open with O_DIRECT & O_SYNC
Pull spi fixes from Mark Brown:
"A couple of device ID and quirk updates, plus a bunch of small fixes
most of which (other than the Cadence one) are unremarkable error
handling fixes"
* tag 'spi-fix-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: atcspi200: Handle invalid buswidth and fix compiler warning
spi: dt-bindings: sun6i: Allow Dual SPI and Quad SPI for newer SoCs
spi: intel-pci: Add support for Nova Lake mobile SPI flash
spi: cadence-qspi: Fix requesting of APB and AHB clocks on JH7110
spi: rockchip-sfc: Fix double-free in remove() callback
spi: atcspi200: Fix double-free in atcspi_configure_dma()
spi: amlogic: spifc-a4: Fix DMA mapping error handling
Pull regulator fixes from Mark Brown:
"A couple of small driver specific fixes for pca9450, cleaning up
logging and fixing warnings due to confusion with interrupt type"
* tag 'regulator-fix-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: pca9450: Correct probed name for PCA9452
regulator: pca9450: Correct interrupt type
Pull sound fixes from Takashi Iwai:
"There have been continuous flux but most of them are device-specific
small fixes, while we see a few core fixes at this time (minor PCM fix
for linked streams and a few ASoC core fixes for delayed work, etc)
Core:
- PCM: Fix use-after-free in linked stream drain
ASoC:
- core: Fixes for delayed works, empty DMI string handling and DT overlay
- qcom: qdsp6: Fix ADSP stop/start crash via component removal ordering
- tegra: Add support for Tegra238 audio graph card
- amd: Fix missing error checks for clock acquisition
- rt1011: Fix incorrect DAPM context retrieval helper
HD-audio:
- Add quirk for Gigabyte H610M, ASUS UM6702RC, HP 14s-dr5xxx, and
ThinkPad X390
USB-audio:
- Scarlett2: Fix NULL dereference for malformed endpoint descriptors
- Add quirk for SPACETOUCH"
* tag 'sound-7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ASoC: amd: acp-mach-common: Add missing error check for clock acquisition
ASoC: detect empty DMI strings
ASoC: amd: acp3x-rt5682-max9836: Add missing error check for clock acquisition
ALSA: usb-audio: Add iface reset and delay quirk for SPACETOUCH USB Audio
ASoC: codecs: rt1011: Use component to get the dapm context in spk_mode_put
ALSA: usb-audio: Check endpoint numbers at parsing Scarlett2 mixer interfaces
ASoC: simple-card-utils: fix graph_util_is_ports0() for DT overlays
ASoC: soc-core: flush delayed work before removing DAIs and widgets
ASoC: soc-core: drop delayed_work_pending() check before flush
ASoC: tegra: Add support for Tegra238 soundcard
ALSA: hda/realtek: Add headset jack quirk for Thinkpad X390
ALSA: hda/realtek: add HP Laptop 14s-dr5xxx mute LED quirk
ALSA: hda/realtek: add quirk for ASUS UM6702RC
ALSA: pcm: fix use-after-free on linked stream runtime in snd_pcm_drain()
ALSA: hda/realtek: Add quirk for Gigabyte Technology to fix headphone
firmware: cs_dsp: Fix fragmentation regression in firmware download
ASoC: qcom: qdsp6: Fix q6apm remove ordering during ADSP stop and start
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- Fix nvme-pci IRQ race and slab-out-of-bounds access
- Fix recursive workqueue locking for target async events
- Various cleanups
- Fix a potential NULL pointer dereference in ublk on size setting
- ublk automatic partition scanning fix
- Two s390 dasd fixes
* tag 'block-7.0-20260312' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
nvme: Annotate struct nvme_dhchap_key with __counted_by
nvme-core: do not pass empty queue_limits to blk_mq_alloc_queue()
nvme-pci: Fix race bug in nvme_poll_irqdisable()
nvmet: move async event work off nvmet-wq
nvme-pci: Fix slab-out-of-bounds in nvme_dbbuf_set
s390/dasd: Copy detected format information to secondary device
s390/dasd: Move quiesce state with pprc swap
ublk: don't clear GD_SUPPRESS_PART_SCAN for unprivileged daemons
ublk: fix NULL pointer dereference in ublk_ctrl_set_size()
Pull io_uring fixes from Jens Axboe:
- Fix an inverted true/false comment on task_no_new_privs, from the
BPF filtering changes merged in this release
- Use the migration disabling way of running the BPF filters, as the
io_uring side doesn't do that already
- Fix an issue with ->rings stability under resize, both for local
task_work additions and for eventfd signaling
- Fix an issue with SQE mixed mode, where a bounds check wasn't correct
for having a 128b SQE
- Fix an issue where a legacy provided buffer group is changed to to
ring mapped one while legacy buffers from that group are in flight
* tag 'io_uring-7.0-20260312' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
io_uring/kbuf: check if target buffer list is still legacy on recycle
io_uring: fix physical SQE bounds check for SQE_MIXED 128-byte ops
io_uring/eventfd: use ctx->rings_rcu for flags checking
io_uring: ensure ctx->rings is stable for task work flags manipulation
io_uring/bpf_filter: use bpf_prog_run_pin_on_cpu() to prevent migration
io_uring/register: fix comment about task_no_new_privs
Pull slab fixes from Vlastimil Babka:
- Fix for a memory leak that can occur when already so low on memory
that we can't allocate a new slab anymore (Qing Wang)
- Fix for a case where slabobj_ext array for a slab might be allocated
from the same slab, making it permanently non-freeable (Harry Yoo)
* tag 'slab-for-7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
slab: fix memory leak when refill_sheaf() fails
mm/slab: fix an incorrect check in obj_exts_alloc_size()
Pull power sequencing fix from Bartosz Golaszewski:
- fix OF-node reference leak in pwrseq-pcie-m2
* tag 'pwrseq-fixes-for-v7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
power: sequencing: pcie-m2: Fix device node reference leak in probe
In DecodeQ931(), the UserUserIE code path reads a 16-bit length from
the packet, then decrements it by 1 to skip the protocol discriminator
byte before passing it to DecodeH323_UserInformation(). If the encoded
length is 0, the decrement wraps to -1, which is then passed as a
large value to the decoder, leading to an out-of-bounds read.
Add a check to ensure len is positive after the decrement.
Fixes: 5e35941d99 ("[NETFILTER]: Add H.323 conntrack/NAT helper")
Reported-by: Klaudia Kloc <klaudia@vidocsecurity.com>
Reported-by: Dawid Moczadło <dawid@vidocsecurity.com>
Tested-by: Jenny Guanni Qu <qguanni@gmail.com>
Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
The monthday field can be up to 31, and shifting a signed integer 1
by 31 positions (1 << 31) is undefined behavior in C, as the result
overflows a 32-bit signed int. Use 1U to ensure well-defined behavior
for all valid monthday values.
Change the weekday shift to 1U as well for consistency.
Fixes: ee4411a1b1 ("[NETFILTER]: x_tables: add xt_time match")
Reported-by: Klaudia Kloc <klaudia@vidocsecurity.com>
Reported-by: Dawid Moczadło <dawid@vidocsecurity.com>
Tested-by: Jenny Guanni Qu <qguanni@gmail.com>
Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Templates refer to objects that can go away while packets are sitting in
nfqueue refer to:
- helper, this can be an issue on module removal.
- timeout policy, nfnetlink_cttimeout might remove it.
The use of templates with zone and event cache filter are safe, since
this just copies values.
Flush these enqueued packets in case the template rule gets removed.
Fixes: 24de58f465 ("netfilter: xt_CT: allow to attach timeout policy + glue code")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Packets sitting in nfqueue might hold a reference to:
- templates that specify the conntrack zone, because a percpu area is
used and module removal is possible.
- conntrack timeout policies and helper, where object removal leave
a stale reference.
Since these objects can just go away, drop enqueued packets to avoid
stale reference to them.
If there is a need for finer grain removal, this logic can be revisited
to make selective packet drop upon dependencies.
Fixes: 7e0b2b57f0 ("netfilter: nft_ct: add ct timeout support")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
If cloning the second stateful expression in the element via GFP_ATOMIC
fails, then the first stateful expression remains in place without being
released.
unreferenced object (percpu) 0x607b97e9cab8 (size 16):
comm "softirq", pid 0, jiffies 4294931867
hex dump (first 16 bytes on cpu 3):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
backtrace (crc 0):
pcpu_alloc_noprof+0x453/0xd80
nft_counter_clone+0x9c/0x190 [nf_tables]
nft_expr_clone+0x8f/0x1b0 [nf_tables]
nft_dynset_new+0x2cb/0x5f0 [nf_tables]
nft_rhash_update+0x236/0x11c0 [nf_tables]
nft_dynset_eval+0x11f/0x670 [nf_tables]
nft_do_chain+0x253/0x1700 [nf_tables]
nft_do_chain_ipv4+0x18d/0x270 [nf_tables]
nf_hook_slow+0xaa/0x1e0
ip_local_deliver+0x209/0x330
Fixes: 563125a73a ("netfilter: nftables: generalize set extension to support for several expressions")
Reported-by: Gurpreet Shergill <giki.shergill@proton.me>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
In decode_int(), the CONS case calls get_bits(bs, 2) to read a length
value, then calls get_uint(bs, len) without checking that len bytes
remain in the buffer. The existing boundary check only validates the
2 bits for get_bits(), not the subsequent 1-4 bytes that get_uint()
reads. This allows a malformed H.323/RAS packet to cause a 1-4 byte
slab-out-of-bounds read.
Add a boundary check for len bytes after get_bits() and before
get_uint().
Fixes: 5e35941d99 ("[NETFILTER]: Add H.323 conntrack/NAT helper")
Reported-by: Klaudia Kloc <klaudia@vidocsecurity.com>
Reported-by: Dawid Moczadło <dawid@vidocsecurity.com>
Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
With double vlan tagged packets in the fastpath, getting the error:
skb_vlan_push got skb with skb->data not at mac header (offset 18)
Call skb_reset_mac_header() before calling skb_vlan_push().
Fixes: c653d5a78f ("netfilter: flowtable: inline vlan encapsulation in xmit path")
Signed-off-by: Eric Woudstra <ericwouds@gmail.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
This reverts commit 648946966a ("netfilter: nft_set_rbtree: validate
open interval overlap").
There have been reports of nft failing to laod valid rulesets after this
patch was merged into -stable.
I can reproduce several such problem with recent nft versions, including
nft 1.1.6 which is widely shipped by distributions.
We currently have little choice here.
This commit can be resurrected at some point once the nftables fix that
triggers the false overlap positive has appeared in common distros
(see e83e32c8d1cd ("mnl: restore create element command with large batches" in
nftables.git).
Fixes: 648946966a ("netfilter: nft_set_rbtree: validate open interval overlap")
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
sip_help_tcp() parses the SIP Content-Length header with
simple_strtoul(), which returns unsigned long, but stores the result in
unsigned int clen. On 64-bit systems, values exceeding UINT_MAX are
silently truncated before computing the SIP message boundary.
For example, Content-Length 4294967328 (2^32 + 32) is truncated to 32,
causing the parser to miscalculate where the current message ends. The
loop then treats trailing data in the TCP segment as a second SIP
message and processes it through the SDP parser.
Fix this by changing clen to unsigned long to match the return type of
simple_strtoul(), and reject Content-Length values that exceed the
remaining TCP payload length.
Fixes: f5b321bd37 ("netfilter: nf_conntrack_sip: add TCP support")
Signed-off-by: Lukas Johannes Möller <research@johannes-moeller.dev>
Signed-off-by: Florian Westphal <fw@strlen.de>
Hyunwoo Kim reports out-of-bounds access in sctp and ctnetlink.
These attributes are used by the kernel without any validation.
Extend the netlink policies accordingly.
Quoting the reporter:
nlattr_to_sctp() assigns the user-supplied CTA_PROTOINFO_SCTP_STATE
value directly to ct->proto.sctp.state without checking that it is
within the valid range. [..]
and: ... with exp->dir = 100, the access at
ct->master->tuplehash[100] reads 5600 bytes past the start of a
320-byte nf_conn object, causing a slab-out-of-bounds read confirmed by
UBSAN.
Fixes: 076a0ca026 ("netfilter: ctnetlink: add NAT support for expectations")
Fixes: a258860e01 ("netfilter: ctnetlink: add full support for SCTP to ctnetlink")
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
ctnetlink_dump_exp_ct() stores a conntrack pointer in cb->data for the
netlink dump callback ctnetlink_exp_ct_dump_table(), but drops the
conntrack reference immediately after netlink_dump_start(). When the
dump spans multiple rounds, the second recvmsg() triggers the dump
callback which dereferences the now-freed conntrack via nfct_help(ct),
leading to a use-after-free on ct->ext.
The bug is that the netlink_dump_control has no .start or .done
callbacks to manage the conntrack reference across dump rounds. Other
dump functions in the same file (e.g. ctnetlink_get_conntrack) properly
use .start/.done callbacks for this purpose.
Fix this by adding .start and .done callbacks that hold and release the
conntrack reference for the duration of the dump, and move the
nfct_help() call after the cb->args[0] early-return check in the dump
callback to avoid dereferencing ct->ext unnecessarily.
BUG: KASAN: slab-use-after-free in ctnetlink_exp_ct_dump_table+0x4f/0x2e0
Read of size 8 at addr ffff88810597ebf0 by task ctnetlink_poc/133
CPU: 1 UID: 0 PID: 133 Comm: ctnetlink_poc Not tainted 7.0.0-rc2+ #3 PREEMPTLAZY
Call Trace:
<TASK>
ctnetlink_exp_ct_dump_table+0x4f/0x2e0
netlink_dump+0x333/0x880
netlink_recvmsg+0x3e2/0x4b0
? aa_sk_perm+0x184/0x450
sock_recvmsg+0xde/0xf0
Allocated by task 133:
kmem_cache_alloc_noprof+0x134/0x440
__nf_conntrack_alloc+0xa8/0x2b0
ctnetlink_create_conntrack+0xa1/0x900
ctnetlink_new_conntrack+0x3cf/0x7d0
nfnetlink_rcv_msg+0x48e/0x510
netlink_rcv_skb+0xc9/0x1f0
nfnetlink_rcv+0xdb/0x220
netlink_unicast+0x3ec/0x590
netlink_sendmsg+0x397/0x690
__sys_sendmsg+0xf4/0x180
Freed by task 0:
slab_free_after_rcu_debug+0xad/0x1e0
rcu_core+0x5c3/0x9c0
Fixes: e844a92843 ("netfilter: ctnetlink: allow to dump expectation per master conntrack")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Code allocates standard kernel memory to pass to the MPAM, which expects
__iomem. The code is safe, because __iomem accessors should work fine
on kernel mapped memory, however leads to sparse warnings:
test_mpam_devices.c:327:42: warning: incorrect type in initializer (different address spaces)
test_mpam_devices.c:327:42: expected char [noderef] __iomem *buf
test_mpam_devices.c:327:42: got void *
test_mpam_devices.c:342:24: warning: cast removes address space '__iomem' of expression
Cast the pointer to memory via __force to silence them.
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512160133.eAzPdJv2-lkp@intel.com/
Acked-by: Ben Horgan <ben.horgan@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
Accesses to MSC must be made from a cpu that is affine to that MSC and the
driver checks this in __mpam_write_reg() using smp_processor_id(). A fake
in-memory MSC is used for testing. When using that, it doesn't matter which
cpu we access it from but calling smp_processor_id() from a preemptible
context gives warnings when running with CONFIG_DEBUG_PREEMPT.
Add a test helper that wraps mpam_reset_msc_bitmap() with preemption
disabled to ensure all (fake) MSC accesses are made with preemption
disabled.
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
When an MSC supporting memory bandwidth monitoring is brought offline and
then online, mpam_restore_mbwu_state() calls __ris_msmon_read() via ipi to
restore the configuration of the bandwidth counters. It doesn't care about
the value read, mbwu_arg.val, and doesn't set it leading to a null pointer
dereference when __ris_msmon_read() adds to it. This results in a kernel
oops with a call trace such as:
Call trace:
__ris_msmon_read+0x19c/0x64c (P)
mpam_restore_mbwu_state+0xa0/0xe8
smp_call_on_cpu_callback+0x1c/0x38
process_one_work+0x154/0x4b4
worker_thread+0x188/0x310
kthread+0x11c/0x130
ret_from_fork+0x10/0x20
Provide a local variable for val to avoid __ris_msmon_read() dereferencing
a null pointer when adding to val.
Fixes: 41e8a14950 ("arm_mpam: Track bandwidth counter state for power management")
Signed-off-by: Ben Horgan <ben.horgan@arm.com>
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
DW_CFA_advance_loc4 is defined but no handler is implemented. Its
CFA opcode defaults to EDYNSCS_INVALID_CFA_OPCODE triggering an
error which wrongfully prevents modules from loading.
Link: https://bugs.gentoo.org/971060
Signed-off-by: Pepper Gray <hello@peppergray.xyz>
Signed-off-by: Will Deacon <will@kernel.org>
If we log the parent directory of a conflicting inode, we are not logging
the new dentries of the directory, so when we finish we have the parent
directory's inode marked as logged but we did not log its new dentries.
As a consequence if the parent directory is explicitly fsynced later and
it does not have any new changes since we logged it, the fsync is a no-op
and after a power failure the new dentries are missing.
Example scenario:
$ mkdir foo
$ sync
$rmdir foo
$ mkdir dir1
$ mkdir dir2
# A file with the same name and parent as the directory we just deleted
# and was persisted in a past transaction. So the deleted directory's
# inode is a conflicting inode of this new file's inode.
$ touch foo
$ ln foo dir2/link
# The fsync on dir2 will log the parent directory (".") because the
# conflicting inode (deleted directory) does not exists anymore, but it
# it does not log its new dentries (dir1).
$ xfs_io -c "fsync" dir2
# This fsync on the parent directory is no-op, since the previous fsync
# logged it (but without logging its new dentries).
$ xfs_io -c "fsync" .
<power failure>
# After log replay dir1 is missing.
Fix this by ensuring we log new dir dentries whenever we log the parent
directory of a no longer existing conflicting inode.
A test case for fstests will follow soon.
Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/182055fa-e9ce-4089-9f5f-4b8a23e8dd91@gmail.com/
Fixes: a3baaf0d78 ("Btrfs: fix fsync after succession of renames and unlink/rmdir")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function `btrfs_relocate_chunk()` always passes verbose=true to
`btrfs_relocate_block_group()` instead of the `verbose` parameter passed
into it by it's callers.
While user initiated rebalancing should be logged in the Kernel's log
buffer. This causes excessive log spamming from automatic rebalancing,
e.g. on zoned filesystems running low on usable space.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After cancel_delayed_work_sync() is called from
xfrm_nat_keepalive_net_fini(), xfrm_state_fini() flushes remaining
states via __xfrm_state_delete(), which calls
xfrm_nat_keepalive_state_updated() to re-schedule nat_keepalive_work.
The following is a simple race scenario:
cpu0 cpu1
cleanup_net() [Round 1]
ops_undo_list()
xfrm_net_exit()
xfrm_nat_keepalive_net_fini()
cancel_delayed_work_sync(nat_keepalive_work);
xfrm_state_fini()
xfrm_state_flush()
xfrm_state_delete(x)
__xfrm_state_delete(x)
xfrm_nat_keepalive_state_updated(x)
schedule_delayed_work(nat_keepalive_work);
rcu_barrier();
net_complete_free();
net_passive_dec(net);
llist_add(&net->defer_free_list, &defer_free_list);
cleanup_net() [Round 2]
rcu_barrier();
net_complete_free()
kmem_cache_free(net_cachep, net);
nat_keepalive_work()
// on freed net
To prevent this, cancel_delayed_work_sync() is replaced with
disable_delayed_work_sync().
Fixes: f531d13bdf ("xfrm: support sending NAT keepalives in ESP in UDP states")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The lcd2s_print() and lcd2s_gotoxy() functions currently ignore the
return value of lcd2s_i2c_master_send(), which can fail. This can lead
to silent data loss or incorrect cursor positioning.
Add proper error checking: if the number of bytes sent does not match
the expected length, return -EIO; otherwise propagate any error code
from the I2C transfer.
Signed-off-by: Wang Jun <1742789905@qq.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
snprintf() returns the number of characters that would have been
written excluding the NUL terminator. Output is truncated when the
return value is >= the buffer size, not just > the buffer size.
When ret == size, the current code takes the non-truncated path,
advancing buf by ret and reducing size to 0. This is wrong because
the output was actually truncated (the last character was replaced by
NUL). Fix by using >= so the truncation path is taken correctly.
Link: https://lore.kernel.org/all/20260312191143.28719-4-objecting@objecting.org/
Fixes: 76db5a27a8 ("bootconfig: Add Extra Boot Config support")
Cc: stable@vger.kernel.org
Signed-off-by: Josh Law <objecting@objecting.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The bounds check for brace_index happens after the array write.
While the current call pattern prevents an actual out-of-bounds
access (the previous call would have returned an error), the
write-before-check pattern is fragile and would become a real
out-of-bounds write if the error return were ever not propagated.
Move the bounds check before the array write so the function is
self-contained and safe regardless of caller behavior.
Link: https://lore.kernel.org/all/20260312191143.28719-3-objecting@objecting.org/
Fixes: ead1e19ad9 ("lib/bootconfig: Fix a bug of breaking existing tree nodes")
Cc: stable@vger.kernel.org
Signed-off-by: Josh Law <objecting@objecting.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
The alternate screen support added by commit 23743ba647 ("vt: add
support for smput/rmput escape codes") only saves and restores the
regular screen buffer (vc_origin), but completely ignores the corresponding
unicode screen buffer (vc_uni_lines) creating a messed-up display.
Add vc_saved_uni_lines to save the unicode screen buffer when entering
the alternate screen, and restore it when leaving. Also ensure proper
cleanup in reset_terminal() and vc_deallocate().
Fixes: 23743ba647 ("vt: add support for smput/rmput escape codes")
Cc: stable <stable@kernel.org>
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
Link: https://patch.msgid.link/5o2p6qp3-91pq-0p17-or02-1oors4417ns7@onlyvoer.pbz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sabrina Dubroca says:
====================
xfrm: fix most sparse warnings
This series fixes most of the sparse warnings currently reported about
RCU pointers for files under net/xfrm. There's no actual bug in the
current code, we only need to use the correct helpers in each context.
====================
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Blamed commits forgot that vxlan/geneve use udp_tunnel[6]_xmit_skb() which
call iptunnel_xmit_stats().
iptunnel_xmit_stats() was assuming tunnels were only using
NETDEV_PCPU_STAT_TSTATS.
@syncp offset in pcpu_sw_netstats and pcpu_dstats is different.
32bit kernels would either have corruptions or freezes if the syncp
sequence was overwritten.
This patch also moves pcpu_stat_type closer to dev->{t,d}stats to avoid
a potential cache line miss since iptunnel_xmit_stats() needs to read it.
Fixes: 6fa6de3022 ("geneve: Handle stats using NETDEV_PCPU_STAT_DSTATS.")
Fixes: be226352e8 ("vxlan: Handle stats using NETDEV_PCPU_STAT_DSTATS.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20260311123110.1471930-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When a peer MEP is being deleted, cancel_delayed_work_sync() is called
on ccm_rx_dwork before freeing. However, br_cfm_frame_rx() runs in
softirq context under rcu_read_lock (without RTNL) and can re-schedule
ccm_rx_dwork via ccm_rx_timer_start() between cancel_delayed_work_sync()
returning and kfree_rcu() being called.
The following is a simple race scenario:
cpu0 cpu1
mep_delete_implementation()
cancel_delayed_work_sync(ccm_rx_dwork);
br_cfm_frame_rx()
// peer_mep still in hlist
if (peer_mep->ccm_defect)
ccm_rx_timer_start()
queue_delayed_work(ccm_rx_dwork)
hlist_del_rcu(&peer_mep->head);
kfree_rcu(peer_mep, rcu);
ccm_rx_work_expired()
// on freed peer_mep
To prevent this, cancel_delayed_work_sync() is replaced with
disable_delayed_work_sync() in both peer MEP deletion paths, so
that subsequent queue_delayed_work() calls from br_cfm_frame_rx()
are silently rejected.
The cc_peer_disable() helper retains cancel_delayed_work_sync()
because it is also used for the CC enable/disable toggle path where
the work must remain re-schedulable.
Fixes: dc32cbb3db ("bridge: cfm: Kernel space implementation of CFM. CCM frame RX added.")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/abBgYT5K_FI9rD1a@v4bel
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
__xbc_open_brace() pushes entries with post-increment
(open_brace[brace_index++]), so brace_index always points one past
the last valid entry. xbc_verify_tree() reads open_brace[brace_index]
to report which brace is unclosed, but this is one past the last
pushed entry and contains stale/zero data, causing the error message
to reference the wrong node.
Use open_brace[brace_index - 1] to correctly identify the unclosed
brace. brace_index is known to be > 0 here since we are inside the
if (brace_index) guard.
Link: https://lore.kernel.org/all/20260312191143.28719-2-objecting@objecting.org/
Fixes: ead1e19ad9 ("lib/bootconfig: Fix a bug of breaking existing tree nodes")
Cc: stable@vger.kernel.org
Signed-off-by: Josh Law <objecting@objecting.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Pull NVMe fixes from Keith:
"- Fix nvme-pci IRQ race and slab-out-of-bounds access (Sungwoo Kim)
- Fix recursive workqueue locking for target async events (Chaitanya)
- Various cleanups (Maurizio Lombardi, Thorsten Blum)"
* tag 'nvme-7.0-2026-03-12' of git://git.infradead.org/nvme:
nvme: Annotate struct nvme_dhchap_key with __counted_by
nvme-core: do not pass empty queue_limits to blk_mq_alloc_queue()
nvme-pci: Fix race bug in nvme_poll_irqdisable()
nvmet: move async event work off nvmet-wq
nvme-pci: Fix slab-out-of-bounds in nvme_dbbuf_set
Igor Ushakov reported that GC purged the receive queue of
an alive socket due to a race with MSG_PEEK with a nice repro.
This is the exact same issue previously fixed by commit
cbcf01128d ("af_unix: fix garbage collect vs MSG_PEEK").
After GC was replaced with the current algorithm, the cited
commit removed the locking dance in unix_peek_fds() and
reintroduced the same issue.
The problem is that MSG_PEEK bumps a file refcount without
interacting with GC.
Consider an SCC containing sk-A and sk-B, where sk-A is
close()d but can be recv()ed via sk-B.
The bad thing happens if sk-A is recv()ed with MSG_PEEK from
sk-B and sk-B is close()d while GC is checking unix_vertex_dead()
for sk-A and sk-B.
GC thread User thread
--------- -----------
unix_vertex_dead(sk-A)
-> true <------.
\
`------ recv(sk-B, MSG_PEEK)
invalidate !! -> sk-A's file refcount : 1 -> 2
close(sk-B)
-> sk-B's file refcount : 2 -> 1
unix_vertex_dead(sk-B)
-> true
Initially, sk-A's file refcount is 1 by the inflight fd in sk-B
recvq. GC thinks sk-A is dead because the file refcount is the
same as the number of its inflight fds.
However, sk-A's file refcount is bumped silently by MSG_PEEK,
which invalidates the previous evaluation.
At this moment, sk-B's file refcount is 2; one by the open fd,
and one by the inflight fd in sk-A. The subsequent close()
releases one refcount by the former.
Finally, GC incorrectly concludes that both sk-A and sk-B are dead.
One option is to restore the locking dance in unix_peek_fds(),
but we can resolve this more elegantly thanks to the new algorithm.
The point is that the issue does not occur without the subsequent
close() and we actually do not need to synchronise MSG_PEEK with
the dead SCC detection.
When the issue occurs, close() and GC touch the same file refcount.
If GC sees the refcount being decremented by close(), it can just
give up garbage-collecting the SCC.
Therefore, we only need to signal the race during MSG_PEEK with
a proper memory barrier to make it visible to the GC.
Let's use seqcount_t to notify GC when MSG_PEEK occurs and let
it defer the SCC to the next run.
This way no locking is needed on the MSG_PEEK side, and we can
avoid imposing a penalty on every MSG_PEEK unnecessarily.
Note that we can retry within unix_scc_dead() if MSG_PEEK is
detected, but we do not do so to avoid hung task splat from
abusive MSG_PEEK calls.
Fixes: 118f457da9 ("af_unix: Remove lock dance in unix_peek_fds().")
Reported-by: Igor Ushakov <sysroot314@gmail.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260311054043.1231316-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull power management fixes from Rafael Wysocki:
- Make the idle loop skip the cpuidle governor .reflect() callback
after it has skipped the .select() one (Rafael Wysocki)
- Fix swapped power/energy unit labels in cpupower (Kaushlendra Kumar)
- Add support for setting EPP via systemd service and intel_pstate
turbo boost support to cpupower (Jan Kiszka, Zhang Rui)
* tag 'pm-7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
sched: idle: Make skipping governor callbacks more consistent
cpupower: Add intel_pstate turbo boost support for Intel platforms
cpupower: Add support for setting EPP via systemd service
cpupower: fix swapped power/energy unit labels
Pull ACPI fixes from Rafael Wysocki:
- On some platforms, the ACPI companion object of the ACPI video bus
platform device is shared with multiple other platform devices which
leads to driver probe issues, so replace that device with an
auxiliary one (which arguably is a better match for the given use
case) and update the ACPI video bus driver accordingly (Rafael
Wysocki)
- Address sparse warnings in acpi_os_initialize() by adding __iomem to
a local variable declaration (Ben Dooks)
* tag 'acpi-7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: OSL: fix __iomem type on return from acpi_os_map_generic_address()
ACPI: video: Switch over to auxiliary bus type
Pull NFS client fixes from Anna Schumaker:
- Fix NFS KConfig typos
- Decrement re_receiving on the early exit paths
- return EISDIR on nfs3_proc_create if d_alias is a dir
* tag 'nfs-for-7.0-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Fix NFS KConfig typos
xprtrdma: Decrement re_receiving on the early exit paths
nfs: return EISDIR on nfs3_proc_create if d_alias is a dir
l2cap_information_rsp() checks that cmd_len covers the fixed
l2cap_info_rsp header (type + result, 4 bytes) but then reads
rsp->data without verifying that the payload is present:
- L2CAP_IT_FEAT_MASK calls get_unaligned_le32(rsp->data), which reads
4 bytes past the header (needs cmd_len >= 8).
- L2CAP_IT_FIXED_CHAN reads rsp->data[0], 1 byte past the header
(needs cmd_len >= 5).
A truncated L2CAP_INFO_RSP with result == L2CAP_IR_SUCCESS triggers an
out-of-bounds read of adjacent skb data.
Guard each data access with the required payload length check. If the
payload is too short, skip the read and let the state machine complete
with safe defaults (feat_mask and remote_fixed_chan remain zero from
kzalloc), so the info timer cleanup and l2cap_conn_start() still run
and the connection is not stalled.
Fixes: 4e8402a3f8 ("[Bluetooth] Retrieve L2CAP features mask on connection setup")
Cc: stable@vger.kernel.org
Signed-off-by: Lukas Johannes Möller <research@johannes-moeller.dev>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
l2cap_ecred_reconf_rsp() casts the incoming data to struct
l2cap_ecred_conn_rsp (the ECRED *connection* response, 8 bytes with
result at offset 6) instead of struct l2cap_ecred_reconf_rsp (2 bytes
with result at offset 0).
This causes two problems:
- The sizeof(*rsp) length check requires 8 bytes instead of the
correct 2, so valid L2CAP_ECRED_RECONF_RSP packets are rejected
with -EPROTO.
- rsp->result reads from offset 6 instead of offset 0, returning
wrong data when the packet is large enough to pass the check.
Fix by using the correct type. Also pass the already byte-swapped
result variable to BT_DBG instead of the raw __le16 field.
Fixes: 15f02b9105 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
Cc: stable@vger.kernel.org
Signed-off-by: Lukas Johannes Möller <research@johannes-moeller.dev>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
After commit ab4eedb790 ("Bluetooth: L2CAP: Fix corrupted list in
hci_chan_del"), l2cap_conn_del() uses conn->lock to protect access to
conn->users. However, l2cap_register_user() and l2cap_unregister_user()
don't use conn->lock, creating a race condition where these functions can
access conn->users and conn->hchan concurrently with l2cap_conn_del().
This can lead to use-after-free and list corruption bugs, as reported
by syzbot.
Fix this by changing l2cap_register_user() and l2cap_unregister_user()
to use conn->lock instead of hci_dev_lock(), ensuring consistent locking
for the l2cap_conn structure.
Reported-by: syzbot+14b6d57fb728e27ce23c@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=14b6d57fb728e27ce23c
Fixes: ab4eedb790 ("Bluetooth: L2CAP: Fix corrupted list in hci_chan_del")
Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Commit 302a1f674c ("Bluetooth: MGMT: Fix possible UAFs") introduced
mgmt_pending_valid(), which not only validates the pending command but
also unlinks it from the pending list if it is valid. This change in
semantics requires updates to several completion handlers to avoid list
corruption and memory safety issues.
This patch addresses two left-over issues from the aforementioned rework:
1. In mgmt_add_adv_patterns_monitor_complete(), mgmt_pending_remove()
is replaced with mgmt_pending_free() in the success path. Since
mgmt_pending_valid() already unlinks the command at the beginning of
the function, calling mgmt_pending_remove() leads to a double list_del()
and subsequent list corruption/kernel panic.
2. In set_mesh_complete(), the use of mgmt_pending_foreach() in the error
path is removed. Since the current command is already unlinked by
mgmt_pending_valid(), this foreach loop would incorrectly target other
pending mesh commands, potentially freeing them while they are still being
processed concurrently (leading to UAFs). The redundant mgmt_cmd_status()
is also simplified to use cmd->opcode directly.
Fixes: 302a1f674c ("Bluetooth: MGMT: Fix possible UAFs")
Signed-off-by: Wang Tao <wangtao554@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
While introducing hci_le_create_conn_sync the functionality
of hci_connect_le was ported to hci_le_create_conn_sync including
the disable of the scan before starting the connection.
When this code was run non synchronously the immediate call that was
setting the flag HCI_LE_SCAN_INTERRUPTED had an impact. Since the
completion handler for the LE_SCAN_DISABLE was not immediately called.
In the completion handler of the LE_SCAN_DISABLE event, this flag is
checked to set the state of the hdev to DISCOVERY_STOPPED.
With the synchronised approach the later setting of the
HCI_LE_SCAN_INTERRUPTED flag has not the same effect. The completion
handler would immediately fire in the LE_SCAN_DISABLE call, check for
the flag, which is then not yet set and do nothing.
To fix this issue and make the function call work as before, we move the
setting of the flag HCI_LE_SCAN_INTERRUPTED before disabling the scan.
Fixes: 8e8b92ee60 ("Bluetooth: hci_sync: Add hci_le_create_conn_sync")
Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
iso-tester defer tests seem to fail with hci_conn_hash_lookup_cig
being unable to resolve a cig in set_cig_params_sync due a race
where it is run immediatelly before hci_bind_cis is able to set
the QoS settings into the hci_conn object.
So this moves the assigning of the QoS settings to be done directly
by hci_le_set_cig_params to prevent that from happening again.
Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The last test step ("Test with Invalid public key X and Y, all set to
0") expects to get an "DHKEY check failed" instead of "unspecified".
Fixes: 6d19628f53 ("Bluetooth: SMP: Fail if remote and local public keys are identical")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Core 6.0, Vol 3, Part A, 3.4.3:
"... If the sum of the payload sizes for the K-frames exceeds the
specified SDU length, the receiver shall disconnect the channel."
This fixes L2CAP/LE/CFC/BV-27-C (running together with 'l2test -r -P
0x0027 -V le_public').
Fixes: aac23bf636 ("Bluetooth: Implement LE L2CAP reassembly")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Core 6.0, Vol 3, Part A, 3.4.3:
"If the SDU length field value exceeds the receiver's MTU, the receiver
shall disconnect the channel..."
This fixes L2CAP/LE/CFC/BV-26-C (running together with 'l2test -r -P
0x0027 -V le_public -I 100').
Fixes: aac23bf636 ("Bluetooth: Implement LE L2CAP reassembly")
Signed-off-by: Christian Eggers <ceggers@arri.de>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Pull btrfs fixes from David Sterba:
- detect possible file name hash collision earlier so it does not lead
to transaction abort
- handle b-tree leaf overflows when snapshotting a subvolume with set
received UUID, leading to transaction abort
- in zoned mode, reorder relocation block group initialization after
the transaction kthread start
- fix orphan cleanup state tracking of subvolume, this could lead to
invalid dentries under some conditions
- add locking around updates of dynamic reclain state update
- in subpage mode, add missing RCU unlock when trying to releae extent
buffer
- remap tree fixes:
- add missing description strings for the newly added remap tree
- properly update search key when iterating backrefs
* tag 'for-7.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: remove duplicated definition of btrfs_printk_in_rcu()
btrfs: remove unnecessary transaction abort in the received subvol ioctl
btrfs: abort transaction on failure to update root in the received subvol ioctl
btrfs: fix transaction abort on set received ioctl due to item overflow
btrfs: fix transaction abort when snapshotting received subvolumes
btrfs: fix transaction abort on file creation due to name hash collision
btrfs: read key again after incrementing slot in move_existing_remaps()
btrfs: add missing RCU unlock in error path in try_release_subpage_extent_buffer()
btrfs: set BTRFS_ROOT_ORPHAN_CLEANUP during subvol create
btrfs: zoned: move btrfs_zoned_reserve_data_reloc_bg() after kthread start
btrfs: hold space_info->lock when clearing periodic reclaim ready
btrfs: print-tree: add remap tree definitions
Merge cpupower utility updates, including a fix and improvements of the
existing functionality, for 7.0-rc4.
* pm-tools:
cpupower: Add intel_pstate turbo boost support for Intel platforms
cpupower: Add support for setting EPP via systemd service
cpupower: fix swapped power/energy unit labels
Pull AppArmor fixes from John Johansen:
- fix race between freeing data and fs accessing it
- fix race on unreferenced rawdata dereference
- fix differential encoding verification
- fix unconfined unprivileged local user can do privileged policy management
- Fix double free of ns_name in aa_replace_profiles()
- fix missing bounds check on DEFAULT table in verify_dfa()
- fix side-effect bug in match_char() macro usage
- fix: limit the number of levels of policy namespaces
- replace recursive profile removal with iterative approach
- fix memory leak in verify_header
- validate DFA start states are in bounds in unpack_pdb
* tag 'apparmor-pr-mainline-2026-03-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor:
apparmor: fix race between freeing data and fs accessing it
apparmor: fix race on rawdata dereference
apparmor: fix differential encoding verification
apparmor: fix unprivileged local user can do privileged policy management
apparmor: Fix double free of ns_name in aa_replace_profiles()
apparmor: fix missing bounds check on DEFAULT table in verify_dfa()
apparmor: fix side-effect bug in match_char() macro usage
apparmor: fix: limit the number of levels of policy namespaces
apparmor: replace recursive profile removal with iterative approach
apparmor: fix memory leak in verify_header
apparmor: validate DFA start states are in bounds in unpack_pdb
Merge an ACPI OS services layer (OSL) fix that addresses sparse warnings
in acpi_os_initialize() (Ben Dooks)
* acpi-osl:
ACPI: OSL: fix __iomem type on return from acpi_os_map_generic_address()
Add "do no harm" testing of EFER, CR0, CR4, and CR8 for SEV+ guests to
verify that the guest can read and write the registers, without hitting
e.g. a #VC on SEV-ES guests due to KVM incorrectly trying to intercept a
register.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260310211841.2552361-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There's a gap between when the buffer was grabbed and when it
potentially gets recycled, where if the list is empty, someone could've
upgraded it to a ring provided type. This can happen if the request
is forced via io-wq. The legacy recycling is missing checking if the
buffer_list still exists, and if it's of the correct type. Add those
checks.
Cc: stable@vger.kernel.org
Fixes: c7fb19428d ("io_uring: add support for ring mapped supplied buffers")
Reported-by: Keenan Dong <keenanat2000@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
DW UART cannot write to LCR, DLL, and DLH while BUSY is asserted.
Existance of BUSY depends on uart_16550_compatible, if UART HW is
configured with it those registers can always be written.
There currently is dw8250_force_idle() which attempts to achieve
non-BUSY state by disabling FIFO, however, the solution is unreliable
when Rx keeps getting more and more characters.
Create a sequence of operations that ensures UART cannot keep BUSY
asserted indefinitely. The new sequence relies on enabling loopback mode
temporarily to prevent incoming Rx characters keeping UART BUSY.
Ensure no Tx in ongoing while the UART is switches into the loopback
mode (requires exporting serial8250_fifo_wait_for_lsr_thre() and adding
DMA Tx pause/resume functions).
According to tests performed by Adriana Nicolae <adriana@arista.com>,
simply disabling FIFO or clearing FIFOs only once does not always
ensure BUSY is deasserted but up to two tries may be needed. This could
be related to ongoing Rx of a character (a guess, not known for sure).
Therefore, retry FIFO clearing a few times (retry limit 4 is arbitrary
number but using, e.g., p->fifosize seems overly large). Tests
performed by others did not exhibit similar challenge but it does not
seem harmful to leave the FIFO clearing loop in place for all DW UARTs
with BUSY functionality.
Use the new dw8250_idle_enter/exit() to do divisor writes and LCR
writes. In case of plain LCR writes, opportunistically try to update
LCR first and only invoke dw8250_idle_enter() if the write did not
succeed (it has been observed that in practice most LCR writes do
succeed without complications).
This issue was first reported by qianfan Zhao who put lots of debugging
effort into understanding the solution space.
Fixes: c49436b657 ("serial: 8250_dw: Improve unwritable LCR workaround")
Fixes: 7d4008ebb1 ("tty: add a DesignWare 8250 driver")
Cc: stable <stable@kernel.org>
Reported-by: qianfan Zhao <qianfanguijin@163.com>
Link: https://lore.kernel.org/linux-serial/289bb78a-7509-1c5c-2923-a04ed3b6487d@163.com/
Reported-by: Adriana Nicolae <adriana@arista.com>
Link: https://lore.kernel.org/linux-serial/20250819182322.3451959-1-adriana@arista.com/
Reported-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Murthy, Shanth <shanth.murthy@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20260203171049.4353-8-ilpo.jarvinen@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When DW UART is !uart_16550_compatible, it can indicate BUSY at any
point (when under constant Rx pressure) unless a complex sequence of
steps is performed. Any LCR write can run a foul with the condition
that prevents writing LCR while the UART is BUSY, which triggers
BUSY_DETECT interrupt that seems unmaskable using IER bits.
Normal flow is that dw8250_handle_irq() handles BUSY_DETECT condition
by reading USR register. This BUSY feature, however, breaks the
assumptions made in serial8250_do_shutdown(), which runs
synchronize_irq() after clearing IER and assumes no interrupts can
occur after that point but then proceeds to update LCR, which on DW
UART can trigger an interrupt.
If serial8250_do_shutdown() releases the interrupt handler before the
handler has run and processed the BUSY_DETECT condition by read the USR
register, the IRQ is not deasserted resulting in interrupt storm that
triggers "irq x: nobody cared" warning leading to disabling the IRQ.
Add late synchronize_irq() into serial8250_do_shutdown() to ensure
BUSY_DETECT from DW UART is handled before port's interrupt handler is
released. Alternative would be to add DW UART specific shutdown
function but it would mostly duplicate the generic code and the extra
synchronize_irq() seems pretty harmless in serial8250_do_shutdown().
Fixes: 7d4008ebb1 ("tty: add a DesignWare 8250 driver")
Cc: stable <stable@kernel.org>
Reported-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Murthy, Shanth <shanth.murthy@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20260203171049.4353-7-ilpo.jarvinen@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
INTC10EE UART can end up into an interrupt storm where it reports
IIR_NO_INT (0x1). If the storm happens during active UART operation, it
is promptly stopped by IIR value change due to Rx or Tx events.
However, when there is no activity, either due to idle serial line or
due to specific circumstances such as during shutdown that writes
IER=0, there is nothing to stop the storm.
During shutdown the storm is particularly problematic because
serial8250_do_shutdown() calls synchronize_irq() that will hang in
waiting for the storm to finish which never happens.
This problem can also result in triggering a warning:
irq 45: nobody cared (try booting with the "irqpoll" option)
[...snip...]
handlers:
serial8250_interrupt
Disabling IRQ #45
Normal means to reset interrupt status by reading LSR, MSR, USR, or RX
register do not result in the UART deasserting the IRQ.
Add a quirk to INTC10EE UARTs to enable Tx interrupts if UART's Tx is
currently empty and inactive. Rework IIR_NO_INT to keep track of the
number of consecutive IIR_NO_INT, and on fourth one perform the quirk.
Enabling Tx interrupts should change IIR value from IIR_NO_INT to
IIR_THRI which has been observed to stop the storm.
Fixes: e92fad0249 ("serial: 8250_dw: Add ACPI ID for Granite Rapids-D UART")
Cc: stable <stable@kernel.org>
Reported-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Murthy, Shanth <shanth.murthy@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20260203171049.4353-6-ilpo.jarvinen@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
dw8250_handle_irq() takes port's lock multiple times with no good
reason to release it in between and calls serial8250_handle_irq()
that also takes port's lock.
Take port's lock only once in dw8250_handle_irq() and use
serial8250_handle_irq_locked() to avoid releasing port's lock in
between.
As IIR_NO_INT check in serial8250_handle_irq() was outside of port's
lock, it has to be done already in dw8250_handle_irq().
DW UART can, in addition to IIR_NO_INT, report BUSY_DETECT (0x7) which
collided with the IIR_NO_INT (0x1) check in serial8250_handle_irq()
(because & is used instead of ==) meaning that no other work is done by
serial8250_handle_irq() during an BUSY_DETECT interrupt.
This allows reorganizing code in dw8250_handle_irq() to do both
IIR_NO_INT and BUSY_DETECT handling right at the start simplifying
the logic.
Tested-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Murthy, Shanth <shanth.murthy@intel.com>
Cc: stable <stable@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20260203171049.4353-5-ilpo.jarvinen@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
8250_port exports serial8250_handle_irq() to HW specific 8250 drivers.
It takes port's lock within but a HW specific 8250 driver may want to
take port's lock itself, do something, and then call the generic
handler in 8250_port but to do that, the caller has to release port's
lock for no good reason.
Introduce serial8250_handle_irq_locked() which a HW specific driver can
call while already holding port's lock.
As this is new export, put it straight into a namespace (where all 8250
exports should eventually be moved).
Tested-by: Bandal, Shankar <shankar.bandal@intel.com>
Tested-by: Murthy, Shanth <shanth.murthy@intel.com>
Cc: stable <stable@kernel.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20260203171049.4353-4-ilpo.jarvinen@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
uart_write_room() and uart_write() behave inconsistently when
xmit_buf is NULL (which happens for PORT_UNKNOWN ports that were
never properly initialized):
- uart_write_room() returns kfifo_avail() which can be > 0
- uart_write() checks xmit_buf and returns 0 if NULL
This inconsistency causes an infinite loop in drivers that rely on
tty_write_room() to determine if they can write:
while (tty_write_room(tty) > 0) {
written = tty->ops->write(...);
// written is always 0, loop never exits
}
For example, caif_serial's handle_tx() enters an infinite loop when
used with PORT_UNKNOWN serial ports, causing system hangs.
Fix by making uart_write_room() also check xmit_buf and return 0 if
it's NULL, consistent with uart_write().
Reproducer: https://gist.github.com/mrpre/d9a694cc0e19828ee3bc3b37983fde13
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Cc: stable <stable@kernel.org>
Link: https://patch.msgid.link/20260204074327.226165-1-jiayuan.chen@linux.dev
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ulite_probe() calls pm_runtime_put_autosuspend() at the end of probe
without holding a corresponding PM runtime reference for non-console
ports.
During ulite_assign(), uart_add_one_port() triggers uart_configure_port()
which calls ulite_pm() via uart_change_pm(). For non-console ports, the
UART core performs a balanced get/put cycle:
uart_change_pm(ON) -> ulite_pm() -> pm_runtime_get_sync() +1
uart_change_pm(OFF) -> ulite_pm() -> pm_runtime_put_autosuspend() -1
This leaves no spare reference for the pm_runtime_put_autosuspend() at
the end of probe. The PM runtime core prevents the count from actually
going below zero, and instead triggers a
"Runtime PM usage count underflow!" warning.
For console ports the bug is masked: the UART core skips the
uart_change_pm(OFF) call, so the UART core's unbalanced get happens to
pair with probe's trailing put.
Add pm_runtime_get_noresume() before pm_runtime_enable() to take an
explicit probe-owned reference that the trailing
pm_runtime_put_autosuspend() can release. This ensures a correct usage
count regardless of whether the port is a console.
Fixes: 5bbe10a694 ("tty: serial: uartlite: Add runtime pm support")
Cc: stable <stable@kernel.org>
Signed-off-by: Maciej Andrzejewski ICEYE <maciej.andrzejewski@m-works.net>
Link: https://patch.msgid.link/20260305123746.4152800-1-maciej.andrzejewski@m-works.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 039d492637 ("serial: 8250: Toggle IER bits on only after irq
has been set up") moved IRQ setup before the THRE test, in combination
with commit 205d300aea ("serial: 8250: change lock order in
serial8250_do_startup()") the interrupt handler can run during the
test and race with its IIR reads. This can produce wrong THRE test
results and cause spurious registration of the
serial8250_backup_timeout timer. Unconditionally disable the IRQ for
the short duration of the test and re-enable it afterwards to avoid
the race.
Fixes: 039d492637 ("serial: 8250: Toggle IER bits on only after irq has been set up")
Depends-on: 205d300aea ("serial: 8250: change lock order in serial8250_do_startup()")
Cc: stable <stable@kernel.org>
Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Alban Bedel <alban.bedel@lht.dlh.de>
Tested-by: Maximilian Lueer <maximilian.lueer@lht.dlh.de>
Link: https://patch.msgid.link/20260224121639.579404-1-alban.bedel@lht.dlh.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
`dmaengine_terminate_async` does not guarantee that the
`__dma_tx_complete` callback will run. The callback is currently the
only place where `dma->tx_running` gets cleared. If the transaction is
canceled and the callback never runs, then `dma->tx_running` will never
get cleared and we will never schedule new TX DMA transactions again.
This change makes it so we clear `dma->tx_running` after we terminate
the DMA transaction. This is "safe" because `serial8250_tx_dma_flush`
is holding the UART port lock. The first thing the callback does is also
grab the UART port lock, so access to `dma->tx_running` is serialized.
Fixes: 9e512eaaf8 ("serial: 8250: Fix fifo underflow on flush")
Cc: stable <stable@kernel.org>
Signed-off-by: Raul E Rangel <rrangel@google.com>
Link: https://patch.msgid.link/20260209135815.1.I16366ecb0f62f3c96fe3dd5763fcf6f3c2b4d8cd@changeid
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Starting with the upcoming Rust 1.96.0 (to be released 2026-05-28),
`rustc` introduces the new lint `unused_features` [1], which warns [2]:
warning: feature `used_with_arg` is declared but not used
--> <crate attribute>:1:93
|
1 | #![feature(asm_const,asm_goto,arbitrary_self_types,lint_reasons,offset_of_nested,raw_ref_op,used_with_arg)]
| ^^^^^^^^^^^^^
|
= note: `#[warn(unused_features)]` (part of `#[warn(unused)]`) on by default
The original goal of using `-Zcrate-attr` automatically was that there
is a consistent set of features enabled and managed globally for all
Rust kernel code (modulo exceptions like the `rust/` crated).
While we could require crates to enable features manually (even if we
still keep the `-Zallow-features=` list, i.e. removing the `-Zcrate-attr`
list), it is not really worth making all developers worry about it just
for a new lint.
The features are expected to eventually become stable anyway (most already
did), and thus having to remove features in every file that may use them
is not worth it either.
Thus just allow the new lint globally.
The lint actually existed for a long time, which is why `rustc` does
not complain about an unknown lint in the stable versions we support,
but it was "disabled" years ago [3], and now it was made to work again.
For extra context, the new implementation of the lint has already been
improved to avoid linting about features that became stable thanks to
Benno's report and the ensuing discussion [4] [5], but while that helps,
it is still the case that we may have features enabled that are not used
for one reason or another in a particular crate.
Cc: stable@vger.kernel.org # Needed in 6.12.y and later (Rust is pinned in older LTSs).
Link: https://github.com/rust-lang/rust/pull/152164 [1]
Link: https://github.com/Rust-for-Linux/pin-init/pull/114 [2]
Link: https://github.com/rust-lang/rust/issues/44232 [3]
Link: https://github.com/rust-lang/rust/issues/153523 [4]
Link: https://github.com/rust-lang/rust/pull/153610 [5]
Reviewed-by: Benno Lossin <lossin@kernel.org>
Reviewed-by: Gary Guo <gary@garyguo.net>
Link: https://patch.msgid.link/20260312111014.74198-1-ojeda@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
The wrong value of the number of domains is wrong which leads to
failures when trying to enumerate nested power domains.
PM: genpd_xlate_onecell: invalid domain index 0
PM: genpd_xlate_onecell: invalid domain index 1
PM: genpd_xlate_onecell: invalid domain index 3
PM: genpd_xlate_onecell: invalid domain index 4
PM: genpd_xlate_onecell: invalid domain index 5
PM: genpd_xlate_onecell: invalid domain index 13
PM: genpd_xlate_onecell: invalid domain index 14
Attempts to use these power domains fail, so fix this by
using the correct value of calculated power domains.
Signed-off-by: Adam Ford <aford173@gmail.com>
Fixes: 88914db077 ("pmdomain: mediatek: Add support for Hardware Voter power domains")
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Cc: stable@vger.kernel.org
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Upon resuming from suspend, the Touch Bar driver was missing a resume
method in order to restore the original mode the Touch Bar was on before
suspending. It is the same as the reset_resume method.
[jkosina@suse.com: rebased on top of the pm_ptr() conversion]
Cc: stable@vger.kernel.org
Signed-off-by: Aditya Garg <gargaditya08@live.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
The Logitech MX Master 4 can be connected over bluetooth or through a
Logitech Bolt receiver. This change adds support for non-standard HID
features, such as high resolution scrolling when the mouse is connected
over bluetooth.
Because no Logitech Bolt receiver driver exists yet those features
won't be available when the mouse is connected through the receiver.
Signed-off-by: Adrian Freund <adrian@freund.io>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
ASoC: Fixes for v7.0
Quite a large pull request, but nothing too concerning here - everything
is fairly small. We've got a couple of smaller core fixes for races on
card teardown from Matteo Cotifava, a fix for handling dodgy DMI
information generated by u-boot, some driver specific fixes and some new
device IDs for Tegra.
When running the command:
'perf record -e "{instructions,instructions:p}" -j any,counter sleep 1',
a "shift-out-of-bounds" warning is reported on CWF.
UBSAN: shift-out-of-bounds in /kbuild/src/consumer/arch/x86/events/intel/lbr.c:970:15
shift exponent 64 is too large for 64-bit type 'long long unsigned int'
......
intel_pmu_lbr_counters_reorder.isra.0.cold+0x2a/0xa7
intel_pmu_lbr_save_brstack+0xc0/0x4c0
setup_arch_pebs_sample_data+0x114b/0x2400
The warning occurs because the second "instructions:p" event, which
involves branch counters sampling, is incorrectly programmed to fixed
counter 0 instead of the general-purpose (GP) counters 0-3 that support
branch counters sampling. Currently only GP counters 0-3 support branch
counters sampling on CWF, any event involving branch counters sampling
should be programed on GP counters 0-3. Since the counter index of fixed
counter 0 is 32, it leads to the "src" value in below code is right
shifted 64 bits and trigger the "shift-out-of-bounds" warning.
cnt = (src >> (order[j] * LBR_INFO_BR_CNTR_BITS)) & LBR_INFO_BR_CNTR_MASK;
The root cause is the loss of the branch counters constraint for the
new event in the branch counters sampling event group. Since it isn't
yet part of the sibling list. This results in the second
"instructions:p" event being programmed on fixed counter 0 incorrectly
instead of the appropriate GP counters 0-3.
To address this, we apply the missing branch counters constraint for
the last event in the group. Additionally, we introduce a new function,
`intel_set_branch_counter_constr()`, to apply the branch counters
constraint and avoid code duplication.
Fixes: 3374491619 ("perf/x86/intel: Support branch counters logging")
Reported-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260228053320.140406-2-dapeng1.mi@linux.intel.com
Cc: stable@vger.kernel.org
Oliver reported that x86_pmu_del() ended up doing an out-of-bound memory access
when group_sched_in() fails and needs to roll back.
This *should* be handled by the transaction callbacks, but he found that when
the group leader is a software event, the transaction handlers of the wrong PMU
are used. Despite the move_group case in perf_event_open() and group_sched_in()
using pmu_ctx->pmu.
Turns out, inherit uses event->pmu to clone the events, effectively undoing the
move_group case for all inherited contexts. Fix this by also making inherit use
pmu_ctx->pmu, ensuring all inherited counters end up in the same pmu context.
Similarly, __perf_event_read() should use equally use pmu_ctx->pmu for the
group case.
Fixes: bd27568117 ("perf: Rewrite core context handling")
Reported-by: Oliver Rosenberg <olrose55@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://patch.msgid.link/20260309133713.GB606826@noisy.programming.kicks-ass.net
Both Mi Dapeng and Ian Rogers noted that not everything that sets HES_STOPPED
is required to EF_UPDATE. Specifically the 'step 1' loop of rescheduling
explicitly does EF_UPDATE to ensure the counter value is read.
However, then 'step 2' simply leaves the new counter uninitialized when
HES_STOPPED, even though, as noted above, the thing that stopped them might not
be aware it needs to EF_RELOAD -- since it didn't EF_UPDATE on stop.
One such location that is affected is throttling, throttle does pmu->stop(, 0);
and unthrottle does pmu->start(, 0); possibly restarting an uninitialized counter.
Fixes: a4eaf7f146 ("perf: Rework the PMU methods")
Reported-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reported-by: Ian Rogers <irogers@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://patch.msgid.link/20260311204035.GX606826@noisy.programming.kicks-ass.net
A production AMD EPYC system crashed with a NULL pointer dereference
in the PMU NMI handler:
BUG: kernel NULL pointer dereference, address: 0000000000000198
RIP: x86_perf_event_update+0xc/0xa0
Call Trace:
<NMI>
amd_pmu_v2_handle_irq+0x1a6/0x390
perf_event_nmi_handler+0x24/0x40
The faulting instruction is `cmpq $0x0, 0x198(%rdi)` with RDI=0,
corresponding to the `if (unlikely(!hwc->event_base))` check in
x86_perf_event_update() where hwc = &event->hw and event is NULL.
drgn inspection of the vmcore on CPU 106 showed a mismatch between
cpuc->active_mask and cpuc->events[]:
active_mask: 0x1e (bits 1, 2, 3, 4)
events[1]: 0xff1100136cbd4f38 (valid)
events[2]: 0x0 (NULL, but active_mask bit 2 set)
events[3]: 0xff1100076fd2cf38 (valid)
events[4]: 0xff1100079e990a90 (valid)
The event that should occupy events[2] was found in event_list[2]
with hw.idx=2 and hw.state=0x0, confirming x86_pmu_start() had run
(which clears hw.state and sets active_mask) but events[2] was
never populated.
Another event (event_list[0]) had hw.state=0x7 (STOPPED|UPTODATE|ARCH),
showing it was stopped when the PMU rescheduled events, confirming the
throttle-then-reschedule sequence occurred.
The root cause is commit 7e772a93eb ("perf/x86: Fix NULL event access
and potential PEBS record loss") which moved the cpuc->events[idx]
assignment out of x86_pmu_start() and into step 2 of x86_pmu_enable(),
after the PERF_HES_ARCH check. This broke any path that calls
pmu->start() without going through x86_pmu_enable() -- specifically
the unthrottle path:
perf_adjust_freq_unthr_events()
-> perf_event_unthrottle_group()
-> perf_event_unthrottle()
-> event->pmu->start(event, 0)
-> x86_pmu_start() // sets active_mask but not events[]
The race sequence is:
1. A group of perf events overflows, triggering group throttle via
perf_event_throttle_group(). All events are stopped: active_mask
bits cleared, events[] preserved (x86_pmu_stop no longer clears
events[] after commit 7e772a93eb).
2. While still throttled (PERF_HES_STOPPED), x86_pmu_enable() runs
due to other scheduling activity. Stopped events that need to
move counters get PERF_HES_ARCH set and events[old_idx] cleared.
In step 2 of x86_pmu_enable(), PERF_HES_ARCH causes these events
to be skipped -- events[new_idx] is never set.
3. The timer tick unthrottles the group via pmu->start(). Since
commit 7e772a93eb removed the events[] assignment from
x86_pmu_start(), active_mask[new_idx] is set but events[new_idx]
remains NULL.
4. A PMC overflow NMI fires. The handler iterates active counters,
finds active_mask[2] set, reads events[2] which is NULL, and
crashes dereferencing it.
Move the cpuc->events[hwc->idx] assignment in x86_pmu_enable() to
before the PERF_HES_ARCH check, so that events[] is populated even
for events that are not immediately started. This ensures the
unthrottle path via pmu->start() always finds a valid event pointer.
Fixes: 7e772a93eb ("perf/x86: Fix NULL event access and potential PEBS record loss")
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260310-perf-v2-1-4a3156fce43c@debian.org
There are two versions of the __this_cpu_local_lock() definitions in
include/linux/local_lock_internal.h: one version that relies on the
Clang overloading functionality and another version that does not.
Select the latter version when using sparse. This patch fixes the
following errors reported by sparse:
include/linux/local_lock_internal.h:331:40: sparse: sparse: multiple definitions for function '__this_cpu_local_lock'
include/linux/local_lock_internal.h:325:37: sparse: the previous one is here
Closes: https://lore.kernel.org/oe-kbuild-all/202603062334.wgI5htP0-lkp@intel.com/
Fixes: d3febf16de ("locking/local_lock: Support Clang's context analysis")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Marco Elver <elver@google.com>
Link: https://patch.msgid.link/20260311231455.1961413-1-bvanassche@acm.org
We use a unit struct `__InitOk` in the closure generated by the
initializer macros as the return value. We shadow it by creating a
struct with the same name again inside of the closure, preventing early
returns of `Ok` in the initializer (before all fields have been
initialized).
In the face of Type Alias Impl Trait (TAIT) and the next trait solver,
this solution no longer works [1]. The shadowed struct can be named
through type inference. In addition, there is an RFC proposing to add
the feature of path inference to Rust, which would similarly allow [2].
Thus remove the shadowed token and replace it with an `unsafe` to create
token.
The reason we initially used the shadowing solution was because an
alternative solution used a builder pattern. Gary writes [3]:
In the early builder-pattern based InitOk, having a single InitOk
type for token is unsound because one can launder an InitOk token
used for one place to another initializer. I used a branded lifetime
solution, and then you figured out that using a shadowed type would
work better because nobody could construct it at all.
The laundering issue does not apply to the approach we ended up with
today.
With this change, the example by Tim Chirananthavat in [1] no longer
compiles and results in this error:
error: cannot construct `pin_init::__internal::InitOk` with struct literal syntax due to private fields
--> src/main.rs:26:17
|
26 | InferredType {}
| ^^^^^^^^^^^^
|
= note: private field `0` that was not provided
help: you might have meant to use the `new` associated function
|
26 - InferredType {}
26 + InferredType::new()
|
Applying the suggestion of using the `::new()` function, results in
another expected error:
error[E0133]: call to unsafe function `pin_init::__internal::InitOk::new` is unsafe and requires unsafe block
--> src/main.rs:26:17
|
26 | InferredType::new()
| ^^^^^^^^^^^^^^^^^^^ call to unsafe function
|
= note: consult the function's documentation for information on how to avoid undefined behavior
Reported-by: Tim Chirananthavat <theemathas@gmail.com>
Link: https://github.com/rust-lang/rust/issues/153535 [1]
Link: https://github.com/rust-lang/rfcs/pull/3444#issuecomment-4016145373 [2]
Link: https://github.com/rust-lang/rust/issues/153535#issuecomment-4017620804 [3]
Fixes: fc6c6baa1f ("rust: init: add initialization macros")
Cc: stable@vger.kernel.org
Signed-off-by: Benno Lossin <lossin@kernel.org>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Gary Guo <gary@garyguo.net>
Link: https://patch.msgid.link/20260311105056.1425041-1-lossin@kernel.org
[ Added period as mentioned. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
net->xfrm.nlsk is used in 2 types of contexts:
- fully under RCU, with rcu_read_lock + rcu_dereference and a NULL check
- in the netlink handlers, with requests coming from a userspace socket
In the 2nd case, net->xfrm.nlsk is guaranteed to stay non-NULL and the
object is alive, since we can't enter the netns destruction path while
the user socket holds a reference on the netns.
After adding the __rcu annotation to netns_xfrm.nlsk (which silences
sparse warnings in the RCU users and __net_init code), we need to tell
sparse that the 2nd case is safe. Add a helper for that.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In xfrm_policy_init:
add rcu_assign_pointer to fix warning:
net/xfrm/xfrm_policy.c:4238:29: warning: incorrect type in assignment (different address spaces)
net/xfrm/xfrm_policy.c:4238:29: expected struct hlist_head [noderef] __rcu *table
net/xfrm/xfrm_policy.c:4238:29: got struct hlist_head *
add rcu_dereference_protected to silence warning:
net/xfrm/xfrm_policy.c:4265:36: warning: incorrect type in argument 1 (different address spaces)
net/xfrm/xfrm_policy.c:4265:36: expected struct hlist_head *n
net/xfrm/xfrm_policy.c:4265:36: got struct hlist_head [noderef] __rcu *table
The netns is being created, no concurrent access is possible yet.
In xfrm_policy_fini, net is going away, there shouldn't be any
concurrent changes to the hashtables, so we can use
rcu_dereference_protected to silence warnings:
net/xfrm/xfrm_policy.c:4291:17: warning: incorrect type in argument 1 (different address spaces)
net/xfrm/xfrm_policy.c:4291:17: expected struct hlist_head const *h
net/xfrm/xfrm_policy.c:4291:17: got struct hlist_head [noderef] __rcu *table
net/xfrm/xfrm_policy.c:4292:36: warning: incorrect type in argument 1 (different address spaces)
net/xfrm/xfrm_policy.c:4292:36: expected struct hlist_head *n
net/xfrm/xfrm_policy.c:4292:36: got struct hlist_head [noderef] __rcu *table
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Silence sparse warnings in xfrm_state_fini:
net/xfrm/xfrm_state.c:3327:9: warning: incorrect type in argument 1 (different address spaces)
net/xfrm/xfrm_state.c:3327:9: expected struct hlist_head const *h
net/xfrm/xfrm_state.c:3327:9: got struct hlist_head [noderef] __rcu *state_byseq
Add xfrm_state_deref_netexit() to wrap those calls. The netns is going
away, we don't have to worry about the state_by* pointers being
changed behind our backs.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
xfrm_state_lookup_spi_proto is called under xfrm_state_lock by
xfrm_alloc_spi, no need to take a reference on the state and pretend
to be under RCU.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We're under xfrm_state_lock for all those walks, we can use
xfrm_state_deref_prot to silence sparse warnings such as:
net/xfrm/xfrm_state.c:933:17: warning: dereference of noderef expression
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Use rcu_assign_pointer, and tmp variables for freeing on the error
path without accessing net->xfrm.state_by*.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
In all callers, x is not an __rcu pointer. We can drop the annotation to
avoid sparse warnings:
net/xfrm/xfrm_state.c:58:39: warning: incorrect type in argument 1 (different address spaces)
net/xfrm/xfrm_state.c:58:39: expected struct refcount_struct [usertype] *r
net/xfrm/xfrm_state.c:58:39: got struct refcount_struct [noderef] __rcu *
net/xfrm/xfrm_state.c:1166:42: warning: incorrect type in argument 1 (different address spaces)
net/xfrm/xfrm_state.c:1166:42: expected struct xfrm_state [noderef] __rcu *x
net/xfrm/xfrm_state.c:1166:42: got struct xfrm_state *[assigned] x
(repeated for each caller)
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The new PowerPC VMX fast path (__copy_tofrom_user_power7_vmx) is not
exercised by existing copyloops selftests. This patch updates
the selftest to exercise the VMX variant, ensuring the VMX copy path
is validated.
Changes include:
- COPY_LOOP=test___copy_tofrom_user_power7_vmx with -D VMX_TEST is used
in existing selftest build targets.
- Inclusion of ../utils.c to provide get_auxv_entry() for hardware
feature detection.
- At runtime, the test skips execution if Altivec is not available.
- Copy sizes above VMX_COPY_THRESHOLD are used to ensure the VMX
path is taken.
This enables validation of the VMX fast path without affecting systems
that do not support Altivec.
Signed-off-by: Sayali Patil <sayalip@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260304122201.153049-2-sayalip@linux.ibm.com
On powerpc with PREEMPT_FULL or PREEMPT_LAZY and function tracing enabled,
KUAP warnings can be triggered from the VMX usercopy path under memory
stress workloads.
KUAP requires that no subfunctions are called once userspace access has
been enabled. The existing VMX copy implementation violates this
requirement by invoking enter_vmx_usercopy() from the assembly path after
userspace access has already been enabled. If preemption occurs
in this window, the AMR state may not be preserved correctly,
leading to unexpected userspace access state and resulting in
KUAP warnings.
Fix this by restructuring the VMX usercopy flow so that VMX selection
and VMX state management are centralized in raw_copy_tofrom_user(),
which is invoked by the raw_copy_{to,from,in}_user() wrappers.
The new flow is:
- raw_copy_{to,from,in}_user() calls raw_copy_tofrom_user()
- raw_copy_tofrom_user() decides whether to use the VMX path
based on size and CPU capability
- Call enter_vmx_usercopy() before enabling userspace access
- Enable userspace access as per the copy direction
and perform the VMX copy
- Disable userspace access as per the copy direction
- Call exit_vmx_usercopy()
- Fall back to the base copy routine if the VMX copy faults
With this change, the VMX assembly routines no longer perform VMX state
management or call helper functions; they only implement the
copy operations.
The previous feature-section based VMX selection inside
__copy_tofrom_user_power7() is removed, and a dedicated
__copy_tofrom_user_power7_vmx() entry point is introduced.
This ensures correct KUAP ordering, avoids subfunction calls
while KUAP is unlocked, and eliminates the warnings while preserving
the VMX fast path.
Fixes: de78a9c42a ("powerpc: Add a framework for Kernel Userspace Access Protection")
Reported-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Closes: https://lore.kernel.org/all/20260109064917.777587-2-sshegde@linux.ibm.com/
Suggested-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Co-developed-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Signed-off-by: Sayali Patil <sayalip@linux.ibm.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260304122201.153049-1-sayalip@linux.ibm.com
It may happen that mm is already released, which leads to kernel panic.
This adds the NULL check for current->mm, similarly to
commit 20afc60f89 ("x86, perf: Check that current->mm is alive before getting user callchain").
I was getting this panic when running a profiling BPF program
(profile.py from bcc-tools):
[26215.051935] Kernel attempted to read user page (588) - exploit attempt? (uid: 0)
[26215.051950] BUG: Kernel NULL pointer dereference on read at 0x00000588
[26215.051952] Faulting instruction address: 0xc00000000020fac0
[26215.051957] Oops: Kernel access of bad area, sig: 11 [#1]
[...]
[26215.052049] Call Trace:
[26215.052050] [c000000061da6d30] [c00000000020fc10] perf_callchain_user_64+0x2d0/0x490 (unreliable)
[26215.052054] [c000000061da6dc0] [c00000000020f92c] perf_callchain_user+0x1c/0x30
[26215.052057] [c000000061da6de0] [c0000000005ab2a0] get_perf_callchain+0x100/0x360
[26215.052063] [c000000061da6e70] [c000000000573bc8] bpf_get_stackid+0x88/0xf0
[26215.052067] [c000000061da6ea0] [c008000000042258] bpf_prog_16d4ab9ab662f669_do_perf_event+0xf8/0x274
[...]
In addition, move storing the top-level stack entry to generic
perf_callchain_user to make sure the top-evel entry is always captured,
even if current->mm is NULL.
Fixes: 20002ded4d ("perf_counter: powerpc: Add callchain support")
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Tested-by: Qiao Zhao <qzhao@redhat.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
[Maddy: fixed message to avoid checkpatch format style error]
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260309144045.169427-1-vmalik@redhat.com
commit 4267739cab ("arch, mm: consolidate initialization of SPARSE memory model"),
changed the initialization order of "pageblock_order" from...
start_kernel()
- setup_arch()
- initmem_init()
- sparse_init()
- set_pageblock_order(); // this sets the pageblock_order
- xxx_cma_reserve();
to...
start_kernel()
- setup_arch()
- xxx_cma_reserve();
- mm_core_init_early()
- free_area_init()
- sparse_init()
- set_pageblock_order() // this sets the pageblock_order.
So this means, pageblock_order is not initialized before these cma
reservation function calls, hence we are seeing CMA failures like...
[ 0.000000] kvm_cma_reserve: reserving 3276 MiB for global area
[ 0.000000] cma: pageblock_order not yet initialized. Called during early boot?
[ 0.000000] cma: Failed to reserve 3276 MiB
....
[ 0.000000][ T0] cma: pageblock_order not yet initialized. Called during early boot?
[ 0.000000][ T0] cma: Failed to reserve 1024 MiB
This patch moves these CMA reservations to arch_mm_preinit() which
happens in mm_core_init() (which happens after pageblock_order is
initialized), but before the memblock moves the free memory to buddy.
Fixes: 4267739cab ("arch, mm: consolidate initialization of SPARSE memory model")
Suggested-by: Mike Rapoport <rppt@kernel.org>
Reported-and-tested-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Closes: https://lore.kernel.org/linuxppc-dev/4c338a29-d190-44f3-8874-6cfa0a031f0b@linux.ibm.com/
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Tested-by: Dan Horák <dan@danny.cz>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/6e532cf0db5be99afbe20eed699163d5e86cd71f.1772303986.git.ritesh.list@gmail.com
Fixes for v7.0:
Core:
- Adjusted msm_iommu_pagetable_prealloc_allocate() allocation type
DPU:
- Fixed blue screens on Hamoa laptops by reverting the LM reservation
- Fixed the size of the LM block on several platforms
- Dropped usage of %pK (again)
- Fixed smatch warning on SSPP v13+ code
- Fixed INTF_6 interrupts on Lemans
DSI:
- Fixed DSI PHY revision on Kaanapali
- Fixed pixel clock calculation for the bonded DSI mode panels with
compression enabled
DT bindings:
- Fixed DisplayPort description on Glymur
- Fixed model name in SM8750 MDSS schema
GPU:
- Added MODULE_DEVICE_TABLE to the GPU driver
- Fix bogus protect error on X2-85
- Fix dma_free_attrs() buffer size
- Gen8 UBWC fix for Glymur
From: Rob Clark <rob.clark@oss.qualcomm.com>
Link: https://patch.msgid.link/CACSVV00wZ95gFDLfzJ0Ywb8rsjPSjZ1aHdwE4smnyuZ=Fg-g8Q@mail.gmail.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
For unstated reasons, function mshv_partition_ioctl_set_memory passes
struct mshv_user_mem_region by value instead of by reference. Change
it to pass by reference.
Signed-off-by: Mukesh R <mrathor@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
hv_hvcrash_ctxt_save() in arch/x86/hyperv/hv_crash.c currently saves
segment registers via a general-purpose register (%eax). Update the
code to save segment registers (cs, ss, ds, es, fs, gs) directly to
the crash context memory using movw. This avoids unnecessary use of
a general-purpose register, making the code simpler and more efficient.
The size of the corresponding object file improves as follows:
text data bss dec hex filename
4167 176 200 4543 11bf hv_crash-old.o
4151 176 200 4527 11af hv_crash-new.o
No functional change occurs to the saved context contents; this is
purely a code-quality improvement.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Long Li <longli@microsoft.com>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
The NIX RAS health reporter recovery routine checks nix_af_rvu_int to
decide whether to re-enable NIX_AF_RAS interrupts. This is the RVU
interrupt status field and is unrelated to RAS events, so the recovery
flow may incorrectly skip re-enabling NIX_AF_RAS interrupts.
Check nix_af_rvu_ras instead before writing NIX_AF_RAS_ENA_W1S.
Fixes: 5ed66306ea ("octeontx2-af: Add devlink health reporters for NIX")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Link: https://patch.msgid.link/20260310184824.1183651-1-alok.a.tiwari@oracle.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The "rx_filter" member of "hwtstamp_config" structure is an enum field and
does not support bitwise OR combination of multiple filter values. It
causes error while linuxptp application tries to match rx filter version.
Fix this by storing the requested filter type in a new port field.
Fixes: 97248adb5a ("net: ti: am65-cpsw: Update hw timestamping filter for PTPv1 RX packets")
Signed-off-by: Chintan Vankar <c-vankar@ti.com>
Link: https://patch.msgid.link/20260310160940.109822-1-c-vankar@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Westphal says:
====================
netfilter: updates for net
Due to large volume of backlogged patches its unlikely I will make the
2nd planned PR this week, so several legit fixes will be pushed back
to next week. Sorry for the inconvenience but I am out of ideas and
alternatives.
1) syzbot managed to add/remove devices to a flowtable, due to a bug in
the flowtable netdevice notifier this gets us a double-add and
eventually UaF when device is removed again (we only expect one
entry, duplicate remains past net_device end-of-life).
From Phil Sutter, bug added in 6.16.
2) Yiming Qian reports another nf_tables transaction handling bug:
in some cases error unwind misses to undo certain set elements,
resulting in refcount underflow and use-after-free, bug added in 6.4.
3) Jenny Guanni Qu found out-of-bounds read in pipapo set type.
While the value is never used, it still rightfully triggers KASAN
splats. Bug exists since this set type was added in 5.6.
4) a few x_tables modules contain copypastry tcp option parsing code which
can read 1 byte past the option area. This bug is ancient, fix from
David Dull.
5) nfnetlink_queue leaks kernel memory if userspace provides bad
NFQA_VLAN/NFQA_L2HDR attributes. From Hyunwoo Kim, bug stems from
from 4.7 days.
6) nfnetlink_cthelper has incorrect loop restart logic which may result
in reading one pointer past end of array. From 3.6 days, fix also from
Hyunwoo Kim.
7) xt_IDLETIMER v0 extension must reject working with timers added
by revision v1, else we get list corruption. Bug added in v5.7.
From Yifan Wu, Juefei Pu and Yuan Tan via Xin Lu.
* tag 'nf-26-03-10' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: xt_IDLETIMER: reject rev0 reuse of ALARM timer labels
netfilter: nfnetlink_cthelper: fix OOB read in nfnl_cthelper_dump_table()
netfilter: nfnetlink_queue: fix entry leak in bridge verdict error path
netfilter: x_tables: guard option walkers against 1-byte tail reads
netfilter: nft_set_pipapo: fix stack out-of-bounds read in pipapo_drop()
netfilter: nf_tables: always walk all pending catchall elements
netfilter: nf_tables: Fix for duplicate device in netdev hooks
====================
Link: https://patch.msgid.link/20260310132050.630-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2026-03-10 (ice, iavf, i40e, e1000e, e1000)
Nikolay Aleksandrov changes return code of RDMA related ice devlink get
parameters when irdma is not enabled to -EOPNOTSUPP as current return
of -ENODEV causes issues with devlink output.
Petr Oros resolves a couple of issues in iavf; freeing PTP resources
before reset and disable. Fixing contention issues with the netdev lock
between reset and some ethtool operations.
Alok Tiwari corrects an incorrect comparison of cloud filter values and
adjust some passed arguments to sizeof() for consistency on i40e.
Matt Vollrath removes an incorrect decrement for DMA error on e1000 and
e1000e drivers.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
e1000/e1000e: Fix leak in DMA error cleanup
i40e: fix src IP mask checks and memcpy argument names in cloud filter
iavf: fix incorrect reset handling in callbacks
iavf: fix PTP use-after-free during reset
drivers: net: ice: fix devlink parameters get without irdma
====================
Link: https://patch.msgid.link/20260310205654.4109072-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sabrina Dubroca says:
====================
neighbour: fix update of proxy neighbour
While re-reading some "old" patches I ran into a small change of
behavior in commit dc2a27e524 ("neighbour: Update pneigh_entry in
pneigh_create().").
The old behavior was not consistent between ->protocol and ->flags,
and didn't offer a way to clear protocol, so maybe it's better to
change that (7-years-old [1]) behavior. But then we should change
non-proxy neighbours as well to keep neigh/pneigh consistent.
[1] df9b0e30d4 ("neighbor: Add protocol attribute")
====================
Link: https://patch.msgid.link/cover.1772894876.git.sd@queasysnail.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The rtl8366rb_led_group_port_mask() function always returns LED port
bit in LED group 0; the switch statement returns the same thing in all
non-default cases.
This means that the driver does not currently support configuring LEDs
in non-zero LED groups.
Fix this.
Fixes: 32d6170054 ("net: dsa: realtek: add LED drivers for rtl8366rb")
Signed-off-by: Marek Behún <kabel@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260311111237.29002-1-kabel@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A user can set conn_timeout to any value via
setsockopt(TIPC_CONN_TIMEOUT), including values less than 4. When a
SYN is rejected with TIPC_ERR_OVERLOAD and the retry path in
tipc_sk_filter_connect() executes:
delay %= (tsk->conn_timeout / 4);
If conn_timeout is in the range [0, 3], the integer division yields 0,
and the modulo operation triggers a divide-by-zero exception, causing a
kernel oops/panic.
Fix this by clamping conn_timeout to a minimum of 4 at the point of use
in tipc_sk_filter_connect().
Oops: divide error: 0000 [#1] SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 119 Comm: poc-F144 Not tainted 7.0.0-rc2+
RIP: 0010:tipc_sk_filter_rcv (net/tipc/socket.c:2236 net/tipc/socket.c:2362)
Call Trace:
tipc_sk_backlog_rcv (include/linux/instrumented.h:82 include/linux/atomic/atomic-instrumented.h:32 include/net/sock.h:2357 net/tipc/socket.c:2406)
__release_sock (include/net/sock.h:1185 net/core/sock.c:3213)
release_sock (net/core/sock.c:3797)
tipc_connect (net/tipc/socket.c:2570)
__sys_connect (include/linux/file.h:62 include/linux/file.h:83 net/socket.c:2098)
Fixes: 6787927475 ("tipc: buffer overflow handling in listener socket")
Cc: stable@vger.kernel.org
Signed-off-by: Mehul Rao <mehulrao@gmail.com>
Reviewed-by: Tung Nguyen <tung.quang.nguyen@est.tech>
Link: https://patch.msgid.link/20260310170730.28841-1-mehulrao@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When booting with the 'ipv6.disable=1' parameter, the nd_tbl is never
initialized because inet6_init() exits before ndisc_init() is called which
initializes it. If bpf_redirect_neigh() is called with explicit AF_INET6
nexthop parameters, __bpf_redirect_neigh_v6() can skip the IPv6 FIB lookup
and call bpf_out_neigh_v6() directly. bpf_out_neigh_v6() then calls
ip_neigh_gw6(), which uses ipv6_stub->nd_tbl.
BUG: kernel NULL pointer dereference, address: 0000000000000248
Oops: Oops: 0000 [#1] SMP NOPTI
RIP: 0010:skb_do_redirect+0x44f/0xf40
Call Trace:
<TASK>
? srso_alias_return_thunk+0x5/0xfbef5
? __tcf_classify.constprop.0+0x83/0x160
? srso_alias_return_thunk+0x5/0xfbef5
? tcf_classify+0x2b/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? tc_run+0xb8/0x120
? srso_alias_return_thunk+0x5/0xfbef5
__dev_queue_xmit+0x6fa/0x1000
? srso_alias_return_thunk+0x5/0xfbef5
packet_sendmsg+0x10da/0x1700
? srso_alias_return_thunk+0x5/0xfbef5
__sys_sendto+0x1f3/0x220
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x101/0xf80
? exc_page_fault+0x6e/0x170
? srso_alias_return_thunk+0x5/0xfbef5
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
Fix this by adding an early check in bpf_out_neigh_v6(). If IPv6 is
disabled, drop the packet before neighbor lookup.
Suggested-by: Fernando Fernandez Mancera <fmancera@suse.de>
Fixes: ba452c9e99 ("bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop")
Signed-off-by: Ricardo B. Marlière <rbm@suse.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20260307-net-nd_tbl_fixes-v4-4-e2677e85628c@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When booting with the 'ipv6.disable=1' parameter, the nd_tbl is never
initialized because inet6_init() exits before ndisc_init() is called which
initializes it. If bpf_redirect_neigh() is called from tc with an explicit
nexthop of nh_family == AF_INET6, bpf_out_neigh_v4() takes the AF_INET6
branch and calls ip_neigh_gw6(), which relies on ipv6_stub->nd_tbl.
BUG: kernel NULL pointer dereference, address: 0000000000000248
Oops: Oops: 0000 [#1] SMP NOPTI
RIP: 0010:skb_do_redirect+0xb93/0xf00
Call Trace:
<TASK>
? srso_alias_return_thunk+0x5/0xfbef5
? __tcf_classify.constprop.0+0x83/0x160
? srso_alias_return_thunk+0x5/0xfbef5
? tcf_classify+0x2b/0x50
? srso_alias_return_thunk+0x5/0xfbef5
? tc_run+0xb8/0x120
? srso_alias_return_thunk+0x5/0xfbef5
__dev_queue_xmit+0x6fa/0x1000
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? alloc_skb_with_frags+0x58/0x200
packet_sendmsg+0x10da/0x1700
? srso_alias_return_thunk+0x5/0xfbef5
__sys_sendto+0x1f3/0x220
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x101/0xf80
? exc_page_fault+0x6e/0x170
? srso_alias_return_thunk+0x5/0xfbef5
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
Fix this by adding an early check in the AF_INET6 branch of
bpf_out_neigh_v4(). If IPv6 is disabled, unlock RCU and drop the packet.
Suggested-by: Fernando Fernandez Mancera <fmancera@suse.de>
Fixes: ba452c9e99 ("bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop")
Signed-off-by: Ricardo B. Marlière <rbm@suse.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://patch.msgid.link/20260307-net-nd_tbl_fixes-v4-3-e2677e85628c@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When booting with the 'ipv6.disable=1' parameter, the nd_tbl is never
initialized because inet6_init() exits before ndisc_init() is called
which initializes it. If bonding ARP/NS validation is enabled, an IPv6
NS/NA packet received on a slave can reach bond_validate_na(), which
calls bond_has_this_ip6(). That path calls ipv6_chk_addr() and can
crash in __ipv6_chk_addr_and_flags().
BUG: kernel NULL pointer dereference, address: 00000000000005d8
Oops: Oops: 0000 [#1] SMP NOPTI
RIP: 0010:__ipv6_chk_addr_and_flags+0x69/0x170
Call Trace:
<IRQ>
ipv6_chk_addr+0x1f/0x30
bond_validate_na+0x12e/0x1d0 [bonding]
? __pfx_bond_handle_frame+0x10/0x10 [bonding]
bond_rcv_validate+0x1a0/0x450 [bonding]
bond_handle_frame+0x5e/0x290 [bonding]
? srso_alias_return_thunk+0x5/0xfbef5
__netif_receive_skb_core.constprop.0+0x3e8/0xe50
? srso_alias_return_thunk+0x5/0xfbef5
? update_cfs_rq_load_avg+0x1a/0x240
? srso_alias_return_thunk+0x5/0xfbef5
? __enqueue_entity+0x5e/0x240
__netif_receive_skb_one_core+0x39/0xa0
process_backlog+0x9c/0x150
__napi_poll+0x30/0x200
? srso_alias_return_thunk+0x5/0xfbef5
net_rx_action+0x338/0x3b0
handle_softirqs+0xc9/0x2a0
do_softirq+0x42/0x60
</IRQ>
<TASK>
__local_bh_enable_ip+0x62/0x70
__dev_queue_xmit+0x2d3/0x1000
? srso_alias_return_thunk+0x5/0xfbef5
? srso_alias_return_thunk+0x5/0xfbef5
? packet_parse_headers+0x10a/0x1a0
packet_sendmsg+0x10da/0x1700
? kick_pool+0x5f/0x140
? srso_alias_return_thunk+0x5/0xfbef5
? __queue_work+0x12d/0x4f0
__sys_sendto+0x1f3/0x220
__x64_sys_sendto+0x24/0x30
do_syscall_64+0x101/0xf80
? exc_page_fault+0x6e/0x170
? srso_alias_return_thunk+0x5/0xfbef5
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
Fix this by checking ipv6_mod_enabled() before dispatching IPv6 packets to
bond_na_rcv(). If IPv6 is disabled, return early from bond_rcv_validate()
and avoid the path to ipv6_chk_addr().
Suggested-by: Fernando Fernandez Mancera <fmancera@suse.de>
Fixes: 4e24be018e ("bonding: add new parameter ns_targets")
Signed-off-by: Ricardo B. Marlière <rbm@suse.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20260307-net-nd_tbl_fixes-v4-2-e2677e85628c@suse.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When retrans mount option was introduced, the default value was set
as 1. However, in the light of some bugs that this has exposed recently
we should change it to 0 and retain the old behaviour before this option
was introduced.
Cc: <stable@vger.kernel.org>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
When looking up open handles to be re-used in cifs_open(), calling
cifs_get_{writable,readable}_path() is wrong as it will look up for
the first matching open handle, and if @file->f_flags doesn't match,
it will ignore the remaining open handles in
cifsInodeInfo::openFileList that might potentially match
@file->f_flags.
For writable and readable handles, fix this by calling
__cifs_get_writable_file() and __find_readable_file(), respectively,
with FIND_OPEN_FLAGS set.
With the patch, the following program ends up with two opens instead
of three sent over the wire.
```
#define _GNU_SOURCE
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
int main(int argc, char *argv[])
{
int fd;
fd = open("/mnt/1/foo", O_CREAT | O_WRONLY | O_TRUNC, 0664);
close(fd);
fd = open("/mnt/1/foo", O_DIRECT | O_WRONLY);
close(fd);
fd = open("/mnt/1/foo", O_WRONLY);
close(fd);
fd = open("/mnt/1/foo", O_DIRECT | O_WRONLY);
close(fd);
return 0;
}
```
```
$ mount.cifs //srv/share /mnt/1 -o ...
$ gcc test.c && ./a.out
```
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
parse_server_interfaces() initializes interface socket addresses with
CIFS_PORT. When the mount uses a non-default port this overwrites the
configured destination port.
Later, cifs_chan_update_iface() copies this sockaddr into server->dstaddr,
causing reconnect attempts to use the wrong port after server interface
updates.
Use the existing port from server->dstaddr instead.
Cc: stable@vger.kernel.org
Fixes: fe856be475 ("CIFS: parse and store info on iface queries")
Tested-by: Dr. Thomas Orgis <thomas.orgis@uni-hamburg.de>
Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The DesignWare I3C master controller ACKs IBIs as soon as a valid
Device Address Table (DAT) entry is present. This can create a race
between device attachment (after DAA) and the point where the client
driver enables IBIs via i3c_device_enable_ibi().
Set DEV_ADDR_TABLE_SIR_REJECT in the DAT entry during
attach_i3c_dev() and reattach_i3c_dev() so that IBIs are rejected
by default. The bit is managed thereafter by the existing
dw_i3c_master_set_sir_enabled() function, which clears it in
enable_ibi() after ENEC is issued, and restores it in disable_ibi()
after DISEC.
Fixes: 1dd728f5d4 ("i3c: master: Add driver for Synopsys DesignWare IP")
Signed-off-by: Adrian Ng Ho Yin <adrianhoyin.ng@altera.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/53f5b8cbdd8af789ec38b95b02873f32f9182dd6.1770962368.git.adrianhoyin.ng@altera.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
The DesignWare I3C master driver creates a virtual I2C adapter to
provide backward compatibility with I2C devices. However, the current
implementation does not associate this virtual adapter with any
Device Tree node.
Propagate the of_node from the I3C master platform device to the
virtual I2C adapter's device structure. This ensures that standard
I2C aliases are correctly resolved and bus numbering remains consistent.
Signed-off-by: Peter Yin <peteryin.openbmc@gmail.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260302075645.1492766-1-peteryin.openbmc@gmail.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Disruption of the MIPI I3C HCI controller's internal state can cause
i3c_hci_bus_disable() to fail when attempting to shut down the bus.
In the code paths where bus disable is invoked - bus clean-up and runtime
suspend - the controller does not need to remain operational afterward, so
a full controller reset is a safe recovery mechanism.
Add a fallback to issue a software reset when disabling the bus fails.
This ensures the bus is reliably halted even if the controller's state
machine is stuck or unresponsive.
The fallback is used both during bus clean-up and in the runtime suspend
path. In the latter case, ensure interrupts are quiesced after reset.
Fixes: 9ad9a52cce ("i3c/master: introduce the mipi-i3c-hci driver")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260306072451.11131-15-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Shared interrupts may fire unexpectedly, including during periods when the
controller is not yet fully initialized. Commit b9a15012a1
("i3c: mipi-i3c-hci: Add optional Runtime PM support") addressed this issue
for the runtime-suspended state, but the same problem can also occur before
the bus is enabled for the first time.
Ensure the IRQ handler ignores interrupts until initialization is complete
by making consistent use of the existing irq_inactive flag. The flag is
now set to false immediately before enabling the bus.
To guarantee correct ordering with respect to the IRQ handler, protect
all transitions of irq_inactive with the same spinlock used inside the
handler.
Fixes: b8460480f6 ("i3c: mipi-i3c-hci: Allow for Multi-Bus Instances")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260306072451.11131-14-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
The DMA ring halts whenever a transfer encounters an error. The interrupt
handler previously attempted to detect this situation and restart the ring
if a transfer completed at the same time. However, this restart logic runs
entirely in interrupt context and is inherently racy: it interacts with
other paths manipulating the ring state, and fully serializing it within
the interrupt handler is not practical.
Move this error-recovery logic out of the interrupt handler and into the
transfer-processing path (i3c_hci_process_xfer()), where serialization and
state management are already controlled. Introduce a new optional I/O-ops
callback, handle_error(), invoked when a completed transfer reports an
error. For DMA operation, the implementation simply calls the existing
dequeue function, which safely aborts and restarts the ring when needed.
This removes the fragile ring-restart logic from the interrupt handler and
centralizes error handling where proper sequencing can be ensured.
Fixes: ccdb2e0e3b ("i3c: mipi-i3c-hci: Add Intel specific quirk to ring resuming")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260306072451.11131-13-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Several parts of the MIPI I3C HCI driver duplicate the same sequence for
queuing a transfer, waiting for completion, and handling timeouts. This
logic appears in five separate locations and will be affected by an
upcoming fix.
Refactor the repeated code into a new helper, i3c_hci_process_xfer(), and
store the timeout value in the hci_xfer structure so that callers do not
need to pass it as a separate parameter.
Fixes: 9ad9a52cce ("i3c/master: introduce the mipi-i3c-hci driver")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260306072451.11131-12-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
The DMA dequeue path attempts to restart the ring after aborting an
in-flight transfer, but the current sequence is incomplete. The controller
must be brought out of the aborted state and the ring control registers
must be programmed in the correct order: first clearing ABORT, then
re-enabling the ring and asserting RUN_STOP to resume operation.
Add the missing controller resume step and update the ring control writes
so that the ring is restarted using the proper sequence.
Fixes: 9ad9a52cce ("i3c/master: introduce the mipi-i3c-hci driver")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260306072451.11131-11-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
The internal control command descriptor used for no-op commands includes a
Transaction ID (TID) field, but the no-op command constructed in
hci_dma_dequeue_xfer() omitted it. As a result, the hardware receives a
no-op descriptor without the expected TID.
This bug has gone unnoticed because the TID is currently not validated in
the no-op completion path, but the descriptor format requires it to be
present.
Add the missing TID field when generating a no-op descriptor so that its
layout matches the defined command structure.
Fixes: 9ad9a52cce ("i3c/master: introduce the mipi-i3c-hci driver")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260306072451.11131-10-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
The logic used to abort the DMA ring contains several flaws:
1. The driver unconditionally issues a ring abort even when the ring has
already stopped.
2. The completion used to wait for abort completion is never
re-initialized, resulting in incorrect wait behavior.
3. The abort sequence unintentionally clears RING_CTRL_ENABLE, which
resets hardware ring pointers and disrupts the controller state.
4. If the ring is already stopped, the abort operation should be
considered successful without attempting further action.
Fix the abort handling by checking whether the ring is running before
issuing an abort, re-initializing the completion when needed, ensuring that
RING_CTRL_ENABLE remains asserted during abort, and treating an already
stopped ring as a successful condition.
Fixes: 9ad9a52cce ("i3c/master: introduce the mipi-i3c-hci driver")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260306072451.11131-9-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
The DMA ring bookkeeping in the MIPI I3C HCI driver is updated from two
contexts: the DMA ring dequeue path (hci_dma_dequeue_xfer()) and the
interrupt handler (hci_dma_xfer_done()). Both modify the ring's
in-flight transfer state - specifically rh->src_xfers[] and
xfer->ring_entry - but without any serialization. This allows the two
paths to race, potentially leading to inconsistent ring state.
Serialize access to the shared ring state by extending the existing
spinlock to cover the DMA dequeue path and the entire interrupt handler.
Since the core IRQ handler now holds this lock, remove the per-function
locking from the PIO and DMA sub-handlers.
Additionally, clear the completed entry in rh->src_xfers[] in
hci_dma_xfer_done() so it cannot be matched or completed again.
Finally, place the ring restart sequence under the same lock in
hci_dma_dequeue_xfer() to avoid concurrent enqueue or completion
operations while the ring state is being modified.
Fixes: 9ad9a52cce ("i3c/master: introduce the mipi-i3c-hci driver")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260306072451.11131-8-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
The HCI DMA dequeue path (hci_dma_dequeue_xfer()) may be invoked for
multiple transfers that timeout around the same time. However, the
function is not serialized and can race with itself.
When a timeout occurs, hci_dma_dequeue_xfer() stops the ring, processes
incomplete transfers, and then restarts the ring. If another timeout
triggers a parallel call into the same function, the two instances may
interfere with each other - stopping or restarting the ring at unexpected
times.
Add a mutex so that hci_dma_dequeue_xfer() is serialized with respect to
itself.
Fixes: 9ad9a52cce ("i3c/master: introduce the mipi-i3c-hci driver")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260306072451.11131-7-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
The I3C subsystem allows multiple transfers to be queued concurrently.
However, the MIPI I3C HCI driver's DMA enqueue path, hci_dma_queue_xfer(),
lacks sufficient serialization.
In particular, the allocation of the enqueue_ptr and its subsequent update
in the RING_OPERATION1 register, must be done atomically. Otherwise, for
example, it would be possible for 2 transfers to be allocated the same
enqueue_ptr.
Extend the use of the existing spinlock for that purpose. Keep a count of
the number of xfers enqueued so that it is easy to determine if the ring
has enough space.
Fixes: 9ad9a52cce ("i3c/master: introduce the mipi-i3c-hci driver")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260306072451.11131-6-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
The MIPI I3C HCI driver currently uses separate spinlocks for different
contexts (PIO vs. DMA rings). This split is unnecessary and complicates
upcoming fixes. The driver does not support concurrent PIO and DMA
operation, and it only supports a single DMA ring, so a single lock is
sufficient for all paths.
Introduce a unified spinlock in struct i3c_hci, switch both PIO and DMA
code to use it, and remove the per-context locks.
No functional change is intended in this patch.
Fixes: 9ad9a52cce ("i3c/master: introduce the mipi-i3c-hci driver")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260306072451.11131-5-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Prepare for fixing a race in the DMA ring enqueue path when handling
parallel transfers. Move all DMA mapping out of hci_dma_queue_xfer()
and into a new helper that performs the mapping up front.
This refactoring allows the upcoming fix to extend the spinlock coverage
around the enqueue operation without performing DMA mapping under the
spinlock.
No functional change is intended in this patch.
Fixes: 9ad9a52cce ("i3c/master: introduce the mipi-i3c-hci driver")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260306072451.11131-4-adrian.hunter@intel.com
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
perf_event__synthesize_modules() allocates a single union perf_event and
reuses it across every kernel module callback.
After the first module is processed, perf_record_mmap2__read_build_id()
sets PERF_RECORD_MISC_MMAP_BUILD_ID in header.misc and writes that
module's build ID into the event.
On subsequent iterations the callback overwrites start, len, pid, and
filename for the next module but never clears the stale build ID fields
or the MMAP_BUILD_ID flag.
When perf_record_mmap2__read_build_id() runs for the second module it
sees the flag, reads the stale build ID into a dso_id, and
__dso__improve_id() permanently poisons the DSO with the wrong build ID.
Every module after the first therefore receives the first module's build
ID in its MMAP2 record.
On a system with the sunrpc and nfsd modules loaded, this causes perf
script and perf report to show [unknown] for all module symbols.
The latent bug has existed since commit d9f2ecbc5e ("perf dso:
Move build_id to dso_id") introduced the PERF_RECORD_MISC_MMAP_BUILD_ID
check in perf_record_mmap2__read_build_id().
Commit 53b00ff358 ("perf record: Make --buildid-mmap the default")
then exposed it to all users by making the MMAP2-with-build-ID path the
default. Both commits were merged in the same series.
Clear the MMAP_BUILD_ID flag and zero the build_id union before each
call to perf_record_mmap2__read_build_id() so that every module starts
with a clean slate.
Fixes: d9f2ecbc5e ("perf dso: Move build_id to dso_id")
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When IORING_SETUP_SQE_MIXED is used without IORING_SETUP_NO_SQARRAY,
the boundary check for 128-byte SQE operations in io_init_req()
validated the logical SQ head position rather than the physical SQE
index.
The existing check:
!(ctx->cached_sq_head & (ctx->sq_entries - 1))
ensures the logical position isn't at the end of the ring, which is
correct for NO_SQARRAY rings where physical == logical. However, when
sq_array is present, an unprivileged user can remap any logical
position to an arbitrary physical index via sq_array. Setting
sq_array[N] = sq_entries - 1 places a 128-byte operation at the last
physical SQE slot, causing the 128-byte memcpy in
io_uring_cmd_sqe_copy() to read 64 bytes past the end of the SQE
array.
Replace the cached_sq_head alignment check with a direct validation
of the physical SQE index, which correctly handles both sq_array and
NO_SQARRAY cases.
Fixes: 1cba30bf9f ("io_uring: add support for IORING_SETUP_SQE_MIXED")
Signed-off-by: Tom Ryan <ryan36005@gmail.com>
Link: https://patch.msgid.link/20260310052003.72871-1-ryan36005@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Similarly to what commit e78f7b70e837 did for local task work additions,
use ->rings_rcu under RCU rather than dereference ->rings directly. See
that commit for more details.
Cc: stable@vger.kernel.org
Fixes: 79cfe9e59c ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If DEFER_TASKRUN | SETUP_TASKRUN is used and task work is added while
the ring is being resized, it's possible for the OR'ing of
IORING_SQ_TASKRUN to happen in the small window of swapping into the
new rings and the old rings being freed.
Prevent this by adding a 2nd ->rings pointer, ->rings_rcu, which is
protected by RCU. The task work flags manipulation is inside RCU
already, and if the resize ring freeing is done post an RCU synchronize,
then there's no need to add locking to the fast path of task work
additions.
Note: this is only done for DEFER_TASKRUN, as that's the only setup mode
that supports ring resizing. If this ever changes, then they too need to
use the io_ctx_mark_taskrun() helper.
Link: https://lore.kernel.org/io-uring/20260309062759.482210-1-naup96721@gmail.com/
Cc: stable@vger.kernel.org
Fixes: 79cfe9e59c ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Reported-by: Hao-Yu Yang <naup96721@gmail.com>
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The acp_card_rt5682_init() and acp_card_rt5682s_init() functions did not
check the return values of clk_get(). This could lead to a kernel crash
when the invalid pointers are later dereferenced by clock core
functions.
Fix this by:
1. Changing clk_get() to the device-managed devm_clk_get().
2. Adding IS_ERR() checks immediately after each clock acquisition.
Fixes: 8b72562668 ("ASoC: amd: acp: Add support for RT5682-VS codec")
Fixes: d4c750f2c7 ("ASoC: amd: acp: Add generic machine driver support for ACP cards")
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Link: https://patch.msgid.link/20260310044327.2582018-1-nichen@iscas.ac.cn
Signed-off-by: Mark Brown <broonie@kernel.org>
The kernel test robot reported a compile-time error regarding the
FIELD_PREP() value being too large for the TRANS_DUAL_QUAD field:
error: FIELD_PREP: value too large for the field
note: in expansion of macro 'TRANS_DUAL_QUAD'
tc |= TRANS_DUAL_QUAD(ffs(op->data.buswidth) - 1);
This occurs because TRANS_DUAL_QUAD is defined as a 2-bit field, and
GCC's static analysis cannot deduce that `ffs(op->data.buswidth) - 1`
will strictly fall within the 0~3 range. Although the SPI framework
guarantees that `op->data.buswidth` is valid at runtime (e.g., 1, 2,
4, 8), an explicit bounds check is necessary to satisfy the compiler.
To resolve the build warning, introduce a safe fallback mechanism.
If an unexpected buswidth is encountered, the driver will trigger
a WARN_ON_ONCE to leave a trace and fall back to width_code = 0
(standard 1-bit SPI mode). This approach guarantees predictable
hardware behavior.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602140738.P7ZozxzI-lkp@intel.com/
Suggested-by: Pei Xiao <xiaopei01@kylinos.cn>
Signed-off-by: CL Wang <cl634@andestech.com>
Link: https://patch.msgid.link/20260303024737.1791196-1-cl634@andestech.com
Signed-off-by: Mark Brown <broonie@kernel.org>
A recently added quirk does not fit in the left column of the table,
so it all has to be reformatted and realigned.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If discovery has failed for any reason (such as no support for a block)
then there is no need to unwind all the IP blocks in fini. In this
condition there can actually be failures during the unwind too.
Reset num_ip_blocks to zero during failure path and skip the unnecessary
cleanup path.
Suggested-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit fae5984296b981c8cc3acca35b701c1f332a6cd8)
Cc: stable@vger.kernel.org
Error handling path should unreserve bo then return failed.
Fixes: 305cd109b7 ("drm/amdkfd: Validate user queue update")
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit c24afed7de9ecce341825d8ab55a43a254348b33)
[WHY]
On DCN21, dccg2_init() is called in dcn10_init_hw() before
bios_golden_init(). During S0i3 resume, BIOS sets MICROSECOND_TIME_BASE_DIV
to 0x00120464 as a marker. dccg2_init() overwrites this to 0x00120264,
causing dcn21_s0i3_golden_init_wa() to misdetect the state and skip golden
init.
Eventually during the resume sequence, a flip timeout occurs.
[HOW]
Skip DCCG on dccg2_is_s0i3_golden_init_wa_done() on DCN21.
Fixes: 4c595e7511 ("drm/amd/display: Migrate DCCG registers access from hwseq to dccg component.")
Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit c61eda434336cf2c033aa35efdc9a08b31d2fdfa)
Commit 4c595e7511 ("drm/amd/display: Migrate DCCG registers access
from hwseq to dccg component.") moved register writes from hwseq to
dccg2_*() functions but did not add the registers to the DCCG register
list macros. The struct fields default to 0, so REG_WRITE() targets
MMIO offset 0, causing a GPU hang on resume (seen on DCN21/DCN30
during IGT kms_cursor_crc@cursor-suspend).
Add
- MICROSECOND_TIME_BASE_DIV
- MILLISECOND_TIME_BASE_DIV
- DCCG_GATE_DISABLE_CNTL
- DCCG_GATE_DISABLE_CNTL2
- DC_MEM_GLOBAL_PWR_REQ_CNTL
to macros in dcn20_dccg.h, dcn301_dccg.h, dcn31_dccg.h, and dcn314_dccg.h.
Fixes: 4c595e7511 ("drm/amd/display: Migrate DCCG registers access from hwseq to dccg component.")
Reported-by: Rafael Passos <rafael@rcpassos.me>
Reviewed-by: Aurabindo Pillai <aurabindo.pillai@amd.com>
Signed-off-by: Ivan Lipski <ivan.lipski@amd.com>
Signed-off-by: Alex Hung <alex.hung@amd.com>
Tested-by: Dan Wheeler <daniel.wheeler@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit e6e2b956fc814de766d3480be7018297c41d3ce0)
The return value of vmx_leave_smm() is unrelated from that of
nested_vmx_enter_non_root_mode(). Check explicitly for success
(which happens to be 0) and return 1 just like everywhere
else in vmx_leave_smm().
Likewise, in svm_leave_smm() return 0/1 instead of the 0/1/-errno
returned by tenter_svm_guest_mode().
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a test checking that invalid eVMCS contents are validated after an
RSM instruction is emulated.
The failure mode is simply that the RSM succeeds, because KVM virtualizes
NMIs anyway while running L2; the two pin-based execution controls used
by the test are entirely handled by KVM and not by the processor.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The VMCB12 is stored in guest memory and can be mangled while in SMM; it
is then reloaded by svm_leave_smm(), but it is not checked again for
validity.
Move the cached vmcb12 control and save consistency checks out of
svm_set_nested_state() and into a helper, and reuse it in
svm_leave_smm().
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The VMCS12 is not available while in SMM. However, it can be overwritten
if userspace manages to trigger copy_enlightened_to_vmcs12() - for example
via KVM_GET_NESTED_STATE.
Because of this, the VMCS12 has to be checked for validity before it is
used to generate the VMCS02. Move the check code out of vmx_set_nested_state()
(the other "not a VMLAUNCH/VMRESUME" path that emulates a nested vmentry)
and reuse it in vmx_leave_smm().
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly set/clear CR8 write interception when AVIC is (de)activated to
fix a bug where KVM leaves the interception enabled after AVIC is
activated. E.g. if KVM emulates INIT=>WFS while AVIC is deactivated, CR8
will remain intercepted in perpetuity.
On its own, the dangling CR8 intercept is "just" a performance issue, but
combined with the TPR sync bug fixed by commit d02e48830e ("KVM: SVM:
Sync TPR from LAPIC into VMCB::V_TPR even if AVIC is active"), the danging
intercept is fatal to Windows guests as the TPR seen by hardware gets
wildly out of sync with reality.
Note, VMX isn't affected by the bug as TPR_THRESHOLD is explicitly ignored
when Virtual Interrupt Delivery is enabled, i.e. when APICv is active in
KVM's world. I.e. there's no need to trigger update_cr8_intercept(), this
is firmly an SVM implementation flaw/detail.
WARN if KVM gets a CR8 write #VMEXIT while AVIC is active, as KVM should
never enter the guest with AVIC enabled and CR8 writes intercepted.
Fixes: 3bbf3565f4 ("svm: Do not intercept CR8 when enable AVIC")
Cc: stable@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Cc: Naveen N Rao (AMD) <naveen@kernel.org>
Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260203190711.458413-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Squash fix to avic_deactivate_vmcb. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Initialize all per-vCPU AVIC control fields in the VMCB if AVIC is enabled
in KVM and the VM has an in-kernel local APIC, i.e. if it's _possible_ the
vCPU could activate AVIC at any point in its lifecycle. Configuring the
VMCB if and only if AVIC is active "works" purely because of optimizations
in kvm_create_lapic() to speculatively set apicv_active if AVIC is enabled
*and* to defer updates until the first KVM_RUN. In quotes because KVM
likely won't do the right thing if kvm_apicv_activated() is false, i.e. if
a vCPU is created while APICv is inhibited at the VM level for whatever
reason. E.g. if the inhibit is *removed* before KVM_REQ_APICV_UPDATE is
handled in KVM_RUN, then __kvm_vcpu_update_apicv() will elide calls to
vendor code due to seeing "apicv_active == activate".
Cleaning up the initialization code will also allow fixing a bug where KVM
incorrectly leaves CR8 interception enabled when AVIC is activated without
creating a mess with respect to whether AVIC is activated or not.
Cc: stable@vger.kernel.org
Fixes: 67034bb9dd ("KVM: SVM: Add irqchip_split() checks before enabling AVIC")
Fixes: 6c3e4422dd ("svm: Add support for dynamic APICv")
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Reviewed-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260203190711.458413-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add KVM_X86_QUIRK_VMCS12_ALLOW_FREEZE_IN_SMM to allow L1 to set
FREEZE_IN_SMM in vmcs12's GUEST_IA32_DEBUGCTL field, as permitted
prior to commit 6b1dd26544 ("KVM: VMX: Preserve host's
DEBUGCTLMSR_FREEZE_IN_SMM while running the guest"). Enable the quirk
by default for backwards compatibility (like all quirks); userspace
can disable it via KVM_CAP_DISABLE_QUIRKS2 for consistency with the
constraints on WRMSR(IA32_DEBUGCTL).
Note that the quirk only bypasses the consistency check. The vmcs02 bit is
still owned by the host, and PMCs are not frozen during virtualized SMM.
In particular, if a host administrator decides that PMCs should not be
frozen during physical SMM, then L1 has no say in the matter.
Fixes: 095686e6fc ("KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter")
Cc: stable@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://patch.msgid.link/20260205231537.1278753-1-jmattson@google.com
[sean: tag for stable@, clean-up and fix goofs in the comment and docs]
Signed-off-by: Sean Christopherson <seanjc@google.com>
[Rename quirk. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The mask_notifier_list is protected by kvm->irq_srcu, but the traversal
in kvm_fire_mask_notifiers() incorrectly uses hlist_for_each_entry_rcu().
This leads to lockdep warnings because the standard RCU iterator expects
to be under rcu_read_lock(), not SRCU.
Replace the RCU variant with hlist_for_each_entry_srcu() and provide
the proper srcu_read_lock_held() annotation to ensure correct
synchronization and silence lockdep.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Link: https://patch.msgid.link/20260204091206.2617-1-lirongqing@baidu.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In KVM guests with Hyper-V hypercalls enabled, the hypercalls
HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST and HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX
allow a guest to request invalidation of portions of a virtual TLB.
For this, the hypercall parameter includes a list of GVAs that are supposed
to be invalidated.
Currently, only the base GVA is checked to be canonical. In reality, this
check needs to be performed for the entire range of GVAs, as checking only
the base GVA enables guests running on Intel hardware to trigger a
WARN_ONCE in the host (see Fixes commit below).
Move the check for non-canonical addresses to be performed for every GVA
of the supplied range to avoid the splat, and to be more in line with the
Hyper-V specification, since, although unlikely, a range starting with an
invalid GVA may still contain GVAs that are valid.
Fixes: fa787ac07b ("KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush")
Signed-off-by: Manuel Andreas <manuel.andreas@tum.de>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://patch.msgid.link/00a7a31b-573b-4d92-91f8-7d7e2f88ea48@tum.de
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM incorrectly synthesizes CPUID bits for KVM-only leaves, as the
following branch in kvm_cpu_cap_init() is never taken:
if (leaf < NCAPINTS)
kvm_cpu_caps[leaf] &= kernel_cpu_caps[leaf];
This means that bits set via SYNTHESIZED_F() for KVM-only leaves are
unconditionally set. This for example can cause issues for SEV-SNP
guests running on Family 19h CPUs, as TSA_SQ_NO and TSA_L1_NO are
always enabled by KVM in 80000021[ECX]. When userspace issues a
SNP_LAUNCH_UPDATE command to update the CPUID page for the guest, SNP
firmware will explicitly reject the command if the page sets sets these
bits on vulnerable CPUs.
To fix this, check in SYNTHESIZED_F() that the corresponding X86
capability is set before adding it to to kvm_cpu_cap_features.
Fixes: 31272abd59 ("KVM: SVM: Advertise TSA CPUID bits to guests")
Link: https://lore.kernel.org/all/20260208164233.30405-1-clopez@suse.de/
Signed-off-by: Carlos López <clopez@suse.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://patch.msgid.link/20260209153108.70667-2-clopez@suse.de
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix a build error in kvmppc_e500_tlb_init() that was introduced by the
conversion to use kzalloc_objs(), as KVM confusingly uses the size of the
structure that is one and only field in tlbe_priv:
arch/powerpc/kvm/e500_mmu.c:923:33: error: assignment to 'struct tlbe_priv *'
from incompatible pointer type 'struct tlbe_ref *' [-Wincompatible-pointer-types]
923 | vcpu_e500->gtlb_priv[0] = kzalloc_objs(struct tlbe_ref,
| ^
KVM has been flawed since commit 0164c0f0c4 ("KVM: PPC: e500: clear up
confusion between host and guest entries"), but the issue went unnoticed
until kmalloc_obj() came along and enforced types, as "struct tlbe_priv"
was just a wrapper of "struct tlbe_ref" (why on earth the two ever existed
separately...).
Fixes: 69050f8d6d ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
Cc: Kees Cook <kees@kernel.org>
Reviewed-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Link: https://patch.msgid.link/20260303190339.974325-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Increase 'maxnode' when using 'get_mempolicy' syscall in guest_memfd
mmap and NUMA policy tests to fix a failure on one Intel GNR platform.
On a CXL-capable platform, the memory affinity of CXL memory regions may
not be covered by the SRAT. Since each CXL memory region is enumerated
via a CFMWS table, at early boot the kernel parses all CFMWS tables to
detect all CXL memory regions and assigns a 'faked' NUMA node for each
of them, starting from the highest NUMA node ID enumerated via the SRAT.
This increases the 'nr_node_ids'. E.g., on the aforementioned Intel GNR
platform which has 4 NUMA nodes and 18 CFMWS tables, it increases to 22.
This results in the 'get_mempolicy' syscall failure on that platform,
because currently 'maxnode' is hard-coded to 8 but the 'get_mempolicy'
syscall requires the 'maxnode' to be not smaller than the 'nr_node_ids'.
Increase the 'maxnode' to the number of bits of 'nodemask', which is
'unsigned long', to fix this.
This may not cover all systems. Perhaps a better way is to always set
the 'nodemask' and 'maxnode' based on the actual maximum NUMA node ID on
the system, but for now just do the simple way.
Reported-by: Yi Lai <yi1.lai@intel.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221014
Closes: https://lore.kernel.org/all/bug-221014-28872@https.bugzilla.kernel.org%2F
Signed-off-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yuan Yao <yaoyuan@linux.alibaba.com>
Link: https://patch.msgid.link/20260302205158.178058-1-kai.huang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM/arm64 fixes for 7.0, take #3
- Correctly handle deeactivation of out-of-LRs interrupts by
starting the EOIcount deactivation walk *after* the last irq
that made it into an LR. This avoids deactivating irqs that
are in the LRs and that the vcpu hasn't deactivated yet.
- Avoid calling into the stubs to probe for ICH_VTR_EL2.TDS when
pKVM is already enabled -- not only thhis isn't possible (pKVM
will reject the call), but it is also useless: this can only
happen for a CPU that has already booted once, and the capability
will not change.
KVM generic changes for 7.0
- Remove a subtle pseudo-overlay of kvm_stats_desc, which, aside from being
unnecessary and confusing, triggered compiler warnings due to
-Wflex-array-member-not-at-end.
- Document that vcpu->mutex is take outside of kvm->slots_lock and
kvm->slots_arch_lock, which is intentional and desirable despite being
rather unintuitive.
KVM/arm64 fixes for 7.0, take #2
- Fix a couple of low-severity bugs in our S2 fault handling path,
affecting the recently introduced LS64 handling and the even more
esoteric handling of hwpoison in a nested context
- Address yet another syzkaller finding in the vgic initialisation,
were we would end-up destroying an uninitialised vgic, with nasty
consequences
- Address an annoying case of pKVM failing to boot when some of the
memblock regions that the host is faulting in are not page-aligned
- Inject some sanity in the NV stage-2 walker by checking the limits
against the advertised PA size, and correctly report the resulting
faults
- Drop an unnecessary ISB when emulating an EL2 S1 address translation
When refill_sheaf() partially fills one sheaf (e.g., fills 5 objects
but need to fill 10), it will update sheaf->size and return -ENOMEM.
However, the callers (alloc_full_sheaf() and __pcs_replace_empty_main())
directly call free_empty_sheaf() on failure, which only does kfree(sheaf),
causing the partially allocated objects memory in sheaf->objects[] leaked.
Fix this by calling sheaf_flush_unused() before free_empty_sheaf() to
free objects of sheaf->objects[]. And also add a WARN_ON() in
free_empty_sheaf() to catch any future cases where a non-empty sheaf is
being freed.
Fixes: ed30c4adfc ("slab: add optimized sheaf refill from partial list")
Signed-off-by: Qing Wang <wangqing7171@gmail.com>
Link: https://patch.msgid.link/20260311093617.4155965-1-wangqing7171@gmail.com
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Hao Li <hao.li@linux.dev>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
hv_crash_c_entry() is a C function that is entered without a stack,
and this is only allowed for functions that have the __naked attribute,
which informs the compiler that it must not emit the usual prologue and
epilogue or emit any other kind of instrumentation that relies on a
stack frame.
So split up the function, and set the __naked attribute on the initial
part that sets up the stack, GDT, IDT and other pieces that are needed
for ordinary C execution. Given that function calls are not permitted
either, use the existing long return coded in an asm() block to call the
second part of the function, which is an ordinary function that is
permitted to call other functions as usual.
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> # asm parts, not hv parts
Reviewed-by: Mukesh Rathor <mrathor@linux.microsoft.com>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: linux-hyperv@vger.kernel.org
Fixes: 94212d3461 ("x86/hyperv: Implement hypervisor RAM collection into vmcore")
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
This reverts commit 36d6cbb621.
Calling this as a passthrough hypercall leaves the VM in an inconsistent
state. Revert before it is released.
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Pull remoteproc fixes from Bjorn Andersson:
- Correct the early return from the i.MX remoteproc prepare
operation, which prevented the platform-specific prepare
function from being reached
- Ensure that the Mediatek SCP clock is released during system
suspend after the recent refactoring to avoid issues with the
clock framework's prepare lock.
- Correct the type of the subsys_name_len field in the sysmon
event QMI message, as the recent introduction of big endian
support in the QMI encoder highlighted the type mismatch and
resulted in a failure to encode the message
- Roll back the devm_ioremap_resource_wc() to a devm_ioremap_wc()
in the Qualcomm WCNSS remoteproc driver, after reports that
requesting this resource fails on some platforms
* tag 'rproc-v7.0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/remoteproc/linux:
remoteproc: imx_rproc: Fix unreachable platform prepare_ops
remoteproc: mediatek: Unprepare SCP clock during system suspend
remoteproc: sysmon: Correct subsys_name_len type in QMI request
remoteproc: qcom_wcnss: Fix reserved region mapping failure
Pull powerpc fixes from Madhavan Srinivasan:
- Correct MSI allocation tracking
- Always use 64 bits PTE for powerpc/e500
- Fix inline assembly for clang build on PPC32
- Fixes for clang build issues in powerpc64/ftrace
- Fixes for powerpc64/bpf JIT and tailcall support
- Cleanup MPC83XX devicetrees
- Fix keymile vendor prefix
- Fix to use big-endian types for crash variables
Thanks to Abhishek Dubey, Christophe Leroy (CS GROUP), Hari Bathini,
Heiko Schocher, J. Neuschäfer, Mahesh Salgaonkar, Nam Cao, Nilay Shroff,
Rob Herring (Arm), Saket Kumar Bhaskar, Sourabh Jain, Stan Johnson, and
Venkat Rao Bagalkote.
* tag 'powerpc-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (23 commits)
powerpc/pseries: Correct MSI allocation tracking
powerpc: dts: mpc83xx: Add unit addresses to /memory
powerpc: dts: mpc8315erdb: Add missing #cells properties to SPI bus
powerpc: dts: mpc8315erdb: Rename LED nodes to comply with schema
powerpc: dts: mpc8315erdb: Use IRQ_TYPE_* macros
powerpc: dts: mpc8313erdb: Use IRQ_TYPE_* macros
powerpc: 83xx: km83xx: Fix keymile vendor prefix
dt-bindings: powerpc: Add Freescale/NXP MPC83xx SoCs
powerpc64/bpf: fix kfunc call support
powerpc64/bpf: fix handling of BPF stack in exception callback
powerpc64/bpf: remove BPF redzone protection in trampoline stack
powerpc64/bpf: use consistent tailcall offset in trampoline
powerpc64/bpf: fix the address returned by bpf_get_func_ip
powerpc64/bpf: do not increment tailcall count when prog is NULL
powerpc64/ftrace: workaround clang recording GEP in __patchable_function_entries
powerpc64/ftrace: fix OOL stub count with clang
powerpc64: make clang cross-build friendly
powerpc/crash: adjust the elfcorehdr size
powerpc/kexec/core: use big-endian types for crash variables
powerpc/prom_init: Fixup missing #size-cells on PowerMac media-bay nodes
...
When oops_panic_write is set, the driver disables interrupts and
switches to PIO polling mode but still falls through into the DMA
path. DMA cannot be used reliably in panic context, so make the
DMA path an else branch to ensure only PIO is used during panic
writes.
Fixes: c1ac2dc34b ("mtd: rawnand: brcmnand: When oops in progress use pio and interrupt polling")
Signed-off-by: Kamal Dasu <kamal.dasu@broadcom.com>
Reviewed-by: William Zhang <william.zhang@broadcom.com>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
nand_lock() and nand_unlock() call into chip->ops.lock_area/unlock_area
without holding the NAND device lock. On controllers that implement
SET_FEATURES via multiple low-level PIO commands, these can race with
concurrent UBI/UBIFS background erase/write operations that hold the
device lock, resulting in cmd_pending conflicts on the NAND controller.
Add nand_get_device()/nand_release_device() around the lock/unlock
operations to serialize them against all other NAND controller access.
Fixes: 92270086b7 ("mtd: rawnand: Add support for manufacturer specific lock/unlock operation")
Signed-off-by: Kamal Dasu <kamal.dasu@broadcom.com>
Reviewed-by: William Zhang <william.zhang@broadcom.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Some bootloaders like recent versions of U-Boot may install some DMI
properties with empty values rather than not populate them. This manages
to make its way through the validator and cleanup resulting in a rogue
hyphen being appended to the card longname.
Fixes: 4e01e5dbba ("ASoC: improve the DMI long card code in asoc-core")
Signed-off-by: Casey Connolly <casey.connolly@linaro.org>
Link: https://patch.msgid.link/20260306174707.283071-2-casey.connolly@linaro.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Example is wrong, the reg property of the flash is always matching the
node name.
Fixes: 68cd8ef484 ("dt-bindings: mtd: st,spear600-smi: convert to DT schema")
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
These properties must be set because they overwrite the default values,
especially #size-cells which is 0 for most controllers and is 'const: 1'
here.
Fixes: 68cd8ef484 ("dt-bindings: mtd: st,spear600-smi: convert to DT schema")
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
The description mixes two nodes. There is the controller, and there is
the flash. Describe the flash (which itself can be considered an mtd
device, unlike the top level controller), and move the st,smi-fast-mode
property inside, as this property is flash specific and should not live
in the parent controller node.
Fixes: 68cd8ef484 ("dt-bindings: mtd: st,spear600-smi: convert to DT schema")
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
The DmaGspMem pointer accessor methods (gsp_write_ptr, gsp_read_ptr,
cpu_read_ptr, cpu_write_ptr, advance_cpu_read_ptr,
advance_cpu_write_ptr) dereference a raw pointer to DMA memory, creating
an intermediate reference before calling volatile read/write methods.
This is undefined behavior since DMA memory can be concurrently modified
by the device.
Fix this by moving the implementations into a gsp_mem module in fw.rs
that uses the dma_read!() / dma_write!() macros, making the original
methods on DmaGspMem thin forwarding wrappers.
An alternative approach would have been to wrap the shared memory in
Opaque, but that would have required even more unsafe code.
Since the gsp_mem module lives in fw.rs (to access firmware-specific
binding field names), GspMem, Msgq and their relevant fields are
temporarily widened to pub(super). This will be reverted once IoView
projections are available.
Cc: Gary Guo <gary@garyguo.net>
Closes: https://lore.kernel.org/nouveau/DGUT14ILG35P.1UMNRKU93JUM1@kernel.org/
Fixes: 75f6b1de81 ("gpu: nova-core: gsp: Add GSP command queue bindings and handling")
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Link: https://patch.msgid.link/20260309225408.27714-1-dakr@kernel.org
[ Use pub(super) where possible; replace bitwise-and with modulo
operator analogous to [1]. - Danilo ]
Link: https://lore.kernel.org/all/20260129-nova-core-cmdq1-v3-1-2ede85493a27@nvidia.com/ [1]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
The LPVO USB GPIB adapter apparently uses an FTDI 8U232AM with the
default PID, but this device id is already handled by the ftdi_sio
serial driver.
Stop binding to the default PID to avoid breaking existing setups with
FTDI 8U232AM.
Anyone using this driver should blacklist the ftdi_sio driver and add
the device id manually through sysfs (e.g. using udev rules).
Fixes: fce79512a9 ("staging: gpib: Add LPVO DIY USB GPIB driver")
Fixes: e6ab504633 ("staging: gpib: Destage gpib")
Cc: Dave Penkler <dpenkler@gmail.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Johan Hovold <johan@kernel.org>
Link: https://patch.msgid.link/20260305151729.10501-2-johan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
usb_role_switch_is_parent() was walking up to the parent node and checking
for the "usb-role-switch" property regardless of the type of the passed
fwnode. This could cause unrelated device nodes to be probed as potential
role switch parent, leading to spurious matches and "-EPROBE_DEFER" being
returned infinitely.
Till now only Type-B connector node will have a parent node which may
present "usb-role-switch" property and register the role switch device.
For Type-C connector node, its parent node will always be a Type-C chip
device which will never register the role switch device. However, it may
still present a non-boolean "usb-role-switch = <&usb_controller>" property
for historical compatibility.
So restrict the helper to only operate on Type-B connector when attempting
to get the role switch from parent node.
Fixes: 6fadd72943 ("usb: roles: get usb-role-switch from parent")
Cc: stable <stable@kernel.org>
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Tested-by: Arnaud Ferraris <arnaud.ferraris@collabora.com>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://patch.msgid.link/20260309074313.2809867-3-xu.yang_2@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 1366cd228b.
The fwnode_usb_role_switch_get() returns NULL only if no connection is
found, returns ERR_PTR(-EPROBE_DEFER) if connection is found but deferred
probe is needed, or a valid pointer of usb_role_switch.
When switching from a NULL check to IS_ERR_OR_NULL(), usb_role_switch_get()
returns NULL and overwrites the ERR_PTR(-EPROBE_DEFER) returned by
fwnode_usb_role_switch_get(). This causes the deferred probe indication to
be lost, preventing the USB role switch from ever being retrieved.
Fixes: 1366cd228b ("tcpm: allow looking for role_sw device in the main node")
Cc: stable <stable@kernel.org>
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Tested-by: Arnaud Ferraris <arnaud.ferraris@collabora.com>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://patch.msgid.link/20260309074313.2809867-2-xu.yang_2@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The network device outlived its parent gadget device during
disconnection, resulting in dangling sysfs links and null pointer
dereference problems.
A prior attempt to solve this by removing SET_NETDEV_DEV entirely [1]
was reverted due to power management ordering concerns and a NO-CARRIER
regression.
A subsequent attempt to defer net_device allocation to bind [2] broke
1:1 mapping between function instance and network device, making it
impossible for configfs to report the resolved interface name. This
results in a regression where the DHCP server fails on pmOS.
Use device_move to reparent the net_device between the gadget device and
/sys/devices/virtual/ across bind/unbind cycles. This preserves the
network interface across USB reconnection, allowing the DHCP server to
retain their binding.
Introduce gether_attach_gadget()/gether_detach_gadget() helpers and use
__free(detach_gadget) macro to undo attachment on bind failure. The
bind_count ensures device_move executes only on the first bind.
[1] https://lore.kernel.org/lkml/f2a4f9847617a0929d62025748384092e5f35cce.camel@crapouillou.net/
[2] https://lore.kernel.org/linux-usb/795ea759-7eaf-4f78-81f4-01ffbf2d7961@ixit.cz/
Fixes: 40d133d7f5 ("usb: gadget: f_ncm: convert to new function interface with backward compatibility")
Cc: stable <stable@kernel.org>
Signed-off-by: Kuen-Han Tsai <khtsai@google.com>
Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-7-ea2afbc7d9b2@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit e065c6a7e4.
This commit is being reverted as part of a series-wide revert.
By deferring the net_device allocation to the bind() phase, a single
function instance will spawn multiple network devices if it is symlinked
to multiple USB configurations.
This causes regressions for userspace tools (like the postmarketOS DHCP
daemon) that rely on reading the interface name (e.g., "usb0") from
configfs. Currently, configfs returns the template "usb%d", causing the
userspace network setup to fail.
Crucially, because this patch breaks the 1:1 mapping between the
function instance and the network device, this naming issue cannot
simply be patched. Configfs only exposes a single 'ifname' attribute per
instance, making it impossible to accurately report the actual interface
name when multiple underlying network devices can exist for that single
instance.
All configurations tied to the same function instance are meant to share
a single network device. Revert this change to restore the 1:1 mapping
by allocating the network device at the instance level (alloc_inst).
Reported-by: David Heidelberg <david@ixit.cz>
Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/
Fixes: 56a512a9b4 ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind")
Cc: stable <stable@kernel.org>
Signed-off-by: Kuen-Han Tsai <khtsai@google.com>
Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-6-ea2afbc7d9b2@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 7a7930c0f9.
This commit is being reverted as part of a series-wide revert.
By deferring the net_device allocation to the bind() phase, a single
function instance will spawn multiple network devices if it is symlinked
to multiple USB configurations.
This causes regressions for userspace tools (like the postmarketOS DHCP
daemon) that rely on reading the interface name (e.g., "usb0") from
configfs. Currently, configfs returns the template "usb%d", causing the
userspace network setup to fail.
Crucially, because this patch breaks the 1:1 mapping between the
function instance and the network device, this naming issue cannot
simply be patched. Configfs only exposes a single 'ifname' attribute per
instance, making it impossible to accurately report the actual interface
name when multiple underlying network devices can exist for that single
instance.
All configurations tied to the same function instance are meant to share
a single network device. Revert this change to restore the 1:1 mapping
by allocating the network device at the instance level (alloc_inst).
Reported-by: David Heidelberg <david@ixit.cz>
Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/
Fixes: 56a512a9b4 ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind")
Cc: stable <stable@kernel.org>
Signed-off-by: Kuen-Han Tsai <khtsai@google.com>
Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-5-ea2afbc7d9b2@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 0c0981126b.
This commit is being reverted as part of a series-wide revert.
By deferring the net_device allocation to the bind() phase, a single
function instance will spawn multiple network devices if it is symlinked
to multiple USB configurations.
This causes regressions for userspace tools (like the postmarketOS DHCP
daemon) that rely on reading the interface name (e.g., "usb0") from
configfs. Currently, configfs returns the template "usb%d", causing the
userspace network setup to fail.
Crucially, because this patch breaks the 1:1 mapping between the
function instance and the network device, this naming issue cannot
simply be patched. Configfs only exposes a single 'ifname' attribute per
instance, making it impossible to accurately report the actual interface
name when multiple underlying network devices can exist for that single
instance.
All configurations tied to the same function instance are meant to share
a single network device. Revert this change to restore the 1:1 mapping
by allocating the network device at the instance level (alloc_inst).
Reported-by: David Heidelberg <david@ixit.cz>
Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/
Fixes: 56a512a9b4 ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind")
Cc: stable <stable@kernel.org>
Signed-off-by: Kuen-Han Tsai <khtsai@google.com>
Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-4-ea2afbc7d9b2@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 56a512a9b4.
This commit is being reverted as part of a series-wide revert.
By deferring the net_device allocation to the bind() phase, a single
function instance will spawn multiple network devices if it is symlinked
to multiple USB configurations.
This causes regressions for userspace tools (like the postmarketOS DHCP
daemon) that rely on reading the interface name (e.g., "usb0") from
configfs. Currently, configfs returns the template "usb%d", causing the
userspace network setup to fail.
Crucially, because this patch breaks the 1:1 mapping between the
function instance and the network device, this naming issue cannot
simply be patched. Configfs only exposes a single 'ifname' attribute per
instance, making it impossible to accurately report the actual interface
name when multiple underlying network devices can exist for that single
instance.
All configurations tied to the same function instance are meant to share
a single network device. Revert this change to restore the 1:1 mapping
by allocating the network device at the instance level (alloc_inst).
Reported-by: David Heidelberg <david@ixit.cz>
Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/
Fixes: 56a512a9b4 ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind")
Cc: stable <stable@kernel.org>
Signed-off-by: Kuen-Han Tsai <khtsai@google.com>
Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-3-ea2afbc7d9b2@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit fde0634ad9.
This commit is being reverted as part of a series-wide revert.
By deferring the net_device allocation to the bind() phase, a single
function instance will spawn multiple network devices if it is symlinked
to multiple USB configurations.
This causes regressions for userspace tools (like the postmarketOS DHCP
daemon) that rely on reading the interface name (e.g., "usb0") from
configfs. Currently, configfs returns the template "usb%d", causing the
userspace network setup to fail.
Crucially, because this patch breaks the 1:1 mapping between the
function instance and the network device, this naming issue cannot
simply be patched. Configfs only exposes a single 'ifname' attribute per
instance, making it impossible to accurately report the actual interface
name when multiple underlying network devices can exist for that single
instance.
All configurations tied to the same function instance are meant to share
a single network device. Revert this change to restore the 1:1 mapping
by allocating the network device at the instance level (alloc_inst).
Reported-by: David Heidelberg <david@ixit.cz>
Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/
Fixes: 56a512a9b4 ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind")
Cc: stable <stable@kernel.org>
Signed-off-by: Kuen-Han Tsai <khtsai@google.com>
Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-2-ea2afbc7d9b2@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 0d6c8144ca.
This commit is being reverted as part of a series-wide revert.
By deferring the net_device allocation to the bind() phase, a single
function instance will spawn multiple network devices if it is symlinked
to multiple USB configurations.
This causes regressions for userspace tools (like the postmarketOS DHCP
daemon) that rely on reading the interface name (e.g., "usb0") from
configfs. Currently, configfs returns the template "usb%d", causing the
userspace network setup to fail.
Crucially, because this patch breaks the 1:1 mapping between the
function instance and the network device, this naming issue cannot
simply be patched. Configfs only exposes a single 'ifname' attribute per
instance, making it impossible to accurately report the actual interface
name when multiple underlying network devices can exist for that single
instance.
All configurations tied to the same function instance are meant to share
a single network device. Revert this change to restore the 1:1 mapping
by allocating the network device at the instance level (alloc_inst).
Reported-by: David Heidelberg <david@ixit.cz>
Closes: https://lore.kernel.org/linux-usb/70b558ea-a12e-4170-9b8e-c951131249af@ixit.cz/
Fixes: 56a512a9b4 ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind")
Cc: stable <stable@kernel.org>
Signed-off-by: Kuen-Han Tsai <khtsai@google.com>
Link: https://patch.msgid.link/20260309-f-ncm-revert-v2-1-ea2afbc7d9b2@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
dp_altmode_configure sets the signaling rate to the current
configuration's rate and then shifts the value to the Select
Configuration bitfield. On the initial configuration, dp->data.conf
is 0 to begin with, so the signaling rate field is never set, which
leads to some DisplayPort Alt Mode partners sending NAK to the
Configure message.
Set the signaling rate to the capabilities supported by both the
port and the port partner. If the cable supports DisplayPort Alt Mode,
then include its capabilities as well.
Fixes: a17fae8fc3 ("usb: typec: Add Displayport Alternate Mode 2.1 Support")
Cc: stable <stable@kernel.org>
Signed-off-by: RD Babiera <rdbabiera@google.com>
Acked-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://patch.msgid.link/20260310204106.3939862-2-rdbabiera@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The xHCI controller reports a Host Controller Error (HCE) in UAS Storage
Device plug/unplug scenarios on Android devices. HCE is checked in
xhci_irq() function and causes an interrupt storm (since the interrupt
isn’t cleared), leading to severe system-level faults.
When the xHC controller reports HCE in the interrupt handler, the driver
only logs a warning and assumes xHC activity will stop as stated in xHCI
specification. An interrupt storm does however continue on some hosts
even after HCE, and only ceases after manually disabling xHC interrupt
and stopping the controller by calling xhci_halt().
Add xhci_halt() to xhci_irq() function where STS_HCE status is checked,
mirroring the existing error handling pattern used for STS_FATAL errors.
This only fixes the interrupt storm. Proper HCE recovery requires resetting
and re-initializing the xHC.
CC: stable@vger.kernel.org
Signed-off-by: Dayu Jiang <jiangdayu@xiaomi.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Link: https://patch.msgid.link/20260304223639.3882398-3-mathias.nyman@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
xhci_alloc_command() allocates a command structure and, when the
second argument is true, also allocates a completion structure.
Currently, the error handling path in xhci_disable_slot() only frees
the command structure using kfree(), causing the completion structure
to leak.
Use xhci_free_command() instead of kfree(). xhci_free_command() correctly
frees both the command structure and the associated completion structure.
Since the command structure is allocated with zero-initialization,
command->in_ctx is NULL and will not be erroneously freed by
xhci_free_command().
This bug was found using an experimental static analysis tool we are
developing. The tool is based on the LLVM framework and is specifically
designed to detect memory management issues. It is currently under
active development and not yet publicly available, but we plan to
open-source it after our research is published.
The bug was originally detected on v6.13-rc1 using our static analysis
tool, and we have verified that the issue persists in the latest mainline
kernel.
We performed build testing on x86_64 with allyesconfig using GCC=11.4.0.
Since triggering these error paths in xhci_disable_slot() requires specific
hardware conditions or abnormal state, we were unable to construct a test
case to reliably trigger these specific error paths at runtime.
Fixes: 7faac1953e ("xhci: avoid race between disable slot command and host runtime suspend")
CC: stable@vger.kernel.org
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Link: https://patch.msgid.link/20260304223639.3882398-2-mathias.nyman@linux.intel.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In usbhs_remove(), the driver frees resources (including the pipe array)
while the interrupt handler (usbhs_interrupt) is still registered. If an
interrupt fires after usbhs_pipe_remove() but before the driver is fully
unbound, the ISR may access freed memory, causing a use-after-free.
Fix this by calling devm_free_irq() before freeing resources. This ensures
the interrupt handler is both disabled and synchronized (waits for any
running ISR to complete) before usbhs_pipe_remove() is called.
Fixes: f1407d5c66 ("usb: renesas_usbhs: Add Renesas USBHS common code")
Cc: stable <stable@kernel.org>
Suggested-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Fan Wu <fanwu01@zju.edu.cn>
Link: https://patch.msgid.link/20260303073344.34577-1-fanwu01@zju.edu.cn
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The CH343 USB/serial adapter is as buggy as it is popular (very).
One of its quirks is that despite being capable of signalling a
BREAK condition, it doesn't advertise it.
This used to work nonetheless until 66aad7d8d3 ("usb: cdc-acm:
return correct error code on unsupported break") applied some
reasonable restrictions, preventing breaks from being emitted on
devices that do not advertise CAP_BRK.
Add a quirk for this particular device, so that breaks can still
be produced on some of my machines attached to my console server.
Fixes: 66aad7d8d3 ("usb: cdc-acm: return correct error code on unsupported break")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable <stable@kernel.org>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Oliver Neukum <oneukum@suse.com>
Link: https://patch.msgid.link/20260301124440.1192752-1-maz@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The `check_command_size_in_blocks()` function calculates the data size
in bytes by left shifting `common->data_size_from_cmnd` by the block
size (`common->curlun->blkbits`). However, it does not validate whether
this shift operation will cause an integer overflow.
Initially, the block size is set up in `fsg_lun_open()` , and the
`common->data_size_from_cmnd` is set up in `do_scsi_command()`. During
initialization, there is no integer overflow check for the interaction
between two variables.
So if a malicious USB host sends a SCSI READ or WRITE command
requesting a large amount of data (`common->data_size_from_cmnd`), the
left shift operation can wrap around. This results in a truncated data
size, which can bypass boundary checks and potentially lead to memory
corruption or out-of-bounds accesses.
Fix this by using the check_shl_overflow() macro to safely perform the
shift and catch any overflows.
Fixes: 144974e7f9 ("usb: gadget: mass_storage: support multi-luns with different logic block size")
Signed-off-by: Seungjin Bae <eeodqql09@gmail.com>
Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
Link: https://patch.msgid.link/20260228104324.1696455-2-eeodqql09@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When adding dynamic configuration for bInterval, the value was removed
from the static SuperSpeed endpoint descriptors but was not set from the
configured value in hidg_bind(). Thus at SuperSpeed the interrupt
endpoints have bInterval as zero which is not valid per the USB
specification.
Add the missing setting for SuperSpeed endpoints.
Fixes: ea34925f5b ("usb: gadget: hid: allow dynamic interval configuration via configfs")
Cc: stable <stable@kernel.org>
Signed-off-by: John Keeping <jkeeping@inmusicbrands.com>
Acked-by: Peter Korsgaard <peter@korsgaard.com>
Link: https://patch.msgid.link/20260227111540.431521-1-jkeeping@inmusicbrands.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some USB devices incorrectly report bNumConfigurations as 0 in their
device descriptor, which causes the USB core to reject them during
enumeration.
logs:
usb 1-2: device descriptor read/64, error -71
usb 1-2: no configurations
usb 1-2: can't read configurations, error -22
However, these devices actually work correctly when
treated as having a single configuration.
Add a new quirk USB_QUIRK_FORCE_ONE_CONFIG to handle such devices.
When this quirk is set, assume the device has 1 configuration instead
of failing with -EINVAL.
This quirk is applied to the device with VID:PID 5131:2007 which
exhibits this behavior.
Signed-off-by: Jie Deng <dengjie03@kylinos.cn>
Link: https://patch.msgid.link/20260227084931.1527461-1-dengjie03@kylinos.cn
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The usb_control_msg(), usb_bulk_msg(), and usb_interrupt_msg() APIs in
usbcore allow unlimited timeout durations. And since they use
uninterruptible waits, this leaves open the possibility of hanging a
task for an indefinitely long time, with no way to kill it short of
unplugging the target device.
To prevent this sort of problem, enforce a maximum limit on the length
of these unkillable timeouts. The limit chosen here, somewhat
arbitrarily, is 60 seconds. On many systems (although not all) this
is short enough to avoid triggering the kernel's hung-task detector.
In addition, clear up the ambiguity of negative timeout values by
treating them the same as 0, i.e., using the maximum allowed timeout.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Link: https://lore.kernel.org/linux-usb/3acfe838-6334-4f6d-be7c-4bb01704b33d@rowland.harvard.edu/
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
CC: stable@vger.kernel.org
Link: https://patch.msgid.link/15fc9773-a007-47b0-a703-df89a8cf83dd@rowland.harvard.edu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Remove the error path from the usb_phy_roothub_set_mode() function.
The code is clearly wrong, because phy_set_mode() calls can't be
balanced with phy_power_off() calls.
Additionally, the usb_phy_roothub_set_mode() function is called only
from usb_add_hcd() before it powers on the PHYs, so powering off those
makes no sense anyway.
Presumably, the code is copy-pasted from the phy_power_on() function
without adjusting the error handling.
Cc: stable@vger.kernel.org # v5.1+
Fixes: b97a313483 ("usb: core: comply to PHY framework")
Signed-off-by: Gabor Juhos <j4g8y7@gmail.com>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://patch.msgid.link/20260218-usb-phy-poweroff-fix-v1-1-66e6831e860e@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Elan touchscreens have a HID-battery device for the stylus which is always
there even if there is no stylus.
This is causing upower to report an empty battery for the stylus and some
desktop-environments will show a notification about this, which is quite
annoying.
Because of this the HID-battery is being ignored on all Elan I2c and USB
touchscreens, but this causes there to be no battery reporting for
the stylus at all.
This adds a new HID_BATTERY_QUIRK_DYNAMIC and uses these for the Elan
touchscreens.
This new quirks causes the present value of the battery to start at 0,
which will make userspace ignore it and only sets present to 1 after
receiving a battery input report which only happens when the stylus
gets in range.
Reported-by: ggrundik@gmail.com
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221118
Signed-off-by: Hans de Goede <johannes.goede@oss.qualcomm.com>
Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
Drop the Asus UX550* touchscreen ignore battery quirks, there is a blanket
HID_BATTERY_QUIRK_IGNORE for all USB_VENDOR_ID_ELAN USB touchscreens now,
so these are just a duplicate of those.
Signed-off-by: Hans de Goede <johannes.goede@oss.qualcomm.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
The acp3x_5682_init() function did not check the return value of
clk_get(), which could lead to dereferencing error pointers in
rt5682_clk_enable().
Fix this by:
1. Changing clk_get() to the device-managed devm_clk_get().
2. Adding proper IS_ERR() checks for both clock acquisitions.
Fixes: 6b8e4e7db3 ("ASoC: amd: Add machine driver for Raven based platform")
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Link: https://patch.msgid.link/20260310024246.2153827-1-nichen@iscas.ac.cn
Signed-off-by: Mark Brown <broonie@kernel.org>
xfs_bmap_update_diff_items() sorts bmap intents by inode number using
a subtraction of two xfs_ino_t (uint64_t) values, with the result
truncated to int. This is incorrect when two inode numbers differ by
more than INT_MAX (2^31 - 1), which is entirely possible on large XFS
filesystems.
Fix this by replacing the subtraction with cmp_int().
Cc: <stable@vger.kernel.org> # v4.9
Fixes: 9f3afb57d5 ("xfs: implement deferred bmbt map/unmap operations")
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Support for Dual SPI and Quad SPI was added to the Linux driver in
commit 0605d9fb41 ("spi: sun6i: add quirk for dual and quad SPI modes
support") and commit 25453d797d ("spi: sun6i: add dual and quad SPI
modes support for R329/D1/R528/T113s").
However the binding was never updated to allow these modes. Allow them
by adding 2 and 4 to the allowed bus widths for the newer variants.
While at it, also add 0 to the allowed bus widths. This signals that
RX or TX is not available, i.e. the MISO or MOSI pin is disconnected.
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Signed-off-by: Chen-Yu Tsai <wens@kernel.org>
Link: https://patch.msgid.link/20260302153559.3199783-2-wens@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
The pointer returned from acpi_os_map_generic_address() is
tagged with __iomem, so make the rv it is returned to also
of void __iomem * type.
Fixes the following sparse warning:
drivers/acpi/osl.c:1686:20: warning: incorrect type in assignment (different address spaces)
drivers/acpi/osl.c:1686:20: expected void *rv
drivers/acpi/osl.c:1686:20: got void [noderef] __iomem *
Fixes: 6915564dc5 ("ACPI: OSL: Change the type of acpi_os_map_generic_address() return value")
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
[ rjw: Subject tweak, added Fixes tag ]
Link: https://patch.msgid.link/20260311105835.463030-1-ben.dooks@codethink.co.uk
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Chasing vfork()'ed tasks on a CID ownership mode switch requires a full
task list walk, which is obviously expensive on large systems.
Avoid that by keeping a list of tasks using a mm MMCID entity in mm::mm_cid
and walk this list instead. This removes the proven to be flaky counting
logic and avoids a full task list walk in the case of vfork()'ed tasks.
Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260310202526.183824481@kernel.org
Matthieu and Jiri reported stalls where a task endlessly loops in
mm_get_cid() when scheduling in.
It turned out that the logic which handles vfork()'ed tasks is broken. It
is invoked when the number of tasks associated to a process is smaller than
the number of MMCID users. It then walks the task list to find the
vfork()'ed task, but accounts all the already processed tasks as well.
If that double processing brings the number of to be handled tasks to 0,
the walk stops and the vfork()'ed task's CID is not fixed up. As a
consequence a subsequent schedule in fails to acquire a (transitional) CID
and the machine stalls.
Cure this by removing the accounting condition and make the fixup always
walk the full task list if it could not find the exact number of users in
the process' thread list.
Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Closes: https://lore.kernel.org/b24ffcb3-09d5-4e48-9070-0b69bc654281@kernel.org
Reported-by: Matthieu Baerts <matttbe@kernel.org>
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260310202526.048657665@kernel.org
A newly forked task is accounted as MMCID user before the task is visible
in the process' thread list and the global task list. This creates the
following problem:
CPU1 CPU2
fork()
sched_mm_cid_fork(tnew1)
tnew1->mm.mm_cid_users++;
tnew1->mm_cid.cid = getcid()
-> preemption
fork()
sched_mm_cid_fork(tnew2)
tnew2->mm.mm_cid_users++;
// Reaches the per CPU threshold
mm_cid_fixup_tasks_to_cpus()
for_each_other(current, p)
....
As tnew1 is not visible yet, this fails to fix up the already allocated CID
of tnew1. As a consequence a subsequent schedule in might fail to acquire a
(transitional) CID and the machine stalls.
Move the invocation of sched_mm_cid_fork() after the new task becomes
visible in the thread and the task list to prevent this.
This also makes it symmetrical vs. exit() where the task is removed as CID
user before the task is removed from the thread and task lists.
Fixes: fbd0e71dc3 ("sched/mmcid: Provide CID ownership mode fixup functions")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260310202525.969061974@kernel.org
Dinh writes:
firmware: stratix10-svc: add multiple svc clients support
- Add a dedicated thread for each svc client to fix a timeout issue when the svc
driver is handling multiple clients.
* tag 'stratix10_svc_fix_for_v7.0' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux:
firmware: stratix10-svc: Add Multi SVC clients support
The trace_clock_jiffies() function that handles the "uptime" clock for
tracing calls jiffies_64_to_clock_t(). This causes the function tracer to
constantly recurse when the tracing clock is set to "uptime". Mark it
notrace to prevent unnecessary recursion when using the "uptime" clock.
Fixes: 58d4e21e50 ("tracing: Fix wraparound problems in "uptime" trace clock")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260306212403.72270bb2@robin
This patch fixes an out-of-bounds access in ceph_handle_auth_reply()
that can be triggered by a message of type CEPH_MSG_AUTH_REPLY. In
ceph_handle_auth_reply(), the value of the payload_len field of such a
message is stored in a variable of type int. A value greater than
INT_MAX leads to an integer overflow and is interpreted as a negative
value. This leads to decrementing the pointer address by this value and
subsequently accessing it because ceph_decode_need() only checks that
the memory access does not exceed the end address of the allocation.
This patch fixes the issue by changing the data type of payload_len to
u32. Additionally, the data type of result_msg_len is changed to u32,
as it is also a variable holding a non-negative length.
Also, an additional layer of sanity checks is introduced, ensuring that
directly after reading it from the message, payload_len and
result_msg_len are not greater than the overall segment length.
BUG: KASAN: slab-out-of-bounds in ceph_handle_auth_reply+0x642/0x7a0 [libceph]
Read of size 4 at addr ffff88811404df14 by task kworker/20:1/262
CPU: 20 UID: 0 PID: 262 Comm: kworker/20:1 Not tainted 6.19.2 #5 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Workqueue: ceph-msgr ceph_con_workfn [libceph]
Call Trace:
<TASK>
dump_stack_lvl+0x76/0xa0
print_report+0xd1/0x620
? __pfx__raw_spin_lock_irqsave+0x10/0x10
? kasan_complete_mode_report_info+0x72/0x210
kasan_report+0xe7/0x130
? ceph_handle_auth_reply+0x642/0x7a0 [libceph]
? ceph_handle_auth_reply+0x642/0x7a0 [libceph]
__asan_report_load_n_noabort+0xf/0x20
ceph_handle_auth_reply+0x642/0x7a0 [libceph]
mon_dispatch+0x973/0x23d0 [libceph]
? apparmor_socket_recvmsg+0x6b/0xa0
? __pfx_mon_dispatch+0x10/0x10 [libceph]
? __kasan_check_write+0x14/0x30i
? mutex_unlock+0x7f/0xd0
? __pfx_mutex_unlock+0x10/0x10
? __pfx_do_recvmsg+0x10/0x10 [libceph]
ceph_con_process_message+0x1f1/0x650 [libceph]
process_message+0x1e/0x450 [libceph]
ceph_con_v2_try_read+0x2e48/0x6c80 [libceph]
? __pfx_ceph_con_v2_try_read+0x10/0x10 [libceph]
? save_fpregs_to_fpstate+0xb0/0x230
? raw_spin_rq_unlock+0x17/0xa0
? finish_task_switch.isra.0+0x13b/0x760
? __switch_to+0x385/0xda0
? __kasan_check_write+0x14/0x30
? mutex_lock+0x8d/0xe0
? __pfx_mutex_lock+0x10/0x10
ceph_con_workfn+0x248/0x10c0 [libceph]
process_one_work+0x629/0xf80
? __kasan_check_write+0x14/0x30
worker_thread+0x87f/0x1570
? __pfx__raw_spin_lock_irqsave+0x10/0x10
? __pfx_try_to_wake_up+0x10/0x10
? kasan_print_address_stack_frame+0x1f7/0x280
? __pfx_worker_thread+0x10/0x10
kthread+0x396/0x830
? __pfx__raw_spin_lock_irq+0x10/0x10
? __pfx_kthread+0x10/0x10
? __kasan_check_write+0x14/0x30
? recalc_sigpending+0x180/0x210
? __pfx_kthread+0x10/0x10
ret_from_fork+0x3f7/0x610
? __pfx_ret_from_fork+0x10/0x10
? __switch_to+0x385/0xda0
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
[ idryomov: replace if statements with ceph_decode_need() for
payload_len and result_msg_len ]
Cc: stable@vger.kernel.org
Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This patch fixes unnecessary implicit conversions that change signedness
of blob_len and num_mon in ceph_monmap_decode().
Currently blob_len and num_mon are (signed) int variables. They are used
to hold values that are always non-negative and get assigned in
ceph_decode_32_safe(), which is meant to assign u32 values. Both
variables are subsequently used as unsigned values, and the value of
num_mon is further assigned to monmap->num_mon, which is of type u32.
Therefore, both variables should be of type u32. This is especially
relevant for num_mon. If the value read from the incoming message is
very large, it is interpreted as a negative value, and the check for
num_mon > CEPH_MAX_MON does not catch it. This leads to the attempt to
allocate a very large chunk of memory for monmap, which will most likely
fail. In this case, an unnecessary attempt to allocate memory is
performed, and -ENOMEM is returned instead of -EINVAL.
Cc: stable@vger.kernel.org
Signed-off-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Setting up the interface when suspended/resumeing fail on this card.
Adding a reset and delay quirk will eliminate this problem.
usb 1-1: New USB device found, idVendor=0666, idProduct=0880
usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
usb 1-1: Product: USB Audio
usb 1-1: Manufacturer: SPACETOUCH
usb 1-1: SerialNumber: 000000000
Signed-off-by: Lianqin Hu <hulianqin@vivo.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/TYUPR06MB6217ACC80B70BE25D87456B0D247A@TYUPR06MB6217.apcprd06.prod.outlook.com
The refactoring commit 5fa9b82cbc ("scripts: kconfig:
merge_config.sh: refactor from shell/sed/grep to awk") passes
$TMP_FILE.new as ARGV[3] to awk, using it as both an output destination
and an input file argument. When the base file is empty, nothing is
written to ARGV[3] during processing, so awk fails trying to open it
for reading:
awk: cmd. line:52: fatal: cannot open file
`./.tmp.config.grcQin34jb.new' for reading: No such file or directory
mv: cannot stat './.tmp.config.grcQin34jb.new': No such file or directory
Pass the output path via -v outfile instead and drop the FILENAME ==
ARGV[3] { nextfile }.
Fixes: 5fa9b82cbc ("scripts: kconfig: merge_config.sh: refactor from shell/sed/grep to awk")
Signed-off-by: Daniel Gomez <da.gomez@samsung.com>
Link: https://patch.msgid.link/20260310-fixes-merge-config-v1-1-beaeeaded6bd@samsung.com
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Pull smb server fixes from Steve French:
- Fix potential use after free errors
- Fix refcount leak in smb2 open error path
- Prevent allowing logging signing or encryption keys
* tag 'v7.0-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: Don't log keys in SMB3 signing and encryption key generation
smb: server: fix use-after-free in smb2_open()
ksmbd: fix use-after-free in smb_lazy_parent_lease_break_close()
ksmbd: fix use-after-free by using call_rcu() for oplock_info
ksmbd: fix use-after-free in proc_show_files due to early rcu_read_unlock
smb/server: Fix another refcount leak in smb2_open()
The bcmgenet EEE implementation is broken in several ways.
phy_support_eee() is never called, so the PHY never advertises EEE
and phylib never sets phydev->enable_tx_lpi. bcmgenet_mac_config()
checks priv->eee.eee_enabled to decide whether to enable the MAC
LPI logic, but that field is never initialised to true, so the MAC
never enters Low Power Idle even when EEE is negotiated - wasting
the power savings EEE is designed to provide. The only way to get
EEE working at all is a manual 'ethtool --set-eee eth0 eee on' after
every link-up, and even then bcmgenet_get_eee() immediately clobbers
the reported state because phy_ethtool_get_eee() overwrites
eee_enabled and tx_lpi_enabled with the uninitialised PHY eee_cfg
values. Finally, bcmgenet_mac_config() is only called on link-up,
so EEE is never disabled in hardware on link-down.
Fix all of this by removing the MAC-side EEE state tracking
(priv->eee) and aligning with the pattern used by other non-phylink
MAC drivers such as FEC.
Call phy_support_eee() in bcmgenet_mii_probe() so the PHY advertises
EEE link modes and phylib tracks negotiation state. Move the EEE
hardware control to bcmgenet_mii_setup(), which is called on every
link event, and drive it directly from phydev->enable_tx_lpi - the
flag phylib sets when EEE is negotiated and the user has not disabled
it. This enables EEE automatically once the link partner agrees and
disables it cleanly on link-down.
Make bcmgenet_get_eee() and bcmgenet_set_eee() pure passthroughs to
phy_ethtool_get_eee() and phy_ethtool_set_eee(), with the MAC
hardware register read/written for tx_lpi_timer. Drop struct
ethtool_keee eee from struct bcmgenet_priv.
Fixes: fe0d4fd928 ("net: phy: Keep track of EEE configuration")
Link: https://lore.kernel.org/netdev/d352039f-4cbb-41e6-9aeb-0b4f3941b54c@lunn.ch/
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Tested-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20260310054935.1238594-1-nb@tipi-net.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
genlmsg_reply() hands the reply skb to netlink, and
netlink_unicast() consumes it on all return paths, whether the
skb is queued successfully or freed on an error path.
net_shaper_nl_get_doit() and net_shaper_nl_cap_get_doit()
currently jump to free_msg after genlmsg_reply() fails and call
nlmsg_free(msg), which can hit the same skb twice.
Return the genlmsg_reply() error directly and keep free_msg
only for pre-reply failures.
Fixes: 4b623f9f0f ("net-shapers: implement NL get operation")
Fixes: 553ea9f1ef ("net: shaper: implement introspection support")
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moses <p@1g4.org>
Link: https://patch.msgid.link/20260309173450.538026-2-p@1g4.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Normal RX/TX interrupts are enabled later, in arc_emac_open(), so probe
should not see interrupt delivery in the usual case. However, hardware may
still present stale or latched interrupt status left by firmware or the
bootloader.
If probe later unwinds after devm_request_irq() has installed the handler,
such a stale interrupt can still reach arc_emac_intr() during teardown and
race with release of the associated net_device.
Avoid that window by putting the device into a known quiescent state before
requesting the IRQ: disable all EMAC interrupt sources and clear any
pending EMAC interrupt status bits. This keeps the change hardware-focused
and minimal, while preventing spurious IRQ delivery from leftover state.
Fixes: e4f2379db6 ("ethernet/arc/arc_emac - Add new driver")
Cc: stable@vger.kernel.org
Signed-off-by: Fan Wu <fanwu01@zju.edu.cn>
Link: https://patch.msgid.link/20260309132409.584966-1-fanwu01@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While testing other changes in vng I noticed that
nl_netdev.page_pool_check flakes. This never happens in real CI.
Turns out vng may boot and get to that test in less than a second.
page_pool_detached() records the detach time in seconds, so if
vng is fast enough detach time is set to 0. Other code treats
0 as "not detached". detach_time is only used to report the state
to the user, so it's not a huge deal in practice but let's fix it.
Store the raw ktime_t (nanoseconds) instead. A nanosecond value
of 0 is practically impossible.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Fixes: 69cb4952b6 ("net: page_pool: report when page pool was destroyed")
Link: https://patch.msgid.link/20260310003907.3540019-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Quanyang observed that when using an NFS rootfs on an AMD ZynqMp board,
the rootfs may take an extended time to recover after a suspend.
Upon investigation, it was determined that the issue originates from a
problem in the macb driver.
According to the Zynq UltraScale TRM [1], when transmit is disabled,
the transmit buffer queue pointer resets to point to the address
specified by the transmit buffer queue base address register.
In the current implementation, the code merely resets `queue->tx_head`
and `queue->tx_tail` to '0'. This approach presents several issues:
- Packets already queued in the tx ring are silently lost,
leading to memory leaks since the associated skbs cannot be released.
- Concurrent write access to `queue->tx_head` and `queue->tx_tail` may
occur from `macb_tx_poll()` or `macb_start_xmit()` when these values
are reset to '0'.
- The transmission may become stuck on a packet that has already been sent
out, with its 'TX_USED' bit set, but has not yet been processed. However,
due to the manipulation of 'queue->tx_head' and 'queue->tx_tail',
`macb_tx_poll()` incorrectly assumes there are no packets to handle
because `queue->tx_head == queue->tx_tail`. This issue is only resolved
when a new packet is placed at this position. This is the root cause of
the prolonged recovery time observed for the NFS root filesystem.
To resolve this issue, shuffle the tx ring and tx skb array so that
the first unsent packet is positioned at the start of the tx ring.
Additionally, ensure that updates to `queue->tx_head` and
`queue->tx_tail` are properly protected with the appropriate lock.
[1] https://docs.amd.com/v/u/en-US/ug1085-zynq-ultrascale-trm
Fixes: bf9cf80cab ("net: macb: Fix tx/rx malfunction after phy link down and up")
Reported-by: Quanyang Wang <quanyang.wang@windriver.com>
Signed-off-by: Kevin Hao <haokexin@gmail.com>
Cc: stable@vger.kernel.org
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260307-zynqmp-v2-1-6ef98a70e1d0@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 60fbb14396 ("mm/huge_memory: adjust try_to_migrate_one() and
split_huge_pmd_locked()") return false unconditionally after
split_huge_pmd_locked(). This may fail try_to_migrate() early when
TTU_SPLIT_HUGE_PMD is specified.
The reason is the above commit adjusted try_to_migrate_one() to, when a
PMD-mapped THP entry is found, and TTU_SPLIT_HUGE_PMD is specified (for
example, via unmap_folio()), return false unconditionally. This breaks
the rmap walk and fail try_to_migrate() early, if this PMD-mapped THP is
mapped in multiple processes.
The user sensible impact of this bug could be:
* On memory pressure, shrink_folio_list() may split partially mapped
folio with split_folio_to_list(). Then free unmapped pages without IO.
If failed, it may not be reclaimed.
* On memory failure, memory_failure() would call try_to_split_thp_page()
to split folio contains the bad page. If succeed, the PG_has_hwpoisoned
bit is only set in the after-split folio contains @split_at. By doing
so, we limit bad memory. If failed to split, the whole folios is not
usable.
One way to reproduce:
Create an anonymous THP range and fork 512 children, so we have a
THP shared mapped in 513 processes. Then trigger folio split with
/sys/kernel/debug/split_huge_pages debugfs to split the THP folio to
order 0.
Without the above commit, we can successfully split to order 0. With the
above commit, the folio is still a large folio.
And currently there are two core users of TTU_SPLIT_HUGE_PMD:
* try_to_unmap_one()
* try_to_migrate_one()
try_to_unmap_one() would restart the rmap walk, so only
try_to_migrate_one() is affected.
We can't simply revert commit 60fbb14396 ("mm/huge_memory: adjust
try_to_migrate_one() and split_huge_pmd_locked()"), since it removed some
duplicated check covered by page_vma_mapped_walk().
This patch fixes this by restart page_vma_mapped_walk() after
split_huge_pmd_locked(). Since we cannot simply return "true" to fix the
problem, as that would affect another case:
When invoking folio_try_share_anon_rmap_pmd() from
split_huge_pmd_locked(), the latter can fail and leave a large folio
mapped through PTEs, in which case we ought to return true from
try_to_migrate_one(). This might result in unnecessary walking of the
rmap but is relatively harmless.
Link: https://lkml.kernel.org/r/20260305015006.27343-1-richard.weiyang@gmail.com
Fixes: 60fbb14396 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()")
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Tested-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Gavin Guo <gavinguo@igalia.com>
Acked-by: David Hildenbrand (arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We batch unmap anonymous lazyfree folios by folio_unmap_pte_batch. If the
batch has a mix of writable and non-writable bits, we may end up setting
the entire batch writable. Fix this by respecting writable bit during
batching.
Although on a successful unmap of a lazyfree folio, the soft-dirty bit is
lost, preserve it on pte restoration by respecting the bit during
batching, to make the fix consistent w.r.t both writable bit and
soft-dirty bit.
I was able to write the below reproducer and crash the kernel.
Explanation of reproducer (set 64K mTHP to always):
Fault in a 64K large folio. Split the VMA at mid-point with
MADV_DONTFORK. fork() - parent points to the folio with 8 writable ptes
and 8 non-writable ptes. Merge the VMAs with MADV_DOFORK so that
folio_unmap_pte_batch() can determine all the 16 ptes as a batch. Do
MADV_FREE on the range to mark the folio as lazyfree. Write to the memory
to dirty the pte, eventually rmap will dirty the folio. Then trigger
reclaim, we will hit the pte restoration path, and the kernel will crash
with the trace given below.
The BUG happens at:
BUG_ON(atomic_inc_return(&ptc->anon_map_count) > 1 && rw);
The code path is asking for anonymous page to be mapped writable into the
pagetable. The BUG_ON() firing implies that such a writable page has been
mapped into the pagetables of more than one process, which breaks
anonymous memory/CoW semantics.
[ 21.134473] kernel BUG at mm/page_table_check.c:118!
[ 21.134497] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[ 21.135917] Modules linked in:
[ 21.136085] CPU: 1 UID: 0 PID: 1735 Comm: dup-lazyfree Not tainted 7.0.0-rc1-00116-g018018a17770 #1028 PREEMPT
[ 21.136858] Hardware name: linux,dummy-virt (DT)
[ 21.137019] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 21.137308] pc : page_table_check_set+0x28c/0x2a8
[ 21.137607] lr : page_table_check_set+0x134/0x2a8
[ 21.137885] sp : ffff80008a3b3340
[ 21.138124] x29: ffff80008a3b3340 x28: fffffdffc3d14400 x27: ffffd1a55e03d000
[ 21.138623] x26: 0040000000000040 x25: ffffd1a55f7dd000 x24: 0000000000000001
[ 21.139045] x23: 0000000000000001 x22: 0000000000000001 x21: ffffd1a55f217f30
[ 21.139629] x20: 0000000000134521 x19: 0000000000134519 x18: 005c43e000040000
[ 21.140027] x17: 0001400000000000 x16: 0001700000000000 x15: 000000000000ffff
[ 21.140578] x14: 000000000000000c x13: 005c006000000000 x12: 0000000000000020
[ 21.140828] x11: 0000000000000000 x10: 005c000000000000 x9 : ffffd1a55c079ee0
[ 21.141077] x8 : 0000000000000001 x7 : 005c03e000040000 x6 : 000000004000ffff
[ 21.141490] x5 : ffff00017fffce00 x4 : 0000000000000001 x3 : 0000000000000002
[ 21.141741] x2 : 0000000000134510 x1 : 0000000000000000 x0 : ffff0000c08228c0
[ 21.141991] Call trace:
[ 21.142093] page_table_check_set+0x28c/0x2a8 (P)
[ 21.142265] __page_table_check_ptes_set+0x144/0x1e8
[ 21.142441] __set_ptes_anysz.constprop.0+0x160/0x1a8
[ 21.142766] contpte_set_ptes+0xe8/0x140
[ 21.142907] try_to_unmap_one+0x10c4/0x10d0
[ 21.143177] rmap_walk_anon+0x100/0x250
[ 21.143315] try_to_unmap+0xa0/0xc8
[ 21.143441] shrink_folio_list+0x59c/0x18a8
[ 21.143759] shrink_lruvec+0x664/0xbf0
[ 21.144043] shrink_node+0x218/0x878
[ 21.144285] __node_reclaim.constprop.0+0x98/0x338
[ 21.144763] user_proactive_reclaim+0x2a4/0x340
[ 21.145056] reclaim_store+0x3c/0x60
[ 21.145216] dev_attr_store+0x20/0x40
[ 21.145585] sysfs_kf_write+0x84/0xa8
[ 21.145835] kernfs_fop_write_iter+0x130/0x1c8
[ 21.145994] vfs_write+0x2b8/0x368
[ 21.146119] ksys_write+0x70/0x110
[ 21.146240] __arm64_sys_write+0x24/0x38
[ 21.146380] invoke_syscall+0x50/0x120
[ 21.146513] el0_svc_common.constprop.0+0x48/0xf8
[ 21.146679] do_el0_svc+0x28/0x40
[ 21.146798] el0_svc+0x34/0x110
[ 21.146926] el0t_64_sync_handler+0xa0/0xe8
[ 21.147074] el0t_64_sync+0x198/0x1a0
[ 21.147225] Code: f9400441 b4fff241 17ffff94 d4210000 (d4210000)
[ 21.147440] ---[ end trace 0000000000000000 ]---
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/wait.h>
#include <sched.h>
#include <fcntl.h>
void write_to_reclaim() {
const char *path = "/sys/devices/system/node/node0/reclaim";
const char *value = "409600000000";
int fd = open(path, O_WRONLY);
if (fd == -1) {
perror("open");
exit(EXIT_FAILURE);
}
if (write(fd, value, sizeof("409600000000") - 1) == -1) {
perror("write");
close(fd);
exit(EXIT_FAILURE);
}
printf("Successfully wrote %s to %s\n", value, path);
close(fd);
}
int main()
{
char *ptr = mmap((void *)(1UL << 30), 1UL << 16, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if ((unsigned long)ptr != (1UL << 30)) {
perror("mmap");
return 1;
}
/* a 64K folio gets faulted in */
memset(ptr, 0, 1UL << 16);
/* 32K half will not be shared into child */
if (madvise(ptr, 1UL << 15, MADV_DONTFORK)) {
perror("madvise madv dontfork");
return 1;
}
pid_t pid = fork();
if (pid < 0) {
perror("fork");
return 1;
} else if (pid == 0) {
sleep(15);
} else {
/* merge VMAs. now first half of the 16 ptes are writable, the other half not. */
if (madvise(ptr, 1UL << 15, MADV_DOFORK)) {
perror("madvise madv fork");
return 1;
}
if (madvise(ptr, (1UL << 16), MADV_FREE)) {
perror("madvise madv free");
return 1;
}
/* dirty the large folio */
(*ptr) += 10;
write_to_reclaim();
// sleep(10);
waitpid(pid, NULL, 0);
}
}
Link: https://lkml.kernel.org/r/20260303061528.2429162-1-dev.jain@arm.com
Fixes: 354dffd295 ("mm: support batched unmap for lazyfree large folios during reclamation")
Signed-off-by: Dev Jain <dev.jain@arm.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Tested-by: Lance Yang <lance.yang@linux.dev>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
move_pages_huge_pmd() handles UFFDIO_MOVE for both normal THPs and huge
zero pages. For the huge zero page path, src_folio is explicitly set to
NULL, and is used as a sentinel to skip folio operations like lock and
rmap.
In the huge zero page branch, src_folio is NULL, so folio_mk_pmd(NULL,
pgprot) passes NULL through folio_pfn() and page_to_pfn(). With
SPARSEMEM_VMEMMAP this silently produces a bogus PFN, installing a PMD
pointing to non-existent physical memory. On other memory models it is a
NULL dereference.
Use page_folio(src_page) to obtain the valid huge zero folio from the
page, which was obtained from pmd_page() and remains valid throughout.
After commit d82d09e482 ("mm/huge_memory: mark PMD mappings of the huge
zero folio special"), moved huge zero PMDs must remain special so
vm_normal_page_pmd() continues to treat them as special mappings.
move_pages_huge_pmd() currently reconstructs the destination PMD in the
huge zero page branch, which drops PMD state such as pmd_special() on
architectures with CONFIG_ARCH_HAS_PTE_SPECIAL. As a result,
vm_normal_page_pmd() can treat the moved huge zero PMD as a normal page
and corrupt its refcount.
Instead of reconstructing the PMD from the folio, derive the destination
entry from src_pmdval after pmdp_huge_clear_flush(), then handle the PMD
metadata the same way move_huge_pmd() does for moved entries by marking it
soft-dirty and clearing uffd-wp.
Link: https://lkml.kernel.org/r/a1e787dd-b911-474d-8570-f37685357d86@lucifer.local
Fixes: e3981db444 ("mm: add folio_mk_pmd()")
Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Use the correct function (or macro) names to avoid kernel-doc warnings:
Warning: include/linux/build_bug.h:38 function parameter 'cond' not
described in 'BUILD_BUG_ON_MSG'
Warning: include/linux/build_bug.h:38 function parameter 'msg' not
described in 'BUILD_BUG_ON_MSG'
Warning: include/linux/build_bug.h:76 function parameter 'expr' not
described in 'static_assert'
Link: https://lkml.kernel.org/r/20260302005144.3467019-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Only export smb2_get_err_map_test smb2_error_map_table_test and
smb2_error_map_num symbol for 'smb2maperror-test' module.
Fixes: 7d0bf050a5 ("smb/client: make SMB2 maperror KUnit tests a separate module")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Signed-off-by: Steve French <stfrench@microsoft.com>
SMB2_write() places write payload in iov[1..n] as part of rq_iov.
smb3_init_transform_rq() pointer-shares rq_iov, so crypt_message()
encrypts iov[1] in-place, replacing the original plaintext with
ciphertext. On a replayable error, the retry sends the same iov[1]
which now contains ciphertext instead of the original data,
resulting in corruption.
The corruption is most likely to be observed when connections are
unstable, as reconnects trigger write retries that re-send the
already-encrypted data.
This affects SFU mknod, MF symlinks, etc. On kernels before
6.10 (prior to the netfs conversion), sync writes also used
this path and were similarly affected. The async write path
wasn't unaffected as it uses rq_iter which gets deep-copied.
Fix by moving the write payload into rq_iter via iov_iter_kvec(),
so smb3_init_transform_rq() deep-copies it before encryption.
Cc: stable@vger.kernel.org #6.3+
Acked-by: Henrique Carvalho <henrique.carvalho@suse.com>
Acked-by: Shyam Prasad N <sprasad@microsoft.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The newly introduced variable is initialized in an #ifdef block
but used outside of it, leading to undefined behavior when
CONFIG_CIFS_ALLOW_INSECURE_LEGACY is disabled:
fs/smb/client/dir.c:417:9: error: variable 'sbflags' is uninitialized when used here [-Werror,-Wuninitialized]
417 | if (sbflags & CIFS_MOUNT_DYNPERM)
| ^~~~~~~
Move the initialization into the declaration, the same way as the
other similar function do it.
Fixes: 4fc3a433c1 ("smb: client: use atomic_t for mnt_cifs_flags")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When user application requests O_DIRECT|O_SYNC along with O_CREAT on
open(2), CREATE_NO_BUFFER and CREATE_WRITE_THROUGH bits were missed in
CREATE request when performing an atomic open, thus leading to
potentially data integrity issues.
Fix this by setting those missing bits in CREATE request when
O_DIRECT|O_SYNC has been specified in cifs_do_create().
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Henrique Carvalho <henrique.carvalho@suse.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: linux-cifs@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
If an error is encountered while mapping TX buffers, the driver should
unmap any buffers already mapped for that skb.
Because count is incremented after a successful mapping, it will always
match the correct number of unmappings needed when dma_error is reached.
Decrementing count before the while loop in dma_error causes an
off-by-one error. If any mapping was successful before an unsuccessful
mapping, exactly one DMA mapping would leak.
In these commits, a faulty while condition caused an infinite loop in
dma_error:
Commit 03b1320dfc ("e1000e: remove use of skb_dma_map from e1000e
driver")
Commit 602c0554d7 ("e1000: remove use of skb_dma_map from e1000 driver")
Commit c1fa347f20 ("e1000/e1000e/igb/igbvf/ixgb/ixgbe: Fix tests of
unsigned in *_tx_map()") fixed the infinite loop, but introduced the
off-by-one error.
This issue may still exist in the igbvf driver, but I did not address it
in this patch.
Fixes: c1fa347f20 ("e1000/e1000e/igb/igbvf/ixgb/ixgbe: Fix tests of unsigned in *_tx_map()")
Assisted-by: Claude:claude-4.6-opus
Signed-off-by: Matt Vollrath <tactii@gmail.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Fix following issues in the IPv4 and IPv6 cloud filter handling logic in
both the add and delete paths:
- The source-IP mask check incorrectly compares mask.src_ip[0] against
tcf.dst_ip[0]. Update it to compare against tcf.src_ip[0]. This likely
goes unnoticed because the check is in an "else if" path that only
executes when dst_ip is not set, most cloud filter use cases focus on
destination-IP matching, and the buggy condition can accidentally
evaluate true in some cases.
- memcpy() for the IPv4 source address incorrectly uses
ARRAY_SIZE(tcf.dst_ip) instead of ARRAY_SIZE(tcf.src_ip), although
both arrays are the same size.
- The IPv4 memcpy operations used ARRAY_SIZE(tcf.dst_ip) and ARRAY_SIZE
(tcf.src_ip), Update these to use sizeof(cfilter->ip.v4.dst_ip) and
sizeof(cfilter->ip.v4.src_ip) to ensure correct and explicit copy size.
- In the IPv6 delete path, memcmp() uses sizeof(src_ip6) when comparing
dst_ip6 fields. Replace this with sizeof(dst_ip6) to make the intent
explicit, even though both fields are struct in6_addr.
Fixes: e284fc2804 ("i40e: Add and delete cloud filter")
Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Pull misc fixes from Andrew Morton:
"15 hotfixes. 6 are cc:stable. 14 are for MM.
Singletons, with one doubleton - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-03-09-16-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS, mailmap: update email address for Lorenzo Stoakes
mm/mmu_notifier: clean up mmu_notifier.h kernel-doc
uaccess: correct kernel-doc parameter format
mm/huge_memory: fix a folio_split() race condition with folio_try_get()
MAINTAINERS: add co-maintainer and reviewer for SLAB ALLOCATOR
MAINTAINERS: add RELAY entry
memcg: fix slab accounting in refill_obj_stock() trylock path
mm/hugetlb.c: use __pa() instead of virt_to_phys() in early bootmem alloc code
zram: rename writeback_compressed device attr
tools/testing: fix testing/vma and testing/radix-tree build
Revert "ptdesc: remove references to folios from __pagetable_ctor() and pagetable_dtor()"
mm/cma: move put_page_testzero() out of VM_WARN_ON in cma_release()
mm/damon/core: clear walk_control on inactive context in damos_walk()
mm: memfd_luo: always dirty all folios
mm: memfd_luo: always make all folios uptodate
For commit b0dcdcb9ae ("resolve_btfids: Fix linker flags detection"),
I suggested setting HOSTPKG_CONFIG to $PKG_CONFIG when compiling
resolve_btfids, but I forgot the quotes around that variable.
As a result, when running vmtest.sh with static linking, it fails as
follows:
$ LDLIBS=-static PKG_CONFIG='pkg-config --static' ./vmtest.sh
[...]
make: unrecognized option '--static'
Usage: make [options] [target] ...
[...]
This worked when I tested it because HOSTPKG_CONFIG didn't have a
default value in the resolve_btfids Makefile, but once it does, the
quotes aren't preserved and it fails on the next make call.
Fixes: b0dcdcb9ae ("resolve_btfids: Fix linker flags detection")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Link: https://lore.kernel.org/r/abADBwn_ykblpABE@mail.gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
BPF_ST | BPF_PROBE_MEM32 immediate stores are not handled by
bpf_jit_blind_insn(), allowing user-controlled 32-bit immediates to
survive unblinded into JIT-compiled native code when bpf_jit_harden >= 1.
The root cause is that convert_ctx_accesses() rewrites BPF_ST|BPF_MEM
to BPF_ST|BPF_PROBE_MEM32 for arena pointer stores during verification,
before bpf_jit_blind_constants() runs during JIT compilation. The
blinding switch only matches BPF_ST|BPF_MEM (mode 0x60), not
BPF_ST|BPF_PROBE_MEM32 (mode 0xa0). The instruction falls through
unblinded.
Add BPF_ST|BPF_PROBE_MEM32 cases to bpf_jit_blind_insn() alongside the
existing BPF_ST|BPF_MEM cases. The blinding transformation is identical:
load the blinded immediate into BPF_REG_AX via mov+xor, then convert
the immediate store to a register store (BPF_STX).
The rewritten STX instruction must preserve the BPF_PROBE_MEM32 mode so
the architecture JIT emits the correct arena addressing (R12-based on
x86-64). Cannot use the BPF_STX_MEM() macro here because it hardcodes
BPF_MEM mode; construct the instruction directly instead.
Fixes: 6082b6c328 ("bpf: Recognize addr_space_cast instruction in the verifier.")
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Sachin Kumar <xcyfun@protonmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/Y6IT5VvNRchPBLI5D7JZHBzZrU9rb0ycRJPJzJSXGj7kJlX8RJwZFSM2YZjcDxoQKABkxt1T8Os2gi23PYyFuQe6KkZGWVyfz8K5afdy9ak=@protonmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
The runtime suspend callback drains the running job workqueue before
suspending the device. If a job is still executing and calls
pm_runtime_resume_and_get(), it can deadlock with the runtime suspend
path.
Fix this by moving pm_runtime_resume_and_get() from the job execution
routine to the job submission routine, ensuring the device is resumed
before the job is queued and avoiding the deadlock during runtime
suspend.
Fixes: 063db45183 ("accel/amdxdna: Enhance runtime power management")
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Lizhi Hou <lizhi.hou@amd.com>
Link: https://patch.msgid.link/20260310180058.336348-1-lizhi.hou@amd.com
Add a test case to ensure that BPF_END operations correctly break
register's scalar ID ties.
The test creates a scenario where r1 is a copy of r0, r0 undergoes a
byte swap, and then r0 is checked against a constant.
- Without the fix in the verifier, the bounds learned from r0 are
incorrectly propagated to r1, making the verifier believe r1 is
bounded and wrongly allowing subsequent pointer arithmetic.
- With the fix, r1 remains an unbounded scalar, and the verifier
correctly rejects the arithmetic operation between the frame pointer
and the unbounded register.
Co-developed-by: Tianci Cao <ziye@zju.edu.cn>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260304083228.142016-3-tangyazhou@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Since commit 95a8ddde36 ("irqchip/riscv-aplic: Preserve APLIC
states across suspend/resume"), when multiple NUMA nodes exist
and AIA is not configured as "none", aplic_probe() is called
multiple times. This leads to register_syscore(&aplic_syscore)
being invoked repeatedly, causing the following Oops:
list_add double add: new=ffffffffb91461f0, prev=ffffffffb91461f0, next=ffffffffb915c408.
[<ffffffffb7b5c8ca>] __list_add_valid_or_report+0x60/0xc0
[<ffffffffb7cc3236>] register_syscore+0x3e/0x70
[<ffffffffb7b8d61c>] aplic_probe+0xc6/0x112
Fix this by registering syscore operations only once, using a static
variable aplic_syscore_registered to track registration.
[ tglx: Trim backtrace properly ]
Fixes: 95a8ddde36 ("irqchip/riscv-aplic: Preserve APLIC states across suspend/resume")
Signed-off-by: Jessica Liu <liu.xuemei1@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260310141731145xMwLsyvXl9Gw-m6A4VRYj@zte.com.cn
aplic_probe() calls acpi_dev_clear_dependencies() unconditionally at the
end, even when the preceding setup (MSI or direct mode) has failed. This is
incorrect because if the device failed to probe, it should not be
considered as active and should not clear dependencies for other devices
waiting on it.
Fix this by returning immediately when the setup fails, skipping the ACPI
dependency cleanup. Also, explicitly return 0 on success instead of relying
on the value of 'rc' to make the success path clear.
Fixes: 5122e380c2 ("irqchip/riscv-aplic: Add ACPI support")
Signed-off-by: Jessica Liu <liu.xuemei1@zte.com.cn>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260310141600411Fu8H8-GXOOgKISU48Tjgx@zte.com.cn
In landlock_restrict_sibling_threads(), when the calling thread is
interrupted while waiting for sibling threads to prepare, it executes
a recovery path.
Previously, this path included a wait_for_completion() call on
all_prepared to prevent a Use-After-Free of the local shared_ctx.
However, this wait is redundant. Exiting the main do-while loop
already leads to a bottom cleanup section that unconditionally waits
for all_finished. Therefore, replacing the wait with a simple break
is safe, prevents UAF, and correctly unblocks the remaining task_works.
Clean up the error path by breaking the loop and updating the
surrounding comments to accurately reflect the state machine.
Suggested-by: Günther Noack <gnoack3000@gmail.com>
Signed-off-by: Yihan Ding <dingyihan@uniontech.com>
Tested-by: Günther Noack <gnoack3000@gmail.com>
Reviewed-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260306021651.744723-3-dingyihan@uniontech.com
Signed-off-by: Mickaël Salaün <mic@digikod.net>
syzbot found a deadlock in landlock_restrict_sibling_threads().
When multiple threads concurrently call landlock_restrict_self() with
sibling thread restriction enabled, they can deadlock by mutually
queueing task_works on each other and then blocking in kernel space
(waiting for the other to finish).
Fix this by serializing the TSYNC operations within the same process
using the exec_update_lock. This prevents concurrent invocations
from deadlocking.
We use down_write_trylock() and restart the syscall if the lock
cannot be acquired immediately. This ensures that if a thread fails
to get the lock, it will return to userspace, allowing it to process
any pending TSYNC task_works from the lock holder, and then
transparently restart the syscall.
Fixes: 42fc7e6543 ("landlock: Multithreading support for landlock_restrict_self()")
Reported-by: syzbot+7ea2f5e9dfd468201817@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7ea2f5e9dfd468201817
Suggested-by: Günther Noack <gnoack3000@gmail.com>
Suggested-by: Tingmao Wang <m@maowtm.org>
Tested-by: Justin Suess <utilityemal77@gmail.com>
Signed-off-by: Yihan Ding <dingyihan@uniontech.com>
Tested-by: Günther Noack <gnoack3000@gmail.com>
Reviewed-by: Günther Noack <gnoack3000@gmail.com>
Link: https://lore.kernel.org/r/20260306021651.744723-2-dingyihan@uniontech.com
Signed-off-by: Mickaël Salaün <mic@digikod.net>
XG mobile station 2022 has a different PID than the 2023 model: add it
that model to hid-asus.
Signed-off-by: Denis Benato <denis.benato@linux.dev>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
In the current implementation, SVC client drivers such as socfpga-hwmon,
intel_fcs, stratix10-soc, stratix10-rsu each send an SMC command that
triggers a single thread in the stratix10-svc driver. Upon receiving a
callback, the initiating client driver sends a stratix10-svc-done signal,
terminating the thread without waiting for other pending SMC commands to
complete. This leads to a timeout issue in the firmware SVC mailbox service
when multiple client drivers send SMC commands concurrently.
To resolve this issue, a dedicated thread is now created per channel. The
stratix10-svc driver will support up to the number of channels defined by
SVC_NUM_CHANNEL. Thread synchronization is handled using a mutex to prevent
simultaneous issuance of SMC commands by multiple threads.
SVC_NUM_DATA_IN_FIFO is reduced from 32 to 8, since each channel now has
its own dedicated FIFO and the SDM processes commands one at a time.
8 entries per channel is sufficient while keeping the total aggregate
capacity the same (4 channels x 8 = 32 entries).
Additionally, a thread task is now validated before invoking kthread_stop
when the user aborts, ensuring safe termination.
Timeout values have also been adjusted to accommodate the increased load
from concurrent client driver activity.
Fixes: 7ca5ce8965 ("firmware: add Intel Stratix10 service layer driver")
Cc: stable@vger.kernel.org
Signed-off-by: Ang Tien Sung <tien.sung.ang@altera.com>
Signed-off-by: Fong, Yan Kei <yankei.fong@altera.com>
Signed-off-by: Muhammad Amirul Asyraf Mohamad Jamian <muhammad.amirul.asyraf.mohamad.jamian@altera.com>
Link: https://lore.kernel.org/all/20260305093151.2678-1-muhammad.amirul.asyraf.mohamad.jamian@altera.com
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Three driver callbacks schedule a reset and wait for its completion:
ndo_change_mtu(), ethtool set_ringparam(), and ethtool set_channels().
Waiting for reset in ndo_change_mtu() and set_ringparam() was added by
commit c2ed2403f1 ("iavf: Wait for reset in callbacks which trigger
it") to fix a race condition where adding an interface to bonding
immediately after MTU or ring parameter change failed because the
interface was still in __RESETTING state. The same commit also added
waiting in iavf_set_priv_flags(), which was later removed by commit
53844673d5 ("iavf: kill "legacy-rx" for good").
Waiting in set_channels() was introduced earlier by commit 4e5e6b5d9d
("iavf: Fix return of set the new channel count") to ensure the PF has
enough time to complete the VF reset when changing channel count, and to
return correct error codes to userspace.
Commit ef490bbb22 ("iavf: Add net_shaper_ops support") added
net_shaper_ops to iavf, which required reset_task to use _locked NAPI
variants (napi_enable_locked, napi_disable_locked) that need the netdev
instance lock.
Later, commit 7e4d784f58 ("net: hold netdev instance lock during
rtnetlink operations") and commit 2bcf4772e4 ("net: ethtool: try to
protect all callback with netdev instance lock") started holding the
netdev instance lock during ndo and ethtool callbacks for drivers with
net_shaper_ops.
Finally, commit 120f28a6f3 ("iavf: get rid of the crit lock")
replaced the driver's crit_lock with netdev_lock in reset_task, causing
incorrect behavior: the callback holds netdev_lock and waits for
reset_task, but reset_task needs the same lock:
Thread 1 (callback) Thread 2 (reset_task)
------------------- ---------------------
netdev_lock() [blocked on workqueue]
ndo_change_mtu() or ethtool op
iavf_schedule_reset()
iavf_wait_for_reset() iavf_reset_task()
waiting... netdev_lock() <- blocked
This does not strictly deadlock because iavf_wait_for_reset() uses
wait_event_interruptible_timeout() with a 5-second timeout. The wait
eventually times out, the callback returns an error to userspace, and
after the lock is released reset_task completes the reset. This leads to
incorrect behavior: userspace sees an error even though the configuration
change silently takes effect after the timeout.
Fix this by extracting the reset logic from iavf_reset_task() into a new
iavf_reset_step() function that expects netdev_lock to be already held.
The three callbacks now call iavf_reset_step() directly instead of
scheduling the work and waiting, performing the reset synchronously in
the caller's context which already holds netdev_lock. This eliminates
both the incorrect error reporting and the need for
iavf_wait_for_reset(), which is removed along with the now-unused
reset_waitqueue.
The workqueue-based iavf_reset_task() becomes a thin wrapper that
acquires netdev_lock and calls iavf_reset_step(), preserving its use
for PF-initiated resets.
The callbacks may block for several seconds while iavf_reset_step()
polls hardware registers, but this is acceptable since netdev_lock is a
per-device mutex and only serializes operations on the same interface.
v3:
- Remove netif_running() guard from iavf_set_channels(). Unlike
set_ringparam where descriptor counts are picked up by iavf_open()
directly, num_req_queues is only consumed during
iavf_reinit_interrupt_scheme() in the reset path. Skipping the reset
on a down device would silently discard the channel count change.
- Remove dead reset_waitqueue code (struct field, init, and all
wake_up calls) since iavf_wait_for_reset() was the only consumer.
Fixes: 120f28a6f3 ("iavf: get rid of the crit lock")
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Commit 7c01dbfc8a ("iavf: periodically cache PHC time") introduced a
worker to cache PHC time, but failed to stop it during reset or disable.
This creates a race condition where `iavf_reset_task()` or
`iavf_disable_vf()` free adapter resources (AQ) while the worker is still
running. If the worker triggers `iavf_queue_ptp_cmd()` during teardown, it
accesses freed memory/locks, leading to a crash.
Fix this by calling `iavf_ptp_release()` before tearing down the adapter.
This ensures `ptp_clock_unregister()` synchronously cancels the worker and
cleans up the chardev before the backing resources are destroyed.
Fixes: 7c01dbfc8a ("iavf: periodically cache PHC time")
Signed-off-by: Petr Oros <poros@redhat.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Acked-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
If CONFIG_IRDMA isn't enabled but there are ice NICs in the system, the
driver will prevent full devlink dev param show dump because its rdma get
callbacks return ENODEV and stop the dump. For example:
$ devlink dev param show
pci/0000:82:00.0:
name msix_vec_per_pf_max type generic
values:
cmode driverinit value 2
name msix_vec_per_pf_min type generic
values:
cmode driverinit value 2
kernel answers: No such device
Returning EOPNOTSUPP allows the dump to continue so we can see all devices'
devlink parameters.
Fixes: c24a65b6a2 ("iidc/ice/irdma: Update IDC to support multiple consumers")
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Add the __counted_by() compiler attribute to the flexible array member
'key' to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Keith Busch <kbusch@kernel.org>
In nvme_alloc_admin_tag_set(), an empty queue_limits struct is
currently allocated on the stack and passed by reference to
blk_mq_alloc_queue().
This is redundant because blk_mq_alloc_queue() already handles
a NULL limits pointer by internally substituting it with a default
empty queue_limits struct.
Remove the unnecessary local variable and pass a NULL value.
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
dev->online_queues is a count incremented in nvme_init_queue. Thus,
valid indices are 0 through dev->online_queues − 1.
This patch fixes the loop condition to ensure the index stays within the
valid range. Index 0 is excluded because it is the admin queue.
KASAN splat:
==================================================================
BUG: KASAN: slab-out-of-bounds in nvme_dbbuf_free drivers/nvme/host/pci.c:377 [inline]
BUG: KASAN: slab-out-of-bounds in nvme_dbbuf_set+0x39c/0x400 drivers/nvme/host/pci.c:404
Read of size 2 at addr ffff88800592a574 by task kworker/u8:5/74
CPU: 0 UID: 0 PID: 74 Comm: kworker/u8:5 Not tainted 6.19.0-dirty #10 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: nvme-reset-wq nvme_reset_work
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0xea/0x150 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0xce/0x5d0 mm/kasan/report.c:482
kasan_report+0xdc/0x110 mm/kasan/report.c:595
__asan_report_load2_noabort+0x18/0x20 mm/kasan/report_generic.c:379
nvme_dbbuf_free drivers/nvme/host/pci.c:377 [inline]
nvme_dbbuf_set+0x39c/0x400 drivers/nvme/host/pci.c:404
nvme_reset_work+0x36b/0x8c0 drivers/nvme/host/pci.c:3252
process_one_work+0x956/0x1aa0 kernel/workqueue.c:3257
process_scheduled_works kernel/workqueue.c:3340 [inline]
worker_thread+0x65c/0xe60 kernel/workqueue.c:3421
kthread+0x41a/0x930 kernel/kthread.c:463
ret_from_fork+0x6f8/0x8c0 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
</TASK>
Allocated by task 34 on cpu 1 at 4.241550s:
kasan_save_stack+0x2c/0x60 mm/kasan/common.c:57
kasan_save_track+0x1c/0x70 mm/kasan/common.c:78
kasan_save_alloc_info+0x3c/0x50 mm/kasan/generic.c:570
poison_kmalloc_redzone mm/kasan/common.c:398 [inline]
__kasan_kmalloc+0xb5/0xc0 mm/kasan/common.c:415
kasan_kmalloc include/linux/kasan.h:263 [inline]
__do_kmalloc_node mm/slub.c:5657 [inline]
__kmalloc_node_noprof+0x2bf/0x8d0 mm/slub.c:5663
kmalloc_array_node_noprof include/linux/slab.h:1075 [inline]
nvme_pci_alloc_dev drivers/nvme/host/pci.c:3479 [inline]
nvme_probe+0x2f1/0x1820 drivers/nvme/host/pci.c:3534
local_pci_probe+0xef/0x1c0 drivers/pci/pci-driver.c:324
pci_call_probe drivers/pci/pci-driver.c:392 [inline]
__pci_device_probe drivers/pci/pci-driver.c:417 [inline]
pci_device_probe+0x743/0x920 drivers/pci/pci-driver.c:451
call_driver_probe drivers/base/dd.c:583 [inline]
really_probe+0x29b/0xb70 drivers/base/dd.c:661
__driver_probe_device+0x3b0/0x4a0 drivers/base/dd.c:803
driver_probe_device+0x56/0x1f0 drivers/base/dd.c:833
__driver_attach_async_helper+0x155/0x340 drivers/base/dd.c:1159
async_run_entry_fn+0xa6/0x4b0 kernel/async.c:129
process_one_work+0x956/0x1aa0 kernel/workqueue.c:3257
process_scheduled_works kernel/workqueue.c:3340 [inline]
worker_thread+0x65c/0xe60 kernel/workqueue.c:3421
kthread+0x41a/0x930 kernel/kthread.c:463
ret_from_fork+0x6f8/0x8c0 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246
The buggy address belongs to the object at ffff88800592a000
which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 244 bytes to the right of
allocated 1152-byte region [ffff88800592a000, ffff88800592a480)
The buggy address belongs to the physical page:
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x5928
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
anon flags: 0xfffffc0000040(head|node=0|zone=1|lastcpupid=0x1fffff)
page_type: f5(slab)
raw: 000fffffc0000040 ffff888001042000 0000000000000000 dead000000000001
raw: 0000000000000000 0000000000080008 00000000f5000000 0000000000000000
head: 000fffffc0000040 ffff888001042000 0000000000000000 dead000000000001
head: 0000000000000000 0000000000080008 00000000f5000000 0000000000000000
head: 000fffffc0000003 ffffea0000164a01 00000000ffffffff 00000000ffffffff
head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88800592a400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff88800592a480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88800592a500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
ffff88800592a580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88800592a600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Fixes: 0f0d2c876c (nvme: free sq/cq dbbuf pointers when dbbuf set fails)
Acked-by: Chao Shi <cshi008@fiu.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Acked-by: Dave Tian <daveti@purdue.edu>
Signed-off-by: Sungwoo Kim <iam@sung-woo.kim>
Signed-off-by: Keith Busch <kbusch@kernel.org>
If the superblock doesn't list a log stripe unit, we set the incore log
roundoff value to 512. This leads to corrupt logs and unmountable
filesystems in generic/617 on a disk with 4k physical sectors...
XFS (sda1): Mounting V5 Filesystem ff3121ca-26e6-4b77-b742-aaff9a449e1c
XFS (sda1): Torn write (CRC failure) detected at log block 0x318e. Truncating head block from 0x3197.
XFS (sda1): failed to locate log tail
XFS (sda1): log mount/recovery failed: error -74
XFS (sda1): log mount failed
XFS (sda1): Mounting V5 Filesystem ff3121ca-26e6-4b77-b742-aaff9a449e1c
XFS (sda1): Ending clean mount
...on the current xfsprogs for-next which has a broken mkfs. xfs_info
shows this...
meta-data=/dev/sda1 isize=512 agcount=4, agsize=644992 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=1 metadir=1
data = bsize=4096 blocks=2579968, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=1
log =internal log bsize=4096 blocks=16384, version=2
= sectsz=4096 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
= rgcount=0 rgsize=268435456 extents
= zoned=0 start=0 reserved=0
...observe that the log section has sectsz=4096 sunit=0, which means
that the roundoff factor is 512, not 4096 as you'd expect. We should
fix mkfs not to generate broken filesystems, but anyone can fuzz the
ondisk superblock so we should be more cautious. I think the inadequate
logic predates commit a6a65fef5e, but that's clearly going to
require a different backport.
Cc: stable@vger.kernel.org # v5.14
Fixes: a6a65fef5e ("xfs: log stripe roundoff is a property of the log")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
During online processing for a DASD device an IO operation is started to
determine the format of the device. CDL format contains specifically
sized blocks at the beginning of the disk.
For a PPRC secondary device no real IO operation is possible therefore
this IO request can not be started and this step is skipped for online
processing of secondary devices. This is generally fine since the
secondary is a copy of the primary device.
In case of an additional partition detection that is run after a swap
operation the format information is needed to properly drive partition
detection IO.
Currently the information is not passed leading to IO errors during
partition detection and a wrongly detected partition table which in turn
might lead to data corruption on the disk with the wrong partition table.
Fix by passing the format information from primary to secondary device.
Fixes: 413862caad ("s390/dasd: add copy pair swap capability")
Cc: stable@vger.kernel.org #6.1
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Acked-by: Eduard Shishkin <edward6@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://patch.msgid.link/20260310142330.4080106-3-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Quiesce and resume is a mechanism to suspend operations on DASD devices.
In the context of a controlled copy pair swap operation, the quiesce
operation is usually issued before the actual swap and a resume
afterwards.
During the swap operation, the underlying device is exchanged. Therefore,
the quiesce flag must be moved to the secondary device to ensure a
consistent quiesce state after the swap.
The secondary device itself cannot be suspended separately because there
is no separate block device representation for it.
Fixes: 413862caad ("s390/dasd: add copy pair swap capability")
Cc: stable@vger.kernel.org #6.1
Reviewed-by: Jan Hoeppner <hoeppner@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://patch.msgid.link/20260310142330.4080106-2-sth@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
An incorrect device name was logged for PCA9452 because the dev_info()
ternary omitted PCA9452 and fell through to "pca9450bc". Introduce a
type_name and set it per device type so the probed message matches the
actual PMIC. While here, make the PCA9451A case explicit.
No functional changes.
Fixes: 017b76fb8e ("regulator: pca9450: Add PMIC pca9452 support")
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Link: https://patch.msgid.link/20260310-pca9450-irq-v1-2-36adf52c2c55@nxp.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Kernel warning on i.MX8MP-EVK when doing module test:
irq: type mismatch, failed to map hwirq-3 for gpio@30200000!
Per PCA945[X] specification: The IRQ_B pin is pulled low when any unmasked
interrupt bit status is changed and it is released high once application
processor read INT1 register.
So the interrupt should be configured as IRQF_TRIGGER_LOW, not
IRQF_TRIGGER_FALLING.
Fixes: 0935ff5f1f ("regulator: pca9450: add pca9450 pmic driver")
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Link: https://patch.msgid.link/20260310-pca9450-irq-v1-1-36adf52c2c55@nxp.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Marc Kleine-Budde says:
====================
pull-request: can 2026-03-10
this is a pull request of 2 patches for net/main.
Haibo Chen's patch fixes the maximum allowed bit rate error, which was
broken in v6.19.
Wenyuan Li contributes a patch for the hi311x driver that adds missing
error checking in the caller of the hi3110_power_enable() function,
hi3110_open().
linux-can-fixes-for-7.0-20260310
* tag 'linux-can-fixes-for-7.0-20260310' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: hi311x: hi3110_open(): add check for hi3110_power_enable() return value
can: dev: keep the max bitrate error at 5%
====================
Link: https://patch.msgid.link/20260310103547.2299403-1-mkl@pengutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit 02c057ddef ("ACPI: video: Convert the driver to a platform one")
switched over the ACPI video bus driver from an ACPI driver to a platform
driver, but that change introduced an unwanted and unexpected side effect.
Namely, on some systems, the ACPI device object of the ACPI video bus
device is an ACPI companion of multiple platform devices and, after
adding video_device_ids[] as an acpi_match_table to the acpi_video_bus
platform driver, all of those devices started to match that driver and
its probe callback is invoked for all of them (it fails, but it leaves
a confusing message in the log). Moreover, the MODULE_DEVICE_TABLE()
of the ACPI video driver module matches all of the devices sharing the
ACPI companion with the ACPI video bus device.
To address this, make the core ACPI device enumeration code create an
auxiliary device for the ACPI video bus device object instead of a
platform device and switch over the ACPI video bus driver (once more)
to an auxiliary driver.
Auxiliary driver generally is a better match for ACPI video bus than
platform driver, among other things because the ACPI video bus device
does not require any resources to be allocated for it during
enumeration. It also allows the ACPI video bus driver to stop abusing
device matching based on ACPI device IDs and it allows a special case
to be dropped from acpi_create_platform_device() because that function
need not worry about the ACPI video bus device any more.
Fixes: 02c057ddef ("ACPI: video: Convert the driver to a platform one")
Reported-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Closes: https://lore.kernel.org/linux-acpi/007e3390-6b2b-457e-83c7-c794c5952018@amd.com/
Tested-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org>
[ rjw: Added AUXILIARY_BUS selection to CONFIG_ACPI to fix build issue ]
[ rjw: Fixed error path in acpi_create_video_bus_device() ]
Link: https://patch.msgid.link/5986516.DvuYhMxLoT@rafael.j.wysocki
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
IDLETIMER revision 0 rules reuse existing timers by label and always call
mod_timer() on timer->timer.
If the label was created first by revision 1 with XT_IDLETIMER_ALARM,
the object uses alarm timer semantics and timer->timer is never initialized.
Reusing that object from revision 0 causes mod_timer() on an uninitialized
timer_list, triggering debugobjects warnings and possible panic when
panic_on_warn=1.
Fix this by rejecting revision 0 rule insertion when an existing timer with
the same label is of ALARM type.
Fixes: 68983a354a ("netfilter: xtables: Add snapshot of hardidletimer target")
Co-developed-by: Yifan Wu <yifanwucs@gmail.com>
Signed-off-by: Yifan Wu <yifanwucs@gmail.com>
Co-developed-by: Juefei Pu <tomapufckgml@gmail.com>
Signed-off-by: Juefei Pu <tomapufckgml@gmail.com>
Signed-off-by: Yuan Tan <tanyuan98@outlook.com>
Signed-off-by: Xin Liu <dstsmallbird@foxmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
nfnl_cthelper_dump_table() has a 'goto restart' that jumps to a label
inside the for loop body. When the "last" helper saved in cb->args[1]
is deleted between dump rounds, every entry fails the (cur != last)
check, so cb->args[1] is never cleared. The for loop finishes with
cb->args[0] == nf_ct_helper_hsize, and the 'goto restart' jumps back
into the loop body bypassing the bounds check, causing an 8-byte
out-of-bounds read on nf_ct_helper_hash[nf_ct_helper_hsize].
The 'goto restart' block was meant to re-traverse the current bucket
when "last" is no longer found, but it was placed after the for loop
instead of inside it. Move the block into the for loop body so that
the restart only occurs while cb->args[0] is still within bounds.
BUG: KASAN: slab-out-of-bounds in nfnl_cthelper_dump_table+0x9f/0x1b0
Read of size 8 at addr ffff888104ca3000 by task poc_cthelper/131
Call Trace:
nfnl_cthelper_dump_table+0x9f/0x1b0
netlink_dump+0x333/0x880
netlink_recvmsg+0x3e2/0x4b0
sock_recvmsg+0xde/0xf0
__sys_recvfrom+0x150/0x200
__x64_sys_recvfrom+0x76/0x90
do_syscall_64+0xc3/0x6e0
Allocated by task 1:
__kvmalloc_node_noprof+0x21b/0x700
nf_ct_alloc_hashtable+0x65/0xd0
nf_conntrack_helper_init+0x21/0x60
nf_conntrack_init_start+0x18d/0x300
nf_conntrack_standalone_init+0x12/0xc0
Fixes: 12f7a50533 ("netfilter: add user-space connection tracking helper infrastructure")
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
nfqnl_recv_verdict() calls find_dequeue_entry() to remove the queue
entry from the queue data structures, taking ownership of the entry.
For PF_BRIDGE packets, it then calls nfqa_parse_bridge() to parse VLAN
attributes. If nfqa_parse_bridge() returns an error (e.g. NFQA_VLAN
present but NFQA_VLAN_TCI missing), the function returns immediately
without freeing the dequeued entry or its sk_buff.
This leaks the nf_queue_entry, its associated sk_buff, and all held
references (net_device refcounts, struct net refcount). Repeated
triggering exhausts kernel memory.
Fix this by dropping the entry via nfqnl_reinject() with NF_DROP verdict
on the error path, consistent with other error handling in this file.
Fixes: 8d45ff22f1 ("netfilter: bridge: nf queue verdict to use NFQA_VLAN and NFQA_L2HDR")
Reviewed-by: David Dull <monderasdor@gmail.com>
Signed-off-by: Hyunwoo Kim <imv4bel@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
When the last byte of options is a non-single-byte option kind, walkers
that advance with i += op[i + 1] ? : 1 can read op[i + 1] past the end
of the option area.
Add an explicit i == optlen - 1 check before dereferencing op[i + 1]
in xt_tcpudp and xt_dccp option walkers.
Fixes: 2e4e6a17af ("[NETFILTER] x_tables: Abstraction layer for {ip,ip6,arp}_tables")
Signed-off-by: David Dull <monderasdor@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
pipapo_drop() passes rulemap[i + 1].n to pipapo_unmap() as the
to_offset argument on every iteration, including the last one where
i == m->field_count - 1. This reads one element past the end of the
stack-allocated rulemap array (declared as rulemap[NFT_PIPAPO_MAX_FIELDS]
with NFT_PIPAPO_MAX_FIELDS == 16).
Although pipapo_unmap() returns early when is_last is true without
using the to_offset value, the argument is evaluated at the call site
before the function body executes, making this a genuine out-of-bounds
stack read confirmed by KASAN:
BUG: KASAN: stack-out-of-bounds in pipapo_drop+0x50c/0x57c [nf_tables]
Read of size 4 at addr ffff8000810e71a4
This frame has 1 object:
[32, 160) 'rulemap'
The buggy address is at offset 164 -- exactly 4 bytes past the end
of the rulemap array.
Pass 0 instead of rulemap[i + 1].n on the last iteration to avoid
the out-of-bounds read.
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Jenny Guanni Qu <qguanni@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
During transaction processing we might have more than one catchall element:
1 live catchall element and 1 pending element that is coming as part of the
new batch.
If the map holding the catchall elements is also going away, its
required to toggle all catchall elements and not just the first viable
candidate.
Otherwise, we get:
WARNING: ./include/net/netfilter/nf_tables.h:1281 at nft_data_release+0xb7/0xe0 [nf_tables], CPU#2: nft/1404
RIP: 0010:nft_data_release+0xb7/0xe0 [nf_tables]
[..]
__nft_set_elem_destroy+0x106/0x380 [nf_tables]
nf_tables_abort_release+0x348/0x8d0 [nf_tables]
nf_tables_abort+0xcf2/0x3ac0 [nf_tables]
nfnetlink_rcv_batch+0x9c9/0x20e0 [..]
Fixes: 628bd3e49c ("netfilter: nf_tables: drop map element references from preparation phase")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
These #defines have been removed from the kernel headers in favour of
the string based PMU format attributes. Usages were previously removed
from the recording side of cs-etm in Perf. Finish the removal by
removing usages from the decode side too.
It's a straight replacement of the old #defines with the new register
bit definitions. Except cs_etm__setup_timeless_decoding() which wasn't
looking at the saved metadata and was instead hard coding an access to
'attr.config'. This was vulnerable to the same issue of .config being
moved to .config2 etc that the original removal of ETM_OPT_* tried to
fix. So fix that too.
Signed-off-by: James Clark <james.clark@linaro.org>
Tested-by: Leo Yan <leo.yan@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mike Leach <mike.leach@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suzuki Poulouse <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tunnel xmit functions (iptunnel_xmit, ip6tunnel_xmit) lack their own
recursion limit. When a bond device in broadcast mode has GRE tap
interfaces as slaves, and those GRE tunnels route back through the
bond, multicast/broadcast traffic triggers infinite recursion between
bond_xmit_broadcast() and ip_tunnel_xmit()/ip6_tnl_xmit(), causing
kernel stack overflow.
The existing XMIT_RECURSION_LIMIT (8) in the no-qdisc path is not
sufficient because tunnel recursion involves route lookups and full IP
output, consuming much more stack per level. Use a lower limit of 4
(IP_TUNNEL_RECURSION_LIMIT) to prevent overflow.
Add recursion detection using dev_xmit_recursion helpers directly in
iptunnel_xmit() and ip6tunnel_xmit() to cover all IPv4/IPv6 tunnel
paths including UDP encapsulated tunnels (VXLAN, Geneve, etc.).
Move dev_xmit_recursion helpers from net/core/dev.h to public header
include/linux/netdevice.h so they can be used by tunnel code.
BUG: KASAN: stack-out-of-bounds in blake2s.constprop.0+0xe7/0x160
Write of size 32 at addr ffff88810033fed0 by task kworker/0:1/11
Workqueue: mld mld_ifc_work
Call Trace:
<TASK>
__build_flow_key.constprop.0 (net/ipv4/route.c:515)
ip_rt_update_pmtu (net/ipv4/route.c:1073)
iptunnel_xmit (net/ipv4/ip_tunnel_core.c:84)
ip_tunnel_xmit (net/ipv4/ip_tunnel.c:847)
gre_tap_xmit (net/ipv4/ip_gre.c:779)
dev_hard_start_xmit (net/core/dev.c:3887)
sch_direct_xmit (net/sched/sch_generic.c:347)
__dev_queue_xmit (net/core/dev.c:4802)
bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312)
bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279)
bond_start_xmit (drivers/net/bonding/bond_main.c:5530)
dev_hard_start_xmit (net/core/dev.c:3887)
__dev_queue_xmit (net/core/dev.c:4841)
ip_finish_output2 (net/ipv4/ip_output.c:237)
ip_output (net/ipv4/ip_output.c:438)
iptunnel_xmit (net/ipv4/ip_tunnel_core.c:86)
gre_tap_xmit (net/ipv4/ip_gre.c:779)
dev_hard_start_xmit (net/core/dev.c:3887)
sch_direct_xmit (net/sched/sch_generic.c:347)
__dev_queue_xmit (net/core/dev.c:4802)
bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312)
bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279)
bond_start_xmit (drivers/net/bonding/bond_main.c:5530)
dev_hard_start_xmit (net/core/dev.c:3887)
__dev_queue_xmit (net/core/dev.c:4841)
ip_finish_output2 (net/ipv4/ip_output.c:237)
ip_output (net/ipv4/ip_output.c:438)
iptunnel_xmit (net/ipv4/ip_tunnel_core.c:86)
ip_tunnel_xmit (net/ipv4/ip_tunnel.c:847)
gre_tap_xmit (net/ipv4/ip_gre.c:779)
dev_hard_start_xmit (net/core/dev.c:3887)
sch_direct_xmit (net/sched/sch_generic.c:347)
__dev_queue_xmit (net/core/dev.c:4802)
bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312)
bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279)
bond_start_xmit (drivers/net/bonding/bond_main.c:5530)
dev_hard_start_xmit (net/core/dev.c:3887)
__dev_queue_xmit (net/core/dev.c:4841)
mld_sendpack
mld_ifc_work
process_one_work
worker_thread
</TASK>
Fixes: 745e20f1b6 ("net: add a recursion limit in xmit path")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Link: https://patch.msgid.link/20260306160133.3852900-2-bestswngs@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
My easystack email will be unreachable soon, update my email address to
linux.dev one.
Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
While head_onwire_len() has a branch to handle ctrl_len == 0 case,
prepare_read_control() always sets up a kvec for the CRC meaning that
a non-empty control segment is effectively assumed. All frames that
clients deal with meet that assumption, so let's make it official and
treat the preamble with an empty control segment as malformed.
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Similar checks are performed for all control frames, but an early check
for message frames was missing. process_message() is already set up to
terminate the loop in case the state changes while con->ops->dispatch()
handler is being executed.
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
If the message frame is (maliciously) corrupted in a way that the
length of the control segment ends up being less than the size of the
message header or a different frame is made to look like a message
frame, out-of-bounds reads may ensue in process_message_header().
Perform an explicit bounds check before decoding the message header.
Cc: stable@vger.kernel.org
Reported-by: Raphael Zimmer <raphael.zimmer@tu-ilmenau.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Markuze <amarkuze@redhat.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Raju Rangoju says:
====================
amd-xgbe: RX adaptation and PHY handling fixes
This series fixes several issues in the amd-xgbe driver related to RX
adaptation and PHY handling in 10GBASE-KR mode, particularly when
auto-negotiation is disabled.
Patch 1 fixes link status handling during RX adaptation by correctly
reading the latched link status bit so transient link drops are
detected without losing the current state.
Patch 2 prevents CRC errors that can occur when performing RX
adaptation with auto-negotiation turned off. The driver now stops
TX/RX before re-triggering RX adaptation and only re-enables traffic
once adaptation completes and the link is confirmed up, ensuring
packets are not corrupted during the adaptation window.
Patch 3 restores the intended ordering of PHY reset relative to
phy_start(), making sure PHY settings are reset before the PHY is
started instead of afterwards.
====================
Link: https://patch.msgid.link/20260306111629.1515676-1-Raju.Rangoju@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When operating in 10GBASE-KR mode with auto-negotiation disabled and RX
adaptation enabled, CRC errors can occur during the RX adaptation
process. This happens because the driver continues transmitting and
receiving packets while adaptation is in progress.
Fix this by stopping TX/RX immediately when the link goes down and RX
adaptation needs to be re-triggered, and only re-enabling TX/RX after
adaptation completes and the link is confirmed up. Introduce a flag to
track whether TX/RX was disabled for adaptation so it can be restored
correctly.
This prevents packets from being transmitted or received during the RX
adaptation window and avoids CRC errors from corrupted frames.
The flag tracking the data path state is synchronized with hardware
state in xgbe_start() to prevent stale state after device restarts.
This ensures that after a restart cycle (where xgbe_stop disables
TX/RX and xgbe_start re-enables them), the flag correctly reflects
that the data path is active.
Fixes: 4f3b20bfbb ("amd-xgbe: add support for rx-adaptation")
Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com>
Link: https://patch.msgid.link/20260306111629.1515676-3-Raju.Rangoju@amd.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When resuming from s2ram, firmware may re-enable x2apic mode, which may have
been disabled by the kernel during boot either because it doesn't support IRQ
remapping or for other reasons. This causes the kernel to continue using the
xapic interface, while the hardware is in x2apic mode, which causes hangs.
This happens on defconfig + bare metal + s2ram.
Fix this in lapic_resume() by disabling x2apic if the kernel expects it to be
disabled, i.e. when x2apic_mode = 0.
The ACPI v6.6 spec, Section 16.3 [1] says firmware restores either the
pre-sleep configuration or initial boot configuration for each CPU, including
MSR state:
When executing from the power-on reset vector as a result of waking from an
S2 or S3 sleep state, the platform firmware performs only the hardware
initialization required to restore the system to either the state the
platform was in prior to the initial operating system boot, or to the
pre-sleep configuration state. In multiprocessor systems, non-boot
processors should be placed in the same state as prior to the initial
operating system boot.
(further ahead)
If this is an S2 or S3 wake, then the platform runtime firmware restores
minimum context of the system before jumping to the waking vector. This
includes:
CPU configuration. Platform runtime firmware restores the pre-sleep
configuration or initial boot configuration of each CPU (MSR, MTRR,
firmware update, SMBase, and so on). Interrupts must be disabled (for
IA-32 processors, disabled by CLI instruction).
(and other things)
So at least as per the spec, re-enablement of x2apic by the firmware is
allowed if "x2apic on" is a part of the initial boot configuration.
[1] https://uefi.org/specs/ACPI/6.6/16_Waking_and_Sleeping.html#initialization
[ bp: Massage. ]
Fixes: 6e1cb38a2a ("x64, x2apic/intr-remap: add x2apic support, including enabling interrupt-remapping")
Co-developed-by: Rahul Bukte <rahul.bukte@sony.com>
Signed-off-by: Rahul Bukte <rahul.bukte@sony.com>
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Sohil Mehta <sohil.mehta@intel.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260306-x2apic-fix-v2-1-bee99c12efa3@sony.com
mctp_flow_prepare_output() checks key->dev and may call
mctp_dev_set_key(), but it does not hold key->lock while doing so.
mctp_dev_set_key() and mctp_dev_release_key() are annotated with
__must_hold(&key->lock), so key->dev access is intended to be
serialized by key->lock. The mctp_sendmsg() transmit path reaches
mctp_flow_prepare_output() via mctp_local_output() -> mctp_dst_output()
without holding key->lock, so the check-and-set sequence is racy.
Example interleaving:
CPU0 CPU1
---- ----
mctp_flow_prepare_output(key, devA)
if (!key->dev) // sees NULL
mctp_flow_prepare_output(
key, devB)
if (!key->dev) // still NULL
mctp_dev_set_key(devB, key)
mctp_dev_hold(devB)
key->dev = devB
mctp_dev_set_key(devA, key)
mctp_dev_hold(devA)
key->dev = devA // overwrites devB
Now both devA and devB references were acquired, but only the final
key->dev value is tracked for release. One reference can be lost,
causing a resource leak as mctp_dev_release_key() would only decrease
the reference on one dev.
Fix by taking key->lock around the key->dev check and
mctp_dev_set_key() call.
Fixes: 67737c4572 ("mctp: Pass flow data & flow release events to drivers")
Signed-off-by: Chengfeng Ye <dg573847474@gmail.com>
Link: https://patch.msgid.link/20260306031402.857224-1-dg573847474@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
kernel BUG at net/core/skbuff.c:2306!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
RIP: 0010:pskb_expand_head+0xa08/0xfe0 net/core/skbuff.c:2306
RSP: 0018:ffffc90004aff760 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff88807e3c8780 RCX: ffffffff89593e0e
RDX: ffff88807b7c4900 RSI: ffffffff89594747 RDI: ffff88807b7c4900
RBP: 0000000000000820 R08: 0000000000000005 R09: 0000000000000000
R10: 00000000961a63e0 R11: 0000000000000000 R12: ffff88807e3c8780
R13: 00000000961a6560 R14: dffffc0000000000 R15: 00000000961a63e0
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe1a0ed8df0 CR3: 000000002d816000 CR4: 00000000003526f0
Call Trace:
<TASK>
ipgre_header+0xdd/0x540 net/ipv4/ip_gre.c:900
dev_hard_header include/linux/netdevice.h:3439 [inline]
packet_snd net/packet/af_packet.c:3028 [inline]
packet_sendmsg+0x3ae5/0x53c0 net/packet/af_packet.c:3108
sock_sendmsg_nosec net/socket.c:727 [inline]
__sock_sendmsg net/socket.c:742 [inline]
____sys_sendmsg+0xa54/0xc30 net/socket.c:2592
___sys_sendmsg+0x190/0x1e0 net/socket.c:2646
__sys_sendmsg+0x170/0x220 net/socket.c:2678
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fe1a0e6c1a9
When a non-Ethernet device (e.g. GRE tunnel) is enslaved to a bond,
bond_setup_by_slave() directly copies the slave's header_ops to the
bond device:
bond_dev->header_ops = slave_dev->header_ops;
This causes a type confusion when dev_hard_header() is later called
on the bond device. Functions like ipgre_header(), ip6gre_header(),all use
netdev_priv(dev) to access their device-specific private data. When
called with the bond device, netdev_priv() returns the bond's private
data (struct bonding) instead of the expected type (e.g. struct
ip_tunnel), leading to garbage values being read and kernel crashes.
Fix this by introducing bond_header_ops with wrapper functions that
delegate to the active slave's header_ops using the slave's own
device. This ensures netdev_priv() in the slave's header functions
always receives the correct device.
The fix is placed in the bonding driver rather than individual device
drivers, as the root cause is bond blindly inheriting header_ops from
the slave without considering that these callbacks expect a specific
netdev_priv() layout.
The type confusion can be observed by adding a printk in
ipgre_header() and running the following commands:
ip link add dummy0 type dummy
ip addr add 10.0.0.1/24 dev dummy0
ip link set dummy0 up
ip link add gre1 type gre local 10.0.0.1
ip link add bond1 type bond mode active-backup
ip link set gre1 master bond1
ip link set gre1 up
ip link set bond1 up
ip addr add fe80::1/64 dev bond1
Fixes: 1284cd3a2b ("bonding: two small fixes for IPoIB support")
Suggested-by: Jay Vosburgh <jv@jvosburgh.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Link: https://patch.msgid.link/20260306021508.222062-1-jiayuan.chen@linux.dev
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
In hi3110_open(), the return value of hi3110_power_enable() is not checked.
If power enable fails, the device may not function correctly, while the
driver still returns success.
Add a check for the return value and propagate the error accordingly.
Signed-off-by: Wenyuan Li <2063309626@qq.com>
Link: https://patch.msgid.link/tencent_B5E2E7528BB28AA8A2A56E16C49BD58B8B07@qq.com
Fixes: 57e83fb9b7 ("can: hi311x: Add Holt HI-311x CAN driver")
[mkl: adjust subject, commit message and jump label]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Commit b360a13d44 ("can: dev: print bitrate error with two decimal
digits") changed calculation of the bit rate error from on-tenth of a
percent to on-hundredth of a percent, but forgot to adjust the scale of the
CAN_CALC_MAX_ERROR constant.
Keeping the existing logic unchanged: Only when the bitrate error exceeds
5% should an error be returned. Otherwise, simply output a warning log.
Fixes: b360a13d44 ("can: dev: print bitrate error with two decimal digits")
Signed-off-by: Haibo Chen <haibo.chen@nxp.com>
Link: https://patch.msgid.link/20260306-can-fix-v1-1-ac526cec6777@nxp.com
Cc: stable@kernel.org
[mkl: improve commit message]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
obj_exts_alloc_size() prevents recursive allocation of slabobj_ext
array from the same cache, to avoid creating slabs that are never freed.
There is one mistake that returns the original size when memory
allocation profiling is disabled. The assumption was that
memcg-triggered slabobj_ext allocation is always served from
KMALLOC_CGROUP type. But this is wrong [1]: when the caller specifies
both __GFP_RECLAIMABLE and __GFP_ACCOUNT with SLUB_TINY enabled, the
allocation is served from normal kmalloc. This is because kmalloc_type()
prioritizes __GFP_RECLAIMABLE over __GFP_ACCOUNT, and SLUB_TINY aliases
KMALLOC_RECLAIM with KMALLOC_NORMAL.
As a result, the recursion guard is bypassed and the problematic slabs
can be created. Fix this by removing the mem_alloc_profiling_enabled()
check entirely. The remaining is_kmalloc_normal() check is still
sufficient to detect whether the cache is of KMALLOC_NORMAL type and
avoid bumping the size if it's not.
Without SLUB_TINY, no functional change intended.
With SLUB_TINY, allocations with __GFP_ACCOUNT|__GFP_RECLAIMABLE
now allocate a larger array if the sizes equal.
Reported-by: Zw Tang <shicenci@gmail.com>
Fixes: 280ea9c315 ("mm/slab: avoid allocating slabobj_ext array from its own slab")
Closes: https://lore.kernel.org/linux-mm/CAPHJ_VKuMKSke8b11AZQw1PTSFN4n2C0gFxC6xGOG0ZLHgPmnA@mail.gmail.com [1]
Cc: stable@vger.kernel.org
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20260309072219.22653-1-harry.yoo@oracle.com
Tested-by: Zw Tang <shicenci@gmail.com>
Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Hotplugging a CPU off and back on fails with pKVM, as we try to
probe for ICH_VTR_EL2.TDS. In a non-VHE setup, this is achieved
by using an EL2 stub helper. However, the stubs are out of reach
once pKVM has deprivileged the kernel. The CPU never boots.
Since pKVM doesn't allow late onlining of CPUs, we can detect
that protected mode is enforced early on, and return the current
state of the capability.
Fixes: 2a28810cbb ("KVM: arm64: GICv3: Detect and work around the lack of ICV_DIR_EL1 trapping")
Reported-by: Vincent Donnefort <vdonnefort@google.com>
Tested-by: Vincent Donnefort <vdonnefort@google.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://patch.msgid.link/20260310085433.3936742-1-maz@kernel.org
Cc: stable@vger.kernel.org
Given that some platforms may use PHY address 0 (I suppose the PHY may
not treat address 0 as a broadcast address or default response address).
It is possible for some boards to connect multiple PHYs to the same
ENETC MAC, for example:
- a PHY with a non-zero address connects to ENETC MAC through SGMII
interface (selected via DTS_A)
- a PHY with address 0 connects to ENETC MAC through RGMII interface
(selected via DTS_B)
For the case where the ENETC port MDIO is used to manage the PHY, when
switching from DTS_A to DTS_B via soft reboot, LaBCR[MDIO_PHYAD_PRTAD]
must be updated to 0 because the NETCMIX block is not reset during soft
reboot. However, the current driver explicitly skips configuring address
0, causing LaBCR[MDIO_PHYAD_PRTAD] to retain its old value.
Therefore, remove the special-case skip of PHY address 0 so that valid
configurations using address 0 are properly supported.
Fixes: 6633df05f3 ("net: enetc: set the external PHY address in IERB for port MDIO usage")
Fixes: 50bfd9c06f ("net: enetc: set external PHY address in IERB for i.MX94 ENETC")
Reviewed-by: Clark Wang <xiaoning.wang@nxp.com>
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260305031211.904812-3-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The current netc_get_phy_addr() implementation falls back to PHY address
0 when the "mdio" node or the PHY child node is missing. On i.MX95, this
causes failures when a real PHY is actually assigned address 0 and is
managed through the EMDIO interface. Because the bit 0 of phy_mask will
be set, leading imx95_enetc_mdio_phyaddr_config() to return an error, and
the netc_blk_ctrl driver probe subsequently fails. Fix this by returning
-ENODEV when neither an "mdio" node nor any PHY node is present, it means
that ENETC port MDIO is not used to manage the PHY, so there is no need
to configure LaBCR[MDIO_PHYAD_PRTAD].
Reported-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Closes: https://lore.kernel.org/all/7825188.GXAFRqVoOG@steina-w
Fixes: 6633df05f3 ("net: enetc: set the external PHY address in IERB for port MDIO usage")
Reviewed-by: Clark Wang <xiaoning.wang@nxp.com>
Tested-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20260305031211.904812-2-wei.fang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
GPIO controller driver should typically implement the .get_direction()
callback as GPIOLIB internals may try to use it to determine the state
of a pin. Since introduction of shared proxy, it prints a warning splat
when using a shared spmi gpio.
The implementation is not easy because the controller supports enabling
the input and output logic at the same time, so we aligns on the
behaviour of the .get() operation and return -EINVAL in other
situations.
Fixes: eadff30244 ("pinctrl: Qualcomm SPMI PMIC GPIO pin controller driver")
Fixes: d7b5f5cc5e ("pinctrl: qcom: spmi-gpio: Add support for GPIO LV/MV subtype")
Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
In xfs_qm_dqflush(), when a dquot flush fails due to corruption
(the out_abort error path), the original code removed the dquot log
item from the AIL before calling xfs_force_shutdown(). This ordering
introduces a subtle race condition that can lead to data loss after
a crash.
The AIL tracks the oldest dirty metadata in the journal. The position
of the tail item in the AIL determines the log tail LSN, which is the
oldest LSN that must be preserved for crash recovery. When an item is
removed from the AIL, the log tail can advance past the LSN of that item.
The race window is as follows: if the dquot item happens to be at
the tail of the log, removing it from the AIL allows the log tail
to advance. If a concurrent log write is sampling the tail LSN at
the same time and subsequently writes a complete checkpoint (i.e.,
one containing a commit record) to disk before the shutdown takes
effect, the journal will no longer protect the dquot's last
modification. On the next mount, log recovery will not replay the
dquot changes, even though they were never written back to disk,
resulting in silent data loss.
Fix this by calling xfs_force_shutdown() before xfs_trans_ail_delete()
in the out_abort path. Once the log is shut down, no new log writes
can complete with an updated tail LSN, making it safe to remove the
dquot item from the AIL.
Cc: stable@vger.kernel.org
Fixes: b707fffda6 ("xfs: abort consistently on dquot flush failure")
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
ip->i_itemp has been set null in xfs_inode_item_destroy(), so there is
no need set it null again in xfs_inode_free_callback().
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
For eDP read the ALPM DPCD caps after DPCD initalization and just before
the PSR init.
v2: Move intel_alpm_init to intel_edp_init_dpcd (Jouni)
v3: Add Fixes with commit-id (Jouni)
v4: Separated the alpm dpcd read caps from alpm_init and moved to
intel_edp_init_dpcd.
v5: Read alpm_caps always for eDP irrespective of the eDP version (Jouni)
v6: replace drm_dp_dpcd_readb with drm_dp_dpcd_read_byte (Jouni)
Fixes: 15438b3259 ("drm/i915/alpm: Add compute config for lobf")
Signed-off-by: Arun R Murthy <arun.r.murthy@intel.com>
Reviewed-by: Animesh Manna <animesh.manna@intel.com>
Reviewed-by: Jouni Högander <jouni.hogander@intel.com>
Signed-off-by: Animesh Manna <animesh.manna@intel.com>
Link: https://patch.msgid.link/20260304072157.1123283-1-arun.r.murthy@intel.com
(cherry picked from commit 88442ba208dd5d3405de3f5000cf5b2c86876ae3)
Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
There are slice row per frame and pic height parameters in DSC that needs
to be configured on every Selective Update in Early Transport mode. Use
helper provided by DSC code to configure these on Selective Update when in
Early Transport mode. Also fill crtc_state->psr2_su_area with full frame
area on full frame update for DSC calculation.
v2: move psr2_su_area under skip_sel_fetch_set_loop label
Bspec: 68927, 71709
Fixes: 467e4e061c ("drm/i915/psr: Enable psr2 early transport as possible")
Cc: <stable@vger.kernel.org> # v6.9+
Signed-off-by: Jouni Högander <jouni.hogander@intel.com>
Reviewed-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Link: https://patch.msgid.link/20260304113011.626542-5-jouni.hogander@intel.com
(cherry picked from commit 3140af2fab505a4cd47d516284529bf1585628be)
Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
Currently we are aligning Selective Update area to cover cursor fully if
needed only once. It may happen that cursor is in Selective Update area
after pipe alignment and after that covering cursor plane only
partially. Fix this by looping alignment as long as alignment isn't needed
anymore.
v2:
- do not unecessarily loop if cursor was already fully covered
- rename aligned as su_area_changed
Fixes: 1bff93b8bc ("drm/i915/psr: Extend SU area to cover cursor fully if needed")
Cc: <stable@vger.kernel.org> # v6.9+
Signed-off-by: Jouni Högander <jouni.hogander@intel.com>
Reviewed-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Link: https://patch.msgid.link/20260304113011.626542-2-jouni.hogander@intel.com
(cherry picked from commit 681e12440d8b110350a5709101169f319e10ccbb)
Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
When a scatterlists table of a GEM shmem object of size 4 GB or more is
populated with pages allocated from a folio, unsigned int .length
attribute of a scatterlist may get overflowed if total byte length of
pages allocated to that single scatterlist happens to reach or cross the
4GB limit. As a consequence, users of the object may suffer from hitting
unexpected, premature end of the object's backing pages.
[278.780187] ------------[ cut here ]------------
[278.780377] WARNING: CPU: 1 PID: 2326 at drivers/gpu/drm/i915/i915_mm.c:55 remap_sg+0x199/0x1d0 [i915]
...
[278.780654] CPU: 1 UID: 0 PID: 2326 Comm: gem_mmap_offset Tainted: G S U 6.17.0-rc1-CI_DRM_16981-ged823aaa0607+ #1 PREEMPT(voluntary)
[278.780656] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER
[278.780658] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024
[278.780659] RIP: 0010:remap_sg+0x199/0x1d0 [i915]
...
[278.780786] Call Trace:
[278.780787] <TASK>
[278.780788] ? __apply_to_page_range+0x3e6/0x910
[278.780795] ? __pfx_remap_sg+0x10/0x10 [i915]
[278.780906] apply_to_page_range+0x14/0x30
[278.780908] remap_io_sg+0x14d/0x260 [i915]
[278.781013] vm_fault_cpu+0xd2/0x330 [i915]
[278.781137] __do_fault+0x3a/0x1b0
[278.781140] do_fault+0x322/0x640
[278.781143] __handle_mm_fault+0x938/0xfd0
[278.781150] handle_mm_fault+0x12c/0x300
[278.781152] ? lock_mm_and_find_vma+0x4b/0x760
[278.781155] do_user_addr_fault+0x2d6/0x8e0
[278.781160] exc_page_fault+0x96/0x2c0
[278.781165] asm_exc_page_fault+0x27/0x30
...
That issue was apprehended by the author of a change that introduced it,
and potential risk even annotated with a comment, but then never addressed.
When adding folio pages to a scatterlist table, take care of byte length
of any single scatterlist not exceeding max_segment.
Fixes: 0b62af28f2 ("i915: convert shmem_sg_free_table() to use a folio_batch")
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14809
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org # v6.5+
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
Link: https://lore.kernel.org/r/20260224094944.2447913-2-janusz.krzysztofik@linux.intel.com
(cherry picked from commit 06249b4e691a75694c014a61708c007fb5755f60)
Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
Apparently ICL may hang with an MCE if we write TRANS_VRR_VMAX/FLIPLINE
before enabling TRANS_DDI_FUNC_CTL.
Personally I was only able to reproduce a hang (on an Dell XPS 7390
2-in-1) with an external display connected via a dock using a dodgy
type-C cable that made the link training fail. After the failed
link training the machine would hang. TGL seemed immune to the
problem for whatever reason.
BSpec does tell us to configure VRR after enabling TRANS_DDI_FUNC_CTL
as well. The DMC firmware also does the VRR restore in two stages:
- first stage seems to be unconditional and includes TRANS_VRR_CTL
and a few other VRR registers, among other things
- second stage is conditional on the DDI being enabled,
and includes TRANS_DDI_FUNC_CTL and TRANS_VRR_VMAX/VMIN/FLIPLINE,
among other things
So let's reorder the steps to match to avoid the hang, and
toss in an extra WARN to make sure we don't screw this up later.
BSpec: 22243
Cc: stable@vger.kernel.org
Cc: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
Reported-by: Benjamin Tissoires <bentiss@kernel.org>
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/15777
Tested-by: Benjamin Tissoires <bentiss@kernel.org>
Fixes: dda7dcd9da ("drm/i915/vrr: Use fixed timings for platforms that support VRR")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patch.msgid.link/20260303095414.4331-1-ville.syrjala@linux.intel.com
Reviewed-by: Ankit Nautiyal <ankit.k.nautiyal@intel.com>
(cherry picked from commit 93f3a267c3dd4d811b224bb9e179a10d81456a74)
Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
Oleksij Rempel says:
====================
net: usb: lan78xx: accumulated bug fixes
This series contains a collection of standalone bug fixes for the
Microchip LAN78xx driver, addressing packet handling, TX statistics,
invalid register accesses, and a kernel warning during disconnect.
====================
Link: https://patch.msgid.link/20260305143429.530909-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Do not configure Latency Tolerance Messaging (LTM) on USB 2.0 hardware.
The LAN7850 is a High-Speed (USB 2.0) only device and does not support
SuperSpeed features like LTM. Currently, the driver unconditionally
attempts to configure LTM registers during initialization. On the
LAN7850, these registers do not exist, resulting in writes to invalid
or undocumented memory space.
This issue was identified during a port to the regmap API with strict
register validation enabled. While no functional issues or crashes have
been observed from these invalid writes, bypassing LTM initialization
on the LAN7850 ensures the driver strictly adheres to the hardware's
valid register map.
Fixes: 55d7de9de6 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet device driver")
Cc: stable@vger.kernel.org
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20260305143429.530909-4-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Account for hardware auto-padding in TX byte counters to reflect actual
wire traffic.
The LAN7850 hardware automatically pads undersized frames to the minimum
Ethernet frame length (ETH_ZLEN, 60 bytes). However, the driver tracks
the network statistics based on the unpadded socket buffer length. This
results in the tx_bytes counter under-reporting the actual physical
bytes placed on the Ethernet wire for small packets (like short ARP or
ICMP requests).
Use max_t() to ensure the transmission statistics accurately account for
the hardware-generated padding.
Fixes: d383216a7e ("lan78xx: Introduce Tx URB processing improvements")
Cc: stable@vger.kernel.org
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20260305143429.530909-3-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Do not drop packets with checksum errors at the USB driver level;
pass them to the network stack.
Previously, the driver dropped all packets where the 'Receive Error
Detected' (RED) bit was set, regardless of the specific error type. This
caused packets with only IP or TCP/UDP checksum errors to be dropped
before reaching the kernel, preventing the network stack from accounting
for them or performing software fallback.
Add a mask for hard hardware errors to safely drop genuinely corrupt
frames, while allowing checksum-errored frames to pass with their
ip_summed field explicitly set to CHECKSUM_NONE.
Fixes: 55d7de9de6 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 Ethernet device driver")
Cc: stable@vger.kernel.org
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20260305143429.530909-2-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When removing a nexthop from a group, remove_nh_grp_entry() publishes
the new group via rcu_assign_pointer() then immediately frees the
removed entry's percpu stats with free_percpu(). However, the
synchronize_net() grace period in the caller remove_nexthop_from_groups()
runs after the free. RCU readers that entered before the publish still
see the old group and can dereference the freed stats via
nh_grp_entry_stats_inc() -> get_cpu_ptr(nhge->stats), causing a
use-after-free on percpu memory.
Fix by deferring the free_percpu() until after synchronize_net() in the
caller. Removed entries are chained via nh_list onto a local deferred
free list. After the grace period completes and all RCU readers have
finished, the percpu stats are safely freed.
Fixes: f4676ea74b ("net: nexthop: Add nexthop group entry stats")
Cc: stable@vger.kernel.org
Signed-off-by: Mehul Rao <mehulrao@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260306233821.196789-1-mehulrao@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With the current sfp_fixup_ignore_tx_fault() fixup we ignore the TX_FAULT
signal, but we also need to apply sfp_fixup_ignore_los() in order to be
able to communicate with the module even if the fiber isn't connected for
configuration purposes.
This is needed for all the MA5671a firmwares, excluding the FS modded
firmware.
Fixes: 2069624dac ("net: sfp: Add tx-fault workaround for Huawei MA5671A SFP ONT")
Signed-off-by: Álvaro Fernández Rojas <noltari@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20260306125139.213637-1-noltari@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
graph_util_is_ports0() identifies DPCM front-end (ports@0) vs back-end
(ports@1) by calling of_get_child_by_name() to find the first "ports"
child and comparing pointers. This relies on child iteration order
matching DTS source order.
When the DPCM topology comes from a DT overlay, __of_attach_node()
inserts new children at the head of the sibling list, reversing the
order. of_get_child_by_name() then returns ports@1 instead of ports@0,
causing all front-end links to be classified as back-ends. The card
registers with no PCM devices.
Fix this by matching the unit address directly from the node name
instead of relying on sibling order.
Fixes: 9293925245 ("ASoC: simple-card-utils: add asoc_graph_is_ports0()")
Signed-off-by: Sen Wang <sen@ti.com>
Acked-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Link: https://patch.msgid.link/20260309042109.2576612-1-sen@ti.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Merge series from "Sheetal ." <sheetal@nvidia.com>:
Add Tegra238 sound card support in the Tegra audio graph card driver,
as Tegra238 requires different PLLA and PLLA_OUT0 clock rates compared
to other Tegra platforms.
AppArmor was putting the reference to i_private data on its end after
removing the original entry from the file system. However the inode
can aand does live beyond that point and it is possible that some of
the fs call back functions will be invoked after the reference has
been put, which results in a race between freeing the data and
accessing it through the fs.
While the rawdata/loaddata is the most likely candidate to fail the
race, as it has the fewest references. If properly crafted it might be
possible to trigger a race for the other types stored in i_private.
Fix this by moving the put of i_private referenced data to the correct
place which is during inode eviction.
Fixes: c961ee5f21 ("apparmor: convert from securityfs to apparmorfs for policy ns files")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Maxime Bélair <maxime.belair@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
There is a race condition that leads to a use-after-free situation:
because the rawdata inodes are not refcounted, an attacker can start
open()ing one of the rawdata files, and at the same time remove the
last reference to this rawdata (by removing the corresponding profile,
for example), which frees its struct aa_loaddata; as a result, when
seq_rawdata_open() is reached, i_private is a dangling pointer and
freed memory is accessed.
The rawdata inodes weren't refcounted to avoid a circular refcount and
were supposed to be held by the profile rawdata reference. However
during profile removal there is a window where the vfs and profile
destruction race, resulting in the use after free.
Fix this by moving to a double refcount scheme. Where the profile
refcount on rawdata is used to break the circular dependency. Allowing
for freeing of the rawdata once all inode references to the rawdata
are put.
Fixes: 5d5182cae4 ("apparmor: move to per loaddata files, instead of replicating in profiles")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Maxime Bélair <maxime.belair@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Differential encoding allows loops to be created if it is abused. To
prevent this the unpack should verify that a diff-encode chain
terminates.
Unfortunately the differential encode verification had two bugs.
1. it conflated states that had gone through check and already been
marked, with states that were currently being checked and marked.
This means that loops in the current chain being verified are treated
as a chain that has already been verified.
2. the order bailout on already checked states compared current chain
check iterators j,k instead of using the outer loop iterator i.
Meaning a step backwards in states in the current chain verification
was being mistaken for moving to an already verified state.
Move to a double mark scheme where already verified states get a
different mark, than the current chain being kept. This enables us
to also drop the backwards verification check that was the cause of
the second error as any already verified state is already marked.
Fixes: 031dcc8f4e ("apparmor: dfa add support for state differential encoding")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
An unprivileged local user can load, replace, and remove profiles by
opening the apparmorfs interfaces, via a confused deputy attack, by
passing the opened fd to a privileged process, and getting the
privileged process to write to the interface.
This does require a privileged target that can be manipulated to do
the write for the unprivileged process, but once such access is
achieved full policy management is possible and all the possible
implications that implies: removing confinement, DoS of system or
target applications by denying all execution, by-passing the
unprivileged user namespace restriction, to exploiting kernel bugs for
a local privilege escalation.
The policy management interface can not have its permissions simply
changed from 0666 to 0600 because non-root processes need to be able
to load policy to different policy namespaces.
Instead ensure the task writing the interface has privileges that
are a subset of the task that opened the interface. This is already
done via policy for confined processes, but unconfined can delegate
access to the opened fd, by-passing the usual policy check.
Fixes: b7fd2c0340 ("apparmor: add per policy ns .load, .replace, .remove interface files")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
if ns_name is NULL after
1071 error = aa_unpack(udata, &lh, &ns_name);
and if ent->ns_name contains an ns_name in
1089 } else if (ent->ns_name) {
then ns_name is assigned the ent->ns_name
1095 ns_name = ent->ns_name;
however ent->ns_name is freed at
1262 aa_load_ent_free(ent);
and then again when freeing ns_name at
1270 kfree(ns_name);
Fix this by NULLing out ent->ns_name after it is transferred to ns_name
Fixes: 145a0ef21c ("apparmor: fix blob compression when ns is forced on a policy load
")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Currently the number of policy namespaces is not bounded relying on
the user namespace limit. However policy namespaces aren't strictly
tied to user namespaces and it is possible to create them and nest
them arbitrarily deep which can be used to exhaust system resource.
Hard cap policy namespaces to the same depth as user namespaces.
Fixes: c88d4c7b04 ("AppArmor: core policy routines")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Reviewed-by: Ryan Lee <ryan.lee@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
The profile removal code uses recursion when removing nested profiles,
which can lead to kernel stack exhaustion and system crashes.
Reproducer:
$ pf='a'; for ((i=0; i<1024; i++)); do
echo -e "profile $pf { \n }" | apparmor_parser -K -a;
pf="$pf//x";
done
$ echo -n a > /sys/kernel/security/apparmor/.remove
Replace the recursive __aa_profile_list_release() approach with an
iterative approach in __remove_profile(). The function repeatedly
finds and removes leaf profiles until the entire subtree is removed,
maintaining the same removal semantic without recursion.
Fixes: c88d4c7b04 ("AppArmor: core policy routines")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: Massimiliano Pellizzer <massimiliano.pellizzer@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
The function sets `*ns = NULL` on every call, leaking the namespace
string allocated in previous iterations when multiple profiles are
unpacked. This also breaks namespace consistency checking since *ns
is always NULL when the comparison is made.
Remove the incorrect assignment.
The caller (aa_unpack) initializes *ns to NULL once before the loop,
which is sufficient.
Fixes: dd51c84857 ("apparmor: provide base for multiple profiles to be replaced at once")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: Massimiliano Pellizzer <massimiliano.pellizzer@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Start states are read from untrusted data and used as indexes into the
DFA state tables. The aa_dfa_next() function call in unpack_pdb() will
access dfa->tables[YYTD_ID_BASE][start], and if the start state exceeds
the number of states in the DFA, this results in an out-of-bound read.
==================================================================
BUG: KASAN: slab-out-of-bounds in aa_dfa_next+0x2a1/0x360
Read of size 4 at addr ffff88811956fb90 by task su/1097
...
Reject policies with out-of-bounds start states during unpacking
to prevent the issue.
Fixes: ad5ff3db53 ("AppArmor: Add ability to load extended policy")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: Massimiliano Pellizzer <massimiliano.pellizzer@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
When a sound card is unbound while a PCM stream is open, a
use-after-free can occur in snd_soc_dapm_stream_event(), called from
the close_delayed_work workqueue handler.
During unbind, snd_soc_unbind_card() flushes delayed work and then
calls soc_cleanup_card_resources(). Inside cleanup,
snd_card_disconnect_sync() releases all PCM file descriptors, and
the resulting PCM close path can call snd_soc_dapm_stream_stop()
which schedules new delayed work with a pmdown_time timer delay.
Since this happens after the flush in snd_soc_unbind_card(), the
new work is not caught. soc_remove_link_components() then frees
DAPM widgets before this work fires, leading to the use-after-free.
The existing flush in soc_free_pcm_runtime() also cannot help as it
runs after soc_remove_link_components() has already freed the widgets.
Add a flush in soc_cleanup_card_resources() after
snd_card_disconnect_sync() (after which no new PCM closes can
schedule further delayed work) and before soc_remove_link_dais()
and soc_remove_link_components() (which tear down the structures the
delayed work accesses).
Fixes: e894efef9a ("ASoC: core: add support to card rebind")
Signed-off-by: Matteo Cotifava <cotifavamatteo@gmail.com>
Link: https://patch.msgid.link/20260309215412.545628-3-cotifavamatteo@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
The delayed_work_pending() check before flush_delayed_work() in
soc_free_pcm_runtime() is unnecessary and racy. flush_delayed_work()
is safe to call unconditionally - it is a no-op when no work is
pending. Remove the check.
The original check was added by commit 9c9b652034 ("ASoC: core:
only flush inited work during free") but delayed_work_pending()
followed by flush_delayed_work() has a time-of-check/time-of-use
window where work can become pending between the two calls.
Fixes: 9c9b652034 ("ASoC: core: only flush inited work during free")
Signed-off-by: Matteo Cotifava <cotifavamatteo@gmail.com>
Link: https://patch.msgid.link/20260309215412.545628-2-cotifavamatteo@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Dual LVDS output (available on the SN65DSI84) requires HSYNC_PULSE_WIDTH
and HORIZONTAL_BACK_PORCH to be divided by two with respect to the values
used for single LVDS output.
While not clearly stated in the datasheet, this is needed according to the
DSI Tuner [0] output. It also makes sense intuitively because in dual LVDS
output two pixels at a time are output and so the output clock is half of
the pixel clock.
Some dual-LVDS panels refuse to show any picture without this fix.
Divide by two HORIZONTAL_FRONT_PORCH too, even though this register is used
only for test pattern generation which is not currently implemented by this
driver.
[0] https://www.ti.com/tool/DSI-TUNER
Fixes: ceb515ba29 ("drm/bridge: ti-sn65dsi83: Add TI SN65DSI83 and SN65DSI84 driver")
Cc: stable@vger.kernel.org
Reviewed-by: Marek Vasut <marek.vasut@mailbox.org>
Link: https://patch.msgid.link/20260226-ti-sn65dsi83-dual-lvds-fixes-and-test-pattern-v1-2-2e15f5a9a6a0@bootlin.com
Signed-off-by: Luca Ceresoli <luca.ceresoli@bootlin.com>
The DSI frequency must be in the range:
(CHA_DSI_CLK_RANGE * 5 MHz) <= DSI freq < ((CHA_DSI_CLK_RANGE + 1) * 5 MHz)
So the register value should point to the lower range value, but
DIV_ROUND_UP() rounds the division to the higher range value, resulting in
an excess of 1 (unless the frequency is an exact multiple of 5 MHz).
For example for a 437100000 MHz clock CHA_DSI_CLK_RANGE should be 87 (0x57):
(87 * 5 = 435) <= 437.1 < (88 * 5 = 440)
but current code returns 88 (0x58).
Fix the computation by removing the DIV_ROUND_UP().
Fixes: ceb515ba29 ("drm/bridge: ti-sn65dsi83: Add TI SN65DSI83 and SN65DSI84 driver")
Cc: stable@vger.kernel.org
Reviewed-by: Marek Vasut <marek.vasut@mailbox.org>
Link: https://patch.msgid.link/20260226-ti-sn65dsi83-dual-lvds-fixes-and-test-pattern-v1-1-2e15f5a9a6a0@bootlin.com
Signed-off-by: Luca Ceresoli <luca.ceresoli@bootlin.com>
aesbs_setkey() and aesbs_cbc_ctr_setkey() allocate struct crypto_aes_ctx
on the stack. On arm64, the kernel-mode NEON context is also stored on
the stack, causing the combined frame size to exceed 1024 bytes and
triggering -Wframe-larger-than= warnings.
Allocate struct crypto_aes_ctx on the heap instead and use
kfree_sensitive() to ensure the key material is zeroed on free.
Use a goto-based cleanup path to ensure kfree_sensitive() is always
called.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Fixes: 4fa617cc68 ("arm64/fpsimd: Allocate kernel mode FP/SIMD buffers on the stack")
Link: https://lore.kernel.org/r/20260306064254.2079274-1-yphbchou0911@gmail.com
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Fix a warning for:
$ ./scripts/kconfig/merge_config.sh .config extra.config
Using .config as base
Merging extra.config
./scripts/kconfig/merge_config.sh: 384: [: false: unexpected operator
The shellcheck report is also attached:
if [ "$STRICT" == "true" ] && [ "$STRICT_MODE_VIOLATED" == "true" ]; then
^-- SC3014 (warning): In POSIX sh, == in place of = is undefined.
^-- SC3014 (warning): In POSIX sh, == in place of = is undefined.
Fixes: dfc97e1c5d ("scripts: kconfig: merge_config.sh: use awk in checks too")
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
Reviewed-by: Mikko Rapeli <mikko.rapeli@linaro.org>
Link: https://patch.msgid.link/20260309121505.40454-1-o451686892@gmail.com
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Since the caller, __io_uring_run_bpf_filters(), doesn't prevent
migration, it should use the migration disabling variant for running
the BPF program.
Fixes: d42eb05e60 ("io_uring: add support for BPF filtering for opcode restrictions")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The driver uses devm_spi_register_controller() for registration, which
automatically unregisters the controller via devm cleanup when the
device is removed. The manual call to spi_unregister_controller() in
the remove() callback can lead to a double-free.
And to make sure controller is unregistered before DMA buffer is
unmapped, switch to use spi_register_controller() in probe().
Fixes: 8011709906 ("spi: rockchip-sfc: Support pm ops")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260310-sfc-v2-1-67fab04b097f@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
The wacom_intuos_bt_irq() function processes Bluetooth HID reports
without sufficient bounds checking. A maliciously crafted short report
can trigger an out-of-bounds read when copying data into the wacom
structure.
Specifically, report 0x03 requires at least 22 bytes to safely read
the processed data and battery status, while report 0x04 (which
falls through to 0x03) requires 32 bytes.
Add explicit length checks for these report IDs and log a warning if
a short report is received.
Signed-off-by: Benoît Sevens <bsevens@google.com>
Reviewed-by: Jason Gerecke <jason.gerecke@wacom.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
Pull cpupower utility updates for 7.0-rc4 from Shuah Khan:
"linux-cpupower-7.0-rc4
- Adds support for setting EPP via systemd service
- Fixes swapped power/energy unit labels
- Adds intel_pstate turbo boost support for Intel platforms"
* tag 'linux-cpupower-7.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux:
cpupower: Add intel_pstate turbo boost support for Intel platforms
cpupower: Add support for setting EPP via systemd service
cpupower: fix swapped power/energy unit labels
Commit 60f3ada4174f ("kunit: Add --list_suites to show suites") introduced
the --list_suites option to kunit.py, but the update to the corresponding
run_wrapper documentation was omitted.
Add the missing description for --list_suites to keep the documentation in
sync with the tool's supported arguments.
Fixes: 60f3ada4174f ("kunit: Add --list_suites to show suites")
Signed-off-by: Ryota Sakamoto <sakamo.ryota@gmail.com>
Reviewed-by: David Gow <david@davidgow.net>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
scx_enable() uses double-checked locking to lazily initialize a static
kthread_worker pointer. The fast path reads helper locklessly:
if (!READ_ONCE(helper)) { // lockless read -- no helper_mutex
The write side initializes helper under helper_mutex, but previously
used a plain assignment:
helper = kthread_run_worker(0, "scx_enable_helper");
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
plain write -- KCSAN data race with READ_ONCE() above
Since READ_ONCE() on the fast path and the plain write on the
initialization path access the same variable without a common lock,
they constitute a data race. KCSAN requires that all sides of a
lock-free access use READ_ONCE()/WRITE_ONCE() consistently.
Use a temporary variable to stage the result of kthread_run_worker(),
and only WRITE_ONCE() into helper after confirming the pointer is
valid. This avoids a window where a concurrent caller on the fast path
could observe an ERR pointer via READ_ONCE(helper) before the error
check completes.
Fixes: b06ccbabe2 ("sched_ext: Fix starvation of scx_enable() under fair-class saturation")
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The actual code is right, but the comment is the wrong way around.
Fixes: ed82f35b92 ("io_uring: allow registration of per-task restrictions")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dingisoul with KASAN reports a use after free if device_add() fails in
nd_async_device_register().
Commit b6eae0f61d ("libnvdimm: Hold reference on parent while
scheduling async init") correctly added a reference on the parent device
to be held until asynchronous initialization was complete. However, if
device_add() results in an allocation failure the ref count of the
device drops to 0 prior to the parent pointer being accessed. Thus
resulting in use after free.
The bug bot AI correctly identified the fix. Save a reference to the
parent pointer to be used to drop the parent reference regardless of the
outcome of device_add().
Reported-by: Dingisoul <dingiso.kernel@gmail.com>
Closes: http://lore.kernel.org/8855544b-be9e-4153-aa55-0bc328b13733@gmail.com
Fixes: b6eae0f61d ("libnvdimm: Hold reference on parent while scheduling async init")
Cc: stable@vger.kernel.org
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://patch.msgid.link/20260306-fix-uaf-async-init-v1-1-a28fd7526723@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
In iptfs_reassem_cont(), IP-TFS attempts to append data to the new inner
packet 'newskb' that is being reassembled. First a zero-copy approach is
tried if it succeeds then newskb becomes non-linear.
When a subsequent fragment in the same datagram does not meet the
fast-path conditions, a memory copy is performed. It calls skb_put() to
append the data and as newskb is non-linear it triggers
SKB_LINEAR_ASSERT check.
Oops: invalid opcode: 0000 [#1] SMP NOPTI
[...]
RIP: 0010:skb_put+0x3c/0x40
[...]
Call Trace:
<IRQ>
iptfs_reassem_cont+0x1ab/0x5e0 [xfrm_iptfs]
iptfs_input_ordered+0x2af/0x380 [xfrm_iptfs]
iptfs_input+0x122/0x3e0 [xfrm_iptfs]
xfrm_input+0x91e/0x1a50
xfrm4_esp_rcv+0x3a/0x110
ip_protocol_deliver_rcu+0x1d7/0x1f0
ip_local_deliver_finish+0xbe/0x1e0
__netif_receive_skb_core.constprop.0+0xb56/0x1120
__netif_receive_skb_list_core+0x133/0x2b0
netif_receive_skb_list_internal+0x1ff/0x3f0
napi_complete_done+0x81/0x220
virtnet_poll+0x9d6/0x116e [virtio_net]
__napi_poll.constprop.0+0x2b/0x270
net_rx_action+0x162/0x360
handle_softirqs+0xdc/0x510
__irq_exit_rcu+0xe7/0x110
irq_exit_rcu+0xe/0x20
common_interrupt+0x85/0xa0
</IRQ>
<TASK>
Fix this by checking if the skb is non-linear. If it is, linearize it by
calling skb_linearize(). As the initial allocation of newskb originally
reserved enough tailroom for the entire reassembled packet we do not
need to check if we have enough tailroom or extend it.
Fixes: 5f2b6a9095 ("xfrm: iptfs: add skb-fragment sharing code")
Reported-by: Hao Long <me@imlonghao.com>
Closes: https://lore.kernel.org/netdev/DGRCO9SL0T5U.JTINSHJQ9KPK@imlonghao.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
When UBLK_F_NO_AUTO_PART_SCAN is set, GD_SUPPRESS_PART_SCAN is cleared
unconditionally, including for unprivileged daemons. Keep it consistent
with the code block for setting GD_SUPPRESS_PART_SCAN by not clearing
it for unprivileged daemons.
In reality this isn't a problem because ioctl(BLKRRPART) requires
CAP_SYS_ADMIN, but it is more reliable to not clear the bit.
Cc: Alexander Atanasov <alex@zazolabs.com>
Fixes: 8443e2087e ("ublk: add UBLK_F_NO_AUTO_PART_SCAN feature flag")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The driver uses devm_dma_request_chan() which registers automatic cleanup
via devm_add_action_or_reset(). Calling dma_release_channel() manually on
the RX channel when TX channel request fails causes a double-free when
the devm cleanup runs.
Remove the unnecessary manual cleanup and simplify the error handling
since devm will properly release channels on probe failure or driver
detach.
Fixes: 34e3815ea4 ("spi: atcspi200: Add ATCSPI200 SPI controller driver")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260305-atcspi2000-v1-1-eafe08dcca60@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Fix three bugs in aml_sfc_dma_buffer_setup() error paths:
1. Unnecessary goto: When the first DMA mapping (sfc->daddr) fails,
nothing needs cleanup. Use direct return instead of goto.
2. Double-unmap bug: When info DMA mapping failed, the code would
unmap sfc->daddr inline, then fall through to out_map_data which
would unmap it again, causing a double-unmap.
3. Wrong unmap size: The out_map_info label used datalen instead of
infolen when unmapping sfc->iaddr, which could lead to incorrect
DMA sync behavior.
Fixes: 4670db6f32 ("spi: amlogic: add driver for Amlogic SPI Flash Controller")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260306-spifc-a4-v1-1-f22c9965f64a@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
When `ceph_process_folio_batch` encounters a folio past the end of the
current object, it should leave it in the batch so that it is picked up
in the next iteration.
Removing the folio from the batch means that it does not get written
back and remains dirty instead. This makes `fsync()` silently skip some
of the data, delays capability release, and breaks coherence with
`O_DIRECT`.
The link below contains instructions for reproducing the bug.
Cc: stable@vger.kernel.org
Fixes: ce80b76dd3 ("ceph: introduce ceph_process_folio_batch() method")
Link: https://tracker.ceph.com/issues/75156
Signed-off-by: Hristo Venev <hristo@venev.name>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add __putname() calls to error code paths that did not free the "path"
pointer obtained by __getname(). If ownership of this pointer is not
passed to the caller via path_info.path, the function must free it
before returning.
Cc: stable@vger.kernel.org
Fixes: 3fd945a79e ("ceph: encode encrypted name in ceph_mdsc_build_path and dentry release")
Fixes: 550f7ca98e ("ceph: give up on paths longer than PATH_MAX")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
ceph_mdsc_build_path() must be called with a zero-initialized
ceph_path_info parameter, or else the following
ceph_mdsc_free_path_info() may crash.
Example crash (on Linux 6.18.12):
virt_to_cache: Object is not a Slab page!
WARNING: CPU: 184 PID: 2871736 at mm/slub.c:6732 kmem_cache_free+0x316/0x400
[...]
Call Trace:
[...]
ceph_open+0x13d/0x3e0
do_dentry_open+0x134/0x480
vfs_open+0x2a/0xe0
path_openat+0x9a3/0x1160
[...]
cache_from_obj: Wrong slab cache. names_cache but object is from ceph_inode_info
WARNING: CPU: 184 PID: 2871736 at mm/slub.c:6746 kmem_cache_free+0x2dd/0x400
[...]
kernel BUG at mm/slub.c:634!
Oops: invalid opcode: 0000 [#1] SMP NOPTI
RIP: 0010:__slab_free+0x1a4/0x350
Some of the ceph_mdsc_build_path() callers had initializers, but
others had not, even though they were all added by commit 15f519e9f8
("ceph: fix race condition validating r_parent before applying state").
The ones without initializer are suspectible to random crashes. (I can
imagine it could even be possible to exploit this bug to elevate
privileges.)
Unfortunately, these Ceph functions are undocumented and its semantics
can only be derived from the code. I see that ceph_mdsc_build_path()
initializes the structure only on success, but not on error.
Calling ceph_mdsc_free_path_info() after a failed
ceph_mdsc_build_path() call does not even make sense, but that's what
all callers do, and for it to be safe, the structure must be
zero-initialized. The least intrusive approach to fix this is
therefore to add initializers everywhere.
Cc: stable@vger.kernel.org
Fixes: 15f519e9f8 ("ceph: fix race condition validating r_parent before applying state")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
During async unlink, we drop the `i_nlink` counter before we receive
the completion (that will eventually update the `i_nlink`) because "we
assume that the unlink will succeed". That is not a bad idea, but it
races against deletions by other clients (or against the completion of
our own unlink) and can lead to an underrun which emits a WARNING like
this one:
WARNING: CPU: 85 PID: 25093 at fs/inode.c:407 drop_nlink+0x50/0x68
Modules linked in:
CPU: 85 UID: 3221252029 PID: 25093 Comm: php-cgi8.1 Not tainted 6.14.11-cm4all1-ampere #655
Hardware name: Supermicro ARS-110M-NR/R12SPD-A, BIOS 1.1b 10/17/2023
pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : drop_nlink+0x50/0x68
lr : ceph_unlink+0x6c4/0x720
sp : ffff80012173bc90
x29: ffff80012173bc90 x28: ffff086d0a45aaf8 x27: ffff0871d0eb5680
x26: ffff087f2a64a718 x25: 0000020000000180 x24: 0000000061c88647
x23: 0000000000000002 x22: ffff07ff9236d800 x21: 0000000000001203
x20: ffff07ff9237b000 x19: ffff088b8296afc0 x18: 00000000f3c93365
x17: 0000000000070000 x16: ffff08faffcbdfe8 x15: ffff08faffcbdfec
x14: 0000000000000000 x13: 45445f65645f3037 x12: 34385f6369706f74
x11: 0000a2653104bb20 x10: ffffd85f26d73290 x9 : ffffd85f25664f94
x8 : 00000000000000c0 x7 : 0000000000000000 x6 : 0000000000000002
x5 : 0000000000000081 x4 : 0000000000000481 x3 : 0000000000000000
x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff08727d3f91e8
Call trace:
drop_nlink+0x50/0x68 (P)
vfs_unlink+0xb0/0x2e8
do_unlinkat+0x204/0x288
__arm64_sys_unlinkat+0x3c/0x80
invoke_syscall.constprop.0+0x54/0xe8
do_el0_svc+0xa4/0xc8
el0_svc+0x18/0x58
el0t_64_sync_handler+0x104/0x130
el0t_64_sync+0x154/0x158
In ceph_unlink(), a call to ceph_mdsc_submit_request() submits the
CEPH_MDS_OP_UNLINK to the MDS, but does not wait for completion.
Meanwhile, between this call and the following drop_nlink() call, a
worker thread may process a CEPH_CAP_OP_IMPORT, CEPH_CAP_OP_GRANT or
just a CEPH_MSG_CLIENT_REPLY (the latter of which could be our own
completion). These will lead to a set_nlink() call, updating the
`i_nlink` counter to the value received from the MDS. If that new
`i_nlink` value happens to be zero, it is illegal to decrement it
further. But that is exactly what ceph_unlink() will do then.
The WARNING can be reproduced this way:
1. Force async unlink; only the async code path is affected. Having
no real clue about Ceph internals, I was unable to find out why the
MDS wouldn't give me the "Fxr" capabilities, so I patched
get_caps_for_async_unlink() to always succeed.
(Note that the WARNING dump above was found on an unpatched kernel,
without this kludge - this is not a theoretical bug.)
2. Add a sleep call after ceph_mdsc_submit_request() so the unlink
completion gets handled by a worker thread before drop_nlink() is
called. This guarantees that the `i_nlink` is already zero before
drop_nlink() runs.
The solution is to skip the counter decrement when it is already zero,
but doing so without a lock is still racy (TOCTOU). Since
ceph_fill_inode() and handle_cap_grant() both hold the
`ceph_inode_info.i_ceph_lock` spinlock while set_nlink() runs, this
seems like the proper lock to protect the `i_nlink` updates.
I found prior art in NFS and SMB (using `inode.i_lock`) and AFS (using
`afs_vnode.cb_lock`). All three have the zero check as well.
Cc: stable@vger.kernel.org
Fixes: 2ccb45462a ("ceph: perform asynchronous unlink if we have sufficient caps")
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
At the end of this function, d is the traversal cursor of flist, but the
code completes found instead. This can lead to issues such as NULL pointer
dereferences, double completion, or descriptor leaks.
Fix this by completing d instead of found in the final
list_for_each_entry_safe() loop.
Fixes: aa8d18becc ("dmaengine: idxd: add callback support for iaa crypto")
Signed-off-by: Tuo Li <islituo@gmail.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://patch.msgid.link/20260106032428.162445-1-islituo@gmail.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
When a write subrequest is marked NETFS_SREQ_NEED_RETRY, the retry path
in netfs_unbuffered_write() unconditionally calls stream->prepare_write()
without checking if it is NULL.
Filesystems such as 9P do not set the prepare_write operation, so
stream->prepare_write remains NULL. When get_user_pages() fails with
-EFAULT and the subrequest is flagged for retry, this results in a NULL
pointer dereference at fs/netfs/direct_write.c:189.
Fix this by mirroring the pattern already used in write_retry.c: if
stream->prepare_write is NULL, skip renegotiation and directly reissue
the subrequest via netfs_reissue_write(), which handles iterator reset,
IN_PROGRESS flag, stats update and reissue internally.
Fixes: a0b4c7a491 ("netfs: Fix unbuffered/DIO writes to dispatch subrequests in strict sequence")
Reported-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7227db0fbac9f348dba0
Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
Link: https://patch.msgid.link/20260307043947.347092-1-kartikey406@gmail.com
Tested-by: syzbot+7227db0fbac9f348dba0@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
The Lenovo ThinkPad X390 (ALC257 codec, subsystem ID 0x17aa2288)
does not report headset button press events. Headphone insertion is
detected (SW_HEADPHONE_INSERT), but pressing the inline microphone
button on a headset produces no input events.
Add a SND_PCI_QUIRK entry that maps this subsystem ID to
ALC285_FIXUP_THINKPAD_NO_BASS_SPK_HEADSET_JACK, which enables
headset jack button detection through alc_fixup_headset_jack()
and ThinkPad ACPI integration. This is the same fixup used by
similar ThinkPad models (P1 Gen 3, X1 Extreme Gen 3).
Signed-off-by: Uzair Mughal <contact@uzair.is-a.dev>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://patch.msgid.link/20260307012906.20093-1-contact@uzair.is-a.dev
In the drain loop, the local variable 'runtime' is reassigned to a
linked stream's runtime (runtime = s->runtime at line 2157). After
releasing the stream lock at line 2169, the code accesses
runtime->no_period_wakeup, runtime->rate, and runtime->buffer_size
(lines 2170-2178) — all referencing the linked stream's runtime without
any lock or refcount protecting its lifetime.
A concurrent close() on the linked stream's fd triggers
snd_pcm_release_substream() → snd_pcm_drop() → pcm_release_private()
→ snd_pcm_unlink() → snd_pcm_detach_substream() → kfree(runtime).
No synchronization prevents kfree(runtime) from completing while the
drain path dereferences the stale pointer.
Fix by caching the needed runtime fields (no_period_wakeup, rate,
buffer_size) into local variables while still holding the stream lock,
and using the cached values after the lock is released.
Fixes: f2b3614cef ("ALSA: PCM - Don't check DMA time-out too shortly")
Cc: stable@vger.kernel.org
Signed-off-by: Mehul Rao <mehulrao@gmail.com>
Link: https://patch.msgid.link/20260305193508.311096-1-mehulrao@gmail.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Others have submitted this issue (https://lore.kernel.org/dmaengine/
20240722030405.3385-1-zhengdongxiong@gxmicro.cn/),
but it has not been fixed yet. Therefore, more supplementary information
is provided here.
As mentioned in the "PCS-CCS-CB-TCB" Producer-Consumer Synchronization of
"DesignWare Cores PCI Express Controller Databook, version 6.00a":
1. The Consumer CYCLE_STATE (CCS) bit in the register only needs to be
initialized once; the value will update automatically to be
~CYCLE_BIT (CB) in the next chunk.
2. The Consumer CYCLE_BIT bit in the register is loaded from the LL
element and tested against CCS. When CB = CCS, the data transfer is
executed. Otherwise not.
The current logic sets customer (HDMA) CS and CB bits to 1 in each chunk
while setting the producer (software) CB of odd chunks to 0 and even
chunks to 1 in the linked list. This is leading to a mismatch between
the producer CB and consumer CS bits.
This issue can be reproduced by setting the transmission data size to
exceed one chunk. By the way, in the EDMA using the same "PCS-CCS-CB-TCB"
mechanism, the CS bit is only initialized once and this issue was not
found. Refer to
drivers/dma/dw-edma/dw-edma-v0-core.c:dw_edma_v0_core_start.
So fix this issue by initializing the CYCLE_STATE and CYCLE_BIT bits
only once.
Fixes: e74c39573d ("dmaengine: dw-edma: Add support for native HDMA")
Signed-off-by: LUO Haowen <luo-hw@foxmail.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/tencent_CB11AA9F3920C1911AF7477A9BD8EFE0AD05@qq.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
When KSMBD_DEBUG_AUTH logging is enabled, generate_smb3signingkey() and
generate_smb3encryptionkey() log the session, signing, encryption, and
decryption key bytes. Remove the logs to avoid exposing credentials.
Fixes: e2f34481b2 ("cifsd: add server-side procedures for SMB3")
Cc: stable@vger.kernel.org
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
opinfo pointer obtained via rcu_dereference(fp->f_opinfo) is being
accessed after rcu_read_unlock() has been called. This creates a
race condition where the memory could be freed by a concurrent
writer between the unlock and the subsequent pointer dereferences
(opinfo->is_lease, etc.), leading to a use-after-free.
Fixes: 5fb282ba4f ("ksmbd: fix possible null-deref in smb_lazy_parent_lease_break_close")
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
ksmbd currently frees oplock_info immediately using kfree(), even
though it is accessed under RCU read-side critical sections in places
like opinfo_get() and proc_show_files().
Since there is no RCU grace period delay between nullifying the pointer
and freeing the memory, a reader can still access oplock_info
structure after it has been freed. This can leads to a use-after-free
especially in opinfo_get() where atomic_inc_not_zero() is called on
already freed memory.
Fix this by switching to deferred freeing using call_rcu().
Fixes: 18b4fac5ef ("ksmbd: fix use-after-free in smb_break_all_levII_oplock()")
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
The opinfo pointer obtained via rcu_dereference(fp->f_opinfo) is
dereferenced after rcu_read_unlock(), creating a use-after-free
window. A concurrent opinfo_put() can free the opinfo between the
unlock and the subsequent access to opinfo->is_lease,
opinfo->o_lease->state, and opinfo->level.
Fix this by deferring rcu_read_unlock() until after all opinfo
field accesses are complete. The values needed (const_names, count,
level) are copied into local variables under the RCU read lock,
and the potentially-sleeping seq_printf calls happen after the
lock is released.
Found by AI-assisted code review (Claude Opus 4.6, Anthropic)
in collaboration with Ali Khaledi.
Cc: stable@vger.kernel.org
Fixes: b38f99c121 ("ksmbd: add procfs interface for runtime monitoring and statistics")
Signed-off-by: Ali Khaledi <ali.khaledi1989@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
If ksmbd_override_fsids() fails, we jump to err_out2. At that point, fp is
NULL because it hasn't been assigned dh_info.fp yet, so ksmbd_fd_put(work,
fp) will not be called. However, dh_info.fp was already inserted into the
session file table by ksmbd_reopen_durable_fd(), so it will leak in the
session file table until the session is closed.
Move fp = dh_info.fp; ahead of the ksmbd_override_fsids() check to fix the
problem.
Found by an experimental AI code review agent at Google.
Fixes: c8efcc7861 ("ksmbd: add support for durable handles v1/v2")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: ChenXiaoSong <chenxiaosong@kylinos.cn>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use vmalloc() instead of kmalloc(..., GFP_DMA) to alloc the temporary
buffer for firmware download blobs. This avoids the problem that a
heavily fragmented system cannot allocate enough physically-contiguous
memory for a large blob.
The redundant alloc buffer mechanism was removed in commit 900baa6e7b
("firmware: cs_dsp: Remove redundant download buffer allocator").
While doing that I was overly focused on the possibility of the
underlying bus requiring DMA-safe memory. So I used GFP_DMA kmalloc()s.
I failed to notice that the code I was removing used vmalloc().
This creates a regression.
Way back in 2014 the problem of fragmentation with kmalloc()s was fixed
by commit cdcd7f7287 ("ASoC: wm_adsp: Use vmalloc to allocate firmware
download buffer").
Although we don't need physically-contiguous memory, we don't know if the
bus needs some particular alignment of the buffers. Since the change in
2014, the firmware download has always used whatever alignment vmalloc()
returns. To avoid introducing a new problem, the temporary buffer is still
used, to keep the same alignment of pointers passed to regmap_raw_write().
Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com>
Fixes: 900baa6e7b ("firmware: cs_dsp: Remove redundant download buffer allocator")
Link: https://patch.msgid.link/20260304141250.1578597-1-rf@opensource.cirrus.com
Signed-off-by: Mark Brown <broonie@kernel.org>
gud_plane_atomic_update() currently handles both crtc state and
framebuffer updates - the complexity has led to a few accidental
NULL pointer dereferences.
Commit dc2d5ddb19 ("drm/gud: fix NULL fb and crtc dereferences
on USB disconnect") [1] fixed an earlier dereference but planes
can also be disabled in non-hotplug paths (e.g. display disables
via the desktop environment). The drm_dev_enter() call would not
cause an early return in those and subsequently oops on
dereferencing crtc:
BUG: kernel NULL pointer dereference, address: 00000000000005c8
CPU: 6 UID: 1000 PID: 3473 Comm: kwin_wayland Not tainted 6.18.2-200.vanilla.gud.fc42.x86_64 #1 PREEMPT(lazy)
RIP: 0010:gud_plane_atomic_update+0x148/0x470 [gud]
<TASK>
drm_atomic_helper_commit_planes+0x28e/0x310
drm_atomic_helper_commit_tail+0x2a/0x70
commit_tail+0xf1/0x150
drm_atomic_helper_commit+0x13c/0x180
drm_atomic_commit+0xb1/0xe0
info ? __pfx___drm_printfn_info+0x10/0x10
drm_mode_atomic_ioctl+0x70f/0x7c0
? __pfx_drm_mode_atomic_ioctl+0x10/0x10
drm_ioctl_kernel+0xae/0x100
drm_ioctl+0x2a8/0x550
? __pfx_drm_mode_atomic_ioctl+0x10/0x10
__x64_sys_ioctl+0x97/0xe0
do_syscall_64+0x7e/0x7f0
? __ct_user_enter+0x56/0xd0
? do_syscall_64+0x158/0x7f0
? __ct_user_enter+0x56/0xd0
? do_syscall_64+0x158/0x7f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Split out crtc handling from gud_plane_atomic_update() into
atomic_enable() and atomic_disable() functions to delegate
crtc state transitioning work to the DRM helpers.
To preserve the gud state commit sequence [2], switch to
the runtime PM version of drm_atomic_helper_commit_tail() which
ensures that crtcs are enabled (hence sending the
GUD_REQ_SET_CONTROLLER_ENABLE and GUD_REQ_SET_DISPLAY_ENABLE
requests) before a framebuffer update is sent.
[1] https://lore.kernel.org/all/20251231055039.44266-1-me@shenghaoyang.info/
[2] https://github.com/notro/gud/wiki/GUD-Protocol#display-state
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202601142159.0v8ilfVs-lkp@intel.com/
Fixes: 73cfd166e0 ("drm/gud: Replace simple display pipe with DRM atomic helpers")
Cc: <stable@vger.kernel.org> # 6.19.x
Cc: <stable@vger.kernel.org> # 6.18.x
Signed-off-by: Shenghao Yang <me@shenghaoyang.info>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Acked-by: Ruben Wauters <rubenru09@aol.com>
Signed-off-by: Ruben Wauters <rubenru09@aol.com>
Link: https://patch.msgid.link/20260222054551.80864-1-me@shenghaoyang.info
On admin queue completion handling, if the admin command completed with
error we print data from the completion context. The issue is that we
already freed the completion context in polling/interrupts handler which
means we print data from context in an unknown state (it might be
already used again).
Change the admin submission flow so alloc/dealloc of the context will be
symmetric and dealloc will be called after any potential use of the
context.
Fixes: 68fb9f3e31 ("RDMA/efa: Remove redundant NULL pointer check of CQE")
Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Michael Margolin <mrgolin@amazon.com>
Signed-off-by: Yonatan Nachum <ynachum@amazon.com>
Link: https://patch.msgid.link/20260308165350.18219-1-ynachum@amazon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Correctly set dbi->write_memory_bpw for the ST7586 driver. This driver
is for a monochrome display that has an unusual data format, so the
default value set in mipi_dbi_spi_init() is not correct simply because
this controller is non-standard.
Previously, we were using dbi->swap_bytes to make the same sort of
workaround, but it was removed in the same commit that added
dbi->write_memory_bpw, so we need to use the latter now to have the
correct behavior.
This fixes every 3 columns of pixels being swapped on the display. There
are 3 pixels per byte, so the byte swap caused this effect.
Fixes: df3fb27a74 ("drm/mipi-dbi: Make bits per word configurable for pixel transfers")
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Javier Martinez Canillas <javierm@redhat.com>
Signed-off-by: David Lechner <dlechner@baylibre.com>
Link: https://patch.msgid.link/20260228-drm-mipi-dbi-fix-st7586-byte-swap-v1-1-e78f6c24cd28@baylibre.com
Current `dma_read!`, `dma_write!` macros also use a custom
`addr_of!()`-based implementation for projecting pointers, which has
soundness issue as it relies on absence of `Deref` implementation on types.
It also has a soundness issue where it does not protect against unaligned
fields (when `#[repr(packed)]` is used) so it can generate misaligned
accesses.
This commit migrates them to use the general pointer projection
infrastructure, which handles these cases correctly.
As part of migration, the macro is updated to have an improved surface
syntax. The current macro have
dma_read!(a.b.c[d].e.f)
to mean `a.b.c` is a DMA coherent allocation and it should project into it
with `[d].e.f` and do a read, which is confusing as it makes the indexing
operator integral to the macro (so it will break if you have an array of
`CoherentAllocation`, for example).
This also is problematic as we would like to generalize
`CoherentAllocation` from just slices to arbitrary types.
Make the macro expects `dma_read!(path.to.dma, .path.inside.dma)` as the
canonical syntax. The index operator is no longer special and is just one
type of projection (in additional to field projection). Similarly, make
`dma_write!(path.to.dma, .path.inside.dma, value)` become the canonical
syntax for writing.
Another issue of the current macro is that it is always fallible. This
makes sense with existing design of `CoherentAllocation`, but once we
support fixed size arrays with `CoherentAllocation`, it is desirable to
have the ability to perform infallible indexing as well, e.g. doing a `[0]`
index of `[Foo; 2]` is okay and can be checked at build-time, so forcing
falliblity is non-ideal. To capture this, the macro is changed to use
`[idx]` as infallible projection and `[idx]?` as fallible index projection
(those syntax are part of the general projection infra). A benefit of this
is that while individual indexing operation may fail, the overall
read/write operation is not fallible.
Fixes: ad2907b4e3 ("rust: add dma coherent allocator abstraction")
Reviewed-by: Benno Lossin <lossin@kernel.org>
Signed-off-by: Gary Guo <gary@garyguo.net>
Link: https://patch.msgid.link/20260302164239.284084-4-gary@kernel.org
[ Capitalize safety comments; slightly improve wording in doc-comments.
- Danilo ]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Add a generic infrastructure for performing field and index projections on
raw pointers. This will form the basis of performing I/O projections.
Pointers manipulations are intentionally using the safe wrapping variants
instead of the unsafe variants, as the latter requires pointers to be
inside an allocation which is not necessarily true for I/O pointers.
This projection macro protects against rogue `Deref` implementation, which
can causes the projected pointer to be outside the bounds of starting
pointer. This is extremely unlikely and Rust has a lint to catch this, but
is unsoundness regardless. The protection works by inducing type inference
ambiguity when `Deref` is implemented.
This projection macro also stops projecting into unaligned fields (i.e.
fields of `#[repr(packed)]` structs), as misaligned pointers require
special handling. This is implemented by attempting to create reference to
projected field inside a `if false` block. Despite being unreachable, Rust
still checks that they're not unaligned fields.
The projection macro supports both fallible and infallible index
projections. These are described in detail inside the documentation.
Signed-off-by: Gary Guo <gary@garyguo.net>
Reviewed-by: Benno Lossin <lossin@kernel.org>
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Link: https://patch.msgid.link/20260302164239.284084-3-gary@kernel.org
[ * Add intro-doc links where possible,
* Fix typos and slightly improve wording, e.g. "as documentation
describes" -> "as the documentation of [`Self::proj`] describes",
* Add an empty line between regular and safety comments, before
examples, and between logically independent comments,
* Capitalize various safety comments.
- Danilo ]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Add a `KnownSize` trait which is used obtain a size from a raw pointer's
metadata. This makes it possible to obtain size information on a raw slice
pointer. This is similar to Rust `core::mem::size_of_val_raw` which is not
yet stable.
Signed-off-by: Gary Guo <gary@garyguo.net>
Reviewed-by: Benno Lossin <lossin@kernel.org>
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Link: https://patch.msgid.link/20260302164239.284084-2-gary@kernel.org
[ Fix wording in doc-comment. - Danilo ]
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
Valentine reports that their guests fail to boot correctly, losing
interrupts, and indicates that the wrong interrupt gets deactivated.
What happens here is that if the maintenance interrupt is slow enough
to kick us out of the guest, extra interrupts can be activated from
the LRs. We then exit and proceed to handle EOIcount deactivations,
picking active interrupts from the AP list. But we start from the
top of the list, potentially deactivating interrupts that were in
the LRs, while EOIcount only denotes deactivation of interrupts that
are not present in an LR.
Solve this by tracking the last interrupt that made it in the LRs,
and start the EOIcount deactivation walk *after* that interrupt.
Since this only makes sense while the vcpu is loaded, stash this
in the per-CPU host state.
Huge thanks to Valentine for doing all the detective work and
providing an initial patch.
Fixes: 3cfd59f81e ("KVM: arm64: GICv3: Handle LR overflow when EOImode==0")
Fixes: 281c6c06e2 ("KVM: arm64: GICv2: Handle LR overflow when EOImode==0")
Reported-by: Valentine Burley <valentine.burley@collabora.com>
Tested-by: Valentine Burley <valentine.burley@collabora.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20260307115955.369455-1-valentine.burley@collabora.com
Link: https://patch.msgid.link/20260307191151.3781182-1-maz@kernel.org
Cc: stable@vger.kernel.org
The serialnr sysfs attribute for CCA cards when queried always
used the default domain for sending the request down to the card.
If for any reason exactly this default domain is disabled then
the attribute code fails to retrieve the CCA info and the sysfs
entry shows an empty string. Works as designed but the serial
number is a card attribute and thus it does not matter which
domain is used for the query. So if there are other domains on
this card available, these could be used.
So extend the code to use AUTOSEL_DOM for the domain value to
address any online domain within the card for querying the cca
info and thus show the serialnr as long as there is one domain
usable regardless of the default domain setting.
Fixes: 8f291ebf32 ("s390/zcrypt: enable card/domain autoselect on ep11 cprbs")
Suggested-by: Ingo Franzki <ifranzki@linux.ibm.com>
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Reviewed-by: Ingo Franzki <ifranzki@linux.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This reverts commit d8b5cf9c63.
Mikhail Zaslonko reported that linux-next doesn't boot anymore [2]. Reason
for this is recent change [2] was supposed to slightly optimize the irq
entry/exit path by removing some psw bits early in case of an idle exit.
This however is incorrect since irqentry_exit() requires the correct old
psw state at irq entry. Otherwise the embedded regs_irqs_disabled() will
not provide the correct result.
With linux-next and HRTIMER_REARM_DEFERRED this leads to the observed boot
problems, however the commit is broken in any case.
Revert the commit which introduced this.
Thanks to Peter Zijlstra for pointing out that this is a bug in the s390
entry code.
Fixes: d8b5cf9c63 ("s390/irq/idle: Remove psw bits early") [1]
Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Closes: https://lore.kernel.org/r/af549a19-db99-4b16-8511-bf315177a13e@linux.ibm.com/ [2]
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Link: https://lore.kernel.org/r/20260306111919.362559-1-hca@linux.ibm.com
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Since commit b5daf93b80 ("firmware: arm_scmi: Avoid notifier
registration for unsupported events") the call chains leading to the helper
__scmi_event_handler_get_ops expect an ERR_PTR to be returned on failure to
get an handler for the requested event key, while the current helper can
still return a NULL when no handler could be found or created.
Fix by forcing an ERR_PTR return value when the handler reference is NULL.
Fixes: b5daf93b80 ("firmware: arm_scmi: Avoid notifier registration for unsupported events")
Signed-off-by: Cristian Marussi <cristian.marussi@arm.com>
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Message-Id: <20260305131011.541444-1-cristian.marussi@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@kernel.org>
A device_node reference obtained from the device tree is not released
on all error paths in the arm_scpi probe path. Specifically, a node
returned by of_parse_phandle() could be leaked when the probe failed
after the node was acquired. The probe function returns early and
the shmem reference is not released.
Use __free(device_node) scope-based cleanup to automatically release
the reference when the variable goes out of scope.
Fixes: ed7ecb8839 ("firmware: arm_scpi: Add compatibility checks for shmem node")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Message-Id: <20260121-arm_scpi_2-v2-1-702d7fa84acb@gmail.com>
Signed-off-by: Sudeep Holla <sudeep.holla@kernel.org>
In qla24xx_els_dcmd_iocb() sp->free is set to qla2x00_els_dcmd_sp_free().
When an error happens, this function is called by qla2x00_sp_release(),
when kref_put() releases the first and the last reference.
qla2x00_els_dcmd_sp_free() frees fcport by calling qla2x00_free_fcport().
Doing it one more time after kref_put() is a bad idea.
Fixes: 82f522ae0d ("scsi: qla2xxx: Fix double free of fcport")
Fixes: 4895009c4b ("scsi: qla2xxx: Prevent command send on chip reset")
Signed-off-by: Vladimir Riabchun <ferr.lambarginio@gmail.com>
Signed-off-by: Farhat Abbas <fabbas@cloudlinux.com>
Link: https://patch.msgid.link/aYsDln9NFQQsPDgg@vova-pc
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
In __ufshcd_wl_suspend(), cancel_delayed_work_sync() is called to cancel
the UFS RTC work, but it is placed after ufshcd_vops_suspend(hba, pm_op,
POST_CHANGE). This creates a race condition where ufshcd_rtc_work() can
still be running while ufshcd_vops_suspend() is executing. When
UFSHCD_CAP_CLK_GATING is not supported, the condition
!hba->clk_gating.active_reqs is always true, causing ufshcd_update_rtc()
to be executed. Since ufshcd_vops_suspend() typically performs clock
gating operations, executing ufshcd_update_rtc() at that moment triggers
an SError. The kernel panic trace is as follows:
Kernel panic - not syncing: Asynchronous SError Interrupt
Call trace:
dump_backtrace+0xec/0x128
show_stack+0x18/0x28
dump_stack_lvl+0x40/0xa0
dump_stack+0x18/0x24
panic+0x148/0x374
nmi_panic+0x3c/0x8c
arm64_serror_panic+0x64/0x8c
do_serror+0xc4/0xc8
el1h_64_error_handler+0x34/0x4c
el1h_64_error+0x68/0x6c
el1_interrupt+0x20/0x58
el1h_64_irq_handler+0x18/0x24
el1h_64_irq+0x68/0x6c
ktime_get+0xc4/0x12c
ufshcd_mcq_sq_stop+0x4c/0xec
ufshcd_mcq_sq_cleanup+0x64/0x1dc
ufshcd_clear_cmd+0x38/0x134
ufshcd_issue_dev_cmd+0x298/0x4d0
ufshcd_exec_dev_cmd+0x1a4/0x1c4
ufshcd_query_attr+0xbc/0x19c
ufshcd_rtc_work+0x10c/0x1c8
process_scheduled_works+0x1c4/0x45c
worker_thread+0x32c/0x3e8
kthread+0x120/0x1d8
ret_from_fork+0x10/0x20
Fix this by moving cancel_delayed_work_sync() before the call to
ufshcd_vops_suspend(hba, pm_op, PRE_CHANGE), ensuring the UFS RTC work is
fully completed or cancelled at that point.
Cc: Bean Huo <beanhuo@iokpp.de>
Fixes: 6bf999e0eb ("scsi: ufs: core: Add UFS RTC support")
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Wang Shuaiwei <wangshuaiwei1@xiaomi.com>
Link: https://patch.msgid.link/20260307035128.3419687-1-wangshuaiwei1@xiaomi.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
enqueue_task_scx() takes int enq_flags from the sched_class interface.
SCX enqueue flags starting at bit 32 (SCX_ENQ_PREEMPT and above) are
silently truncated when passed through activate_task(). extra_enq_flags
was added as a workaround - storing high bits in rq->scx.extra_enq_flags
and OR-ing them back in enqueue_task_scx(). However, the OR target is
still the int parameter, so the high bits are lost anyway.
The current impact is limited as the only affected flag is SCX_ENQ_PREEMPT
which is informational to the BPF scheduler - its loss means the scheduler
doesn't know about preemption but doesn't cause incorrect behavior.
Fix by renaming the int parameter to core_enq_flags and introducing a
u64 enq_flags local that merges both sources. All downstream functions
already take u64 enq_flags.
Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Cc: stable@vger.kernel.org # v6.12+
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
When kmeter.c was refactored into km83xx.c in 2011, the "keymile" vendor
prefix was changed to upper-case "Keymile". The devicetree at
arch/powerpc/boot/dts/kmeter1.dts never underwent the same change,
suggesting that this was simply a mistake.
Fixes: 93e2b95c81 ("powerpc/83xx: rename and update kmeter1")
Signed-off-by: J. Neuschäfer <j.ne@posteo.net>
Reviewed-by: Heiko Schocher <hs@nabladev.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260303-keymile-v1-1-463a11e71702@posteo.net
Commit 61688a82e0 ("powerpc/bpf: enable kfunc call") inadvertently
enabled kfunc call support for 32-bit powerpc but that support will
not be possible until ABI mismatch between 32-bit powerpc and eBPF is
handled in 32-bit powerpc JIT code. Till then, advertise support only
for 64-bit powerpc. Also, in powerpc ABI, caller needs to extend the
arguments properly based on signedness. The JIT code is responsible
for handling this explicitly for kfunc calls as verifier can't handle
this for each architecture-specific ABI needs. But this was not taken
care of while kfunc call support was enabled for powerpc. Fix it by
handling this with bpf_jit_find_kfunc_model() and using zero_extend()
& sign_extend() helper functions.
Fixes: 61688a82e0 ("powerpc/bpf: enable kfunc call")
Cc: stable@vger.kernel.org
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260303181031.390073-7-hbathini@linux.ibm.com
Since bpf2bpf tailcall support is enabled for 64-bit powerpc with
kernel commit 2ed2d8f6fb ("powerpc64/bpf: Support tailcalls with
subprogs"), 'tailcalls/tailcall_bpf2bpf_hierarchy_fexit' BPF selftest
is triggering "corrupted stack end detected inside scheduler" with the
config option CONFIG_SCHED_STACK_END_CHECK enabled. While reviewing
the stack layout for BPF trampoline, observed that the dummy frame is
trying to protect the redzone of BPF program. This is because tail
call info and NVRs save area are in redzone at the time of tailcall
as the current BPF program stack frame is teared down before the
tailcall. But saving this redzone in the dummy frame of trampoline
is unnecessary because of the follow reasons:
1) Firstly, trampoline can be attached to BPF entry/main program
or subprog. But prologue part of the BPF entry/main program,
where the trampoline attachpoint is, is skipped during tailcall.
So, protecting the redzone does not arise when the trampoline is
not even triggered in this scenario.
2) In case of subprog, the caller's stackframe is already setup
and the subprog's stackframe is yet to be setup. So, nothing
on the redzone to be protected.
Also, using dummy frame in BPF trampoline, wastes critically scarce
kernel stack space, especially in tailcall sequence, for marginal
benefit in stack unwinding. So, drop setting up the dummy frame.
Instead, save return address in bpf trampoline frame and use it as
appropriate. Pruning this unnecessary stack usage mitigates the
likelihood of stack overflow in scenarios where bpf2bpf tailcalls
and fexit programs are mixed.
Reported-by: Saket Kumar Bhaskar <skb99@linux.ibm.com>
Fixes: 2ed2d8f6fb ("powerpc64/bpf: Support tailcalls with subprogs")
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260303181031.390073-5-hbathini@linux.ibm.com
Ideally, the offset used to load the tail call info field and to find
the pass by reference address for tail call field should be the same.
But while setting up the tail call info in the trampoline, this was
not followed. This can be misleading and can lead to unpredictable
results if and when bpf_has_stack_frame() ends up returning true
for trampoline frame. Since commit 15513beeb6 ("powerpc64/bpf:
Moving tail_call_cnt to bottom of frame") and commit 2ed2d8f6fb
("powerpc64/bpf: Support tailcalls with subprogs") ensured tail call
field is at the bottom of the stack frame for BPF programs as well as
BPF trampoline, avoid relying on bpf_jit_stack_tailcallinfo_offset()
and bpf_has_stack_frame() for trampoline frame and always calculate
tail call field offset with reference to older frame.
Fixes: 2ed2d8f6fb ("powerpc64/bpf: Support tailcalls with subprogs")
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260303181031.390073-4-hbathini@linux.ibm.com
Support for -fpatchable-function-entry on ppc64le was added in Clang
with [1]. However, when no prefix NOPs are specified - as is the case
with CONFIG_PPC_FTRACE_OUT_OF_LINE - the first NOP is emitted at LEP,
but Clang records the Global Entry Point (GEP) unlike GCC which does
record the Local Entry Point (LEP). Issue [2] has been raised to align
Clang's behavior with GCC. As a temporary workaround to ensure ftrace
initialization works as expected with Clang, derive the LEP using
ppc_function_entry() for kernel symbols and by looking for the below
module GEP sequence for module addresses, until [2] is resolved:
ld r2, -8(r12)
add r2, r2, r12
[1] https://github.com/llvm/llvm-project/pull/151569
[2] https://github.com/llvm/llvm-project/issues/163706
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260127084926.34497-4-hbathini@linux.ibm.com
The total number of out-of-line (OOL) stubs required for function
tracing is determined using the following command:
$(OBJDUMP) -r -j __patchable_function_entries vmlinux.o
While this works correctly with GNU objdump, llvm-objdump does not
list the expected relocation records for this section. Fix this by
using the -d option and counting R_PPC64_ADDR64 relocation entries.
This works as desired with both objdump and llvm-objdump.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260127084926.34497-3-hbathini@linux.ibm.com
ARCH_USING_PATCHABLE_FUNCTION_ENTRY depends on toolchain support for
-fpatchable-function-entry option. The current script that checks
for this support only handles GCC. Rename the script and extend it
to detect support for -fpatchable-function-entry with Clang as well,
allowing clean cross-compilation with Clang toolchains.
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260127084926.34497-2-hbathini@linux.ibm.com
Even if we get a dma_mapping_error() while mapping an RX buffer, we
should still update rx_ring->head to ensure that the buffers we were
able to allocate and map are used. Fix this by breaking out to the
existing code after the loop, analogous to the existing handling for skb
allocation failure.
Fixes: bfec6d7f20 ("net: spacemit: Add K1 Ethernet MAC")
Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn>
Link: https://patch.msgid.link/20260305-k1-ethernet-more-fixes-v2-1-e4e434d65055@iscas.ac.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean says:
====================
Further SJA1105 phylink link replay fixups
While I was playing around with the subsystem knowledge in Chris Mason's
review-prompts to see what LLMs would have needed to catch the bug
behind commit bfd264fbbb ("net: dsa: sja1105: protect link replay
helpers against NULL phylink instance"), it flagged another issue
instead, which IMO is valid. This is being fixed in patch 2/2.
Patch 1/2 is preparatory reordering for that.
I haven't noticed any physical issues, it only has to do with the
soundness of the new call path introduced in January in commit
0b2edc531e ("net: dsa: sja1105: let phylink help with the replay of
link callbacks").
====================
Link: https://patch.msgid.link/20260304220900.3865120-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Most errors that can occur in sja1105_static_config_reload() are fatal
(example: fail to communicate with hardware), but not all are.
For example, sja1105_static_config_upload() -> kcalloc() may fail, and
if that happens, we have called phylink_replay_link_begin() but never
phylink_replay_link_end().
Under that circumstance, all port phylink instances are left in a state
where the resolver is stopped with the PHYLINK_DISABLE_REPLAY bit set.
We have effectively disabled link management with no way to recover from
this condition.
Avoid that situation by ensuring phylink_replay_link_begin() is always
paired with phylink_replay_link_end(), regardless of whether we faced
any errors during switch reset, configuration reload and general state
reload.
Fixes: 0b2edc531e ("net: dsa: sja1105: let phylink help with the replay of link callbacks")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260304220900.3865120-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move phylink_replay_link_end() as the last locked operation under
sja1105_static_config_reload(). The purpose is to be able to goto
this step from the error path of intermediate steps (we must call
phylink_replay_link_end()).
sja1105_reload_cbs() notably does not depend on port states or link
speeds. See commit 954ad9bf13 ("net: dsa: sja1105: fix bandwidth
discrepancy between tc-cbs software and offload") which has discussed
this issue specifically.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260304220900.3865120-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
teql_master_xmit() calls netdev_start_xmit(skb, slave) to transmit
through slave devices, but does not update skb->dev to the slave device
beforehand.
When a gretap tunnel is a TEQL slave, the transmit path reaches
iptunnel_xmit() which saves dev = skb->dev (still pointing to teql0
master) and later calls iptunnel_xmit_stats(dev, pkt_len). This
function does:
get_cpu_ptr(dev->tstats)
Since teql_master_setup() does not set dev->pcpu_stat_type to
NETDEV_PCPU_STAT_TSTATS, the core network stack never allocates tstats
for teql0, so dev->tstats is NULL. get_cpu_ptr(NULL) computes
NULL + __per_cpu_offset[cpu], resulting in a page fault.
BUG: unable to handle page fault for address: ffff8880e6659018
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 68bc067 P4D 68bc067 PUD 0
Oops: Oops: 0002 [#1] SMP KASAN PTI
RIP: 0010:iptunnel_xmit (./include/net/ip_tunnels.h:664 net/ipv4/ip_tunnel_core.c:89)
Call Trace:
<TASK>
ip_tunnel_xmit (net/ipv4/ip_tunnel.c:847)
__gre_xmit (net/ipv4/ip_gre.c:478)
gre_tap_xmit (net/ipv4/ip_gre.c:779)
teql_master_xmit (net/sched/sch_teql.c:319)
dev_hard_start_xmit (net/core/dev.c:3887)
sch_direct_xmit (net/sched/sch_generic.c:347)
__dev_queue_xmit (net/core/dev.c:4802)
neigh_direct_output (net/core/neighbour.c:1660)
ip_finish_output2 (net/ipv4/ip_output.c:237)
__ip_finish_output.part.0 (net/ipv4/ip_output.c:315)
ip_mc_output (net/ipv4/ip_output.c:369)
ip_send_skb (net/ipv4/ip_output.c:1508)
udp_send_skb (net/ipv4/udp.c:1195)
udp_sendmsg (net/ipv4/udp.c:1485)
inet_sendmsg (net/ipv4/af_inet.c:859)
__sys_sendto (net/socket.c:2206)
Fix this by setting skb->dev = slave before calling
netdev_start_xmit(), so that tunnel xmit functions see the correct
slave device with properly allocated tstats.
Fixes: 039f50629b ("ip_tunnel: Move stats update to iptunnel_xmit()")
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Link: https://patch.msgid.link/20260304044216.3517851-3-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Early return paths in NCSI RX and AEN handlers fail to release
the received skb, resulting in a memory leak.
Specifically, ncsi_aen_handler() returns on invalid AEN packets
without consuming the skb. Similarly, ncsi_rcv_rsp() exits early
when failing to resolve the NCSI device, response handler, or
request, leaving the skb unfreed.
CC: stable@vger.kernel.org
Fixes: 7a82ecf4cf ("net/ncsi: NCSI AEN packet handler")
Fixes: 138635cc27 ("net/ncsi: NCSI response packet handler")
Signed-off-by: Jian Zhang <zhangjian.3032@bytedance.com>
Link: https://patch.msgid.link/20260305060656.3357250-1-zhangjian.3032@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
XDP multi-buf programs can modify the layout of the XDP buffer when the
program calls bpf_xdp_pull_data() or bpf_xdp_adjust_tail(). The
referenced commit in the fixes tag corrected the assumption in the mlx5
driver that the XDP buffer layout doesn't change during a program
execution. However, this fix introduced another issue: the dropped
fragments still need to be counted on the driver side to avoid page
fragment reference counting issues.
Such issue can be observed with the
test_xdp_native_adjst_tail_shrnk_data selftest when using a payload of
3600 and shrinking by 256 bytes (an upcoming selftest patch): the last
fragment gets released by the XDP code but doesn't get tracked by the
driver. This results in a negative pp_ref_count during page release and
the following splat:
WARNING: include/net/page_pool/helpers.h:297 at mlx5e_page_release_fragmented.isra.0+0x4a/0x50 [mlx5_core], CPU#12: ip/3137
Modules linked in: [...]
CPU: 12 UID: 0 PID: 3137 Comm: ip Not tainted 6.19.0-rc3+ #12 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:mlx5e_page_release_fragmented.isra.0+0x4a/0x50 [mlx5_core]
[...]
Call Trace:
<TASK>
mlx5e_dealloc_rx_wqe+0xcb/0x1a0 [mlx5_core]
mlx5e_free_rx_descs+0x7f/0x110 [mlx5_core]
mlx5e_close_rq+0x50/0x60 [mlx5_core]
mlx5e_close_queues+0x36/0x2c0 [mlx5_core]
mlx5e_close_channel+0x1c/0x50 [mlx5_core]
mlx5e_close_channels+0x45/0x80 [mlx5_core]
mlx5e_safe_switch_params+0x1a5/0x230 [mlx5_core]
mlx5e_change_mtu+0xf3/0x2f0 [mlx5_core]
netif_set_mtu_ext+0xf1/0x230
do_setlink.isra.0+0x219/0x1180
rtnl_newlink+0x79f/0xb60
rtnetlink_rcv_msg+0x213/0x3a0
netlink_rcv_skb+0x48/0xf0
netlink_unicast+0x24a/0x350
netlink_sendmsg+0x1ee/0x410
__sock_sendmsg+0x38/0x60
____sys_sendmsg+0x232/0x280
___sys_sendmsg+0x78/0xb0
__sys_sendmsg+0x5f/0xb0
[...]
do_syscall_64+0x57/0xc50
This patch fixes the issue by doing page frag counting on all the
original XDP buffer fragments for all relevant XDP actions (XDP_TX ,
XDP_REDIRECT and XDP_PASS). This is basically reverting to the original
counting before the commit in the fixes tag.
As frag_page is still pointing to the original tail, the nr_frags
parameter to xdp_update_skb_frags_info() needs to be calculated
in a different way to reflect the new nr_frags.
Fixes: afd5ba577c ("net/mlx5e: RX, Fix generating skb from non-linear xdp_buff for legacy RQ")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Link: https://patch.msgid.link/20260305142634.1813208-6-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
XDP multi-buf programs can modify the layout of the XDP buffer when the
program calls bpf_xdp_pull_data() or bpf_xdp_adjust_tail(). The
referenced commit in the fixes tag corrected the assumption in the mlx5
driver that the XDP buffer layout doesn't change during a program
execution. However, this fix introduced another issue: the dropped
fragments still need to be counted on the driver side to avoid page
fragment reference counting issues.
The issue was discovered by the drivers/net/xdp.py selftest,
more specifically the test_xdp_native_tx_mb:
- The mlx5 driver allocates a page_pool page and initializes it with
a frag counter of 64 (pp_ref_count=64) and the internal frag counter
to 0.
- The test sends one packet with no payload.
- On RX (mlx5e_skb_from_cqe_mpwrq_nonlinear()), mlx5 configures the XDP
buffer with the packet data starting in the first fragment which is the
page mentioned above.
- The XDP program runs and calls bpf_xdp_pull_data() which moves the
header into the linear part of the XDP buffer. As the packet doesn't
contain more data, the program drops the tail fragment since it no
longer contains any payload (pp_ref_count=63).
- mlx5 device skips counting this fragment. Internal frag counter
remains 0.
- mlx5 releases all 64 fragments of the page but page pp_ref_count is
63 => negative reference counting error.
Resulting splat during the test:
WARNING: CPU: 0 PID: 188225 at ./include/net/page_pool/helpers.h:297 mlx5e_page_release_fragmented.isra.0+0xbd/0xe0 [mlx5_core]
Modules linked in: [...]
CPU: 0 UID: 0 PID: 188225 Comm: ip Not tainted 6.18.0-rc7_for_upstream_min_debug_2025_12_08_11_44 #1 NONE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:mlx5e_page_release_fragmented.isra.0+0xbd/0xe0 [mlx5_core]
[...]
Call Trace:
<TASK>
mlx5e_free_rx_mpwqe+0x20a/0x250 [mlx5_core]
mlx5e_dealloc_rx_mpwqe+0x37/0xb0 [mlx5_core]
mlx5e_free_rx_descs+0x11a/0x170 [mlx5_core]
mlx5e_close_rq+0x78/0xa0 [mlx5_core]
mlx5e_close_queues+0x46/0x2a0 [mlx5_core]
mlx5e_close_channel+0x24/0x90 [mlx5_core]
mlx5e_close_channels+0x5d/0xf0 [mlx5_core]
mlx5e_safe_switch_params+0x2ec/0x380 [mlx5_core]
mlx5e_change_mtu+0x11d/0x490 [mlx5_core]
mlx5e_change_nic_mtu+0x19/0x30 [mlx5_core]
netif_set_mtu_ext+0xfc/0x240
do_setlink.isra.0+0x226/0x1100
rtnl_newlink+0x7a9/0xba0
rtnetlink_rcv_msg+0x220/0x3c0
netlink_rcv_skb+0x4b/0xf0
netlink_unicast+0x255/0x380
netlink_sendmsg+0x1f3/0x420
__sock_sendmsg+0x38/0x60
____sys_sendmsg+0x1e8/0x240
___sys_sendmsg+0x7c/0xb0
[...]
__sys_sendmsg+0x5f/0xb0
do_syscall_64+0x55/0xc70
The problem applies for XDP_PASS as well which is handled in a different
code path in the driver.
This patch fixes the issue by doing page frag counting on all the
original XDP buffer fragments for all relevant XDP actions (XDP_TX ,
XDP_REDIRECT and XDP_PASS). This is basically reverting to the original
counting before the commit in the fixes tag.
As frag_page is still pointing to the original tail, the nr_frags
parameter to xdp_update_skb_frags_info() needs to be calculated
in a different way to reflect the new nr_frags.
Fixes: 87bcef158a ("net/mlx5e: RX, Fix generating skb from non-linear xdp_buff for striding RQ")
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Cc: Amery Hung <ameryhung@gmail.com>
Reviewed-by: Nimrod Oren <noren@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260305142634.1813208-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In case of a TX error CQE, a recovery flow is triggered,
mlx5e_reset_txqsq_cc_pc() resets dma_fifo_cc to 0 but not dma_fifo_pc,
desyncing the DMA FIFO producer and consumer.
After recovery, the producer pushes new DMA entries at the old
dma_fifo_pc, while the consumer reads from position 0.
This causes us to unmap stale DMA addresses from before the recovery.
The DMA FIFO is a purely software construct with no HW counterpart.
At the point of reset, all WQEs have been flushed so dma_fifo_cc is
already equal to dma_fifo_pc. There is no need to reset either counter,
similar to how skb_fifo pc/cc are untouched.
Remove the 'dma_fifo_cc = 0' reset.
This fixes the following WARNING:
WARNING: CPU: 0 PID: 0 at drivers/iommu/dma-iommu.c:1240 iommu_dma_unmap_page+0x79/0x90
Modules linked in: mlx5_vdpa vringh vdpa bonding mlx5_ib mlx5_vfio_pci ipip mlx5_fwctl tunnel4 mlx5_core ib_ipoib geneve ip6_gre ip_gre gre nf_tables ip6_tunnel rdma_ucm ib_uverbs ib_umad vfio_pci vfio_pci_core act_mirred act_skbedit act_vlan vhost_net vhost tap ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle cls_matchall nfnetlink_cttimeout act_gact cls_flower sch_ingress vhost_iotlb iptable_raw tunnel6 vfio_iommu_type1 vfio openvswitch nsh rpcsec_gss_krb5 auth_rpcgss oid_registry xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat xt_addrtype br_netfilter overlay zram zsmalloc rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core fuse [last unloaded: nf_tables]
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.13.0-rc5_for_upstream_min_debug_2024_12_30_21_33 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:iommu_dma_unmap_page+0x79/0x90
Code: 2b 4d 3b 21 72 26 4d 3b 61 08 73 20 49 89 d8 44 89 f9 5b 4c 89 f2 4c 89 e6 48 89 ef 5d 41 5c 41 5d 41 5e 41 5f e9 c7 ae 9e ff <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 66 2e 0f 1f 84 00 00 00 00
Call Trace:
<IRQ>
? __warn+0x7d/0x110
? iommu_dma_unmap_page+0x79/0x90
? report_bug+0x16d/0x180
? handle_bug+0x4f/0x90
? exc_invalid_op+0x14/0x70
? asm_exc_invalid_op+0x16/0x20
? iommu_dma_unmap_page+0x79/0x90
? iommu_dma_unmap_page+0x2e/0x90
dma_unmap_page_attrs+0x10d/0x1b0
mlx5e_tx_wi_dma_unmap+0xbe/0x120 [mlx5_core]
mlx5e_poll_tx_cq+0x16d/0x690 [mlx5_core]
mlx5e_napi_poll+0x8b/0xac0 [mlx5_core]
__napi_poll+0x24/0x190
net_rx_action+0x32a/0x3b0
? mlx5_eq_comp_int+0x7e/0x270 [mlx5_core]
? notifier_call_chain+0x35/0xa0
handle_softirqs+0xc9/0x270
irq_exit_rcu+0x71/0xd0
common_interrupt+0x7f/0xa0
</IRQ>
<TASK>
asm_common_interrupt+0x22/0x40
Fixes: db75373c91 ("net/mlx5e: Recover Send Queue (SQ) from error state")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260305142634.1813208-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The check on mlx5_esw_host_functions_enabled(esw->dev) for adding VF
peer miss rules is incorrect. These rules match traffic from peer's VFs,
so the local device's host function status is irrelevant. Remove this
check to ensure peer VF traffic is properly handled regardless of local
host configuration.
Also fix the PF peer miss rule deletion to be symmetric with the add
path, so only attempt to delete the rule if it was actually created.
Fixes: 520369ef43 ("net/mlx5: Support disabling host PFs")
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260305142634.1813208-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
esw->work_queue executes esw_functions_changed_event_handler ->
esw_vfs_changed_event_handler and acquires the devlink lock.
.eswitch_mode_set (acquires devlink lock in devlink_nl_pre_doit) ->
mlx5_devlink_eswitch_mode_set -> mlx5_eswitch_disable_locked ->
mlx5_eswitch_event_handler_unregister -> flush_workqueue deadlocks
when esw_vfs_changed_event_handler executes.
Fix that by no longer flushing the work to avoid the deadlock, and using
a generation counter to keep track of work relevance. This avoids an old
handler manipulating an esw that has undergone one or more mode changes:
- the counter is incremented in mlx5_eswitch_event_handler_unregister.
- the counter is read and passed to the ephemeral mlx5_host_work struct.
- the work handler takes the devlink lock and bails out if the current
generation is different than the one it was scheduled to operate on.
- mlx5_eswitch_cleanup does the final draining before destroying the wq.
No longer flushing the workqueue has the side effect of maybe no longer
cancelling pending vport_change_handler work items, but that's ok since
those are disabled elsewhere:
- mlx5_eswitch_disable_locked disables the vport eq notifier.
- mlx5_esw_vport_disable disarms the HW EQ notification and marks
vport->enabled under state_lock to false to prevent pending vport
handler from doing anything.
- mlx5_eswitch_cleanup destroys the workqueue and makes sure all events
are disabled/finished.
Fixes: f1bc646c9a ("net/mlx5: Use devl_ API in mlx5_esw_offloads_devlink_port_register")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260305081019.1811100-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit c7159e960f ("usbnet: limit max_mtu based on device's hard_mtu")
capped net->max_mtu to the device's hard_mtu in usbnet_probe(). While
this correctly prevents oversized packets on standard USB network
devices, it breaks the qmi_wwan driver.
qmi_wwan relies on userspace (e.g. ModemManager) setting a large MTU on
the wwan0 interface to configure rx_urb_size via usbnet_change_mtu().
QMI modems negotiate USB transfer sizes of 16,383 or 32,767 bytes, and
the USB receive buffers must be sized accordingly. With max_mtu capped
to hard_mtu (~1500 bytes), userspace can no longer raise the MTU, the
receive buffers remain small, and download speeds drop from >300 Mbps
to ~0.8 Mbps.
Introduce a FLAG_NOMAXMTU driver flag that allows individual usbnet
drivers to opt out of the max_mtu cap. Set this flag in qmi_wwan's
driver_info structures to restore the previous behavior for QMI devices,
while keeping the safety fix in place for all other usbnet drivers.
Fixes: c7159e960f ("usbnet: limit max_mtu based on device's hard_mtu")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/CAPh3n803k8JcBPV5qEzUB-oKzWkAs-D5CU7z=Vd_nLRCr5ZqQg@mail.gmail.com/
Reported-by: Koen Vandeputte <koen.vandeputte@citymesh.com>
Tested-by: Daniele Palmas <dnlplm@gmail.com>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Link: https://patch.msgid.link/20260304134338.1785002-1-lvivier@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hangbin Liu says:
====================
bond: fix 2 link state issues
This patch set fixes two bonding link state issues:
1. Broadcast mode incorrectly sets usable_slaves, causing updelay to be ignored
2. BOND_LINK_FAIL and BOND_LINK_BACK are treated as invalid states, generating
confusing error messages
Here is the reproducer:
```
ip netns add ns
ip -n ns link add bond0 type bond mode 3 miimon 100 updelay 200 downdelay 200
ip -n ns link add type veth
ip -n ns link add type veth
ip -n ns link set veth1 up
ip -n ns link set veth3 up
ip -n ns link set veth0 master bond0
ip -n ns link set veth2 master bond0
ip -n ns link set bond0 up
sleep 1
ip -n ns link set veth3 down
sleep 1
ip -n ns link set veth3 up
sleep 1
dmesg | tail
```
====================
Link: https://patch.msgid.link/20260304-b4-bond_updelay-v1-0-f72eb2e454d0@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before the fixed commit, we check slave->new_link during commit
state, which values are only BOND_LINK_{NOCHANGE, UP, DOWN}. After
the commit, we start using slave->link_new_state, which state also could
be BOND_LINK_{FAIL, BACK}.
For example, when we set updelay/downdelay, after a failover,
the slave->link_new_state could be set to BOND_LINK_{FAIL, BACK} in
bond_miimon_inspect(). And later in bond_miimon_commit(), it will treat
it as invalid and print an error, which would cause confusion for users.
[ 106.440254] bond0: (slave veth2): link status down for interface, disabling it in 200 ms
[ 106.440265] bond0: (slave veth2): invalid new link 1 on slave
[ 106.648276] bond0: (slave veth2): link status definitely down, disabling slave
[ 107.480271] bond0: (slave veth2): link status up, enabling it in 200 ms
[ 107.480288] bond0: (slave veth2): invalid new link 3 on slave
[ 107.688302] bond0: (slave veth2): link status definitely up, 10000 Mbps full duplex
Let's handle BOND_LINK_{FAIL, BACK} as valid link states.
Fixes: 1899bb3251 ("bonding: fix state transition issue in link monitoring")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20260304-b4-bond_updelay-v1-2-f72eb2e454d0@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit e0caeb24f5 ("net: bonding: update the slave array for broadcast mode"),
broadcast mode will also set all_slaves and usable_slaves during
bond_enslave(). But if we also set updelay, during enslave, the
slave init state will be BOND_LINK_BACK. And later
bond_update_slave_arr() will alloc usable_slaves but add nothing.
This will cause bond_miimon_inspect() to have ignore_updelay
always true. So the updelay will be always ignored. e.g.
[ 6.498368] bond0: (slave veth2): link status definitely down, disabling slave
[ 7.536371] bond0: (slave veth2): link status up, enabling it in 0 ms
[ 7.536402] bond0: (slave veth2): link status definitely up, 10000 Mbps full duplex
To fix it, we can either always call bond_update_slave_arr() on every
place when link changes. Or, let's just not set usable_slaves for
broadcast mode.
Fixes: e0caeb24f5 ("net: bonding: update the slave array for broadcast mode")
Reported-by: Liang Li <liali@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20260304-b4-bond_updelay-v1-1-f72eb2e454d0@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Driver core holds a reference to the USB interface and its parent USB
device while the interface is bound to a driver and there is no need to
take additional references unless the structures are needed after
disconnect.
This driver takes a reference to the USB device during probe but does
not to release it on probe failures.
Drop the redundant device reference to fix the leak, reduce cargo
culting, make it easier to spot drivers where an extra reference is
needed, and reduce the risk of further memory leaks.
Fixes: 0791c0327a ("net: mctp: Add MCTP USB transport driver")
Cc: stable@vger.kernel.org # 6.15
Signed-off-by: Johan Hovold <johan@kernel.org>
Acked-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20260305104549.16110-1-johan@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Once a task exits it has its state set to TASK_DEAD and then it is
removed from the cgroup it belonged to. The last step happens on the task
gets out of its last schedule() invocation and is delayed on PREEMPT_RT
due to locking constraints.
As a result it is possible to receive a pid via waitpid() of a task
which is still listed in cgroup.procs for the cgroup it belonged
to. This is something that systemd does not expect and as a result it
waits for its exit until a time out occurs.
This can also be reproduced on !PREEMPT_RT kernel with a significant
delay in do_exit() after exit_notify().
Hide the task from the output which have PF_EXITING set which is done
before the parent is notified. Keeping zombies with live threads
shouldn't break anything (suggested by Tejun).
Reported-by: Bert Karwatzki <spasswolf@web.de>
Closes: https://lore.kernel.org/all/20260219164648.3014-1-spasswolf@web.de/
Tested-by: Bert Karwatzki <spasswolf@web.de>
Fixes: 9311e6c29b ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT")
Cc: stable@vger.kernel.org # v6.19+
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
- Remove CONFIG_PAHOLE_HAS_BTF_TAG from required config list
- Document ext_idle.c as the built-in idle CPU selection policy
- Add descriptions for example schedulers in tools/sched_ext/
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
I found a few more paths that cleanup fails due to a NULL version pointer
on unsupported hardware.
Add NULL checks as applicable.
Fixes: 39fc2bc4da ("drm/amdgpu: Protect GPU register accesses in powergated state in some paths")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f5a05f8414fc10f307eb965f303580c7778f8dd2)
Cc: stable@vger.kernel.org
Older versions of the MES firmware may cause abnormal GPU power consumption.
When performing inference tasks on the GPU (e.g., with Ollama using ROCm),
the GPU may show abnormal power consumption in idle state and incorrect GPU load information.
This issue has been fixed in firmware version 0x8b and newer.
Closes: https://github.com/ROCm/ROCm/issues/5706
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 4e22a5fe6ea6e0b057e7f246df4ac3ff8bfbc46a)
The following members of struct amdgpu_mode_info do not have valid
references in the related kernel-doc sections:
- plane_shaper_lut_property
- plane_shaper_lut_size_property,
- plane_lut3d_size_property
Correct all affected comment blocks.
Fixes: f545d82479 ("drm/amd/display: add plane shaper LUT and TF driver-specific properties")
Fixes: 671994e3bf ("drm/amd/display: add plane 3D LUT driver-specific properties")
Reviewed-by: Melissa Wen <mwen@igalia.com>
Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit ec5708d6e547f7efe2f009073bfa98dbc4c5c2ac)
When GPU initialization fails due to an unsupported HW block
IP blocks may have a NULL version pointer. During cleanup in
amdgpu_device_fini_hw, the code calls amdgpu_device_set_pg_state and
amdgpu_device_set_cg_state which iterate over all IP blocks and access
adev->ip_blocks[i].version without NULL checks, leading to a kernel
NULL pointer dereference.
Add NULL checks for adev->ip_blocks[i].version in both
amdgpu_device_set_cg_state and amdgpu_device_set_pg_state to prevent
dereferencing NULL pointers during GPU teardown when initialization has
failed.
Fixes: 39fc2bc4da ("drm/amdgpu: Protect GPU register accesses in powergated state in some paths")
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit b7ac77468cda92eecae560b05f62f997a12fe2f2)
Cc: stable@vger.kernel.org
Besides deferring the call to housekeeping_update(), commit 6df415aa46
("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug
to workqueue") also defers the rebuild_sched_domains() call to
the workqueue. So a new offline CPU may still be in a sched domain
or new online CPU not showing up in the sched domains for a short
transition period. That could be a problem in some corner cases and
can be the cause of a reported test failure[1]. Fix it by calling
rebuild_sched_domains_cpuslocked() directly in hotplug as before. If
isolated partition invalidation or recreation is being done, the
housekeeping_update() call to update the housekeeping cpumasks will
still be deferred to a workqueue.
In commit 3bfe479671 ("cgroup/cpuset: Move
housekeeping_update()/rebuild_sched_domains() together"),
housekeeping_update() is called before rebuild_sched_domains() because
it needs to access the HK_TYPE_DOMAIN housekeeping cpumask. That is now
changed to use the static HK_TYPE_DOMAIN_BOOT cpumask as HK_TYPE_DOMAIN
cpumask is now changeable at run time. As a result, we can move the
rebuild_sched_domains() call before housekeeping_update() with
the slight advantage that it will be done in the same cpus_read_lock
critical section without the possibility of interference by a concurrent
cpu hot add/remove operation.
As it doesn't make sense to acquire cpuset_mutex/cpuset_top_mutex after
calling housekeeping_update() and immediately release them again, move
the cpuset_full_unlock() operation inside update_hk_sched_domains()
and rename it to cpuset_update_sd_hk_unlock() to signify that it will
release the full set of locks.
[1] https://lore.kernel.org/lkml/1a89aceb-48db-4edd-a730-b445e41221fe@nvidia.com
Fixes: 6df415aa46 ("cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue")
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Commit 0927780c90 ("sched_ext: Use READ_ONCE() for lock-free reads
of module param variables") annotated the plain reads of
scx_slice_bypass_us and scx_bypass_lb_intv_us in bypass_lb_cpu(), but
missed a third site in scx_bypass():
WRITE_ONCE(scx_slice_dfl, scx_slice_bypass_us * NSEC_PER_USEC);
scx_slice_bypass_us is a module parameter writable via sysfs in
process context through set_slice_us() -> param_set_uint_minmax(),
which performs a plain store without holding bypass_lock. scx_bypass()
reads the variable under bypass_lock, but since the writer does not
take that lock, the two accesses are concurrent.
WRITE_ONCE() only applies volatile semantics to the store of
scx_slice_dfl -- the val expression containing scx_slice_bypass_us is
evaluated as a plain read, providing no protection against concurrent
writes.
Wrap the read with READ_ONCE() to complete the annotation started by
commit 0927780c90 and make the access KCSAN-clean, consistent with
the existing READ_ONCE(scx_slice_bypass_us) in bypass_lb_cpu().
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
show_cpu_pool_hog() and show_cpu_pools_hogs() no longer only dump CPU
hogs — since commit 8823eaef45 ("workqueue: Show all busy workers in
stall diagnostics"), they dump every in-flight worker in the pool's
busy_hash.
Rename them to show_cpu_pool_busy_workers() and
show_cpu_pools_busy_workers() to accurately describe what they do.
Also fix the pr_info() message to say "stalled worker pools" instead of
"stalled CPU-bound worker pools", since sleeping/blocked workers are now
included.
No functional change.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Mention the scheduling class precedence of fair and sched_ext to
clear up how sched_ext partial mode works.
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
According to the FF-A specification (DEN0077, v1.1, §13.7), when
FFA_RXTX_UNMAP is invoked from any instance other than non-secure
physical, the w1 register must be zero (MBZ). If a non-zero value is
supplied in this context, the SPMC must return FFA_INVALID_PARAMETER.
The Arm FF-A driver operates exclusively as a guest or non-secure
physical instance where the partition ID is always zero and is not
invoked from a hypervisor context where w1 carries a VM ID. In this
execution model, the partition ID observed by the driver is always zero,
and passing a VM ID is unnecessary and potentially invalid.
Remove the vm_id parameter from ffa_rxtx_unmap() and ensure that the
SMC call is issued with w1 implicitly zeroed, as required by the
specification. This prevents invalid parameter errors and aligns the
implementation with the defined FF-A ABI behavior.
Fixes: 3bbfe98710 ("firmware: arm_ffa: Add initial Arm FFA driver support")
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Message-Id: <20260304120953.847671-1-yeoreum.yun@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@kernel.org>
Previously, HAVE_XXHASH is tested by invoking HOSTCC without HOSTCFLAGS.
Consider the following scenario:
- The host machine has libxxhash installed
- We build the kernel with HOSTCFLAGS containing a --sysroot that does
not have xxhash.h (for hermetic builds)
In this case, HAVE_XXHASH is set to y, but when it builds objtool with
HOSTCFLAGS, because the --sysroot does not contain xxhash.h, the
following error is raised:
<...>/common/tools/objtool/include/objtool/checksum_types.h:12:10: fatal error: 'xxhash.h' file not found
12 | #include <xxhash.h>
| ^~~~~~~~~~
To resolve the error, we test HAVE_XXHASH by invoking HOSTCC with
HOSTCFLAGS.
Signed-off-by: HONG Yifan <elsk@google.com>
Reviewed-by: Carlos Llamas <cmllamas@google.com>
Link: https://patch.msgid.link/20260303010340.306164-1-elsk@google.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Patching a function which references a static key living in a kernel
module is unsupported due to ordering issues inherent to late module
patching:
1) Load a livepatch module which has a __jump_table entry which needs
a klp reloc to reference static key K which lives in module M.
2) The __jump_table klp reloc does *not* get resolved because module M
is not yet loaded.
3) jump_label_add_module() corrupts memory (or causes a panic) when
dereferencing the uninitialized pointer to key K.
validate_special_section_klp_reloc() intends to prevent that from ever
happening by catching it at build time. However, it incorrectly assumes
the special section entry's reloc symbol references have already been
converted from section symbols to object symbols, causing the validation
to miss corruption in extracted static branch/call table entries.
Make sure the references have been properly converted before doing the
validation.
Fixes: dd590d4d57 ("objtool/klp: Introduce klp diff subcommand for diffing object files")
Reported-by: Song Liu <song@kernel.org>
Reviewed-and-tested-by: Song Liu <song@kernel.org>
Link: https://patch.msgid.link/124ad747b751df0df1725eff89de8332e3fb26d6.1770759954.git.jpoimboe@kernel.org
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
Commit e7e222ad73 ("cxl: Move devm_cxl_add_nvdimm_bridge() to
cxl_pmem.ko") moves devm_cxl_add_nvdimm_bridge() into the cxl_pmem file,
which has independent config compile options for built-in or module. The
call from cxl_acpi_probe() is guarded by IS_ENABLED(CONFIG_CXL_PMEM),
which evaluates to true for both =y and =m.
When CONFIG_CXL_PMEM=m, a built-in cxl_acpi attempts to reference a
symbol exported by a module, which fails to link. CXL_PMEM cannot simply
be promoted to =y in this configuration because it depends on LIBNVDIMM,
which may itself be =m.
Add a Kconfig dependency to prevent CXL_ACPI from being built-in when
CXL_PMEM is a module. This contrains CXL_ACPI to =m when CXL_PMEM=m,
while still allowing CXL_ACPI to be freely configured when CXL_PMEM is
either built-in or disabled.
[ dj: Fix up commit reference formatting. ]
Fixes: e7e222ad73 ("cxl: Move devm_cxl_add_nvdimm_bridge() to cxl_pmem.ko")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20260305204057.1516948-1-kbusch@meta.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
The rzt2h_gpio_get_direction() function is called from
gpiod_get_direction(), which ends up being used within the __setup_irq()
call stack when requesting an interrupt.
__setup_irq() holds a raw_spinlock_t with IRQs disabled, which creates
an atomic context. spinlock_t cannot be used within atomic context
when PREEMPT_RT is enabled, since it may become a sleeping lock.
An "[ BUG: Invalid wait context ]" splat is observed when running with
CONFIG_PROVE_LOCKING enabled, describing exactly the aforementioned call
stack.
__setup_irq() needs to hold a raw_spinlock_t with IRQs disabled to
serialize access against a concurrent hard interrupt.
Switch to raw_spinlock_t to fix this.
Fixes: 829dde3369 ("pinctrl: renesas: rzt2h: Add GPIO IRQ chip to handle interrupts")
Signed-off-by: Cosmin Tanislav <cosmin-gabriel.tanislav.xa@renesas.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20260205103930.666051-1-cosmin-gabriel.tanislav.xa@renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
When calling of_parse_phandle_with_fixed_args(), the caller is
responsible for calling of_node_put() to release the device node
reference.
In rzt2h_gpio_register(), the driver fails to call of_node_put() to
release the reference in of_args.np, which causes a memory leak.
Add the missing of_node_put() call to fix the leak.
Fixes: 34d4d09307 ("pinctrl: renesas: Add support for RZ/T2H")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20260127-rzt2h-v1-1-86472e7421b8@gmail.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
The default settings for the Versa3 device on the Renesas RZ/G3S SMARC
SoM board have PLL2 disabled. PLL2 was later enabled together with audio
support, as it is required to support both 44.1 kHz and 48 kHz audio.
With PLL2 enabled, it was observed that Linux occasionally either hangs
during boot (the last log message being related to the I2C probe) or
randomly crashes. This was mainly reproducible on cold boots. During
debugging, it was also noticed that the Unicode replacement character (�)
sometimes appears on the serial console. Further investigation traced this
to the configuration applied through the Versa3 register at offset 0x1c,
which controls PLL enablement.
The appearance of the Unicode replacement character suggested an issue
with the SoC reference clock. The RZ/G3S reference clock is provided by
the Versa3 clock generator (REF output).
After checking with the Renesas Versa3 hardware team, it was found that
this is related to the PLL2 lock bit being set through the
renesas,settings DT property.
The PLL lock bit must be set to avoid unstable clock output from the PLL.
However, due to the Versa3 hardware design, when a PLL lock bit is set,
all outputs (including the REF clock) are temporarily disabled until the
configured PLLs become stable.
As an alternative, the bypass bit can be used. This does not interrupt the
PLL2 output or any other Versa3 outputs, but it may result in temporary
instability on PLL2 output while the configuration is applied. Since PLL2
feeds only the audio path and audio is not used during early boot, this is
acceptable and does not affect system boot.
Drop the PLL2 lock bit and set the bypass bit instead.
This has been tested with more than 1000 cold boots.
Fixes: a94253232b ("arm64: dts: renesas: rzg3s-smarc-som: Add versa3 clock generator node")
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20260302135703.162601-1-claudiu.beznea.uj@bp.renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
The HW user manual for the Renesas RZ/V2H(P) SoC (a.k.a r9a09g057)
states that only WDT1 is supposed to be accessed by the CA55 cores.
WDT0 is supposed to be used by the CM33 core, WDT2 is supposed
to be used by the CR8 core 0, and WDT3 is supposed to be used
by the CR8 core 1.
Remove wdt{0,2,3} from the SoC specific device tree to make it
compliant with the specification from the HW manual.
This change is harmless as there are currently no users of the
wdt{0,2,3} device tree nodes, only the wdt1 node is actually used.
Fixes: 095105496e ("arm64: dts: renesas: r9a09g057: Add WDT0-WDT3 nodes")
Signed-off-by: Fabrizio Castro <fabrizio.castro.jz@renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20260203124247.7320-3-fabrizio.castro.jz@renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Set an appropriate ramp delay for the SD0 I/O voltage regulator in the
CN15 SD overlay to make UHS-I voltage switching reliable during card
initialization.
This issue was observed on the RZ/V2H EVK, while the same UHS-I cards
worked on the RZ/V2N EVK without problems. Adding the ramp delay makes
the behavior consistent and avoids SD init timeouts.
Before this change SD0 could fail with:
mmc0: error -110 whilst initialising SD card
With the delay in place UHS-I cards enumerate correctly:
mmc0: new UHS-I speed SDR104 SDXC card at address aaaa
mmcblk0: mmc0:aaaa SR64G 59.5 GiB
mmcblk0: p1
Fixes: 3d6c2bc762 ("arm64: dts: renesas: Add CN15 eMMC and SD overlays for RZ/V2H and RZ/V2N EVKs")
Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20260123225957.1007089-5-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Add a ramp delay of 60 uV/us to the vqmmc_sdhi0 voltage regulator to
fix UHS-I SD card detection failures.
Measurements on CN78 pin 4 showed the actual voltage ramp time to be
21.86ms when switching between 3.3V and 1.8V. A 25ms ramp delay has
been configured to provide adequate margin. The calculation is based
on the voltage delta of 1.5V (3.3V - 1.8V):
1500000 uV / 60 uV/us = 25000 us (25ms)
Prior to this patch, UHS-I cards failed to initialize with:
mmc0: error -110 whilst initialising SD card
After this patch, UHS-I cards are properly detected on SD0:
mmc0: new UHS-I speed SDR104 SDXC card at address aaaa
mmcblk0: mmc0:aaaa SR64G 59.5 GiB
Fixes: d065453e5e ("arm64: dts: renesas: rzt2h-rzn2h-evk: Enable SD card slot")
Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20260123225957.1007089-2-prabhakar.mahadev-lad.rj@bp.renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
When the nl80211 socket that originated a PMSR request is
closed, cfg80211_release_pmsr() sets the request's nl_portid
to zero and schedules pmsr_free_wk to process the abort
asynchronously. If the interface is concurrently torn down
before that work runs, cfg80211_pmsr_wdev_down() calls
cfg80211_pmsr_process_abort() directly. However, the already-
scheduled pmsr_free_wk work item remains pending and may run
after the interface has been removed from the driver. This
could cause the driver's abort_pmsr callback to operate on a
torn-down interface, leading to undefined behavior and
potential crashes.
Cancel pmsr_free_wk synchronously in cfg80211_pmsr_wdev_down()
before calling cfg80211_pmsr_process_abort(). This ensures any
pending or in-progress work is drained before interface teardown
proceeds, preventing the work from invoking the driver abort
callback after the interface is gone.
Fixes: 9bb7e0f24e ("cfg80211: add peer measurement with FTM initiator API")
Signed-off-by: Peddolla Harshavardhan Reddy <peddolla.reddy@oss.qualcomm.com>
Link: https://patch.msgid.link/20260305160712.1263829-3-peddolla.reddy@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
In some scenarios, a deadlock can happen, involving _do_shadow_pte().
Convert all usages of pgste_get_lock() to pgste_get_trylock() in
_do_shadow_pte() and return -EAGAIN. All callers can already deal with
-EAGAIN being returned.
Fixes: e38c884df9 ("KVM: s390: Switch to new gmap")
Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
Reviewed-by: Christoph Schlameuss <schlameuss@linux.ibm.com>
Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
ublk_ctrl_set_size() unconditionally dereferences ub->ub_disk via
set_capacity_and_notify() without checking if it is NULL.
ub->ub_disk is NULL before UBLK_CMD_START_DEV completes (it is only
assigned in ublk_ctrl_start_dev()) and after UBLK_CMD_STOP_DEV runs
(ublk_detach_disk() sets it to NULL). Since the UBLK_CMD_UPDATE_SIZE
handler performs no state validation, a user can trigger a NULL pointer
dereference by sending UPDATE_SIZE to a device that has been added but
not yet started, or one that has been stopped.
Fix this by checking ub->ub_disk under ub->mutex before dereferencing
it, and returning -ENODEV if the disk is not available.
Fixes: 98b995660b ("ublk: Add UBLK_U_CMD_UPDATE_SIZE")
Cc: stable@vger.kernel.org
Signed-off-by: Mehul Rao <mehulrao@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When user_mem_abort() handles a nested stage-2 fault, it truncates
vma_pagesize to respect the guest's mapping size. However, the local
variable vma_shift is never updated to match this new size.
If the underlying host page turns out to be hardware poisoned,
kvm_send_hwpoison_signal() is called with the original, larger
vma_shift instead of the actual mapping size. This signals incorrect
poison boundaries to userspace and breaks hugepage memory poison
containment for nested VMs.
Update vma_shift to match the truncated vma_pagesize when operating
on behalf of a nested hypervisor.
Fixes: fd276e71d1 ("KVM: arm64: nv: Handle shadow stage 2 page faults")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260304162222.836152-3-tabba@google.com
[maz: simplified vma_shift assignment from the original patch]
Signed-off-by: Marc Zyngier <maz@kernel.org>
ieee80211_chan_bw_change() iterates all stations and accesses
link->reserved.oper via sta->sdata->link[link_id]. For stations on
AP_VLAN interfaces (e.g. 4addr WDS clients), sta->sdata points to
the VLAN sdata, whose link never participates in chanctx reservations.
This leaves link->reserved.oper zero-initialized with chan == NULL,
causing a NULL pointer dereference in __ieee80211_sta_cap_rx_bw()
when accessing chandef->chan->band during CSA.
Resolve the VLAN sdata to its parent AP sdata using get_bss_sdata()
before accessing link data.
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://patch.msgid.link/20260305170812.2904208-1-nbd@nbd.name
[also change sta->sdata in ARRAY_SIZE even if it doesn't matter]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The KVM_DEV_RISCV_AIA_GRP_APLIC branch of aia_has_attr() was identified
to have a race condition with concurrent KVM_SET_DEVICE_ATTR ioctls,
leading to a use-after-free bug.
Upon analyzing the code, it was discovered that the
KVM_DEV_RISCV_AIA_GRP_IMSIC branch of aia_has_attr() suffers from the same
lack of synchronization. It invokes kvm_riscv_aia_imsic_has_attr() without
holding dev->kvm->lock.
While aia_has_attr() is running, a concurrent aia_set_attr() could call
aia_init() under the dev->kvm->lock. If aia_init() fails, it may trigger
kvm_riscv_vcpu_aia_imsic_cleanup(), which frees imsic_state. Without proper
locking, kvm_riscv_aia_imsic_has_attr() could attempt to access imsic_state
while it is being deallocated.
Although this specific path has not yet been reported by a fuzzer, it
is logically identical to the APLIC issue. Fix this by acquiring the
dev->kvm->lock before calling kvm_riscv_aia_imsic_has_attr(), ensuring
consistency with the locking pattern used for other AIA attribute groups.
Fixes: 5463091a51 ("RISC-V: KVM: Expose IMSIC registers as attributes of AIA irqchip")
Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn>
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260304080804.2281721-1-xujiakai2025@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
Fuzzer reports a KASAN use-after-free bug triggered by a race
between KVM_HAS_DEVICE_ATTR and KVM_SET_DEVICE_ATTR ioctls on
the AIA device. The root cause is that aia_has_attr() invokes
kvm_riscv_aia_aplic_has_attr() without holding dev->kvm->lock, while
a concurrent aia_set_attr() may call aia_init() under that lock. When
aia_init() fails after kvm_riscv_aia_aplic_init() has succeeded, it
calls kvm_riscv_aia_aplic_cleanup() in its fail_cleanup_imsics path,
which frees both aplic_state and aplic_state->irqs. The concurrent
has_attr path can then dereference the freed aplic->irqs in
aplic_read_pending():
irqd = &aplic->irqs[irq]; /* UAF here */
KASAN report:
BUG: KASAN: slab-use-after-free in aplic_read_pending
arch/riscv/kvm/aia_aplic.c:119 [inline]
BUG: KASAN: slab-use-after-free in aplic_read_pending_word
arch/riscv/kvm/aia_aplic.c:351 [inline]
BUG: KASAN: slab-use-after-free in aplic_mmio_read_offset
arch/riscv/kvm/aia_aplic.c:406
Read of size 8 at addr ff600000ba965d58 by task 9498
Call Trace:
aplic_read_pending arch/riscv/kvm/aia_aplic.c:119 [inline]
aplic_read_pending_word arch/riscv/kvm/aia_aplic.c:351 [inline]
aplic_mmio_read_offset arch/riscv/kvm/aia_aplic.c:406
kvm_riscv_aia_aplic_has_attr arch/riscv/kvm/aia_aplic.c:566
aia_has_attr arch/riscv/kvm/aia_device.c:469
allocated by task 9473:
kvm_riscv_aia_aplic_init arch/riscv/kvm/aia_aplic.c:583
aia_init arch/riscv/kvm/aia_device.c:248 [inline]
aia_set_attr arch/riscv/kvm/aia_device.c:334
freed by task 9473:
kvm_riscv_aia_aplic_cleanup arch/riscv/kvm/aia_aplic.c:644
aia_init arch/riscv/kvm/aia_device.c:292 [inline]
aia_set_attr arch/riscv/kvm/aia_device.c:334
Fix this race by acquiring dev->kvm->lock in aia_has_attr() before
calling kvm_riscv_aia_aplic_has_attr(), consistent with the locking
pattern used in aia_get_attr() and aia_set_attr().
Fixes: 289a007b98 ("RISC-V: KVM: Expose APLIC registers as attributes of AIA irqchip")
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260302132703.1721415-1-xujiakai2025@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
kvm_riscv_vcpu_aia_rmw_topei() assumes that the per-vCPU IMSIC state has
been initialized once AIA is reported as available and initialized at
the VM level. This assumption does not always hold.
Under fuzzed ioctl sequences, a guest may access the IMSIC TOPEI CSR
before the vCPU IMSIC state is set up. In this case,
vcpu->arch.aia_context.imsic_state is still NULL, and the TOPEI RMW path
dereferences it unconditionally, leading to a host kernel crash.
The crash manifests as:
Unable to handle kernel paging request at virtual address
dfffffff0000000e
...
kvm_riscv_vcpu_aia_imsic_rmw arch/riscv/kvm/aia_imsic.c:909
kvm_riscv_vcpu_aia_rmw_topei arch/riscv/kvm/aia.c:231
csr_insn arch/riscv/kvm/vcpu_insn.c:208
system_opcode_insn arch/riscv/kvm/vcpu_insn.c:281
kvm_riscv_vcpu_virtual_insn arch/riscv/kvm/vcpu_insn.c:355
kvm_riscv_vcpu_exit arch/riscv/kvm/vcpu_exit.c:230
kvm_arch_vcpu_ioctl_run arch/riscv/kvm/vcpu.c:1008
...
Fix this by explicitly checking whether the vCPU IMSIC state has been
initialized before handling TOPEI CSR accesses. If not, forward the CSR
emulation to user space.
Fixes: db8b7e97d6 ("RISC-V: KVM: Add in-kernel virtualization of AIA IMSIC")
Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn>
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Reviewed-by: Nutty Liu <nutty.liu@hotmail.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260226085119.643295-1-xujiakai2025@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
While fuzzing KVM on RISC-V, a use-after-free was observed in
kvm_riscv_gstage_get_leaf(), where ptep_get() dereferences a
freed gstage page table page during gfn unmap.
The crash manifests as:
use-after-free in ptep_get include/linux/pgtable.h:340 [inline]
use-after-free in kvm_riscv_gstage_get_leaf arch/riscv/kvm/gstage.c:89
Call Trace:
ptep_get include/linux/pgtable.h:340 [inline]
kvm_riscv_gstage_get_leaf+0x2ea/0x358 arch/riscv/kvm/gstage.c:89
kvm_riscv_gstage_unmap_range+0xf0/0x308 arch/riscv/kvm/gstage.c:265
kvm_unmap_gfn_range+0x168/0x1fc arch/riscv/kvm/mmu.c:256
kvm_mmu_unmap_gfn_range virt/kvm/kvm_main.c:724 [inline]
page last free pid 808 tgid 808 stack trace:
kvm_riscv_mmu_free_pgd+0x1b6/0x26a arch/riscv/kvm/mmu.c:457
kvm_arch_flush_shadow_all+0x1a/0x24 arch/riscv/kvm/mmu.c:134
kvm_flush_shadow_all virt/kvm/kvm_main.c:344 [inline]
The UAF is caused by gstage page table walks running concurrently with
gstage pgd teardown. In particular, kvm_unmap_gfn_range() can traverse
gstage page tables while kvm_arch_flush_shadow_all() frees the pgd,
leading to use-after-free of page table pages.
Fix the issue by serializing gstage unmap and pgd teardown with
kvm->mmu_lock. Holding mmu_lock ensures that gstage page tables
remain valid for the duration of unmap operations and prevents
concurrent frees.
This matches existing RISC-V KVM usage of mmu_lock to protect gstage
map/unmap operations, e.g. kvm_riscv_mmu_iounmap.
Fixes: dd82e35638 ("RISC-V: KVM: Factor-out g-stage page table management")
Signed-off-by: Jiakai Xu <xujiakai2025@iscas.ac.cn>
Signed-off-by: Jiakai Xu <jiakaiPeanut@gmail.com>
Reviewed-by: Anup Patel <anup@brainfault.org>
Link: https://lore.kernel.org/r/20260202040059.1801167-1-xujiakai2025@iscas.ac.cn
Signed-off-by: Anup Patel <anup@brainfault.org>
Gary noticed [1] that the initializer macros as well as the `[Pin]Init`
traits cannot support unaligned fields, since they use operations that
require aligned pointers. This means that any code using structs with
unaligned fields in pin-init is unsound.
By default, the `init!` macro generates references to initialized fields,
which makes the compiler check that those fields are aligned. However,
we added the `#[disable_initialized_field_access]` attribute to avoid
this behavior in commit ceca298c53 ("rust: pin-init: internal: init:
add escape hatch for referencing initialized fields"). Thus remove the
`#[disable_initialized_field_access]` attribute from `init!`, which is
the only safe way to create an initializer handling unaligned fields.
If support for in-place initializing structs with unaligned fields is
required in the future, we could figure out a solution. This is tracked
in [2].
Reported-by: Gary Guo <gary@garyguo.net>
Closes: https://rust-for-linux.zulipchat.com/#narrow/channel/561532-pin-init/topic/initialized.20field.20accessor.20detection/with/576210658 [1]
Link: https://github.com/Rust-for-Linux/pin-init/issues/112 [2]
Fixes: ceca298c53 ("rust: pin-init: internal: init: add escape hatch for referencing initialized fields")
Signed-off-by: Benno Lossin <lossin@kernel.org>
Acked-by: Janne Grunau <j@jannau.net>
Reviewed-by: Gary Guo <gary@garyguo.net>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Link: https://patch.msgid.link/20260302140424.4097655-1-lossin@kernel.org
[ Adjusted tags and reworded as discussed. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
When building with an out directory (O=), absolute paths can end up in the
file name in `#[track_caller]` or the panic message. This is not desirable
as this leaks the exact path being used to build the kernel and means that
the same location can appear in two forms (relative or absolute).
This is reported by Asahi [1] and is being workaround in [2] previously to
force everything to be absolute path. Using absolute path for everything
solves the inconsistency, however it does not address the reproducibility
issue. So, fix this by remap all absolute paths to srctree to relative path
instead.
This is previously attempted in commit dbdffaf50f ("kbuild, rust: use
-fremap-path-prefix to make paths relative") but that was reverted as
remapping debug info causes some tool (e.g. objdump) to be unable to find
sources. Therefore, use `--remap-path-scope` to only remap macros but leave
debuginfo untouched. `--remap-path-scope` is only stable in Rust 1.95, so
use `rustc-option` to detect its presence. This feature has been available
as `-Zremap-path-scope` for all versions that we support; however due to
bugs in the Rust compiler, it does not work reliably until 1.94. I opted to
not enable it for 1.94 as it's just a single version that we missed.
This change can be validated by building a kernel with O=, strip debug info
on vmlinux, and then check if the absolute path exists in `strings
vmlinux`, e.g. `strings vmlinux |grep \/home`.
Reported-by: Janne Grunau <j@jannau.net>
Reported-by: Asahi Lina <lina+kernel@asahilina.net>
Closes: https://rust-for-linux.zulipchat.com/#narrow/channel/288089-General/topic/Per-call-site.20data.20and.20lock.20class.20keys/near/572466559 [1]
Link: 54ab888788 [2]
Signed-off-by: Gary Guo <gary@garyguo.net>
Acked-by: Nicolas Schier <nsc@kernel.org> # kbuild
Link: https://patch.msgid.link/20260226152112.3222886-1-gary@kernel.org
[ Reworded for few typos. - Miguel ]
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
The Glymur platform has four DisplayPort controllers. The hardware
supports four streams (MST) per controller. However, on Glymur the first
three controllers only have two streams wired to the display subsystem,
while the fourth controller operates in single-stream mode.
Add a dedicated clause for the Glymur compatible to require the register
ranges for all four stream blocks, while allowing either one pixel clock
(for the single-stream controller) or two pixel clocks (for the remaining
controllers).
Update the Glymur MDSS schema example by adding the missing p2, p3,
mst2link and mst3link register blocks. Without these, the bindings
validation fails. Also replace the made-up register addresses with the
actual addresses from the first controller to match the SoC devicetree
description.
Cc: stable@vger.kernel.org # v6.19
Fixes: 8f63bf9082 ("dt-bindings: display: msm: Document the Glymur DiplayPort controller")
Fixes: 1aee577bbc ("dt-bindings: display: msm: Document the Glymur Mobile Display SubSystem")
Signed-off-by: Abel Vesa <abel.vesa@oss.qualcomm.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com>
Patchwork: https://patchwork.freedesktop.org/patch/708518/
Link: https://lore.kernel.org/r/20260303-glymur-fix-dp-bindings-reg-clocks-v4-1-1ebd9c7c2cee@oss.qualcomm.com
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
To get the comment changes in this commit:
171efc7009 ("x86/ibs: Fix typo in dc_l2tlb_miss comment")
This silences this perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/arch/x86/include/asm/amd/ibs.h arch/x86/include/asm/amd/ibs.h
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Picking up the changes from these csets:
2153b2e891 ("sparc: Add architecture support for clone3")
99d2592023 ("rseq: Implement sys_rseq_slice_yield()")
4ac286c4a8 ("s390/syscalls: Switch to generic system call table generation")
This makes 'perf trace' support it, now its possible, for instance, to
do:
# perf trace -e rseq_slice_yield --max-stack=16
Here is an example with the 'sendmmsg' syscall:
root@x1:~# perf trace -e sendmmsg --max-stack 16 --max-events=1
0.000 ( 0.062 ms): dbus-broker/1012 sendmmsg(fd: 150, mmsg: 0x7ffef57cca50, vlen: 1, flags: DONTWAIT|NOSIGNAL) = 1
syscall_exit_to_user_mode_prepare ([kernel.kallsyms])
syscall_exit_to_user_mode_prepare ([kernel.kallsyms])
syscall_exit_to_user_mode ([kernel.kallsyms])
do_syscall_64 ([kernel.kallsyms])
entry_SYSCALL_64 ([kernel.kallsyms])
[0x117ce7] (/usr/lib64/libc.so.6 (deleted))
root@x1:~#
To do a system wide tracing of the new 'rseq_slice_yield' syscall with a
backtrace of at most 16 entries.
This addresses these perf tools build warnings:
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
diff -u tools/scripts/syscall.tbl scripts/syscall.tbl
diff -u tools/perf/arch/x86/entry/syscalls/syscall_32.tbl arch/x86/entry/syscalls/syscall_32.tbl
diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
diff -u tools/perf/arch/powerpc/entry/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl
diff -u tools/perf/arch/s390/entry/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl
diff -u tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl arch/mips/kernel/syscalls/syscall_n64.tbl
diff -u tools/perf/arch/arm/entry/syscalls/syscall.tbl arch/arm/tools/syscall.tbl
diff -u tools/perf/arch/sh/entry/syscalls/syscall.tbl arch/sh/kernel/syscalls/syscall.tbl
diff -u tools/perf/arch/sparc/entry/syscalls/syscall.tbl arch/sparc/kernel/syscalls/syscall.tbl
diff -u tools/perf/arch/xtensa/entry/syscalls/syscall.tbl arch/xtensa/kernel/syscalls/syscall.tbl
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ludwig Rydberg <ludwig.rydberg@gaisler.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Add a sample module under samples/workqueue/stall_detector/ that
reproduces a workqueue stall caused by PF_WQ_WORKER misuse. The
module queues two work items on the same per-CPU pool, then clears
PF_WQ_WORKER and sleeps in wait_event_idle(), hiding from the
concurrency manager and stalling the second work item indefinitely.
This is useful for testing the workqueue watchdog stall diagnostics.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
show_cpu_pool_hog() only prints workers whose task is currently running
on the CPU (task_is_running()). This misses workers that are busy
processing a work item but are sleeping or blocked — for example, a
worker that clears PF_WQ_WORKER and enters wait_event_idle(). Such a
worker still occupies a pool slot and prevents progress, yet produces
an empty backtrace section in the watchdog output.
This is happening on real arm64 systems, where
toggle_allocation_gate() IPIs every single CPU in the machine (which
lacks NMI), causing workqueue stalls that show empty backtraces because
toggle_allocation_gate() is sleeping in wait_event_idle().
Remove the task_is_running() filter so every in-flight worker in the
pool's busy_hash is dumped. The busy_hash is protected by pool->lock,
which is already held.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
When diagnosing workqueue stalls, knowing how long each in-flight work
item has been executing is valuable. Add a current_start timestamp
(jiffies) to struct worker, set it when a work item begins execution in
process_one_work(), and print the elapsed wall-clock time in show_pwq().
Unlike current_at (which tracks CPU runtime and resets on wakeup for
CPU-intensive detection), current_start is never reset because the
diagnostic cares about total wall-clock time including sleeps.
Before: in-flight: 165:stall_work_fn [wq_stall]
After: in-flight: 165:stall_work_fn [wq_stall] for 100s
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
The watchdog_ts name doesn't convey what the timestamp actually tracks.
This field tracks the last time a workqueue got progress.
Rename it to last_progress_ts to make it clear that it records when the
pool last made forward progress (started processing new work items).
No functional change.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
pr_cont_worker_id() checks pool->flags against WQ_BH, which is a
workqueue-level flag (defined in workqueue.h). Pool flags use a
separate namespace with POOL_* constants (defined in workqueue.c).
The correct constant is POOL_BH. Both WQ_BH and POOL_BH are defined
as (1 << 0) so this has no behavioral impact, but it is semantically
wrong and inconsistent with every other pool-level BH check in the
file.
Fixes: 4cb1ef6460 ("workqueue: Implement BH workqueues to eventually replace tasklets")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Smatch reports unreachable code in imx_rproc_prepare(), where an early
return inside the reserved-memory parsing loop prevents platform
prepare_ops from being executed.
When of_reserved_mem_region_to_resource() fails, imx_rproc_prepare()
returns immediately, so the platform-specific prepare callback is never
called. As a result, prepare_ops such as imx_rproc_sm_lmm_prepare() on
i.MX95 have no chance to run.
This is problematic when Linux controls the M7 Logical Machine and is
responsible for preparing resources such as TCM. Without running the
platform prepare callback, loading the M7 ELF into TCM may fail if the
bootloader did not power up and initialize TCM.
Fix this by breaking out of the reserved-memory loop instead of
returning, allowing the platform prepare_ops to be executed as intended.
Fixes: edd2a99560 ("remoteproc: imx_rproc: Introduce prepare ops for imx_rproc_dcfg")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/linux-remoteproc/aYYXAa2Fj36XG4yQ@p14s/T/#t
Signed-off-by: Peng Fan <peng.fan@nxp.com>
Reviewed-by: Daniel Baluta <daniel.baluta@nxp.com>
Link: https://lore.kernel.org/r/20260208-imx-rproc-fix-v1-1-ad74555eb9a4@nxp.com
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Prior to commit d935187cfb ("remoteproc: mediatek: Break lock
dependency to prepare_lock"), `scp->clk` was prepared and enabled only
when it needs to communicate with the SCP. The commit d935187cfb
moved the prepare operation to remoteproc's prepare(), keeping the clock
prepared as long as the SCP is running.
The power consumption due to the prolonged clock preparation can be
negligible when the system is running, as SCP is designed to be a very
power efficient processor.
However, the clock remains prepared even when the system enters system
suspend. This prevents the underlying clock controller (and potentially
the parent PLLs) from shutting down, which increases power consumption
and may block the system from entering deep sleep states.
Add suspend and resume callbacks. Unprepare the clock in suspend() if
it was active and re-prepare it in resume() to ensure the clock is
properly disabled during system suspend, while maintaining the "always
prepared" semantics while the system is active. The driver doesn't
implement .attach() callback, hence it only checks for RPROC_RUNNING.
Fixes: d935187cfb ("remoteproc: mediatek: Break lock dependency to prepare_lock")
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
Signed-off-by: Tzung-Bi Shih <tzungbi@kernel.org>
Link: https://lore.kernel.org/r/20260206033034.3031781-1-tzungbi@kernel.org
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
When a guest performs an atomic/exclusive operation on memory lacking
the required attributes, user_mem_abort() injects a data abort and
returns early. However, it fails to release the reference to the
host page acquired via __kvm_faultin_pfn().
A malicious guest could repeatedly trigger this fault, leaking host
page references and eventually causing host memory exhaustion (OOM).
Fix this by consolidating the early error returns to a new out_put_page
label that correctly calls kvm_release_page_unused().
Fixes: 2937aeec9d ("KVM: arm64: Handle DABT caused by LS64* instructions on unsupported memory")
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Yuan Yao <yaoyuan@linux.alibaba.com>
Link: https://patch.msgid.link/20260304162222.836152-2-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
The task ownership state machine in sched_ext is quite hard to follow
from the code alone. The interaction of ownership states, memory
ordering rules and cross-CPU "lock dancing" makes the overall model
subtle.
Extend the documentation next to scx_ops_state to provide a more
structured and self-contained description of the state transitions and
their synchronization rules.
The new reference should make the code easier to reason about and
maintain and can help future contributors understand the overall
task-ownership workflow.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
bypass_lb_cpu() reads scx_bypass_lb_intv_us and scx_slice_bypass_us
without holding any lock, in timer callback context where module
parameter writes via sysfs can happen concurrently:
min_delta_us = scx_bypass_lb_intv_us / SCX_BYPASS_LB_MIN_DELTA_DIV;
^^^^^^^^^^^^^^^^^^^^
plain read -- KCSAN data race
if (delta < DIV_ROUND_UP(min_delta_us, scx_slice_bypass_us))
^^^^^^^^^^^^^^^^^
plain read -- KCSAN data race
scx_bypass_lb_intv_us already uses READ_ONCE() in scx_bypass_lb_timerfn()
and scx_bypass() for its other lock-free read sites, leaving
bypass_lb_cpu() inconsistent. scx_slice_bypass_us has the same
lock-free access pattern in the same function.
Fix both plain reads by using READ_ONCE() to complete the concurrent
access annotation and make the code KCSAN-clean.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
As per R_BFHQH,
" When an Address size fault is generated, the reported fault code
indicates one of the following:
If the fault was generated due to the TTBR_ELx used in the translation
having nonzero address bits above the OA size, then a fault at level 0. "
Fix the reported Address size fault level as being 0 if the base address is
wrongly programmed by L1.
Fixes: 61e30b9eef ("KVM: arm64: nv: Implement nested Stage-2 page table walk logic")
Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
Link: https://patch.msgid.link/20260225173515.20490-3-zenghui.yu@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
check_base_s2_limits() checks the validity of SL0 and inputsize against
ia_size (inputsize again!) but the pseudocode from DDI0487 G.a
AArch64.TranslationTableWalk() says that we should check against the
implemented PA size.
We would otherwise fail to walk S2 with a valid configuration. E.g.,
granule size = 4KB, inputsize = 40 bits, initial lookup level = 0 (no
concatenation) on a system with 48 bits PA range supported is allowed by
architecture.
Fix it by obtaining PA size by kvm_get_pa_bits(). Note that
kvm_get_pa_bits() returns the fixed limit now and should eventually reflect
the per VM PARange (one day!). Given that the configured PARange should not
be greater that kvm_ipa_limit, it at least fixes the problem described
above.
While at it, inject a level 0 translation fault to guest if
check_base_s2_limits() fails, as per the pseudocode.
Fixes: 61e30b9eef ("KVM: arm64: nv: Implement nested Stage-2 page table walk logic")
Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
Link: https://patch.msgid.link/20260225173515.20490-2-zenghui.yu@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
If, for any odd reason, we cannot converge to mapping size that is
completely contained in a memblock region, we fail to install a S2
mapping and go back to the faulting instruction. Rince, repeat.
This happens when faulting in regions that are smaller than a page
or that do not have PAGE_SIZE-aligned boundaries (as witnessed on
an O6 board that refuses to boot in protected mode).
In this situation, fallback to using a PAGE_SIZE mapping anyway --
it isn't like we can go any lower.
Fixes: e728e70580 ("KVM: arm64: Adjust range correctly during host stage-2 faults")
Link: https://lore.kernel.org/r/86wlzr77cn.wl-maz@kernel.org
Cc: stable@vger.kernel.org
Cc: Quentin Perret <qperret@google.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Link: https://patch.msgid.link/20260305132751.2928138-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
If the firmware version query fails, the driver currently ignores the
error and continues initializing. This leaves the device in a bad state.
Fix this by making bng_re_query_hwrm_version() return the error code and
update the driver to check for this error and stop the setup process
safely if it happens.
Fixes: 745065770c ("RDMA/bng_re: Register and get the resources from bnge driver")
Signed-off-by: Kamal Heib <kheib@redhat.com>
Link: https://patch.msgid.link/20260303043645.425724-1-kheib@redhat.com
Reviewed-by: Siva Reddy Kallam <siva.kallam@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
kzalloc() is called with __GFP_NOFAIL, so a NULL return is not expected.
Drop the redundant !map check in xfs_dabuf_map().
Also switch the nirecs-sized allocation to kcalloc().
Signed-off-by: hongao <hongao@uniontech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Use %ld (not %lu) for signed long, and pass the actual string length
returned by sprintf() to write_text() instead of sizeof(buf).
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reset controller fixes for v7.0
* Fix NULL pointer dereference in reset-rzg2l-usbphy-ctrl driver for
renesas,rzg2l-usbphy-ctrl devices without pwrrdy control.
* tag 'reset-fixes-for-v7.0' of https://git.pengutronix.de/git/pza/linux:
reset: rzg2l-usbphy-ctrl: Check pwrrdy is valid before using it
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
FSL SOC Fixes for 7.0
- Fix a race condition in Freescale Queue and Buffer Manager.
- Fix a trivial error verification in CPM1
* tag 'soc_fsl-7.0-2' of https://git.kernel.org/pub/scm/linux/kernel/git/chleroy/linux:
soc: fsl: cpm1: qmc: Fix error check for devm_ioremap_resource() in qmc_qe_init_resources()
soc: fsl: qbman: fix race condition in qman_destroy_fq
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
RISC-V soc fixes for v7.0-rc1
drivers:
Fix leaks in probe/init function teardown code in three drivers.
microchip:
Fix a warning introduced by a recent binding change, that made resets
required on Polarfire SoC's CAN IP.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
* tag 'riscv-soc-fixes-for-v7.0-rc1' of https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux:
cache: ax45mp: Fix device node reference leak in ax45mp_cache_init()
cache: starfive: fix device node leak in starlink_cache_init()
riscv: dts: microchip: add can resets to mpfs
soc: microchip: mpfs: Fix memory leak in mpfs_sys_controller_probe()
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
The QMI message encoder has up until recently read a single byte (as
elem_size == 1), but with the introduction of big endian support it's
become apparent that this field is expected to be a full u32 -
regardless of the size of the length in the encoded message (which is
what elem_size specifies).
The result is that the encoder now reads past the length byte and
rejects the unreasonably large length formed when including the
following 3 bytes from the subsys_name array.
Fix this by changing to the expected type.
Fixes: 1fb82ee806 ("remoteproc: qcom: Introduce sysmon")
Signed-off-by: Bjorn Andersson <bjorn.andersson@oss.qualcomm.com>
Reviewed-by: Chris Lew <christopher.lew@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260220-qmi-encode-invalid-length-v2-1-5674be35ab29@oss.qualcomm.com
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
Eliminate kernel-doc warnings in mmu_notifier.h:
- add a missing struct short description
- use the correct format for function parameters
- add missing function return comment sections
Warning: include/linux/mmu_notifier.h:236 missing initial short
description on line: * struct mmu_interval_notifier_ops
Warning: include/linux/mmu_notifier.h:325 function parameter 'interval_sub'
not described in 'mmu_interval_set_seq'
Warning: include/linux/mmu_notifier.h:325 function parameter 'cur_seq'
not described in 'mmu_interval_set_seq'
Warning: include/linux/mmu_notifier.h:346 function parameter 'interval_sub'
not described in 'mmu_interval_read_retry'
Warning: include/linux/mmu_notifier.h:346 function parameter 'seq' not
described in 'mmu_interval_read_retry'
Warning: include/linux/mmu_notifier.h:346 No description found for return
value of 'mmu_interval_read_retry'
Warning: include/linux/mmu_notifier.h:370 function parameter 'interval_sub'
not described in 'mmu_interval_check_retry'
Warning: include/linux/mmu_notifier.h:370 function parameter 'seq' not
described in 'mmu_interval_check_retry'
Warning: include/linux/mmu_notifier.h:370 No description found for return
value of 'mmu_interval_check_retry'
Link: https://lkml.kernel.org/r/20260302005222.3470783-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
During a pagecache folio split, the values in the related xarray should
not be changed from the original folio at xarray split time until all
after-split folios are well formed and stored in the xarray. Current use
of xas_try_split() in __split_unmapped_folio() lets some after-split
folios show up at wrong indices in the xarray. When these misplaced
after-split folios are unfrozen, before correct folios are stored via
__xa_store(), and grabbed by folio_try_get(), they are returned to
userspace at wrong file indices, causing data corruption. More detailed
explanation is at the bottom.
The reproducer is at: https://github.com/dfinity/thp-madv-remove-test
It
1. creates a memfd,
2. forks,
3. in the child process, maps the file with large folios (via shmem code
path) and reads the mapped file continuously with 16 threads,
4. in the parent process, uses madvise(MADV_REMOVE) to punch poles in the
large folio.
Data corruption can be observed without the fix. Basically, data from a
wrong page->index is returned.
Fix it by using the original folio in xas_try_split() calls, so that
folio_try_get() can get the right after-split folios after the original
folio is unfrozen.
Uniform split, split_huge_page*(), is not affected, since it uses
xas_split_alloc() and xas_split() only once and stores the original folio
in the xarray. Change xas_split() used in uniform split branch to use the
original folio to avoid confusion.
Fixes below points to the commit introduces the code, but folio_split() is
used in a later commit 7460b470a1 ("mm/truncate: use folio_split() in
truncate operation").
More details:
For example, a folio f is split non-uniformly into f, f2, f3, f4 like
below:
+----------------+---------+----+----+
| f | f2 | f3 | f4 |
+----------------+---------+----+----+
but the xarray would look like below after __split_unmapped_folio() is
done:
+----------------+---------+----+----+
| f | f2 | f3 | f3 |
+----------------+---------+----+----+
After __split_unmapped_folio(), the code changes the xarray and unfreezes
after-split folios:
1. unfreezes f2, __xa_store(f2)
2. unfreezes f3, __xa_store(f3)
3. unfreezes f4, __xa_store(f4), which overwrites the second f3 to f4.
4. unfreezes f.
Meanwhile, a parallel filemap_get_entry() can read the second f3 from the
xarray and use folio_try_get() on it at step 2 when f3 is unfrozen. Then,
f3 is wrongly returned to user.
After the fix, the xarray looks like below after __split_unmapped_folio():
+----------------+---------+----+----+
| f | f | f | f |
+----------------+---------+----+----+
so that the race window no longer exists.
[ziy@nvidia.com: move comment, per David]
Link: https://lkml.kernel.org/r/5C9FA053-A4C6-4615-BE05-74E47A6462B3@nvidia.com
Link: https://lkml.kernel.org/r/20260302203159.3208341-1-ziy@nvidia.com
Fixes: 00527733d0 ("mm/huge_memory: add two new (not yet used) functions for folio_split()")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reported-by: Bas van Dijk <bas@dfinity.org>
Closes: https://lore.kernel.org/all/CAKNNEtw5_kZomhkugedKMPOG-sxs5Q5OLumWJdiWXv+C9Yct0w@mail.gmail.com/
Tested-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
RELAYFS was originally developed by Tom Zanussi and Karim Yaghmour in
2005[1]. Jens Axboe converted it from filesystem into a generic API in
2006[2] and made it widely known through the notable I/O tracing tool
blktrace. In the decade, there remain a few users scatterred across
different subsystems, like recently added wifi commit[3] that is an
example to show how to communicate between users and kernel. Last year
I've already done some maintenance and added/corrected some diagnostic
counters.
At Tencent, we internally maintain RELAY as one of most crucial components
of network observibility platform which was shared a bit at LPC 2025[4][5]
and hopefully will be published in the paper this year. RELAY has proven
highly efficient due to its inherent design essence. This design becomes
the indispensable way to build a 7x24 platform monitoring various hot
paths even without any selectively sampling (yes, sampling is commonly
used to avoid the overall performance degradation). One of the
recommended usages is to use its zerocopy function relay_reserve() to
transfer data in a raw format that can be recognized and parsed by the
corresponding application to userspace without introducing heavy locks and
complicated logic that appears in other types of approaches, like printk.
More details can be discovered by reading through the Documentation :)
Credits are given to the all the contributors and reviewers for
RELAY/RELAYFS in the past and future! Many thanks!
[1]: commit e82894f84d ("[PATCH] relayfs")
[2]: commit b86ff981a8 ("[PATCH] relay: migrate from relayfs to a generic relay API")
[3]: commit c1bf6959dd ("wifi: ath11k: Register relayfs entries for CFR dump")
[4]: https://lpc.events/event/19/contributions/2055/
[5]: https://lpc.events/event/19/contributions/2010/
Link: https://lkml.kernel.org/r/20260301020902.56476-1-kerneljasonxing@gmail.com
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In the trylock path of refill_obj_stock(), mod_objcg_mlstate() should use
the real alloc/free bytes (i.e., nr_acct) for accounting, rather than
nr_bytes.
The user-visible impact is that the NR_SLAB_RECLAIMABLE_B and
NR_SLAB_UNRECLAIMABLE_B stats can end up being incorrect.
For example, if a user allocates a 6144-byte object, then before this
fix efill_obj_stock() calls mod_objcg_mlstate(..., nr_bytes=2048), even
though it should account for 6144 bytes (i.e., nr_acct).
When the user later frees the same object with kfree(),
refill_obj_stock() calls mod_objcg_mlstate(..., nr_bytes=6144). This
ends up adding 6144 to the stats, but it should be applying -6144
(i.e., nr_acct) since the object is being freed.
Link: https://lkml.kernel.org/r/20260226115145.62903-1-hao.li@linux.dev
Fixes: 200577f69f ("memcg: objcg stock trylock without irq disabling")
Signed-off-by: Hao Li <hao.li@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vlastimil Babka <vbabka@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Architecture like powerpc, checks for pfn_valid() in their virt_to_phys()
implementation (when CONFIG_DEBUG_VIRTUAL is enabled) [1]. Commit
d49004c5f0 "arch, mm: consolidate initialization of nodes, zones and
memory map" changed the order of initialization between
hugetlb_bootmem_alloc() and free_area_init(). This means, pfn_valid() can
now return false in alloc_bootmem() path, since sparse_init() is not yet
done.
Since, alloc_bootmem() uses memblock_alloc(.., MEMBLOCK_ALLOC_ACCESSIBLE),
this means these allocations are always going to happen below high_memory,
where __pa() should return valid physical addresses. Hence this patch
converts the two callers of virt_to_phys() in alloc_bootmem() path to
__pa() to avoid this bootup warning:
------------[ cut here ]------------
WARNING: arch/powerpc/include/asm/io.h:879 at virt_to_phys+0x44/0x1b8, CPU#0: swapper/0
Modules linked in:
<...>
NIP [c000000000601584] virt_to_phys+0x44/0x1b8
LR [c000000004075de4] alloc_bootmem+0x144/0x1a8
Call Trace:
[c000000004d1fb50] [c000000004075dd4] alloc_bootmem+0x134/0x1a8
[c000000004d1fba0] [c000000004075fac] __alloc_bootmem_huge_page+0x164/0x230
[c000000004d1fbe0] [c000000004030bc4] alloc_bootmem_huge_page+0x44/0x138
[c000000004d1fc10] [c000000004076e48] hugetlb_hstate_alloc_pages+0x350/0x5ac
[c000000004d1fd30] [c0000000040782f0] hugetlb_bootmem_alloc+0x15c/0x19c
[c000000004d1fd70] [c00000000406d7b4] mm_core_init_early+0x7c/0xdf4
[c000000004d1ff30] [c000000004011d84] start_kernel+0xac/0xc58
[c000000004d1ffe0] [c00000000000e99c] start_here_common+0x1c/0x20
[1]: https://lore.kernel.org/linuxppc-dev/87tsv5h544.ritesh.list@gmail.com/
Link: https://lkml.kernel.org/r/b4a7d2c6c4c1dd81dddc904fc21f01303290a4b8.1772107852.git.riteshh@linux.ibm.com
Fixes: d49004c5f0 ("arch, mm: consolidate initialization of nodes, zones and memory map")
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Build of VMA and radix-tree tests is unhappy after the conversion of
kzalloc() to kzalloc_obj() in lib/idr.c:
cc -I../shared -I. -I../../include -I../../arch/x86/include -I../../../lib -g -Og -Wall -D_LGPL_SOURCE -fsanitize=address -fsanitize=undefined -DNUM_VMA_FLAG_BITS=128 -DNUM_MM_FLAG_BITS=128 -c -o idr.o idr.c
idr.c: In function `ida_alloc_range':
idr.c:420:34: error: implicit declaration of function `kzalloc_obj'; did you mean `kzalloc_node'? [-Wimplicit-function-declaration]
420 | bitmap = kzalloc_obj(*bitmap, GFP_NOWAIT);
| ^~~~~~~~~~~
| kzalloc_node
idr.c:420:32: error: assignment to `struct ida_bitmap *' from `int' makes pointer from integer without a cast [-Wint-conversion]
420 | bitmap = kzalloc_obj(*bitmap, GFP_NOWAIT);
| ^
idr.c:447:40: error: assignment to `struct ida_bitmap *' from `int' makes pointer from integer without a cast [-Wint-conversion]
447 | bitmap = kzalloc_obj(*bitmap, GFP_NOWAIT);
| ^
idr.c:468:15: error: assignment to `struct ida_bitmap *' from `int' makes pointer from integer without a cast [-Wint-conversion]
468 | alloc = kzalloc_obj(*bitmap, gfp);
| ^
make: *** [<builtin>: idr.o] Error 1
Import necessary macros from include/linux to tools/include/linux to fix
the compilation.
Link: https://lkml.kernel.org/r/20260225233111.2760752-1-rppt@kernel.org
Fixes: 69050f8d6d ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Tested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damos_walk() sets ctx->walk_control to the caller-provided control
structure before checking whether the context is running. If the context
is inactive (damon_is_running() returns false), the function returns
-EINVAL without clearing ctx->walk_control. This leaves a dangling
pointer to a stack-allocated structure that will be freed when the caller
returns.
This is structurally identical to the bug fixed in commit f9132fbc2e
("mm/damon/core: remove call_control in inactive contexts") for
damon_call(), which had the same pattern of linking a control object and
returning an error without unlinking it.
The dangling walk_control pointer can cause:
1. Use-after-free if the context is later started and kdamond
dereferences ctx->walk_control (e.g., in damos_walk_cancel()
which writes to control->canceled and calls complete())
2. Permanent -EBUSY from subsequent damos_walk() calls, since the
stale pointer is non-NULL
Nonetheless, the real user impact is quite restrictive. The
use-after-free is impossible because there is no damos_walk() callers who
starts the context later. The permanent -EBUSY can actually confuse
users, as DAMON is not running. But the symptom is kept only while the
context is turned off. Turning it on again will make DAMON internally
uses a newly generated damon_ctx object that doesn't have the invalid
damos_walk_control pointer, so everything will work fine again.
Fix this by clearing ctx->walk_control under walk_control_lock before
returning -EINVAL, mirroring the fix pattern from f9132fbc2e.
Link: https://lkml.kernel.org/r/20260224011102.56033-1-sj@kernel.org
Fixes: bf0eaba0ff ("mm/damon/core: implement damos_walk()")
Reported-by: Raul Pazemecxas De Andrade <raul_pazemecxas@hotmail.com>
Closes: https://lore.kernel.org/CPUPR80MB8171025468965E583EF2490F956CA@CPUPR80MB8171.lamprd80.prod.outlook.com
Signed-off-by: Raul Pazemecxas De Andrade <raul_pazemecxas@hotmail.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [6.14+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A dirty folio is one which has been written to. A clean folio is its
opposite. Since a clean folio has no user data, it can be freed under
memory pressure.
memfd preservation with LUO saves the flag at preserve(). This is
problematic. The folio might get dirtied later. Saving it at freeze()
also doesn't work, since the dirty bit from PTE is normally synced at
unmap and there might still be mappings of the file at freeze().
To see why this is a problem, say a folio is clean at preserve, but gets
dirtied later. The serialized state of the folio will mark it as clean.
After retrieve, the next kernel will see the folio as clean and might try
to reclaim it under memory pressure. This will result in losing user
data.
Mark all folios of the file as dirty, and always set the
MEMFD_LUO_FOLIO_DIRTY flag. This comes with the side effect of making all
clean folios un-reclaimable. This is a cost that has to be paid for
participants of live update. It is not expected to be a common use case
to preserve a lot of clean folios anyway.
Since the value of pfolio->flags is a constant now, drop the flags
variable and set it directly.
Link: https://lkml.kernel.org/r/20260223173931.2221759-3-pratyush@kernel.org
Fixes: b3749f174d ("mm: memfd_luo: allow preserving memfd")
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: memfd_luo: fixes for folio flag preservation".
This series contains a couple fixes for flag preservation for memfd live
update.
The first patch fixes memfd preservation when fallocate() was used to
pre-allocate some pages. For these memfds, all the writes to fallocated
pages touched after preserve were lost.
The second patch fixes dirty flag tracking. If the dirty flag is not
tracked correctly, the next kernel might incorrectly reclaim some folios
under memory pressure, losing user data. This is a theoretical bug that I
observed when reading the code, and haven't been able to reproduce it.
This patch (of 2):
When a folio is added to a shmem file via fallocate, it is not zeroed on
allocation. This is done as a performance optimization since it is
possible the folio will never end up being used at all. When the folio is
used, shmem checks for the uptodate flag, and if absent, zeroes the folio
(and sets the flag) before returning to user.
With LUO, the flags of each folio are saved at preserve time. It is
possible to have a memfd with some folios fallocated but not uptodate.
For those, the uptodate flag doesn't get saved. The folios might later
end up being used and become uptodate. They would get passed to the next
kernel via KHO correctly since they did get preserved. But they won't
have the MEMFD_LUO_FOLIO_UPTODATE flag.
This means that when the memfd is retrieved, the folios will be added to
the shmem file without the uptodate flag. They will be zeroed before
first use, losing the data in those folios.
Since we take a big performance hit in allocating, zeroing, and pinning
all folios at prepare time anyway, take some more and zero all
non-uptodate ones too.
Later when there is a stronger need to make prepare faster, this can be
optimized.
To avoid racing with another uptodate operation, take the folio lock.
Link: https://lkml.kernel.org/r/20260223173931.2221759-2-pratyush@kernel.org
Fixes: b3749f174d ("mm: memfd_luo: allow preserving memfd")
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If task_work_add() failed, ctx->task is put but the tsync_works struct
is not reset to its previous state. The first consequence is that the
kernel allocates memory for dying threads, which could lead to
user-accounted memory exhaustion (not very useful nor specific to this
case). The second consequence is that task_work_cancel(), called by
cancel_tsync_works(), can dereference a NULL task pointer.
Fix this issues by keeping a consistent works->size wrt the added task
work. This is done in a new tsync_works_trim() helper which also cleans
up the shared_ctx and work fields.
As a safeguard, add a pointer check to cancel_tsync_works() and update
tsync_works_release() accordingly.
Cc: Jann Horn <jannh@google.com>
Reviewed-by: Günther Noack <gnoack@google.com>
Link: https://lore.kernel.org/r/20260217122341.2359582-1-mic@digikod.net
[mic: Replace memset() with compound literal]
Signed-off-by: Mickaël Salaün <mic@digikod.net>
bpf_iter_scx_dsq_new() reads dsq->seq via READ_ONCE() without holding
any lock, making dsq->seq a lock-free concurrently accessed variable.
However, dispatch_enqueue(), the sole writer of dsq->seq, uses a plain
increment without the matching WRITE_ONCE() on the write side:
dsq->seq++;
^^^^^^^^^^^
plain write -- KCSAN data race
The KCSAN documentation requires that if one accessor uses READ_ONCE()
or WRITE_ONCE() on a variable to annotate lock-free access, all other
accesses must also use the appropriate accessor. A plain write leaves
the pair incomplete and will trigger KCSAN warnings.
Fix by using WRITE_ONCE() for the write side of the update:
WRITE_ONCE(dsq->seq, dsq->seq + 1);
This is consistent with bpf_iter_scx_dsq_new() and makes the
concurrent access annotation complete and KCSAN-clean.
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The GL9750 SD host controller has intermittent data corruption during
DMA write operations. The GM_BURST register's R_OSRC_Lmt field
(bits 17:16), which limits outstanding DMA read requests from system
memory, is not being cleared during initialization. The Windows driver
sets R_OSRC_Lmt to zero, limiting requests to the smallest unit.
Clear R_OSRC_Lmt to match the Windows driver behavior. This eliminates
write corruption verified with f3write/f3read tests while maintaining
DMA performance.
Cc: stable@vger.kernel.org
Fixes: e51df6ce66 ("mmc: host: sdhci-pci: Add Genesys Logic GL975x support")
Closes: https://lore.kernel.org/linux-mmc/33d12807-5c72-41ce-8679-57aa11831fad@linux.dev/
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Matthew Schwartz <matthew.schwartz@linux.dev>
Reviewed-by: Ben Chuang <ben.chuang@genesyslogic.com.tw>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
To pick the changes in:
6517dfbcc9 ("KVM: x86: Add x2APIC "features" to control EOI broadcast suppression")
20c3c4108d ("KVM: SEV: Add KVM_SEV_SNP_ENABLE_REQ_CERTS command")
This silences these perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/arch/x86/include/uapi/asm/kvm.h arch/x86/include/uapi/asm/kvm.h
Please see tools/include/uapi/README for further details.
Cc: Sean Christopherson <seanjc@google.com>
Cc: Khushit Shah <khushit.shah@nutanix.com>
Cc: Michael Roth <michael.roth@amd.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the changes from:
f24ef0093d ("KVM: x86: Advertise MOVRS CPUID to userspace")
f49ecf5e11 ("x86/cpufeature: Replace X86_FEATURE_SYSENTER32 with X86_FEATURE_SYSFAST32")
db5e824964 ("KVM: SVM: Virtualize and advertise support for ERAPS")
This causes these perf files to be rebuilt and brings some X86_FEATURE
that may be used by:
CC /tmp/build/perf/bench/mem-memcpy-x86-64-asm.o
CC /tmp/build/perf/bench/mem-memset-x86-64-asm.o
And addresses this perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h
Please see tools/include/uapi/README for further details.
Cc: Amit Shah <amit.shah@amd.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Zhao Liu <zhao1.liu@intel.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the changes in:
f7ab71f178 ("KVM: s390: Add explicit padding to struct kvm_s390_keyop")
0ee4ddc164 ("KVM: s390: Storage key manipulation IOCTL")
fa9893fadb ("KVM: Introduce KVM_EXIT_SNP_REQ_CERTS for SNP certificate-fetching")
f174a9ffcd ("KVM: arm64: Add exit to userspace on {LD,ST}64B* outside of memslots")
That just rebuilds perf, as these patches add just one new KVM ioctl,
but for S390, that is not being considered by tools/perf/trace/beauty/kvm_ioctl.sh
so far.
This addresses this perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
Please see tools/include/uapi/README for further details.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick up the changes in:
f3ec502b67 ("mm/slab: mark alloc tags empty for sheaves allocated with __GFP_NO_OBJ_EXT")
241b3a0963 ("mm: clarify GFP_ATOMIC/GFP_NOWAIT doc-comment")
That just adds some comments, so no changes in perf tooling, just silences
this build warning:
Warning: Kernel ABI header differences:
diff -u tools/include/linux/gfp_types.h include/linux/gfp_types.h
Please see tools/include/uapi/README.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Update it as one comment got realigned, probably in a merge, so no
changes in perf tooling, just silences this build warning:
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/linux/perf_event.h include/uapi/linux/perf_event.h
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick up the change in:
a1fab3e69d ("x86/irq: Fix comment on IRQ vector layout")
That just adds one comment, so no changes in perf tooling, just silences
this build warning:
diff -u tools/perf/trace/beauty/arch/x86/include/asm/irq_vectors.h arch/x86/include/asm/irq_vectors.h
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick up changes from:
0e6b7eae1f ("fs: add FS_XFLAG_VERITY for fs-verity files")
These are used to beautify fs syscall arguments, albeit the changes in
this update are not affecting those beautifiers.
This addresses these tools/perf build warnings:
Warning: Kernel ABI header differences:
diff -u tools/perf/trace/beauty/include/uapi/linux/fs.h include/uapi/linux/fs.h
Please see tools/include/uapi/README.
Cc: Andrey Albershteyn <aalbersh@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To pick the changes from:
9b8a0ba682 ("mount: add OPEN_TREE_NAMESPACE")
0e5032237e ("statmount: accept fd as a parameter")
That doesn't change anything in tools this time as nothing that is
harvested by the beauty scripts got changed:
$ ls -1 tools/perf/trace/beauty/*mount*sh
tools/perf/trace/beauty/fsmount.sh
tools/perf/trace/beauty/mount_flags.sh
tools/perf/trace/beauty/move_mount_flags.sh
$
This addresses this perf build warning.
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/linux/mount.h include/uapi/linux/mount.h
Please see tools/include/uapi/README for further details.
Cc: Christian Brauner <brauner@kernel.org>
Cc: Bhavik Sachdev <b.sachdev1904@gmail.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Currently no target is specified to compile rust code when needed, which
breaks cross compilation. E.g. for arm64:
LD /tmp/build/tests/workloads/perf-test-in.o
aarch64-linux-gnu-ld: /tmp/build/tests/workloads/code_with_type.a(code_with_type.code_with_type.d12f4324cb53c560-cgu.0.rcgu.o): Relocations in generic ELF (EM: 62)
aarch64-linux-gnu-ld: /tmp/build/tests/workloads/code_with_type.a(code_with_type.code_with_type.d12f4324cb53c560-cgu.0.rcgu.o): Relocations in generic ELF (EM: 62)
[...repeated...]
aarch64-linux-gnu-ld: /tmp/build/tests/workloads/code_with_type.a(code_with_type.code_with_type.d12f4324cb53c560-cgu.0.rcgu.o): Relocations in generic ELF (EM: 62)
aarch64-linux-gnu-ld: /tmp/build/tests/workloads/code_with_type.a(code_with_type.code_with_type.d12f4324cb53c560-cgu.0.rcgu.o): Relocations in generic ELF (EM: 62)
aarch64-linux-gnu-ld: /tmp/build/tests/workloads/code_with_type.a: error adding symbols: file in wrong format
make[5]: *** [/perf/tools/build/Makefile.build:162: /tmp/build/tests/workloads/perf-test-in.o] Error 1
make[4]: *** [/perf/tools/build/Makefile.build:156: workloads] Error 2
make[3]: *** [/perf/tools/build/Makefile.build:156: tests] Error 2
make[2]: *** [Makefile.perf:785: /tmp/build/perf-test-in.o] Error 2
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [Makefile.perf:289: sub-make] Error 2
make: *** [Makefile:76: all] Error 2
Detect required target and pass it via rust_flags to the compiler.
Note that CROSS_COMPILE might be different from what rust compiler
expects, since it may omit the target vendor value, e.g.
"aarch64-linux-gnu" instead of "aarch64-unknown-linux-gnu".
Thus explicitly map supported CROSS_COMPILE values to corresponding Rust
versions, as suggested by Miguel Ojeda.
Tested using arm64 cross-compilation example from [1].
Fixes: 2e05bb52a1 ("perf test workload: Add code_with_type test workload")
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Dmitrii Dolgov <9erthalion6@gmail.com>
Cc: Levi Zim <i@kxxt.dev>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicolas Schier <nsc@kernel.org>
Link: https://perfwiki.github.io/main/arm64-cross-compilation-dockerfile/ [1]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Due to a recent change, building perf may result in a build error when
it is trying to "prune orphans".
The file list passed to "rm" may exceed what the shell can handle.
The build will then abort with an error like this:
TEST [...]/arm64/build/linux-custom/tools/perf/pmu-events/metric_test.log
make[5]: /bin/sh: Argument list too long
make[5]: *** [pmu-events/Build:217: prune_orphans] Error 127
make[5]: *** Waiting for unfinished jobs....
make[4]: *** [Makefile.perf:773: [...]/tools/perf/pmu-events/pmu-events-in.o] Error 2
make[4]: *** Waiting for unfinished jobs....
make[3]: *** [Makefile.perf:289: sub-make] Error 2
Processing the arguments via "xargs", instead of passing the list of
files directly to "rm" via the shell, prevents this issue.
Fixes: 36a1b0061a ("perf build: Reduce pmu-events related copying and mkdirs")
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Markus Mayer <mmayer@broadcom.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The command length of in-target scales with the depth of the directory
times the number of objects in the Makefile.
When there are many objects, and O=[absolute_path] is set, and the
absolute_path is relatively long.
It is possible that this line "$(call if_changed,$(host)ld_multi)" will
report error: "make[4]: /bin/sh: Argument list too long"
For example, build perf tools with O=/long/output/path
Like built-in.a and *.mod rules in scripts/Makefile.build, add
$(objpredix)/ by the shell command instead of by Make's builtin
function.
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Changqing Li <changqing.li@windriver.com>
Cc: Charlie Jenkins <charlie@rivosinc.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
xfs/1879 on one of my test VMs got stuck due to the xfs_io healthmon
subcommand sleeping in wait_event_interruptible at:
xfs_healthmon_read_iter+0x558/0x5f8 [xfs]
vfs_read+0x248/0x320
ksys_read+0x78/0x120
Looking at xfs_healthmon_read_iter, in !O_NONBLOCK mode it will sleep
until the mount cookie == DETACHED_MOUNT_COOKIE, there are events
waiting to be formatted, or there are formatted events in the read
buffer that could be copied to userspace.
Poking into the running kernel, I see that there are zero events in the
list, the read buffer is empty, and the mount cookie is indeed in
DETACHED state. IOWs, xfs_healthmon_has_eventdata should have returned
true, but instead we're asleep waiting for a wakeup.
I think what happened here is that xfs_healthmon_read_iter and
xfs_healthmon_unmount were racing with each other, and _read_iter lost
the race. _unmount queued an unmount event, which woke up _read_iter.
It found, formatted, and copied the event out to userspace. That
cleared out the pending event list and emptied the read buffer. xfs_io
then called read() again, so _has_eventdata decided that we should sleep
on the empty event queue.
Next, _unmount called xfs_healthmon_detach, which set the mount cookie
to DETACHED. Unfortunately, it didn't call wake_up_all on the hm, so
the wait_event_interruptible in the _read_iter thread remains asleep.
That's why the test stalled.
Fix this by moving the wake_up_all call to xfs_healthmon_detach.
Fixes: b3a289a2a9 ("xfs: create event queuing, formatting, and discovery infrastructure")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The scratch field in struct xfs_gc_bio is unused. Remove it.
Fixes: 102f444b57 ("xfs: rework zone GC buffer management")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
In pwrseq_pcie_m2_probe(), ctx->of_node acquires an explicit reference
to the device node using of_node_get(), but there is no corresponding
of_node_put() in the driver's error handling paths or removal.
Since the ctx is tied to the lifecycle of the platform device, there
is no need to hold an additional reference to the device's own of_node.
Fixes: 52e7b5bd62 ("power: sequencing: Add the Power Sequencing driver for the PCIe M.2 connectors")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Link: https://patch.msgid.link/20260302-m2-v1-1-a6533e18aa69@gmail.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
ionic_build_hdr() populated the Ethernet source MAC (hdr->eth.smac_h) by
passing the header’s storage directly to rdma_read_gid_l2_fields().
However, ib_ud_header_init() is called after that and re-initializes the
UD header, which wipes the previously written smac_h. As a result, packets
are emitted with an zero source MAC address on the wire.
Correct the source MAC by reading the GID-derived smac into a temporary
buffer and copy it after ib_ud_header_init() completes.
Fixes: e8521822c7 ("RDMA/ionic: Register device ops for control path")
Cc: stable@vger.kernel.org # 6.18
Signed-off-by: Abhijit Gangurde <abhijit.gangurde@amd.com>
Link: https://patch.msgid.link/20260227061809.2979990-1-abhijit.gangurde@amd.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
If IB_MR_REREG_TRANS is set during rereg_user_mr, the
umem will be released and a new one will be allocated
in irdma_rereg_mr_trans. If any step of irdma_rereg_mr_trans
fails after the new umem is allocated, it releases the umem,
but does not set iwmr->region to NULL. The problem is that
this failure is propagated to the user, who will then call
ibv_dereg_mr (as they should). Then, the dereg_mr path will
see a non-NULL umem and attempt to call ib_umem_release again.
Fix this by setting iwmr->region to NULL after ib_umem_release.
Fixed: 5ac388db27 ("RDMA/irdma: Add support to re-register a memory region")
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Link: https://patch.msgid.link/20260227152743.1183388-1-jmoroni@google.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
With crash hotplug support enabled, additional memory is allocated to
the elfcorehdr kexec segment to accommodate resources added during
memory hotplug events. However, the kdump FDT is not updated with the
same size, which can result in elfcorehdr corruption in the kdump
kernel.
Update elf_headers_sz (the kimage member representing the size of the
elfcorehdr kexec segment) to reflect the total memory allocated for the
elfcorehdr segment instead of the elfcorehdr buffer size at the time of
kdump load. This allows of_kexec_alloc_and_setup_fdt() to reserve the
full elfcorehdr memory in the kdump FDT and prevents elfcorehdr
corruption.
Fixes: 849599b702 ("powerpc/crash: add crash memory hotplug support")
Reviewed-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20260227171801.2238847-1-sourabhjain@linux.ibm.com
Use explicit word-sized big-endian types for kexec and crash related
variables. This makes the endianness unambiguous and avoids type
mismatches that trigger sparse warnings.
The change addresses sparse warnings like below (seen on both 32-bit
and 64-bit builds):
CHECK ../arch/powerpc/kexec/core.c
sparse: expected unsigned int static [addressable] [toplevel] [usertype] crashk_base
sparse: got restricted __be32 [usertype]
sparse: warning: incorrect type in assignment (different base types)
sparse: expected unsigned int static [addressable] [toplevel] [usertype] crashk_size
sparse: got restricted __be32 [usertype]
sparse: warning: incorrect type in assignment (different base types)
sparse: expected unsigned long long static [addressable] [toplevel] mem_limit
sparse: got restricted __be32 [usertype]
sparse: warning: incorrect type in assignment (different base types)
sparse: expected unsigned int static [addressable] [toplevel] [usertype] kernel_end
sparse: got restricted __be32 [usertype]
No functional change intended.
Fixes: ea961a828f ("powerpc: Fix endian issues in kexec and crash dump code")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512221405.VHPKPjnp-lkp@intel.com/
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/20251224151257.28672-1-sourabhjain@linux.ibm.com
Test robot reports the following error with clang-16.0.6:
In file included from kernel/rseq.c:75:
include/linux/rseq_entry.h:141:3: error: invalid operand for instruction
unsafe_get_user(offset, &ucs->post_commit_offset, efault);
^
include/linux/uaccess.h:608:2: note: expanded from macro 'unsafe_get_user'
arch_unsafe_get_user(x, ptr, local_label); \
^
arch/powerpc/include/asm/uaccess.h:518:2: note: expanded from macro 'arch_unsafe_get_user'
__get_user_size_goto(__gu_val, __gu_addr, sizeof(*(p)), e); \
^
arch/powerpc/include/asm/uaccess.h:284:2: note: expanded from macro '__get_user_size_goto'
__get_user_size_allowed(x, ptr, size, __gus_retval); \
^
arch/powerpc/include/asm/uaccess.h:275:10: note: expanded from macro '__get_user_size_allowed'
case 8: __get_user_asm2(x, (u64 __user *)ptr, retval); break; \
^
arch/powerpc/include/asm/uaccess.h:258:4: note: expanded from macro '__get_user_asm2'
" li %1+1,0\n" \
^
<inline asm>:7:5: note: instantiated into assembly here
li 31+1,0
^
1 error generated.
On PPC32, for 64 bits vars a pair of registers is used. Usually the
lower register in the pair is the high part and the higher register is
the low part. GCC uses r3/r4 ... r11/r12 ... r14/r15 ... r30/r31
In older kernel code inline assembly was using %1 and %1+1 to represent
64 bits values. However here it looks like clang uses r31 as high part,
allthough r32 doesn't exist hence the error.
Allthoug %1+1 should work, most places now use %L1 instead of %1+1, so
let's do the same here.
With that change, the build doesn't fail anymore and a disassembly shows
clang uses r17/r18 and r31/r14 pair when GCC would have used r16/r17 and
r30/r31:
Disassembly of section .fixup:
00000000 <.fixup>:
0: 38 a0 ff f2 li r5,-14
4: 3a 20 00 00 li r17,0
8: 3a 40 00 00 li r18,0
c: 48 00 00 00 b c <.fixup+0xc>
c: R_PPC_REL24 .text+0xbc
10: 38 a0 ff f2 li r5,-14
14: 3b e0 00 00 li r31,0
18: 39 c0 00 00 li r14,0
1c: 48 00 00 00 b 1c <.fixup+0x1c>
1c: R_PPC_REL24 .text+0x144
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202602021825.otcItxGi-lkp@intel.com/
Fixes: c20beffeec ("powerpc/uaccess: Use flexible addressing with __put_user()/__get_user()")
Signed-off-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Acked-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/8ca3a657a650e497a96bfe7acde2f637dadab344.1770103646.git.chleroy@kernel.org
Today there are two PTE formats for e500:
- The 64 bits format, used
- On 64 bits kernel
- On 32 bits kernel with 64 bits physical addresses
- On 32 bits kernel with support of huge pages
- The 32 bits format, used in other cases
Maintaining two PTE formats means unnecessary maintenance burden
because every change needs to be implemented and tested for both
formats.
Remove the 32 bits PTE format. The memory usage increase due to
larger PTEs is minimal (approx. 0,1% of memory).
This also means that from now on huge pages are supported also
with 32 bits physical addresses.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Link: https://patch.msgid.link/04a658209ea78dcc0f3dbde6b2c29cf1939adfe9.1767721208.git.chleroy@kernel.org
If `CONFIG_BLOCK` is disabled, the following warnings are displayed
during build:
warning: struct `NullTerminatedFormatter` is never constructed
--> ../rust/kernel/str.rs:667:19
|
667 | pub(crate) struct NullTerminatedFormatter<'a> {
| ^^^^^^^^^^^^^^^^^^^^^^^
|
= note: `#[warn(dead_code)]` (part of `#[warn(unused)]`) on by default
warning: associated function `new` is never used
--> ../rust/kernel/str.rs:673:19
|
671 | impl<'a> NullTerminatedFormatter<'a> {
| ------------------------------------ associated function in this implementation
672 | /// Create a new [`Self`] instance.
673 | pub(crate) fn new(buffer: &'a mut [u8]) -> Option<NullTerminatedFormatter<'a>> {
Fix them by making `NullTerminatedFormatter` public, as it could be
useful for drivers anyway.
Fixes: cdde7a1951 ("rust: str: introduce `NullTerminatedFormatter`")
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260224-nullterminatedformatter-v1-1-5bef7b9b3d4c@nvidia.com
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Update the X2-85 gpu's register protect count configuration with the
correct count_max value to avoid blocking the entire MMIO region from the
UMD.
Protect configurations are a bit complicated on A8xx. There are 2 set of
protect registers with different counts: Global and Pipe-specific. The
last-span-unbound feature is available only on the Pipe-specific protect
registers. Due to this, we cannot use the BUILD_BUG sanity check for A8x
protect configurations, so remove the A840 entry from there.
Fixes: 01ff3bf272 ("drm/msm/a8xx: Add support for Adreno X2-85 GPU")
Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com>
Patchwork: https://patchwork.freedesktop.org/patch/706944/
Message-ID: <20260225-glymur-protect-fix-v1-1-0deddedf9277@oss.qualcomm.com>
Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com>
On modern Intel platforms, the intel_pstate driver is commonly used and
it provides turbo boost control via
/sys/devices/system/cpu/intel_pstate/no_turbo.
However, cpupower doesn't handle this. it
1. shows turbo boost as "active" blindly for Intel platforms
2. controls turbo boost functionality via the generic
/sys/devices/system/cpu/cpufreq/boost sysfs interface only.
Enhance the cpupower tool to ensure the "--boost" command works
seamlessly on Intel platforms with intel_pstate driver running.
Without this patch,
$ echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
1
$ sudo cpupower frequency-info --boost
analyzing CPU 21:
boost state support:
Supported: yes
Active: yes
$ sudo cpupower set --boost 0
Error setting turbo-boost
$ sudo cpupower set --boost 1
Error setting turbo-boost
With this patch,
$ cat /sys/devices/system/cpu/intel_pstate/no_turbo
0
$ sudo cpupower set --boost 0
$ sudo cpupower frequency-info --boost
analyzing CPU 21:
boost state support:
Supported: yes
Active: no
$ cat /sys/devices/system/cpu/intel_pstate/no_turbo
1
$ sudo cpupower set --boost 1
$ sudo cpupower frequency-info --boost
analyzing CPU 28:
boost state support:
Supported: yes
Active: yes
$ cat /sys/devices/system/cpu/intel_pstate/no_turbo
0
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
Extend the systemd service so that it can be used for tuning the Energy
Performance Preference (EPP) as well. Available options can be read from
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_available_preferences.
The desired one can then be set in cpupower-service.conf.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
cxl_detach_ep() is called during bottom-up removal when all CXL memory
devices beneath a switch port have been removed. For each port in the
hierarchy it locks both the port and its parent, removes the endpoint,
and if the port is now empty, marks it dead and unregisters the port
by calling delete_switch_port(). There are two places during this work
where the parent_port may be used after freeing:
First, a concurrent detach may have already processed a port by the
time a second worker finds it via bus_find_device(). Without pinning
parent_port, it may already be freed when we discover port->dead and
attempt to unlock the parent_port. In a production kernel that's a
silent memory corruption, with lock debug, it looks like this:
[]DEBUG_LOCKS_WARN_ON(__owner_task(owner) != get_current())
[]WARNING: kernel/locking/mutex.c:949 at __mutex_unlock_slowpath+0x1ee/0x310
[]Call Trace:
[]mutex_unlock+0xd/0x20
[]cxl_detach_ep+0x180/0x400 [cxl_core]
[]devm_action_release+0x10/0x20
[]devres_release_all+0xa8/0xe0
[]device_unbind_cleanup+0xd/0xa0
[]really_probe+0x1a6/0x3e0
Second, delete_switch_port() releases three devm actions registered
against parent_port. The last of those is unregister_port() and it
calls device_unregister() on the child port, which can cascade. If
parent_port is now also empty the device core may unregister and free
it too. So by the time delete_switch_port() returns, parent_port may
be free, and the subsequent device_unlock(&parent_port->dev) operates
on freed memory. The kernel log looks same as above, with a different
offset in cxl_detach_ep().
Both of these issues stem from the absence of a lifetime guarantee
between a child port and its parent port.
Establish a lifetime rule for ports: child ports hold a reference to
their parent device until release. Take the reference when the port
is allocated and drop it when released. This ensures the parent is
valid for the full lifetime of the child and eliminates the use after
free window in cxl_detach_ep().
This is easily reproduced with a reload of cxl_acpi in QEMU with CXL
devices present.
Fixes: 2345df5424 ("cxl/memdev: Fix endpoint port removal")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260226184439.1732841-1-alison.schofield@intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
If we fail to remove an item from the uuid tree, we don't need to abort
the transaction since we have not done any change before. So remove that
transaction abort.
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we failed to update the root we don't abort the transaction, which is
wrong since we already used the transaction to remove an item from the
uuid tree.
Fixes: dd5f9615fc ("Btrfs: maintain subvolume items in the UUID tree")
CC: stable@vger.kernel.org # 3.12+
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the set received ioctl fails due to an item overflow when attempting to
add the BTRFS_UUID_KEY_RECEIVED_SUBVOL we have to abort the transaction
since we did some metadata updates before.
This means that if a user calls this ioctl with the same received UUID
field for a lot of subvolumes, we will hit the overflow, trigger the
transaction abort and turn the filesystem into RO mode. A malicious user
could exploit this, and this ioctl does not even requires that a user
has admin privileges (CAP_SYS_ADMIN), only that he/she owns the subvolume.
Fix this by doing an early check for item overflow before starting a
transaction. This is also race safe because we are holding the subvol_sem
semaphore in exclusive (write) mode.
A test case for fstests will follow soon.
Fixes: dd5f9615fc ("Btrfs: maintain subvolume items in the UUID tree")
CC: stable@vger.kernel.org # 3.12+
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently a user can trigger a transaction abort by snapshotting a
previously received snapshot a bunch of times until we reach a
BTRFS_UUID_KEY_RECEIVED_SUBVOL item overflow (the maximum item size we
can store in a leaf). This is very likely not common in practice, but
if it happens, it turns the filesystem into RO mode. The snapshot, send
and set_received_subvol and subvol_setflags (used by receive) don't
require CAP_SYS_ADMIN, just inode_owner_or_capable(). A malicious user
could use this to turn a filesystem into RO mode and disrupt a system.
Reproducer script:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
# Use smallest node size to make the test faster.
mkfs.btrfs -f --nodesize 4K $DEV
mount $DEV $MNT
# Create a subvolume and set it to RO so that it can be used for send.
btrfs subvolume create $MNT/sv
touch $MNT/sv/foo
btrfs property set $MNT/sv ro true
# Send and receive the subvolume into snaps/sv.
mkdir $MNT/snaps
btrfs send $MNT/sv | btrfs receive $MNT/snaps
# Now snapshot the received subvolume, which has a received_uuid, a
# lot of times to trigger the leaf overflow.
total=500
for ((i = 1; i <= $total; i++)); do
echo -ne "\rCreating snapshot $i/$total"
btrfs subvolume snapshot -r $MNT/snaps/sv $MNT/snaps/sv_$i > /dev/null
done
echo
umount $MNT
When running the test:
$ ./test.sh
(...)
Create subvolume '/mnt/sdi/sv'
At subvol /mnt/sdi/sv
At subvol sv
Creating snapshot 496/500ERROR: Could not create subvolume: Value too large for defined data type
Creating snapshot 497/500ERROR: Could not create subvolume: Read-only file system
Creating snapshot 498/500ERROR: Could not create subvolume: Read-only file system
Creating snapshot 499/500ERROR: Could not create subvolume: Read-only file system
Creating snapshot 500/500ERROR: Could not create subvolume: Read-only file system
And in dmesg/syslog:
$ dmesg
(...)
[251067.627338] BTRFS warning (device sdi): insert uuid item failed -75 (0x4628b21c4ac8d898, 0x2598bee2b1515c91) type 252!
[251067.629212] ------------[ cut here ]------------
[251067.630033] BTRFS: Transaction aborted (error -75)
[251067.630871] WARNING: fs/btrfs/transaction.c:1907 at create_pending_snapshot.cold+0x52/0x465 [btrfs], CPU#10: btrfs/615235
[251067.632851] Modules linked in: btrfs dm_zero (...)
[251067.644071] CPU: 10 UID: 0 PID: 615235 Comm: btrfs Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full)
[251067.646165] Tainted: [W]=WARN
[251067.646733] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[251067.648735] RIP: 0010:create_pending_snapshot.cold+0x55/0x465 [btrfs]
[251067.649984] Code: f0 48 0f (...)
[251067.653313] RSP: 0018:ffffce644908fae8 EFLAGS: 00010292
[251067.653987] RAX: 00000000ffffff01 RBX: ffff8e5639e63a80 RCX: 00000000ffffffd3
[251067.655042] RDX: ffff8e53faa76b00 RSI: 00000000ffffffb5 RDI: ffffffffc0919750
[251067.656077] RBP: ffffce644908fbd8 R08: 0000000000000000 R09: ffffce644908f820
[251067.657068] R10: ffff8e5adc1fffa8 R11: 0000000000000003 R12: ffff8e53c0431bd0
[251067.658050] R13: ffff8e5414593600 R14: ffff8e55efafd000 R15: 00000000ffffffb5
[251067.659019] FS: 00007f2a4944b3c0(0000) GS:ffff8e5b27dae000(0000) knlGS:0000000000000000
[251067.660115] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[251067.660943] CR2: 00007ffc5aa57898 CR3: 00000005813a2003 CR4: 0000000000370ef0
[251067.661972] Call Trace:
[251067.662292] <TASK>
[251067.662653] create_pending_snapshots+0x97/0xc0 [btrfs]
[251067.663413] btrfs_commit_transaction+0x26e/0xc00 [btrfs]
[251067.664257] ? btrfs_qgroup_convert_reserved_meta+0x35/0x390 [btrfs]
[251067.665238] ? _raw_spin_unlock+0x15/0x30
[251067.665837] ? record_root_in_trans+0xa2/0xd0 [btrfs]
[251067.666531] btrfs_mksubvol+0x330/0x580 [btrfs]
[251067.667145] btrfs_mksnapshot+0x74/0xa0 [btrfs]
[251067.667827] __btrfs_ioctl_snap_create+0x194/0x1d0 [btrfs]
[251067.668595] btrfs_ioctl_snap_create_v2+0x107/0x130 [btrfs]
[251067.669479] btrfs_ioctl+0x1580/0x2690 [btrfs]
[251067.670093] ? count_memcg_events+0x6d/0x180
[251067.670849] ? handle_mm_fault+0x1a0/0x2a0
[251067.671652] __x64_sys_ioctl+0x92/0xe0
[251067.672406] do_syscall_64+0x50/0xf20
[251067.673129] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[251067.674096] RIP: 0033:0x7f2a495648db
[251067.674812] Code: 00 48 89 (...)
[251067.678227] RSP: 002b:00007ffc5aa57840 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[251067.679691] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2a495648db
[251067.681145] RDX: 00007ffc5aa588b0 RSI: 0000000050009417 RDI: 0000000000000004
[251067.682511] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
[251067.683842] R10: 000000000000000a R11: 0000000000000246 R12: 00007ffc5aa59910
[251067.685176] R13: 00007ffc5aa588b0 R14: 0000000000000004 R15: 0000000000000006
[251067.686524] </TASK>
[251067.686972] ---[ end trace 0000000000000000 ]---
[251067.687890] BTRFS: error (device sdi state A) in create_pending_snapshot:1907: errno=-75 unknown
[251067.689049] BTRFS info (device sdi state EA): forced readonly
[251067.689054] BTRFS warning (device sdi state EA): Skipping commit of aborted transaction.
[251067.690119] BTRFS: error (device sdi state EA) in cleanup_transaction:2043: errno=-75 unknown
[251067.702028] BTRFS info (device sdi state EA): last unmount of filesystem 46dc3975-30a2-4a69-a18f-418b859cccda
Fix this by ignoring -EOVERFLOW errors from btrfs_uuid_tree_add() in the
snapshot creation code when attempting to add the
BTRFS_UUID_KEY_RECEIVED_SUBVOL item. This is OK because it's not critical
and we are still able to delete the snapshot, as snapshot/subvolume
deletion ignores if a BTRFS_UUID_KEY_RECEIVED_SUBVOL is missing (see
inode.c:btrfs_delete_subvolume()). As for send/receive, we can still do
send/receive operations since it always peeks the first root ID in the
existing BTRFS_UUID_KEY_RECEIVED_SUBVOL (it could peek any since all
snapshots have the same content), and even if the key is missing, it
falls back to searching by BTRFS_UUID_KEY_SUBVOL key.
A test case for fstests will be sent soon.
Fixes: dd5f9615fc ("Btrfs: maintain subvolume items in the UUID tree")
CC: stable@vger.kernel.org # 3.12+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we attempt to create several files with names that result in the same
hash, we have to pack them in same dir item and that has a limit inherent
to the leaf size. However if we reach that limit, we trigger a transaction
abort and turns the filesystem into RO mode. This allows for a malicious
user to disrupt a system, without the need to have administration
privileges/capabilities.
Reproducer:
$ cat exploit-hash-collisions.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
# Use smallest node size to make the test faster and require fewer file
# names that result in hash collision.
mkfs.btrfs -f --nodesize 4K $DEV
mount $DEV $MNT
# List of names that result in the same crc32c hash for btrfs.
declare -a names=(
'foobar'
'%a8tYkxfGMLWRGr55QSeQc4PBNH9PCLIvR6jZnkDtUUru1t@RouaUe_L:@xGkbO3nCwvLNYeK9vhE628gss:T$yZjZ5l-Nbd6CbC$M=hqE-ujhJICXyIxBvYrIU9-TDC'
'AQci3EUB%shMsg-N%frgU:02ByLs=IPJU0OpgiWit5nexSyxZDncY6WB:=zKZuk5Zy0DD$Ua78%MelgBuMqaHGyKsJUFf9s=UW80PcJmKctb46KveLSiUtNmqrMiL9-Y0I_l5Fnam04CGIg=8@U:Z'
'CvVqJpJzueKcuA$wqwePfyu7VxuWNN3ho$p0zi2H8QFYK$7YlEqOhhb%:hHgjhIjW5vnqWHKNP4'
'ET:vk@rFU4tsvMB0$C_p=xQHaYZjvoF%-BTc%wkFW8yaDAPcCYoR%x$FH5O:'
'HwTon%v7SGSP4FE08jBwwiu5aot2CFKXHTeEAa@38fUcNGOWvE@Mz6WBeDH_VooaZ6AgsXPkVGwy9l@@ZbNXabUU9csiWrrOp0MWUdfi$EZ3w9GkIqtz7I_eOsByOkBOO'
'Ij%2VlFGXSuPvxJGf5UWy6O@1svxGha%b@=%wjkq:CIgE6u7eJOjmQY5qTtxE2Rjbis9@us'
'KBkjG5%9R8K9sOG8UTnAYjxLNAvBmvV5vz3IiZaPmKuLYO03-6asI9lJ_j4@6Xo$KZicaLWJ3Pv8XEwVeUPMwbHYWwbx0pYvNlGMO9F:ZhHAwyctnGy%_eujl%WPd4U2BI7qooOSr85J-C2V$LfY'
'NcRfDfuUQ2=zP8K3CCF5dFcpfiOm6mwenShsAb_F%n6GAGC7fT2JFFn:c35X-3aYwoq7jNX5$ZJ6hI3wnZs$7KgGi7wjulffhHNUxAT0fRRLF39vJ@NvaEMxsMO'
'Oj42AQAEzRoTxa5OuSKIr=A_lwGMy132v4g3Pdq1GvUG9874YseIFQ6QU'
'Ono7avN5GjC:_6dBJ_'
'WHmN2gnmaN-9dVDy4aWo:yNGFzz8qsJyJhWEWcud7$QzN2D9R0efIWWEdu5kwWr73NZm4=@CoCDxrrZnRITr-kGtU_cfW2:%2_am'
'WiFnuTEhAG9FEC6zopQmj-A-$LDQ0T3WULz%ox3UZAPybSV6v1Z$b4L_XBi4M4BMBtJZpz93r9xafpB77r:lbwvitWRyo$odnAUYlYMmU4RvgnNd--e=I5hiEjGLETTtaScWlQp8mYsBovZwM2k'
'XKyH=OsOAF3p%uziGF_ZVr$ivrvhVgD@1u%5RtrV-gl_vqAwHkK@x7YwlxX3qT6WKKQ%PR56NrUBU2dOAOAdzr2=5nJuKPM-T-$ZpQfCL7phxQbUcb:BZOTPaFExc-qK-gDRCDW2'
'd3uUR6OFEwZr%ns1XH_@tbxA@cCPmbBRLdyh7p6V45H$P2$F%w0RqrD3M0g8aGvWpoTFMiBdOTJXjD:JF7=h9a_43xBywYAP%r$SPZi%zDg%ql-KvkdUCtF9OLaQlxmd'
'ePTpbnit%hyNm@WELlpKzNZYOzOTf8EQ$sEfkMy1VOfIUu3coyvIr13-Y7Sv5v-Ivax2Go_GQRFMU1b3362nktT9WOJf3SpT%z8sZmM3gvYQBDgmKI%%RM-G7hyrhgYflOw%z::ZRcv5O:lDCFm'
'evqk743Y@dvZAiG5J05L_ROFV@$2%rVWJ2%3nxV72-W7$e$-SK3tuSHA2mBt$qloC5jwNx33GmQUjD%akhBPu=VJ5g$xhlZiaFtTrjeeM5x7dt4cHpX0cZkmfImndYzGmvwQG:$euFYmXn$_2rA9mKZ'
'gkgUtnihWXsZQTEkrMAWIxir09k3t7jk_IK25t1:cy1XWN0GGqC%FrySdcmU7M8MuPO_ppkLw3=Dfr0UuBAL4%GFk2$Ma10V1jDRGJje%Xx9EV2ERaWKtjpwiZwh0gCSJsj5UL7CR8RtW5opCVFKGGy8Cky'
'hNgsG_8lNRik3PvphqPm0yEH3P%%fYG:kQLY=6O-61Wa6nrV_WVGR6TLB09vHOv%g4VQRP8Gzx7VXUY1qvZyS'
'isA7JVzN12xCxVPJZ_qoLm-pTBuhjjHMvV7o=F:EaClfYNyFGlsfw-Kf%uxdqW-kwk1sPl2vhbjyHU1A6$hz'
'kiJ_fgcdZFDiOptjgH5PN9-PSyLO4fbk_:u5_2tz35lV_iXiJ6cx7pwjTtKy-XGaQ5IefmpJ4N_ZqGsqCsKuqOOBgf9LkUdffHet@Wu'
'lvwtxyhE9:%Q3UxeHiViUyNzJsy:fm38pg_b6s25JvdhOAT=1s0$pG25x=LZ2rlHTszj=gN6M4zHZYr_qrB49i=pA--@WqWLIuX7o1S_SfS@2FSiUZN'
'rC24cw3UBDZ=5qJBUMs9e$=S4Y94ni%Z8639vnrGp=0Hv4z3dNFL0fBLmQ40=EYIY:Z=SLc@QLMSt2zsss2ZXrP7j4='
'uwGl2s-fFrf@GqS=DQqq2I0LJSsOmM%xzTjS:lzXguE3wChdMoHYtLRKPvfaPOZF2fER@j53evbKa7R%A7r4%YEkD=kicJe@SFiGtXHbKe4gCgPAYbnVn'
'UG37U6KKua2bgc:IHzRs7BnB6FD:2Mt5Cc5NdlsW%$1tyvnfz7S27FvNkroXwAW:mBZLA1@qa9WnDbHCDmQmfPMC9z-Eq6QT0jhhPpqyymaD:R02ghwYo%yx7SAaaq-:x33LYpei$5g8DMl3C'
'y2vjek0FE1PDJC0qpfnN:x8k2wCFZ9xiUF2ege=JnP98R%wxjKkdfEiLWvQzmnW'
'8-HCSgH5B%K7P8_jaVtQhBXpBk:pE-$P7ts58U0J@iR9YZntMPl7j$s62yAJO@_9eanFPS54b=UTw$94C-t=HLxT8n6o9P=QnIxq-f1=Ne2dvhe6WbjEQtc'
'YPPh:IFt2mtR6XWSmjHptXL_hbSYu8bMw-JP8@PNyaFkdNFsk$M=xfL6LDKCDM-mSyGA_2MBwZ8Dr4=R1D%7-mCaaKGxb990jzaagRktDTyp'
'9hD2ApKa_t_7x-a@GCG28kY:7$M@5udI1myQ$x5udtggvagmCQcq9QXWRC5hoB0o-_zHQUqZI5rMcz_kbMgvN5jr63LeYA4Cj-c6F5Ugmx6DgVf@2Jqm%MafecpgooqreJ53P-QTS'
)
# Now create files with all those names in the same parent directory.
# It should not fail since a 4K leaf has enough space for them.
for name in "${names[@]}"; do
touch $MNT/$name
done
# Now add one more file name that causes a crc32c hash collision.
# This should fail, but it should not turn the filesystem into RO mode
# (which could be exploited by malicious users) due to a transaction
# abort.
touch $MNT/'W6tIm-VK2@BGC@IBfcgg6j_p:pxp_QUqtWpGD5Ok_GmijKOJJt'
# Check that we are able to create another file, with a name that does not cause
# a crc32c hash collision.
echo -n "hello world" > $MNT/baz
# Unmount and mount again, verify file baz exists and with the right content.
umount $MNT
mount $DEV $MNT
echo "File baz content: $(cat $MNT/baz)"
umount $MNT
When running the reproducer:
$ ./exploit-hash-collisions.sh
(...)
touch: cannot touch '/mnt/sdi/W6tIm-VK2@BGC@IBfcgg6j_p:pxp_QUqtWpGD5Ok_GmijKOJJt': Value too large for defined data type
./exploit-hash-collisions.sh: line 57: /mnt/sdi/baz: Read-only file system
cat: /mnt/sdi/baz: No such file or directory
File baz content:
And the transaction abort stack trace in dmesg/syslog:
$ dmesg
(...)
[758240.509761] ------------[ cut here ]------------
[758240.510668] BTRFS: Transaction aborted (error -75)
[758240.511577] WARNING: fs/btrfs/inode.c:6854 at btrfs_create_new_inode+0x805/0xb50 [btrfs], CPU#6: touch/888644
[758240.513513] Modules linked in: btrfs dm_zero (...)
[758240.523221] CPU: 6 UID: 0 PID: 888644 Comm: touch Tainted: G W 6.19.0-rc8-btrfs-next-225+ #1 PREEMPT(full)
[758240.524621] Tainted: [W]=WARN
[758240.525037] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[758240.526331] RIP: 0010:btrfs_create_new_inode+0x80b/0xb50 [btrfs]
[758240.527093] Code: 0f 82 cf (...)
[758240.529211] RSP: 0018:ffffce64418fbb48 EFLAGS: 00010292
[758240.529935] RAX: 00000000ffffffd3 RBX: 0000000000000000 RCX: 00000000ffffffb5
[758240.531040] RDX: 0000000d04f33e06 RSI: 00000000ffffffb5 RDI: ffffffffc0919dd0
[758240.531920] RBP: ffffce64418fbc10 R08: 0000000000000000 R09: 00000000ffffffb5
[758240.532928] R10: 0000000000000000 R11: ffff8e52c0000000 R12: ffff8e53eee7d0f0
[758240.533818] R13: ffff8e57f70932a0 R14: ffff8e5417629568 R15: 0000000000000000
[758240.534664] FS: 00007f1959a2a740(0000) GS:ffff8e5b27cae000(0000) knlGS:0000000000000000
[758240.535821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[758240.536644] CR2: 00007f1959b10ce0 CR3: 000000012a2cc005 CR4: 0000000000370ef0
[758240.537517] Call Trace:
[758240.537828] <TASK>
[758240.538099] btrfs_create_common+0xbf/0x140 [btrfs]
[758240.538760] path_openat+0x111a/0x15b0
[758240.539252] do_filp_open+0xc2/0x170
[758240.539699] ? preempt_count_add+0x47/0xa0
[758240.540200] ? __virt_addr_valid+0xe4/0x1a0
[758240.540800] ? __check_object_size+0x1b3/0x230
[758240.541661] ? alloc_fd+0x118/0x180
[758240.542315] do_sys_openat2+0x70/0xd0
[758240.543012] __x64_sys_openat+0x50/0xa0
[758240.543723] do_syscall_64+0x50/0xf20
[758240.544462] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[758240.545397] RIP: 0033:0x7f1959abc687
[758240.546019] Code: 48 89 fa (...)
[758240.548522] RSP: 002b:00007ffe16ff8690 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[758240.566278] RAX: ffffffffffffffda RBX: 00007f1959a2a740 RCX: 00007f1959abc687
[758240.567068] RDX: 0000000000000941 RSI: 00007ffe16ffa333 RDI: ffffffffffffff9c
[758240.567860] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[758240.568707] R10: 00000000000001b6 R11: 0000000000000202 R12: 0000561eec7c4b90
[758240.569712] R13: 0000561eec7c311f R14: 00007ffe16ffa333 R15: 0000000000000000
[758240.570758] </TASK>
[758240.571040] ---[ end trace 0000000000000000 ]---
[758240.571681] BTRFS: error (device sdi state A) in btrfs_create_new_inode:6854: errno=-75 unknown
[758240.572899] BTRFS info (device sdi state EA): forced readonly
Fix this by checking for hash collision, and if the adding a new name is
possible, early in btrfs_create_new_inode() before we do any tree updates,
so that we don't need to abort the transaction if we cannot add the new
name due to the leaf size limit.
A test case for fstests will be sent soon.
Fixes: caae78e032 ("btrfs: move common inode creation code into btrfs_create_new_inode()")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Call rcu_read_lock() before exiting the loop in
try_release_subpage_extent_buffer() because there is a rcu_read_unlock()
call past the loop.
This has been detected by the Clang thread-safety analyzer.
Fixes: ad580dfa38 ("btrfs: fix subpage deadlock in try_release_subpage_extent_buffer()")
CC: stable@vger.kernel.org # 6.18+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have recently observed a number of subvolumes with broken dentries.
ls-ing the parent dir looks like:
drwxrwxrwt 1 root root 16 Jan 23 16:49 .
drwxr-xr-x 1 root root 24 Jan 23 16:48 ..
d????????? ? ? ? ? ? broken_subvol
and similarly stat-ing the file fails.
In this state, deleting the subvol fails with ENOENT, but attempting to
create a new file or subvol over it errors out with EEXIST and even
aborts the fs. Which leaves us a bit stuck.
dmesg contains a single notable error message reading:
"could not do orphan cleanup -2"
2 is ENOENT and the error comes from the failure handling path of
btrfs_orphan_cleanup(), with the stack leading back up to
btrfs_lookup().
btrfs_lookup
btrfs_lookup_dentry
btrfs_orphan_cleanup // prints that message and returns -ENOENT
After some detailed inspection of the internal state, it became clear
that:
- there are no orphan items for the subvol
- the subvol is otherwise healthy looking, it is not half-deleted or
anything, there is no drop progress, etc.
- the subvol was created a while ago and does the meaningful first
btrfs_orphan_cleanup() call that sets BTRFS_ROOT_ORPHAN_CLEANUP much
later.
- after btrfs_orphan_cleanup() fails, btrfs_lookup_dentry() returns -ENOENT,
which results in a negative dentry for the subvolume via
d_splice_alias(NULL, dentry), leading to the observed behavior. The
bug can be mitigated by dropping the dentry cache, at which point we
can successfully delete the subvolume if we want.
i.e.,
btrfs_lookup()
btrfs_lookup_dentry()
if (!sb_rdonly(inode->vfs_inode)->vfs_inode)
btrfs_orphan_cleanup(sub_root)
test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP)
btrfs_search_slot() // finds orphan item for inode N
...
prints "could not do orphan cleanup -2"
if (inode == ERR_PTR(-ENOENT))
inode = NULL;
return d_splice_alias(NULL, dentry) // NEGATIVE DENTRY for valid subvolume
btrfs_orphan_cleanup() does test_and_set_bit(BTRFS_ROOT_ORPHAN_CLEANUP)
on the root when it runs, so it cannot run more than once on a given
root, so something else must run concurrently. However, the obvious
routes to deleting an orphan when nlinks goes to 0 should not be able to
run without first doing a lookup into the subvolume, which should run
btrfs_orphan_cleanup() and set the bit.
The final important observation is that create_subvol() calls
d_instantiate_new() but does not set BTRFS_ROOT_ORPHAN_CLEANUP, so if
the dentry cache gets dropped, the next lookup into the subvolume will
make a real call into btrfs_orphan_cleanup() for the first time. This
opens up the possibility of concurrently deleting the inode/orphan items
but most typical evict() paths will be holding a reference on the parent
dentry (child dentry holds parent->d_lockref.count via dget in
d_alloc(), released in __dentry_kill()) and prevent the parent from
being removed from the dentry cache.
The one exception is delayed iputs. Ordered extent creation calls
igrab() on the inode. If the file is unlinked and closed while those
refs are held, iput() in __dentry_kill() decrements i_count but does
not trigger eviction (i_count > 0). The child dentry is freed and the
subvol dentry's d_lockref.count drops to 0, making it evictable while
the inode is still alive.
Since there are two races (the race between writeback and unlink and
the race between lookup and delayed iputs), and there are too many moving
parts, the following three diagrams show the complete picture.
(Only the second and third are races)
Phase 1:
Create Subvol in dentry cache without BTRFS_ROOT_ORPHAN_CLEANUP set
btrfs_mksubvol()
lookup_one_len()
__lookup_slow()
d_alloc_parallel()
__d_alloc() // d_lockref.count = 1
create_subvol(dentry)
// doesn't touch the bit..
d_instantiate_new(dentry, inode) // dentry in cache with d_lockref.count == 1
Phase 2:
Create a delayed iput for a file in the subvol but leave the subvol in
state where its dentry can be evicted (d_lockref.count == 0)
T1 (task) T2 (writeback) T3 (OE workqueue)
write() // dirty pages
btrfs_writepages()
btrfs_run_delalloc_range()
cow_file_range()
btrfs_alloc_ordered_extent()
igrab() // i_count: 1 -> 2
btrfs_unlink_inode()
btrfs_orphan_add()
close()
__fput()
dput()
finish_dput()
__dentry_kill()
dentry_unlink_inode()
iput() // 2 -> 1
--parent->d_lockref.count // 1 -> 0; evictable
finish_ordered_fn()
btrfs_finish_ordered_io()
btrfs_put_ordered_extent()
btrfs_add_delayed_iput()
Phase 3:
Once the delayed iput is pending and the subvol dentry is evictable,
the shrinker can free it, causing the next lookup to go through
btrfs_lookup() and call btrfs_orphan_cleanup() for the first time.
If the cleaner kthread processes the delayed iput concurrently, the
two race:
T1 (shrinker) T2 (cleaner kthread) T3 (lookup)
super_cache_scan()
prune_dcache_sb()
__dentry_kill()
// subvol dentry freed
btrfs_run_delayed_iputs()
iput() // i_count -> 0
evict() // sets I_FREEING
btrfs_evict_inode()
// truncation loop
btrfs_lookup()
btrfs_lookup_dentry()
btrfs_orphan_cleanup()
// first call (bit never set)
btrfs_iget()
// blocks on I_FREEING
btrfs_orphan_del()
// inode freed
// returns -ENOENT
btrfs_del_orphan_item()
// -ENOENT
// "could not do orphan cleanup -2"
d_splice_alias(NULL, dentry)
// negative dentry for valid subvol
The most straightforward fix is to ensure the invariant that a dentry
for a subvolume can exist if and only if that subvolume has
BTRFS_ROOT_ORPHAN_CLEANUP set on its root (and is known to have no
orphans or ran btrfs_orphan_cleanup()).
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_zoned_reserve_data_reloc_bg() is called on each mount of a file
system and allocates a new block-group, to assign it to be the dedicated
relocation target, if no pre-existing usable block-group for this task is
found.
If for some reason the transaction is aborted, btrfs_end_transaction()
will wake up the transaction kthread. But the transaction kthread is not
yet initialized at the time btrfs_zoned_reserve_data_reloc_bg() is
called, leading to the following NULL-pointer dereference:
RSP: 0018:ffffc9000c617c98 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 000000000000073c RCX: 0000000000000002
RDX: 0000000000000001 RSI: 0000000000000003 RDI: 0000000000000001
RBP: 0000000000000207 R08: ffffffff8223c71d R09: 0000000000000635
R10: ffff888108588000 R11: 0000000000000003 R12: 0000000000000003
R13: 000000000000073c R14: 0000000000000000 R15: ffff888114dd6000
FS: 00007f2993745840(0000) GS:ffff8882b508d000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000073c CR3: 0000000121a82006 CR4: 0000000000770eb0
PKRU: 55555554
Call Trace:
<TASK>
try_to_wake_up (./include/linux/spinlock.h:557 kernel/sched/core.c:4106)
__btrfs_end_transaction (fs/btrfs/transaction.c:1115 (discriminator 2))
btrfs_zoned_reserve_data_reloc_bg (fs/btrfs/zoned.c:2840)
open_ctree (fs/btrfs/disk-io.c:3588)
btrfs_get_tree.cold (fs/btrfs/super.c:982 fs/btrfs/super.c:1944 fs/btrfs/super.c:2087 fs/btrfs/super.c:2121)
vfs_get_tree (fs/super.c:1752)
__do_sys_fsconfig (fs/fsopen.c:231 fs/fsopen.c:295 fs/fsopen.c:473)
do_syscall_64 (arch/x86/entry/syscall_64.c:63 (discriminator 1) arch/x86/entry/syscall_64.c:94 (discriminator 1))
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:131)
RIP: 0033:0x7f299392740e
Move the call to btrfs_zoned_reserve_data_reloc_bg() after the
transaction_kthread has been initialized to fix this problem.
Fixes: 694ce5e143 ("btrfs: zoned: reserve data_reloc block group on mount")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_set_periodic_reclaim_ready() requires space_info->lock to be held,
as enforced by lockdep_assert_held(). However, btrfs_reclaim_sweep() was
calling it after do_reclaim_sweep() returns, at which point
space_info->lock is no longer held.
Fix this by explicitly acquiring space_info->lock before clearing the
periodic reclaim ready flag in btrfs_reclaim_sweep().
Reported-by: Chris Mason <clm@meta.com>
Link: https://lore.kernel.org/linux-btrfs/20260208182556.891815-1-clm@meta.com/
Fixes: 19eff93dc7 ("btrfs: fix periodic reclaim condition")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add the definitions for the remap tree to print-tree.c, so that we get
more useful information if a tree is dumped to dmesg.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Earlier TEE subsystem assumed to refcount all the memory pages to be
shared with TEE implementation to be refcounted. However, the slab
allocations within the kernel don't allow refcounting kernel pages.
It is rather better to trust the kernel clients to not free pages while
being shared with TEE implementation. Hence, remove refcounting of kernel
pages from register_shm_helper() API.
Fixes: b9c0e49abf ("mm: decline to manipulate the refcount on a slab page")
Reported-by: Marco Felsch <m.felsch@pengutronix.de>
Reported-by: Sven Püschel <s.pueschel@pengutronix.de>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Co-developed-by: Sumit Garg <sumit.garg@oss.qualcomm.com>
Signed-off-by: Sumit Garg <sumit.garg@oss.qualcomm.com>
Tested-by: Sven Püschel <s.pueschel@pengutronix.de>
Signed-off-by: Jens Wiklander <jens.wiklander@linaro.org>
Change additionalProperties to unevaluatedProperties because it refs to
/schemas/input/matrix-keymap.yaml.
Fix below CHECK_DTBS warnings:
arch/arm/boot/dts/nxp/imx/imx6dl-victgo.dtb: keypad@70 (holtek,ht16k33): 'keypad,num-columns', 'keypad,num-rows' do not match any of the regexes: '^pinctrl-[0-9]+$'
from schema $id: http://devicetree.org/schemas/auxdisplay/holtek,ht16k33.yaml#
Fixes: f12b457c6b ("dt-bindings: auxdisplay: ht16k33: Convert to json-schema")
Acked-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Explicitly document the ordering of vcpu->mutex being taken *outside* of
kvm->slots_lock. While somewhat unintuitive since vCPUs conceptually have
narrower scope than VMs, the scope of the owning object (vCPU versus VM)
doesn't automatically carry over to the lock. In this case, vcpu->mutex
has far broader scope than kvm->slots_lock. As Paolo put it, it's a
"don't worry about multiple ioctls at the same time" mutex that's intended
to be taken at the outer edges of KVM.
More importantly, arm64 and x86 have gained flows that take kvm->slots_lock
inside of vcpu->mutex. x86's kvm_inhibit_apic_access_page() is particularly
nasty, as slots_lock is taken quite deep within KVM_RUN, i.e. simply
swapping the ordering isn't an option.
Commit to the vcpu->mutex => kvm->slots_lock ordering, as vcpu->mutex
really is intended to be a "top-level" lock, whereas kvm->slots_lock is
"just" a helper lock.
Opportunistically document that vcpu->mutex is also taken outside of
slots_arch_lock, e.g. when allocating shadow roots on x86 (which is the
entire reason slots_arch_lock exists, as shadow roots must be allocated
while holding kvm->srcu)
kvm_mmu_new_pgd()
|
-> kvm_mmu_reload()
|
-> kvm_mmu_load()
|
-> mmu_alloc_shadow_roots()
|
-> mmu_first_shadow_root_alloc()
but also when manipulating memslots in vCPU context, e.g. when inhibiting
the APIC-access page via the aforementioned kvm_inhibit_apic_access_page()
kvm_inhibit_apic_access_page()
|
-> __x86_set_memory_region()
|
-> kvm_set_internal_memslot()
|
-> kvm_set_memory_region()
|
-> kvm_set_memslot()
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Marc Zyngier <maz@kernel.org>
Link: https://patch.msgid.link/20260302170239.596810-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add validation of the inner IPv4 packet tot_len and ihl fields parsed
from decrypted IPTFS payloads in __input_process_payload(). A crafted
ESP packet containing an inner IPv4 header with tot_len=0 causes an
infinite loop: iplen=0 leads to capturelen=min(0, remaining)=0, so the
data offset never advances and the while(data < tail) loop never
terminates, spinning forever in softirq context.
Reject inner IPv4 packets where tot_len < ihl*4 or ihl*4 < sizeof(struct
iphdr), which catches both the tot_len=0 case and malformed ihl values.
The normal IP stack performs this validation in ip_rcv_core(), but IPTFS
extracts and processes inner packets before they reach that layer.
Reported-by: Roshan Kumar <roshaen09@gmail.com>
Fixes: 6c82d24336 ("xfrm: iptfs: add basic receive packet (tunnel egress) handling")
Cc: stable@vger.kernel.org
Signed-off-by: Roshan Kumar <roshaen09@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Two issues were noticed after the NFS v4.0 KConfig changes were merged
upstream. First, the text of CONFIG_NFS_V4 should not encourage people
to select it if they are unsure. Second, the new CONFIG_NFS_V4_0 option
should default to "on" instead of "off" to avoid breaking people's
setups if they are using NFS v4.0.
Reported-by: Niklas Cassel <cassel@kernel.org>
Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
Fixes: 4e02693525 ("NFS: Add a way to disable NFS v4.0 via KConfig")
Fixes: 7537db2480 ("NFS: Merge CONFIG_NFS_V4_1 with CONFIG_NFS_V4")
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
In the event that rpcrdma_post_recvs() fails to create a work request
(due to memory allocation failure, say) or otherwise exits early, we
should decrement ep->re_receiving before returning. Otherwise we will
hang in rpcrdma_xprt_drain() as re_receiving will never reach zero and
the completion will never be triggered.
On a system with high memory pressure, this can appear as the following
hung task:
INFO: task kworker/u385:17:8393 blocked for more than 122 seconds.
Tainted: G S E 6.19.0 #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u385:17 state:D stack:0 pid:8393 tgid:8393 ppid:2 task_flags:0x4248060 flags:0x00080000
Workqueue: xprtiod xprt_autoclose [sunrpc]
Call Trace:
<TASK>
__schedule+0x48b/0x18b0
? ib_post_send_mad+0x247/0xae0 [ib_core]
schedule+0x27/0xf0
schedule_timeout+0x104/0x110
__wait_for_common+0x98/0x180
? __pfx_schedule_timeout+0x10/0x10
wait_for_completion+0x24/0x40
rpcrdma_xprt_disconnect+0x444/0x460 [rpcrdma]
xprt_rdma_close+0x12/0x40 [rpcrdma]
xprt_autoclose+0x5f/0x120 [sunrpc]
process_one_work+0x191/0x3e0
worker_thread+0x2e3/0x420
? __pfx_worker_thread+0x10/0x10
kthread+0x10d/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x273/0x2b0
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
Fixes: 15788d1d10 ("xprtrdma: Do not refresh Receive Queue while it is draining")
Signed-off-by: Eric Badger <ebadger@purestorage.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
All combined i2c/i3c drivers appear to suffer from the same link
time problem when CONFIG_I3C is set to 'm':
arm-linux-gnueabi-ld: drivers/iio/magnetometer/mmc5633.o: in function `mmc5633_i3c_driver_init':
mmc5633.c:(.init.text+0x30): undefined reference to `i3c_driver_register_with_owner'
This was previously fixed every time by marking individual
drivers as 'depends on I2C; depends on I3C || !I3C', but this gets
tedious and is somewhat confusing.
Add a Kconfig symbol 'I3C_OR_I2C' to help replace those dependencies,
and use this in all the existing drivers that had already fixed it
as well as the new mmc5633 driver.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260204164216.544409-1-arnd@kernel.org
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
The serdes device_node is obtained using of_get_child_by_name(),
which increments the reference count. However, it is never put,
leading to a reference leak.
Add the missing of_node_put() calls to ensure the reference count is
properly balanced.
Fixes: 7ae14cf581 ("phy: ti: j721e-wiz: Implement DisplayPort mode to the wiz driver")
Suggested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/20260212-wiz-v2-1-6e8bd4cc7a4a@gmail.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
The blamed commit introduced support for specifying individual lanes as
OF nodes in the device, and these can have status = "disabled".
When that happens, for_each_available_child_of_node() skips them and
lynx_28g_probe_lane() -> devm_phy_create() is not called, so lane->phy
will be NULL. Yet it will be dereferenced in lynx_28g_cdr_lock_check(),
resulting in a crash.
This used to be well handled in v3 of that patch:
https://lore.kernel.org/linux-phy/20250926180505.760089-14-vladimir.oltean@nxp.com/
but until v5 was merged, the logic to support per-lane OF nodes was
split into a separate change, and the per-SoC compatible strings patch
was deferred to a "part 2" set. The splitting was done improperly, and
that handling of NULL lane->phy pointers was not integrated into the
proper commit.
Fixes: 7df7d58abb ("phy: lynx-28g: support individual lanes as OF PHY providers")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260226182853.1103616-1-vladimir.oltean@nxp.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Geert reports that enabling CONFIG_KUNIT_ALL_TESTS shouldn't enable
features that aren't enabled without it. That isn't what "*all* tests"
means, but as the prompt puts it, "All KUnit tests with satisfied
dependencies".
The impact is that enabling CONFIG_KUNIT_ALL_TESTS brings features which
cannot be disabled as built-in into the kernel.
Keep the pattern where consumer drivers have to "select PHY_COMMON_PROPS",
but if KUNIT_ALL_TESTS is enabled, also make PHY_COMMON_PROPS user
selectable, so it can be turned off.
Modify PHY_COMMON_PROPS_TEST to depend on PHY_COMMON_PROPS rather than
select it.
Fixes: e7556b59ba ("phy: add phy_get_rx_polarity() and phy_get_tx_polarity()")
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Closes: https://lore.kernel.org/linux-phy/CAMuHMdUBaoYKNj52gn8DQeZFZ42Cvm6xT6fvo0-_twNv1k3Jhg@mail.gmail.com/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260226153315.3530378-1-vladimir.oltean@nxp.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Consider the following sequence of events on a death listener:
1. The remote process dies and sends a BR_DEAD_BINDER message.
2. The local process invokes the BC_CLEAR_DEATH_NOTIFICATION command.
3. The local process then invokes the BC_DEAD_BINDER_DONE.
Then, the kernel will reply to the BC_DEAD_BINDER_DONE command with a
BR_CLEAR_DEATH_NOTIFICATION_DONE reply using push_work_if_looper().
However, this can result in a deadlock if the current thread is not a
looper. This is because dead_binder_done() still holds the proc lock
during set_notification_done(), which called push_work_if_looper().
Normally, push_work_if_looper() takes the thread lock, which is fine to
take under the proc lock. But if the current thread is not a looper,
then it falls back to delivering the reply to the process work queue,
which involves taking the proc lock. Since the proc lock is already
held, this is a deadlock.
Fix this by releasing the proc lock during set_notification_done(). It
was not intentional that it was held during that function to begin with.
I don't think this ever happens in Android because BC_DEAD_BINDER_DONE
is only invoked in response to BR_DEAD_BINDER messages, and the kernel
always delivers BR_DEAD_BINDER to a looper. So there's no scenario where
Android userspace will call BC_DEAD_BINDER_DONE on a non-looper thread.
Cc: stable <stable@kernel.org>
Fixes: eafedbc7c0 ("rust_binder: add Rust Binder driver")
Reported-by: syzbot+c8287e65a57a89e7fb72@syzkaller.appspotmail.com
Tested-by: syzbot+c8287e65a57a89e7fb72@syzkaller.appspotmail.com
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Gary Guo <gary@garyguo.net>
Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://patch.msgid.link/20260224-binder-dead-binder-done-proc-lock-v1-1-bbe1b8a6e74a@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When sending a transaction, its offsets array is first copied into the
target proc's vma, and then the values are read back from there. This is
normally fine because the vma is a read-only mapping, so the target
process cannot change the value under us.
However, if the target process somehow gains the ability to write to its
own vma, it could change the offset before it's read back, causing the
kernel to misinterpret what the sender meant. If the sender happens to
send a payload with a specific shape, this could in the worst case lead
to the receiver being able to privilege escalate into the sender.
The intent is that gaining the ability to change the read-only vma of
your own process should not be exploitable, so remove this TOCTOU read
even though it's unexploitable without another Binder bug.
Cc: stable <stable@kernel.org>
Fixes: eafedbc7c0 ("rust_binder: add Rust Binder driver")
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Link: https://patch.msgid.link/20260218-binder-vma-check-v2-2-60f9d695a990@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When installing missing pages (or zapping them), Rust Binder will look
up the vma in the mm by address, and then call vm_insert_page (or
zap_page_range_single). However, if the vma is closed and replaced with
a different vma at the same address, this can lead to Rust Binder
installing pages into the wrong vma.
By installing the page into a writable vma, it becomes possible to write
to your own binder pages, which are normally read-only. Although you're
not supposed to be able to write to those pages, the intent behind the
design of Rust Binder is that even if you get that ability, it should not
lead to anything bad. Unfortunately, due to another bug, that is not the
case.
To fix this, store a pointer in vm_private_data and check that the vma
returned by vma_lookup() has the right vm_ops and vm_private_data before
trying to use the vma. This should ensure that Rust Binder will refuse
to interact with any other VMA. The plan is to introduce more vma
abstractions to avoid this unsafe access to vm_ops and vm_private_data,
but for now let's start with the simplest possible fix.
C Binder performs the same check in a slightly different way: it
provides a vm_ops->close that sets a boolean to true, then checks that
boolean after calling vma_lookup(), but this is more fragile
than the solution in this patch. (We probably still want to do both, but
the vm_ops->close callback will be added later as part of the follow-up
vma API changes.)
It's still possible to remap the vma so that pages appear in the right
vma, but at the wrong offset, but this is a separate issue and will be
fixed when Rust Binder gets a vm_ops->close callback.
Cc: stable <stable@kernel.org>
Fixes: eafedbc7c0 ("rust_binder: add Rust Binder driver")
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Acked-by: Danilo Krummrich <dakr@kernel.org>
Acked-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Link: https://patch.msgid.link/20260218-binder-vma-check-v2-1-60f9d695a990@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The spam detection logic in TreeRange was executed before the current
request was inserted into the tree. So the new request was not being
factored in the spam calculation. Fix this by moving the logic after
the new range has been inserted.
Also, the detection logic for ArrayRange was missing altogether which
meant large spamming transactions could get away without being detected.
Fix this by implementing an equivalent low_oneway_space() in ArrayRange.
Note that I looked into centralizing this logic in RangeAllocator but
iterating through 'state' and 'size' got a bit too complicated (for me)
and I abandoned this effort.
Cc: stable <stable@kernel.org>
Cc: Alice Ryhl <aliceryhl@google.com>
Fixes: eafedbc7c0 ("rust_binder: add Rust Binder driver")
Signed-off-by: Carlos Llamas <cmllamas@google.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Link: https://patch.msgid.link/20260210232949.3770644-1-cmllamas@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jonathan writes:
IIO: 1st set of fixes for the 7.0 cycle
Usual mixed bag of ancient bugs that have been discovered and more
recent stuff.
core
- Cleanup a wait_queue if a driver is removed at exacty the wrong
moment.
adi,adf4377
- Check correct masks when waiting for reset to complete.
adi,adis
- Fix a NULL pointer dereference if ops not provided to adis_init()
bosch,bme680
- Fix typo in value used to calculate measurement wait duration.
infineon,tlv493d
- Drop incorrect shifting of some bits for x-axis
invensense,icm42600
- Fix corner case of output data rate being set to the value it already
has which resulted in waiting for ever for a flag to say the update was
completed.
- Fix a case where the buffer is turned off whilst ODR switch is in progress.
invensense,icm45600
- Interrupt 1 drive bit was inverted.
- Fix a underflow for regulator put warning if probe fails
invensense,mpu9150
- Work around a hardware quirk where reading from irq status is not sufficient
to acknowledge an interrupt.
maxim,ds4424
- Reject -128 as a possible raw value as it's out of range with the sign
/ magnitude encoding used by this chip.
microchip,mcp4131
- Shift the wiper value only once.
rohm,bh1780
- Fix a runtime reference count issue on an error path.
sensiron,sps30
- Fix two buffer size issues due to sizeof() wrong thing.
tyhx,hx9023s
- Ensure count used by __counted_by is set before accessing the buffer.
- Avoid a potential division by zero.
* tag 'iio-fixes-for-7.0a' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jic23/iio:
iio: imu: adis: Fix NULL pointer dereference in adis_init
iio: imu: inv_icm45600: fix regulator put warning when probe fails
iio: buffer: Fix wait_queue not being removed
iio: gyro: mpu3050-core: fix pm_runtime error handling
iio: gyro: mpu3050-i2c: fix pm_runtime error handling
iio: adc: ad7768-1: Fix ERR_PTR dereference in ad7768_fill_scale_tbl
iio: chemical: sps30_serial: fix buffer size in sps30_serial_read_meas()
iio: chemical: sps30_i2c: fix buffer size in sps30_i2c_read_meas()
iio: magnetometer: tlv493d: remove erroneous shift in X-axis data
iio: proximity: hx9023s: Protect against division by zero in set_samp_freq
iio: proximity: hx9023s: fix assignment order for __counted_by
iio: chemical: bme680: Fix measurement wait duration calculation
iio: dac: ds4424: reject -128 RAW value
iio: imu: inv_icm45600: fix INT1 drive bit inverted
iio: potentiometer: mcp4131: fix double application of wiper shift
iio: imu: inv-mpu9150: fix irq ack preventing irq storms
iio: frequency: adf4377: Fix duplicated soft reset mask
iio: light: bh1780: fix PM runtime leak on error path
iio: imu: inv_icm42600: fix odr switch when turning buffer off
iio: imu: inv_icm42600: fix odr switch to the same value
Dinh writes:
firmware: stratix10-rsu: fix NULL pointer dereference when RSU is disabled
- Fix a kernel panic that happens in the driver when the First Stage Boot Loader
has not enabled the Remote System Update(RSU).
* tag 'stratix10_rsu_fix_for_v7.0' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/dinguyen/linux:
firmware: stratix10-rsu: Fix NULL pointer dereference when RSU is disabled
On x86, the HYPERVISOR_CALLBACK_VECTOR is used to receive synthetic
interrupts (SINTs) from the hypervisor for doorbells and intercepts.
There is no such vector reserved for arm64.
On arm64, the hypervisor exposes a synthetic register that can be read
to find the INTID that should be used for SINTs. This INTID is in the
PPI range.
To better unify the code paths, introduce mshv_sint_vector_init() that
either reads the synthetic register and obtains the INTID (arm64) or
just uses HYPERVISOR_CALLBACK_VECTOR as the interrupt vector (x86).
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Rename mshv_synic_init() to mshv_synic_cpu_init() and
mshv_synic_cleanup() to mshv_synic_cpu_exit() to better reflect that
these functions handle per-cpu synic setup and teardown.
Use mshv_synic_init/cleanup() to perform init/cleanup that is not per-cpu.
Move all the synic related setup from mshv_parent_partition_init.
Move the reboot notifier to mshv_synic.c because it currently only
operates on the synic cpuhp state.
Move out synic_pages from the global mshv_root since its use is now
completely local to mshv_synic.c.
This is in preparation for adding more stuff to mshv_synic_init().
No functional change.
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
mdc800_device_read() submits download_urb and waits for completion.
If the timeout fires and the device has not responded, the function
returns without killing the URB, leaving it active.
A subsequent read() resubmits the same URB while it is still
in-flight, triggering the WARN in usb_submit_urb():
"URB submitted while active"
Check the return value of wait_event_timeout() and kill the URB if
it indicates timeout, ensuring the URB is complete before its status
is inspected or the URB is resubmitted.
Similar to
- commit 372c931319 ("USB: yurex: fix control-URB timeout handling")
- commit b98d5000c5 ("media: rc: iguanair: handle timeouts")
Signed-off-by: Ziyi Guo <n7l8m4@u.northwestern.edu>
Cc: stable <stable@kernel.org>
Link: https://patch.msgid.link/20260209151937.2247202-1-n7l8m4@u.northwestern.edu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fix wrong variable used for error checking after dma_alloc_coherent()
call. The function checks cdns_ctrl->dma_cdma_desc instead of
cdns_ctrl->cdma_desc, which could lead to incorrect error handling.
Fixes: ec4ba01e89 ("mtd: rawnand: Add new Cadence NAND driver to MTD subsystem")
Cc: stable@vger.kernel.org
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Reviewed-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Given CONFIG_FORTIFY_SOURCE=y and a recent compiler,
commit 439a1bcac6 ("fortify: Use __builtin_dynamic_object_size() when
available") produces the warning below and an oops.
Searching for RedBoot partition table in 50000000.flash at offset 0x7e0000
------------[ cut here ]------------
WARNING: lib/string_helpers.c:1035 at 0xc029e04c, CPU#0: swapper/0/1
memcmp: detected buffer overflow: 15 byte read of buffer size 14
Modules linked in:
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.19.0 #1 NONE
As Kees said, "'names' is pointing to the final 'namelen' many bytes
of the allocation ... 'namelen' could be basically any length at all.
This fortify warning looks legit to me -- this code used to be reading
beyond the end of the allocation."
Since the size of the dynamic allocation is calculated with strlen()
we can use strcmp() instead of memcmp() and remain within bounds.
Cc: Kees Cook <kees@kernel.org>
Cc: stable@vger.kernel.org
Cc: linux-hardening@vger.kernel.org
Link: https://lore.kernel.org/all/202602151911.AD092DFFCD@keescook/
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Suggested-by: Kees Cook <kees@kernel.org>
Signed-off-by: Finn Thain <fthain@linux-m68k.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
I have not been actively involved in SPI NOR development recently and
would like to step down to focus on my current day-to-day work.
The subsystem remains in good hands with Pratyush and Michael.
Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Acked-by: Michael Walle <mwalle@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
During the device remove process, the device is reset, causing the
configuration registers to go back to their default state, which is
zero. As the driver is checking if the event log support was enabled
before deallocating, it will fail if a reset happened before.
Do not check if the support was enabled, the check for 'idxd->evl'
being valid (only allocated if the HW capability is available) is
enough.
Fixes: 244da66cda ("dmaengine: idxd: setup event log configuration")
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://patch.msgid.link/20260121-idxd-fix-flr-on-kernel-queues-v3-v3-10-7ed70658a9d1@intel.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
When using MSI (not MSI-X) with multiple IRQs, the MSI data value
must be unique per vector to ensure correct interrupt delivery.
Currently, the driver fails to increment the MSI data per vector,
causing interrupts to be misrouted.
Fix this by caching the base MSI data and adjusting each vector's
data accordingly during IRQ setup.
Fixes: e63d79d1ff04 ("dmaengine: dw-edma: Add Synopsys DesignWare eDMA IP core driver")
Signed-off-by: Shenghui Shi <brody.shi@m2semi.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20260209103726.414-1-brody.shi@m2semi.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Configure only the requested channel when a fixed channel is specified
to avoid modifying other channels unintentionally.
Fix parameter configuration when a fixed DMA channel is requested on
i.MX9 AON domain and i.MX8QM/QXP/DXL platforms. When a client requests
a fixed channel (e.g., channel 6), the driver traverses channels 0-5
and may unintentionally modify their configuration if they are unused.
This leads to issues such as setting the `is_multi_fifo` flag unexpectedly,
causing memcpy tests to fail when using the dmatest tool.
Only affect edma memcpy test when the channel is fixed.
Fixes: 72f5801a4e ("dmaengine: fsl-edma: integrate v3 support")
Signed-off-by: Joy Zou <joy.zou@nxp.com>
Cc: stable@vger.kernel.org
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://patch.msgid.link/20250917-b4-edma-chanconf-v1-1-886486e02e91@nxp.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
When the TX queue for espintcp is full, esp_output_tail_tcp will
return an error and not free the skb, because with synchronous crypto,
the common xfrm output code will drop the packet for us.
With async crypto (esp_output_done), we need to drop the skb when
esp_output_tail_tcp returns an error.
Fixes: e27cca96cd ("xfrm: add espintcp (RFC 8229)")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
When we update an SA, we construct a new state and call
xdo_dev_state_add, but never insert it. The existing state is updated,
then we immediately destroy the new state. Since we haven't added it,
we don't go through the standard state delete code, and we're skipping
removing it from the device (but xdo_dev_state_free will get called
when we destroy the temporary state).
This is similar to commit c5d4d7d831 ("xfrm: Fix deletion of
offloaded SAs on failure.").
Fixes: d77e38e612 ("xfrm: Add an IPsec hardware offloading API")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
pcpu_num = 0 is a valid value. The marker for "unset pcpu_num" which
makes copy_to_user_state_extra not add the XFRMA_SA_PCPU attribute is
UINT_MAX.
Fixes: 1ddf9916ac ("xfrm: Add support for per cpu xfrm state handling.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
We're returning an error caused by invalid user input without setting
an extack. Add one.
Fixes: 1ddf9916ac ("xfrm: Add support for per cpu xfrm state handling.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
When the Remote System Update (RSU) isn't enabled in the First Stage
Boot Loader (FSBL), the driver encounters a NULL pointer dereference when
excute svc_normal_to_secure_thread() thread, resulting in a kernel panic:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
Mem abort info:
...
Data abort info:
...
[0000000000000008] user address but active_mm is swapper
Internal error: Oops: 0000000096000004 [#1] SMP
Modules linked in:
CPU: 0 UID: 0 PID: 79 Comm: svc_smc_hvc_thr Not tainted 6.19.0-rc8-yocto-standard+ #59 PREEMPT
Hardware name: SoCFPGA Stratix 10 SoCDK (DT)
pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : svc_normal_to_secure_thread+0x38c/0x990
lr : svc_normal_to_secure_thread+0x144/0x990
...
Call trace:
svc_normal_to_secure_thread+0x38c/0x990 (P)
kthread+0x150/0x210
ret_from_fork+0x10/0x20
Code: 97cfc113 f9400260 aa1403e1 f9400400 (f9400402)
---[ end trace 0000000000000000 ]---
The issue occurs because rsu_send_async_msg() fails when RSU is not enabled
in firmware, causing the channel to be freed via stratix10_svc_free_channel().
However, the probe function continues execution and registers
svc_normal_to_secure_thread(), which subsequently attempts to access the
already-freed channel, triggering the NULL pointer dereference.
Fix this by properly cleaning up the async client and returning early on
failure, preventing the thread from being used with an invalid channel.
Fixes: 15847537b6 ("firmware: stratix10-rsu: Migrate RSU driver to use stratix10 asynchronous framework.")
Cc: stable@kernel.org
Signed-off-by: Liwei Song <liwei.song@windriver.com>
Signed-off-by: Dinh Nguyen <dinguyen@kernel.org>
Commit 4e6e8c2b75 ("binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4") added
support for AT_HWCAP3 and AT_HWCAP4, but it missed updating the AUX
vector size calculation in create_elf_fdpic_tables() and
AT_VECTOR_SIZE_BASE in include/linux/auxvec.h.
Similar to the fix for AT_HWCAP2 in commit c6a09e342f ("binfmt_elf_fdpic:
fix AUXV size calculation when ELF_HWCAP2 is defined"), this omission
leads to a mismatch between the reserved space and the actual number of
AUX entries, eventually triggering a kernel BUG_ON(csp != sp).
Fix this by incrementing nitems when ELF_HWCAP3 or ELF_HWCAP4 are
defined and updating AT_VECTOR_SIZE_BASE.
Cc: Mark Brown <broonie@kernel.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io>
Fixes: 4e6e8c2b75 ("binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4")
Signed-off-by: Andrei Vagin <avagin@google.com>
Link: https://patch.msgid.link/20260217180108.1420024-2-avagin@google.com
Signed-off-by: Kees Cook <kees@kernel.org>
In the past %pK was preferable to %p as it would not leak raw pointer
values into the kernel log.
Since commit ad67b74d24 ("printk: hash addresses printed with %p")
the regular %p has been improved to avoid this issue.
Furthermore, restricted pointers ("%pK") were never meant to be used
through printk(). They can still unintentionally leak raw pointers or
acquire sleeping locks in atomic contexts.
Switch to the regular pointer formatting which is safer and
easier to reason about.
This was previously fixed in this driver in commit 1ba9fbe403
("drm/msm: Don't use %pK through printk") but an additional usage
was reintroduced in commit 39a750ff5f ("drm/msm/dpu: Add DSPP GC
driver to provide GAMMA_LUT DRM property")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Fixes: 39a750ff5f ("drm/msm/dpu: Add DSPP GC driver to provide GAMMA_LUT DRM property")
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Patchwork: https://patchwork.freedesktop.org/patch/706229/
Link: https://lore.kernel.org/r/20260223-restricted-pointers-msm-v1-1-14c0b451e372@linutronix.de
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
In preparation for making the kmalloc family of allocators type aware,
we need to make sure that the returned type from the allocation matches
the type of the variable being assigned. (Before, the allocator would
always return "void *", which can be implicitly cast to any pointer type.)
The assigned type is "void **" but the returned type will be "void ***".
These are the same allocation size (pointer size), but the types do not
match. Adjust the allocation type to match the assignment.
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Patchwork: https://patchwork.freedesktop.org/patch/703588/
Link: https://lore.kernel.org/r/20260206222151.work.016-kees@kernel.org
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com>
Commit c70b9d5fdc ("remoteproc: qcom: Use of_reserved_mem_region_*
functions for "memory-region"") switched from devm_ioremap_wc() to
devm_ioremap_resource_wc(). The difference is devm_ioremap_resource_wc()
also requests the resource which fails. Testing of both fixed and
dynamic reserved regions indicates that requesting the resource should
work, so I'm not sure why it doesn't work in this case. Fix the issue by
reverting back to devm_ioremap_wc().
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reported-by: André Apitzsch <git@apitzsch.eu>
Fixes: c70b9d5fdc ("remoteproc: qcom: Use of_reserved_mem_region_* functions for "memory-region"")
Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: André Apitzsch <git@apitzsch.eu> # on BQ Aquaris M5
Link: https://lore.kernel.org/r/20260128220243.3018526-1-robh@kernel.org
Signed-off-by: Bjorn Andersson <andersson@kernel.org>
The ncm_set_alt function was holding a mutex to protect against races
with configfs, which invokes the might-sleep function inside an atomic
context.
Remove the struct net_device pointer from the f_ncm_opts structure to
eliminate the contention. The connection state is now managed by a new
boolean flag to preserve the use-after-free fix from
commit 6334b8e455 ("usb: gadget: f_ncm: Fix UAF ncm object at re-bind
after usb ep transport error").
BUG: sleeping function called from invalid context
Call Trace:
dump_stack_lvl+0x83/0xc0
dump_stack+0x14/0x16
__might_resched+0x389/0x4c0
__might_sleep+0x8e/0x100
...
__mutex_lock+0x6f/0x1740
...
ncm_set_alt+0x209/0xa40
set_config+0x6b6/0xb40
composite_setup+0x734/0x2b40
...
Fixes: 56a512a9b4 ("usb: gadget: f_ncm: align net_device lifecycle with bind/unbind")
Cc: stable@kernel.org
Signed-off-by: Kuen-Han Tsai <khtsai@google.com>
Link: https://patch.msgid.link/20260221-legacy-ncm-v2-2-dfb891d76507@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The `tpg->tpg_nexus` pointer in the USB Target driver is dynamically
managed and tied to userspace configuration via ConfigFS. It can be
NULL if the USB host sends requests before the nexus is fully
established or immediately after it is dropped.
Currently, functions like `bot_submit_command()` and the data
transfer paths retrieve `tv_nexus = tpg->tpg_nexus` and immediately
dereference `tv_nexus->tvn_se_sess` without any validation. If a
malicious or misconfigured USB host sends a BOT (Bulk-Only Transport)
command during this race window, it triggers a NULL pointer
dereference, leading to a kernel panic (local DoS).
This exposes an inconsistent API usage within the module, as peer
functions like `usbg_submit_command()` and `bot_send_bad_response()`
correctly implement a NULL check for `tv_nexus` before proceeding.
Fix this by bringing consistency to the nexus handling. Add the
missing `if (!tv_nexus)` checks to the vulnerable BOT command and
request processing paths, aborting the command gracefully with an
error instead of crashing the system.
Fixes: c52661d60f ("usb-gadget: Initial merge of target module for UASP + BOT")
Cc: stable <stable@kernel.org>
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Reviewed-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com>
Link: https://patch.msgid.link/20260219023834.17976-1-jiashengjiangcool@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If we found an alias through nfs3_do_create/nfs_add_or_obtain
/d_splice_alias which happens to be a dir dentry, we don't return
any error, and simply forget about this alias, but the original
dentry we were adding and passed as parameter remains negative.
This later causes an oops on nfs_atomic_open_v23/finish_open since we
supply a negative dentry to do_dentry_open.
This has been observed running lustre-racer, where dirs and files are
created/removed concurrently with the same name and O_EXCL is not
used to open files (frequent file redirection).
While d_splice_alias typically returns a directory alias or NULL, we
explicitly check d_is_dir() to ensure that we don't attempt to perform
file operations (like finish_open) on a directory inode, which triggers
the observed oops.
Fixes: 7c6c5249f0 ("NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly.")
Reviewed-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
The pwrrdy regmap_filed is allocated in rzg2l_usbphy_ctrl_pwrrdy_init()
only if the driver data is set to RZG2L_USBPHY_CTRL_PWRRDY. Check that
pwrrdy is valid before using it to avoid "Unable to handle kernel NULL
pointer dereference at virtual address" errors.
Fixes: c5b7cd9ade ("reset: rzg2l-usbphy-ctrl: Add suspend/resume support")
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com>
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de>
hw_sm750_map() calls pci_request_region() but never releases the
region on error paths or in lynxfb_pci_remove(). This causes a
resource leak that prevents the PCI region from being mapped again
after driver removal or a failed probe. A TODO comment in the code
acknowledges this missing cleanup.
Restructure the error handling in hw_sm750_map() to properly release
the PCI region on ioremap failures, and add pci_release_region() to
lynxfb_pci_remove().
Signed-off-by: Artem Lytkin <iprintercanon@gmail.com>
Cc: stable <stable@kernel.org>
Link: https://patch.msgid.link/20260216202038.1828-1-iprintercanon@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When QMAN_FQ_FLAG_DYNAMIC_FQID is set, there's a race condition between
fq_table[fq->idx] state and freeing/allocating from the pool and
WARN_ON(fq_table[fq->idx]) in qman_create_fq() gets triggered.
Indeed, we can have:
Thread A Thread B
qman_destroy_fq() qman_create_fq()
qman_release_fqid()
qman_shutdown_fq()
gen_pool_free()
-- At this point, the fqid is available again --
qman_alloc_fqid()
-- so, we can get the just-freed fqid in thread B --
fq->fqid = fqid;
fq->idx = fqid * 2;
WARN_ON(fq_table[fq->idx]);
fq_table[fq->idx] = fq;
fq_table[fq->idx] = NULL;
And adding some logs between qman_release_fqid() and
fq_table[fq->idx] = NULL makes the WARN_ON() trigger a lot more.
To prevent that, ensure that fq_table[fq->idx] is set to NULL before
gen_pool_free() is called by using smp_wmb().
Fixes: c535e923bb ("soc/fsl: Introduce DPAA 1.x QMan device driver")
Signed-off-by: Richard Genoud <richard.genoud@bootlin.com>
Tested-by: CHAMPSEIX Thomas <thomas.champseix@alstomgroup.com>
Link: https://lore.kernel.org/r/20251223072549.397625-1-richard.genoud@bootlin.com
Signed-off-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
The adis_init() function dereferences adis->ops to check if the
individual function pointers (write, read, reset) are NULL, but does
not first check if adis->ops itself is NULL.
Drivers like adis16480, adis16490, adis16545 and others do not set
custom ops and rely on adis_init() assigning the defaults. Since struct
adis is zero-initialized by devm_iio_device_alloc(), adis->ops is NULL
when adis_init() is called, causing a NULL pointer dereference:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
pc : adis_init+0xc0/0x118
Call trace:
adis_init+0xc0/0x118
adis16480_probe+0xe0/0x670
Fix this by checking if adis->ops is NULL before dereferencing it,
falling through to assign the default ops in that case.
Fixes: 3b29bcee8f ("iio: imu: adis: Add custom ops struct")
Signed-off-by: Radu Sabau <radu.sabau@analog.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Reviewed-by: Antoniu Miclaus <antoniu.miclaus@analog.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
When the driver probe fails we encounter a regulator put warning
because vddio regulator is not stopped before release. The issue
comes from pm_runtime not already setup when core probe fails and
the vddio regulator disable callback is called.
Fix the issue by setting pm_runtime active early before vddio
regulator resource cleanup. This requires to cut pm_runtime
set_active and enable in 2 function calls.
Fixes: 7ff021a3fa ("iio: imu: inv_icm45600: add new inv_icm45600 driver")
Signed-off-by: Jean-Baptiste Maneyrol <jean-baptiste.maneyrol@tdk.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
In the edge case where the IIO device is unregistered while we're
buffering, we were directly returning an error without removing the wait
queue. Instead, set 'ret' and break out of the loop.
Fixes: 9eeee3b0bf ("iio: Add output buffer support")
Signed-off-by: Nuno Sá <nuno.sa@analog.com>
Reviewed-by: David Lechner <dlechner@baylibre.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
The return value of pm_runtime_get_sync() is not checked, allowing
the driver to access hardware that may fail to resume. The device
usage count is also unconditionally incremented. Use
pm_runtime_resume_and_get() which propagates errors and avoids
incrementing the usage count on failure.
In preenable, add pm_runtime_put_autosuspend() on set_8khz_samplerate()
failure since postdisable does not run when preenable fails.
Fixes: 3904b28efb ("iio: gyro: Add driver for the MPU-3050 gyroscope")
Reviewed-by: Linus Walleij <linusw@kernel.org>
Signed-off-by: Antoniu Miclaus <antoniu.miclaus@analog.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
The return value of pm_runtime_get_sync() is not checked, and the
function always returns success. This allows I2C mux operations to
proceed even when the device fails to resume.
Use pm_runtime_resume_and_get() and propagate its return value to
properly handle resume failures.
Fixes: 3904b28efb ("iio: gyro: Add driver for the MPU-3050 gyroscope")
Signed-off-by: Antoniu Miclaus <antoniu.miclaus@analog.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
sizeof(num) evaluates to sizeof(size_t) which is 8 bytes on 64-bit,
but the buffer elements are only 4 bytes. The same function already
uses sizeof(*meas) on line 312, making the mismatch evident. Use
sizeof(*meas) consistently.
Fixes: b2e171f5a5 ("iio: sps30: add support for serial interface")
Signed-off-by: Antoniu Miclaus <antoniu.miclaus@analog.com>
Acked-by: Tomasz Duszynski <tduszyns@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
TLV493D_BX2_MAG_X_AXIS_LSB is defined as GENMASK(7, 4). FIELD_GET()
already right-shifts bits [7:4] to [3:0], so the additional >> 4
discards most of the X-axis low nibble. The Y and Z axes correctly
omit this extra shift. Remove it.
Fixes: 106511d280 ("iio: magnetometer: add support for Infineon TLV493D 3D Magentic sensor")
Signed-off-by: Antoniu Miclaus <antoniu.miclaus@analog.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Initialize fw_size before copying firmware data into the flexible
array member to match the __counted_by() annotation. This fixes the
incorrect assignment order that triggers runtime safety checks.
Fixes: e9ed97be4f ("iio: proximity: hx9023s: Added firmware file parsing functionality")
Signed-off-by: Yasin Lee <yasin.lee.x@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
This function refers to the Bosch BME680 API as the source of the
calculation, but one of the constants does not match the Bosch
implementation. This appears to be a simple transposition of two digits,
resulting in a wait time that is too short. This can cause the following
'device measurement cycle incomplete' check to occasionally fail, returning
EBUSY to user space.
Adjust the constant to match the Bosch implementation and resolve the EBUSY
errors.
Fixes: 4241665e6e ("iio: chemical: bme680: Fix sensor data read operation")
Link: https://github.com/boschsensortec/BME68x_SensorAPI/blob/v4.4.8/bme68x.c#L521
Signed-off-by: Chris Spencer <spencercw@gmail.com>
Acked-by: Vasileios Amoiridis <vassilisamir@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
The DS442x DAC uses sign-magnitude encoding, so -128 cannot be represented
in hardware (7-bit magnitude).
Previously, passing -128 resulted in a truncated value that programmed
0mA (magnitude 0) instead of the expected maximum negative current,
effectively failing silently.
Reject -128 to avoid producing the wrong current.
Fixes: d632a2bd8f ("iio: dac: ds4422/ds4424 dac driver")
Cc: stable@vger.kernel.org
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Drive bit must be set for open-drain mode and be cleared for push-pull
mode.
Referring to datasheet DS-000576_ICM-45605.pdf section 17.23
INT1_CONFIG2.
Fixes: 06674a72cf ("iio: imu: inv_icm45600: add buffer support in iio devices")
Signed-off-by: Jean-Baptiste Maneyrol <jean-baptiste.maneyrol@tdk.com>
Reviewed-by: Andy Shevchenko <andy@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
The MCP4131 wiper address is shifted twice when preparing the SPI
command in mcp4131_write_raw().
The address is already shifted when assigned to the local variable
"address", but is then shifted again when written to data->buf[0].
This results in an incorrect command being sent to the device and
breaks wiper writes to the second channel.
Remove the second shift and use the pre-shifted address directly
when composing the SPI transfer.
Fixes: 22d199a539 ("iio: potentiometer: add driver for Microchip MCP413X/414X/415X/416X/423X/424X/425X/426X")
Signed-off-by: Lukas Schmid <lukas.schmid@netcube.li>#
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
IRQ needs to be acked. for some odd reasons, reading from irq status does
not reliable help, enable acking from any register to be on the safe side
and read the irq status register. Comments in the code indicate a known
unreliability with that register.
The blamed commit was tested with mpu6050 in lg,p895 and lg,p880 according
to Tested-bys. But with the MPU9150 in the Epson Moverio BT-200 this leads
to irq storms without properly acking the irq.
Fixes: 0a3b517c80 ("iio: imu: inv_mpu6050: fix interrupt status read for old buggy chips")
Signed-off-by: Andreas Kemnade <andreas@kemnade.info>
Acked-by: Jean-Baptiste Maneyrol <jean-baptiste.maneyrol@tdk.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
The regmap_read_poll_timeout() uses ADF4377_0000_SOFT_RESET_R_MSK
twice instead of checking both SOFT_RESET_MSK (bit 0) and
SOFT_RESET_R_MSK (bit 7). This causes an incomplete reset status check.
The code first sets both SOFT_RESET and SOFT_RESET_R bits to 1 via
regmap_update_bits(), then polls for them to be cleared. Since we set
both bits before polling, we should be waiting for both to clear.
Fix by using both masks as done in regmap_update_bits() above.
Fixes: eda549e2e5 ("iio: frequency: adf4377: add support for ADF4377")
Signed-off-by: SeungJu Cheon <suunj1331@gmail.com>
Cc: Stable@vger.kernel.org
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Move pm_runtime_put_autosuspend() before the error check to ensure
the PM runtime reference count is always decremented after
pm_runtime_get_sync(), regardless of whether the read operation
succeeds or fails.
Fixes: 1f0477f183 ("iio: light: new driver for the ROHM BH1780")
Signed-off-by: Antoniu Miclaus <antoniu.miclaus@analog.com>
Reviewed-by: Linus Walleij <linusw@kernel.org>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
ODR switch is done in 2 steps when FIFO is on : change the ODR register
value and acknowledge change when reading the FIFO ODR change flag.
When we are switching odr and turning buffer off just afterward, we are
losing the FIFO ODR change flag and ODR switch is blocked.
Fix the issue by force applying any waiting ODR change when turning
buffer off.
Fixes: ec74ae9fd3 ("iio: imu: inv_icm42600: add accurate timestamping")
Signed-off-by: Jean-Baptiste Maneyrol <jean-baptiste.maneyrol@tdk.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
ODR switch is done in 2 steps when FIFO is on : change the ODR register
value and acknowledge change when reading the FIFO ODR change flag.
When we are switching to the same odr value, we end up waiting for a
FIFO ODR flag that is never happening.
Fix the issue by doing nothing and exiting properly when we are
switching to the same ODR value.
Fixes: ec74ae9fd3 ("iio: imu: inv_icm42600: add accurate timestamping")
Signed-off-by: Jean-Baptiste Maneyrol <jean-baptiste.maneyrol@tdk.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
In ax45mp_cache_init(), of_find_matching_node() returns a device node
with an incremented reference count that must be released with
of_node_put(). The current code fails to call of_node_put() which
causes a reference leak.
Use the __free(device_node) attribute to ensure automatic cleanup when
the variable goes out of scope.
Fixes: d34599bcd2 ("cache: Add L2 cache management for Andes AX45MP RISC-V core")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
of_find_matching_node() returns a device_node with refcount incremented.
Use __free(device_node) attribute to automatically call of_node_put()
when the variable goes out of scope, preventing the refcount leak.
Fixes: cabff60ca7 ("cache: Add StarFive StarLink cache management")
Signed-off-by: Felix Gu <ustc.gu@gmail.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
The can IP on PolarFire SoC requires the use of the blocks reset
during normal operation, and the property is therefore required by the
binding, causing a warning on the m100pfsevp board where it is default
enabled:
mpfs-m100pfsevp.dtb: can@2010c000 (microchip,mpfs-can): 'resets' is a required property
Add the reset to both can nodes.
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
In mpfs_sys_controller_probe(), if of_get_mtd_device_by_node() fails,
the function returns immediately without freeing the allocated memory
for sys_controller, leading to a memory leak.
Fix this by jumping to the out_free label to ensure the memory is
properly freed.
Also, consolidate the error handling for the mbox_request_channel()
failure case to use the same label.
Fixes: 742aa6c563 ("soc: microchip: mpfs: enable access to the system controller's flash")
Co-developed-by: Jianhao Xu <jianhao.xu@seu.edu.cn>
Signed-off-by: Jianhao Xu <jianhao.xu@seu.edu.cn>
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Remove KVM's internal pseudo-overlay of kvm_stats_desc, which subtly
aliases the flexible name[] in the uAPI definition with a fixed-size array
of the same name. The unusual embedded structure results in compiler
warnings due to -Wflex-array-member-not-at-end, and also necessitates an
extra level of dereferencing in KVM. To avoid the "overlay", define the
uAPI structure to have a fixed-size name when building for the kernel.
Opportunistically clean up the indentation for the stats macros, and
replace spaces with tabs.
No functional change intended.
Reported-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Closes: https://lore.kernel.org/all/aPfNKRpLfhmhYqfP@kspp
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
[..]
Acked-by: Anup Patel <anup@brainfault.org>
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://patch.msgid.link/20251205232655.445294-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-08 10:40:48 -08:00
1320 changed files with 17244 additions and 8295 deletions
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.