linux-net

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/ synced 2026-04-05 00:07:48 -04:00

Author	SHA1	Message	Date
Linus Torvalds	3ad66a34cc	Merge tag 'io_uring-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Fix a typo in the mock_file help text - Fix a comment regarding IORING_SETUP_TASKRUN_FLAG in the io_uring.h UAPI header - Use READ_ONCE() for reading refill queue entries - Reject SEND_VECTORIZED for fixed buffer sends, as it isn't implemented. Currently this flag is silently ignored This is in preparation for making these work, but first we need a fixup so that older kernels will correctly reject them - Ensure "0" means default for the rx page size * tag 'io_uring-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/zcrx: use READ_ONCE with user shared RQEs io_uring/mock: Fix typo in help text io_uring/net: reject SEND_VECTORIZED when unsupported io_uring: correct comment for IORING_SETUP_TASKRUN_FLAG io_uring/zcrx: don't set rx_page_size when not requested	2026-03-06 08:31:36 -08:00
Pavel Begunkov	531bb98a03	io_uring/zcrx: use READ_ONCE with user shared RQEs Refill queue entries are shared with the user space, use READ_ONCE when reading them. Fixes: `34a3e60821` ("io_uring/zcrx: implement zerocopy receive pp memory provider"); Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-03-04 06:30:39 -07:00
Jakub Kicinski	3d17d76d1f	io_uring/zcrx: don't set rx_page_size when not requested The rx_buf_len parameter was recently added to the Rx zero-copy implementation. The expectation is that when not set system will maintain previous behavior and use the default buffer size (PAGE_SIZE). This works correctly at the iouring level, but we don't preserve the same "zero means default" semantics when registering the memory provider on the netdev. mp_param.rx_page_size is unconditionally set to PAGE_SIZE. This causes __net_mp_open_rxq() to check for QCFG_RX_PAGE_SIZE support in the driver, and return -EOPNOTSUPP for drivers that don't advertise it -- even though the user never asked for large buffers. Only set mp_param.rx_page_size when rx_buf_len was explicitly provided, so that the default page size path works on all zcrx-capable drivers. mlx5 and fbnic only support 4kB pages in the current release. Fixes: `795663b4d1` ("io_uring/zcrx: implement large rx buffer support") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-27 12:32:49 -07:00
Linus Torvalds	bf4afc53b7	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/\(alloc_objs(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2026-02-21 17:09:51 -08:00
Linus Torvalds	8934827db5	Merge tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kmalloc_obj conversion from Kees Cook: "This does the tree-wide conversion to kmalloc_obj() and friends using coccinelle, with a subsequent small manual cleanup of whitespace alignment that coccinelle does not handle. This uncovered a clang bug in __builtin_counted_by_ref(), so the conversion is preceded by disabling that for current versions of clang. The imminent clang 22.1 release has the fix. I've done allmodconfig build tests for x86_64, arm64, i386, and arm. I did defconfig builds for alpha, m68k, mips, parisc, powerpc, riscv, s390, sparc, sh, arc, csky, xtensa, hexagon, and openrisc" * tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kmalloc_obj: Clean up after treewide replacements treewide: Replace kmalloc with kmalloc_obj for non-scalar types compiler_types: Disable __builtin_counted_by_ref for Clang	2026-02-21 11:02:58 -08:00
Kees Cook	69050f8d6d	treewide: Replace kmalloc with kmalloc_obj for non-scalar types This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>	2026-02-21 01:02:28 -08:00
Kai Aizen	003049b1c4	io_uring/zcrx: fix user_ref race between scrub and refill paths The io_zcrx_put_niov_uref() function uses a non-atomic check-then-decrement pattern (atomic_read followed by separate atomic_dec) to manipulate user_refs. This is serialized against other callers by rq_lock, but io_zcrx_scrub() modifies the same counter with atomic_xchg() WITHOUT holding rq_lock. On SMP systems, the following race exists: CPU0 (refill, holds rq_lock) CPU1 (scrub, no rq_lock) put_niov_uref: atomic_read(uref) - 1 // window opens atomic_xchg(uref, 0) - 1 return_niov_freelist(niov) [PUSH #1] // window closes atomic_dec(uref) - wraps to -1 returns true return_niov(niov) return_niov_freelist(niov) [PUSH #2: DOUBLE-FREE] The same niov is pushed to the freelist twice, causing free_count to exceed nr_iovs. Subsequent freelist pushes then perform an out-of-bounds write (a u32 value) past the kvmalloc'd freelist array into the adjacent slab object. Fix this by replacing the non-atomic read-then-dec in io_zcrx_put_niov_uref() with an atomic_try_cmpxchg loop that atomically tests and decrements user_refs. This makes the operation safe against concurrent atomic_xchg from scrub without requiring scrub to acquire rq_lock. Fixes: `34a3e60821` ("io_uring/zcrx: implement zerocopy receive pp memory provider") Cc: stable@vger.kernel.org Signed-off-by: Kai Aizen <kai@snailsploit.com> [pavel: removed a warning and a comment] Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-18 10:39:48 -07:00
Linus Torvalds	7b751b01ad	Merge tag 'io_uring-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull more io_uring updates from Jens Axboe: "This is a mix of cleanups and fixes. No major fixes in here, just a bunch of little fixes. Some of them marked for stable as it fixes behavioral issues - Fix an issue with SOCKET_URING_OP_SETSOCKOPT for netlink sockets, due to a too restrictive check on it having an ioctl handler - Remove a redundant SQPOLL check in ring creation - Kill dead accounting for zero-copy send, which doesn't use ->buf or ->len post the initial setup - Fix missing clamp of the allocation hint, which could cause allocations to fall outside of the range the application asked for. Still within the allowed limits. - Fix for IORING_OP_PIPE's handling of direct descriptors - Tweak to the API for the newly added BPF filters, making them more future proof in terms of how applications deal with them - A few fixes for zcrx, fixing a few error handling conditions - Fix for zcrx request flag checking - Add support for querying the zcrx page size - Improve the NO_SQARRAY static branch inc/dec, avoiding busy conditions causing too much traffic - Various little cleanups" * tag 'io_uring-7.0-20260216' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/bpf_filter: pass in expected filter payload size io_uring/bpf_filter: move filter size and populate helper into struct io_uring/cancel: de-unionize file and user_data in struct io_cancel_data io_uring/rsrc: improve regbuf iov validation io_uring: remove unneeded io_send_zc accounting io_uring/cmd_net: fix too strict requirement on ioctl io_uring: delay sqarray static branch disablement io_uring/query: add query.h copyright notice io_uring/query: return support for custom rx page size io_uring/zcrx: check unsupported flags on import io_uring/zcrx: fix post open error handling io_uring/zcrx: fix sgtable leak on mapping failures io_uring: use the right type for creds iteration io_uring/openclose: fix io_pipe_fixed() slot tracking for specific slots io_uring/filetable: clamp alloc_hint to the configured alloc range io_uring/rsrc: replace reg buffer bit field with flags io_uring/zcrx: improve types for size calculation io_uring/tctx: avoid modifying loop variable in io_ring_add_registered_file io_uring: simplify IORING_SETUP_DEFER_TASKRUN && !SQPOLL check	2026-02-17 08:33:49 -08:00
Pavel Begunkov	7496e658a7	io_uring/zcrx: check unsupported flags on import The imoorted zcrx registration path checks for ZCRX_REG_IMPORT, as it should, but doesn't reject any unsupported flags. Fix that. Cc: stable@vger.kernel.org Fixes: `00d9148127` ("io_uring/zcrx: share an ifq between rings") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-15 14:55:29 -07:00
Pavel Begunkov	5d540e4508	io_uring/zcrx: fix post open error handling Closing a queue doesn't guarantee that all associated page pools are terminated right away, let the refcounting do the work instead of releasing the zcrx ctx directly. Cc: stable@vger.kernel.org Fixes: `e0793de24a` ("io_uring/zcrx: set pp memory provider for an rx queue") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-14 18:05:08 -07:00
Pavel Begunkov	a983aae397	io_uring/zcrx: fix sgtable leak on mapping failures In an unlikely case when io_populate_area_dma() fails, which could only happen on a PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA machine, io_zcrx_map_area() will have an initialised and not freed table. It was supposed to be cleaned up in the error path, but !is_mapped prevents that. Fixes: `439a98b972` ("io_uring/zcrx: deduplicate area mapping") Cc: stable@vger.kernel.org Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-14 18:05:00 -07:00
Linus Torvalds	041c16acba	Merge tag 'for-7.0/io_uring-zcrx-large-buffers-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring large rx buffer support from Jens Axboe: "Now that the networking updates are upstream, here's the support for large buffers for zcrx. Using larger (bigger than 4K) rx buffers can increase the effiency of zcrx. For example, it's been shown that using 32K buffers can decrease CPU usage by ~30% compared to 4K buffers" * tag 'for-7.0/io_uring-zcrx-large-buffers-20260206' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/zcrx: implement large rx buffer support	2026-02-12 15:07:50 -08:00
Pavel Begunkov	417d029dc4	io_uring/zcrx: improve types for size calculation Make sure io_import_umem() promotes the type to long before calculating the area size. While the area size is capped at 1GB by io_validate_user_buf_range() and fits into an "int", it's still too error prone. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-10 05:26:12 -07:00
Pavel Begunkov	af07330e28	io_uring/zcrx: fix rq flush locking zcrx needs to keep the rq lock for uref manipulations, for now move all zcrx_return_buffers() under the lock. Fixes: `475eb39b00` ("io_uring/zcrx: add sync refill queue flushing") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 08:19:43 -07:00
Pavel Begunkov	0ae91d8ab7	io_uring/zcrx: fix page array leak `d9f595b9a6` ("io_uring/zcrx: fix leaking pages on sg init fail") fixed a page leakage but didn't free the page array, release it as well. Fixes: `b84621d96e` ("io_uring/zcrx: allocate sgtable for umem areas") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-02-02 08:19:35 -07:00
Pavel Begunkov	795663b4d1	io_uring/zcrx: implement large rx buffer support There are network cards that support receive buffers larger than 4K, and that can be vastly beneficial for performance, and benchmarks for this patch showed up to 30% CPU util improvement for 32K vs 4K buffers. Allows zcrx users to specify the size in struct io_uring_zcrx_ifq_reg::rx_buf_len. If set to zero, zcrx will use a default value. zcrx will check and fail if the memory backing the area can't be split into physically contiguous chunks of the required size. It's more restrictive as it only needs dma addresses to be contig, but that's beyond this series. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: kill duplicate netdev_queues.h include] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2026-01-24 08:33:03 -07:00
David Wei	00d9148127	io_uring/zcrx: share an ifq between rings Add a way to share an ifq from a src ring that is real (i.e. bound to a HW RX queue) with other rings. This is done by passing a new flag IORING_ZCRX_IFQ_REG_IMPORT in the registration struct io_uring_zcrx_ifq_reg, alongside the fd of an exported zcrx ifq. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
David Wei	0926f94ab3	io_uring/zcrx: add io_fill_zcrx_offsets() Add a helper io_fill_zcrx_offsets() that sets the constant offsets in struct io_uring_zcrx_offsets returned to userspace. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	d7af80b213	io_uring/zcrx: export zcrx via a file Add an option to wrap a zcrx instance into a file and expose it to the user space. Currently, users can't do anything meaningful with the file, but it'll be used in a next patch to import it into another io_uring instance. It's implemented as a new op called ZCRX_CTRL_EXPORT for the IORING_REGISTER_ZCRX_CTRL registration opcode. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
David Wei	742cb2e14e	io_uring/zcrx: move io_zcrx_scrub() and dependencies up In preparation for adding zcrx ifq exporting and importing, move io_zcrx_scrub() and its dependencies up the file to be closer to io_close_queue(). Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	39c9676f78	io_uring/zcrx: count zcrx users zcrx tries to detach ifq / terminate page pools when the io_uring ctx owning it is being destroyed. There will be multiple io_uring instances attached to it in the future, so add a separate counter to track the users. Note, refs can't be reused for this purpose as it only used to prevent zcrx and rings destruction, and also used by page pools to keep it alive. Signed-off-by: David Wei <dw@davidwei.uk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	475eb39b00	io_uring/zcrx: add sync refill queue flushing Add an zcrx interface via IORING_REGISTER_ZCRX_CTRL that forces the kernel to flush / consume entries from the refill queue. Just as with the IORING_REGISTER_ZCRX_REFILL attempt, the motivation is to address cases where the refill queue becomes full, and the user can't return buffers and needs to stash them. It's still a slow path, and the user should size refill queue appropriately, but it should be helpful for handling temporary traffic spikes and other unpredictable conditions. The interface is simpler comparing to ZCRX_REFILL as it doesn't need temporary refill entry arrays and gives natural batching, whereas ZCRX_REFILL requires even more user logic to be somewhat efficient. Also, add a structure for the operation. It's not currently used but can serve for future improvements like limiting the number of buffers to process, etc. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	d663976dad	io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL It'll be annoying and take enough of boilerplate code to implement new zcrx features as separate io_uring register opcode. Introduce IORING_REGISTER_ZCRX_CTRL that will multiplex such calls to zcrx. Note, there are no real users of the opcode in this patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pedro Demarchi Gomes	a0169c3a62	io_uring/zcrx: use folio_nr_pages() instead of shift operation folio_nr_pages() is a faster helper function to get the number of pages when NR_PAGES_IN_LARGE_FOLIO is enabled. Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Pavel Begunkov	f0243d2b86	io_uring/zcrx: convert to use netmem_desc Convert zcrx to struct netmem_desc, and use struct net_iov::desc to access its fields instead of struct net_iov inner union alises. zcrx only directly reads niov->pp, so with this patch it doesn't depend on the union anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Byungchul Park <byungchul@sk.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 11:19:37 -07:00
Jens Axboe	ecb8490b2f	Merge branch 'io_uring-6.18' into for-6.19/io_uring Merge 6.18-rc io_uring fixes, as certain coming changes depend on some of these. * io_uring-6.18: io_uring/rsrc: don't use blk_rq_nr_phys_segments() as number of bvecs io_uring/query: return number of available queries io_uring/rw: ensure allocated iovec gets cleared for early failure io_uring: fix regbuf vector size truncation io_uring: fix types for region size calulation io_uring/zcrx: remove sync refill uapi io_uring: fix buffer auto-commit for multishot uring_cmd io_uring: correct __must_hold annotation in io_install_fixed_file io_uring zcrx: add MAINTAINERS entry io_uring: Fix code indentation error io_uring/sqpoll: be smarter on when to update the stime usage io_uring/sqpoll: switch away from getrusage() for CPU accounting io_uring: fix incorrect unlikely() usage in io_waitid_prep() Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-13 07:26:37 -07:00
David Wei	b6c5f9454e	io_uring/zcrx: call netdev_queue_get_dma_dev() under instance lock netdev ops must be called under instance lock or rtnl_lock, but io_register_zcrx_ifq() isn't doing this for netdev_queue_get_dma_dev(). Fix this by taking the instance lock using netdev_get_by_index_lock(). Extended the instance lock section to include attaching a memory provider. Could not move io_zcrx_create_area() outside, since the dmabuf codepath IORING_ZCRX_AREA_DMABUF requires ifq->dev. Fixes: `59b8b32ac8` ("io_uring/zcrx: add support for custom DMA devices") Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-11 07:53:33 -07:00
David Wei	75c299a917	io_uring/zcrx: reverse ifq refcount Add a refcount to struct io_zcrx_ifq to reverse the refcounting relationship i.e. rings now reference ifqs instead. As a result of this, remove ctx->refs that an ifq holds on a ring via the page pool memory provider. This ref ifq->refs is held by internal users of an ifq, namely rings and the page pool memory provider associated with an ifq. This is needed to keep the ifq around until the page pool is destroyed. Since ifqs now no longer hold refs to ring ctx, there isn't a need to split the cleanup of ifqs into two: io_shutdown_zcrx_ifqs() in io_ring_exit_work() while waiting for ctx->refs to drop to 0, and io_unregister_zcrx_ifqs() after. Remove io_shutdown_zcrx_ifqs(). Signed-off-by: David Wei <dw@davidwei.uk> Co-developed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-06 16:23:21 -07:00
David Wei	1bd95163da	io_uring/zcrx: move io_unregister_zcrx_ifqs() down In preparation for removing the ref on ctx->refs held by an ifq and removing io_shutdown_zcrx_ifqs(), move io_unregister_zcrx_ifqs() down such that it can call io_zcrx_scrub(). Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-06 16:23:21 -07:00
David Wei	5c686456a4	io_uring/zcrx: add user_struct and mm_struct to io_zcrx_ifq In preparation for removing ifq->ctx and making ifq lifetime independent of ring ctx, add user_struct and mm_struct to io_zcrx_ifq. In the ifq cleanup path, these are the only fields used from the main ring ctx to do accounting. Taking a copy in the ifq allows ifq->ctx to be removed later, including the ctx->refs held by the ifq. Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-06 16:23:21 -07:00
David Wei	edd706ede8	io_uring/zcrx: add io_zcrx_ifq arg to io_zcrx_free_area() Add io_zcrx_ifq arg to io_zcrx_free_area(). A QOL change to reduce line widths. Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-06 16:23:21 -07:00
David Wei	6ab39b392e	io_uring/rsrc: refactor io_{un}account_mem() to take {user,mm}_struct param Refactor io_{un}account_mem() to take user_struct and mm_struct directly, instead of accessing it from the ring ctx. Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-06 16:23:21 -07:00
David Wei	1fa7a34131	io_uring/memmap: refactor io_free_region() to take user_struct param Refactor io_free_region() to take user_struct directly, instead of accessing it from the ring ctx. Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-06 16:23:21 -07:00
Pavel Begunkov	819630bd6f	io_uring/zcrx: remove sync refill uapi There is a better way to handle the problem IORING_REGISTER_ZCRX_REFILL solves. The uapi can also be slightly adjusted to accommodate future extensions. Remove the feature for now, it'll be reworked for the next release. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-11-03 08:55:58 -07:00
Pavel Begunkov	e9a9dcb4cc	io_uring/zcrx: increment fallback loop src offset Don't forget to adjust the source offset in io_copy_page(), otherwise it'll be copying into the same location in some cases for highmem setups. Fixes: `e67645bb7f` ("io_uring/zcrx: prepare fallback for larger pages") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-10-08 07:26:14 -06:00
Pavel Begunkov	09cfd3c52e	io_uring/zcrx: fix overshooting recv limit It's reported that sometimes a zcrx request can receive more than was requested. It's caused by io_zcrx_recv_skb() adjusting desc->count for all received buffers including frag lists, but then doing recursive calls to process frag list skbs, which leads to desc->count double accounting and underflow. Reported-and-tested-by: Matthias Jasny <matthiasjasny@gmail.com> Fixes: `6699ec9a23` ("io_uring/zcrx: add a read limit to recvzc requests") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-10-08 07:26:08 -06:00
Linus Torvalds	8804d970fa	Merge tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...	2025-10-02 18:18:33 -07:00
Linus Torvalds	07fdad3a93	Merge tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core & protocols: - Improve drop account scalability on NUMA hosts for RAW and UDP sockets and the backlog, almost doubling the Pps capacity under DoS - Optimize the UDP RX performance under stress, reducing contention, revisiting the binary layout of the involved data structs and implementing NUMA-aware locking. This improves UDP RX performance by an additional 50%, even more under extreme conditions - Add support for PSP encryption of TCP connections; this mechanism has some similarities with IPsec and TLS, but offers superior HW offloads capabilities - Ongoing work to support Accurate ECN for TCP. AccECN allows more than one congestion notification signal per RTT and is a building block for Low Latency, Low Loss, and Scalable Throughput (L4S) - Reorganize the TCP socket binary layout for data locality, reducing the number of touched cachelines in the fastpath - Refactor skb deferral free to better scale on large multi-NUMA hosts, this improves TCP and UDP RX performances significantly on such HW - Increase the default socket memory buffer limits from 256K to 4M to better fit modern link speeds - Improve handling of setups with a large number of nexthop, making dump operating scaling linearly and avoiding unneeded synchronize_rcu() on delete - Improve bridge handling of VLAN FDB, storing a single entry per bridge instead of one entry per port; this makes the dump order of magnitude faster on large switches - Restore IP ID correctly for encapsulated packets at GSO segmentation time, allowing GRO to merge packets in more scenarios - Improve netfilter matching performance on large sets - Improve MPTCP receive path performance by leveraging recently introduced core infrastructure (skb deferral free) and adopting recent TCP autotuning changes - Allow bridges to redirect to a backup port when the bridge port is administratively down - Introduce MPTCP 'laminar' endpoint that con be used only once per connection and simplify common MPTCP setups - Add RCU safety to dst->dev, closing a lot of possible races - A significant crypto library API for SCTP, MPTCP and IPv6 SR, reducing code duplication - Supports pulling data from an skb frag into the linear area of an XDP buffer Things we sprinkled into general kernel code: - Generate netlink documentation from YAML using an integrated YAML parser Driver API: - Support using IPv6 Flow Label in Rx hash computation and RSS queue selection - Introduce API for fetching the DMA device for a given queue, allowing TCP zerocopy RX on more H/W setups - Make XDP helpers compatible with unreadable memory, allowing more easily building DevMem-enabled drivers with a unified XDP/skbs datapath - Add a new dedicated ethtool callback enabling drivers to provide the number of RX rings directly, improving efficiency and clarity in RX ring queries and RSS configuration - Introduce a burst period for the health reporter, allowing better handling of multiple errors due to the same root cause - Support for DPLL phase offset exponential moving average, controlling the average smoothing factor Device drivers: - Add a new Huawei driver for 3rd gen NIC (hinic3) - Add a new SpacemiT driver for K1 ethernet MAC - Add a generic abstraction for shared memory communication devices (dibps) - Ethernet high-speed NICs: - nVidia/Mellanox: - Use multiple per-queue doorbell, to avoid MMIO contention issues - support adjacent functions, allowing them to delegate their SR-IOV VFs to sibling PFs - support RSS for IPSec offload - support exposing raw cycle counters in PTP and mlx5 - support for disabling host PFs. - Intel (100G, ice, idpf): - ice: support for SRIOV VFs over an Active-Active link aggregate - ice: support for firmware logging via debugfs - ice: support for Earliest TxTime First (ETF) hardware offload - idpf: support basic XDP functionalities and XSk - Broadcom (bnxt): - support Hyper-V VF ID - dynamic SRIOV resource allocations for RoCE - Meta (fbnic): - support queue API, zero-copy Rx and Tx - support basic XDP functionalities - devlink health support for FW crashes and OTP mem corruptions - expand hardware stats coverage to FEC, PHY, and Pause - Wangxun: - support ethtool coalesce options - support for multiple RSS contexts - Ethernet virtual: - Macsec: - replace custom netlink attribute checks with policy-level checks - Bonding: - support aggregator selection based on port priority - Microsoft vNIC: - use page pool fragments for RX buffers instead of full pages to improve memory efficiency - Ethernet NICs consumer, and embedded: - Qualcomm: support Ethernet function for IPQ9574 SoC - Airoha: implement wlan offloading via NPU - Freescale - enetc: add NETC timer PTP driver and add PTP support - fec: enable the Jumbo frame support for i.MX8QM - Renesas (R-Car S4): - support HW offloading for layer 2 switching - support for RZ/{T2H, N2H} SoCs - Cadence (macb): support TAPRIO traffic scheduling - TI: - support for Gigabit ICSS ethernet SoC (icssm-prueth) - Synopsys (stmmac): a lot of cleanups - Ethernet PHYs: - Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS driver - Support bcm63268 GPHY power control - Support for Micrel lan8842 PHY and PTP - Support for Aquantia AQR412 and AQR115 - CAN: - a large CAN-XL preparation work - reorganize raw_sock and uniqframe struct to minimize memory usage - rcar_canfd: update the CAN-FD handling - WiFi: - extended Neighbor Awareness Networking (NAN) support - S1G channel representation cleanup - improve S1G support - WiFi drivers: - Intel (iwlwifi): - major refactor and cleanup - Broadcom (brcm80211): - support for AP isolation - RealTek (rtw88/89) rtw88/89: - preparation work for RTL8922DE support - MediaTek (mt76): - HW restart improvements - MLO support - Qualcomm/Atheros (ath10k): - GTK rekey fixes - Bluetooth drivers: - btusb: support for several new IDs for MT7925 - btintel: support for BlazarIW core - btintel_pcie: support for _suspend() / _resume() - btintel_pcie: support for Scorpious, Panther Lake-H484 IDs" * tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1536 commits) net: stmmac: Add support for Allwinner A523 GMAC200 dt-bindings: net: sun8i-emac: Add A523 GMAC200 compatible Revert "Documentation: net: add flow control guide and document ethtool API" octeontx2-pf: fix bitmap leak octeontx2-vf: fix bitmap leak net/mlx5e: Use extack in set rxfh callback net/mlx5e: Introduce mlx5e_rss_params for RSS configuration net/mlx5e: Introduce mlx5e_rss_init_params net/mlx5e: Remove unused mdev param from RSS indir init net/mlx5: Improve QoS error messages with actual depth values net/mlx5e: Prevent entering switchdev mode with inconsistent netns net/mlx5: HWS, Generalize complex matchers net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs selftests/net: add tcp_port_share to .gitignore Revert "net/mlx5e: Update and set Xon/Xoff upon MTU set" net: add NUMA awareness to skb_attempt_defer_free() net: use llist for sd->defer_list net: make softnet_data.defer_count an atomic selftests: drv-net: psp: add tests for destroying devices selftests: drv-net: psp: add test for auto-adjusting TCP MSS ...	2025-10-02 15:17:01 -07:00
David Hildenbrand	d99c57546d	io_uring/zcrx: remove nth_page() usage within folio Within a folio/compound page, nth_page() is no longer required. Given that we call folio_test_partial_kmap()+kmap_local_page(), the code would already be problematic if the pages would span multiple folios. So let's just assume that all src pages belong to a single folio/compound page and can be iterated ordinarily. The dst page is currently always a single page, so we're not actually iterating anything. Link: https://lkml.kernel.org/r/20250901150359.867252-21-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:05 -07:00
Pavel Begunkov	31bf77dcc3	io_uring/zcrx: account niov arrays to cgroup net_iov / freelist / etc. arrays can be quite long, make sure they're accounted. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:21 -06:00
Pavel Begunkov	705d2ac7b2	io_uring/zcrx: allow synchronous buffer return Returning buffers via a ring is performant and convenient, but it becomes a problem when/if the user misconfigured the ring size and it becomes full. Add a synchronous way to return buffers back to the page pool via a new register opcode. It's supposed to be a reliable slow path for refilling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:21 -06:00
Pavel Begunkov	8fd08d8dda	io_uring/zcrx: introduce io_parse_rqe() Add a helper for verifying a rqe and extracting a niov out of it. It'll be reused in following patches. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:21 -06:00
Pavel Begunkov	5a8b6e7c1d	io_uring/zcrx: don't adjust free cache space The cache should be empty when io_pp_zc_alloc_netmems() is called, that's promised by page pool and further checked, so there is no need to recalculate the available space in io_zcrx_ring_refill(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:21 -06:00
Pavel Begunkov	c95257f336	io_uring/zcrx: use guards for the refill lock Use guards for rq_lock in io_zcrx_ring_refill(), makes it a tad simpler. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:21 -06:00
Pavel Begunkov	73fa880eff	io_uring/zcrx: reduce netmem scope in refill Reduce the scope of a local var netmem in io_zcrx_ring_refill. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	20dda449c0	io_uring/zcrx: protect netdev with pp_lock Remove ifq->lock and reuse pp_lock to protect the netdev pointer. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	4f602f3112	io_uring/zcrx: rename dma lock In preparation for reusing the lock for other purposes, rename it to "pp_lock". As before, it can be taken deeper inside the networking stack by page pool, and so the syscall io_uring must avoid holding it while doing queue reconfiguration or anything that can result in immediate pp init/destruction. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	d8d135dfe3	io_uring/zcrx: make niov size variable Instead of using PAGE_SIZE for the niov size add a niov_shift field to ifq, and patch up all important places. Copy fallback still assumes PAGE_SIZE, so it'll be wasting some memory for now. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	5d93f7bade	io_uring/zcrx: set sgt for umem area Set struct io_zcrx_mem::sgt for umem areas as well to simplify looking up the current sg table. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00
Pavel Begunkov	6c18511729	io_uring/zcrx: remove dmabuf_offset It was removed from uapi, so now it's always 0 and can be removed together with offset handling in io_populate_area_dma(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-09-16 12:37:20 -06:00

1 2 3

113 Commits