mirror of
https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
synced 2026-04-18 03:23:53 -04:00
9408d8a37e6cce8803681ab816383450a056c3a9
9811 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
9408d8a37e |
nvme: improved uring polling
Drivers can poll requests directly, so use that. We just need to ensure the driver's request was allocated from a polled hctx, so a special driver flag is added to struct io_uring_cmd. The allows unshared and multipath namespaces to use the same polling callback, and multipath is guaranteed to get the same queue as the command was submitted on. Previously multipath polling might check a different path and poll the wrong info. The other bonus is we don't need a bio payload in order to poll, allowing commands like 'flush' and 'write zeroes' to be submitted on the same high priority queue as read and write commands. Finally, using the request based polling skips the unnecessary bio overhead. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230612190343.2087040-3-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
84fccbba93 |
Merge tag 'spi-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi updates from Mark Brown:
"One small core feature this time around but mostly driver improvements
and additions for SPI:
- Add support for controlling the idle state of MOSI, some systems
can support this and depending on the system integration may need
it to avoid glitching in some situations
- Support for polling mode in the S3C64xx driver and DMA on the
Qualcomm QSPI driver
- Support for several Allwinner SoCs, AMD Pensando Elba, Intel Mount
Evans, Renesas RZ/V2M, and ST STM32H7"
* tag 'spi-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (66 commits)
spi: dt-bindings: atmel,at91rm9200-spi: fix broken sam9x7 compatible
spi: dt-bindings: atmel,at91rm9200-spi: add sam9x7 compatible
spi: Add support for Renesas CSI
spi: dt-bindings: Add bindings for RZ/V2M CSI
spi: sun6i: Use the new helper to derive the xfer timeout value
spi: atmel: Prevent false timeouts on long transfers
spi: dt-bindings: stm32: do not disable spi-slave property for stm32f4-f7
spi: Create a helper to derive adaptive timeouts
spi: spi-geni-qcom: correctly handle -EPROBE_DEFER from dma_request_chan()
spi: stm32: disable spi-slave property for stm32f4-f7
spi: stm32: introduction of stm32h7 SPI device mode support
spi: stm32: use dmaengine_terminate_{a}sync instead of _all
spi: stm32: renaming of spi_master into spi_controller
spi: dw: Remove misleading comment for Mount Evans SoC
spi: dt-bindings: snps,dw-apb-ssi: Add compatible for Intel Mount Evans SoC
spi: dw: Add compatible for Intel Mount Evans SoC
spi: s3c64xx: Use dev_err_probe()
spi: s3c64xx: Use the managed spi master allocation function
spi: spl022: Probe defer is no error
spi: spi-imx: fix mixing of native and gpio chipselects for imx51/imx53/imx6 variants
...
|
||
|
|
6e17c6de3d |
Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ... |
||
|
|
582c161cf3 |
Merge tag 'hardening-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook: "There are three areas of note: A bunch of strlcpy()->strscpy() conversions ended up living in my tree since they were either Acked by maintainers for me to carry, or got ignored for multiple weeks (and were trivial changes). The compiler option '-fstrict-flex-arrays=3' has been enabled globally, and has been in -next for the entire devel cycle. This changes compiler diagnostics (though mainly just -Warray-bounds which is disabled) and potential UBSAN_BOUNDS and FORTIFY _warning_ coverage. In other words, there are no new restrictions, just potentially new warnings. Any new FORTIFY warnings we've seen have been fixed (usually in their respective subsystem trees). For more details, see commit |
||
|
|
d416a46c95 |
Merge tag 'execve-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull execve updates from Kees Cook: - Fix a few comments for correctness and typos (Baruch Siach) - Small simplifications for binfmt (Christophe JAILLET) - Set p_align to 4 for PT_NOTE in core dump (Fangrui Song) * tag 'execve-v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_elf: fix comment typo s/reset/regset/ elf: correct note name comment binfmt: Slightly simplify elf_fdpic_map_file() binfmt: Use struct_size() coredump, vmcore: Set p_align to 4 for PT_NOTE |
||
|
|
21953eb16c |
Merge tag 'lsm-pr-20230626' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm updates from Paul Moore: - A SafeSetID patch to correct what appears to be a cut-n-paste typo in the code causing a UID to be printed where a GID was desired. This is coming via the LSM tree because we haven't been able to get a response from the SafeSetID maintainer (Micah Morton) in several months. Hopefully we are able to get in touch with Micah, but until we do I'm going to pick them up in the LSM tree. - A small fix to the reiserfs LSM xattr code. We're continuing to work through some issues with the reiserfs code as we try to fixup the LSM xattr handling, but in the process we're uncovering some ugly problems in reiserfs and we may just end up removing the LSM xattr support in reiserfs prior to reiserfs' removal. For better or worse, this shouldn't impact any of the reiserfs users, as we discovered that LSM xattrs on reiserfs were completely broken, meaning no one is currently using the combo of reiserfs and a file labeling LSM. - A tweak to how the cap_user_data_t struct/typedef is declared in the header file to appease the Sparse gods. - In the process of trying to sort out the SafeSetID lost-maintainer problem I realized that I needed to update the labeled networking entry to "Supported". - Minor comment/documentation and spelling fixes. * tag 'lsm-pr-20230626' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: device_cgroup: Fix kernel-doc warnings in device_cgroup SafeSetID: fix UID printed instead of GID MAINTAINERS: move labeled networking to "supported" capability: erase checker warnings about struct __user_cap_data_struct lsm: fix a number of misspellings reiserfs: Initialize sec->length in reiserfs_security_init(). capability: fix kernel-doc warnings in capability.c |
||
|
|
6a46676994 |
Merge tag 's390-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Alexander Gordeev: - Fix the style of protected key API driver source: use x-mas tree for all local variable declarations - Rework protected key API driver to not use the struct pkey_protkey and pkey_clrkey anymore. Both structures have a fixed size buffer, but with the support of ECC protected key these buffers are not big enough. Use dynamic buffers internally and transparently for userspace - Add support for a new 'non CCA clear key token' with ECC clear keys supported: ECC P256, ECC P384, ECC P521, ECC ED25519 and ECC ED448. This makes it possible to derive a protected key from the ECC clear key input via PKEY_KBLOB2PROTK3 ioctl, while currently the only way to derive is via PCKMO instruction - The s390 PMU of PAI crypto and extension 1 NNPA counters use atomic_t for reference counting. Replace this with the proper data type refcount_t - Select ARCH_SUPPORTS_INT128, but limit this to clang for now, since gcc generates inefficient code, which may lead to stack overflows - Replace one-element array with flexible-array member in struct vfio_ccw_parent and refactor the rest of the code accordingly. Also, prefer struct_size() over sizeof() open- coded versions - Introduce OS_INFO_FLAGS_ENTRY pointing to a flags field and OS_INFO_FLAG_REIPL_CLEAR flag that informs a dumper whether the system memory should be cleared or not once dumped - Fix a hang when a user attempts to remove a VFIO-AP mediated device attached to a guest: add VFIO_DEVICE_GET_IRQ_INFO and VFIO_DEVICE_SET_IRQS IOCTLs and wire up the VFIO bus driver callback to request a release of the device - Fix calculation for R_390_GOTENT relocations for modules - Allow any user space process with CAP_PERFMON capability read and display the CPU Measurement facility counter sets - Rework large statically-defined per-CPU cpu_cf_events data structure and replace it with dynamically allocated structures created when a perf_event_open() system call is invoked or /dev/hwctr device is accessed * tag 's390-6.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/cpum_cf: rework PER_CPU_DEFINE of struct cpu_cf_events s390/cpum_cf: open access to hwctr device for CAP_PERFMON privileged process s390/module: fix rela calculation for R_390_GOTENT s390/vfio-ap: wire in the vfio_device_ops request callback s390/vfio-ap: realize the VFIO_DEVICE_SET_IRQS ioctl s390/vfio-ap: realize the VFIO_DEVICE_GET_IRQ_INFO ioctl s390/pkey: add support for ecc clear key s390/pkey: do not use struct pkey_protkey s390/pkey: introduce reverse x-mas trees s390/zcore: conditionally clear memory on reipl s390/ipl: add REIPL_CLEAR flag to os_info vfio/ccw: use struct_size() helper vfio/ccw: replace one-element array with flexible-array member s390: select ARCH_SUPPORTS_INT128 s390/pai_ext: replace atomic_t with refcount_t s390/pai_crypto: replace atomic_t with refcount_t |
||
|
|
bc6cb4d5bc |
Merge tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Introduce cmpxchg128() -- aka. the demise of cmpxchg_double()
The cmpxchg128() family of functions is basically & functionally the
same as cmpxchg_double(), but with a saner interface.
Instead of a 6-parameter horror that forced u128 - u64/u64-halves
layout details on the interface and exposed users to complexity,
fragility & bugs, use a natural 3-parameter interface with u128
types.
- Restructure the generated atomic headers, and add kerneldoc comments
for all of the generic atomic{,64,_long}_t operations.
The generated definitions are much cleaner now, and come with
documentation.
- Implement lock_set_cmp_fn() on lockdep, for defining an ordering when
taking multiple locks of the same type.
This gets rid of one use of lockdep_set_novalidate_class() in the
bcache code.
- Fix raw_cpu_generic_try_cmpxchg() bug due to an unintended variable
shadowing generating garbage code on Clang on certain ARM builds.
* tag 'locking-core-2023-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
locking/atomic: scripts: fix ${atomic}_dec_if_positive() kerneldoc
percpu: Fix self-assignment of __old in raw_cpu_generic_try_cmpxchg()
locking/atomic: treewide: delete arch_atomic_*() kerneldoc
locking/atomic: docs: Add atomic operations to the driver basic API documentation
locking/atomic: scripts: generate kerneldoc comments
docs: scripts: kernel-doc: accept bitwise negation like ~@var
locking/atomic: scripts: simplify raw_atomic*() definitions
locking/atomic: scripts: simplify raw_atomic_long*() definitions
locking/atomic: scripts: split pfx/name/sfx/order
locking/atomic: scripts: restructure fallback ifdeffery
locking/atomic: scripts: build raw_atomic_long*() directly
locking/atomic: treewide: use raw_atomic*_<op>()
locking/atomic: scripts: add trivial raw_atomic*_<op>()
locking/atomic: scripts: factor out order template generation
locking/atomic: scripts: remove leftover "${mult}"
locking/atomic: scripts: remove bogus order parameter
locking/atomic: xtensa: add preprocessor symbols
locking/atomic: x86: add preprocessor symbols
locking/atomic: sparc: add preprocessor symbols
locking/atomic: sh: add preprocessor symbols
...
|
||
|
|
a0433f8cae |
Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- Various cleanups all around (Irvin, Chaitanya, Christophe)
- Better struct packing (Christophe JAILLET)
- Reduce controller error logs for optional commands (Keith)
- Support for >=64KiB block sizes (Daniel Gomez)
- Fabrics fixes and code organization (Max, Chaitanya, Daniel
Wagner)
- bcache updates via Coly:
- Fix a race at init time (Mingzhe Zou)
- Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)
- use page pinning in the block layer for dio (David)
- convert old block dio code to page pinning (David, Christoph)
- cleanups for pktcdvd (Andy)
- cleanups for rnbd (Guoqing)
- use the unchecked __bio_add_page() for the initial single page
additions (Johannes)
- fix overflows in the Amiga partition handling code (Michael)
- improve mq-deadline zoned device support (Bart)
- keep passthrough requests out of the IO schedulers (Christoph, Ming)
- improve support for flush requests, making them less special to deal
with (Christoph)
- add bdev holder ops and shutdown methods (Christoph)
- fix the name_to_dev_t() situation and use cases (Christoph)
- decouple the block open flags from fmode_t (Christoph)
- ublk updates and cleanups, including adding user copy support (Ming)
- BFQ sanity checking (Bart)
- convert brd from radix to xarray (Pankaj)
- constify various structures (Thomas, Ivan)
- more fine grained persistent reservation ioctl capability checks
(Jingbo)
- misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
Jordy, Li, Min, Yu, Zhong, Waiman)
* tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
scsi/sg: don't grab scsi host module reference
ext4: Fix warning in blkdev_put()
block: don't return -EINVAL for not found names in devt_from_devname
cdrom: Fix spectre-v1 gadget
block: Improve kernel-doc headers
blk-mq: don't insert passthrough request into sw queue
bsg: make bsg_class a static const structure
ublk: make ublk_chr_class a static const structure
aoe: make aoe_class a static const structure
block/rnbd: make all 'class' structures const
block: fix the exclusive open mask in disk_scan_partitions
block: add overflow checks for Amiga partition support
block: change all __u32 annotations to __be32 in affs_hardblocks.h
block: fix signed int overflow in Amiga partition support
block: add capacity validation in bdev_add_partition()
block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
block: disallow Persistent Reservation on partitions
reiserfs: fix blkdev_put() warning from release_journal_dev()
block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
block: document the holder argument to blkdev_get_by_path
...
|
||
|
|
0aa69d53ac |
Merge tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
"Nothing major in this release, just a bunch of cleanups and some
optimizations around networking mostly.
- clean up file request flags handling (Christoph)
- clean up request freeing and CQ locking (Pavel)
- support for using pre-registering the io_uring fd at setup time
(Josh)
- Add support for user allocated ring memory, rather than having the
kernel allocate it. Mostly for packing rings into a huge page (me)
- avoid an unnecessary double retry on receive (me)
- maintain ordering for task_work, which also improves performance
(me)
- misc cleanups/fixes (Pavel, me)"
* tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux: (39 commits)
io_uring: merge conditional unlock flush helpers
io_uring: make io_cq_unlock_post static
io_uring: inline __io_cq_unlock
io_uring: fix acquire/release annotations
io_uring: kill io_cq_unlock()
io_uring: remove IOU_F_TWQ_FORCE_NORMAL
io_uring: don't batch task put on reqs free
io_uring: move io_clean_op()
io_uring: inline io_dismantle_req()
io_uring: remove io_free_req_tw
io_uring: open code io_put_req_find_next
io_uring: add helpers to decode the fixed file file_ptr
io_uring: use io_file_from_index in io_msg_grab_file
io_uring: use io_file_from_index in __io_sync_cancel
io_uring: return REQ_F_ flags from io_file_get_flags
io_uring: remove io_req_ffs_set
io_uring: remove a confusing comment above io_file_get_flags
io_uring: remove the mode variable in io_file_get_flags
io_uring: remove __io_file_supports_nowait
io_uring: wait interruptibly for request completions on exit
...
|
||
|
|
c0a572d9d3 |
Merge tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs mount updates from Christian Brauner:
"This contains the work to extend move_mount() to allow adding a mount
beneath the topmost mount of a mount stack.
There are two LWN articles about this. One covers the original patch
series in [1]. The other in [2] summarizes the session and roughly the
discussion between Al and me at LSFMM. The second article also goes
into some good questions from attendees.
Since all details are found in the relevant commit with a technical
dive into semantics and locking at the end I'm only adding the
motivation and core functionality for this from commit message and
leave out the invasive details. The code is also heavily commented and
annotated as well which was explicitly requested.
TL;DR:
> mount -t ext4 /dev/sda /mnt
|
└─/mnt /dev/sda ext4
> mount --beneath -t xfs /dev/sdb /mnt
|
└─/mnt /dev/sdb xfs
└─/mnt /dev/sda ext4
> umount /mnt
|
└─/mnt /dev/sdb xfs
The longer motivation is that various distributions are adding or are
in the process of adding support for system extensions and in the
future configuration extensions through various tools. A more detailed
explanation on system and configuration extensions can be found on the
manpage which is listed below at [3].
System extension images may – dynamically at runtime — extend the
/usr/ and /opt/ directory hierarchies with additional files. This is
particularly useful on immutable system images where a /usr/ and/or
/opt/ hierarchy residing on a read-only file system shall be extended
temporarily at runtime without making any persistent modifications.
When one or more system extension images are activated, their /usr/
and /opt/ hierarchies are combined via overlayfs with the same
hierarchies of the host OS, and the host /usr/ and /opt/ overmounted
with it ("merging"). When they are deactivated, the mount point is
disassembled — again revealing the unmodified original host version of
the hierarchy ("unmerging"). Merging thus makes the extension's
resources suddenly appear below the /usr/ and /opt/ hierarchies as if
they were included in the base OS image itself. Unmerging makes them
disappear again, leaving in place only the files that were shipped
with the base OS image itself.
System configuration images are similar but operate on directories
containing system or service configuration.
On nearly all modern distributions mount propagation plays a crucial
role and the rootfs of the OS is a shared mount in a peer group
(usually with peer group id 1):
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:1 29 1
On such systems all services and containers run in a separate mount
namespace and are pivot_root()ed into their rootfs. A separate mount
namespace is almost always used as it is the minimal isolation
mechanism services have. But usually they are even much more isolated
up to the point where they almost become indistinguishable from
containers.
Mount propagation again plays a crucial role here. The rootfs of all
these services is a slave mount to the peer group of the host rootfs.
This is done so the service will receive mount propagation events from
the host when certain files or directories are updated.
In addition, the rootfs of each service, container, and sandbox is
also a shared mount in its separate peer group:
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:24 master:1 71 47
For people not too familiar with mount propagation, the master:1 means
that this is a slave mount to peer group 1. Which as one can see is
the host rootfs as indicated by shared:1 above. The shared:24
indicates that the service rootfs is a shared mount in a separate peer
group with peer group id 24.
A service may run other services. Such nested services will also have
a rootfs mount that is a slave to the peer group of the outer service
rootfs mount.
For containers things are just slighly different. A container's rootfs
isn't a slave to the service's or host rootfs' peer group. The rootfs
mount of a container is simply a shared mount in its own peer group:
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/home/ubuntu/debian-tree / ext4 shared:99 61 60
So whereas services are isolated OS components a container is treated
like a separate world and mount propagation into it is restricted to a
single well known mount that is a slave to the peer group of the
shared mount /run on the host:
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/propagate/debian-tree /run/host/incoming tmpfs master:5 71 68
Here, the master:5 indicates that this mount is a slave to the peer
group with peer group id 5. This allows to propagate mounts into the
container and served as a workaround for not being able to insert
mounts into mount namespaces directly. But the new mount api does
support inserting mounts directly. For the interested reader the
blogpost in [4] might be worth reading where I explain the old and the
new approach to inserting mounts into mount namespaces.
Containers of course, can themselves be run as services. They often
run full systems themselves which means they again run services and
containers with the exact same propagation settings explained above.
The whole system is designed so that it can be easily updated,
including all services in various fine-grained ways without having to
enter every single service's mount namespace which would be
prohibitively expensive. The mount propagation layout has been
carefully chosen so it is possible to propagate updates for system
extensions and configurations from the host into all services.
The simplest model to update the whole system is to mount on top of
/usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
will then propagate into every service. This works cleanly the first
time. However, when the system is updated multiple times it becomes
necessary to unmount the first update on /opt, /usr, /etc and then
propagate the new update. But this means, there's an interval where
the old base system is accessible. This has to be avoided to protect
against downgrade attacks.
The vfs already exposes a mechanism to userspace whereby mounts can be
mounted beneath an existing mount. Such mounts are internally referred
to as "tucked". The patch series exposes the ability to mount beneath
a top mount through the new MOVE_MOUNT_BENEATH flag for the
move_mount() system call. This allows userspace to seamlessly upgrade
mounts. After this series the only thing that will have changed is
that mounting beneath an existing mount can be done explicitly instead
of just implicitly.
The crux is that the proposed mechanism already exists and that it is
so powerful as to cover cases where mounts are supposed to be updated
with new versions. Crucially, it offers an important flexibility.
Namely that updates to a system may either be forced or can be delayed
and the umount of the top mount be left to a service if it is a
cooperative one"
Link: https://lwn.net/Articles/927491 [1]
Link: https://lwn.net/Articles/934094 [2]
Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [3]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [4]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013
* tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: allow to mount beneath top mount
fs: use a for loop when locking a mount
fs: properly document __lookup_mnt()
fs: add path_mounted()
|
||
|
|
64bf6ae93e |
Merge tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Miscellaneous features, cleanups, and fixes for vfs and individual fs
Features:
- Use mode 0600 for file created by cachefilesd so it can be run by
unprivileged users. This aligns them with directories which are
already created with mode 0700 by cachefilesd
- Reorder a few members in struct file to prevent some false sharing
scenarios
- Indicate that an eventfd is used a semaphore in the eventfd's
fdinfo procfs file
- Add a missing uapi header for eventfd exposing relevant uapi
defines
- Let the VFS protect transitions of a superblock from read-only to
read-write in addition to the protection it already provides for
transitions from read-write to read-only. Protecting read-only to
read-write transitions allows filesystems such as ext4 to perform
internal writes, keeping writers away until the transition is
completed
Cleanups:
- Arnd removed the architecture specific arch_report_meminfo()
prototypes and added a generic one into procfs.h. Note, we got a
report about a warning in amdpgpu codepaths that suggested this was
bisectable to this change but we concluded it was a false positive
- Remove unused parameters from split_fs_names()
- Rename put_and_unmap_page() to unmap_and_put_page() to let the name
reflect the order of the cleanup operation that has to unmap before
the actual put
- Unexport buffer_check_dirty_writeback() as it is not used outside
of block device aops
- Stop allocating aio rings from highmem
- Protecting read-{only,write} transitions in the VFS used open-coded
barriers in various places. Replace them with proper little helpers
and document both the helpers and all barrier interactions involved
when transitioning between read-{only,write} states
- Use flexible array members in old readdir codepaths
Fixes:
- Use the correct type __poll_t for epoll and eventfd
- Replace all deprecated strlcpy() invocations, whose return value
isn't checked with an equivalent strscpy() call
- Fix some kernel-doc warnings in fs/open.c
- Reduce the stack usage in jffs2's xattr codepaths finally getting
rid of this: fs/jffs2/xattr.c:887:1: error: the frame size of 1088
bytes is larger than 1024 bytes [-Werror=frame-larger-than=]
royally annoying compilation warning
- Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY where an int and not
fmode_t is required to avoid fmode_t to integer degradation
warnings
- Create coredumps with O_WRONLY instead of O_RDWR. There's a long
explanation in that commit how O_RDWR is actually a bug which we
found out with the help of Linus and git archeology
- Fix "no previous prototype" warnings in the pipe codepaths
- Add overflow calculations for remap_verify_area() as a signed
addition overflow could be triggered in xfstests
- Fix a null pointer dereference in sysv
- Use an unsigned variable for length calculations in jfs avoiding
compilation warnings with gcc 13
- Fix a dangling pipe pointer in the watch queue codepath
- The legacy mount option parser provided as a fallback by the VFS
for filesystems not yet converted to the new mount api did prefix
the generated mount option string with a leading ',' causing issues
for some filesystems
- Fix a repeated word in a comment in fs.h
- autofs: Update the ctime when mtime is updated as mandated by
POSIX"
* tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits)
readdir: Replace one-element arrays with flexible-array members
fs: Provide helpers for manipulating sb->s_readonly_remount
fs: Protect reconfiguration of sb read-write from racing writes
eventfd: add a uapi header for eventfd userspace APIs
autofs: set ctime as well when mtime changes on a dir
eventfd: show the EFD_SEMAPHORE flag in fdinfo
fs/aio: Stop allocating aio rings from HIGHMEM
fs: Fix comment typo
fs: unexport buffer_check_dirty_writeback
fs: avoid empty option when generating legacy mount string
watch_queue: prevent dangling pipe pointer
fs.h: Optimize file struct to prevent false sharing
highmem: Rename put_and_unmap_page() to unmap_and_put_page()
cachefiles: Allow the cache to be non-root
init: remove unused names parameter in split_fs_names()
jfs: Use unsigned variable for length calculations
fs/sysv: Null check to prevent null-ptr-deref bug
fs: use UB-safe check for signed addition overflow in remap_verify_area
procfs: consolidate arch_report_meminfo declaration
fs: pipe: reveal missing function protoypes
...
|
||
|
|
0b3d412798 |
elf: correct note name comment
NT_PRFPREG note is named "CORE". Correct the comment accordingly.
Fixes:
|
||
|
|
95a55437dc |
block: change all __u32 annotations to __be32 in affs_hardblocks.h
The Amiga partition parser module uses signed int for partition sector
address and count, which will overflow for disks larger than 1 TB.
Use u64 as type for sector address and size to allow using disks up to
2 TB without LBD support, and disks larger than 2 TB with LBD. The RBD
format allows to specify disk sizes up to 2^128 bytes (though native
OS limitations reduce this somewhat, to max 2^68 bytes), so check for
u64 overflow carefully to protect against overflowing sector_t.
This bug was reported originally in 2012, and the fix was created by
the RDB author, Joanne Dow <jdow@earthlink.net>. A patch had been
discussed and reviewed on linux-m68k at that time but never officially
submitted (now resubmitted as patch 1 of this series).
Patch 3 (this series) adds additional error checking and warning
messages. One of the error checks now makes use of the previously
unused rdb_CylBlocks field, which causes a 'sparse' warning
(cast to restricted __be32).
Annotate all 32 bit fields in affs_hardblocks.h as __be32, as the
on-disk format of RDB and partition blocks is always big endian.
Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=43511
Fixes:
|
||
|
|
2d8c9dcf71 |
eventfd: add a uapi header for eventfd userspace APIs
Create a uapi header include/uapi/linux/eventfd.h, move the associated flags to the uapi header, and include it from linux/eventfd.h. Suggested-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Wen Yang <wenyang.linux@foxmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dylan Yudaken <dylany@fb.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Eric Biggers <ebiggers@google.com> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: <tencent_2B6A999A23E86E522D5D9859D54FFCF9AA05@qq.com> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
52f79609c0 |
net: ethtool: correct MAX attribute value for stats
When compiling YNL generated code compiler complains about
array-initializer-out-of-bounds. Turns out the MAX value
for STATS_GRP uses the value for STATS.
This may lead to random corruptions in user space (kernel
itself doesn't use this value as it never parses stats).
Fixes:
|
||
|
|
cf264e1329 |
cachestat: implement cachestat syscall
There is currently no good way to query the page cache state of large file
sets and directory trees. There is mincore(), but it scales poorly: the
kernel writes out a lot of bitmap data that userspace has to aggregate,
when the user really doesn not care about per-page information in that
case. The user also needs to mmap and unmap each file as it goes along,
which can be quite slow as well.
Some use cases where this information could come in handy:
* Allowing database to decide whether to perform an index scan or
direct table queries based on the in-memory cache state of the
index.
* Visibility into the writeback algorithm, for performance issues
diagnostic.
* Workload-aware writeback pacing: estimating IO fulfilled by page
cache (and IO to be done) within a range of a file, allowing for
more frequent syncing when and where there is IO capacity, and
batching when there is not.
* Computing memory usage of large files/directory trees, analogous to
the du tool for disk usage.
More information about these use cases could be found in the following
thread:
https://lore.kernel.org/lkml/20230315170934.GA97793@cmpxchg.org/
This patch implements a new syscall that queries cache state of a file and
summarizes the number of cached pages, number of dirty pages, number of
pages marked for writeback, number of (recently) evicted pages, etc. in a
given range. Currently, the syscall is only wired in for x86
architecture.
NAME
cachestat - query the page cache statistics of a file.
SYNOPSIS
#include <sys/mman.h>
struct cachestat_range {
__u64 off;
__u64 len;
};
struct cachestat {
__u64 nr_cache;
__u64 nr_dirty;
__u64 nr_writeback;
__u64 nr_evicted;
__u64 nr_recently_evicted;
};
int cachestat(unsigned int fd, struct cachestat_range *cstat_range,
struct cachestat *cstat, unsigned int flags);
DESCRIPTION
cachestat() queries the number of cached pages, number of dirty
pages, number of pages marked for writeback, number of evicted
pages, number of recently evicted pages, in the bytes range given by
`off` and `len`.
An evicted page is a page that is previously in the page cache but
has been evicted since. A page is recently evicted if its last
eviction was recent enough that its reentry to the cache would
indicate that it is actively being used by the system, and that
there is memory pressure on the system.
These values are returned in a cachestat struct, whose address is
given by the `cstat` argument.
The `off` and `len` arguments must be non-negative integers. If
`len` > 0, the queried range is [`off`, `off` + `len`]. If `len` ==
0, we will query in the range from `off` to the end of the file.
The `flags` argument is unused for now, but is included for future
extensibility. User should pass 0 (i.e no flag specified).
Currently, hugetlbfs is not supported.
Because the status of a page can change after cachestat() checks it
but before it returns to the application, the returned values may
contain stale information.
RETURN VALUE
On success, cachestat returns 0. On error, -1 is returned, and errno
is set to indicate the error.
ERRORS
EFAULT cstat or cstat_args points to an invalid address.
EINVAL invalid flags.
EBADF invalid file descriptor.
EOPNOTSUPP file descriptor is of a hugetlbfs file
[nphamcs@gmail.com: replace rounddown logic with the existing helper]
Link: https://lkml.kernel.org/r/20230504022044.3675469-1-nphamcs@gmail.com
Link: https://lkml.kernel.org/r/20230503013608.2431726-3-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Brian Foster <bfoster@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
||
|
|
3a41db531e |
pktcdvd: Get rid of custom printing macros
We may use traditional dev_*() macros instead of custom ones provided by the driver. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20230310164549.22133-2-andriy.shevchenko@linux.intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
5538213436 |
capability: erase checker warnings about struct __user_cap_data_struct
Currently Sparse warns the following when compiling kernel/capability.c:
kernel/capability.c:191:35: warning: incorrect type in argument 2
(different address spaces)
kernel/capability.c:191:35: expected void const *from
kernel/capability.c:191:35: got struct __user_cap_data_struct
[noderef] __user *
kernel/capability.c:168:14: warning: dereference of noderef expression
...... (multiple noderef warnings on different locations)
kernel/capability.c:244:29: warning: incorrect type in argument 1
(different address spaces)
kernel/capability.c:244:29: expected void *to
kernel/capability.c:244:29: got struct __user_cap_data_struct
[noderef] __user ( * )[2]
kernel/capability.c:247:42: warning: dereference of noderef expression
...... (multiple noderef warnings on different locations)
It seems that defining `struct __user_cap_data_struct` together with
`cap_user_data_t` make Sparse believe that the struct is `noderef` as
well. Separate their definitions to clarify their respective attributes.
Signed-off-by: GONG, Ruiqi <gongruiqi@huaweicloud.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
[PM: wrapped long lines in the description]
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
||
|
|
6afc770048 |
s390/vfio-ap: realize the VFIO_DEVICE_GET_IRQ_INFO ioctl
Realize the VFIO_DEVICE_GET_IRQ_INFO ioctl to retrieve the information for the VFIO device request IRQ. Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Link: https://lore.kernel.org/r/20230530223538.279198-2-akrowiak@linux.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> |
||
|
|
132328e8e8 |
bpf: netfilter: Add BPF_NETFILTER bpf_attach_type
Andrii Nakryiko writes:
And we currently don't have an attach type for NETLINK BPF link.
Thankfully it's not too late to add it. I see that link_create() in
kernel/bpf/syscall.c just bypasses attach_type check. We shouldn't
have done that. Instead we need to add BPF_NETLINK attach type to enum
bpf_attach_type. And wire all that properly throughout the kernel and
libbpf itself.
This adds BPF_NETFILTER and uses it. This breaks uabi but this
wasn't in any non-rc release yet, so it should be fine.
v2: check link_attack prog type in link_create too
Fixes:
|
||
|
|
224d80c584 |
types: Introduce [us]128
Introduce [us]128 (when available). Unlike [us]64, ensure they are always naturally aligned. This also enables 128bit wide atomics (which require natural alignment) such as cmpxchg128(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20230531132323.385005581@infradead.org |
||
|
|
b5bbc52fd0 |
ublk: add control command of UBLK_U_CMD_GET_FEATURES
Add control command of UBLK_U_CMD_GET_FEATURES for returning driver's feature set or capability. This way can simplify userspace for maintaining compatibility because userspace doesn't need to send command to one device for querying driver feature set any more. Such as, with the queried feature set, userspace can choose to use: - UBLK_CMD_GET_DEV_INFO2 or UBLK_CMD_GET_DEV_INFO, - UBLK_U_CMD_* or UBLK_CMD_* Userspace code: https://github.com/ming1/ubdsrv/commits/features-cmd Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230603040601.775227-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
e910c8e3aa |
autofs: use flexible array in ioctl structure
Commit
|
||
|
|
a45baa079e |
spi: add SPI_MOSI_IDLE_LOW mode bit
Some spi controller switch the mosi line to high, whenever they are idle. This may not be desired in all use cases. For example neopixel leds can get confused and flicker due to misinterpreting the idle state. Therefore, we introduce a new spi-mode bit, with which the idle behaviour can be overwritten on a per device basis. Signed-off-by: Boerge Struempfel <boerge.struempfel@gmail.com> Link: https://lore.kernel.org/r/20230530141641.1155691-2-boerge.struempfel@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org> |
||
|
|
26fb5480a2 |
net/handshake: Enable the SNI extension to work properly
Enable the upper layer protocol to specify the SNI peername. This
avoids the need for tlshd to use a DNS lookup, which can return a
hostname that doesn't match the incoming certificate's SubjectName.
Fixes:
|
||
|
|
3632679d9e |
ipv{4,6}/raw: fix output xfrm lookup wrt protocol
With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the
protocol field of the flow structure, build by raw_sendmsg() /
rawv6_sendmsg()), is set to IPPROTO_RAW. This breaks the ipsec policy
lookup when some policies are defined with a protocol in the selector.
For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to
specify the protocol. Just accept all values for IPPROTO_RAW socket.
For ipv4, the sin_port field of 'struct sockaddr_in' could not be used
without breaking backward compatibility (the value of this field was never
checked). Let's add a new kind of control message, so that the userland
could specify which protocol is used.
Fixes:
|
||
|
|
1172d5b8be |
ublk: support user copy
Currently copy between io request buffer(pages) and userspace buffer is done inside ublk_map_io() or ublk_unmap_io(). This way performs very well in case of pre-allocated userspace io buffer. For dynamically allocated or external userspace backend io buffer, UBLK_F_NEED_GET_DATA is added for ublk server to provide buffer by one extra command communication for WRITE request. For READ, userspace simply provides buffer, but can't know when the buffer is done[1]. Add UBLK_F_USER_COPY by moving io data copy out of kernel by providing read()/write() on /dev/ublkcN, and simply let ublk server do the io data copy. This way makes both side cleaner, the cost is that one extra syscall for copy io data between request and backend buffer. With UBLK_F_USER_COPY, it actually becomes possible to run per-io zero copy now, such as, only do zero copy for big size IO, so it can be thought as one prep patch for supporting zero copy. Meantime zero copy still needs to expose read()/write() buffer for some corner case, such as passthrough IO. [1] READ buffer in UBLK_F_NEED_GET_DATA https://lore.kernel.org/linux-block/116d8a56-0881-56d3-9bcc-78ff3e1dc4e5@linux.alibaba.com/T/#m23bd4b8634c0a054e6797063167b469949a247bb ublksrv loop usercopy code: https://github.com/ming1/ubdsrv/commits/usercopy Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230519065030.351216-8-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
62fe99cef9 |
ublk: add read()/write() support for ublk char device
Support pread()/pwrite() on ublk char device for reading/writing request io buffer, so data copy between io request buffer and userspace buffer can be moved to ublk server from ublk driver. Then UBLK_F_NEED_GET_DATA becomes not necessary, so ublk server can allocate buffer without one extra round uring command communication for userspace to provide buffer. IO buffer can be located by iocb->ki_pos which encodes buffer offset, io tag and queue id info, and type of iocb->ki_pos is u64, so it is big enough for holding reasonable queue depth, nr_queues and max io buffer size. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230519065030.351216-7-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
6ac3928156 |
fs: allow to mount beneath top mount
Various distributions are adding or are in the process of adding support
for system extensions and in the future configuration extensions through
various tools. A more detailed explanation on system and configuration
extensions can be found on the manpage which is listed below at [1].
System extension images may – dynamically at runtime — extend the /usr/
and /opt/ directory hierarchies with additional files. This is
particularly useful on immutable system images where a /usr/ and/or
/opt/ hierarchy residing on a read-only file system shall be extended
temporarily at runtime without making any persistent modifications.
When one or more system extension images are activated, their /usr/ and
/opt/ hierarchies are combined via overlayfs with the same hierarchies
of the host OS, and the host /usr/ and /opt/ overmounted with it
("merging"). When they are deactivated, the mount point is disassembled
— again revealing the unmodified original host version of the hierarchy
("unmerging"). Merging thus makes the extension's resources suddenly
appear below the /usr/ and /opt/ hierarchies as if they were included in
the base OS image itself. Unmerging makes them disappear again, leaving
in place only the files that were shipped with the base OS image itself.
System configuration images are similar but operate on directories
containing system or service configuration.
On nearly all modern distributions mount propagation plays a crucial
role and the rootfs of the OS is a shared mount in a peer group (usually
with peer group id 1):
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:1 29 1
On such systems all services and containers run in a separate mount
namespace and are pivot_root()ed into their rootfs. A separate mount
namespace is almost always used as it is the minimal isolation mechanism
services have. But usually they are even much more isolated up to the
point where they almost become indistinguishable from containers.
Mount propagation again plays a crucial role here. The rootfs of all
these services is a slave mount to the peer group of the host rootfs.
This is done so the service will receive mount propagation events from
the host when certain files or directories are updated.
In addition, the rootfs of each service, container, and sandbox is also
a shared mount in its separate peer group:
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/ / ext4 shared:24 master:1 71 47
For people not too familiar with mount propagation, the master:1 means
that this is a slave mount to peer group 1. Which as one can see is the
host rootfs as indicated by shared:1 above. The shared:24 indicates that
the service rootfs is a shared mount in a separate peer group with peer
group id 24.
A service may run other services. Such nested services will also have a
rootfs mount that is a slave to the peer group of the outer service
rootfs mount.
For containers things are just slighly different. A container's rootfs
isn't a slave to the service's or host rootfs' peer group. The rootfs
mount of a container is simply a shared mount in its own peer group:
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/home/ubuntu/debian-tree / ext4 shared:99 61 60
So whereas services are isolated OS components a container is treated
like a separate world and mount propagation into it is restricted to a
single well known mount that is a slave to the peer group of the shared
mount /run on the host:
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID
/propagate/debian-tree /run/host/incoming tmpfs master:5 71 68
Here, the master:5 indicates that this mount is a slave to the peer
group with peer group id 5. This allows to propagate mounts into the
container and served as a workaround for not being able to insert mounts
into mount namespaces directly. But the new mount api does support
inserting mounts directly. For the interested reader the blogpost in [2]
might be worth reading where I explain the old and the new approach to
inserting mounts into mount namespaces.
Containers of course, can themselves be run as services. They often run
full systems themselves which means they again run services and
containers with the exact same propagation settings explained above.
The whole system is designed so that it can be easily updated, including
all services in various fine-grained ways without having to enter every
single service's mount namespace which would be prohibitively expensive.
The mount propagation layout has been carefully chosen so it is possible
to propagate updates for system extensions and configurations from the
host into all services.
The simplest model to update the whole system is to mount on top of
/usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
will then propagate into every service. This works cleanly the first
time. However, when the system is updated multiple times it becomes
necessary to unmount the first update on /opt, /usr, /etc and then
propagate the new update. But this means, there's an interval where the
old base system is accessible. This has to be avoided to protect against
downgrade attacks.
The vfs already exposes a mechanism to userspace whereby mounts can be
mounted beneath an existing mount. Such mounts are internally referred
to as "tucked". The patch series exposes the ability to mount beneath a
top mount through the new MOVE_MOUNT_BENEATH flag for the move_mount()
system call. This allows userspace to seamlessly upgrade mounts. After
this series the only thing that will have changed is that mounting
beneath an existing mount can be done explicitly instead of just
implicitly.
Today, there are two scenarios where a mount can be mounted beneath an
existing mount instead of on top of it:
(1) When a service or container is started in a new mount namespace and
pivot_root()s into its new rootfs. The way this is done is by
mounting the new rootfs beneath the old rootfs:
fd_newroot = open("/var/lib/machines/fedora", ...);
fd_oldroot = open("/", ...);
fchdir(fd_newroot);
pivot_root(".", ".");
After the pivot_root(".", ".") call the new rootfs is mounted
beneath the old rootfs which can then be unmounted to reveal the
underlying mount:
fchdir(fd_oldroot);
umount2(".", MNT_DETACH);
Since pivot_root() moves the caller into a new rootfs no mounts must
be propagated out of the new rootfs as a consequence of the
pivot_root() call. Thus, the mounts cannot be shared.
(2) When a mount is propagated to a mount that already has another mount
mounted on the same dentry.
The easiest example for this is to create a new mount namespace. The
following commands will create a mount namespace where the rootfs
mount / will be a slave to the peer group of the host rootfs /
mount's peer group. IOW, it will receive propagation from the host:
mount --make-shared /
unshare --mount --propagation=slave
Now a new mount on the /mnt dentry in that mount namespace is
created. (As it can be confusing it should be spelled out that the
tmpfs mount on the /mnt dentry that was just created doesn't
propagate back to the host because the rootfs mount / of the mount
namespace isn't a peer of the host rootfs.):
mount -t tmpfs tmpfs /mnt
TARGET SOURCE FSTYPE PROPAGATION
└─/mnt tmpfs tmpfs
Now another terminal in the host mount namespace can observe that
the mount indeed hasn't propagated back to into the host mount
namespace. A new mount can now be created on top of the /mnt dentry
with the rootfs mount / as its parent:
mount --bind /opt /mnt
TARGET SOURCE FSTYPE PROPAGATION
└─/mnt /dev/sda2[/opt] ext4 shared:1
The mount namespace that was created earlier can now observe that
the bind mount created on the host has propagated into it:
TARGET SOURCE FSTYPE PROPAGATION
└─/mnt /dev/sda2[/opt] ext4 master:1
└─/mnt tmpfs tmpfs
But instead of having been mounted on top of the tmpfs mount at the
/mnt dentry the /opt mount has been mounted on top of the rootfs
mount at the /mnt dentry. And the tmpfs mount has been remounted on
top of the propagated /opt mount at the /opt dentry. So in other
words, the propagated mount has been mounted beneath the preexisting
mount in that mount namespace.
Mount namespaces make this easy to illustrate but it's also easy to
mount beneath an existing mount in the same mount namespace
(The following example assumes a shared rootfs mount / with peer
group id 1):
mount --bind /opt /opt
TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION
└─/opt /dev/sda2[/opt] ext4 188 29 shared:1
If another mount is mounted on top of the /opt mount at the /opt
dentry:
mount --bind /tmp /opt
The following clunky mount tree will result:
TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION
└─/opt /dev/sda2[/tmp] ext4 405 29 shared:1
└─/opt /dev/sda2[/opt] ext4 188 405 shared:1
└─/opt /dev/sda2[/tmp] ext4 404 188 shared:1
The /tmp mount is mounted beneath the /opt mount and another copy is
mounted on top of the /opt mount. This happens because the rootfs /
and the /opt mount are shared mounts in the same peer group.
When the new /tmp mount is supposed to be mounted at the /opt dentry
then the /tmp mount first propagates to the root mount at the /opt
dentry. But there already is the /opt mount mounted at the /opt
dentry. So the old /opt mount at the /opt dentry will be mounted on
top of the new /tmp mount at the /tmp dentry, i.e. @opt->mnt_parent
is @tmp and @opt->mnt_mountpoint is /tmp (Note that @opt->mnt_root
is /opt which is what shows up as /opt under SOURCE). So again, a
mount will be mounted beneath a preexisting mount.
(Fwiw, a few iterations of mount --bind /opt /opt in a loop on a
shared rootfs is a good example of what could be referred to as
mount explosion.)
The main point is that such mounts allows userspace to umount a top
mount and reveal an underlying mount. So for example, umounting the
tmpfs mount on /mnt that was created in example (1) using mount
namespaces reveals the /opt mount which was mounted beneath it.
In (2) where a mount was mounted beneath the top mount in the same mount
namespace unmounting the top mount would unmount both the top mount and
the mount beneath. In the process the original mount would be remounted
on top of the rootfs mount / at the /opt dentry again.
This again, is a result of mount propagation only this time it's umount
propagation. However, this can be avoided by simply making the parent
mount / of the @opt mount a private or slave mount. Then the top mount
and the original mount can be unmounted to reveal the mount beneath.
These two examples are fairly arcane and are merely added to make it
clear how mount propagation has effects on current and future features.
More common use-cases will just be things like:
mount -t btrfs /dev/sdA /mnt
mount -t xfs /dev/sdB --beneath /mnt
umount /mnt
after which we'll have updated from a btrfs filesystem to a xfs
filesystem without ever revealing the underlying mountpoint.
The crux is that the proposed mechanism already exists and that it is so
powerful as to cover cases where mounts are supposed to be updated with
new versions. Crucially, it offers an important flexibility. Namely that
updates to a system may either be forced or can be delayed and the
umount of the top mount be left to a service if it is a cooperative one.
This adds a new flag to move_mount() that allows to explicitly move a
beneath the top mount adhering to the following semantics:
* Mounts cannot be mounted beneath the rootfs. This restriction
encompasses the rootfs but also chroots via chroot() and pivot_root().
To mount a mount beneath the rootfs or a chroot, pivot_root() can be
used as illustrated above.
* The source mount must be a private mount to force the kernel to
allocate a new, unused peer group id. This isn't a required
restriction but a voluntary one. It avoids repeating a semantical
quirk that already exists today. If bind mounts which already have a
peer group id are inserted into mount trees that have the same peer
group id this can cause a lot of mount propagation events to be
generated (For example, consider running mount --bind /opt /opt in a
loop where the parent mount is a shared mount.).
* Avoid getting rid of the top mount in the kernel. Cooperative services
need to be able to unmount the top mount themselves.
This also avoids a good deal of additional complexity. The umount
would have to be propagated which would be another rather expensive
operation. So namespace_lock() and lock_mount_hash() would potentially
have to be held for a long time for both a mount and umount
propagation. That should be avoided.
* The path to mount beneath must be mounted and attached.
* The top mount and its parent must be in the caller's mount namespace
and the caller must be able to mount in that mount namespace.
* The caller must be able to unmount the top mount to prove that they
could reveal the underlying mount.
* The propagation tree is calculated based on the destination mount's
parent mount and the destination mount's mountpoint on the parent
mount. Of course, if the parent of the destination mount and the
destination mount are shared mounts in the same peer group and the
mountpoint of the new mount to be mounted is a subdir of their
->mnt_root then both will receive a mount of /opt. That's probably
easier to understand with an example. Assuming a standard shared
rootfs /:
mount --bind /opt /opt
mount --bind /tmp /opt
will cause the same mount tree as:
mount --bind /opt /opt
mount --beneath /tmp /opt
because both / and /opt are shared mounts/peers in the same peer
group and the /opt dentry is a subdirectory of both the parent's and
the child's ->mnt_root. If a mount tree like that is created it almost
always is an accident or abuse of mount propagation. Realistically
what most people probably mean in this scenarios is:
mount --bind /opt /opt
mount --make-private /opt
mount --make-shared /opt
This forces the allocation of a new separate peer group for the /opt
mount. Aferwards a mount --bind or mount --beneath actually makes
sense as the / and /opt mount belong to different peer groups. Before
that it's likely just confusion about what the user wanted to achieve.
* Refuse MOVE_MOUNT_BENEATH if:
(1) the @mnt_from has been overmounted in between path resolution and
acquiring @namespace_sem when locking @mnt_to. This avoids the
proliferation of shadow mounts.
(2) if @to_mnt is moved to a different mountpoint while acquiring
@namespace_sem to lock @to_mnt.
(3) if @to_mnt is unmounted while acquiring @namespace_sem to lock
@to_mnt.
(4) if the parent of the target mount propagates to the target mount
at the same mountpoint.
This would mean mounting @mnt_from on @mnt_to->mnt_parent and then
propagating a copy @c of @mnt_from onto @mnt_to. This defeats the
whole purpose of mounting @mnt_from beneath @mnt_to.
(5) if the parent mount @mnt_to->mnt_parent propagates to @mnt_from at
the same mountpoint.
If @mnt_to->mnt_parent propagates to @mnt_from this would mean
propagating a copy @c of @mnt_from on top of @mnt_from. Afterwards
@mnt_from would be mounted on top of @mnt_to->mnt_parent and
@mnt_to would be unmounted from @mnt->mnt_parent and remounted on
@mnt_from. But since @c is already mounted on @mnt_from, @mnt_to
would ultimately be remounted on top of @c. Afterwards, @mnt_from
would be covered by a copy @c of @mnt_from and @c would be covered
by @mnt_from itself. This defeats the whole purpose of mounting
@mnt_from beneath @mnt_to.
Cases (1) to (3) are required as they deal with races that would cause
bugs or unexpected behavior for users. Cases (4) and (5) refuse
semantical quirks that would not be a bug but would cause weird mount
trees to be created. While they can already be created via other means
(mount --bind /opt /opt x n) there's no reason to repeat past mistakes
in new features.
Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [1]
Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [2]
Link: https://github.com/flatcar/sysext-bakery
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
Link: https://github.com/systemd/systemd/pull/26013
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Message-Id: <20230202-fs-move-mount-replace-v4-4-98f3d80d7eaa@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
||
|
|
6e76ac5958 |
io_uring: Add io_uring_setup flag to pre-register ring fd and never install it
With IORING_REGISTER_USE_REGISTERED_RING, an application can register the ring fd and use it via registered index rather than installed fd. This allows using a registered ring for everything *except* the initial mmap. With IORING_SETUP_NO_MMAP, io_uring_setup uses buffers allocated by the user, rather than requiring a subsequent mmap. The combination of the two allows a user to operate *entirely* via a registered ring fd, making it unnecessary to ever install the fd in the first place. So, add a flag IORING_SETUP_REGISTERED_FD_ONLY to make io_uring_setup register the fd and return a registered index, without installing the fd. This allows an application to avoid touching the fd table at all, and allows a library to never even momentarily install a file descriptor. This splits out an io_ring_add_registered_file helper from io_ring_add_registered_fd, for use by io_uring_setup. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Link: https://lore.kernel.org/r/bc8f431bada371c183b95a83399628b605e978a3.1682699803.git.josh@joshtriplett.org Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
03d89a2de2 |
io_uring: support for user allocated memory for rings/sqes
Currently io_uring applications must call mmap(2) twice to map the rings themselves, and the sqes array. This works fine, but it does not support using huge pages to back the rings/sqes. Provide a way for the application to pass in pre-allocated memory for the rings/sqes, which can then suitably be allocated from shmfs or via mmap to get huge page support. Particularly for larger rings, this reduces the TLBs needed. If an application wishes to take advantage of that, it must pre-allocate the memory needed for the sq/cq ring, and the sqes. The former must be passed in via the io_uring_params->cq_off.user_data field, while the latter is passed in via the io_uring_params->sq_off.user_data field. Then it must set IORING_SETUP_NO_MMAP in the io_uring_params->flags field, and io_uring will then map the existing memory into the kernel for shared use. The application must not call mmap(2) to map rings as it otherwise would have, that will now fail with -EINVAL if this setup flag was used. The pages used for the rings and sqes must be contigious. The intent here is clearly that huge pages should be used, otherwise the normal setup procedure works fine as-is. The application may use one huge page for both the rings and sqes. Outside of those initialization changes, everything works like it did before. Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
a3b111b046 |
Merge tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- MD pull request via Song:
- Improve raid5 sequential IO performance on spinning disks, which
fixes a regression since v6.0 (Jan Kara)
- Fix bitmap offset types, which fixes an issue introduced in this
merge window (Jonathan Derrick)
- Cleanup of hweight type used for cgroup writeback (Maxim)
- Fix a regression with the "has_submit_bio" changes across partitions
(Ming)
- Cleanup of QUEUE_FLAG_ADD_RANDOM clearing.
We used to set this flag on queues non blk-mq queues, and hence some
drivers clear it unconditionally. Since all of these have since been
converted to true blk-mq drivers, drop the useless clear as the bit
is not set (Chaitanya)
- Fix the flags being set in a bio for a flush for drbd (Christoph)
- Cleanup and deduplication of the code handling setting block device
capacity (Damien)
- Fix for ublk handling IO timeouts (Ming)
- Fix for a regression in blk-cgroup teardown (Tao)
- NBD documentation and code fixes (Eric)
- Convert blk-integrity to using device_attributes rather than a second
kobject to manage lifetimes (Thomas)
* tag 'for-6.4/block-2023-05-06' of git://git.kernel.dk/linux:
ublk: add timeout handler
drbd: correctly submit flush bio on barrier
mailmap: add mailmap entries for Jens Axboe
block: Skip destroyed blkg when restart in blkg_destroy_all()
writeback: fix call of incorrect macro
md: Fix bitmap offset type in sb writer
md/raid5: Improve performance for sequential IO
docs nbd: userspace NBD now favors github over sourceforge
block nbd: use req.cookie instead of req.handle
uapi nbd: add cookie alias to handle
uapi nbd: improve doc links to userspace spec
blk-integrity: register sysfs attributes on struct device
blk-integrity: convert to struct device_attribute
blk-integrity: use sysfs_emit
block/drivers: remove dead clear of random flag
block: sync part's ->bd_has_submit_bio with disk's
block: Cleanup set_capacity()/bdev_set_nr_sectors()
|
||
|
|
7994beabfb |
Merge tag 'dmaengine-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine
Pull dmaengine updates from Vinod Koul:
"New support:
- Apple admac t8112 device support
- StarFive JH7110 DMA controller
Updates:
- Big pile of idxd updates to support IAA 2.0 device capabilities,
DSA 2.0 Event Log and completion record faulting features and
new DSA operations
- at_xdmac supend & resume updates and driver code cleanup
- k3-udma supend & resume support
- k3-psil thread support for J784s4"
* tag 'dmaengine-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine: (57 commits)
dmaengine: idxd: add per wq PRS disable
dmaengine: idxd: add pid to exported sysfs attribute for opened file
dmaengine: idxd: expose fault counters to sysfs
dmaengine: idxd: add a device to represent the file opened
dmaengine: idxd: add per file user counters for completion record faults
dmaengine: idxd: process batch descriptor completion record faults
dmaengine: idxd: add descs_completed field for completion record
dmaengine: idxd: process user page faults for completion record
dmaengine: idxd: add idxd_copy_cr() to copy user completion record during page fault handling
dmaengine: idxd: create kmem cache for event log fault items
dmaengine: idxd: add per DSA wq workqueue for processing cr faults
dmanegine: idxd: add debugfs for event log dump
dmaengine: idxd: add interrupt handling for event log
dmaengine: idxd: setup event log configuration
dmaengine: idxd: add event log size sysfs attribute
dmaengine: idxd: make misc interrupt one shot
dt-bindings: dma: snps,dw-axi-dmac: constrain the items of resets for JH7110 dma
dt-bindings: dma: Drop unneeded quotes
dmaengine: at_xdmac: align declaration of ret with the rest of variables
dmaengine: at_xdmac: add a warning message regarding for unpaused channels
...
|
||
|
|
c8c655c34e |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"s390:
- More phys_to_virt conversions
- Improvement of AP management for VSIE (nested virtualization)
ARM64:
- Numerous fixes for the pathological lock inversion issue that
plagued KVM/arm64 since... forever.
- New framework allowing SMCCC-compliant hypercalls to be forwarded
to userspace, hopefully paving the way for some more features being
moved to VMMs rather than be implemented in the kernel.
- Large rework of the timer code to allow a VM-wide offset to be
applied to both virtual and physical counters as well as a
per-timer, per-vcpu offset that complements the global one. This
last part allows the NV timer code to be implemented on top.
- A small set of fixes to make sure that we don't change anything
affecting the EL1&0 translation regime just after having having
taken an exception to EL2 until we have executed a DSB. This
ensures that speculative walks started in EL1&0 have completed.
- The usual selftest fixes and improvements.
x86:
- Optimize CR0.WP toggling by avoiding an MMU reload when TDP is
enabled, and by giving the guest control of CR0.WP when EPT is
enabled on VMX (VMX-only because SVM doesn't support per-bit
controls)
- Add CR0/CR4 helpers to query single bits, and clean up related code
where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long"
return as a bool
- Move AMD_PSFD to cpufeatures.h and purge KVM's definition
- Avoid unnecessary writes+flushes when the guest is only adding new
PTEs
- Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s
optimizations when emulating invalidations
- Clean up the range-based flushing APIs
- Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a
single A/D bit using a LOCK AND instead of XCHG, and skip all of
the "handle changed SPTE" overhead associated with writing the
entire entry
- Track the number of "tail" entries in a pte_list_desc to avoid
having to walk (potentially) all descriptors during insertion and
deletion, which gets quite expensive if the guest is spamming
fork()
- Disallow virtualizing legacy LBRs if architectural LBRs are
available, the two are mutually exclusive in hardware
- Disallow writes to immutable feature MSRs (notably
PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features
- Overhaul the vmx_pmu_caps selftest to better validate
PERF_CAPABILITIES
- Apply PMU filters to emulated events and add test coverage to the
pmu_event_filter selftest
- AMD SVM:
- Add support for virtual NMIs
- Fixes for edge cases related to virtual interrupts
- Intel AMX:
- Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if
XTILE_DATA is not being reported due to userspace not opting in
via prctl()
- Fix a bug in emulation of ENCLS in compatibility mode
- Allow emulation of NOP and PAUSE for L2
- AMX selftests improvements
- Misc cleanups
MIPS:
- Constify MIPS's internal callbacks (a leftover from the hardware
enabling rework that landed in 6.3)
Generic:
- Drop unnecessary casts from "void *" throughout kvm_main.c
- Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the
struct size by 8 bytes on 64-bit kernels by utilizing a padding
hole
Documentation:
- Fix goof introduced by the conversion to rST"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits)
KVM: s390: pci: fix virtual-physical confusion on module unload/load
KVM: s390: vsie: clarifications on setting the APCB
KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA
KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state
KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init()
KVM: selftests: Test the PMU event "Instructions retired"
KVM: selftests: Copy full counter values from guest in PMU event filter test
KVM: selftests: Use error codes to signal errors in PMU event filter test
KVM: selftests: Print detailed info in PMU event filter asserts
KVM: selftests: Add helpers for PMC asserts in PMU event filter test
KVM: selftests: Add a common helper for the PMU event filter guest code
KVM: selftests: Fix spelling mistake "perrmited" -> "permitted"
KVM: arm64: vhe: Drop extra isb() on guest exit
KVM: arm64: vhe: Synchronise with page table walker on MMU update
KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc()
KVM: arm64: nvhe: Synchronise with page table walker on TLBI
KVM: arm64: Handle 32bit CNTPCTSS traps
KVM: arm64: nvhe: Synchronise with page table walker on vcpu run
KVM: arm64: vgic: Don't acquire its_lock before config_lock
KVM: selftests: Add test to verify KVM's supported XCR0
...
|
||
|
|
7acc137211 |
Merge tag 'cxl-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull compute express link updates from Dan Williams:
"DOE support is promoted from drivers/cxl/ to drivers/pci/ with Bjorn's
blessing, and the CXL core continues to mature its media management
capabilities with support for listing and injecting media errors. Some
late fixes that missed v6.3-final are also included:
- Refactor the DOE infrastructure (Data Object Exchange
PCI-config-cycle mailbox) to be a facility of the PCI core rather
than the CXL core.
This is foundational for upcoming support for PCI
device-attestation and PCIe / CXL link encryption.
- Add support for retrieving and injecting poison for CXL memory
expanders.
This enabling uses trace-events to convey CXL media error records
to user tooling. It includes translation of device-local addresses
(DPA) to system physical addresses (SPA) and their corresponding
CXL region.
- Fixes for decoder enumeration that missed v6.3-final
- Miscellaneous fixups"
* tag 'cxl-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (38 commits)
cxl/test: Add mock test for set_timestamp
cxl/mbox: Update CMD_RC_TABLE
tools/testing/cxl: Require CONFIG_DEBUG_FS
tools/testing/cxl: Add a sysfs attr to test poison inject limits
tools/testing/cxl: Use injected poison for get poison list
tools/testing/cxl: Mock the Clear Poison mailbox command
tools/testing/cxl: Mock the Inject Poison mailbox command
cxl/mem: Add debugfs attributes for poison inject and clear
cxl/memdev: Trace inject and clear poison as cxl_poison events
cxl/memdev: Warn of poison inject or clear to a mapped region
cxl/memdev: Add support for the Clear Poison mailbox command
cxl/memdev: Add support for the Inject Poison mailbox command
tools/testing/cxl: Mock support for Get Poison List
cxl/trace: Add an HPA to cxl_poison trace events
cxl/region: Provide region info to the cxl_poison trace event
cxl/memdev: Add trigger_poison_list sysfs attribute
cxl/trace: Add TRACE support for CXL media-error records
cxl/mbox: Add GET_POISON_LIST mailbox command
cxl/mbox: Initialize the poison state
cxl/mbox: Restrict poison cmds to debugfs cxl_raw_allow_all
...
|
||
|
|
4e1c80ae5c |
Merge tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever: "The big ticket item for this release is that support for RPC-with-TLS [RFC 9289] has been added to the Linux NFS server. The goal is to provide a simple-to-deploy, low-overhead in-transit confidentiality and peer authentication mechanism. It can supplement NFS Kerberos and it can protect the use of legacy non-cryptographic user authentication flavors such as AUTH_SYS. The TLS Record protocol is handled entirely by kTLS, meaning it can use either software encryption or offload encryption to smart NICs. Aside from that, work continues on improving NFSD's open file cache. Among the many clean-ups in that area is a patch to convert the rhashtable to use the list-hashing version of that data structure" * tag 'nfsd-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits) NFSD: Handle new xprtsec= export option SUNRPC: Support TLS handshake in the server-side TCP socket code NFSD: Clean up xattr memory allocation flags NFSD: Fix problem of COMMIT and NFS4ERR_DELAY in infinite loop SUNRPC: Clear rq_xid when receiving a new RPC Call SUNRPC: Recognize control messages in server-side TCP socket code SUNRPC: Be even lazier about releasing pages SUNRPC: Convert svc_xprt_release() to the release_pages() API SUNRPC: Relocate svc_free_res_pages() nfsd: simplify the delayed disposal list code SUNRPC: Ignore return value of ->xpo_sendto SUNRPC: Ensure server-side sockets have a sock->file NFSD: Watch for rq_pages bounds checking errors in nfsd_splice_actor() sunrpc: simplify two-level sysctl registration for svcrdma_parm_table SUNRPC: return proper error from get_expiry() lockd: add some client-side tracepoints nfs: move nfs_fhandle_hash to common include file lockd: server should unlock lock if client rejects the grant lockd: fix races in client GRANTED_MSG wait logic lockd: move struct nlm_wait to lockd.h ... |
||
|
|
d579c468d7 |
Merge tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing updates from Steven Rostedt: - User events are finally ready! After lots of collaboration between various parties, we finally locked down on a stable interface for user events that can also work with user space only tracing. This is implemented by telling the kernel (or user space library, but that part is user space only and not part of this patch set), where the variable is that the application uses to know if something is listening to the trace. There's also an interface to tell the kernel about these events, which will show up in the /sys/kernel/tracing/events/user_events/ directory, where it can be enabled. When it's enabled, the kernel will update the variable, to tell the application to start writing to the kernel. See https://lwn.net/Articles/927595/ - Cleaned up the direct trampolines code to simplify arm64 addition of direct trampolines. Direct trampolines use the ftrace interface but instead of jumping to the ftrace trampoline, applications (mostly BPF) can register their own trampoline for performance reasons. - Some updates to the fprobe infrastructure. fprobes are more efficient than kprobes, as it does not need to save all the registers that kprobes on ftrace do. More work needs to be done before the fprobes will be exposed as dynamic events. - More updates to references to the obsolete path of /sys/kernel/debug/tracing for the new /sys/kernel/tracing path. - Add a seq_buf_do_printk() helper to seq_bufs, to print a large buffer line by line instead of all at once. There are users in production kernels that have a large data dump that originally used printk() directly, but the data dump was larger than what printk() allowed as a single print. Using seq_buf() to do the printing fixes that. - Add /sys/kernel/tracing/touched_functions that shows all functions that was every traced by ftrace or a direct trampoline. This is used for debugging issues where a traced function could have caused a crash by a bpf program or live patching. - Add a "fields" option that is similar to "raw" but outputs the fields of the events. It's easier to read by humans. - Some minor fixes and clean ups. * tag 'trace-v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (41 commits) ring-buffer: Sync IRQ works before buffer destruction tracing: Add missing spaces in trace_print_hex_seq() ring-buffer: Ensure proper resetting of atomic variables in ring_buffer_reset_online_cpus recordmcount: Fix memory leaks in the uwrite function tracing/user_events: Limit max fault-in attempts tracing/user_events: Prevent same address and bit per process tracing/user_events: Ensure bit is cleared on unregister tracing/user_events: Ensure write index cannot be negative seq_buf: Add seq_buf_do_printk() helper tracing: Fix print_fields() for __dyn_loc/__rel_loc tracing/user_events: Set event filter_type from type ring-buffer: Clearly check null ptr returned by rb_set_head_page() tracing: Unbreak user events tracing/user_events: Use print_format_fields() for trace output tracing/user_events: Align structs with tabs for readability tracing/user_events: Limit global user_event count tracing/user_events: Charge event allocs to cgroups tracing/user_events: Update documentation for ABI tracing/user_events: Use write ABI in example tracing/user_events: Add ABI self-test ... |
||
|
|
33afd4b763 |
Merge tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton: "Mainly singleton patches all over the place. Series of note are: - updates to scripts/gdb from Glenn Washburn - kexec cleanups from Bjorn Helgaas" * tag 'mm-nonmm-stable-2023-04-27-16-01' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (50 commits) mailmap: add entries for Paul Mackerras libgcc: add forward declarations for generic library routines mailmap: add entry for Oleksandr ocfs2: reduce ioctl stack usage fs/proc: add Kthread flag to /proc/$pid/status ia64: fix an addr to taddr in huge_pte_offset() checkpatch: introduce proper bindings license check epoll: rename global epmutex scripts/gdb: add GDB convenience functions $lx_dentry_name() and $lx_i_dentry() scripts/gdb: create linux/vfs.py for VFS related GDB helpers uapi/linux/const.h: prefer ISO-friendly __typeof__ delayacct: track delays from IRQ/SOFTIRQ scripts/gdb: timerlist: convert int chunks to str scripts/gdb: print interrupts scripts/gdb: raise error with reduced debugging information scripts/gdb: add a Radix Tree Parser lib/rbtree: use '+' instead of '|' for setting color. proc/stat: remove arch_idle_time() checkpatch: check for misuse of the link tags checkpatch: allow Closes tags with links ... |
||
|
|
7fa8a8ee94 |
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
|
||
|
|
2686eb845d |
uapi nbd: add cookie alias to handle
The uapi <linux/nbd.h> header declares a 'char handle[8]' per request; which is overloaded in English (are you referring to "handle" the verb, such as handling a signal or writing a callback handler, or "handle" the noun, the value used in a lookup table to correlate a response back to the request). Many user-space NBD implementations (both servers and clients) have instead used 'uint64_t cookie' or similar, as it is easier to directly assign an integer than to futz around with memcpy. In fact, upstream documentation is now encouraging this shift in terminology: https://github.com/NetworkBlockDevice/nbd/commit/ca4392eb2b Accomplish this by use of an anonymous union to provide the alias for anyone getting the definition from the uapi; this does not break existing clients, while exposing the nicer name for those who prefer it. Note that block/nbd.c still uses the term handle (in fact, it actually combines a 32-bit cookie and a 32-bit tag into the 64-bit handle), but that internal usage is not changed by the public uapi, since no compliant NBD server has any reason to inspect or alter the 64 bits sent over the socket. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20230410180611.1051618-3-eblake@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
daf376a366 |
uapi nbd: improve doc links to userspace spec
The uapi <linux/nbd.h> header intentionally documents only the NBD server features that the kernel module will utilize as a client. But while it already had one mention of skipped bits due to userspace extensions, it did not actually direct the reader to the canonical source to learn about those extensions. While touching comments, fix an outdated reference that listed only READ and WRITE as commands. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20230410180611.1051618-2-eblake@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> |
||
|
|
8ccd54fe45 |
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin: "virtio,vhost,vdpa: features, fixes, and cleanups: - reduction in interrupt rate in virtio - perf improvement for VDUSE - scalability for vhost-scsi - non power of 2 ring support for packed rings - better management for mlx5 vdpa - suspend for snet - VIRTIO_F_NOTIFICATION_DATA - shared backend with vdpa-sim-blk - user VA support in vdpa-sim - better struct packing for virtio and fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (52 commits) vhost_vdpa: fix unmap process in no-batch mode MAINTAINERS: make me a reviewer of VIRTIO CORE AND NET DRIVERS tools/virtio: fix build caused by virtio_ring changes virtio_ring: add a struct device forward declaration vdpa_sim_blk: support shared backend vdpa_sim: move buffer allocation in the devices vdpa/snet: use likely/unlikely macros in hot functions vdpa/snet: implement kick_vq_with_data callback virtio-vdpa: add VIRTIO_F_NOTIFICATION_DATA feature support virtio: add VIRTIO_F_NOTIFICATION_DATA feature support vdpa/snet: support the suspend vDPA callback vdpa/snet: support getting and setting VQ state MAINTAINERS: add vringh.h to Virtio Core and Net Drivers vringh: address kdoc warnings vdpa: address kdoc warnings virtio_ring: don't update event idx on get_buf vdpa_sim: add support for user VA vdpa_sim: replace the spinlock with a mutex to protect the state vdpa_sim: use kthread worker vdpa_sim: make devices agnostic for work management ... |
||
|
|
9280c57743 |
NFSD: Handle new xprtsec= export option
Enable administrators to require clients to use transport layer security when accessing particular exports. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> |
||
|
|
cec24b8b6b |
Merge tag 'char-misc-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc drivers updates from Greg KH: "Here is the "big" set of char/misc and other driver subsystems for 6.4-rc1. It's pretty big, but due to the removal of pcmcia drivers, almost breaks even for number of lines added vs. removed, a nice change. Included in here are: - removal of unused PCMCIA drivers (finally!) - Interconnect driver updates and additions - Lots of IIO driver updates and additions - MHI driver updates - Coresight driver updates - NVMEM driver updates, which required some OF updates - W1 driver updates and a new maintainer to manage the subsystem - FPGA driver updates - New driver subsystem, CDX, for AMD systems - lots of other small driver updates and additions All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (196 commits) mcb-lpc: Reallocate memory region to avoid memory overlapping mcb-pci: Reallocate memory region to avoid memory overlapping mcb: Return actual parsed size when reading chameleon table kernel/configs: Drop Android config fragments virt: acrn: Replace obsolete memalign() with posix_memalign() spmi: Add a check for remove callback when removing a SPMI driver spmi: fix W=1 kernel-doc warnings spmi: mtk-pmif: Drop of_match_ptr for ID table spmi: pmic-arb: Convert to platform remove callback returning void spmi: mtk-pmif: Convert to platform remove callback returning void spmi: hisi-spmi-controller: Convert to platform remove callback returning void w1: gpio: remove unnecessary ENOMEM messages w1: omap-hdq: remove unnecessary ENOMEM messages w1: omap-hdq: add SPDX tag w1: omap-hdq: allow compile testing w1: matrox: remove unnecessary ENOMEM messages w1: matrox: use inline over __inline__ w1: matrox: switch from asm to linux header w1: ds2482: do not use assignment in if condition w1: ds2482: drop unnecessary header ... |
||
|
|
b39667abcd |
Merge tag 'tty-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty / serial updates from Greg KH: "Here is the big set of tty/serial driver updates for 6.4-rc1. Nothing major, just lots of tiny, constant, forward development. This includes: - obligatory n_gsm updates and feature additions - 8250_em driver updates - sh-sci driver updates - dts cleanups and updates - general cleanups and improvements by Ilpo and Jiri - other small serial driver core fixes and driver updates All of these have been in linux-next for a while with no reported problems" * tag 'tty-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (87 commits) n_gsm: Use array_index_nospec() with index that comes from userspace tty: vt: drop checks for undefined VT_SINGLE_DRIVER tty: vt: distribute EXPORT_SYMBOL() tty: vt: simplify some cases in tioclinux() tty: vt: reformat tioclinux() tty: serial: sh-sci: Fix end of transmission on SCI tty: serial: sh-sci: Add support for tx end interrupt handling tty: serial: sh-sci: Fix TE setting on SCI IP tty: serial: sh-sci: Add RZ/G2L SCIFA DMA rx support tty: serial: sh-sci: Add RZ/G2L SCIFA DMA tx support serial: max310x: fix IO data corruption in batched operations serial: core: Disable uart_start() on uart_remove_one_port() serial: 8250: Reinit port->pm on port specific driver unbind serial: 8250: Add missing wakeup event reporting tty: serial: fsl_lpuart: use UARTMODIR register bits for lpuart32 platform tty: serial: fsl_lpuart: adjust buffer length to the intended size serial: fix TIOCSRS485 locking serial: make SiFive serial drivers depend on ARCH_ symbols tty: synclink_gt: don't allocate and pass dummy flags tty: serial: simplify qcom_geni_serial_send_chunk_fifo() ... |
||
|
|
6e98b09da9 |
Merge tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Introduce a config option to tweak MAX_SKB_FRAGS. Increasing the
default value allows for better BIG TCP performances
- Reduce compound page head access for zero-copy data transfers
- RPS/RFS improvements, avoiding unneeded NET_RX_SOFTIRQ when
possible
- Threaded NAPI improvements, adding defer skb free support and
unneeded softirq avoidance
- Address dst_entry reference count scalability issues, via false
sharing avoidance and optimize refcount tracking
- Add lockless accesses annotation to sk_err[_soft]
- Optimize again the skb struct layout
- Extends the skb drop reasons to make it usable by multiple
subsystems
- Better const qualifier awareness for socket casts
BPF:
- Add skb and XDP typed dynptrs which allow BPF programs for more
ergonomic and less brittle iteration through data and
variable-sized accesses
- Add a new BPF netfilter program type and minimal support to hook
BPF programs to netfilter hooks such as prerouting or forward
- Add more precise memory usage reporting for all BPF map types
- Adds support for using {FOU,GUE} encap with an ipip device
operating in collect_md mode and add a set of BPF kfuncs for
controlling encap params
- Allow BPF programs to detect at load time whether a particular
kfunc exists or not, and also add support for this in light
skeleton
- Bigger batch of BPF verifier improvements to prepare for upcoming
BPF open-coded iterators allowing for less restrictive looping
capabilities
- Rework RCU enforcement in the verifier, add kptr_rcu and enforce
BPF programs to NULL-check before passing such pointers into kfunc
- Add support for kptrs in percpu hashmaps, percpu LRU hashmaps and
in local storage maps
- Enable RCU semantics for task BPF kptrs and allow referenced kptr
tasks to be stored in BPF maps
- Add support for refcounted local kptrs to the verifier for allowing
shared ownership, useful for adding a node to both the BPF list and
rbtree
- Add BPF verifier support for ST instructions in
convert_ctx_access() which will help new -mcpu=v4 clang flag to
start emitting them
- Add ARM32 USDT support to libbpf
- Improve bpftool's visual program dump which produces the control
flow graph in a DOT format by adding C source inline annotations
Protocols:
- IPv4: Allow adding to IPv4 address a 'protocol' tag. Such value
indicates the provenance of the IP address
- IPv6: optimize route lookup, dropping unneeded R/W lock acquisition
- Add the handshake upcall mechanism, allowing the user-space to
implement generic TLS handshake on kernel's behalf
- Bridge: support per-{Port, VLAN} neighbor suppression, increasing
resilience to nodes failures
- SCTP: add support for Fair Capacity and Weighted Fair Queueing
schedulers
- MPTCP: delay first subflow allocation up to its first usage. This
will allow for later better LSM interaction
- xfrm: Remove inner/outer modes from input/output path. These are
not needed anymore
- WiFi:
- reduced neighbor report (RNR) handling for AP mode
- HW timestamping support
- support for randomized auth/deauth TA for PASN privacy
- per-link debugfs for multi-link
- TC offload support for mac80211 drivers
- mac80211 mesh fast-xmit and fast-rx support
- enable Wi-Fi 7 (EHT) mesh support
Netfilter:
- Add nf_tables 'brouting' support, to force a packet to be routed
instead of being bridged
- Update bridge netfilter and ovs conntrack helpers to handle IPv6
Jumbo packets properly, i.e. fetch the packet length from
hop-by-hop extension header. This is needed for BIT TCP support
- The iptables 32bit compat interface isn't compiled in by default
anymore
- Move ip(6)tables builtin icmp matches to the udptcp one. This has
the advantage that icmp/icmpv6 match doesn't load the
iptables/ip6tables modules anymore when iptables-nft is used
- Extended netlink error report for netdevice in flowtables and
netdev/chains. Allow for incrementally add/delete devices to netdev
basechain. Allow to create netdev chain without device
Driver API:
- Remove redundant Device Control Error Reporting Enable, as PCI core
has already error reporting enabled at enumeration time
- Move Multicast DB netlink handlers to core, allowing devices other
then bridge to use them
- Allow the page_pool to directly recycle the pages from safely
localized NAPI
- Implement lockless TX queue stop/wake combo macros, allowing for
further code de-duplication and sanitization
- Add YNL support for user headers and struct attrs
- Add partial YNL specification for devlink
- Add partial YNL specification for ethtool
- Add tc-mqprio and tc-taprio support for preemptible traffic classes
- Add tx push buf len param to ethtool, specifies the maximum number
of bytes of a transmitted packet a driver can push directly to the
underlying device
- Add basic LED support for switch/phy
- Add NAPI documentation, stop relaying on external links
- Convert dsa_master_ioctl() to netdev notifier. This is a
preparatory work to make the hardware timestamping layer selectable
by user space
- Add transceiver support and improve the error messages for CAN-FD
controllers
New hardware / drivers:
- Ethernet:
- AMD/Pensando core device support
- MediaTek MT7981 SoC
- MediaTek MT7988 SoC
- Broadcom BCM53134 embedded switch
- Texas Instruments CPSW9G ethernet switch
- Qualcomm EMAC3 DWMAC ethernet
- StarFive JH7110 SoC
- NXP CBTX ethernet PHY
- WiFi:
- Apple M1 Pro/Max devices
- RealTek rtl8710bu/rtl8188gu
- RealTek rtl8822bs, rtl8822cs and rtl8821cs SDIO chipset
- Bluetooth:
- Realtek RTL8821CS, RTL8851B, RTL8852BS
- Mediatek MT7663, MT7922
- NXP w8997
- Actions Semi ATS2851
- QTI WCN6855
- Marvell 88W8997
- Can:
- STMicroelectronics bxcan stm32f429
Drivers:
- Ethernet NICs:
- Intel (1G, icg):
- add tracking and reporting of QBV config errors
- add support for configuring max SDU for each Tx queue
- Intel (100G, ice):
- refactor mailbox overflow detection to support Scalable IOV
- GNSS interface optimization
- Intel (i40e):
- support XDP multi-buffer
- nVidia/Mellanox:
- add the support for linux bridge multicast offload
- enable TC offload for egress and engress MACVLAN over bond
- add support for VxLAN GBP encap/decap flows offload
- extend packet offload to fully support libreswan
- support tunnel mode in mlx5 IPsec packet offload
- extend XDP multi-buffer support
- support MACsec VLAN offload
- add support for dynamic msix vectors allocation
- drop RX page_cache and fully use page_pool
- implement thermal zone to report NIC temperature
- Netronome/Corigine:
- add support for multi-zone conntrack offload
- Solarflare/Xilinx:
- support offloading TC VLAN push/pop actions to the MAE
- support TC decap rules
- support unicast PTP
- Other NICs:
- Broadcom (bnxt): enforce software based freq adjustments only on
shared PHC NIC
- RealTek (r8169): refactor to addess ASPM issues during NAPI poll
- Micrel (lan8841): add support for PTP_PF_PEROUT
- Cadence (macb): enable PTP unicast
- Engleder (tsnep): add XDP socket zero-copy support
- virtio-net: implement exact header length guest feature
- veth: add page_pool support for page recycling
- vxlan: add MDB data path support
- gve: add XDP support for GQI-QPL format
- geneve: accept every ethertype
- macvlan: allow some packets to bypass broadcast queue
- mana: add support for jumbo frame
- Ethernet high-speed switches:
- Microchip (sparx5): Add support for TC flower templates
- Ethernet embedded switches:
- Broadcom (b54):
- configure 6318 and 63268 RGMII ports
- Marvell (mv88e6xxx):
- faster C45 bus scan
- Microchip:
- lan966x:
- add support for IS1 VCAP
- better TX/RX from/to CPU performances
- ksz9477: add ETS Qdisc support
- ksz8: enhance static MAC table operations and error handling
- sama7g5: add PTP capability
- NXP (ocelot):
- add support for external ports
- add support for preemptible traffic classes
- Texas Instruments:
- add CPSWxG SGMII support for J7200 and J721E
- Intel WiFi (iwlwifi):
- preparation for Wi-Fi 7 EHT and multi-link support
- EHT (Wi-Fi 7) sniffer support
- hardware timestamping support for some devices/firwmares
- TX beacon protection on newer hardware
- Qualcomm 802.11ax WiFi (ath11k):
- MU-MIMO parameters support
- ack signal support for management packets
- RealTek WiFi (rtw88):
- SDIO bus support
- better support for some SDIO devices (e.g. MAC address from
efuse)
- RealTek WiFi (rtw89):
- HW scan support for 8852b
- better support for 6 GHz scanning
- support for various newer firmware APIs
- framework firmware backwards compatibility
- MediaTek WiFi (mt76):
- P2P support
- mesh A-MSDU support
- EHT (Wi-Fi 7) support
- coredump support"
* tag 'net-next-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2078 commits)
net: phy: hide the PHYLIB_LEDS knob
net: phy: marvell-88x2222: remove unnecessary (void*) conversions
tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.
net: amd: Fix link leak when verifying config failed
net: phy: marvell: Fix inconsistent indenting in led_blink_set
lan966x: Don't use xdp_frame when action is XDP_TX
tsnep: Add XDP socket zero-copy TX support
tsnep: Add XDP socket zero-copy RX support
tsnep: Move skb receive action to separate function
tsnep: Add functions for queue enable/disable
tsnep: Rework TX/RX queue initialization
tsnep: Replace modulo operation with mask
net: phy: dp83867: Add led_brightness_set support
net: phy: Fix reading LED reg property
drivers: nfc: nfcsim: remove return value check of `dev_dir`
net: phy: dp83867: Remove unnecessary (void*) conversions
net: ethtool: coalesce: try to make user settings stick twice
net: mana: Check if netdev/napi_alloc_frag returns single page
net: mana: Rename mana_refill_rxoob and remove some empty lines
net: veth: add page_pool stats
...
|
||
|
|
b68ee1c613 |
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley: "Updates to the usual drivers (megaraid_sas, scsi_debug, lpfc, target, mpi3mr, hisi_sas, arcmsr). The major core change is the constification of the host templates (which touches everything) along with other minor fixups and clean ups" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (207 commits) scsi: ufs: mcq: Use pointer arithmetic in ufshcd_send_command() scsi: ufs: mcq: Annotate ufshcd_inc_sq_tail() appropriately scsi: cxlflash: s/semahpore/semaphore/ scsi: lpfc: Silence an incorrect device output scsi: mpi3mr: Use IRQ save variants of spinlock to protect chain frame allocation scsi: scsi_debug: Fix missing error code in scsi_debug_init() scsi: hisi_sas: Work around build failure in suspend function scsi: lpfc: Fix ioremap issues in lpfc_sli4_pci_mem_setup() scsi: mpt3sas: Fix an issue when driver is being removed scsi: mpt3sas: Remove HBA BIOS version in the kernel log scsi: target: core: Fix invalid memory access scsi: scsi_debug: Drop sdebug_queue scsi: scsi_debug: Only allow sdebug_max_queue be modified when no shosts scsi: scsi_debug: Use scsi_host_busy() in delay_store() and ndelay_store() scsi: scsi_debug: Use blk_mq_tagset_busy_iter() in stop_all_queued() scsi: scsi_debug: Use blk_mq_tagset_busy_iter() in sdebug_blk_mq_poll() scsi: scsi_debug: Dynamically allocate sdebug_queued_cmd scsi: scsi_debug: Use scsi_block_requests() to block queues scsi: scsi_debug: Protect block_unblock_all_queues() with mutex scsi: scsi_debug: Change shost list lock to a mutex ... |
||
|
|
36006b1d5c |
Merge tag 'ata-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
Pull ata updates from Damien Le Moal: - Many cleanups of the pata_parport driver and of its protocol modules (Ondrej) - Remove unused code (ata_id_xxx() functions) (Sergey) - Add Add UniPhier SATA controller DT bindings (Kunihiko) - Fix dependencies for the Freescale QorIQ AHCI SATA controller driver (Geert) - DT property handling improvements (Rob) * tag 'ata-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata: (57 commits) ata: pata_parport-bpck6: Declare mode_map as static ata: pata_parport-bpck6: Remove dependency on 64BIT ata: pata_parport-bpck6: reduce indents in bpck6_open ata: pata_parport-bpck6: delete ppc6lnx.c ata: pata_parport-bpck6: move defines and mode_map to bpck6.c ata: pata_parport-bpck6: move ppc6_wr_data_byte to bpck6.c and rename ata: pata_parport-bpck6: move ppc6_rd_data_byte to bpck6.c and rename ata: pata_parport-bpck6: move ppc6_send_cmd to bpck6.c and rename ata: pata_parport-bpck6: move ppc6_deselect to bpck6.c and rename ata: pata_parport-bpck6: merge ppc6_select into bpck6_open ata: pata_parport-bpck6: move ppc6_open to bpck6.c and rename ata: pata_parport-bpck6: move ppc6_wr_extout to bpck6.c and rename ata: pata_parport-bpck6: move ppc6_wait_for_fifo to bpck6.c and rename ata: pata_parport-bpck6: merge ppc6_wr_data_blk into bpck6_write_block ata: pata_parport-bpck6: merge ppc6_rd_data_blk into bpck6_read_block ata: pata_parport-bpck6: merge ppc6_wr_port16_blk into bpck6_write_block ata: pata_parport-bpck6: merge ppc6_rd_port16_blk into bpck6_read_block ata: pata_parport-bpck6: merge ppc6_wr_port into bpck6_write_regr ata: pata_parport-bpck6: merge ppc6_rd_port into bpck6_read_regr ata: pata_parport-bpck6: remove ppc6_close ... |
||
|
|
48dc810012 |
Merge tag 'for-6.4/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer: - Split dm-bufio's rw_semaphore and rbtree. Offers improvements to dm-bufio's locking to allow increased concurrent IO -- particularly for read access for buffers already in dm-bufio's cache. - Also split dm-bio-prison-v1's spinlock and rbtree with comparable aim at improving concurrent IO (for the DM thinp target). - Both the dm-bufio and dm-bio-prison-v1 scaling of the number of locks and rbtrees used are managed by dm_num_hash_locks(). And the hash function used by both is dm_hash_locks_index(). - Allow DM targets to require DISCARD, WRITE_ZEROES and SECURE_ERASE to be split at the target specified boundary (in terms of max_discard_sectors, max_write_zeroes_sectors and max_secure_erase_sectors respectively). - DM verity error handling fix for check_at_most_once on FEC. - Update DM verity target to emit audit events on verification failure and more. - DM core ->io_hints improvements needed in support of new discard support that is added to the DM "zero" and "error" targets. - Fix missing kmem_cache_destroy() call in initialization error path of both the DM integrity and DM clone targets. - A couple fixes for DM flakey, also add "error_reads" feature. - Fix DM core's resume to not lock FS when the DM map is NULL; otherwise initial table load can race with FS mount that takes superblock's ->s_umount rw_semaphore. - Various small improvements to both DM core and DM targets. * tag 'for-6.4/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (40 commits) dm: don't lock fs when the map is NULL in process of resume dm flakey: add an "error_reads" option dm flakey: remove trailing space in the table line dm flakey: fix a crash with invalid table line dm ioctl: fix nested locking in table_clear() to remove deadlock concern dm: unexport dm_get_queue_limits() dm: allow targets to require splitting WRITE_ZEROES and SECURE_ERASE dm: add helper macro for simple DM target module init and exit dm raid: remove unused d variable dm: remove unnecessary (void*) conversions dm mirror: add DMERR message if alloc_workqueue fails dm: push error reporting down to dm_register_target() dm integrity: call kmem_cache_destroy() in dm_integrity_init() error path dm clone: call kmem_cache_destroy() in dm_clone_init() error path dm error: add discard support dm zero: add discard support dm table: allow targets without devices to set ->io_hints dm verity: emit audit events on verification failure and more dm verity: fix error handling for check_at_most_once on FEC dm: improve hash_locks sizing and hash function ... |