Commit 9251e3e9 authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'mm-hotfixes-stable-2024-10-28-21-50' of...

Merge tag 'mm-hotfixes-stable-2024-10-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "21 hotfixes. 13 are cc:stable. 13 are MM and 8 are non-MM.

  No particular theme here - mainly singletons, a couple of doubletons.
  Please see the changelogs"

* tag 'mm-hotfixes-stable-2024-10-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
  mm: avoid unconditional one-tick sleep when swapcache_prepare fails
  mseal: update mseal.rst
  mm: split critical region in remap_file_pages() and invoke LSMs in between
  selftests/mm: fix deadlock for fork after pthread_create with atomic_bool
  Revert "selftests/mm: replace atomic_bool with pthread_barrier_t"
  Revert "selftests/mm: fix deadlock for fork after pthread_create on ARM"
  tools: testing: add expand-only mode VMA test
  mm/vma: add expand-only VMA merge mode and optimise do_brk_flags()
  resource,kexec: walk_system_ram_res_rev must retain resource flags
  nilfs2: fix kernel bug due to missing clearing of checked flag
  mm: numa_clear_kernel_node_hotplug: Add NUMA_NO_NODE check for node id
  ocfs2: pass u64 to ocfs2_truncate_inline maybe overflow
  mm: shmem: fix data-race in shmem_getattr()
  mm: mark mas allocation in vms_abort_munmap_vmas as __GFP_NOFAIL
  x86/traps: move kmsan check after instrumentation_begin
  resource: remove dependency on SPARSEMEM from GET_FREE_REGION
  mm/mmap: fix race in mmap_region() with ftruncate()
  mm/page_alloc: let GFP_ATOMIC order-0 allocs access highatomic reserves
  fork: only invoke khugepaged, ksm hooks if no error
  fork: do not invoke uffd on fork if error occurs
  ...
parents d5b2ee0f 01626a18
Loading
Loading
Loading
Loading
+148 −159
Original line number Diff line number Diff line
@@ -23,177 +23,166 @@ applications can additionally seal security critical data at runtime.
A similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].

User API
========
mseal()
-----------
The mseal() syscall has the following signature:

``int mseal(void addr, size_t len, unsigned long flags)``

**addr/len**: virtual memory address range.

The address range set by ``addr``/``len`` must meet:
SYSCALL
=======
mseal syscall signature
-----------------------
   ``int mseal(void \* addr, size_t len, unsigned long flags)``

   **addr**/**len**: virtual memory address range.
      The address range set by **addr**/**len** must meet:
         - The start address must be in an allocated VMA.
         - The start address must be page aligned.
   - The end address (``addr`` + ``len``) must be in an allocated VMA.
         - The end address (**addr** + **len**) must be in an allocated VMA.
         - no gap (unallocated memory) between start and end address.

      The ``len`` will be paged aligned implicitly by the kernel.

   **flags**: reserved for future use.

**return values**:

- ``0``: Success.

- ``-EINVAL``:
    - Invalid input ``flags``.
    - The start address (``addr``) is not page aligned.
    - Address range (``addr`` + ``len``) overflow.

- ``-ENOMEM``:
    - The start address (``addr``) is not allocated.
    - The end address (``addr`` + ``len``) is not allocated.
    - A gap (unallocated memory) between start and end address.

- ``-EPERM``:
    - sealing is supported only on 64-bit CPUs, 32-bit is not supported.

   **Return values**:
      - **0**: Success.
      - **-EINVAL**:
         * Invalid input ``flags``.
         * The start address (``addr``) is not page aligned.
         * Address range (``addr`` + ``len``) overflow.
      - **-ENOMEM**:
         * The start address (``addr``) is not allocated.
         * The end address (``addr`` + ``len``) is not allocated.
         * A gap (unallocated memory) between start and end address.
      - **-EPERM**:
         * sealing is supported only on 64-bit CPUs, 32-bit is not supported.

   **Note about error return**:
      - For above error cases, users can expect the given memory range is
        unmodified, i.e. no partial update.

      - There might be other internal errors/cases not listed here, e.g.
  error during merging/splitting VMAs, or the process reaching the max
        error during merging/splitting VMAs, or the process reaching the maximum
        number of supported VMAs. In those cases, partial updates to the given
        memory range could happen. However, those cases should be rare.

**Blocked operations after sealing**:
    Unmapping, moving to another location, and shrinking the size,
    via munmap() and mremap(), can leave an empty space, therefore
    can be replaced with a VMA with a new set of attributes.

    Moving or expanding a different VMA into the current location,
    via mremap().

    Modifying a VMA via mmap(MAP_FIXED).

    Size expansion, via mremap(), does not appear to pose any
    specific risks to sealed VMAs. It is included anyway because
    the use case is unclear. In any case, users can rely on
    merging to expand a sealed VMA.

    mprotect() and pkey_mprotect().

    Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
    for anonymous memory, when users don't have write permission to the
    memory. Those behaviors can alter region contents by discarding pages,
    effectively a memset(0) for anonymous memory.

    Kernel will return -EPERM for blocked operations.

    For blocked operations, one can expect the given address is unmodified,
    i.e. no partial update. Note, this is different from existing mm
    system call behaviors, where partial updates are made till an error is
    found and returned to userspace. To give an example:

    Assume following code sequence:

    - ptr = mmap(null, 8192, PROT_NONE);
    - munmap(ptr + 4096, 4096);
    - ret1 = mprotect(ptr, 8192, PROT_READ);
    - mseal(ptr, 4096);
    - ret2 = mprotect(ptr, 8192, PROT_NONE);

    ret1 will be -ENOMEM, the page from ptr is updated to PROT_READ.
   **Architecture support**:
      mseal only works on 64-bit CPUs, not 32-bit CPUs.

    ret2 will be -EPERM, the page remains to be PROT_READ.

**Note**:

- mseal() only works on 64-bit CPUs, not 32-bit CPU.

- users can call mseal() multiple times, mseal() on an already sealed memory
   **Idempotent**:
      users can call mseal multiple times. mseal on an already sealed memory
      is a no-action (not error).

- munseal() is not supported.

Use cases:
==========
   **no munseal**
      Once mapping is sealed, it can't be unsealed. The kernel should never
      have munseal, this is consistent with other sealing feature, e.g.
      F_SEAL_SEAL for file.

Blocked mm syscall for sealed mapping
-------------------------------------
   It might be important to note: **once the mapping is sealed, it will
   stay in the process's memory until the process terminates**.

   Example::

         *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
         rc = mseal(ptr, 4096, 0);
         /* munmap will fail */
         rc = munmap(ptr, 4096);
         assert(rc < 0);

   Blocked mm syscall:
      - munmap
      - mmap
      - mremap
      - mprotect and pkey_mprotect
      - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
        MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK

   The first set of syscalls to block is munmap, mremap, mmap. They can
   either leave an empty space in the address space, therefore allowing
   replacement with a new mapping with new set of attributes, or can
   overwrite the existing mapping with another mapping.

   mprotect and pkey_mprotect are blocked because they changes the
   protection bits (RWX) of the mapping.

   Certain destructive madvise behaviors, specifically MADV_DONTNEED,
   MADV_FREE, MADV_DONTNEED_LOCKED, and MADV_WIPEONFORK, can introduce
   risks when applied to anonymous memory by threads lacking write
   permissions. Consequently, these operations are prohibited under such
   conditions. The aforementioned behaviors have the potential to modify
   region contents by discarding pages, effectively performing a memset(0)
   operation on the anonymous memory.

   Kernel will return -EPERM for blocked syscalls.

   When blocked syscall return -EPERM due to sealing, the memory regions may
   or may not be changed, depends on the syscall being blocked:

      - munmap: munmap is atomic. If one of VMAs in the given range is
        sealed, none of VMAs are updated.
      - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
        when mprotect over multiple VMAs, mprotect might update the beginning
        VMAs before reaching the sealed VMA and return -EPERM.
      - mmap and mremap: undefined behavior.

Use cases
=========
- glibc:
  The dynamic linker, during loading ELF executables, can apply sealing to
  non-writable memory segments.

- Chrome browser: protect some security sensitive data-structures.
  mapping segments.

Notes on which memory to seal:
==============================
- Chrome browser: protect some security sensitive data structures.

It might be important to note that sealing changes the lifetime of a mapping,
i.e. the sealed mapping won’t be unmapped till the process terminates or the
exec system call is invoked. Applications can apply sealing to any virtual
memory region from userspace, but it is crucial to thoroughly analyze the
mapping's lifetime prior to apply the sealing.
When not to use mseal
=====================
Applications can apply sealing to any virtual memory region from userspace,
but it is *crucial to thoroughly analyze the mapping's lifetime* prior to
apply the sealing. This is because the sealed mapping *won’t be unmapped*
until the process terminates or the exec system call is invoked.

For example:

   - aio/shm
     aio/shm can call mmap and  munmap on behalf of userspace, e.g.
     ksys_shmdt() in shm.c. The lifetimes of those mapping are not tied to
     the lifetime of the process. If those memories are sealed from userspace,
     then munmap will fail, causing leaks in VMA address space during the
     lifetime of the process.

   - ptr allocated by malloc (heap)
     Don't use mseal on the memory ptr return from malloc().
     malloc() is implemented by allocator, e.g. by glibc. Heap manager might
     allocate a ptr from brk or mapping created by mmap.
     If an app calls mseal on a ptr returned from malloc(), this can affect
     the heap manager's ability to manage the mappings; the outcome is
     non-deterministic.

     Example::

        ptr = malloc(size);
        /* don't call mseal on ptr return from malloc. */
        mseal(ptr, size);
        /* free will success, allocator can't shrink heap lower than ptr */
        free(ptr);

mseal doesn't block
===================
In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's
attributes, such as protection bits (RWX). Sealed mappings doesn't mean the
memory is immutable.

  aio/shm can call mmap()/munmap() on behalf of userspace, e.g. ksys_shmdt() in
  shm.c. The lifetime of those mapping are not tied to the lifetime of the
  process. If those memories are sealed from userspace, then munmap() will fail,
  causing leaks in VMA address space during the lifetime of the process.

- Brk (heap)

  Currently, userspace applications can seal parts of the heap by calling
  malloc() and mseal().
  let's assume following calls from user space:

  - ptr = malloc(size);
  - mprotect(ptr, size, RO);
  - mseal(ptr, size);
  - free(ptr);

  Technically, before mseal() is added, the user can change the protection of
  the heap by calling mprotect(RO). As long as the user changes the protection
  back to RW before free(), the memory range can be reused.

  Adding mseal() into the picture, however, the heap is then sealed partially,
  the user can still free it, but the memory remains to be RO. If the address
  is re-used by the heap manager for another malloc, the process might crash
  soon after. Therefore, it is important not to apply sealing to any memory
  that might get recycled.

  Furthermore, even if the application never calls the free() for the ptr,
  the heap manager may invoke the brk system call to shrink the size of the
  heap. In the kernel, the brk-shrink will call munmap(). Consequently,
  depending on the location of the ptr, the outcome of brk-shrink is
  nondeterministic.


Additional notes:
=================
As Jann Horn pointed out in [3], there are still a few ways to write
to RO memory, which is, in a way, by design. Those cases are not covered
by mseal(). If applications want to block such cases, sandbox tools (such as
seccomp, LSM, etc) might be considered.
to RO memory, which is, in a way, by design. And those could be blocked
by different security measures.

Those cases are:

- Write to read-only memory through /proc/self/mem interface.
   - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE).
   - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
   - userfaultfd.

The idea that inspired this patch comes from Stephen Röttger’s work in V8
CFI [4]. Chrome browser in ChromeOS will be the first user of this API.

Reference:
==========
[1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274

[2] https://man.openbsd.org/mimmutable.2

[3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com

[4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
Reference
=========
- [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
- [2] https://man.openbsd.org/mimmutable.2
- [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
- [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
+6 −6
Original line number Diff line number Diff line
@@ -261,12 +261,6 @@ static noinstr bool handle_bug(struct pt_regs *regs)
	int ud_type;
	u32 imm;

	/*
	 * Normally @regs are unpoisoned by irqentry_enter(), but handle_bug()
	 * is a rare case that uses @regs without passing them to
	 * irqentry_enter().
	 */
	kmsan_unpoison_entry_regs(regs);
	ud_type = decode_bug(regs->ip, &imm);
	if (ud_type == BUG_NONE)
		return handled;
@@ -275,6 +269,12 @@ static noinstr bool handle_bug(struct pt_regs *regs)
	 * All lies, just get the WARN/BUG out.
	 */
	instrumentation_begin();
	/*
	 * Normally @regs are unpoisoned by irqentry_enter(), but handle_bug()
	 * is a rare case that uses @regs without passing them to
	 * irqentry_enter().
	 */
	kmsan_unpoison_entry_regs(regs);
	/*
	 * Since we're emulating a CALL with exceptions, restore the interrupt
	 * state to what it was at the exception site.
+1 −0
Original line number Diff line number Diff line
@@ -401,6 +401,7 @@ void nilfs_clear_folio_dirty(struct folio *folio)

	folio_clear_uptodate(folio);
	folio_clear_mappedtodisk(folio);
	folio_clear_checked(folio);

	head = folio_buffers(folio);
	if (head) {
+8 −0
Original line number Diff line number Diff line
@@ -1787,6 +1787,14 @@ int ocfs2_remove_inode_range(struct inode *inode,
		return 0;

	if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
		int id_count = ocfs2_max_inline_data_with_xattr(inode->i_sb, di);

		if (byte_start > id_count || byte_start + byte_len > id_count) {
			ret = -EINVAL;
			mlog_errno(ret);
			goto out;
		}

		ret = ocfs2_truncate_inline(inode, di_bh, byte_start,
					    byte_start + byte_len, 0);
		if (ret) {
+28 −0
Original line number Diff line number Diff line
@@ -692,6 +692,34 @@ void dup_userfaultfd_complete(struct list_head *fcs)
	}
}

void dup_userfaultfd_fail(struct list_head *fcs)
{
	struct userfaultfd_fork_ctx *fctx, *n;

	/*
	 * An error has occurred on fork, we will tear memory down, but have
	 * allocated memory for fctx's and raised reference counts for both the
	 * original and child contexts (and on the mm for each as a result).
	 *
	 * These would ordinarily be taken care of by a user handling the event,
	 * but we are no longer doing so, so manually clean up here.
	 *
	 * mm tear down will take care of cleaning up VMA contexts.
	 */
	list_for_each_entry_safe(fctx, n, fcs, list) {
		struct userfaultfd_ctx *octx = fctx->orig;
		struct userfaultfd_ctx *ctx = fctx->new;

		atomic_dec(&octx->mmap_changing);
		VM_BUG_ON(atomic_read(&octx->mmap_changing) < 0);
		userfaultfd_ctx_put(octx);
		userfaultfd_ctx_put(ctx);

		list_del(&fctx->list);
		kfree(fctx);
	}
}

void mremap_userfaultfd_prep(struct vm_area_struct *vma,
			     struct vm_userfaultfd_ctx *vm_ctx)
{
Loading