Commit e6b324fb authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'mm-hotfixes-stable-2024-06-17-11-43' of...

Merge tag 'mm-hotfixes-stable-2024-06-17-11-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Mainly MM singleton fixes. And a couple of ocfs2 regression fixes"

* tag 'mm-hotfixes-stable-2024-06-17-11-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  kcov: don't lose track of remote references during softirqs
  mm: shmem: fix getting incorrect lruvec when replacing a shmem folio
  mm/debug_vm_pgtable: drop RANDOM_ORVALUE trick
  mm: fix possible OOB in numa_rebuild_large_mapping()
  mm/migrate: fix kernel BUG at mm/compaction.c:2761!
  selftests: mm: make map_fixed_noreplace test names stable
  mm/memfd: add documentation for MFD_NOEXEC_SEAL MFD_EXEC
  mm: mmap: allow for the maximum number of bits for randomizing mmap_base by default
  gcov: add support for GCC 14
  zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING
  mm: huge_memory: fix misused mapping_large_folio_support() for anon folios
  lib/alloc_tag: fix RCU imbalance in pgalloc_tag_get()
  lib/alloc_tag: do not register sysctl interface when CONFIG_SYSCTL=n
  MAINTAINERS: remove Lorenzo as vmalloc reviewer
  Revert "mm: init_mlocked_on_free_v3"
  mm/page_table_check: fix crash on ZONE_DEVICE
  gcc: disable '-Warray-bounds' for gcc-9
  ocfs2: fix NULL pointer dereference in ocfs2_abort_trigger()
  ocfs2: fix NULL pointer dereference in ocfs2_journal_dirty()
parents 5cf81d7b 01c8f980
Loading
Loading
Loading
Loading
+0 −6
Original line number Diff line number Diff line
@@ -2192,12 +2192,6 @@
			Format: 0 | 1
			Default set by CONFIG_INIT_ON_FREE_DEFAULT_ON.

	init_mlocked_on_free=	[MM] Fill freed userspace memory with zeroes if
				it was mlock'ed and not explicitly munlock'ed
				afterwards.
				Format: 0 | 1
				Default set by CONFIG_INIT_MLOCKED_ON_FREE_DEFAULT_ON

	init_pkru=	[X86] Specify the default memory protection keys rights
			register contents for all processes.  0x55555554 by
			default (disallow access to all but pkey 0).  Can
+1 −0
Original line number Diff line number Diff line
@@ -32,6 +32,7 @@ Security-related interfaces
   seccomp_filter
   landlock
   lsm
   mfd_noexec
   spec_ctrl
   tee

+86 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

==================================
Introduction of non-executable mfd
==================================
:Author:
    Daniel Verkamp <dverkamp@chromium.org>
    Jeff Xu <jeffxu@chromium.org>

:Contributor:
	Aleksa Sarai <cyphar@cyphar.com>

Since Linux introduced the memfd feature, memfds have always had their
execute bit set, and the memfd_create() syscall doesn't allow setting
it differently.

However, in a secure-by-default system, such as ChromeOS, (where all
executables should come from the rootfs, which is protected by verified
boot), this executable nature of memfd opens a door for NoExec bypass
and enables “confused deputy attack”.  E.g, in VRP bug [1]: cros_vm
process created a memfd to share the content with an external process,
however the memfd is overwritten and used for executing arbitrary code
and root escalation. [2] lists more VRP of this kind.

On the other hand, executable memfd has its legit use: runc uses memfd’s
seal and executable feature to copy the contents of the binary then
execute them. For such a system, we need a solution to differentiate runc's
use of executable memfds and an attacker's [3].

To address those above:
 - Let memfd_create() set X bit at creation time.
 - Let memfd be sealed for modifying X bit when NX is set.
 - Add a new pid namespace sysctl: vm.memfd_noexec to help applications in
   migrating and enforcing non-executable MFD.

User API
========
``int memfd_create(const char *name, unsigned int flags)``

``MFD_NOEXEC_SEAL``
	When MFD_NOEXEC_SEAL bit is set in the ``flags``, memfd is created
	with NX. F_SEAL_EXEC is set and the memfd can't be modified to
	add X later. MFD_ALLOW_SEALING is also implied.
	This is the most common case for the application to use memfd.

``MFD_EXEC``
	When MFD_EXEC bit is set in the ``flags``, memfd is created with X.

Note:
	``MFD_NOEXEC_SEAL`` implies ``MFD_ALLOW_SEALING``. In case that
	an app doesn't want sealing, it can add F_SEAL_SEAL after creation.


Sysctl:
========
``pid namespaced sysctl vm.memfd_noexec``

The new pid namespaced sysctl vm.memfd_noexec has 3 values:

 - 0: MEMFD_NOEXEC_SCOPE_EXEC
	memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
	MFD_EXEC was set.

 - 1: MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL
	memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
	MFD_NOEXEC_SEAL was set.

 - 2: MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED
	memfd_create() without MFD_NOEXEC_SEAL will be rejected.

The sysctl allows finer control of memfd_create for old software that
doesn't set the executable bit; for example, a container with
vm.memfd_noexec=1 means the old software will create non-executable memfd
by default while new software can create executable memfd by setting
MFD_EXEC.

The value of vm.memfd_noexec is passed to child namespace at creation
time. In addition, the setting is hierarchical, i.e. during memfd_create,
we will search from current ns to root ns and use the most restrictive
setting.

[1] https://crbug.com/1305267

[2] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20memfd%20escalation&can=1

[3] https://lwn.net/Articles/781013/
+0 −1
Original line number Diff line number Diff line
@@ -23974,7 +23974,6 @@ VMALLOC
M:	Andrew Morton <akpm@linux-foundation.org>
R:	Uladzislau Rezki <urezki@gmail.com>
R:	Christoph Hellwig <hch@infradead.org>
R:	Lorenzo Stoakes <lstoakes@gmail.com>
L:	linux-mm@kvack.org
S:	Maintained
W:	http://www.linux-mm.org
+12 −0
Original line number Diff line number Diff line
@@ -1046,10 +1046,21 @@ config ARCH_MMAP_RND_BITS_MAX
config ARCH_MMAP_RND_BITS_DEFAULT
	int

config FORCE_MAX_MMAP_RND_BITS
	bool "Force maximum number of bits to use for ASLR of mmap base address"
	default y if !64BIT
	help
	  ARCH_MMAP_RND_BITS and ARCH_MMAP_RND_COMPAT_BITS represent the number
	  of bits to use for ASLR and if no custom value is assigned (EXPERT)
	  then the architecture's lower bound (minimum) value is assumed.
	  This toggle changes that default assumption to assume the arch upper
	  bound (maximum) value instead.

config ARCH_MMAP_RND_BITS
	int "Number of bits to use for ASLR of mmap base address" if EXPERT
	range ARCH_MMAP_RND_BITS_MIN ARCH_MMAP_RND_BITS_MAX
	default ARCH_MMAP_RND_BITS_DEFAULT if ARCH_MMAP_RND_BITS_DEFAULT
	default ARCH_MMAP_RND_BITS_MAX if FORCE_MAX_MMAP_RND_BITS
	default ARCH_MMAP_RND_BITS_MIN
	depends on HAVE_ARCH_MMAP_RND_BITS
	help
@@ -1084,6 +1095,7 @@ config ARCH_MMAP_RND_COMPAT_BITS
	int "Number of bits to use for ASLR of mmap base address for compatible applications" if EXPERT
	range ARCH_MMAP_RND_COMPAT_BITS_MIN ARCH_MMAP_RND_COMPAT_BITS_MAX
	default ARCH_MMAP_RND_COMPAT_BITS_DEFAULT if ARCH_MMAP_RND_COMPAT_BITS_DEFAULT
	default ARCH_MMAP_RND_COMPAT_BITS_MAX if FORCE_MAX_MMAP_RND_BITS
	default ARCH_MMAP_RND_COMPAT_BITS_MIN
	depends on HAVE_ARCH_MMAP_RND_COMPAT_BITS
	help
Loading