Commit 5c00ff74 authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - The series "zram: optimal post-processing target selection" from
   Sergey Senozhatsky improves zram's post-processing selection
   algorithm. This leads to improved memory savings.

 - Wei Yang has gone to town on the mapletree code, contributing several
   series which clean up the implementation:
	- "refine mas_mab_cp()"
	- "Reduce the space to be cleared for maple_big_node"
	- "maple_tree: simplify mas_push_node()"
	- "Following cleanup after introduce mas_wr_store_type()"
	- "refine storing null"

 - The series "selftests/mm: hugetlb_fault_after_madv improvements" from
   David Hildenbrand fixes this selftest for s390.

 - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng
   implements some rationaizations and cleanups in the page mapping
   code.

 - The series "mm: optimize shadow entries removal" from Shakeel Butt
   optimizes the file truncation code by speeding up the handling of
   shadow entries.

 - The series "Remove PageKsm()" from Matthew Wilcox completes the
   migration of this flag over to being a folio-based flag.

 - The series "Unify hugetlb into arch_get_unmapped_area functions" from
   Oscar Salvador implements a bunch of consolidations and cleanups in
   the hugetlb code.

 - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain
   takes away the wp-fault time practice of turning a huge zero page
   into small pages. Instead we replace the whole thing with a THP. More
   consistent cleaner and potentiall saves a large number of pagefaults.

 - The series "percpu: Add a test case and fix for clang" from Andy
   Shevchenko enhances and fixes the kernel's built in percpu test code.

 - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett
   optimizes mremap() by avoiding doing things which we didn't need to
   do.

 - The series "Improve the tmpfs large folio read performance" from
   Baolin Wang teaches tmpfs to copy data into userspace at the folio
   size rather than as individual pages. A 20% speedup was observed.

 - The series "mm/damon/vaddr: Fix issue in
   damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON
   splitting.

 - The series "memcg-v1: fully deprecate charge moving" from Shakeel
   Butt removes the long-deprecated memcgv2 charge moving feature.

 - The series "fix error handling in mmap_region() and refactor" from
   Lorenzo Stoakes cleanup up some of the mmap() error handling and
   addresses some potential performance issues.

 - The series "x86/module: use large ROX pages for text allocations"
   from Mike Rapoport teaches x86 to use large pages for
   read-only-execute module text.

 - The series "page allocation tag compression" from Suren Baghdasaryan
   is followon maintenance work for the new page allocation profiling
   feature.

 - The series "page->index removals in mm" from Matthew Wilcox remove
   most references to page->index in mm/. A slow march towards shrinking
   struct page.

 - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs
   interface tests" from Andrew Paniakin performs maintenance work for
   DAMON's self testing code.

 - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar
   improves zswap's batching of compression and decompression. It is a
   step along the way towards using Intel IAA hardware acceleration for
   this zswap operation.

 - The series "kasan: migrate the last module test to kunit" from
   Sabyrzhan Tasbolatov completes the migration of the KASAN built-in
   tests over to the KUnit framework.

 - The series "implement lightweight guard pages" from Lorenzo Stoakes
   permits userapace to place fault-generating guard pages within a
   single VMA, rather than requiring that multiple VMAs be created for
   this. Improved efficiencies for userspace memory allocators are
   expected.

 - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses
   tracepoints to provide increased visibility into memcg stats flushing
   activity.

 - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky
   fixes a zram buglet which potentially affected performance.

 - The series "mm: add more kernel parameters to control mTHP" from
   Maíra Canal enhances our ability to control/configuremultisize THP
   from the kernel boot command line.

 - The series "kasan: few improvements on kunit tests" from Sabyrzhan
   Tasbolatov has a couple of fixups for the KASAN KUnit tests.

 - The series "mm/list_lru: Split list_lru lock into per-cgroup scope"
   from Kairui Song optimizes list_lru memory utilization when lockdep
   is enabled.

* tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits)
  cma: enforce non-zero pageblock_order during cma_init_reserved_mem()
  mm/kfence: add a new kunit test test_use_after_free_read_nofault()
  zram: fix NULL pointer in comp_algorithm_show()
  memcg/hugetlb: add hugeTLB counters to memcg
  vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event
  mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount
  zram: ZRAM_DEF_COMP should depend on ZRAM
  MAINTAINERS/MEMORY MANAGEMENT: add document files for mm
  Docs/mm/damon: recommend academic papers to read and/or cite
  mm: define general function pXd_init()
  kmemleak: iommu/iova: fix transient kmemleak false positive
  mm/list_lru: simplify the list_lru walk callback function
  mm/list_lru: split the lock to per-cgroup scope
  mm/list_lru: simplify reparenting and initial allocation
  mm/list_lru: code clean up for reparenting
  mm/list_lru: don't export list_lru_add
  mm/list_lru: don't pass unnecessary key parameters
  kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller
  kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW
  kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols
  ...
parents 228a1157 2532e6c7
Loading
Loading
Loading
Loading
+2 −0
Original line number Diff line number Diff line
@@ -47,6 +47,8 @@ The list of possible return codes:
-ENOMEM	  zram was not able to allocate enough memory to fulfil your
	  needs.
-EINVAL	  invalid input has been provided.
-EAGAIN	  re-try operation later (e.g. when attempting to run recompress
	  and writeback simultaneously).
========  =============================================================

If you use 'echo', the returned value is set by the 'echo' utility,
+3 −79
Original line number Diff line number Diff line
@@ -90,9 +90,7 @@ Brief summary of control files.
                                     used.
 memory.swappiness		     set/show swappiness parameter of vmscan
				     (See sysctl's vm.swappiness)
 memory.move_charge_at_immigrate     set/show controls of moving charges
                                     This knob is deprecated and shouldn't be
                                     used.
 memory.move_charge_at_immigrate     This knob is deprecated.
 memory.oom_control		     set/show oom controls.
                                     This knob is deprecated and shouldn't be
                                     used.
@@ -243,10 +241,6 @@ behind this approach is that a cgroup that aggressively uses a shared
page will eventually get charged for it (once it is uncharged from
the cgroup that brought it in -- this will happen on memory pressure).

But see :ref:`section 8.2 <cgroup-v1-memory-movable-charges>` when moving a
task to another cgroup, its pages may be recharged to the new cgroup, if
move_charge_at_immigrate has been chosen.

2.4 Swap Extension
--------------------------------------

@@ -756,78 +750,8 @@ If we want to change this to 1G, we can at any time use::

THIS IS DEPRECATED!

It's expensive and unreliable! It's better practice to launch workload
tasks directly from inside their target cgroup. Use dedicated workload
cgroups to allow fine-grained policy adjustments without having to
move physical pages between control domains.

Users can move charges associated with a task along with task migration, that
is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
This feature is not supported in !CONFIG_MMU environments because of lack of
page tables.

8.1 Interface
-------------

This feature is disabled by default. It can be enabled (and disabled again) by
writing to memory.move_charge_at_immigrate of the destination cgroup.

If you want to enable it::

	# echo (some positive value) > memory.move_charge_at_immigrate

.. note::
      Each bits of move_charge_at_immigrate has its own meaning about what type
      of charges should be moved. See :ref:`section 8.2
      <cgroup-v1-memory-movable-charges>` for details.

.. note::
      Charges are moved only when you move mm->owner, in other words,
      a leader of a thread group.

.. note::
      If we cannot find enough space for the task in the destination cgroup, we
      try to make space by reclaiming memory. Task migration may fail if we
      cannot make enough space.

.. note::
      It can take several seconds if you move charges much.

And if you want disable it again::

	# echo 0 > memory.move_charge_at_immigrate

.. _cgroup-v1-memory-movable-charges:

8.2 Type of charges which can be moved
--------------------------------------

Each bit in move_charge_at_immigrate has its own meaning about what type of
charges should be moved. But in any case, it must be noted that an account of
a page or a swap can be moved only when it is charged to the task's current
(old) memory cgroup.

+---+--------------------------------------------------------------------------+
|bit| what type of charges would be moved ?                                    |
+===+==========================================================================+
| 0 | A charge of an anonymous page (or swap of it) used by the target task.   |
|   | You must enable Swap Extension (see 2.4) to enable move of swap charges. |
+---+--------------------------------------------------------------------------+
| 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) |
|   | and swaps of tmpfs file) mmapped by the target task. Unlike the case of  |
|   | anonymous pages, file pages (and swaps) in the range mmapped by the task |
|   | will be moved even if the task hasn't done page fault, i.e. they might   |
|   | not be the task's "RSS", but other task's "RSS" that maps the same file. |
|   | The mapcount of the page is ignored (the page can be moved independent   |
|   | of the mapcount). You must enable Swap Extension (see 2.4) to            |
|   | enable move of swap charges.                                             |
+---+--------------------------------------------------------------------------+

8.3 TODO
--------

- All of moving charge operations are done under cgroup_mutex. It's not good
  behavior to hold the mutex too long, so we may need some trick.
Reading memory.move_charge_at_immigrate will always return 0 and writing
to it will always return -EINVAL.

9. Memory thresholds
====================
+5 −0
Original line number Diff line number Diff line
@@ -1655,6 +1655,11 @@ The following nested keys are defined.
	  pgdemote_khugepaged
		Number of pages demoted by khugepaged.

	  hugetlb
		Amount of memory used by hugetlb pages. This metric only shows
		up if hugetlb usage is accounted for in memory.current (i.e.
		cgroup is mounted with the memory_hugetlb_accounting option).

  memory.numa_stat
	A read-only nested-keyed file which exists on non-root cgroups.

+17 −0
Original line number Diff line number Diff line
@@ -6711,6 +6711,16 @@
			Force threading of all interrupt handlers except those
			marked explicitly IRQF_NO_THREAD.

	thp_shmem=	[KNL]
			Format: <size>[KMG],<size>[KMG]:<policy>;<size>[KMG]-<size>[KMG]:<policy>
			Control the default policy of each hugepage size for the
			internal shmem mount. <policy> is one of policies available
			for the shmem mount ("always", "inherit", "never", "within_size",
			and "advise").
			It can be used multiple times for multiple shmem THP sizes.
			See Documentation/admin-guide/mm/transhuge.rst for more
			details.

	topology=	[S390,EARLY]
			Format: {off | on}
			Specify if the kernel should make use of the cpu
@@ -6952,6 +6962,13 @@
			See Documentation/admin-guide/mm/transhuge.rst
			for more details.

	transparent_hugepage_shmem= [KNL]
			Format: [always|within_size|advise|never|deny|force]
			Can be used to control the hugepage allocation policy for
			the internal shmem mount.
			See Documentation/admin-guide/mm/transhuge.rst
			for more details.

	trusted.source=	[KEYS]
			Format: <string>
			This parameter identifies the trust source as a backend
+33 −2
Original line number Diff line number Diff line
@@ -326,6 +326,29 @@ PMD_ORDER THP policy will be overridden. If the policy for PMD_ORDER
is not defined within a valid ``thp_anon``, its policy will default to
``never``.

Similarly to ``transparent_hugepage``, you can control the hugepage
allocation policy for the internal shmem mount by using the kernel parameter
``transparent_hugepage_shmem=<policy>``, where ``<policy>`` is one of the
seven valid policies for shmem (``always``, ``within_size``, ``advise``,
``never``, ``deny``, and ``force``).

In the same manner as ``thp_anon`` controls each supported anonymous THP
size, ``thp_shmem`` controls each supported shmem THP size. ``thp_shmem``
has the same format as ``thp_anon``, but also supports the policy
``within_size``.

``thp_shmem=`` may be specified multiple times to configure all THP sizes
as required. If ``thp_shmem=`` is specified at least once, any shmem THP
sizes not explicitly configured on the command line are implicitly set to
``never``.

``transparent_hugepage_shmem`` setting only affects the global toggle. If
``thp_shmem`` is not specified, PMD_ORDER hugepage will default to
``inherit``. However, if a valid ``thp_shmem`` setting is provided by the
user, the PMD_ORDER hugepage policy will be overridden. If the policy for
PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will
default to ``never``.

Hugepages in tmpfs/shmem
========================

@@ -530,10 +553,18 @@ anon_fault_fallback_charge
	instead falls back to using huge pages with lower orders or
	small pages even though the allocation was successful.

swpout
	is incremented every time a huge page is swapped out in one
zswpout
	is incremented every time a huge page is swapped out to zswap in one
	piece without splitting.

swpin
	is incremented every time a huge page is swapped in from a non-zswap
	swap device in one piece.

swpout
	is incremented every time a huge page is swapped out to a non-zswap
	swap device in one piece without splitting.

swpout_fallback
	is incremented if a huge page has to be split before swapout.
	Usually because failed to allocate some continuous swap space
Loading