Commit 93908083 authored by Wei Yang's avatar Wei Yang Committed by Andrew Morton
Browse files

mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared THP

Commit 60fbb143 ("mm/huge_memory: adjust try_to_migrate_one() and
split_huge_pmd_locked()") return false unconditionally after
split_huge_pmd_locked().  This may fail try_to_migrate() early when
TTU_SPLIT_HUGE_PMD is specified.

The reason is the above commit adjusted try_to_migrate_one() to, when a
PMD-mapped THP entry is found, and TTU_SPLIT_HUGE_PMD is specified (for
example, via unmap_folio()), return false unconditionally.  This breaks
the rmap walk and fail try_to_migrate() early, if this PMD-mapped THP is
mapped in multiple processes.

The user sensible impact of this bug could be:

  * On memory pressure, shrink_folio_list() may split partially mapped
    folio with split_folio_to_list(). Then free unmapped pages without IO.
    If failed, it may not be reclaimed.
  * On memory failure, memory_failure() would call try_to_split_thp_page()
    to split folio contains the bad page. If succeed, the PG_has_hwpoisoned
    bit is only set in the after-split folio contains @split_at. By doing
    so, we limit bad memory. If failed to split, the whole folios is not
    usable.

One way to reproduce:

    Create an anonymous THP range and fork 512 children, so we have a
    THP shared mapped in 513 processes. Then trigger folio split with
    /sys/kernel/debug/split_huge_pages debugfs to split the THP folio to
    order 0.

Without the above commit, we can successfully split to order 0.  With the
above commit, the folio is still a large folio.

And currently there are two core users of TTU_SPLIT_HUGE_PMD:

  * try_to_unmap_one()
  * try_to_migrate_one()

try_to_unmap_one() would restart the rmap walk, so only
try_to_migrate_one() is affected.

We can't simply revert commit 60fbb143 ("mm/huge_memory: adjust
try_to_migrate_one() and split_huge_pmd_locked()"), since it removed some
duplicated check covered by page_vma_mapped_walk().

This patch fixes this by restart page_vma_mapped_walk() after
split_huge_pmd_locked().  Since we cannot simply return "true" to fix the
problem, as that would affect another case:

    When invoking folio_try_share_anon_rmap_pmd() from
    split_huge_pmd_locked(), the latter can fail and leave a large folio
    mapped through PTEs, in which case we ought to return true from
    try_to_migrate_one(). This might result in unnecessary walking of the
    rmap but is relatively harmless.

Link: https://lkml.kernel.org/r/20260305015006.27343-1-richard.weiyang@gmail.com


Fixes: 60fbb143 ("mm/huge_memory: adjust try_to_migrate_one() and split_huge_pmd_locked()")
Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
Tested-by: default avatarLance Yang <lance.yang@linux.dev>
Reviewed-by: default avatarLance Yang <lance.yang@linux.dev>
Reviewed-by: default avatarGavin Guo <gavinguo@igalia.com>
Acked-by: default avatarDavid Hildenbrand (arm) <david@kernel.org>
Reviewed-by: default avatarLorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
parent 29f40594
Loading
Loading
Loading
Loading
+9 −3
Original line number Diff line number Diff line
@@ -2450,11 +2450,17 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
			__maybe_unused pmd_t pmdval;

			if (flags & TTU_SPLIT_HUGE_PMD) {
				/*
				 * split_huge_pmd_locked() might leave the
				 * folio mapped through PTEs. Retry the walk
				 * so we can detect this scenario and properly
				 * abort the walk.
				 */
				split_huge_pmd_locked(vma, pvmw.address,
						      pvmw.pmd, true);
				ret = false;
				page_vma_mapped_walk_done(&pvmw);
				break;
				flags &= ~TTU_SPLIT_HUGE_PMD;
				page_vma_mapped_walk_restart(&pvmw);
				continue;
			}
#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
			pmdval = pmdp_get(pvmw.pmd);